1# WAL Disk Format
2
3The write ahead log operates in segments that are numbered and sequential,
4e.g. `000000`, `000001`, `000002`, etc., and are limited to 128MB by default.
5A segment is written to in pages of 32KB. Only the last page of the most recent segment
6may be partial. A WAL record is an opaque byte slice that gets split up into sub-records
7should it exceed the remaining space of the current page. Records are never split across
8segment boundaries. If a single record exceeds the default segment size, a segment with
9a larger size will be created.
10The encoding of pages is largely borrowed from [LevelDB's/RocksDB's write ahead log.](https://github.com/facebook/rocksdb/wiki/Write-Ahead-Log-File-Format)
11
12Notable deviations are that the record fragment is encoded as:
13
14```
15┌───────────┬──────────┬────────────┬──────────────┐
16│ type <1b> │ len <2b> │ CRC32 <4b> │ data <bytes> │
17└───────────┴──────────┴────────────┴──────────────┘
18```
19
20The type flag has the following states:
21
22* `0`: rest of page will be empty
23* `1`: a full record encoded in a single fragment
24* `2`: first fragment of a record
25* `3`: middle fragment of a record
26* `4`: final fragment of a record
27
28## Record encoding
29
30The records written to the write ahead log are encoded as follows:
31
32### Series records
33
34Series records encode the labels that identifies a series and its unique ID.
35
36```
37┌────────────────────────────────────────────┐
38│ type = 1 <1b>                              │
39├────────────────────────────────────────────┤
40│ ┌─────────┬──────────────────────────────┐ │
41│ │ id <8b> │ n = len(labels) <uvarint>    │ │
42│ ├─────────┴────────────┬─────────────────┤ │
43│ │ len(str_1) <uvarint> │ str_1 <bytes>   │ │
44│ ├──────────────────────┴─────────────────┤ │
45│ │  ...                                   │ │
46│ ├───────────────────────┬────────────────┤ │
47│ │ len(str_2n) <uvarint> │ str_2n <bytes> │ │
48│ └───────────────────────┴────────────────┘ │
49│                  . . .                     │
50└────────────────────────────────────────────┘
51```
52
53### Sample records
54
55Sample records encode samples as a list of triples `(series_id, timestamp, value)`.
56Series reference and timestamp are encoded as deltas w.r.t the first sample.
57The first row stores the starting id and the starting timestamp.
58The first sample record begins at the second row.
59
60```
61┌──────────────────────────────────────────────────────────────────┐
62│ type = 2 <1b>                                                    │
63├──────────────────────────────────────────────────────────────────┤
64│ ┌────────────────────┬───────────────────────────┐               │
65│ │ id <8b>            │ timestamp <8b>            │               │
66│ └────────────────────┴───────────────────────────┘               │
67│ ┌────────────────────┬───────────────────────────┬─────────────┐ │
68│ │ id_delta <uvarint> │ timestamp_delta <uvarint> │ value <8b>  │ │
69│ └────────────────────┴───────────────────────────┴─────────────┘ │
70│                              . . .                               │
71└──────────────────────────────────────────────────────────────────┘
72```
73
74### Tombstone records
75
76Tombstone records encode tombstones as a list of triples `(series_id, min_time, max_time)`
77and specify an interval for which samples of a series got deleted.
78
79```
80┌─────────────────────────────────────────────────────┐
81│ type = 3 <1b>                                       │
82├─────────────────────────────────────────────────────┤
83│ ┌─────────┬───────────────────┬───────────────────┐ │
84│ │ id <8b> │ min_time <varint> │ max_time <varint> │ │
85│ └─────────┴───────────────────┴───────────────────┘ │
86│                        . . .                        │
87└─────────────────────────────────────────────────────┘
88```
89