1ae4a0502SMauro Carvalho Chehab=============================
2ae4a0502SMauro Carvalho ChehabBTT - Block Translation Table
3ae4a0502SMauro Carvalho Chehab=============================
4ae4a0502SMauro Carvalho Chehab
5ae4a0502SMauro Carvalho Chehab
6ae4a0502SMauro Carvalho Chehab1. Introduction
7ae4a0502SMauro Carvalho Chehab===============
8ae4a0502SMauro Carvalho Chehab
9ae4a0502SMauro Carvalho ChehabPersistent memory based storage is able to perform IO at byte (or more
10ae4a0502SMauro Carvalho Chehabaccurately, cache line) granularity. However, we often want to expose such
11ae4a0502SMauro Carvalho Chehabstorage as traditional block devices. The block drivers for persistent memory
12ae4a0502SMauro Carvalho Chehabwill do exactly this. However, they do not provide any atomicity guarantees.
13ae4a0502SMauro Carvalho ChehabTraditional SSDs typically provide protection against torn sectors in hardware,
14ae4a0502SMauro Carvalho Chehabusing stored energy in capacitors to complete in-flight block writes, or perhaps
15ae4a0502SMauro Carvalho Chehabin firmware. We don't have this luxury with persistent memory - if a write is in
16ae4a0502SMauro Carvalho Chehabprogress, and we experience a power failure, the block will contain a mix of old
17ae4a0502SMauro Carvalho Chehaband new data. Applications may not be prepared to handle such a scenario.
18ae4a0502SMauro Carvalho Chehab
19ae4a0502SMauro Carvalho ChehabThe Block Translation Table (BTT) provides atomic sector update semantics for
20ae4a0502SMauro Carvalho Chehabpersistent memory devices, so that applications that rely on sector writes not
21ae4a0502SMauro Carvalho Chehabbeing torn can continue to do so. The BTT manifests itself as a stacked block
22ae4a0502SMauro Carvalho Chehabdevice, and reserves a portion of the underlying storage for its metadata. At
23ae4a0502SMauro Carvalho Chehabthe heart of it, is an indirection table that re-maps all the blocks on the
24ae4a0502SMauro Carvalho Chehabvolume. It can be thought of as an extremely simple file system that only
25ae4a0502SMauro Carvalho Chehabprovides atomic sector updates.
26ae4a0502SMauro Carvalho Chehab
27ae4a0502SMauro Carvalho Chehab
28ae4a0502SMauro Carvalho Chehab2. Static Layout
29ae4a0502SMauro Carvalho Chehab================
30ae4a0502SMauro Carvalho Chehab
31ae4a0502SMauro Carvalho ChehabThe underlying storage on which a BTT can be laid out is not limited in any way.
32ae4a0502SMauro Carvalho ChehabThe BTT, however, splits the available space into chunks of up to 512 GiB,
33ae4a0502SMauro Carvalho Chehabcalled "Arenas".
34ae4a0502SMauro Carvalho Chehab
35ae4a0502SMauro Carvalho ChehabEach arena follows the same layout for its metadata, and all references in an
36ae4a0502SMauro Carvalho Chehabarena are internal to it (with the exception of one field that points to the
37ae4a0502SMauro Carvalho Chehabnext arena). The following depicts the "On-disk" metadata layout::
38ae4a0502SMauro Carvalho Chehab
39ae4a0502SMauro Carvalho Chehab
40ae4a0502SMauro Carvalho Chehab    Backing Store     +------->  Arena
41ae4a0502SMauro Carvalho Chehab  +---------------+   |   +------------------+
42ae4a0502SMauro Carvalho Chehab  |               |   |   | Arena info block |
43ae4a0502SMauro Carvalho Chehab  |    Arena 0    +---+   |       4K         |
44ae4a0502SMauro Carvalho Chehab  |     512G      |       +------------------+
45ae4a0502SMauro Carvalho Chehab  |               |       |                  |
46ae4a0502SMauro Carvalho Chehab  +---------------+       |                  |
47ae4a0502SMauro Carvalho Chehab  |               |       |                  |
48ae4a0502SMauro Carvalho Chehab  |    Arena 1    |       |   Data Blocks    |
49ae4a0502SMauro Carvalho Chehab  |     512G      |       |                  |
50ae4a0502SMauro Carvalho Chehab  |               |       |                  |
51ae4a0502SMauro Carvalho Chehab  +---------------+       |                  |
52ae4a0502SMauro Carvalho Chehab  |       .       |       |                  |
53ae4a0502SMauro Carvalho Chehab  |       .       |       |                  |
54ae4a0502SMauro Carvalho Chehab  |       .       |       |                  |
55ae4a0502SMauro Carvalho Chehab  |               |       |                  |
56ae4a0502SMauro Carvalho Chehab  |               |       |                  |
57ae4a0502SMauro Carvalho Chehab  +---------------+       +------------------+
58ae4a0502SMauro Carvalho Chehab                          |                  |
59ae4a0502SMauro Carvalho Chehab                          |     BTT Map      |
60ae4a0502SMauro Carvalho Chehab                          |                  |
61ae4a0502SMauro Carvalho Chehab                          |                  |
62ae4a0502SMauro Carvalho Chehab                          +------------------+
63ae4a0502SMauro Carvalho Chehab                          |                  |
64ae4a0502SMauro Carvalho Chehab                          |     BTT Flog     |
65ae4a0502SMauro Carvalho Chehab                          |                  |
66ae4a0502SMauro Carvalho Chehab                          +------------------+
67ae4a0502SMauro Carvalho Chehab                          | Info block copy  |
68ae4a0502SMauro Carvalho Chehab                          |       4K         |
69ae4a0502SMauro Carvalho Chehab                          +------------------+
70ae4a0502SMauro Carvalho Chehab
71ae4a0502SMauro Carvalho Chehab
72ae4a0502SMauro Carvalho Chehab3. Theory of Operation
73ae4a0502SMauro Carvalho Chehab======================
74ae4a0502SMauro Carvalho Chehab
75ae4a0502SMauro Carvalho Chehab
76ae4a0502SMauro Carvalho Chehaba. The BTT Map
77ae4a0502SMauro Carvalho Chehab--------------
78ae4a0502SMauro Carvalho Chehab
79ae4a0502SMauro Carvalho ChehabThe map is a simple lookup/indirection table that maps an LBA to an internal
80ae4a0502SMauro Carvalho Chehabblock. Each map entry is 32 bits. The two most significant bits are special
81ae4a0502SMauro Carvalho Chehabflags, and the remaining form the internal block number.
82ae4a0502SMauro Carvalho Chehab
83ae4a0502SMauro Carvalho Chehab======== =============================================================
84ae4a0502SMauro Carvalho ChehabBit      Description
85ae4a0502SMauro Carvalho Chehab======== =============================================================
86*eddeed12SMauro Carvalho Chehab31 - 30	 Error and Zero flags - Used in the following way::
87ae4a0502SMauro Carvalho Chehab
88ae4a0502SMauro Carvalho Chehab	   == ==  ====================================================
89ae4a0502SMauro Carvalho Chehab	   31 30  Description
90ae4a0502SMauro Carvalho Chehab	   == ==  ====================================================
91ae4a0502SMauro Carvalho Chehab	   0  0	  Initial state. Reads return zeroes; Premap = Postmap
92ae4a0502SMauro Carvalho Chehab	   0  1	  Zero state: Reads return zeroes
93ae4a0502SMauro Carvalho Chehab	   1  0	  Error state: Reads fail; Writes clear 'E' bit
94ae4a0502SMauro Carvalho Chehab	   1  1	  Normal Block – has valid postmap
95ae4a0502SMauro Carvalho Chehab	   == ==  ====================================================
96ae4a0502SMauro Carvalho Chehab
97ae4a0502SMauro Carvalho Chehab29 - 0	 Mappings to internal 'postmap' blocks
98ae4a0502SMauro Carvalho Chehab======== =============================================================
99ae4a0502SMauro Carvalho Chehab
100ae4a0502SMauro Carvalho Chehab
101ae4a0502SMauro Carvalho ChehabSome of the terminology that will be subsequently used:
102ae4a0502SMauro Carvalho Chehab
103ae4a0502SMauro Carvalho Chehab============	================================================================
104ae4a0502SMauro Carvalho ChehabExternal LBA	LBA as made visible to upper layers.
105ae4a0502SMauro Carvalho ChehabABA		Arena Block Address - Block offset/number within an arena
106ae4a0502SMauro Carvalho ChehabPremap ABA	The block offset into an arena, which was decided upon by range
107ae4a0502SMauro Carvalho Chehab		checking the External LBA
108ae4a0502SMauro Carvalho ChehabPostmap ABA	The block number in the "Data Blocks" area obtained after
109ae4a0502SMauro Carvalho Chehab		indirection from the map
110ae4a0502SMauro Carvalho Chehabnfree		The number of free blocks that are maintained at any given time.
111ae4a0502SMauro Carvalho Chehab		This is the number of concurrent writes that can happen to the
112ae4a0502SMauro Carvalho Chehab		arena.
113ae4a0502SMauro Carvalho Chehab============	================================================================
114ae4a0502SMauro Carvalho Chehab
115ae4a0502SMauro Carvalho Chehab
116ae4a0502SMauro Carvalho ChehabFor example, after adding a BTT, we surface a disk of 1024G. We get a read for
117ae4a0502SMauro Carvalho Chehabthe external LBA at 768G. This falls into the second arena, and of the 512G
118ae4a0502SMauro Carvalho Chehabworth of blocks that this arena contributes, this block is at 256G. Thus, the
119ae4a0502SMauro Carvalho Chehabpremap ABA is 256G. We now refer to the map, and find out the mapping for block
120ae4a0502SMauro Carvalho Chehab'X' (256G) points to block 'Y', say '64'. Thus the postmap ABA is 64.
121ae4a0502SMauro Carvalho Chehab
122ae4a0502SMauro Carvalho Chehab
123ae4a0502SMauro Carvalho Chehabb. The BTT Flog
124ae4a0502SMauro Carvalho Chehab---------------
125ae4a0502SMauro Carvalho Chehab
126ae4a0502SMauro Carvalho ChehabThe BTT provides sector atomicity by making every write an "allocating write",
127ae4a0502SMauro Carvalho Chehabi.e. Every write goes to a "free" block. A running list of free blocks is
128ae4a0502SMauro Carvalho Chehabmaintained in the form of the BTT flog. 'Flog' is a combination of the words
129ae4a0502SMauro Carvalho Chehab"free list" and "log". The flog contains 'nfree' entries, and an entry contains:
130ae4a0502SMauro Carvalho Chehab
131ae4a0502SMauro Carvalho Chehab========  =====================================================================
132ae4a0502SMauro Carvalho Chehablba       The premap ABA that is being written to
133ae4a0502SMauro Carvalho Chehabold_map   The old postmap ABA - after 'this' write completes, this will be a
134ae4a0502SMauro Carvalho Chehab	  free block.
135ae4a0502SMauro Carvalho Chehabnew_map   The new postmap ABA. The map will up updated to reflect this
136ae4a0502SMauro Carvalho Chehab	  lba->postmap_aba mapping, but we log it here in case we have to
137ae4a0502SMauro Carvalho Chehab	  recover.
138ae4a0502SMauro Carvalho Chehabseq	  Sequence number to mark which of the 2 sections of this flog entry is
139ae4a0502SMauro Carvalho Chehab	  valid/newest. It cycles between 01->10->11->01 (binary) under normal
140ae4a0502SMauro Carvalho Chehab	  operation, with 00 indicating an uninitialized state.
141ae4a0502SMauro Carvalho Chehablba'	  alternate lba entry
142ae4a0502SMauro Carvalho Chehabold_map'  alternate old postmap entry
143ae4a0502SMauro Carvalho Chehabnew_map'  alternate new postmap entry
144ae4a0502SMauro Carvalho Chehabseq'	  alternate sequence number.
145ae4a0502SMauro Carvalho Chehab========  =====================================================================
146ae4a0502SMauro Carvalho Chehab
147ae4a0502SMauro Carvalho ChehabEach of the above fields is 32-bit, making one entry 32 bytes. Entries are also
148ae4a0502SMauro Carvalho Chehabpadded to 64 bytes to avoid cache line sharing or aliasing. Flog updates are
149ae4a0502SMauro Carvalho Chehabdone such that for any entry being written, it:
150ae4a0502SMauro Carvalho Chehaba. overwrites the 'old' section in the entry based on sequence numbers
151ae4a0502SMauro Carvalho Chehabb. writes the 'new' section such that the sequence number is written last.
152ae4a0502SMauro Carvalho Chehab
153ae4a0502SMauro Carvalho Chehab
154ae4a0502SMauro Carvalho Chehabc. The concept of lanes
155ae4a0502SMauro Carvalho Chehab-----------------------
156ae4a0502SMauro Carvalho Chehab
157ae4a0502SMauro Carvalho ChehabWhile 'nfree' describes the number of concurrent IOs an arena can process
158ae4a0502SMauro Carvalho Chehabconcurrently, 'nlanes' is the number of IOs the BTT device as a whole can
159ae4a0502SMauro Carvalho Chehabprocess::
160ae4a0502SMauro Carvalho Chehab
161ae4a0502SMauro Carvalho Chehab	nlanes = min(nfree, num_cpus)
162ae4a0502SMauro Carvalho Chehab
163ae4a0502SMauro Carvalho ChehabA lane number is obtained at the start of any IO, and is used for indexing into
164ae4a0502SMauro Carvalho Chehaball the on-disk and in-memory data structures for the duration of the IO. If
165ae4a0502SMauro Carvalho Chehabthere are more CPUs than the max number of available lanes, than lanes are
166ae4a0502SMauro Carvalho Chehabprotected by spinlocks.
167ae4a0502SMauro Carvalho Chehab
168ae4a0502SMauro Carvalho Chehab
169ae4a0502SMauro Carvalho Chehabd. In-memory data structure: Read Tracking Table (RTT)
170ae4a0502SMauro Carvalho Chehab------------------------------------------------------
171ae4a0502SMauro Carvalho Chehab
172ae4a0502SMauro Carvalho ChehabConsider a case where we have two threads, one doing reads and the other,
173ae4a0502SMauro Carvalho Chehabwrites. We can hit a condition where the writer thread grabs a free block to do
174ae4a0502SMauro Carvalho Chehaba new IO, but the (slow) reader thread is still reading from it. In other words,
175ae4a0502SMauro Carvalho Chehabthe reader consulted a map entry, and started reading the corresponding block. A
176ae4a0502SMauro Carvalho Chehabwriter started writing to the same external LBA, and finished the write updating
177ae4a0502SMauro Carvalho Chehabthe map for that external LBA to point to its new postmap ABA. At this point the
178ae4a0502SMauro Carvalho Chehabinternal, postmap block that the reader is (still) reading has been inserted
179ae4a0502SMauro Carvalho Chehabinto the list of free blocks. If another write comes in for the same LBA, it can
180ae4a0502SMauro Carvalho Chehabgrab this free block, and start writing to it, causing the reader to read
181ae4a0502SMauro Carvalho Chehabincorrect data. To prevent this, we introduce the RTT.
182ae4a0502SMauro Carvalho Chehab
183ae4a0502SMauro Carvalho ChehabThe RTT is a simple, per arena table with 'nfree' entries. Every reader inserts
184ae4a0502SMauro Carvalho Chehabinto rtt[lane_number], the postmap ABA it is reading, and clears it after the
185ae4a0502SMauro Carvalho Chehabread is complete. Every writer thread, after grabbing a free block, checks the
186ae4a0502SMauro Carvalho ChehabRTT for its presence. If the postmap free block is in the RTT, it waits till the
187ae4a0502SMauro Carvalho Chehabreader clears the RTT entry, and only then starts writing to it.
188ae4a0502SMauro Carvalho Chehab
189ae4a0502SMauro Carvalho Chehab
190ae4a0502SMauro Carvalho Chehabe. In-memory data structure: map locks
191ae4a0502SMauro Carvalho Chehab--------------------------------------
192ae4a0502SMauro Carvalho Chehab
193ae4a0502SMauro Carvalho ChehabConsider a case where two writer threads are writing to the same LBA. There can
194ae4a0502SMauro Carvalho Chehabbe a race in the following sequence of steps::
195ae4a0502SMauro Carvalho Chehab
196ae4a0502SMauro Carvalho Chehab	free[lane] = map[premap_aba]
197ae4a0502SMauro Carvalho Chehab	map[premap_aba] = postmap_aba
198ae4a0502SMauro Carvalho Chehab
199ae4a0502SMauro Carvalho ChehabBoth threads can update their respective free[lane] with the same old, freed
200ae4a0502SMauro Carvalho Chehabpostmap_aba. This has made the layout inconsistent by losing a free entry, and
201ae4a0502SMauro Carvalho Chehabat the same time, duplicating another free entry for two lanes.
202ae4a0502SMauro Carvalho Chehab
203ae4a0502SMauro Carvalho ChehabTo solve this, we could have a single map lock (per arena) that has to be taken
204ae4a0502SMauro Carvalho Chehabbefore performing the above sequence, but we feel that could be too contentious.
205ae4a0502SMauro Carvalho ChehabInstead we use an array of (nfree) map_locks that is indexed by
206ae4a0502SMauro Carvalho Chehab(premap_aba modulo nfree).
207ae4a0502SMauro Carvalho Chehab
208ae4a0502SMauro Carvalho Chehab
209ae4a0502SMauro Carvalho Chehabf. Reconstruction from the Flog
210ae4a0502SMauro Carvalho Chehab-------------------------------
211ae4a0502SMauro Carvalho Chehab
212ae4a0502SMauro Carvalho ChehabOn startup, we analyze the BTT flog to create our list of free blocks. We walk
213ae4a0502SMauro Carvalho Chehabthrough all the entries, and for each lane, of the set of two possible
214ae4a0502SMauro Carvalho Chehab'sections', we always look at the most recent one only (based on the sequence
215ae4a0502SMauro Carvalho Chehabnumber). The reconstruction rules/steps are simple:
216ae4a0502SMauro Carvalho Chehab
217ae4a0502SMauro Carvalho Chehab- Read map[log_entry.lba].
218ae4a0502SMauro Carvalho Chehab- If log_entry.new matches the map entry, then log_entry.old is free.
219ae4a0502SMauro Carvalho Chehab- If log_entry.new does not match the map entry, then log_entry.new is free.
220ae4a0502SMauro Carvalho Chehab  (This case can only be caused by power-fails/unsafe shutdowns)
221ae4a0502SMauro Carvalho Chehab
222ae4a0502SMauro Carvalho Chehab
223ae4a0502SMauro Carvalho Chehabg. Summarizing - Read and Write flows
224ae4a0502SMauro Carvalho Chehab-------------------------------------
225ae4a0502SMauro Carvalho Chehab
226ae4a0502SMauro Carvalho ChehabRead:
227ae4a0502SMauro Carvalho Chehab
228ae4a0502SMauro Carvalho Chehab1.  Convert external LBA to arena number + pre-map ABA
229ae4a0502SMauro Carvalho Chehab2.  Get a lane (and take lane_lock)
230ae4a0502SMauro Carvalho Chehab3.  Read map to get the entry for this pre-map ABA
231ae4a0502SMauro Carvalho Chehab4.  Enter post-map ABA into RTT[lane]
232ae4a0502SMauro Carvalho Chehab5.  If TRIM flag set in map, return zeroes, and end IO (go to step 8)
233ae4a0502SMauro Carvalho Chehab6.  If ERROR flag set in map, end IO with EIO (go to step 8)
234ae4a0502SMauro Carvalho Chehab7.  Read data from this block
235ae4a0502SMauro Carvalho Chehab8.  Remove post-map ABA entry from RTT[lane]
236ae4a0502SMauro Carvalho Chehab9.  Release lane (and lane_lock)
237ae4a0502SMauro Carvalho Chehab
238ae4a0502SMauro Carvalho ChehabWrite:
239ae4a0502SMauro Carvalho Chehab
240ae4a0502SMauro Carvalho Chehab1.  Convert external LBA to Arena number + pre-map ABA
241ae4a0502SMauro Carvalho Chehab2.  Get a lane (and take lane_lock)
242ae4a0502SMauro Carvalho Chehab3.  Use lane to index into in-memory free list and obtain a new block, next flog
243ae4a0502SMauro Carvalho Chehab    index, next sequence number
244ae4a0502SMauro Carvalho Chehab4.  Scan the RTT to check if free block is present, and spin/wait if it is.
245ae4a0502SMauro Carvalho Chehab5.  Write data to this free block
246ae4a0502SMauro Carvalho Chehab6.  Read map to get the existing post-map ABA entry for this pre-map ABA
247ae4a0502SMauro Carvalho Chehab7.  Write flog entry: [premap_aba / old postmap_aba / new postmap_aba / seq_num]
248ae4a0502SMauro Carvalho Chehab8.  Write new post-map ABA into map.
249ae4a0502SMauro Carvalho Chehab9.  Write old post-map entry into the free list
250ae4a0502SMauro Carvalho Chehab10. Calculate next sequence number and write into the free list entry
251ae4a0502SMauro Carvalho Chehab11. Release lane (and lane_lock)
252ae4a0502SMauro Carvalho Chehab
253ae4a0502SMauro Carvalho Chehab
254ae4a0502SMauro Carvalho Chehab4. Error Handling
255ae4a0502SMauro Carvalho Chehab=================
256ae4a0502SMauro Carvalho Chehab
257ae4a0502SMauro Carvalho ChehabAn arena would be in an error state if any of the metadata is corrupted
258ae4a0502SMauro Carvalho Chehabirrecoverably, either due to a bug or a media error. The following conditions
259ae4a0502SMauro Carvalho Chehabindicate an error:
260ae4a0502SMauro Carvalho Chehab
261ae4a0502SMauro Carvalho Chehab- Info block checksum does not match (and recovering from the copy also fails)
262ae4a0502SMauro Carvalho Chehab- All internal available blocks are not uniquely and entirely addressed by the
263ae4a0502SMauro Carvalho Chehab  sum of mapped blocks and free blocks (from the BTT flog).
264ae4a0502SMauro Carvalho Chehab- Rebuilding free list from the flog reveals missing/duplicate/impossible
265ae4a0502SMauro Carvalho Chehab  entries
266ae4a0502SMauro Carvalho Chehab- A map entry is out of bounds
267ae4a0502SMauro Carvalho Chehab
268ae4a0502SMauro Carvalho ChehabIf any of these error conditions are encountered, the arena is put into a read
269ae4a0502SMauro Carvalho Chehabonly state using a flag in the info block.
270ae4a0502SMauro Carvalho Chehab
271ae4a0502SMauro Carvalho Chehab
272ae4a0502SMauro Carvalho Chehab5. Usage
273ae4a0502SMauro Carvalho Chehab========
274ae4a0502SMauro Carvalho Chehab
275ae4a0502SMauro Carvalho ChehabThe BTT can be set up on any disk (namespace) exposed by the libnvdimm subsystem
276ae4a0502SMauro Carvalho Chehab(pmem, or blk mode). The easiest way to set up such a namespace is using the
277ae4a0502SMauro Carvalho Chehab'ndctl' utility [1]:
278ae4a0502SMauro Carvalho Chehab
279ae4a0502SMauro Carvalho ChehabFor example, the ndctl command line to setup a btt with a 4k sector size is::
280ae4a0502SMauro Carvalho Chehab
281ae4a0502SMauro Carvalho Chehab    ndctl create-namespace -f -e namespace0.0 -m sector -l 4k
282ae4a0502SMauro Carvalho Chehab
283ae4a0502SMauro Carvalho ChehabSee ndctl create-namespace --help for more options.
284ae4a0502SMauro Carvalho Chehab
285ae4a0502SMauro Carvalho Chehab[1]: https://github.com/pmem/ndctl
286