FREEMAP (revision ef3ac1d1) - OpenGrok cross reference for /dragonfly/sys/vfs/hammer2/FREEMAP

		HAMMER2 Freemap Design Notes

			Overview

   HAMMER2 Media is broken down into 2 GByte zones.  Each 2 GByte zone
   contains a 4 MByte header (64 x 64K blocks = 0.2% of storage).  The
   blocks in this header are reserved for various purposes.  For example,
   block #0 is reserved for a volume headers.  Most of the remaining
   64KB blocks in this header are reserved for use by the freemap.

   The freemap only uses blocks from these reserved areas.  In order to
   ensure that any of the four volume headers can be used by the mount code
   (in case some are found to be corrupted), each freemap block in the
   logical freemap topology will iterate through up to 8 copies whos
   block numbers are taken the reserved area.

   - Four copies, one for each of the four volume headers which H2 sequences
     through on each flush.  This ensures that a mount from any of the four
     volume headers is handed a consistent freemap topology.

   - One copy to ensure that recovery operations during mount do not modify
     the state of the freemap topology pointed to by older volume headers
     which are still valid.  Note that the freemap for volume headers
     indexed after the mount point being recovered may lose freemap
     consistency, so if you choose an older mount point for a RW mount,
     you have to stick with it.

   - One copy for live operations.  This allows HAMMER2 to retire the
     related buffer (or for the OS to retire the buffer cache buffer)
     prior to the next flush and also allows the buffers to be flushed
     asynchronously.

   - The two remaining copies add robustness to the specification.  For
     example, with appropriate feature code the filesystem can tolerate
     a limited number of bad blocks in the reserved area.

   For the moment we use a simple calculation for the freemap block.  In
   a later version I would like to mix the blocks up a bit so the blocks
   in each set of 8 are not situated near each other.

			RW Mount Restrictions

   If an older volume header is explicitly selected by the mount code, any
   newer (presumably corrupt since the mount code didn't select it) volume
   headers will lose freemap consistency as the freemap code rotates into
   freemap blocks that might have been used by the topology pointed to by
   the newer (but not selected) volume headers.  For a RW mount, this means
   that if an older volume header is selected, the newer ones that were
   not selected WILL be formally invalidated by the mount code and cannot
   be used in a remount attempt.

   During normal operation, each filesystem flush rotates to a new volume
   header.  A filesystem may have up to four volume headers spread at 2GB
   intervals.  Filesystems smaller than ~9GB or so will have fewer volume
   headers to rotate through.

			Freemap Topology

   The freemap topology contains 4 levels of meta-data (blockref arrays),
   one of which is embedded in the volume header (so only three real
   meta-data levels), plus one level of leaf-data.  Unlike normal files,
   which use a variable-radix, the freemap topology uses a fixed radix to
   simplify the algorithm and to ensure freemap locality to the blocks
   under management.

   Freemap blocks are allocated from the reserved area in each 2GB zone.
   The leafs represent data in the zone.  Higher levels in the freemap
   topology will cover more area but the physical freemap meta-data blocks
   always occur prior to the area being covered.  Thus a HAMMER2 filesystem
   of almost any size can be formatted and the related freemap blocks
   will always exist.

   Level 1 - (radix 10 + 21) 64KB representing 2GB.  This is represented
      by a hammer2_bmap_data[1024] array.  Each entry represents
      2MB worth of media storage x 1024 entries to represent 2GB.
      Each entry contains a 128x2 bit bitmap representing 16KB
      of storage in 2 bits (128 x 16KB = 2MB).

   Level 2 - (radix 10) 64KB blockmap representing 2TB (~2GB per entry)
   Level 3 - (radix 10) 64KB blockmap representing 2PB (~2TB per entry)
   Level 4 - (radix 10) 64KB blockmap representing 2EB (~2PB per entry)
   Level 5 - (radix 3) blockref x 8 in volume header representing 16EB (2^64)
      (this conveniently eats one 512-byte 'sector' of the 64KB
      volume header).

   Each level is assign reserved blocks in the 4MB header per 2GB zone.
   Since we use block 0 for the volume header, the first freemap reserved
   block in the zone begins at block 1.

   Freemap copy #0:
Level 1 uses block 1 (this is the leaf block)
Level 2 uses block 2
Level 3 uses block 3
Level 4 uses block 4

   Freemap copy #1:
Level 1 uses block 5 (this is the leaf block)
Level 2 uses block 6
Level 3 uses block 7
Level 4 uses block 8

   ... and so forth up to Freemap copy #7 using blocks 29, 30, 31, and 32.

			    Flushing

   The freemap does not have to be flushed by fsync/sync, but should probably
   be flushed at least once a minute by the normal filesystem sync.  The
   reason it does not have to be flushed every time is that freemap recovery
   is executed on-mount and will use the last fully flushed freemap TID
   stored in the volume header to do an incremental meta-data scan of the
   H2 filesystem between that TID and the last flushed TID.  All blocks not
   found to have been marked allocated will be marked allocated.  Simple as
   that.

			Freemap Granularity

   The freemap granularity is 16KB (radix of 14) but the minimum
   allocation radix is 1KB (radix of 10) (and can be in multiples of
   1KB with some coding).  1KB inodes can hold up to 512 bytes of direct
   data, so tiny files eat exactly 1KB of media storage inclusive of the
   inode.

   The freemap keeps track of partial allocations in-memory but not
   on-media, so even a normal umount will cause partially allocated
   blocks to appear fully allocated until some later date when the
   bulk scan code defragments it.

			 Block Selection

   Block selection is localized to be near the inode's (or nearby data)
   blockref.  The algorithmic complexity of determining locality is not
   defined here atm.

		     Freemap Leaf Substructure

   * linear - Linear sub-granular allocation offset.  Allows ~1KB granular
       linear allocations.

   * class  - Allocation clustering class ((type << 8) | radix).

   * avail  - Available space in bytes, currently only used by layer 1 leaf.
       Used as an allocation clustering aid.

   * bitmap - Eight 32 bit words representing ~2MB in 16KB allocation chunks
       at 2 bits per chunk.  The filesystem allocation granularity
       can be smaller (currently ~1KB minimum), and the live
       filesystem caches iterations when allocating multiple chunks.
       However, on remount any partial allocations out of a 64KB
       allocation block MAY cause the entire 64KB to be considered
       allocated.  Fragmented space can potentially be reclaimed
       and/or relocated by the bulk block free scan.

       The 2-bit bitmap fields are assigned as follows:

       00	FREE
       01	POSSIBLY FREE (type 1)
       10	POSSIBLY FREE (type 2)
       11	ALLOCATED

		  Freemap Metadata Substructure
		     (Levels 2, 3, 4, and 5)

   Freemap layers 2, 3, 4, and 5 operate as arrays of blockrefs but steal
   some of the check area (a 24-byte area) for freemap-specific meta-data.
   We reserve a few fields to store information which allows the block
   allocator to do its work more efficiently.

   * bigmask - A mask of radixes available for allocation under this
	blockref.  Typically initialized to -1.

   * avail   - Total available space in bytes.

   The freemap allocator uses a cylinder-group-like abstraction using
   the localized allocation concept first implemented by UFS.  In HAMMER2
   there is no such thing as a real cylinder group, nor are there specific
   reserved areas for inodes vs data, but we do the next best thing by
   roughly typing leafs (each leaf representing ~2MB) to hopefully allow
   the drive to employ its zone-cache to make both stat-only and tar-style
   bulk accesses efficient (in addition to normal file accesses).

   Levels 2, 3, and 4 contains an array blockmap[1024] (64KB total),
   supplying 10 bits of address space each.  Level 5 is a blockmap[8]
   stored in the volume header supplying 3 bits of address space.
   (level 0 supplies 10 + 21 bits of address space).

   The Level1 blockmap is HAMMER2's idea of a 'cylinder group', thus
   effectively fixed at multiples of ~2MB or so.

		        Initial Conditions

   The freemap is a multi-indirect block structure but there is no real
   reason to pre-format it in newfs_hammer2.  Instead, newfs_hammer2 simply
   leaves the associated top-level indirect blocks empty and uses the
   (voldata->allocator_beg) field to allocate space linearly, then leaves
   it to the live filesystem to initialize the freemap as more space gets
   allocated.

   The freemap does NOT use a fixed 5-level radix tree.  It uses the same
   blockmap algorithm used for file blocks but restricts any recursion to
   specific radix values.  This means that small filesystems will have much
   smaller freemap depths.  2 layers (and not counting the volume header as
   a layer) gets us 16GB, 3 layers gets us 16TB.

		How blocks are allocated and freed

   The H2 freemap leaf bitmap operates in 16KB chunks, but the leaf also
   contains a linear allocation offset that can keep track of sub-16KB
   allocations with certain restrictions.  More random sub-16KB allocations
   are tracked in-memory, but will be lost (assumed to be a full 16KB) if
   a crash occurs.

   NOTE!  All operations on the freemap occur on the current live version
   of the freemap, including bulkfree operations.

   Blocks are allocated by transitioning the 2-bit pattern in the leaf
   to 11.  That is, (00, 01, 10) -> (11).  This handles races between the
   live allocator and the asynchronous bulkfree code.  A live allocation
   which occurs while the asynchronous bulkfree process is running will
   operate race-free by transitioning the (01) an (10) states back
   to (11), which prevents bulkfree from later marking those blocks
   FREE (00).

   Blocks can be freed several ways, but all block freeing operations
   require at least two passes before the related blocks can actually be
   reused.

   Method #1 - Removal in the filesystem marks the related freemap bitmap
	POSSIBLY FREE (either 01 or 10).  The asynchronous bulkfree
	process later determines that the block is actually free and
	transitions it to FREE (00), or moves it back to
	ALLOCATED (11).

	This works whether the blocks can actually be freed or not,
	so we don't care if the related blocks are part of some other
	snapshot or not.  bulkfree will figure it out.

   Method #2 - Removal in the filesystem ignores the freemap.  The freemap
	blocks are left alone (typically ALLOCATED (11)).

	In this situation bulkfree must make extra passes to determine
	if blocks are possibly free, then transition the leaf bitmap
	entries to POSSIBLY FREE (01 or 10).  bulkfree cannot directly
	transition the entries to FREE (00) without another pass.

	However, this method has numerous advantages including making
	snapshot manipulation (including deletions) instantanious
	and allow whole subtrees and/or large-files to be rm -rf'd
	with only a single disk write to update the inode in the
	best case.

   Method #3 - Brute force.  *ALL* freemap bitmap entries are marked
	POSSIBLY FREE and bulkfree then must do multiple passes
	(particularly in order to ensure that its memory use remains
	bounded) to again transition all the freemap bitmap entries
	to either FREE (00) or ALLOCATED (11).

	This method can be faster than #2 but wastes a considerable
	amount of write-bandwidth (and SSD wear if the target drive
	is a SSD).

   In all cases the bulkfree code must make a final pass on the filesystem
   to do the final transition of POSSIBLY FREE blocks to FREE (00) or
   ALLOCATED (11).  Again, races for the FREE (00) are handled by observing
   if the bitmap code was moved to ALLOCATED (11) by the live system while
   bulkfree ran asynchrnously and not transitioning the element to FREE (00)
   in that situation.

   All bulkfree passes are done on meta-data.  Actual data blocks do not
   need to be read unless the media is being verified.  H2 uses method #2
   by default and efficiency depends on how much ram the system has to
   cache scan information.  That said, the bulkfree process is not only
   incremental but it is also interruptable and restartable.  It does not
   interfere with live operations other than using disk bandwidth, so
   there are several ways to run it including in the background.

   The biggest issue is that *NO* space can be freed up by the live
   filesystem without the bulkfree process unless we optimize the case
   where data is created and deleted from within a single snapshot.
   This is made more difficult by the fact that each flush represents
   a fine-grained snapshot (up to four, representing the four volume
   headers the flush iterates through).

	      Snapshots and Replicated Topologies

   The bulkfree code maintains information in-memory to the best of its
   ability for a multitude of reasons, including attempting to detect
   snapshot recursions down block chains which have already been scanned
   via some other snapshot.  Without this, a large number of snapshots
   can cause a huge multiplication of disk I/O reads (but not writes) during
   the topology scan.

		Use of Generic indirect-block API

   I decided to use the same indirect-block allocation model for the
   freemap that normal files use, with a few special cases added to force
   specific radix values and to 'allocate' the freemap-related blocks
   and indirect blocks via a reserved-block calculation and (obviously)
   not via a recursive call to the allocator.

   The Freemap is defined above as a fixed 5-level scheme (level 1-5),
   but in actual operation the radix tree can be shortcut just as it
   is with normal files.  However, unlike normal files, shorcuts will
   be forced to use specific radix values in order to guarantee that
   reserved block numbers can be trivially calculated.  As the freemap
   becomes more fleshed out the tree on-media will look more and more like
   the actual specification.

   One advantage of doing things this way is that smaller filesystems
   won't actually use a 5-level scheme.  A 16GB filesystem can use 8
   blockrefs in the volume header which point directly to layer 1 leaf
   blocks.  A 16TB filesystem can be managed with only three levels
   (layer 3, 2, and 1 only where the 8 x layer 3 blockrefs are stored in
   the volume header).  And so forth.

   At the moment we have no plans to return any of the unused 4MB zone
   header space (per 2GB of storage) back to the filesystem for general use.
   There are lots of things we may want to use the reserved areas for in
   the future.

			Emergency Deletions

   All filesystem modifications including deletions must allocate blocks
   in order to update the main topology all the way to the root.  H2 will
   reserve roughly 5% of the available blocks in the filesystem for
   deletions in order to allow a system operator to recover from a
   filesystem full condition.

   Despite this, situations may come up, due to having snapshots, where
   deletions eat up available blocks but fail to create freeable space.
   When this situation occurs the system operator may be forced to issue
   emergency in-place deletions which replace existing blocks rather then
   allocate new blocks.  For the moment the spec for dealing with these
   situations remains incomplete.