FREEMAP (revision bbb35c81) - OpenGrok cross reference for /dragonfly/sys/vfs/hammer2/FREEMAP

		HAMMER2 Freemap Design Notes

			Overview

   HAMMER2 Media is broken down into 2 GByte zones.  Each 2 GByte zone
   contains a 4 MByte header (64 x 64K blocks = 0.2% of storage).  The
   blocks in this header are reserved for various purposes.  For example,
   block #0 is reserved for a volume header in the first four zones.  Most
   of the remaining 64KB blocks in this header are reserved for use by the
   freemap.

   The freemap only uses blocks from these reserved areas.  In order to
   ensure that any of the four volume headers can be used by the mount code
   (in case some are found to be corrupted), each freemap block in the
   logical freemap topology will iterate through up to 8 copies whos
   block numbers are taken the reserved area.

   - Four copies, one for each of the four volume headers which H2 sequences
     through on each flush.  This ensures that a mount from any of the four
     volume headers is handed a consistent freemap topology.

   - One copy to ensure that recovery operations during mount do not modify
     the state of the freemap topology pointed to by older volume headers
     which are still valid.  Note that the freemap for volume headers
     indexed after the mount point being recovered may lose freemap
     consistency, so if you choose an older mount point for a RW mount,
     you have to stick with it.

   - One copy for live operations.  This allows HAMMER2 to retire the
     related buffer asynchronously in the background (or for the OS to
     retire the buffer cache buffer on its own) prior to the formal
     flush.  The later formal flush then has less work to do.

   - The two remaining copies add robustness to the specification.  For
     example, with appropriate feature code added the filesystem can
     tolerate a limited number of bad blocks in the reserved area.

   For the moment we use a simple calculation for the freemap block.  In
   a later version I would like to mix the blocks up a bit so the blocks
   in each set of 8 are not situated near each other.

		    RW Mount Restrictions

   If an older volume header is explicitly selected by the mount code, any
   newer (presumably corrupt since the mount code didn't select it) volume
   headers will lose freemap consistency as the freemap code rotates into
   freemap blocks that might have been used by the topology pointed to by
   the newer (but not selected) volume headers.  For a RW mount, this means
   that if an older volume header is selected, the newer ones that were
   not selected WILL be formally invalidated by the mount code and cannot
   be used in a remount attempt.

   During normal operation, each filesystem flush rotates to a new volume
   header.  A filesystem may have up to four volume headers spread at 2GB
   intervals.  Filesystems smaller than ~9GB or so will have fewer volume
   headers to rotate through.

			Freemap Topology

   The freemap topology contains 4 levels of meta-data (blockref arrays),
   one of which is embedded in the volume header (so only three real
   meta-data levels), plus one level of leaf-data.  Unlike normal files,
   which use a variable-radix, the freemap topology uses a fixed radix to
   simplify the algorithm and to ensure freemap locality to the blocks
   under management.

   Freemap blocks are allocated from the reserved area in each 2GB zone.
   The leafs represent data in the zone.  Higher levels in the freemap
   topology will cover more area but the physical freemap meta-data blocks
   always occur prior to the area being covered.  Thus a HAMMER2 filesystem
   of almost any size can be formatted and the related freemap blocks
   will always exist.

   Level 1 - (radix 10 + 21) 64KB representing 2GB.  This is represented
      by a hammer2_bmap_data[1024] array.  Each entry represents
      2MB worth of media storage x 1024 entries to represent 2GB.
      Each entry contains a 128x2 bit bitmap representing 16KB
      of storage in 2 bits (128 x 16KB = 2MB).

   Level 2 - (radix 10) 64KB blockmap representing 2TB (~2GB per entry)
   Level 3 - (radix 10) 64KB blockmap representing 2PB (~2TB per entry)
   Level 4 - (radix 10) 64KB blockmap representing 2EB (~2PB per entry)
   Level 5 - (radix 3) blockref x 8 in volume header representing 16EB (2^64)
      (this conveniently eats one 512-byte 'sector' of the 64KB
      volume header).

   Each level is assign reserved blocks in the 4MB header per 2GB zone.
   Since we use block 0 for the volume header, the first freemap reserved
   block in the zone begins at block 1.

   Freemap copy #0:
Level 1 uses block 1 (this is the leaf block)
Level 2 uses block 2
Level 3 uses block 3
Level 4 uses block 4

   Freemap copy #1:
Level 1 uses block 5 (this is the leaf block)
Level 2 uses block 6
Level 3 uses block 7
Level 4 uses block 8

   ... and so forth up to Freemap copy #7 using blocks 29, 30, 31, and 32.

			    Flushing

   The freemap does not have to be flushed by fsync/sync, but should probably
   be flushed at least once a minute by the normal filesystem sync.  The
   reason it does not have to be flushed with fsync is that freemap recovery
   is executed on-mount and will use the last fully flushed freemap TID
   stored in the volume header to do an incremental meta-data scan of the
   H2 filesystem between that TID and the last flushed TID.  All blocks not
   found to have been marked allocated will be marked allocated.  Simple as
   that.  Since the scan is incremental, this typically costs very little
   time.

			Freemap Granularity

   The freemap granularity is 16KB (radix of 14) but the minimum
   allocation radix is 1KB (radix of 10) (and can be in multiples of
   1KB with some coding).  1KB inodes can hold up to 512 bytes of direct
   data, so tiny files eat exactly 1KB of media storage inclusive of the
   inode.

   The freemap keeps track of partial allocations in-memory but not
   on-media, so even a normal umount will cause partially allocated
   blocks to appear fully allocated until some later date when the
   bulk scan code defragments it.

			 Block Selection

   Block selection is localized to be near the inode's (or nearby data)
   blockref.  The algorithmic complexity of determining locality is not
   defined here atm.

		     Freemap Leaf Substructure

   * linear - Linear sub-granular allocation offset.  Allows ~1KB granular
       linear allocations.

   * class  - Allocation clustering class ((type << 8) | radix).

   * avail  - Available space in bytes, currently only used by layer 1 leaf.
       Used as an allocation clustering aid.

   * bitmap - Eight 32 bit words representing ~2MB in 16KB allocation chunks
       at 2 bits per chunk.  The filesystem allocation granularity
       can be smaller (currently ~1KB minimum), and the live
       filesystem caches iterations when allocating multiple chunks.
       However, on remount any partial allocations out of a 64KB
       allocation block MAY cause the entire 64KB to be considered
       allocated.  Fragmented space can potentially be reclaimed
       and/or relocated by the bulk block free scan.

       The 2-bit bitmap fields are assigned as follows:

       00	FREE
       01	POSSIBLY FREE (type 1)
       10	POSSIBLY FREE (type 2)
       11	ALLOCATED

		  Freemap Metadata Substructure
		     (Levels 2, 3, 4, and 5)

   Freemap layers 2, 3, 4, and 5 operate as arrays of blockrefs but steal
   some of the check area (a 24-byte area) for freemap-specific meta-data.
   We reserve a few fields to store information which allows the block
   allocator to do its work more efficiently.

   * bigmask - A mask of radixes available for allocation under this
	blockref.  Typically initialized to -1.

   * avail   - Total available space in bytes.

   The freemap allocator uses a cylinder-group-like abstraction using
   the localized allocation concept first implemented by UFS.  In HAMMER2
   there is no such thing as a real cylinder group, nor are there specific
   reserved areas for inodes vs data, but we do the next best thing by
   roughly typing leafs (each leaf representing ~2MB) to hopefully allow
   the drive to employ its zone-cache to make both stat-only and tar-style
   bulk accesses efficient (in addition to normal file accesses).

   Levels 2, 3, and 4 contains an array blockmap[1024] (64KB total),
   supplying 10 bits of address space each.  Level 5 is a blockmap[8]
   stored in the volume header supplying 3 bits of address space.
   (level 0 supplies 10 + 21 bits of address space).

   The Level1 blockmap is HAMMER2's idea of a 'cylinder group', thus
   effectively fixed at multiples of ~2MB or so.

		        Initial Conditions

   newfs_hammer2 does not need to format the freemap.  Instead, newfs_hammer2
   simply leaves the associated top-level indirect blocks empty and uses
   the (voldata->allocator_beg) field to allocate space linearly, then
   leaves it to the live filesystem to initialize the freemap as more space
   gets allocated.

   The freemap does NOT use a fixed 5-level radix tree.  It uses the same
   blockmap algorithm used for file blocks but restricts any recursion to
   specific radix values.  This means that small filesystems will have much
   smaller freemap depths.  2 layers (and not counting the volume header as
   a layer) gets us 16GB, 3 layers gets us 16TB.

		How blocks are allocated and freed

   The H2 freemap leaf bitmap operates in 16KB chunks, but the leaf also
   contains a linear allocation offset that can keep track of sub-16KB
   allocations with certain restrictions.  More random sub-16KB allocations
   are tracked in-memory, but will be lost (assumed to be a full 16KB) if
   a crash occurs.  Each 16KB chunk is denoted by a 2-bit pattern 00, 01, 10,
   or 11.

   NOTE!  All operations on the freemap occur on the current live version
   of the freemap, including bulkfree operations.

   Blocks are allocated by transitioning the 2-bit pattern in the leaf
   to 11.  That is, (00, 01, 10) -> (11).

   The primary mechanism used to free a block is via the asynchronous
   bulkfree scan.  This scans all filesystem meta-data in two major passes
   (and potentially multiple sub-passes).

   Pass#1 - The first pass figures which blocks might be freeable.  The
     most recently flushed meta-data topology (including all four
     volume headers and all snapshots) is scanned and an in-memory
     copy of the FreeMap is built from scratch.  Multiple sub-scans
     might be required to break the larger scan up into more easily
     digested pieces based on the amount of memory available to hold
     the temporary freemap.

     Any allocated blocks in the live freemap are then transitioned
     from (11) to either (10) or (01) if after the scan they are
     found to not be allocated.

     The blocks are still assumed to be allocated at this time and
     any new allocations will transition them back to (11).

   Pass#2 - The second pass is required to deal with races against the
     live filesystem while the freemap scan was running. It also
     allows the freemap scans to run asynchronously from any flush,
     improving concurrency.  However, at least one synchronous flush
     is required between Pass#1 and Pass#2.

     The second pass is a duplicate of the first pass.  The meta-data
     topology is scanned and a freemap is built in-memory and then
     compared against the live freemap.  Instead transitioning from
     (11)->(10)/(01) this pass transitions from (10)/(01) to (00).

     If a block that it thinks is free is (11), no transition occurs
     because this could be due to a race against the live filesystem.

     This pass will incidentally transition (10)/(01) back to (11)
     if the block was found not to be allocated, but it is perfectly
     acceptable for the block to remain in a (10)/(01) state after
     completion.

   NOTE! The meta-data scanning passes must also explicitly scan blocks
  associated with any open files, since these might represent
  open-but-deleted files.  These blocks must not be accidentally freed
  while the system is still using the file.  Again, since this is
  done in two passes it does not have to be synchronized against
  frontend operations.  So in total:

  * Topology under all four volume headers.  This includes all
    PFSs and snapshots.

  * Topology under all open hammer2 files.

   The Bulk-free operation is expensive but uses a bounded amount of ram.
   The ADVANTAGE of this mechanism is that deletions in the live filesystem
   do not have to clean up the freemap and thus do not have to recurse
   the topology during the deletion.  In fact, a 'rm -rf' equivalent of a
   directory topology can be handled simply by blowing away the top-level
   directory inode.  This is instantanious and thus can be dangerous but
   you always have your snapshots to fall-back on.

   The DISADVANTAGE is that all meta-data must be scanned.  Twice.  This
   can be mitigated by using swapcache(8) to cache the meta-data on a SSD.
   This is also mitigated by the fact that you can do the bulkfree scan less
   often on very large filesystems which presumably have a lot of freespace
   (so the interval is not as big an issue).  In a sense the operation does
   scale in that it takes longer on larger filesystems but also can be run
   less often.

   The biggest issue is that *NO* space can be freed up by the live
   filesystem without the bulkfree process unless we optimize the case
   where data is created and deleted from within a single snapshot.
   This is made more difficult by the fact that each flush represents
   a fine-grained snapshot (up to four, representing the four volume
   headers the flush iterates through).

	      Snapshots and Replicated Topologies

   The bulkfree code maintains information in-memory to the best of its
   ability for a multitude of reasons, including attempting to detect
   snapshot recursions down block chains which have already been scanned
   via some other snapshot.  Without this, a large number of snapshots
   can cause a huge multiplication of disk I/O reads (but not writes) during
   the topology scan.

		Use of Generic indirect-block API

   I decided to use the same indirect-block allocation model for the
   freemap that normal files use, with a few special cases added to force
   specific radix values and to 'allocate' the freemap-related blocks
   and indirect blocks via a reserved-block calculation and (obviously)
   not via a recursive call to the allocator.

   The Freemap is defined above as a fixed 5-level scheme (level 1-5),
   but in actual operation the radix tree can be shortcut just as it
   is with normal files.  However, unlike normal files, shorcuts will
   be forced to use specific radix values in order to guarantee that
   reserved block numbers can be trivially calculated.  As the freemap
   becomes more fleshed out the tree on-media will look more and more like
   the actual specification.

   One advantage of doing things this way is that smaller filesystems
   won't actually use a 5-level scheme.  A 16GB filesystem can use 8
   blockrefs in the volume header which point directly to layer 1 leaf
   blocks.  A 16TB filesystem can be managed with only three levels
   (layer 3, 2, and 1 only where the 8 x layer 3 blockrefs are stored in
   the volume header).  And so forth.

   At the moment we have no plans to return any of the unused 4MB zone
   header space (per 2GB of storage) back to the filesystem for general use.
   There are lots of things we may want to use the reserved areas for in
   the future.

			Emergency Deletions

   All filesystem modifications including deletions must allocate blocks
   in order to update the main topology all the way to the root.  H2 will
   reserve roughly 5% of the available blocks in the filesystem for
   deletions in order to allow a system operator to recover from a
   filesystem full condition.

   However, due to the snapshot capability as well as the possibility of
   fragmentation, it is possible for the administrator to not delete enough
   to actually be able to free up blocks.  Once the reserve is used up
   the filesystem can become unwritable.

   When this situation occurs the only way to recover is to update blocks
   in-place.  Updating blocks in-place will destroy the data on any
   related snapshots or otherwise corrupt the snapshots.  Emergency recovery
   thus recommends that all related snapshots be destroyed.  You can choose
   not to do this in which case your snapshots might wind up containing
   broken links and generate CRC failure messages.

   For the moment the spec for dealing with these situations remains
   incomplete.