1Notes on `sharded-slab`'s implementation and design. 2 3# Design 4 5The sharded slab's design is strongly inspired by the ideas presented by 6Leijen, Zorn, and de Moura in [Mimalloc: Free List Sharding in 7Action][mimalloc]. In this report, the authors present a novel design for a 8memory allocator based on a concept of _free list sharding_. 9 10Memory allocators must keep track of what memory regions are not currently 11allocated ("free") in order to provide them to future allocation requests. 12The term [_free list_][freelist] refers to a technique for performing this 13bookkeeping, where each free block stores a pointer to the next free block, 14forming a linked list. The memory allocator keeps a pointer to the most 15recently freed block, the _head_ of the free list. To allocate more memory, 16the allocator pops from the free list by setting the head pointer to the 17next free block of the current head block, and returning the previous head. 18To deallocate a block, the block is pushed to the free list by setting its 19first word to the current head pointer, and the head pointer is set to point 20to the deallocated block. Most implementations of slab allocators backed by 21arrays or vectors use a similar technique, where pointers are replaced by 22indices into the backing array. 23 24When allocations and deallocations can occur concurrently across threads, 25they must synchronize accesses to the free list; either by putting the 26entire allocator state inside of a lock, or by using atomic operations to 27treat the free list as a lock-free structure (such as a [Treiber stack]). In 28both cases, there is a significant performance cost — even when the free 29list is lock-free, it is likely that a noticeable amount of time will be 30spent in compare-and-swap loops. Ideally, the global synchronzation point 31created by the single global free list could be avoided as much as possible. 32 33The approach presented by Leijen, Zorn, and de Moura is to introduce 34sharding and thus increase the granularity of synchronization significantly. 35In mimalloc, the heap is _sharded_ so that each thread has its own 36thread-local heap. Objects are always allocated from the local heap of the 37thread where the allocation is performed. Because allocations are always 38done from a thread's local heap, they need not be synchronized. 39 40However, since objects can move between threads before being deallocated, 41_deallocations_ may still occur concurrently. Therefore, Leijen et al. 42introduce a concept of _local_ and _global_ free lists. When an object is 43deallocated on the same thread it was originally allocated on, it is placed 44on the local free list; if it is deallocated on another thread, it goes on 45the global free list for the heap of the thread from which it originated. To 46allocate, the local free list is used first; if it is empty, the entire 47global free list is popped onto the local free list. Since the local free 48list is only ever accessed by the thread it belongs to, it does not require 49synchronization at all, and because the global free list is popped from 50infrequently, the cost of synchronization has a reduced impact. A majority 51of allocations can occur without any synchronization at all; and 52deallocations only require synchronization when an object has left its 53parent thread (a relatively uncommon case). 54 55[mimalloc]: https://www.microsoft.com/en-us/research/uploads/prod/2019/06/mimalloc-tr-v1.pdf 56[freelist]: https://en.wikipedia.org/wiki/Free_list 57[Treiber stack]: https://en.wikipedia.org/wiki/Treiber_stack 58 59# Implementation 60 61A slab is represented as an array of [`MAX_THREADS`] _shards_. A shard 62consists of a vector of one or more _pages_ plus associated metadata. 63Finally, a page consists of an array of _slots_, head indices for the local 64and remote free lists. 65 66```text 67┌─────────────┐ 68│ shard 1 │ 69│ │ ┌─────────────┐ ┌────────┐ 70│ pages───────┼───▶│ page 1 │ │ │ 71├─────────────┤ ├─────────────┤ ┌────▶│ next──┼─┐ 72│ shard 2 │ │ page 2 │ │ ├────────┤ │ 73├─────────────┤ │ │ │ │XXXXXXXX│ │ 74│ shard 3 │ │ local_head──┼──┘ ├────────┤ │ 75└─────────────┘ │ remote_head─┼──┐ │ │◀┘ 76 ... ├─────────────┤ │ │ next──┼─┐ 77┌─────────────┐ │ page 3 │ │ ├────────┤ │ 78│ shard n │ └─────────────┘ │ │XXXXXXXX│ │ 79└─────────────┘ ... │ ├────────┤ │ 80 ┌─────────────┐ │ │XXXXXXXX│ │ 81 │ page n │ │ ├────────┤ │ 82 └─────────────┘ │ │ │◀┘ 83 └────▶│ next──┼───▶ ... 84 ├────────┤ 85 │XXXXXXXX│ 86 └────────┘ 87``` 88 89 90The size of the first page in a shard is always a power of two, and every 91subsequent page added after the first is twice as large as the page that 92preceeds it. 93 94```text 95 96pg. 97┌───┐ ┌─┬─┐ 98│ 0 │───▶ │ │ 99├───┤ ├─┼─┼─┬─┐ 100│ 1 │───▶ │ │ │ │ 101├───┤ ├─┼─┼─┼─┼─┬─┬─┬─┐ 102│ 2 │───▶ │ │ │ │ │ │ │ │ 103├───┤ ├─┼─┼─┼─┼─┼─┼─┼─┼─┬─┬─┬─┬─┬─┬─┬─┐ 104│ 3 │───▶ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ 105└───┘ └─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┘ 106``` 107 108When searching for a free slot, the smallest page is searched first, and if 109it is full, the search proceeds to the next page until either a free slot is 110found or all available pages have been searched. If all available pages have 111been searched and the maximum number of pages has not yet been reached, a 112new page is then allocated. 113 114Since every page is twice as large as the previous page, and all page sizes 115are powers of two, we can determine the page index that contains a given 116address by shifting the address down by the smallest page size and 117looking at how many twos places necessary to represent that number, 118telling us what power of two page size it fits inside of. We can 119determine the number of twos places by counting the number of leading 120zeros (unused twos places) in the number's binary representation, and 121subtracting that count from the total number of bits in a word. 122 123The formula for determining the page number that contains an offset is thus: 124 125```rust,ignore 126WIDTH - ((offset + INITIAL_PAGE_SIZE) >> INDEX_SHIFT).leading_zeros() 127``` 128 129where `WIDTH` is the number of bits in a `usize`, and `INDEX_SHIFT` is 130 131```rust,ignore 132INITIAL_PAGE_SIZE.trailing_zeros() + 1; 133``` 134 135[`MAX_THREADS`]: https://docs.rs/sharded-slab/latest/sharded_slab/trait.Config.html#associatedconstant.MAX_THREADS 136