1`folly/ThreadCachedInt.h` 2---------------------- 3 4High-performance atomic increment using thread caching. 5 6`folly/ThreadCachedInt.h` introduces a integer class designed for high 7performance increments from multiple threads simultaneously without 8loss of precision. It has two read modes, `readFast` gives a potentially stale 9value with one load, and `readFull` gives the exact value, but is much slower, 10as discussed below. 11 12 13### Performance 14*** 15 16Increment performance is up to 10x greater than `std::atomic_fetch_add` in high 17contention environments. See `folly/test/ThreadCachedIntTest.h` for more 18comprehensive benchmarks. 19 20`readFast` is as fast as a single load. 21 22`readFull`, on the other hand, requires acquiring a mutex and iterating through 23a list to accumulate the values of all the thread local counters, so is 24significantly slower than `readFast`. 25 26 27### Usage 28*** 29 30Create an instance and increment it with `increment` or the operator overloads. 31Read the value with `readFast` for quick, potentially stale data, or `readFull` 32for a more expensive but precise result. There are additional convenience 33functions as well, such as `set`. 34 35``` Cpp 36 ThreadCachedInt<int64_t> val; 37 EXPECT_EQ(0, val.readFast()); 38 ++val; // increment in thread local counter only 39 EXPECT_EQ(0, val.readFast()); // increment has not been flushed 40 EXPECT_EQ(1, val.readFull()); // accumulates all thread local counters 41 val.set(2); 42 EXPECT_EQ(2, val.readFast()); 43 EXPECT_EQ(2, val.readFull()); 44``` 45 46### Implementation 47*** 48 49`folly::ThreadCachedInt` uses `folly::ThreadLocal` to store thread specific 50objects that each have a local counter. When incrementing, the thread local 51instance is incremented. If the local counter passes the cache size, the value 52is flushed to the global counter with an atomic increment. It is this global 53counter that is read with `readFast` via a simple load, but will not count any 54of the updates that haven't been flushed. 55 56In order to read the exact value, `ThreadCachedInt` uses the extended 57`readAllThreads()` API of `folly::ThreadLocal` to iterate through all the 58references to all the associated thread local object instances. This currently 59requires acquiring a global mutex and iterating through the references, 60accumulating the counters along with the global counter. This also means that 61the first use of the object from a new thread will acquire the mutex in order to 62insert the thread local reference into the list. By default, there is one 63global mutex per integer type used in `ThreadCachedInt`. If you plan on using a 64lot of `ThreadCachedInt`s in your application, considering breaking up the 65global mutex by introducing additional `Tag` template parameters. 66 67`set` simply sets the global counter value, and marks all the thread local 68instances as needing to be reset. When iterating with `readFull`, thread local 69counters that have been marked as reset are skipped. When incrementing, thread 70local counters marked for reset are set to zero and unmarked for reset. 71 72Upon destruction, thread local counters are flushed to the parent so that counts 73are not lost after increments in temporary threads. This requires grabbing the 74global mutex to make sure the parent itself wasn't destroyed in another thread 75already. 76 77### Alternate Implementations 78*** 79 80There are of course many ways to skin a cat, and you may notice there is a 81partial alternate implementation in `folly/test/ThreadCachedIntTest.cpp` that 82provides similar performance. `ShardedAtomicInt` simply uses an array of 83`std::atomic<int64_t>`'s and hashes threads across them to do low-contention 84atomic increments, and `readFull` just sums up all the ints. 85 86This sounds great, but in order to get the contention low enough to get similar 87performance as ThreadCachedInt with 24 threads, `ShardedAtomicInt` needs about 882000 ints to hash across. This uses about 20x more memory, and the lock-free 89`readFull` has to sum up all 2048 ints, which ends up being a about 50x slower 90than `ThreadCachedInt` in low contention situations, which is hopefully the 91common case since it's designed for high-write, low read access patterns. 92Performance of `readFull` is about the same speed as `ThreadCachedInt` in high 93contention environments. 94 95Depending on the operating conditions, it may make more sense to use one 96implementation over the other. For example, a lower contention environment will 97probably be able to use a `ShardedAtomicInt` with a much smaller array without 98hurting performance, while improving memory consumption and perf of `readFull`. 99