1Memory Barriers 2=============== 3 4Modern CPUs make extensive use of pipe-lining and out-of-order execution, 5meaning that the CPU is often executing more than one instruction at a 6time, and not necessarily in the order that the source code would suggest. 7Furthermore, even before the CPU gets a chance to reorder operations, the 8compiler may (and often does) reorganize the code for greater efficiency, 9particularly at higher optimization levels. Optimizing compilers and 10out-of-order execution are both critical for good performance, but they 11can lead to surprising results when multiple processes access the same 12memory space. 13 14Example 15======= 16 17Suppose x is a pointer to a structure stored in shared memory, and that the 18entire structure has been initialized to zero bytes. One backend executes 19the following code fragment: 20 21 x->foo = 1; 22 x->bar = 1; 23 24Meanwhile, at approximately the same time, another backend executes this 25code fragment: 26 27 bar = x->bar; 28 foo = x->foo; 29 30The second backend might end up with foo = 1 and bar = 1 (if it executes 31both statements after the first backend), or with foo = 0 and bar = 0 (if 32it executes both statements before the first backend), or with foo = 1 and 33bar = 0 (if the first backend executes the first statement, the second 34backend executes both statements, and then the first backend executes the 35second statement). 36 37Surprisingly, however, the second backend could also end up with foo = 0 38and bar = 1. The compiler might swap the order of the two stores performed 39by the first backend, or the two loads performed by the second backend. 40Even if it doesn't, on a machine with weak memory ordering (such as PowerPC 41or Itanium) the CPU might choose to execute either the loads or the stores 42out of order. This surprising result can lead to bugs. 43 44A common pattern where this actually does result in a bug is when adding items 45onto a queue. The writer does this: 46 47 q->items[q->num_items] = new_item; 48 ++q->num_items; 49 50The reader does this: 51 52 num_items = q->num_items; 53 for (i = 0; i < num_items; ++i) 54 /* do something with q->items[i] */ 55 56This code turns out to be unsafe, because the writer might increment 57q->num_items before it finishes storing the new item into the appropriate slot. 58More subtly, the reader might prefetch the contents of the q->items array 59before reading q->num_items. Thus, there's still a bug here *even if the 60writer does everything in the order we expect*. We need the writer to update 61the array before bumping the item counter, and the reader to examine the item 62counter before examining the array. 63 64Note that these types of highly counterintuitive bugs can *only* occur when 65multiple processes are interacting with the same memory segment. A given 66process always perceives its *own* writes to memory in program order. 67 68Avoiding Memory Ordering Bugs 69============================= 70 71The simplest (and often best) way to avoid memory ordering bugs is to 72protect the data structures involved with an lwlock. For more details, see 73src/backend/storage/lmgr/README. For instance, in the above example, the 74writer could acquire an lwlock in exclusive mode before appending to the 75queue, and each reader could acquire the same lock in shared mode before 76reading it. If the data structure is not heavily trafficked, this solution is 77generally entirely adequate. 78 79However, in some cases, it is desirable to avoid the overhead of acquiring 80and releasing locks. In this case, memory barriers may be used to ensure 81that the apparent order of execution is as the programmer desires. In 82PostgreSQL backend code, the pg_memory_barrier() macro may be used to achieve 83this result. In the example above, we can prevent the reader from seeing a 84garbage value by having the writer do this: 85 86 q->items[q->num_items] = new_item; 87 pg_memory_barrier(); 88 ++q->num_items; 89 90And by having the reader do this: 91 92 num_items = q->num_items; 93 pg_memory_barrier(); 94 for (i = 0; i < num_items; ++i) 95 /* do something with q->items[i] */ 96 97The pg_memory_barrier() macro will (1) prevent the compiler from rearranging 98the code in such a way as to allow the memory accesses to occur out of order 99and (2) generate any code (often, inline assembly) that is needed to prevent 100the CPU from executing the memory accesses out of order. Specifically, the 101barrier prevents loads and stores written after the barrier from being 102performed before the barrier, and vice-versa. 103 104Although this code will work, it is needlessly inefficient. On systems with 105strong memory ordering (such as x86), the CPU never reorders loads with other 106loads, nor stores with other stores. It can, however, allow a load to be 107performed before a subsequent store. To avoid emitting unnecessary memory 108instructions, we provide two additional primitives: pg_read_barrier(), and 109pg_write_barrier(). When a memory barrier is being used to separate two 110loads, use pg_read_barrier(); when it is separating two stores, use 111pg_write_barrier(); when it is a separating a load and a store (in either 112order), use pg_memory_barrier(). pg_memory_barrier() can always substitute 113for either a read or a write barrier, but is typically more expensive, and 114therefore should be used only when needed. 115 116With these guidelines in mind, the writer can do this: 117 118 q->items[q->num_items] = new_item; 119 pg_write_barrier(); 120 ++q->num_items; 121 122And the reader can do this: 123 124 num_items = q->num_items; 125 pg_read_barrier(); 126 for (i = 0; i < num_items; ++i) 127 /* do something with q->items[i] */ 128 129On machines with strong memory ordering, these weaker barriers will simply 130prevent compiler rearrangement, without emitting any actual machine code. 131On machines with weak memory ordering, they will prevent compiler 132reordering and also emit whatever hardware barrier may be required. Even 133on machines with weak memory ordering, a read or write barrier may be able 134to use a less expensive instruction than a full barrier. 135 136Weaknesses of Memory Barriers 137============================= 138 139While memory barriers are a powerful tool, and much cheaper than locks, they 140are also much less capable than locks. Here are some of the problems. 141 1421. Concurrent writers are unsafe. In the above example of a queue, using 143memory barriers doesn't make it safe for two processes to add items to the 144same queue at the same time. If more than one process can write to the queue, 145a spinlock or lwlock must be used to synchronize access. The readers can 146perhaps proceed without any lock, but the writers may not. 147 148Even very simple write operations often require additional synchronization. 149For example, it's not safe for multiple writers to simultaneously execute 150this code (supposing x is a pointer into shared memory): 151 152 x->foo++; 153 154Although this may compile down to a single machine-language instruction, 155the CPU will execute that instruction by reading the current value of foo, 156adding one to it, and then storing the result back to the original address. 157If two CPUs try to do this simultaneously, both may do their reads before 158either one does their writes. Such a case could be made safe by using an 159atomic variable and an atomic add. See port/atomics.h. 160 1612. Eight-byte loads and stores aren't necessarily atomic. We assume in 162various places in the source code that an aligned four-byte load or store is 163atomic, and that other processes therefore won't see a half-set value. 164Sadly, the same can't be said for eight-byte value: on some platforms, an 165aligned eight-byte load or store will generate two four-byte operations. If 166you need an atomic eight-byte read or write, you must either serialize access 167with a lock or use an atomic variable. 168 1693. No ordering guarantees. While memory barriers ensure that any given 170process performs loads and stores to shared memory in order, they don't 171guarantee synchronization. In the queue example above, we can use memory 172barriers to be sure that readers won't see garbage, but there's nothing to 173say whether a given reader will run before or after a given writer. If this 174matters in a given situation, some other mechanism must be used instead of 175or in addition to memory barriers. 176 1774. Barrier proliferation. Many algorithms that at first seem appealing 178require multiple barriers. If the number of barriers required is more than 179one or two, you may be better off just using a lock. Keep in mind that, on 180some platforms, a barrier may be implemented by acquiring and releasing a 181backend-private spinlock. This may be better than a centralized lock under 182contention, but it may also be slower in the uncontended case. 183 184Further Reading 185=============== 186 187Much of the documentation about memory barriers appears to be quite 188Linux-specific. The following papers may be helpful: 189 190Memory Ordering in Modern Microprocessors, by Paul E. McKenney 191* http://www.rdrop.com/users/paulmck/scalability/paper/ordering.2007.09.19a.pdf 192 193Memory Barriers: a Hardware View for Software Hackers, by Paul E. McKenney 194* http://www.rdrop.com/users/paulmck/scalability/paper/whymb.2010.06.07c.pdf 195 196The Linux kernel also has some useful documentation on this topic. Start 197with Documentation/memory-barriers.txt 198