1Memory Barriers
2===============
3
4Modern CPUs make extensive use of pipe-lining and out-of-order execution,
5meaning that the CPU is often executing more than one instruction at a
6time, and not necessarily in the order that the source code would suggest.
7Furthermore, even before the CPU gets a chance to reorder operations, the
8compiler may (and often does) reorganize the code for greater efficiency,
9particularly at higher optimization levels.  Optimizing compilers and
10out-of-order execution are both critical for good performance, but they
11can lead to surprising results when multiple processes access the same
12memory space.
13
14Example
15=======
16
17Suppose x is a pointer to a structure stored in shared memory, and that the
18entire structure has been initialized to zero bytes.  One backend executes
19the following code fragment:
20
21    x->foo = 1;
22    x->bar = 1;
23
24Meanwhile, at approximately the same time, another backend executes this
25code fragment:
26
27    bar = x->bar;
28    foo = x->foo;
29
30The second backend might end up with foo = 1 and bar = 1 (if it executes
31both statements after the first backend), or with foo = 0 and bar = 0 (if
32it executes both statements before the first backend), or with foo = 1 and
33bar = 0 (if the first backend executes the first statement, the second
34backend executes both statements, and then the first backend executes the
35second statement).
36
37Surprisingly, however, the second backend could also end up with foo = 0
38and bar = 1.  The compiler might swap the order of the two stores performed
39by the first backend, or the two loads performed by the second backend.
40Even if it doesn't, on a machine with weak memory ordering (such as PowerPC
41or Itanium) the CPU might choose to execute either the loads or the stores
42out of order.  This surprising result can lead to bugs.
43
44A common pattern where this actually does result in a bug is when adding items
45onto a queue.  The writer does this:
46
47    q->items[q->num_items] = new_item;
48    ++q->num_items;
49
50The reader does this:
51
52    num_items = q->num_items;
53    for (i = 0; i < num_items; ++i)
54        /* do something with q->items[i] */
55
56This code turns out to be unsafe, because the writer might increment
57q->num_items before it finishes storing the new item into the appropriate slot.
58More subtly, the reader might prefetch the contents of the q->items array
59before reading q->num_items.  Thus, there's still a bug here *even if the
60writer does everything in the order we expect*.  We need the writer to update
61the array before bumping the item counter, and the reader to examine the item
62counter before examining the array.
63
64Note that these types of highly counterintuitive bugs can *only* occur when
65multiple processes are interacting with the same memory segment.  A given
66process always perceives its *own* writes to memory in program order.
67
68Avoiding Memory Ordering Bugs
69=============================
70
71The simplest (and often best) way to avoid memory ordering bugs is to
72protect the data structures involved with an lwlock.  For more details, see
73src/backend/storage/lmgr/README.  For instance, in the above example, the
74writer could acquire an lwlock in exclusive mode before appending to the
75queue, and each reader could acquire the same lock in shared mode before
76reading it.  If the data structure is not heavily trafficked, this solution is
77generally entirely adequate.
78
79However, in some cases, it is desirable to avoid the overhead of acquiring
80and releasing locks.  In this case, memory barriers may be used to ensure
81that the apparent order of execution is as the programmer desires.   In
82PostgreSQL backend code, the pg_memory_barrier() macro may be used to achieve
83this result.  In the example above, we can prevent the reader from seeing a
84garbage value by having the writer do this:
85
86    q->items[q->num_items] = new_item;
87    pg_memory_barrier();
88    ++q->num_items;
89
90And by having the reader do this:
91
92    num_items = q->num_items;
93    pg_memory_barrier();
94    for (i = 0; i < num_items; ++i)
95        /* do something with q->items[i] */
96
97The pg_memory_barrier() macro will (1) prevent the compiler from rearranging
98the code in such a way as to allow the memory accesses to occur out of order
99and (2) generate any code (often, inline assembly) that is needed to prevent
100the CPU from executing the memory accesses out of order.  Specifically, the
101barrier prevents loads and stores written after the barrier from being
102performed before the barrier, and vice-versa.
103
104Although this code will work, it is needlessly inefficient.  On systems with
105strong memory ordering (such as x86), the CPU never reorders loads with other
106loads, nor stores with other stores.  It can, however, allow a load to be
107performed before a subsequent store.  To avoid emitting unnecessary memory
108instructions, we provide two additional primitives: pg_read_barrier(), and
109pg_write_barrier().  When a memory barrier is being used to separate two
110loads, use pg_read_barrier(); when it is separating two stores, use
111pg_write_barrier(); when it is a separating a load and a store (in either
112order), use pg_memory_barrier().  pg_memory_barrier() can always substitute
113for either a read or a write barrier, but is typically more expensive, and
114therefore should be used only when needed.
115
116With these guidelines in mind, the writer can do this:
117
118    q->items[q->num_items] = new_item;
119    pg_write_barrier();
120    ++q->num_items;
121
122And the reader can do this:
123
124    num_items = q->num_items;
125    pg_read_barrier();
126    for (i = 0; i < num_items; ++i)
127        /* do something with q->items[i] */
128
129On machines with strong memory ordering, these weaker barriers will simply
130prevent compiler rearrangement, without emitting any actual machine code.
131On machines with weak memory ordering, they will prevent compiler
132reordering and also emit whatever hardware barrier may be required.  Even
133on machines with weak memory ordering, a read or write barrier may be able
134to use a less expensive instruction than a full barrier.
135
136Weaknesses of Memory Barriers
137=============================
138
139While memory barriers are a powerful tool, and much cheaper than locks, they
140are also much less capable than locks.  Here are some of the problems.
141
1421. Concurrent writers are unsafe.  In the above example of a queue, using
143memory barriers doesn't make it safe for two processes to add items to the
144same queue at the same time.  If more than one process can write to the queue,
145a spinlock or lwlock must be used to synchronize access. The readers can
146perhaps proceed without any lock, but the writers may not.
147
148Even very simple write operations often require additional synchronization.
149For example, it's not safe for multiple writers to simultaneously execute
150this code (supposing x is a pointer into shared memory):
151
152    x->foo++;
153
154Although this may compile down to a single machine-language instruction,
155the CPU will execute that instruction by reading the current value of foo,
156adding one to it, and then storing the result back to the original address.
157If two CPUs try to do this simultaneously, both may do their reads before
158either one does their writes.  Such a case could be made safe by using an
159atomic variable and an atomic add.  See port/atomics.h.
160
1612. Eight-byte loads and stores aren't necessarily atomic.  We assume in
162various places in the source code that an aligned four-byte load or store is
163atomic, and that other processes therefore won't see a half-set value.
164Sadly, the same can't be said for eight-byte value: on some platforms, an
165aligned eight-byte load or store will generate two four-byte operations.  If
166you need an atomic eight-byte read or write, you must either serialize access
167with a lock or use an atomic variable.
168
1693. No ordering guarantees.  While memory barriers ensure that any given
170process performs loads and stores to shared memory in order, they don't
171guarantee synchronization.  In the queue example above, we can use memory
172barriers to be sure that readers won't see garbage, but there's nothing to
173say whether a given reader will run before or after a given writer.  If this
174matters in a given situation, some other mechanism must be used instead of
175or in addition to memory barriers.
176
1774. Barrier proliferation.  Many algorithms that at first seem appealing
178require multiple barriers.  If the number of barriers required is more than
179one or two, you may be better off just using a lock.  Keep in mind that, on
180some platforms, a barrier may be implemented by acquiring and releasing a
181backend-private spinlock.  This may be better than a centralized lock under
182contention, but it may also be slower in the uncontended case.
183
184Further Reading
185===============
186
187Much of the documentation about memory barriers appears to be quite
188Linux-specific.  The following papers may be helpful:
189
190Memory Ordering in Modern Microprocessors, by Paul E. McKenney
191* http://www.rdrop.com/users/paulmck/scalability/paper/ordering.2007.09.19a.pdf
192
193Memory Barriers: a Hardware View for Software Hackers, by Paul E. McKenney
194* http://www.rdrop.com/users/paulmck/scalability/paper/whymb.2010.06.07c.pdf
195
196The Linux kernel also has some useful documentation on this topic.  Start
197with Documentation/memory-barriers.txt
198