1## esl_alloc : portable aligned memory allocation
2
3The `alloc` module provides for portable aligned allocation. This
4is generally only needed for SIMD vector code.
5
6### rationale
7
8Yes, the C99 standard states that malloc() is _suitably aligned so
9that it may be assigned to a pointer to any type of object_. But SIMD
10vector types are not part of the C99 standard, vector types may be
11wider than any C99 object type, and vector memory should be aligned.
12
13Most, if not all systems we work on will provide posix_memalign().
14But I don't trust it to be there; plus we have a policy of making it
15as easy as possible to port to non-POSIX platforms when we can.
16
17We use POSIX's posix_memalign(), C11's aligned_alloc, or Intel's
18_mm_malloc() (in that preference order), if available on the system.
19Easel configure.ac tests for them, and sets HAVE_POSIX_MEMALIGN,
20HAVE_ALIGNED_ALLOC, and/or HAVE__MM_MALLOC as appropriate. If none are
21available, we fall back to a handrolled implementation.
22
23### the quasi-portable fallback implementation
24
25The fallback implementation does unspeakable things, things that are
26technically undefined-behaviour in C99, but which happen to work
27everywhere I know of. Specifically, given a pointer to a malloc()
28allocation, we cast that pointer to an integer (of type `uintptr_t`),
29mask off its low-order bits to achieve alignment, and store those
30low-order bits in a byte preceding our allocation:
31
32```
33  pp = pointer that malloc() gives us
34  v
35  ...X[------------------------]
36     ^^
37     | \
38     \  p = aligned pointer we give to the caller
39      \
40        one byte storing r-1, where r=(p-pp), total alignment shift
41        r is 1..256, so we can store r-1 in an unsigned byte.
42```
43
44When we free, we use the alignment shift to reconstruct what p was, so
45we can call free() with the pointer that malloc() originally gave us.
46
47#### alignment is limited to <= 256 bytes
48
49We only use one byte for the shift r, because we don't anticipate
50needing to align on more than a 256 byte boundary. Currently the
51largest vectors are AVX-512's 64-byte vectors, and Intel is
52projecting an AVX-1024 with 128-byte vectors. We can revisit if
53needed.
54
55#### there is an overallocation cost
56
57In the best case, malloc() gives us an allocation that's off by
58exactly one byte from a properly aligned location; r=1 and we store
590 in the byte. In the worst case, malloc() gives us a properly
60aligned allocation, in which case our extra byte looks pretty
61stupid, r=V, and we store V-1 in the byte.
62
63Because the worst case behavior means we overallocate by V bytes,
64for a pointer that was already properly aligned, the fallback
65implementation is potentially wasteful, and to minimize the
66wastage, you should minimize allocation calls where possible. For
67example, it'd be better to do 2D arrays by setting pointers into a
68single allocation, for example.
69
70It would be desirable to know when the system malloc() already
71returns a suitably aligned pointer, and we could then just call
72malloc() directly - but I don't know a reliable way to test for that.
73
74#### may cause unnecessary unit test failure
75
76Currently, the unit tests deliberately compile and test the fallback
77implementation, even if `esl_alloc_aligned()` is using a system call
78like `posix_memalign()`. Thus it may happen that `esl_alloc_aligned()`
79is working fine, but the unit test fails because `esl_alloc_aligned()`
80doesn't work on some system (perhaps because of the unspeakable things
81it does).
82
83#### there is no realloc, by design
84
85Aligned realloc() is a problem in general. There's no POSIX aligned
86realloc counterpart for posix_memalign(), nor for C11 aligned_alloc(),
87not for Intel _mm_malloc().
88
89If we try to write our own realloc, we have a problem that the
90reallocated unaligned pointer could formally have a different offset
91$r$, so the system realloc() is not guaranteed to move our data
92correctly. To be sure, we would have to copy our data *again* in the
93correct alignment, and we would need to know the size of the data, not
94just the pointer to it.
95
96Instead, at least for now, we will avoid reallocating aligned memory
97altogether; instead we will free() and do a fresh allocation.  Thus we
98can only do `_Reinit()` style functions that do not guarantee
99preservation of data, not `_Resize()`, which assume that the data will
100be preserved.
101
102
103-----------------------------------------------------------------
104### benchmarking
105
106Real time for -L 100, -N 10000: $10^6$ reallocations, so you can think
107of these as $u$sec per reallocation.
108
109**on Mac OS/X:** timings are essentially the same w/ gcc vs. clang:
110_[11 Feb 17 on wumpus. 2.5Ghz Core i7, Mac OS/X 10.10.5 Yosemite, gcc 4.9.3, gcc -O3]_
111
112|                        | M=5000 | M=500000   | M=5000000  |
113|------------------------|--------|------------|------------|
114| malloc/realloc         | 0.159  | **10.480** |  **5.009** |
115| malloc/free/malloc     | 0.136  |    0.482   |    0.897   |
116| alloc_aligned_fallback | 0.139  |    0.641   | **26.394** |
117| posix_memalign         | 0.189  |    0.481   |    0.908   |
118
119
120
121**on Linux:**
122_[11 Feb 17 on ody eddyfs01. icc -O3]_
123
124|                        | M=5000 | M=500000   | M=5000000  |
125|------------------------|--------|------------|------------|
126| malloc/realloc         | 0.115  |  **0.662** | **1.094**  |
127| malloc/free/malloc     | 0.100  |    0.252   |   1.868    |
128| alloc_aligned_fallback | 0.106  |    0.249   |   1.877    |
129| posix_memalign         | 0.206  |    0.366   |   1.944    |
130
131
132#### dependence on allocation size isn't obvious
133
134Timings go up and down as max allocation size M changes. Maybe what's
135happening is that the system is treating different sizes with
136different strategies.
137
138#### realloc copies data, so it can be slow
139
140In general, if you don't need data to be preserved, allocating fresh
141memory (with free()/malloc()) may be faster than realloc(), because
142realloc() copies data if it has to move the allocation.  However, note
143one example on Linux where realloc() is faster - perhaps because it's
144smart enough to recognize cases where it doesn't need to expand an
145allocation.
146
147#### easel's aligned alloc can be slow on OS/X
148
149I ran the -M5000000 case under Instruments. It is spending all its
150time in free(), in madvise(). Not sure why.
151
152#### conclusion
153
154* posix_memalign() is usually available and performs well.
155* we'll design HMMER vector code to `_Reinit()` with fresh
156  allocations, rather than using reallocation. This may even
157  speed things up a small bit.
158* the `madvise()` stall with the easel fallback code is puzzling
159  and worrying, though it only happens on MacOS, not Linux.
160
161
162
163