• Home
  • History
  • Annotate
Name Date Size #Lines LOC

..16-Feb-2021-

README.mdH A D16-Feb-20215.7 KiB209150

crc32.sxH A D16-Feb-202114.8 KiB788653

crc32_constants.hH A D16-Feb-202129.9 KiB902350

crc32_wrapper.cH A D16-Feb-20212.5 KiB11079

ppc-opcode.hH A D16-Feb-2021916 2419

README.md

1crc32-vpmsum
2============
3
4A set of examples for accelerating CRC32 calculations using the vector
5polynomial multiply sum (vpmsum) instructions introduced in POWER8. These
6instructions implement byte, halfword, word and doubleword carryless
7multiply/add.
8
9Performance
10-----------
11
12An implementation of slice-by-8, one of the fastest lookup table methods
13is included so we can compare performance against it. Testing 5000000
14iterations of a CRC of 32 kB of data (to keep it L1 cache contained):
15
16```
17# time slice_by_8_bench 32768 5000000
18122.220 seconds
19
20# time crc32_bench 32768 5000000
212.937 seconds
22```
23
24The vpmsum accelerated CRC is just over 41x faster.
25
26This test was run on a 4.1 GHz POWER8, so the algorithm sustains about
2752 GiB/sec or 13.6 bytes/cycle. The theoretical limit is 16 bytes/cycle
28since we can execute a maximum of one vpmsum instruction per cycle.
29
30In another test, a version was added to the kernel and btrfs write
31performance was shown to be 3.8x faster. The test was done to a ramdisk
32to mitigate any I/O induced variability.
33
34Quick start
35-----------
36
37- Modify CRC and OPTIONS in the Makefile. There are examples for the two most
38  common crc32s.
39
40- Type make to create the constants (crc32_constants.h)
41
42- Import the code into your application (crc32.sx crc32_wrapper.c
43  crc32_constants.h ppc-opcode.h) and call the CRC:
44
45```
46unsigned int crc32_vpmsum(unsigned int crc, unsigned char *p, unsigned long len);
47```
48
49CRC background
50--------------
51
52For a good background on CRCs, check out:
53
54http://www.ross.net/crc/download/crc_v3.txt
55
56A few key points:
57
58- A CRC is the remainder after dividing a message by the CRC polynomial,
59  ie M mod CRC_POLY
60- multiply/divide is carryless
61- add/subtract is an xor
62- n (where n is the order of the CRC) bits of zeroes are appended to the
63  end of the message.
64
65One more important piece of information - a CRC is a linear function, so:
66
67```
68	CRC(A xor B) = CRC(A) xor CRC(B)
69
70	CRC(A . B) = CRC(A) . CRC(B)	(remember this is carryless multiply)
71```
72
73If we take 64bits of data, represented by two 32 bit chunks (AAAAAAAA
74and BBBBBBBB):
75
76```
77CRC(AAAAAAAABBBBBBBB)
78	= CRC(AAAAAAAA00000000 xor BBBBBBBB)
79	= CRC(AAAAAAAA00000000) xor CRC(BBBBBBBB)
80```
81
82If we operate on AAAAAAAA:
83
84```
85CRC(AAAAAAAA00000000)
86	= CRC(AAAAAAAA . 100000000)
87	= CRC(AAAAAAAA) . CRC(100000000)
88```
89
90And CRC(100000000) is a constant which we can pre-calculate:
91
92```
93CRC(100000000)
94	= 100000000 mod CRC_POLY
95	= 2^32 mod CRC_POLY
96```
97
98Finally we can add our modified AAAAAAAA to BBBBBBBB:
99
100```
101CRC(AAAAAAAABBBBBBBB)
102	= ((2^32 mod CRC_POLY) . CRC(AAAAAAAA)) xor CRC(BBBBBBBB)
103```
104
105In other words, with the right constants pre-calculated we can shift the
106input data around and we can also calculate the CRC in as many parallel
107chunks as we want.
108
109No matter how much shifting we do, the final result will be be 64 bits of
110data (63 actually, because there is no carry into the top bit). To reduce
111it further we need a another trick, and that is Barrett reduction:
112
113http://en.wikipedia.org/wiki/Barrett_reduction
114
115Barrett reduction is a method of calculating a mod n. The idea is to
116calculate q, the multiple of our polynomial that we need to subtract. By
117doing the computation 2x bits higher (ie 64 bits) and shifting the
118result back down 2x bits, we round down to the nearest multiple.
119
120```
121	k = 32
122	m = floor((4^k)/n) = floor((4^32))/n)
123	n = 64 bits of data
124	a = 32 bit CRC
125
126	q = floor(ma/(2^64))
127	result = a - qn
128```
129
130An example in the floating point domain makes it clearer how this works:
131
132```
133a mod n = a - floor(am) * n
134```
135
136Let's use it to calculate 22 mod 10:
137
138```
139	a = 22
140	n = 10
141	m = 1/n = 1/10 = 0.1
142
14322 mod 10
144	= 22 - floor(22*0.1) * 10
145	= 22 - 2 * 10
146	= 22 - 20
147	= 2
148```
149
150There is one more issue left - bit reflection. Some CRCs are defined to
151operate on the least significant bit first (eg CRC32c). Lets look at
152how this would get laid out in a register, and lets simplify it to just
153two bytes (vs a 16 byte VMX register):
154
155    [ 8..15 ] [ 0..7 ]
156
157Notice how the bits and bytes are out of order. Since we are doing
158multi word multiplication on these values we need them to both be
159in order.
160
161The simplest way to fix this is to reflect the bits in each byte:
162
163    [ 15..8 ] [ 7..0 ]
164
165However shuffling bits in a byte is expensive on most CPUs. It is
166however relatively cheap to shuffle bytes around. What if we load
167the bytes in reversed:
168
169    [ 0..7 ] [ 8..15 ]
170
171Now the bits and bytes are in order, except the least significant bit
172of the register is now on the left and the most significant bit is on the
173right. We operate as if the register is reflected, which normally we
174cannot do. The reason we get away with this is our multiplies are carryless
175and our addition and subtraction is xor, so our operations never create
176carries.
177
178The only trick is we have to shift the result of multiplies left one
179because the high bit of the multiply is always 0, and we want that high bit
180on the right not the left.
181
182Implementation
183--------------
184
185The vpmsum instructions on POWER8 have a 6 cycle latency and we can
186execute one every cycle. In light of this the main loop has 8 parallel
187streams which consume 8 x 16 B each iteration. At the completion of this
188loop we have taken 32 kB of data and reduced it to 8 x 16 B (128 B).
189
190The next step is to take this 128 B and reduce it to 8 B. At this stage
191we also add 32 bits of 0 to the end.
192
193We then apply Barrett reduction to get our CRC.
194
195Examples
196--------
197- barrett_reduction: An example of Barrett reduction
198
199- final_fold: Starting with 128 bits, add 32 bits of zeros and reduce it to
200  64 bits, then apply Barrett reduction
201
202- final_fold2: A second method of reduction
203
204Acknowledgements
205----------------
206
207Thanks to Michael Gschwind, Jeff Derby, Lorena Pesantez and Stewart Smith
208for their ideas and assistance.
209