1Partial Reductions for multiplications
2--------------------------------------
3
4It is possible to get away with partial reductions for multiplications
5instead of fully reducing everything. The largest input to square/mult
6will come from an unreduced add, which will double the element values.
7Test values are 1 bit larger than actual maximum values.
8
9    max27 = (1 << 27) - 1
10    max26 = (1 << 26) - 1
11
12Largest values from an add of full bit values (max27,max26,max27,max26..)
13
14    m0 0x1f1fffea8000042c
15    m1 0x133ffff190000268
16    m2 0x185fffef00000354
17    m3 0x0ebffff4f00001d8
18    m4 0x119ffff38000027c
19    m5 0x0a3ffff850000148
20    m6 0x0adffff8000001a4
21    m7 0x05bffffbb00000b8
22    m8 0x041ffffc800000cc
23    m9 0x013fffff10000028
24
25Carry values from reducing sums
26
27    c0 0x00000007c7fffaa0
28    c1 0x000000099ffffcab
29    c2 0x0000000617fffe27
30    c3 0x000000075ffffd83
31    c4 0x0000000467fffeb7
32    c5 0x000000051ffffe5b
33    c6 0x00000002b7ffff47
34    c7 0x00000002dfffff33
35    c8 0x0000000107ffffd7
36    c9         0xa000000b
37    c0         0x000002f8
38
39
40The largest carried value r1 could receive is 0x2f8, with everything else
41fitting in 25 or 26 bits. Assuming full values for everything, with 0x2f8
42added to r1 (max27,maxr1,max27,max26..):
43
44    max27 = (1 << 27) - 1
45    max26 = (1 << 26) - 1
46    maxr1 = (((1 << 25) - 1) + 0x2f8) * 2
47
48    m0 0x1f2006f77ffc7dac
49    m1 0x134000508fffeaa8
50    m2 0x1860004e004655d4
51    m3 0x0ec00053efffea18
52    m4 0x11a000527fffd2fc
53    m5 0x0a4000574fffe988
54    m6 0x0ae00056ffffd224
55    m7 0x05c0005aafffe8f8
56    m8 0x0420005b7fffd14c
57    m9 0x0140005e0fffe868
58
59Carry values
60
61    c0 0x00000007c801bddf
62    c1 0x00000009a0002c2c
63    c2 0x00000006180015e8
64    c3 0x0000000760002d04
65    c4 0x0000000468001678
66    c5 0x0000000520002ddc
67    c6 0x00000002b8001708
68    c7 0x00000002e0002eb4
69    c8 0x0000000108001798
70    c9         0xa0002f8c
71    c0         0x000002f9
72
73The largest carried value is now 0x2f9 (max27,maxr1b,max27,max26..)
74
75    max27 = (1 << 27) - 1
76    max26 = (1 << 26) - 1
77    maxr1b = (((1 << 25) - 1) + 0x2f9) * 2
78
79    m0 0x1f2006f9dffc7c7c
80    m1 0x13400050afffeaa0
81    m2 0x1860004e2046854c
82    m3 0x0ec000540fffea10
83    m4 0x11a000529fffd2ec
84    m5 0x0a4000576fffe980
85    m6 0x0ae000571fffd214
86    m7 0x05c0005acfffe8f0
87    m8 0x0420005b9fffd13c
88    m9 0x0140005e2fffe860
89
90Carry values
91
92    c0 0x00000007c801be77
93    c1 0x00000009a0002c3c
94    c2 0x00000006180015f0
95    c3 0x0000000760002d14
96    c4 0x0000000468001680
97    c5 0x0000000520002dec
98    c6 0x00000002b8001710
99    c7 0x00000002e0002ec4
100    c8 0x00000001080017a0
101    c9         0xa0002f9c
102    c0         0x000002f9
103
104The largest carried value is fixed at 0x2f9. Subtracting the largest values
105from 0 will result in r0 exceeding 26 bits, but r0-r4 are safe for
106multiplications up to 30 bits, so partial reductions throughout the entire
107calculation should be safe to chain. This especially helps with speeding up
108the SSE2 version by freeing it from large serial carry chains. Testing of
109course continues, but no problems as of yet have shown up.
110
111
112Subtraction
113-----------
114Subtraction with unsigned elements is done using Emilia Kasper's trick, via
115agl: http://www.imperialviolet.org/2010/12/04/ecc.html
116
117Adding a large enough value that is equivalent to 0 mod p before subracting
118ensures no elements underflow.
119
120Compiler
121--------
122gcc (as of 4.4.5) has a difficult time optimizing the 32 bit C version properly.
123icc produces code that is roughly 40% faster.