1Partial Reductions for multiplications 2-------------------------------------- 3 4It is possible to get away with partial reductions for multiplications 5instead of fully reducing everything. The largest input to square/mult 6will come from an unreduced add, which will double the element values. 7Test values are 1 bit larger than actual maximum values. 8 9 max27 = (1 << 27) - 1 10 max26 = (1 << 26) - 1 11 12Largest values from an add of full bit values (max27,max26,max27,max26..) 13 14 m0 0x1f1fffea8000042c 15 m1 0x133ffff190000268 16 m2 0x185fffef00000354 17 m3 0x0ebffff4f00001d8 18 m4 0x119ffff38000027c 19 m5 0x0a3ffff850000148 20 m6 0x0adffff8000001a4 21 m7 0x05bffffbb00000b8 22 m8 0x041ffffc800000cc 23 m9 0x013fffff10000028 24 25Carry values from reducing sums 26 27 c0 0x00000007c7fffaa0 28 c1 0x000000099ffffcab 29 c2 0x0000000617fffe27 30 c3 0x000000075ffffd83 31 c4 0x0000000467fffeb7 32 c5 0x000000051ffffe5b 33 c6 0x00000002b7ffff47 34 c7 0x00000002dfffff33 35 c8 0x0000000107ffffd7 36 c9 0xa000000b 37 c0 0x000002f8 38 39 40The largest carried value r1 could receive is 0x2f8, with everything else 41fitting in 25 or 26 bits. Assuming full values for everything, with 0x2f8 42added to r1 (max27,maxr1,max27,max26..): 43 44 max27 = (1 << 27) - 1 45 max26 = (1 << 26) - 1 46 maxr1 = (((1 << 25) - 1) + 0x2f8) * 2 47 48 m0 0x1f2006f77ffc7dac 49 m1 0x134000508fffeaa8 50 m2 0x1860004e004655d4 51 m3 0x0ec00053efffea18 52 m4 0x11a000527fffd2fc 53 m5 0x0a4000574fffe988 54 m6 0x0ae00056ffffd224 55 m7 0x05c0005aafffe8f8 56 m8 0x0420005b7fffd14c 57 m9 0x0140005e0fffe868 58 59Carry values 60 61 c0 0x00000007c801bddf 62 c1 0x00000009a0002c2c 63 c2 0x00000006180015e8 64 c3 0x0000000760002d04 65 c4 0x0000000468001678 66 c5 0x0000000520002ddc 67 c6 0x00000002b8001708 68 c7 0x00000002e0002eb4 69 c8 0x0000000108001798 70 c9 0xa0002f8c 71 c0 0x000002f9 72 73The largest carried value is now 0x2f9 (max27,maxr1b,max27,max26..) 74 75 max27 = (1 << 27) - 1 76 max26 = (1 << 26) - 1 77 maxr1b = (((1 << 25) - 1) + 0x2f9) * 2 78 79 m0 0x1f2006f9dffc7c7c 80 m1 0x13400050afffeaa0 81 m2 0x1860004e2046854c 82 m3 0x0ec000540fffea10 83 m4 0x11a000529fffd2ec 84 m5 0x0a4000576fffe980 85 m6 0x0ae000571fffd214 86 m7 0x05c0005acfffe8f0 87 m8 0x0420005b9fffd13c 88 m9 0x0140005e2fffe860 89 90Carry values 91 92 c0 0x00000007c801be77 93 c1 0x00000009a0002c3c 94 c2 0x00000006180015f0 95 c3 0x0000000760002d14 96 c4 0x0000000468001680 97 c5 0x0000000520002dec 98 c6 0x00000002b8001710 99 c7 0x00000002e0002ec4 100 c8 0x00000001080017a0 101 c9 0xa0002f9c 102 c0 0x000002f9 103 104The largest carried value is fixed at 0x2f9. Subtracting the largest values 105from 0 will result in r0 exceeding 26 bits, but r0-r4 are safe for 106multiplications up to 30 bits, so partial reductions throughout the entire 107calculation should be safe to chain. This especially helps with speeding up 108the SSE2 version by freeing it from large serial carry chains. Testing of 109course continues, but no problems as of yet have shown up. 110 111 112Subtraction 113----------- 114Subtraction with unsigned elements is done using Emilia Kasper's trick, via 115agl: http://www.imperialviolet.org/2010/12/04/ecc.html 116 117Adding a large enough value that is equivalent to 0 mod p before subracting 118ensures no elements underflow. 119 120Compiler 121-------- 122gcc (as of 4.4.5) has a difficult time optimizing the 32 bit C version properly. 123icc produces code that is roughly 40% faster.