1Copyright 1996, 1999-2001, 2003 Free Software Foundation, Inc. 2 3This file is part of the GNU MP Library. 4 5The GNU MP Library is free software; you can redistribute it and/or modify 6it under the terms of either: 7 8 * the GNU Lesser General Public License as published by the Free 9 Software Foundation; either version 3 of the License, or (at your 10 option) any later version. 11 12or 13 14 * the GNU General Public License as published by the Free Software 15 Foundation; either version 2 of the License, or (at your option) any 16 later version. 17 18or both in parallel, as here. 19 20The GNU MP Library is distributed in the hope that it will be useful, but 21WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY 22or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License 23for more details. 24 25You should have received copies of the GNU General Public License and the 26GNU Lesser General Public License along with the GNU MP Library. If not, 27see https://www.gnu.org/licenses/. 28 29 30 31 32 33 INTEL PENTIUM P5 MPN SUBROUTINES 34 35 36This directory contains mpn functions optimized for Intel Pentium (P5,P54) 37processors. The mmx subdirectory has additional code for Pentium with MMX 38(P55). 39 40 41STATUS 42 43 cycles/limb 44 45 mpn_add_n/sub_n 2.375 46 47 mpn_mul_1 12.0 48 mpn_add/submul_1 14.0 49 50 mpn_mul_basecase 14.2 cycles/crossproduct (approx) 51 52 mpn_sqr_basecase 8 cycles/crossproduct (approx) 53 or 15.5 cycles/triangleproduct (approx) 54 55 mpn_l/rshift 5.375 normal (6.0 on P54) 56 1.875 special shift by 1 bit 57 58 mpn_divrem_1 44.0 59 mpn_mod_1 28.0 60 mpn_divexact_by3 15.0 61 62 mpn_copyi/copyd 1.0 63 64Pentium MMX gets the following improvements 65 66 mpn_l/rshift 1.75 67 68 mpn_mul_1 12.0 normal, 7.0 for 16-bit multiplier 69 70 71mpn_add_n and mpn_sub_n run at asymptotically 2 cycles/limb. Due to loop 72overhead and other delays (cache refill?), they run at or near 2.5 73cycles/limb. 74 75mpn_mul_1, mpn_addmul_1, mpn_submul_1 all run 1 cycle faster than they 76should. Intel documentation says a mul instruction is 10 cycles, but it 77measures 9 and the routines using it run as 9. 78 79 80 81P55 MMX AND X87 82 83The cost of switching between MMX and x87 floating point on P55 is about 100 84cycles (fld1/por/emms for instance). In order to avoid that the two aren't 85mixed and currently that means using MMX and not x87. 86 87MMX offers a big speedup for lshift and rshift, and a nice speedup for 8816-bit multipliers in mpn_mul_1. If fast code using x87 is found then 89perhaps the preference for MMX will be reversed. 90 91 92 93 94P54 SHLDL 95 96mpn_lshift and mpn_rshift run at about 6 cycles/limb on P5 and P54, but the 97documentation indicates that they should take only 43/8 = 5.375 cycles/limb, 98or 5 cycles/limb asymptotically. The P55 runs them at the expected speed. 99 100It seems that on P54 a shldl or shrdl allows pairing in one following cycle, 101but not two. For example, back to back repetitions of the following 102 103 shldl( %cl, %eax, %ebx) 104 xorl %edx, %edx 105 xorl %esi, %esi 106 107run at 5 cycles, as expected, but repetitions of the following run at 7 108cycles, whereas 6 would be expected (and is achieved on P55), 109 110 shldl( %cl, %eax, %ebx) 111 xorl %edx, %edx 112 xorl %esi, %esi 113 xorl %edi, %edi 114 xorl %ebp, %ebp 115 116Three xorls run at 7 cycles too, so it doesn't seem to be just that pairing 117inhibited is only in the second following cycle (or something like that). 118 119Avoiding this problem would bring P54 shifts down from 6.0 c/l to 5.5 with a 120pattern of shift, 2 loads, shift, 2 stores, shift, etc. A start has been 121made on something like that, but it's not yet complete. 122 123 124 125 126OTHER NOTES 127 128Prefetching Destinations 129 130 Pentium doesn't allocate cache lines on writes, unlike most other modern 131 processors. Since the functions in the mpn class do array writes, we 132 have to handle allocating the destination cache lines by reading a word 133 from it in the loops, to achieve the best performance. 134 135Prefetching Sources 136 137 Prefetching of sources is pointless since there's no out-of-order loads. 138 Any load instruction blocks until the line is brought to L1, so it may 139 as well be the load that wants the data which blocks. 140 141Data Cache Bank Clashes 142 143 Pairing of memory operations requires that the two issued operations 144 refer to different cache banks (ie. different addresses modulo 32 145 bytes). The simplest way to ensure this is to read/write two words from 146 the same object. If we make operations on different objects, they might 147 or might not be to the same cache bank. 148 149PIC %eip Fetching 150 151 A simple call $+5 and popl can be used to get %eip, there's no need to 152 balance calls and returns since P5 doesn't have any return stack branch 153 prediction. 154 155Float Multiplies 156 157 fmul is pairable and can be issued every 2 cycles (with a 4 cycle 158 latency for data ready to use). This is a lot better than integer mull 159 or imull at 9 cycles non-pairing. Unfortunately the advantage is 160 quickly eaten away by needing to throw data through memory back to the 161 integer registers to adjust for fild and fist being signed, and to do 162 things like propagating carry bits. 163 164 165 166 167 168REFERENCES 169 170"Intel Architecture Optimization Manual", 1997, order number 242816. This 171is mostly about P5, the parts about P6 aren't relevant. Available on-line: 172 173 http://download.intel.com/design/PentiumII/manuals/242816.htm 174 175 176 177---------------- 178Local variables: 179mode: text 180fill-column: 76 181End: 182