xref: /netbsd/external/lgpl3/gmp/dist/mpn/x86/pentium/README (revision f81b1c5b)
1Copyright 1996, 1999-2001, 2003 Free Software Foundation, Inc.
2
3This file is part of the GNU MP Library.
4
5The GNU MP Library is free software; you can redistribute it and/or modify
6it under the terms of either:
7
8  * the GNU Lesser General Public License as published by the Free
9    Software Foundation; either version 3 of the License, or (at your
10    option) any later version.
11
12or
13
14  * the GNU General Public License as published by the Free Software
15    Foundation; either version 2 of the License, or (at your option) any
16    later version.
17
18or both in parallel, as here.
19
20The GNU MP Library is distributed in the hope that it will be useful, but
21WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
22or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
23for more details.
24
25You should have received copies of the GNU General Public License and the
26GNU Lesser General Public License along with the GNU MP Library.  If not,
27see https://www.gnu.org/licenses/.
28
29
30
31
32
33                   INTEL PENTIUM P5 MPN SUBROUTINES
34
35
36This directory contains mpn functions optimized for Intel Pentium (P5,P54)
37processors.  The mmx subdirectory has additional code for Pentium with MMX
38(P55).
39
40
41STATUS
42
43                                cycles/limb
44
45	mpn_add_n/sub_n            2.375
46
47	mpn_mul_1                 12.0
48	mpn_add/submul_1          14.0
49
50	mpn_mul_basecase          14.2 cycles/crossproduct (approx)
51
52	mpn_sqr_basecase           8 cycles/crossproduct (approx)
53                                   or 15.5 cycles/triangleproduct (approx)
54
55	mpn_l/rshift               5.375 normal (6.0 on P54)
56				   1.875 special shift by 1 bit
57
58	mpn_divrem_1              44.0
59	mpn_mod_1                 28.0
60	mpn_divexact_by3          15.0
61
62	mpn_copyi/copyd            1.0
63
64Pentium MMX gets the following improvements
65
66	mpn_l/rshift               1.75
67
68	mpn_mul_1                 12.0 normal, 7.0 for 16-bit multiplier
69
70
71mpn_add_n and mpn_sub_n run at asymptotically 2 cycles/limb.  Due to loop
72overhead and other delays (cache refill?), they run at or near 2.5
73cycles/limb.
74
75mpn_mul_1, mpn_addmul_1, mpn_submul_1 all run 1 cycle faster than they
76should.  Intel documentation says a mul instruction is 10 cycles, but it
77measures 9 and the routines using it run as 9.
78
79
80
81P55 MMX AND X87
82
83The cost of switching between MMX and x87 floating point on P55 is about 100
84cycles (fld1/por/emms for instance).  In order to avoid that the two aren't
85mixed and currently that means using MMX and not x87.
86
87MMX offers a big speedup for lshift and rshift, and a nice speedup for
8816-bit multipliers in mpn_mul_1.  If fast code using x87 is found then
89perhaps the preference for MMX will be reversed.
90
91
92
93
94P54 SHLDL
95
96mpn_lshift and mpn_rshift run at about 6 cycles/limb on P5 and P54, but the
97documentation indicates that they should take only 43/8 = 5.375 cycles/limb,
98or 5 cycles/limb asymptotically.  The P55 runs them at the expected speed.
99
100It seems that on P54 a shldl or shrdl allows pairing in one following cycle,
101but not two.  For example, back to back repetitions of the following
102
103	shldl(	%cl, %eax, %ebx)
104	xorl	%edx, %edx
105	xorl	%esi, %esi
106
107run at 5 cycles, as expected, but repetitions of the following run at 7
108cycles, whereas 6 would be expected (and is achieved on P55),
109
110	shldl(	%cl, %eax, %ebx)
111	xorl	%edx, %edx
112	xorl	%esi, %esi
113	xorl	%edi, %edi
114	xorl	%ebp, %ebp
115
116Three xorls run at 7 cycles too, so it doesn't seem to be just that pairing
117inhibited is only in the second following cycle (or something like that).
118
119Avoiding this problem would bring P54 shifts down from 6.0 c/l to 5.5 with a
120pattern of shift, 2 loads, shift, 2 stores, shift, etc.  A start has been
121made on something like that, but it's not yet complete.
122
123
124
125
126OTHER NOTES
127
128Prefetching Destinations
129
130    Pentium doesn't allocate cache lines on writes, unlike most other modern
131    processors.  Since the functions in the mpn class do array writes, we
132    have to handle allocating the destination cache lines by reading a word
133    from it in the loops, to achieve the best performance.
134
135Prefetching Sources
136
137    Prefetching of sources is pointless since there's no out-of-order loads.
138    Any load instruction blocks until the line is brought to L1, so it may
139    as well be the load that wants the data which blocks.
140
141Data Cache Bank Clashes
142
143    Pairing of memory operations requires that the two issued operations
144    refer to different cache banks (ie. different addresses modulo 32
145    bytes).  The simplest way to ensure this is to read/write two words from
146    the same object.  If we make operations on different objects, they might
147    or might not be to the same cache bank.
148
149PIC %eip Fetching
150
151    A simple call $+5 and popl can be used to get %eip, there's no need to
152    balance calls and returns since P5 doesn't have any return stack branch
153    prediction.
154
155Float Multiplies
156
157    fmul is pairable and can be issued every 2 cycles (with a 4 cycle
158    latency for data ready to use).  This is a lot better than integer mull
159    or imull at 9 cycles non-pairing.  Unfortunately the advantage is
160    quickly eaten away by needing to throw data through memory back to the
161    integer registers to adjust for fild and fist being signed, and to do
162    things like propagating carry bits.
163
164
165
166
167
168REFERENCES
169
170"Intel Architecture Optimization Manual", 1997, order number 242816.  This
171is mostly about P5, the parts about P6 aren't relevant.  Available on-line:
172
173        http://download.intel.com/design/PentiumII/manuals/242816.htm
174
175
176
177----------------
178Local variables:
179mode: text
180fill-column: 76
181End:
182