• Home
  • History
  • Annotate
Name Date Size #Lines LOC

..03-May-2022-

k62mmx/H03-May-2022-708504

mmx/H03-May-2022-1,113821

READMEH A D03-May-20228.4 KiB252160

aors_n.asmH A D03-May-20226.2 KiB338248

aorsmul_1.asmH A D03-May-20229.6 KiB392316

cross.plH A D03-May-20225.8 KiB18396

divrem_1.asmH A D03-May-20224.4 KiB204150

gcd_1.asmH A D03-May-20226.3 KiB360255

gmp-mparam.hH A D03-May-20227.4 KiB167118

mod_34lsub1.asmH A D03-May-20223.8 KiB191137

mode1o.asmH A D03-May-20223.9 KiB177128

mul_1.asmH A D03-May-20225 KiB293203

mul_basecase.asmH A D03-May-202212.2 KiB613447

pre_mod_1.asmH A D03-May-20223.3 KiB147104

sqr_basecase.asmH A D03-May-202213.8 KiB681494

README

1Copyright 2000, 2001 Free Software Foundation, Inc.
2
3This file is part of the GNU MP Library.
4
5The GNU MP Library is free software; you can redistribute it and/or modify
6it under the terms of either:
7
8  * the GNU Lesser General Public License as published by the Free
9    Software Foundation; either version 3 of the License, or (at your
10    option) any later version.
11
12or
13
14  * the GNU General Public License as published by the Free Software
15    Foundation; either version 2 of the License, or (at your option) any
16    later version.
17
18or both in parallel, as here.
19
20The GNU MP Library is distributed in the hope that it will be useful, but
21WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
22or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
23for more details.
24
25You should have received copies of the GNU General Public License and the
26GNU Lesser General Public License along with the GNU MP Library.  If not,
27see https://www.gnu.org/licenses/.
28
29
30
31
32			AMD K6 MPN SUBROUTINES
33
34
35
36This directory contains code optimized for AMD K6 CPUs, meaning K6, K6-2 and
37K6-3.
38
39The mmx subdirectory has MMX code suiting plain K6, the k62mmx subdirectory
40has MMX code suiting K6-2 and K6-3.  All chips in the K6 family have MMX,
41the separate directories are just so that ./configure can omit them if the
42assembler doesn't support MMX.
43
44
45
46
47STATUS
48
49Times for the loops, with all code and data in L1 cache, are as follows.
50
51                                 cycles/limb
52
53	mpn_add_n/sub_n            3.25 normal, 2.75 in-place
54
55	mpn_mul_1                  6.25
56	mpn_add/submul_1           7.65-8.4  (varying with data values)
57
58	mpn_mul_basecase           9.25 cycles/crossproduct (approx)
59	mpn_sqr_basecase           4.7  cycles/crossproduct (approx)
60                                   or 9.2 cycles/triangleproduct (approx)
61
62	mpn_l/rshift               3.0
63
64	mpn_divrem_1              20.0
65	mpn_mod_1                 20.0
66	mpn_divexact_by3          11.0
67
68	mpn_copyi                  1.0
69	mpn_copyd                  1.0
70
71
72K6-2 and K6-3 have dual-issue MMX and get the following improvements.
73
74	mpn_l/rshift               1.75
75
76
77Prefetching of sources hasn't yet given any joy.  With the 3DNow "prefetch"
78instruction, code seems to run slower, and with just "mov" loads it doesn't
79seem faster.  Results so far are inconsistent.  The K6 does a hardware
80prefetch of the second cache line in a sector, so the penalty for not
81prefetching in software is reduced.
82
83
84
85
86NOTES
87
88All K6 family chips have MMX, but only K6-2 and K6-3 have 3DNow.
89
90Plain K6 executes MMX instructions only in the X pipe, but K6-2 and K6-3 can
91execute them in both X and Y (and in both together).
92
93Branch misprediction penalty is 1 to 4 cycles (Optimization Manual
94chapter 6 table 12).
95
96Write-allocate L1 data cache means prefetching of destinations is unnecessary.
97Store queue is 7 entries of 64 bits each.
98
99Floating point multiplications can be done in parallel with integer
100multiplications, but there doesn't seem to be any way to make use of this.
101
102
103
104OPTIMIZATIONS
105
106Unrolled loops are used to reduce looping overhead.  The unrolling is
107configurable up to 32 limbs/loop for most routines, up to 64 for some.
108
109Sometimes computed jumps into the unrolling are used to handle sizes not a
110multiple of the unrolling.  An attractive feature of this is that times
111smoothly increase with operand size, but an indirect jump is about 6 cycles
112and the setups about another 6, so it depends on how much the unrolled code
113is faster than a simple loop as to whether a computed jump ought to be used.
114
115Position independent code is implemented using a call to get eip for
116computed jumps and a ret is always done, rather than an addl $4,%esp or a
117popl, so the CPU return address branch prediction stack stays synchronised
118with the actual stack in memory.  Such a call however still costs 4 to 7
119cycles.
120
121Branch prediction, in absence of any history, will guess forward jumps are
122not taken and backward jumps are taken.  Where possible it's arranged that
123the less likely or less important case is under a taken forward jump.
124
125
126
127MMX
128
129Putting emms or femms as late as possible in a routine seems to be fastest.
130Perhaps an emms or femms stalls until all outstanding MMX instructions have
131completed, so putting it later gives them a chance to complete on their own,
132in parallel with other operations (like register popping).
133
134The Optimization Manual chapter 5 recommends using a femms on K6-2 and K6-3
135at the start of a routine, in case it's been preceded by x87 floating point
136operations.  This isn't done because in gmp programs it's expected that x87
137floating point won't be much used and that chances are an mpn routine won't
138have been preceded by any x87 code.
139
140
141
142CODING
143
144Instructions in general code are shown paired if they can decode and execute
145together, meaning two short decode instructions with the second not
146depending on the first, only the first using the shifter, no more than one
147load, and no more than one store.
148
149K6 does some out of order execution so the pairings aren't essential, they
150just show what slots might be available.  When decoding is the limiting
151factor things can be scheduled that might not execute until later.
152
153
154
155NOTES
156
157Code alignment
158
159- if an opcode/modrm or 0Fh/opcode/modrm crosses a cache line boundary,
160  short decode is inhibited.  The cross.pl script detects this.
161
162- loops and branch targets should be aligned to 16 bytes, or ensure at least
163  2 instructions before a 32 byte boundary.  This makes use of the 16 byte
164  cache in the BTB.
165
166Addressing modes
167
168- (%esi) degrades decoding from short to vector.  0(%esi) doesn't have this
169  problem, and can be used as an equivalent, or easier is just to use a
170  different register, like %ebx.
171
172- K6 and pre-CXT core K6-2 have the following problem.  (K6-2 CXT and K6-3
173  have it fixed, these being cpuid function 1 signatures 0x588 to 0x58F).
174
175  If more than 3 bytes are needed to determine instruction length then
176  decoding degrades from direct to long, or from long to vector.  This
177  happens with forms like "0F opcode mod/rm" with mod/rm=00-xxx-100 since
178  with mod=00 the sib determines whether there's a displacement.
179
180  This affects all MMX and 3DNow instructions, and others with an 0F prefix,
181  like movzbl.  The modes affected are anything with an index and no
182  displacement, or an index but no base, and this includes (%esp) which is
183  really (,%esp,1).
184
185  The cross.pl script detects problem cases.  The workaround is to always
186  use a displacement, and to do this with Zdisp if it's zero so the
187  assembler doesn't discard it.
188
189  See Optimization Manual rev D page 67 and 3DNow Porting Guide rev B pages
190  13-14 and 36-37.
191
192Calls
193
194- indirect jumps and calls are not branch predicted, they measure about 6
195  cycles.
196
197Various
198
199- adcl      2 cycles of decode, maybe 2 cycles executing in the X pipe
200- bsf       12-27 cycles
201- emms      5 cycles
202- femms     3 cycles
203- jecxz     2 cycles taken, 13 not taken (optimization manual says 7 not taken)
204- divl      20 cycles back-to-back
205- imull     2 decode, 3 execute
206- mull      2 decode, 3 execute (optimization manual decoding sample)
207- prefetch  2 cycles
208- rcll/rcrl implicit by one bit: 2 cycles
209            immediate or %cl count: 11 + 2 per bit for dword
210                                    13 + 4 per bit for byte
211- setCC	    2 cycles
212- xchgl	%eax,reg  1.5 cycles, back-to-back (strange)
213        reg,reg   2 cycles, back-to-back
214
215
216
217
218REFERENCES
219
220"AMD-K6 Processor Code Optimization Application Note", AMD publication
221number 21924, revision D amendment 0, January 2000.  This describes K6-2 and
222K6-3.  Available on-line,
223
224http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/21924.pdf
225
226"AMD-K6 MMX Enhanced Processor x86 Code Optimization Application Note", AMD
227publication number 21828, revision A amendment 0, August 1997.  This is an
228older edition of the above document, describing plain K6.  Available
229on-line,
230
231http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/21828.pdf
232
233"3DNow Technology Manual", AMD publication number 21928G/0-March 2000.
234This describes the femms and prefetch instructions, but nothing else from
2353DNow has been used.  Available on-line,
236
237http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/21928.pdf
238
239"3DNow Instruction Porting Guide", AMD publication number 22621, revision B,
240August 1999.  This has some notes on general K6 optimizations as well as
2413DNow.  Available on-line,
242
243http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/22621.pdf
244
245
246
247----------------
248Local variables:
249mode: text
250fill-column: 76
251End:
252