1Copyright 2000, 2001 Free Software Foundation, Inc.
2
3This file is part of the GNU MP Library.
4
5The GNU MP Library is free software; you can redistribute it and/or modify
6it under the terms of either:
7
8  * the GNU Lesser General Public License as published by the Free
9    Software Foundation; either version 3 of the License, or (at your
10    option) any later version.
11
12or
13
14  * the GNU General Public License as published by the Free Software
15    Foundation; either version 2 of the License, or (at your option) any
16    later version.
17
18or both in parallel, as here.
19
20The GNU MP Library is distributed in the hope that it will be useful, but
21WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
22or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
23for more details.
24
25You should have received copies of the GNU General Public License and the
26GNU Lesser General Public License along with the GNU MP Library.  If not,
27see https://www.gnu.org/licenses/.
28
29
30
31
32                      AMD K7 MPN SUBROUTINES
33
34
35This directory contains code optimized for the AMD Athlon CPU.
36
37The mmx subdirectory has routines using MMX instructions.  All Athlons have
38MMX, the separate directory is just so that configure can omit it if the
39assembler doesn't support MMX.
40
41
42
43STATUS
44
45Times for the loops, with all code and data in L1 cache.
46
47                               cycles/limb
48	mpn_add/sub_n             1.6
49
50	mpn_copyi                 0.75 or 1.0   \ varying with data alignment
51	mpn_copyd                 0.75 or 1.0   /
52
53	mpn_divrem_1             17.0 integer part, 15.0 fractional part
54	mpn_mod_1                17.0
55	mpn_divexact_by3          8.0
56
57	mpn_l/rshift              1.2
58
59	mpn_mul_1                 3.4
60	mpn_addmul/submul_1       3.9
61
62	mpn_mul_basecase          4.42 cycles/crossproduct (approx)
63        mpn_sqr_basecase          2.3 cycles/crossproduct (approx)
64				  or 4.55 cycles/triangleproduct (approx)
65
66Prefetching of sources hasn't yet been tried.
67
68
69
70NOTES
71
72cmov, MMX, 3DNow and some extensions to MMX and 3DNow are available.
73
74Write-allocate L1 data cache means prefetching of destinations is unnecessary.
75
76Floating point multiplications can be done in parallel with integer
77multiplications, but there doesn't seem to be any way to make use of this.
78
79Unsigned "mul"s can be issued every 3 cycles.  This suggests 3 is a limit on
80the speed of the multiplication routines.  The documentation shows mul
81executing in IEU0 (or maybe in IEU0 and IEU1 together), so it might be that,
82to get near 3 cycles code has to be arranged so that nothing else is issued
83to IEU0.  A busy IEU0 could explain why some code takes 4 cycles and other
84apparently equivalent code takes 5.
85
86
87
88OPTIMIZATIONS
89
90Unrolled loops are used to reduce looping overhead.  The unrolling is
91configurable up to 32 limbs/loop for most routines and up to 64 for some.
92The K7 has 64k L1 code cache so quite big unrolling is allowable.
93
94Computed jumps into the unrolling are used to handle sizes not a multiple of
95the unrolling.  An attractive feature of this is that times increase
96smoothly with operand size, but it may be that some routines should just
97have simple loops to finish up, especially when PIC adds between 2 and 16
98cycles to get %eip.
99
100Position independent code is implemented using a call to get %eip for the
101computed jumps and a ret is always done, rather than an addl $4,%esp or a
102popl, so the CPU return address branch prediction stack stays synchronised
103with the actual stack in memory.
104
105Branch prediction, in absence of any history, will guess forward jumps are
106not taken and backward jumps are taken.  Where possible it's arranged that
107the less likely or less important case is under a taken forward jump.
108
109
110
111CODING
112
113Instructions in general code have been shown grouped if they can execute
114together, which means up to three direct-path instructions which have no
115successive dependencies.  K7 always decodes three and has out-of-order
116execution, but the groupings show what slots might be available and what
117dependency chains exist.
118
119When there's vector-path instructions an effort is made to get triplets of
120direct-path instructions in between them, even if there's dependencies,
121since this maximizes decoding throughput and might save a cycle or two if
122decoding is the limiting factor.
123
124
125
126INSTRUCTIONS
127
128adcl       direct
129divl       39 cycles back-to-back
130lodsl,etc  vector
131loop       1 cycle vector (decl/jnz opens up one decode slot)
132movd reg   vector
133movd mem   direct
134mull       issue every 3 cycles, latency 4 cycles low word, 6 cycles high word
135popl	   vector (use movl for more than one pop)
136pushl	   direct, will pair with a load
137shrdl %cl  vector, 3 cycles, seems to be 3 decode too
138xorl r,r   false read dependency recognised
139
140
141
142REFERENCES
143
144"AMD Athlon Processor X86 Code Optimization Guide", AMD publication number
14522007, revision K, February 2002.  Available on-line,
146
147http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/22007.pdf
148
149"3DNow Technology Manual", AMD publication number 21928G/0-March 2000.
150This describes the femms and prefetch instructions.  Available on-line,
151
152http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/21928.pdf
153
154"AMD Extensions to the 3DNow and MMX Instruction Sets Manual", AMD
155publication number 22466, revision D, March 2000.  This describes
156instructions added in the Athlon processor, such as pswapd and the extra
157prefetch forms.  Available on-line,
158
159http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/22466.pdf
160
161"3DNow Instruction Porting Guide", AMD publication number 22621, revision B,
162August 1999.  This has some notes on general Athlon optimizations as well as
1633DNow.  Available on-line,
164
165http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/22621.pdf
166
167
168
169
170----------------
171Local variables:
172mode: text
173fill-column: 76
174End:
175