1Copyright 2000, 2001 Free Software Foundation, Inc. 2 3This file is part of the GNU MP Library. 4 5The GNU MP Library is free software; you can redistribute it and/or modify 6it under the terms of the GNU Lesser General Public License as published by 7the Free Software Foundation; either version 2.1 of the License, or (at your 8option) any later version. 9 10The GNU MP Library is distributed in the hope that it will be useful, but 11WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY 12or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public 13License for more details. 14 15You should have received a copy of the GNU Lesser General Public License 16along with the GNU MP Library; see the file COPYING.LIB. If not, write to 17the Free Software Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 1802110-1301, USA. 19 20 21 22 23 AMD K7 MPN SUBROUTINES 24 25 26This directory contains code optimized for the AMD Athlon CPU. 27 28The mmx subdirectory has routines using MMX instructions. All Athlons have 29MMX, the separate directory is just so that configure can omit it if the 30assembler doesn't support MMX. 31 32 33 34STATUS 35 36Times for the loops, with all code and data in L1 cache. 37 38 cycles/limb 39 mpn_add/sub_n 1.6 40 41 mpn_copyi 0.75 or 1.0 \ varying with data alignment 42 mpn_copyd 0.75 or 1.0 / 43 44 mpn_divrem_1 17.0 integer part, 15.0 fractional part 45 mpn_mod_1 17.0 46 mpn_divexact_by3 8.0 47 48 mpn_l/rshift 1.2 49 50 mpn_mul_1 3.4 51 mpn_addmul/submul_1 3.9 52 53 mpn_mul_basecase 4.42 cycles/crossproduct (approx) 54 mpn_sqr_basecase 2.3 cycles/crossproduct (approx) 55 or 4.55 cycles/triangleproduct (approx) 56 57Prefetching of sources hasn't yet been tried. 58 59 60 61NOTES 62 63cmov, MMX, 3DNow and some extensions to MMX and 3DNow are available. 64 65Write-allocate L1 data cache means prefetching of destinations is unnecessary. 66 67Floating point multiplications can be done in parallel with integer 68multiplications, but there doesn't seem to be any way to make use of this. 69 70Unsigned "mul"s can be issued every 3 cycles. This suggests 3 is a limit on 71the speed of the multiplication routines. The documentation shows mul 72executing in IEU0 (or maybe in IEU0 and IEU1 together), so it might be that, 73to get near 3 cycles code has to be arranged so that nothing else is issued 74to IEU0. A busy IEU0 could explain why some code takes 4 cycles and other 75apparently equivalent code takes 5. 76 77 78 79OPTIMIZATIONS 80 81Unrolled loops are used to reduce looping overhead. The unrolling is 82configurable up to 32 limbs/loop for most routines and up to 64 for some. 83The K7 has 64k L1 code cache so quite big unrolling is allowable. 84 85Computed jumps into the unrolling are used to handle sizes not a multiple of 86the unrolling. An attractive feature of this is that times increase 87smoothly with operand size, but it may be that some routines should just 88have simple loops to finish up, especially when PIC adds between 2 and 16 89cycles to get %eip. 90 91Position independent code is implemented using a call to get %eip for the 92computed jumps and a ret is always done, rather than an addl $4,%esp or a 93popl, so the CPU return address branch prediction stack stays synchronised 94with the actual stack in memory. 95 96Branch prediction, in absence of any history, will guess forward jumps are 97not taken and backward jumps are taken. Where possible it's arranged that 98the less likely or less important case is under a taken forward jump. 99 100 101 102CODING 103 104Instructions in general code have been shown grouped if they can execute 105together, which means up to three direct-path instructions which have no 106successive dependencies. K7 always decodes three and has out-of-order 107execution, but the groupings show what slots might be available and what 108dependency chains exist. 109 110When there's vector-path instructions an effort is made to get triplets of 111direct-path instructions in between them, even if there's dependencies, 112since this maximizes decoding throughput and might save a cycle or two if 113decoding is the limiting factor. 114 115 116 117INSTRUCTIONS 118 119adcl direct 120divl 39 cycles back-to-back 121lodsl,etc vector 122loop 1 cycle vector (decl/jnz opens up one decode slot) 123movd reg vector 124movd mem direct 125mull issue every 3 cycles, latency 4 cycles low word, 6 cycles high word 126popl vector (use movl for more than one pop) 127pushl direct, will pair with a load 128shrdl %cl vector, 3 cycles, seems to be 3 decode too 129xorl r,r false read dependency recognised 130 131 132 133REFERENCES 134 135"AMD Athlon Processor X86 Code Optimization Guide", AMD publication number 13622007, revision K, February 2002. Available on-line, 137 138http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/22007.pdf 139 140"3DNow Technology Manual", AMD publication number 21928G/0-March 2000. 141This describes the femms and prefetch instructions. Available on-line, 142 143http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/21928.pdf 144 145"AMD Extensions to the 3DNow and MMX Instruction Sets Manual", AMD 146publication number 22466, revision D, March 2000. This describes 147instructions added in the Athlon processor, such as pswapd and the extra 148prefetch forms. Available on-line, 149 150http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/22466.pdf 151 152"3DNow Instruction Porting Guide", AMD publication number 22621, revision B, 153August 1999. This has some notes on general Athlon optimizations as well as 1543DNow. Available on-line, 155 156http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/22621.pdf 157 158 159 160 161---------------- 162Local variables: 163mode: text 164fill-column: 76 165End: 166