1Copyright 2000, 2001 Free Software Foundation, Inc. 2 3This file is part of the GNU MP Library. 4 5The GNU MP Library is free software; you can redistribute it and/or modify 6it under the terms of either: 7 8 * the GNU Lesser General Public License as published by the Free 9 Software Foundation; either version 3 of the License, or (at your 10 option) any later version. 11 12or 13 14 * the GNU General Public License as published by the Free Software 15 Foundation; either version 2 of the License, or (at your option) any 16 later version. 17 18or both in parallel, as here. 19 20The GNU MP Library is distributed in the hope that it will be useful, but 21WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY 22or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License 23for more details. 24 25You should have received copies of the GNU General Public License and the 26GNU Lesser General Public License along with the GNU MP Library. If not, 27see https://www.gnu.org/licenses/. 28 29 30 31 32 AMD K7 MPN SUBROUTINES 33 34 35This directory contains code optimized for the AMD Athlon CPU. 36 37The mmx subdirectory has routines using MMX instructions. All Athlons have 38MMX, the separate directory is just so that configure can omit it if the 39assembler doesn't support MMX. 40 41 42 43STATUS 44 45Times for the loops, with all code and data in L1 cache. 46 47 cycles/limb 48 mpn_add/sub_n 1.6 49 50 mpn_copyi 0.75 or 1.0 \ varying with data alignment 51 mpn_copyd 0.75 or 1.0 / 52 53 mpn_divrem_1 17.0 integer part, 15.0 fractional part 54 mpn_mod_1 17.0 55 mpn_divexact_by3 8.0 56 57 mpn_l/rshift 1.2 58 59 mpn_mul_1 3.4 60 mpn_addmul/submul_1 3.9 61 62 mpn_mul_basecase 4.42 cycles/crossproduct (approx) 63 mpn_sqr_basecase 2.3 cycles/crossproduct (approx) 64 or 4.55 cycles/triangleproduct (approx) 65 66Prefetching of sources hasn't yet been tried. 67 68 69 70NOTES 71 72cmov, MMX, 3DNow and some extensions to MMX and 3DNow are available. 73 74Write-allocate L1 data cache means prefetching of destinations is unnecessary. 75 76Floating point multiplications can be done in parallel with integer 77multiplications, but there doesn't seem to be any way to make use of this. 78 79Unsigned "mul"s can be issued every 3 cycles. This suggests 3 is a limit on 80the speed of the multiplication routines. The documentation shows mul 81executing in IEU0 (or maybe in IEU0 and IEU1 together), so it might be that, 82to get near 3 cycles code has to be arranged so that nothing else is issued 83to IEU0. A busy IEU0 could explain why some code takes 4 cycles and other 84apparently equivalent code takes 5. 85 86 87 88OPTIMIZATIONS 89 90Unrolled loops are used to reduce looping overhead. The unrolling is 91configurable up to 32 limbs/loop for most routines and up to 64 for some. 92The K7 has 64k L1 code cache so quite big unrolling is allowable. 93 94Computed jumps into the unrolling are used to handle sizes not a multiple of 95the unrolling. An attractive feature of this is that times increase 96smoothly with operand size, but it may be that some routines should just 97have simple loops to finish up, especially when PIC adds between 2 and 16 98cycles to get %eip. 99 100Position independent code is implemented using a call to get %eip for the 101computed jumps and a ret is always done, rather than an addl $4,%esp or a 102popl, so the CPU return address branch prediction stack stays synchronised 103with the actual stack in memory. 104 105Branch prediction, in absence of any history, will guess forward jumps are 106not taken and backward jumps are taken. Where possible it's arranged that 107the less likely or less important case is under a taken forward jump. 108 109 110 111CODING 112 113Instructions in general code have been shown grouped if they can execute 114together, which means up to three direct-path instructions which have no 115successive dependencies. K7 always decodes three and has out-of-order 116execution, but the groupings show what slots might be available and what 117dependency chains exist. 118 119When there's vector-path instructions an effort is made to get triplets of 120direct-path instructions in between them, even if there's dependencies, 121since this maximizes decoding throughput and might save a cycle or two if 122decoding is the limiting factor. 123 124 125 126INSTRUCTIONS 127 128adcl direct 129divl 39 cycles back-to-back 130lodsl,etc vector 131loop 1 cycle vector (decl/jnz opens up one decode slot) 132movd reg vector 133movd mem direct 134mull issue every 3 cycles, latency 4 cycles low word, 6 cycles high word 135popl vector (use movl for more than one pop) 136pushl direct, will pair with a load 137shrdl %cl vector, 3 cycles, seems to be 3 decode too 138xorl r,r false read dependency recognised 139 140 141 142REFERENCES 143 144"AMD Athlon Processor X86 Code Optimization Guide", AMD publication number 14522007, revision K, February 2002. Available on-line, 146 147http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/22007.pdf 148 149"3DNow Technology Manual", AMD publication number 21928G/0-March 2000. 150This describes the femms and prefetch instructions. Available on-line, 151 152http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/21928.pdf 153 154"AMD Extensions to the 3DNow and MMX Instruction Sets Manual", AMD 155publication number 22466, revision D, March 2000. This describes 156instructions added in the Athlon processor, such as pswapd and the extra 157prefetch forms. Available on-line, 158 159http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/22466.pdf 160 161"3DNow Instruction Porting Guide", AMD publication number 22621, revision B, 162August 1999. This has some notes on general Athlon optimizations as well as 1633DNow. Available on-line, 164 165http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/22621.pdf 166 167 168 169 170---------------- 171Local variables: 172mode: text 173fill-column: 76 174End: 175