1Copyright 1999-2001, 2003-2005 Free Software Foundation, Inc. 2 3This file is part of the GNU MP Library. 4 5The GNU MP Library is free software; you can redistribute it and/or modify 6it under the terms of either: 7 8 * the GNU Lesser General Public License as published by the Free 9 Software Foundation; either version 3 of the License, or (at your 10 option) any later version. 11 12or 13 14 * the GNU General Public License as published by the Free Software 15 Foundation; either version 2 of the License, or (at your option) any 16 later version. 17 18or both in parallel, as here. 19 20The GNU MP Library is distributed in the hope that it will be useful, but 21WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY 22or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License 23for more details. 24 25You should have received copies of the GNU General Public License and the 26GNU Lesser General Public License along with the GNU MP Library. If not, 27see https://www.gnu.org/licenses/. 28 29 30 31 POWERPC-64 MPN SUBROUTINES 32 33 34This directory contains mpn functions for 64-bit PowerPC chips. 35 36 37CODE ORGANIZATION 38 39 mpn/powerpc64 mode-neutral code 40 mpn/powerpc64/mode32 code for mode32 41 mpn/powerpc64/mode64 code for mode64 42 43 44The mode32 and mode64 sub-directories contain code which is for use in the 45respective chip mode, 32 or 64. The top-level directory is code that's 46unaffected by the mode. 47 48The "adde" instruction is the main difference between mode32 and mode64. It 49operates on either on a 32-bit or 64-bit quantity according to the chip mode. 50Other instructions have an operand size in their opcode and hence don't vary. 51 52 53 54POWER3/PPC630 pipeline information: 55 56Decoding is 4-way + branch and issue is 8-way with some out-of-order 57capability. 58 59Functional units: 60LS1 - ld/st unit 1 61LS2 - ld/st unit 2 62FXU1 - integer unit 1, handles any simple integer instruction 63FXU2 - integer unit 2, handles any simple integer instruction 64FXU3 - integer unit 3, handles integer multiply and divide 65FPU1 - floating-point unit 1 66FPU2 - floating-point unit 2 67 68Memory: Any two memory operations can issue, but memory subsystem 69 can sustain just one store per cycle. No need for data 70 prefetch; the hardware has very sophisticated prefetch logic. 71Simple integer: 2 operations (such as add, rl*) 72Integer multiply: 1 operation every 9th cycle worst case; exact timing depends 73 on 2nd operand's most significant bit position (10 bits per 74 cycle). Multiply unit is not pipelined, only one multiply 75 operation in progress is allowed. 76Integer divide: ? 77Floating-point: Any plain 2 arithmetic instructions (such as fmul, fadd, and 78 fmadd), latency 4 cycles. 79Floating-point divide: 80 ? 81Floating-point square root: 82 ? 83 84POWER3/PPC630 best possible times for the main loops: 85shift: 1.5 cycles limited by integer unit contention. 86 With 63 special loops, one for each shift count, we could 87 reduce the needed integer instructions to 2, which would 88 reduce the best possible time to 1 cycle. 89add/sub: 1.5 cycles, limited by ld/st unit contention. 90mul: 18 cycles (average) unless floating-point operations are used, 91 but that would only help for multiplies of perhaps 10 and more 92 limbs. 93addmul/submul:Same situation as for mul. 94 95 96POWER4/PPC970 and POWER5 pipeline information: 97 98This is a very odd pipeline, it is basically a VLIW masquerading as a plain 99architecture. Its issue rules are not made public, and since it is so weird, 100it is very hard to figure out any useful information from experimentation. 101An example: 102 103 A well-aligned loop with nop's take 3, 4, 6, 7, ... cycles. 104 3 cycles for 0, 1, 2, 3, 4, 5, 6, 7 nop's 105 4 cycles for 8, 9, 10, 11, 12, 13, 14, 15 nop's 106 6 cycles for 16, 17, 18, 19, 20, 21, 22, 23 nop's 107 7 cycles for 24, 25, 26, 27 nop's 108 8 cycles for 28, 29, 30, 31 nop's 109 ... continues regularly 110 111 112Functional units: 113LS1 - ld/st unit 1 114LS2 - ld/st unit 2 115FXU1 - integer unit 1, handles any integer instruction 116FXU2 - integer unit 2, handles any integer instruction 117FPU1 - floating-point unit 1 118FPU2 - floating-point unit 2 119 120While this is one integer unit less than POWER3/PPC630, the remaining units 121are more powerful; here they handle multiply and divide. 122 123Memory: 2 ld/st. Stores go to the L2 cache, which can sustain just 124 one store per cycle. 125 L1 load latency: to gregs 3-4 cycles, to fregs 5-6 cycles. 126 Operations that modify the address register might be split 127 to use also an integer issue slot. 128Simple integer: 2 operations every cycle, latency 2. 129Integer multiply: 2 operations every 6th cycle, latency 7 cycles. 130Integer divide: ? 131Floating-point: Any plain 2 arithmetic instructions (such as fmul, fadd, and 132 fmadd), latency 6 cycles. 133Floating-point divide: 134 ? 135Floating-point square root: 136 ? 137 138 139IDEAS 140 141*mul_1: Handling one limb using mulld/mulhdu and two limbs using floating- 142point operations should give performance of about 20 cycles for 3 limbs, or 7 143cycles/limb. 144 145We should probably split the single-limb operand in 32-bit chunks, and the 146multi-limb operand in 16-bit chunks, allowing us to accumulate well in fp 147registers. 148 149Problem is to get 32-bit or 16-bit words to the fp registers. Only 64-bit fp 150memops copies bits without fiddling with them. We might therefore need to 151load to integer registers with zero extension, store as 64 bits into temp 152space, and then load to fp regs. Alternatively, load directly to fp space 153and add well-chosen constants to get cancellation. (Other part after given by 154subsequent subtraction.) 155 156Possible code mix for load-via-intregs variant: 157 158lwz,std,lfd 159fmadd,fmadd,fmul,fmul 160fctidz,stfd,ld,fctidz,stfd,ld 161add,adde 162lwz,std,lfd 163fmadd,fmadd,fmul,fmul 164fctidz,stfd,ld,fctidz,stfd,ld 165add,adde 166srd,sld,add,adde,add,adde 167