1| Copyright (c) 1988, 1993 2| Regents of the University of California. All rights reserved. 3| 4| %sccs.include.redist.gas% 5| 6| @(#)oc_cksum.s 8.1 (Berkeley) 06/10/93 7| 8| 9| oc_cksum: ones complement 16 bit checksum for MC68020. 10| 11| oc_cksum (buffer, count, strtval) 12| 13| Do a 16 bit one's complement sum of 'count' bytes from 'buffer'. 14| 'strtval' is the starting value of the sum (usually zero). 15| 16| It simplifies life in in_cksum if strtval can be >= 2^16. 17| This routine will work as long as strtval is < 2^31. 18| 19| Performance 20| ----------- 21| This routine is intended for MC 68020s but should also work 22| for 68030s. It (deliberately) doesn't worry about the alignment 23| of the buffer so will only work on a 68010 if the buffer is 24| aligned on an even address. (Also, a routine written to use 25| 68010 "loop mode" would almost certainly be faster than this 26| code on a 68010). 27| 28| We don't worry about alignment because this routine is frequently 29| called with small counts: 20 bytes for IP header checksums and 40 30| bytes for TCP ack checksums. For these small counts, testing for 31| bad alignment adds ~10% to the per-call cost. Since, by the nature 32| of the kernel's allocator, the data we're called with is almost 33| always longword aligned, there is no benefit to this added cost 34| and we're better off letting the loop take a big performance hit 35| in the rare cases where we're handed an unaligned buffer. 36| 37| Loop unrolling constants of 2, 4, 8, 16, 32 and 64 times were 38| tested on random data on four different types of processors (see 39| list below -- 64 was the largest unrolling because anything more 40| overflows the 68020 Icache). On all the processors, the 41| throughput asymptote was located between 8 and 16 (closer to 8). 42| However, 16 was substantially better than 8 for small counts. 43| (It's clear why this happens for a count of 40: unroll-8 pays a 44| loop branch cost and unroll-16 doesn't. But the tests also showed 45| that 16 was better than 8 for a count of 20. It's not obvious to 46| me why.) So, since 16 was good for both large and small counts, 47| the loop below is unrolled 16 times. 48| 49| The processors tested and their average time to checksum 1024 bytes 50| of random data were: 51| Sun 3/50 (15MHz) 190 us/KB 52| Sun 3/180 (16.6MHz) 175 us/KB 53| Sun 3/60 (20MHz) 134 us/KB 54| Sun 3/280 (25MHz) 95 us/KB 55| 56| The cost of calling this routine was typically 10% of the per- 57| kilobyte cost. E.g., checksumming zero bytes on a 3/60 cost 9us 58| and each additional byte cost 125ns. With the high fixed cost, 59| it would clearly be a gain to "inline" this routine -- the 60| subroutine call adds 400% overhead to an IP header checksum. 61| However, in absolute terms, inlining would only gain 10us per 62| packet -- a 1% effect for a 1ms ethernet packet. This is not 63| enough gain to be worth the effort. 64 65 .data 66 .asciz "@(#)$Header: oc_cksum.s,v 1.1 90/07/09 16:04:43 mike Exp $" 67 .even 68 .text 69 70 .globl _oc_cksum 71_oc_cksum: 72 movl sp@(4),a0 | get buffer ptr 73 movl sp@(8),d1 | get byte count 74 movl sp@(12),d0 | get starting value 75 movl d2,sp@- | free a reg 76 77 | test for possible 1, 2 or 3 bytes of excess at end 78 | of buffer. The usual case is no excess (the usual 79 | case is header checksums) so we give that the faster 80 | 'not taken' leg of the compare. (We do the excess 81 | first because we're about the trash the low order 82 | bits of the count in d1.) 83 84 btst #0,d1 85 jne L5 | if one or three bytes excess 86 btst #1,d1 87 jne L7 | if two bytes excess 88L1: 89 movl d1,d2 90 lsrl #6,d1 | make cnt into # of 64 byte chunks 91 andl #0x3c,d2 | then find fractions of a chunk 92 negl d2 93 andb #0xf,cc | clear X 94 jmp pc@(L3-.-2:b,d2) 95L2: 96 movl a0@+,d2 97 addxl d2,d0 98 movl a0@+,d2 99 addxl d2,d0 100 movl a0@+,d2 101 addxl d2,d0 102 movl a0@+,d2 103 addxl d2,d0 104 movl a0@+,d2 105 addxl d2,d0 106 movl a0@+,d2 107 addxl d2,d0 108 movl a0@+,d2 109 addxl d2,d0 110 movl a0@+,d2 111 addxl d2,d0 112 movl a0@+,d2 113 addxl d2,d0 114 movl a0@+,d2 115 addxl d2,d0 116 movl a0@+,d2 117 addxl d2,d0 118 movl a0@+,d2 119 addxl d2,d0 120 movl a0@+,d2 121 addxl d2,d0 122 movl a0@+,d2 123 addxl d2,d0 124 movl a0@+,d2 125 addxl d2,d0 126 movl a0@+,d2 127 addxl d2,d0 128L3: 129 dbra d1,L2 | (NB- dbra doesn't affect X) 130 131 movl d0,d1 | fold 32 bit sum to 16 bits 132 swap d1 | (NB- swap doesn't affect X) 133 addxw d1,d0 134 jcc L4 135 addw #1,d0 136L4: 137 andl #0xffff,d0 138 movl sp@+,d2 139 rts 140 141L5: | deal with 1 or 3 excess bytes at the end of the buffer. 142 btst #1,d1 143 jeq L6 | if 1 excess 144 145 | 3 bytes excess 146 clrl d2 147 movw a0@(-3,d1:l),d2 | add in last full word then drop 148 addl d2,d0 | through to pick up last byte 149 150L6: | 1 byte excess 151 clrl d2 152 movb a0@(-1,d1:l),d2 153 lsll #8,d2 154 addl d2,d0 155 jra L1 156 157L7: | 2 bytes excess 158 clrl d2 159 movw a0@(-2,d1:l),d2 160 addl d2,d0 161 jra L1 162