k7/mmx/divrem_1.asm

35 C K7: 17.0 cycles/limb integer part, 15.0 cycles/limb fraction part.  label
38 C mp_limb_t mpn_divrem_1 (mp_ptr dst, mp_size_t xsize,  label
39 C                         mp_srcptr src, mp_size_t size,  label
40 C                         mp_limb_t divisor);  label
41 C mp_limb_t mpn_divrem_1c (mp_ptr dst, mp_size_t xsize,  label
42 C                          mp_srcptr src, mp_size_t size,  label
43 C                          mp_limb_t divisor, mp_limb_t carry);  label
44 C mp_limb_t mpn_preinv_divrem_1 (mp_ptr dst, mp_size_t xsize,  label
45 C                                mp_srcptr src, mp_size_t size,  label
46 C                                mp_limb_t divisor, mp_limb_t inverse,  label
47 C                                unsigned shift);  label
48 C  label
49 C Algorithm:  label
50 C  label
52 C Integers using Multiplication" by Granlund and Montgomery, reference in  label
53 C gmp.texi.  label
54 C  label
55 C The "and"s shown in the paper are done here with "cmov"s.  "m" is written  label
56 C for m', and "d" for d_norm, which won't cause any confusion since it's  label
57 C only the normalized divisor that's of any use in the code.  "b" is written  label
58 C for 2^N, the size of a limb, N being 32 here.  label
59 C  label
60 C The step "sdword dr = n - 2^N*d + (2^N-1-q1) * d" is instead done as  label
61 C "n-(q1+1)*d"; this rearrangement gives the same two-limb answer.  If  label
62 C q1==0xFFFFFFFF, then q1+1 would overflow.  We branch to a special case  label
63 C "q1_ff" if this occurs.  Since the true quotient is either q1 or q1+1 then  label
64 C if q1==0xFFFFFFFF that must be the right value.  label
65 C  label
66 C For the last and second last steps q1==0xFFFFFFFF is instead handled by an  label
67 C sbbl to go back to 0xFFFFFFFF if an overflow occurs when adding 1.  This  label
68 C then goes through as normal, and finding no addback required.  sbbl costs  label
69 C an extra cycle over what the main loop code does, but it keeps code size  label
70 C and complexity down.  label
71 C  label
72 C Notes:  label
73 C  label
74 C mpn_divrem_1 and mpn_preinv_divrem_1 avoid one division if the src high  label
75 C limb is less than the divisor.  mpn_divrem_1c doesn't check for a zero  label
76 C carry, since in normal circumstances that will be a very rare event.  label
77 C  label
78 C The test for skipping a division is branch free (once size>=1 is tested).  label
79 C The store to the destination high limb is 0 when a divide is skipped, or  label
80 C if it's not skipped then a copy of the src high limb is used.  The latter  label
81 C is in case src==dst.  label
82 C  label
83 C There's a small bias towards expecting xsize==0, by having code for  label
84 C xsize==0 in a straight line and xsize!=0 under forward jumps.  label
85 C  label
86 C Alternatives:  label
87 C  label
88 C If the divisor is normalized (high bit set) then a division step can  label
89 C always be skipped, since the high destination limb is always 0 or 1 in  label
90 C that case.  It doesn't seem worth checking for this though, since it  label
91 C probably occurs infrequently, in particular note that big_base for a  label
92 C decimal mpn_get_str is not normalized in a 32-bit limb.  label
264 C With MUL_THRESHOLD set to 3, the simple loops here only do 0 to 2 limbs.  label
265 C It'd be possible to write them out without the looping, but no speedup  label
266 C would be expected.  label
267 C  label
268 C Using PARAM_DIVISOR instead of %ebp measures 1 cycle/loop faster on the  label
269 C integer part, but curiously not on the fractional part, where %ebp is a  label
270 C (fixed) couple of cycles faster.  label
331 C -----------------------------------------------------------------------------  label
426 C -----------------------------------------------------------------------------  label
427 C  label
428 C The multiply by inverse loop is 17 cycles, and relies on some out-of-order  label
429 C execution.  The instruction scheduling is important, with various  label
430 C apparently equivalent forms running 1 to 5 cycles slower.  label
431 C  label
432 C A lower bound for the time would seem to be 16 cycles, based on the  label
433 C following successive dependencies.  label
434 C  label
435 C		      cycles  label
436 C		n2+n1	1  label
437 C		mul	6  label
438 C		q1+1	1  label
439 C		mul	6  label
440 C		sub	1  label
441 C		addback	1  label
442 C		       ---  label
443 C		       16  label
444 C  label
445 C This chain is what the loop has already, but 16 cycles isn't achieved.  label
446 C K7 has enough decode, and probably enough execute (depending maybe on what  label
447 C a mul actually consumes), but nothing running under 17 has been found.  label
448 C  label
449 C In theory n2+n1 could be done in the sub and addback stages (by  label
450 C calculating both n2 and n2+n1 there), but lack of registers makes this an  label
451 C unlikely proposition.  label
452 C  label
453 C The jz in the loop keeps the q1+1 stage to 1 cycle.  Handling an overflow  label
454 C from q1+1 with an "sbbl $0, %ebx" would add a cycle to the dependent  label
455 C chain, and nothing better than 18 cycles has been found when using it.  label
456 C The jump is taken only when q1 is 0xFFFFFFFF, and on random data this will  label
457 C be an extremely rare event.  label
458 C  label
459 C Branch mispredictions will hit random occurrences of q1==0xFFFFFFFF, but  label
460 C if some special data is coming out with this always, the q1_ff special  label
461 C case actually runs at 15 c/l.  0x2FFF...FFFD divided by 3 is a good way to  label
462 C induce the q1_ff case, for speed measurements or testing.  Note that  label
463 C 0xFFF...FFF divided by 1 or 2 doesn't induce it.  label
464 C  label
465 C The instruction groupings and empty comments show the cycles for a naive  label
466 C in-order view of the code (conveniently ignoring the load latency on  label
467 C VAR_INVERSE).  This shows some of where the time is going, but is nonsense  label
468 C to the extent that out-of-order execution rearranges it.  In this case  label
469 C there's 19 cycles shown, but it executes at 17.  label
542 C -----------------------------------------------------------------------------  label
543 C  label
544 C Here, and in integer_one_left below, an sbbl $0 is used rather than a jz  label
545 C q1_ff special case.  This make the code a bit smaller and simpler, and  label
546 C costs only 1 cycle (each).  label
607 C -----------------------------------------------------------------------------  label
691 C -----------------------------------------------------------------------------  label
692 C  label
693 C Special case for q1=0xFFFFFFFF, giving q=0xFFFFFFFF meaning the low dword  label
694 C of q*d is simply -d and the remainder n-q*d = n10+d  label
723 C -----------------------------------------------------------------------------  label
724 C  label
725 C Being the fractional part, the "source" limbs are all zero, meaning  label
726 C n10=0, n1=0, and hence nadj=0, leading to many instructions eliminated.  label
727 C  label
728 C The loop runs at 15 cycles.  The dependent chain is the same as the  label
729 C general case above, but without the n2+n1 stage (due to n1==0), so 15  label
730 C would seem to be the lower bound.  label
731 C  label
732 C A not entirely obvious simplification is that q1+1 never overflows a limb,  label
733 C and so there's no need for the sbbl $0 or jz q1_ff from the general case.  label
734 C q1 is the high word of m*n2+b*n2 and the following shows q1<=b-2 always.  label
735 C rnd() means rounding down to a multiple of d.  label
736 C  label
737 C	m*n2 + b*n2 <= m*(d-1) + b*(d-1)  label
738 C		     = m*d + b*d - m - b  define
739 C		     = floor((b(b-d)-1)/d)*d + b*d - m - b  define
740 C		     = rnd(b(b-d)-1) + b*d - m - b  define
741 C		     = rnd(b(b-d)-1 + b*d) - m - b  define
742 C		     = rnd(b*b-1) - m - b  define
743 C		     <= (b-2)*b  label
744 C  label
745 C Unchanged from the general case is that the final quotient limb q can be  label
746 C either q1 or q1+1, and the q1+1 case occurs often.  This can be seen from  label
747 C equation 8.4 of the paper which simplifies as follows when n1==0 and  label
748 C n0==0.  label
749 C  label
750 C	n-q1*d = (n2*k+q0*d)/b <= d + (d*d-2d)/b  label
751 C  label
752 C As before, the instruction groupings and empty comments show a naive  label
753 C in-order view of the code, which is made a nonsense by out of order  label
754 C execution.  There's 17 cycles shown, but it executes at 15.  label
755 C  label
756 C Rotating the store q and remainder->n2 instructions up to the top of the  label
757 C loop gets the run time down from 16 to 15.  label