Lines Matching defs:C

35 C K7: 17.0 cycles/limb integer part, 15.0 cycles/limb fraction part.  label
38 C mp_limb_t mpn_divrem_1 (mp_ptr dst, mp_size_t xsize, label
39 C mp_srcptr src, mp_size_t size, label
40 C mp_limb_t divisor); label
41 C mp_limb_t mpn_divrem_1c (mp_ptr dst, mp_size_t xsize, label
42 C mp_srcptr src, mp_size_t size, label
43 C mp_limb_t divisor, mp_limb_t carry); label
44 C mp_limb_t mpn_preinv_divrem_1 (mp_ptr dst, mp_size_t xsize, label
45 C mp_srcptr src, mp_size_t size, label
46 C mp_limb_t divisor, mp_limb_t inverse, label
47 C unsigned shift); label
48 C label
49 C Algorithm: label
50 C label
52 C Integers using Multiplication" by Granlund and Montgomery, reference in label
53 C gmp.texi. label
54 C label
55 C The "and"s shown in the paper are done here with "cmov"s. "m" is written label
56 C for m', and "d" for d_norm, which won't cause any confusion since it's label
57 C only the normalized divisor that's of any use in the code. "b" is written label
58 C for 2^N, the size of a limb, N being 32 here. label
59 C label
60 C The step "sdword dr = n - 2^N*d + (2^N-1-q1) * d" is instead done as label
61 C "n-(q1+1)*d"; this rearrangement gives the same two-limb answer. If label
62 C q1==0xFFFFFFFF, then q1+1 would overflow. We branch to a special case label
63 C "q1_ff" if this occurs. Since the true quotient is either q1 or q1+1 then label
64 C if q1==0xFFFFFFFF that must be the right value. label
65 C label
66 C For the last and second last steps q1==0xFFFFFFFF is instead handled by an label
67 C sbbl to go back to 0xFFFFFFFF if an overflow occurs when adding 1. This label
68 C then goes through as normal, and finding no addback required. sbbl costs label
69 C an extra cycle over what the main loop code does, but it keeps code size label
70 C and complexity down. label
71 C label
72 C Notes: label
73 C label
74 C mpn_divrem_1 and mpn_preinv_divrem_1 avoid one division if the src high label
75 C limb is less than the divisor. mpn_divrem_1c doesn't check for a zero label
76 C carry, since in normal circumstances that will be a very rare event. label
77 C label
78 C The test for skipping a division is branch free (once size>=1 is tested). label
79 C The store to the destination high limb is 0 when a divide is skipped, or label
80 C if it's not skipped then a copy of the src high limb is used. The latter label
81 C is in case src==dst. label
82 C label
83 C There's a small bias towards expecting xsize==0, by having code for label
84 C xsize==0 in a straight line and xsize!=0 under forward jumps. label
85 C label
86 C Alternatives: label
87 C label
88 C If the divisor is normalized (high bit set) then a division step can label
89 C always be skipped, since the high destination limb is always 0 or 1 in label
90 C that case. It doesn't seem worth checking for this though, since it label
91 C probably occurs infrequently, in particular note that big_base for a label
92 C decimal mpn_get_str is not normalized in a 32-bit limb. label
264 C With MUL_THRESHOLD set to 3, the simple loops here only do 0 to 2 limbs. label
265 C It'd be possible to write them out without the looping, but no speedup label
266 C would be expected. label
267 C label
268 C Using PARAM_DIVISOR instead of %ebp measures 1 cycle/loop faster on the label
269 C integer part, but curiously not on the fractional part, where %ebp is a label
270 C (fixed) couple of cycles faster. label
331 C ----------------------------------------------------------------------------- label
426 C ----------------------------------------------------------------------------- label
427 C label
428 C The multiply by inverse loop is 17 cycles, and relies on some out-of-order label
429 C execution. The instruction scheduling is important, with various label
430 C apparently equivalent forms running 1 to 5 cycles slower. label
431 C label
432 C A lower bound for the time would seem to be 16 cycles, based on the label
433 C following successive dependencies. label
434 C label
435 C cycles label
436 C n2+n1 1 label
437 C mul 6 label
438 C q1+1 1 label
439 C mul 6 label
440 C sub 1 label
441 C addback 1 label
442 C --- label
443 C 16 label
444 C label
445 C This chain is what the loop has already, but 16 cycles isn't achieved. label
446 C K7 has enough decode, and probably enough execute (depending maybe on what label
447 C a mul actually consumes), but nothing running under 17 has been found. label
448 C label
449 C In theory n2+n1 could be done in the sub and addback stages (by label
450 C calculating both n2 and n2+n1 there), but lack of registers makes this an label
451 C unlikely proposition. label
452 C label
453 C The jz in the loop keeps the q1+1 stage to 1 cycle. Handling an overflow label
454 C from q1+1 with an "sbbl $0, %ebx" would add a cycle to the dependent label
455 C chain, and nothing better than 18 cycles has been found when using it. label
456 C The jump is taken only when q1 is 0xFFFFFFFF, and on random data this will label
457 C be an extremely rare event. label
458 C label
459 C Branch mispredictions will hit random occurrences of q1==0xFFFFFFFF, but label
460 C if some special data is coming out with this always, the q1_ff special label
461 C case actually runs at 15 c/l. 0x2FFF...FFFD divided by 3 is a good way to label
462 C induce the q1_ff case, for speed measurements or testing. Note that label
463 C 0xFFF...FFF divided by 1 or 2 doesn't induce it. label
464 C label
465 C The instruction groupings and empty comments show the cycles for a naive label
466 C in-order view of the code (conveniently ignoring the load latency on label
467 C VAR_INVERSE). This shows some of where the time is going, but is nonsense label
468 C to the extent that out-of-order execution rearranges it. In this case label
469 C there's 19 cycles shown, but it executes at 17. label
542 C ----------------------------------------------------------------------------- label
543 C label
544 C Here, and in integer_one_left below, an sbbl $0 is used rather than a jz label
545 C q1_ff special case. This make the code a bit smaller and simpler, and label
546 C costs only 1 cycle (each). label
607 C ----------------------------------------------------------------------------- label
691 C ----------------------------------------------------------------------------- label
692 C label
693 C Special case for q1=0xFFFFFFFF, giving q=0xFFFFFFFF meaning the low dword label
694 C of q*d is simply -d and the remainder n-q*d = n10+d label
723 C ----------------------------------------------------------------------------- label
724 C label
725 C Being the fractional part, the "source" limbs are all zero, meaning label
726 C n10=0, n1=0, and hence nadj=0, leading to many instructions eliminated. label
727 C label
728 C The loop runs at 15 cycles. The dependent chain is the same as the label
729 C general case above, but without the n2+n1 stage (due to n1==0), so 15 label
730 C would seem to be the lower bound. label
731 C label
732 C A not entirely obvious simplification is that q1+1 never overflows a limb, label
733 C and so there's no need for the sbbl $0 or jz q1_ff from the general case. label
734 C q1 is the high word of m*n2+b*n2 and the following shows q1<=b-2 always. label
735 C rnd() means rounding down to a multiple of d. label
736 C label
737 C m*n2 + b*n2 <= m*(d-1) + b*(d-1) label
738 C = m*d + b*d - m - b define
739 C = floor((b(b-d)-1)/d)*d + b*d - m - b define
740 C = rnd(b(b-d)-1) + b*d - m - b define
741 C = rnd(b(b-d)-1 + b*d) - m - b define
742 C = rnd(b*b-1) - m - b define
743 C <= (b-2)*b label
744 C label
745 C Unchanged from the general case is that the final quotient limb q can be label
746 C either q1 or q1+1, and the q1+1 case occurs often. This can be seen from label
747 C equation 8.4 of the paper which simplifies as follows when n1==0 and label
748 C n0==0. label
749 C label
750 C n-q1*d = (n2*k+q0*d)/b <= d + (d*d-2d)/b label
751 C label
752 C As before, the instruction groupings and empty comments show a naive label
753 C in-order view of the code, which is made a nonsense by out of order label
754 C execution. There's 17 cycles shown, but it executes at 15. label
755 C label
756 C Rotating the store q and remainder->n2 instructions up to the top of the label
757 C loop gets the run time down from 16 to 15. label