• Home
  • History
  • Annotate
Name Date Size #Lines LOC

..25-Mar-2014-

Makefile.amH A D25-Mar-20146.7 KiB18288

Makefile.inH A D25-Mar-201429.4 KiB861640

READMEH A D25-Mar-201419.1 KiB502329

alpha.asmH A D25-Mar-20142 KiB6050

common.cH A D25-Mar-201461.2 KiB2,6602,196

div_qr_1_tune.cH A D25-Mar-20141.5 KiB4812

div_qr_1n_pi1_1.cH A D25-Mar-20141.2 KiB407

div_qr_1n_pi1_2.cH A D25-Mar-20141.2 KiB407

divrem1div.cH A D25-Mar-20141.3 KiB439

divrem1inv.cH A D25-Mar-20141.3 KiB439

divrem2div.cH A D25-Mar-20141.2 KiB428

divrem2inv.cH A D25-Mar-20141.2 KiB428

freq.cH A D25-Mar-201423.8 KiB895597

gcdext_double.cH A D25-Mar-20141.2 KiB406

gcdext_single.cH A D25-Mar-20141.2 KiB406

gcdextod.cH A D25-Mar-20141.2 KiB417

gcdextos.cH A D25-Mar-20141.2 KiB417

hgcd_appr_lehmer.cH A D25-Mar-20141.3 KiB417

hgcd_lehmer.cH A D25-Mar-20141.2 KiB417

hgcd_reduce_1.cH A D25-Mar-20141.3 KiB427

hgcd_reduce_2.cH A D25-Mar-20141.3 KiB417

hppa.asmH A D25-Mar-20141.4 KiB4337

hppa2.asmH A D25-Mar-20141.4 KiB4539

hppa2w.asmH A D25-Mar-20141.4 KiB4539

ia64.asmH A D25-Mar-20141.4 KiB4841

jacbase1.cH A D25-Mar-20141.2 KiB396

jacbase2.cH A D25-Mar-20141.2 KiB396

jacbase3.cH A D25-Mar-20141.2 KiB396

jacbase4.cH A D25-Mar-20141.2 KiB396

many.plH A D25-Mar-201439.5 KiB1,335907

mod_1_1-1.cH A D25-Mar-20141.2 KiB429

mod_1_1-2.cH A D25-Mar-20141.2 KiB429

mod_1_div.cH A D25-Mar-20141.5 KiB4713

mod_1_inv.cH A D25-Mar-20141.5 KiB4713

modlinv.cH A D25-Mar-20147.5 KiB179121

noop.cH A D25-Mar-20141.5 KiB6923

pentium.asmH A D25-Mar-20141.7 KiB6151

powerpc.asmH A D25-Mar-20141.4 KiB5443

powerpc64.asmH A D25-Mar-20141.3 KiB5039

powm_mod.cH A D25-Mar-20141.2 KiB406

powm_redc.cH A D25-Mar-20141.3 KiB427

pre_divrem_1.cH A D25-Mar-20141.2 KiB427

set_strb.cH A D25-Mar-20141.6 KiB4914

set_strp.cH A D25-Mar-20141.4 KiB4411

set_strs.cH A D25-Mar-20141.5 KiB4511

sparcv9.asmH A D25-Mar-20141.4 KiB4639

speed-ext.cH A D25-Mar-20147 KiB234114

speed.cH A D25-Mar-201442.5 KiB1,3851,138

speed.hH A D25-Mar-2014116.1 KiB3,6473,274

time.cH A D25-Mar-201447.2 KiB1,5981,108

tune-gcd-p.cH A D25-Mar-20144.5 KiB226151

tuneup.cH A D25-Mar-201477.1 KiB2,9132,180

x86_64.asmH A D25-Mar-20141.6 KiB5646

README

1Copyright 2000-2002, 2004 Free Software Foundation, Inc.
2
3This file is part of the GNU MP Library.
4
5The GNU MP Library is free software; you can redistribute it and/or modify
6it under the terms of either:
7
8  * the GNU Lesser General Public License as published by the Free
9    Software Foundation; either version 3 of the License, or (at your
10    option) any later version.
11
12or
13
14  * the GNU General Public License as published by the Free Software
15    Foundation; either version 2 of the License, or (at your option) any
16    later version.
17
18or both in parallel, as here.
19
20The GNU MP Library is distributed in the hope that it will be useful, but
21WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
22or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
23for more details.
24
25You should have received copies of the GNU General Public License and the
26GNU Lesser General Public License along with the GNU MP Library.  If not,
27see https://www.gnu.org/licenses/.
28
29
30
31
32
33               GMP SPEED MEASURING AND PARAMETER TUNING
34
35
36The programs in this directory are for knowledgeable users who want to
37measure GMP routines on their machine, and perhaps tweak some settings or
38identify things that can be improved.
39
40The programs here are tools, not ready to run solutions.  Nothing is built
41in a normal "make all", but various Makefile targets described below exist.
42
43Relatively few systems and CPUs have been tested, so be sure to verify that
44results are sensible before relying on them.
45
46
47
48
49MISCELLANEOUS NOTES
50
51--enable-assert
52
53    Don't configure with --enable-assert, since the extra code added by
54    assertion checking may influence measurements.
55
56Direct mapped caches
57
58    Some effort has been made to accommodate CPUs with direct mapped caches,
59    by putting data blocks more or less contiguously on the stack.  But this
60    will depend on TMP_ALLOC using alloca, and even then it may or may not
61    be enough.
62
63FreeBSD 4.2 i486 getrusage
64
65    This getrusage seems to be a bit doubtful, it looks like it's
66    microsecond accurate, but sometimes ru_utime remains unchanged after a
67    time of many microseconds has elapsed.  It'd be good to detect this in
68    the time.c initializations, but for now the suggestion is to pretend it
69    doesn't exist.
70
71        ./configure ac_cv_func_getrusage=no
72
73NetBSD 1.4.1 m68k macintosh time base
74
75    On this system it's been found getrusage often goes backwards, making it
76    unusable (time.c getrusage_backwards_p detects this).  gettimeofday
77    sometimes doesn't update atomically when it crosses a 1 second boundary.
78    Not sure what to do about this.  Expect possible intermittent failures.
79
80SCO OpenUNIX 8 /etc/hw
81
82    /etc/hw takes about a second to return the cpu frequency, which suggests
83    perhaps it's measuring each time it runs.  If this is annoying when
84    running the speed program repeatedly then set a GMP_CPU_FREQUENCY
85    environment variable (see TIME BASE section below).
86
87Timing on GNU/Linux
88
89    On Linux, timing currently uses the cycle counter. This is unreliable,
90    since the counter is not saved and restored at context switches (unlike
91    FreeBSD and Solaris where the cycle counter is "virtualized").
92
93    Using the clock_gettime method with CLOCK_PROCESS_CPUTIME_ID (posix) or
94    CLOCK_VIRTUAL (BSD) should be more reliable. To get clock_gettime
95    with glibc, one has to link with -lrt (which also drags in the pthreads
96    threading library). configure.in must be hacked to detect this and
97    arrange proper linking. Something like
98
99      old_LIBS="$LIBS"
100      AC_SEARCH_LIBS(clock_gettime, rt, [AC_DEFINE(HAVE_CLOCK_GETTIME)])
101      TUNE_LIBS="$LIBS"
102      LIBS="$old_LIBS"
103
104      AC_SUBST(TUNE_LIBS)
105
106    might work.
107
108Low resolution timebase
109
110    Parameter tuning can be very time consuming if the only timebase
111    available is a 10 millisecond clock tick, to the point of being
112    unusable.  This is currently the case on VAX and ARM systems.
113
114
115
116
117PARAMETER TUNING
118
119The "tuneup" program runs some tests designed to find the best settings for
120various thresholds, like MUL_TOOM22_THRESHOLD.  Its output can be put
121into gmp-mparam.h.  The program is built and run with
122
123        make tune
124
125If the thresholds indicated are grossly different from the values in the
126selected gmp-mparam.h then there may be a performance boost in applicable
127size ranges by changing gmp-mparam.h accordingly.
128
129Be sure to do a full reconfigure and rebuild to get any newly set thresholds
130to take effect.  A partial rebuild is enough sometimes, but a fresh
131configure and make is certain to be correct.
132
133If a CPU has specific tuned parameters coming from a gmp-mparam.h in one of
134the mpn subdirectories then the values from "make tune" should be similar.
135But check that the configured CPU is right and there are no machine specific
136effects causing a difference.
137
138It's hoped the compiler and options used won't have too much effect on
139thresholds, since for most CPUs they ultimately come down to comparisons
140between assembler subroutines.  Missing out on the longlong.h macros by not
141using gcc will probably have an effect.
142
143Some thresholds produced by the tune program are merely single values chosen
144from what's a range of sizes where two algorithms are pretty much the same
145speed.  When this happens the program is likely to give somewhat different
146values on successive runs.  This is noticeable on the toom3 thresholds for
147instance.
148
149
150
151
152SPEED PROGRAM
153
154The "speed" program can be used for measuring and comparing various
155routines, and producing tables of data or gnuplot graphs.  Compile it with
156
157	make speed
158
159(Or on DOS systems "make speed.exe".)
160
161Here are some examples of how to use it.  Check the code for all the
162options.
163
164Draw a graph of mpn_mul_n, stepping through sizes by 10 or a factor of 1.05
165(whichever is greater).
166
167        ./speed -s 10-5000 -t 10 -f 1.05 -P foo mpn_mul_n
168	gnuplot foo.gnuplot
169
170Compare mpn_add_n and an mpn_lshift by 1, showing times in cycles and
171showing under mpn_lshift the difference between it and mpn_add_n.
172
173	./speed -s 1-40 -c -d mpn_add_n mpn_lshift.1
174
175Using option -c for times in cycles is interesting but normally only
176necessary when looking carefully at assembler subroutines.  You might think
177it would always give an integer value, but this doesn't happen in practice,
178probably due to overheads in the time measurements.
179
180In the free-form output the "#" symbol against a measurement means the
181corresponding routine is fastest at that size.  This is a convenient visual
182cue when comparing different routines.  The graph data files <name>.data
183don't get this since it would upset gnuplot or other data viewers.
184
185
186
187
188TIME BASE
189
190The time measuring method is determined in time.c, based on what the
191configured host has available.  A cycle counter is preferred, possibly
192supplemented by another method if the counter has a limited range.  A
193microsecond accurate getrusage() or gettimeofday() will work quite well too.
194
195The cycle counters (except possibly on alpha) and gettimeofday() will depend
196on the machine being otherwise idle, or rather on other jobs not stealing
197CPU time from the measuring program.  Short routines (those that complete
198within a timeslice) should work even on a busy machine.
199
200Some trouble is taken by speed_measure() in common.c to avoid ill effects
201from sporadic interrupts, or other intermittent things (like cron waking up
202every minute).  But generally an idle machine will be necessary to be
203certain of consistent results.
204
205The CPU frequency is needed to convert between cycles and seconds, or for
206when a cycle counter is supplemented by getrusage() etc.  The speed program
207will convert as necessary according to the output format requested.  The
208tune program will work with either cycles or seconds.
209
210freq.c knows how to get the frequency on some systems, or can measure a
211cycle counter against gettimeofday() or getrusage(), but when that fails, or
212needs to be overridden, an environment variable GMP_CPU_FREQUENCY can be
213used (in Hertz).  For example in "bash" on a 650 MHz machine,
214
215	export GMP_CPU_FREQUENCY=650e6
216
217A high precision time base makes it possible to get accurate measurements in
218a shorter time.
219
220
221
222
223EXAMPLE COMPARISONS - VARIOUS
224
225Here are some ideas for things that can be done with the speed program.
226
227There's always going to be a certain amount of overhead in the time
228measurements, due to reading the time base, and in the loop that runs a
229routine enough times to get a reading of the desired precision.  Noop
230functions taking various arguments are available to measure this.  The
231"overhead" printed by the speed program each time in its intro is the "noop"
232routine, but note that this is just for information, it isn't deducted from
233the times printed or anything.
234
235	./speed -s 1 noop noop_wxs noop_wxys
236
237To see how many cycles per limb a routine is taking, look at the time
238increase when the size increments, using option -D.  This avoids fixed
239overheads in the measuring.  Also, remember many of the assembler routines
240have unrolled loops, so it might be necessary to compare times at, say, 16,
24132, 48, 64 etc to see what the unrolled part is taking, as opposed to any
242finishing off.
243
244        ./speed -s 16-64 -t 16 -C -D mpn_add_n
245
246The -C option on its own gives cycles per limb, but is really only useful at
247big sizes where fixed overheads are small compared to the code doing the
248real work.  Remember of course memory caching and/or page swapping will
249affect results at large sizes.
250
251        ./speed -s 500000 -C mpn_add_n
252
253Once a calculation stops fitting in the CPU data cache, it's going to start
254taking longer.  Exactly where this happens depends on the cache priming in
255the measuring routines, and on what sort of "least recently used" the
256hardware does.  Here's an example for a CPU with a 16kbyte L1 data cache and
25732-bit limb, showing a suddenly steeper curve for mpn_add_n at about 2000
258limbs.
259
260        ./speed -s 1-4000 -t 5 -f 1.02 -P foo mpn_add_n
261	gnuplot foo.gnuplot
262
263When a routine has an unrolled loop for, say, multiples of 8 limbs and then
264an ordinary loop for the remainder, it can happen that it's actually faster
265to do an operation on, say, 8 limbs than it is on 7 limbs.  The following
266draws a graph of mpn_sub_n, to see whether times smoothly increase with
267size.
268
269        ./speed -s 1-100 -c -P foo mpn_sub_n
270	gnuplot foo.gnuplot
271
272If mpn_lshift and mpn_rshift have special case code for shifts by 1, it
273ought to be faster (or at least not slower) than shifting by, say, 2 bits.
274
275        ./speed -s 1-200 -c mpn_rshift.1 mpn_rshift.2
276
277An mpn_lshift by 1 can be done by mpn_add_n adding a number to itself, and
278if the lshift isn't faster there's an obvious improvement that's possible.
279
280        ./speed -s 1-200 -c mpn_lshift.1 mpn_add_n_self
281
282On some CPUs (AMD K6 for example) an "in-place" mpn_add_n where the
283destination is one of the sources is faster than a separate destination.
284Here's an example to see this.  ".1" selects dst==src1 for mpn_add_n (and
285mpn_sub_n), for other values see speed.h SPEED_ROUTINE_MPN_BINARY_N_CALL.
286
287        ./speed -s 1-200 -c mpn_add_n mpn_add_n.1
288
289The gmp manual points out that divisions by powers of two should be done
290using a right shift because it'll be significantly faster than an actual
291division.  The following shows by what factor mpn_rshift is faster than
292mpn_divrem_1, using division by 32 as an example.
293
294        ./speed -s 10-20 -r mpn_rshift.5 mpn_divrem_1.32
295
296
297
298
299EXAMPLE COMPARISONS - MULTIPLICATION
300
301mul_basecase takes a ".<r>" parameter. If positive, it gives the second
302(smaller) operand size.  For example to show speeds for 3x3 up to 20x3 in
303cycles,
304
305        ./speed -s 3-20 -c mpn_mul_basecase.3
306
307A negative ".<-r>" parameter fixes the size of the product to the absolute
308value r.  For example to show speeds for 10x10 up to 19x1 in cycles,
309
310        ./speed -s 10-19 -c mpn_mul_basecase.-20
311
312mul_basecase with no parameter does an NxN multiply, so for example to show
313speeds in cycles for 1x1, 2x2, 3x3, etc, up to 20x20, in cycles,
314
315        ./speed -s 1-20 -c mpn_mul_basecase
316
317sqr_basecase is implemented by a "triangular" method on most CPUs, making it
318up to twice as fast as mul_basecase.  In practice loop overheads and the
319products on the diagonal mean it falls short of this.  Here's an example
320running the two and showing by what factor an NxN mul_basecase is slower
321than an NxN sqr_basecase.  (Some versions of sqr_basecase only allow sizes
322below SQR_TOOM2_THRESHOLD, so if it crashes at that point don't worry.)
323
324        ./speed -s 1-20 -r mpn_sqr_basecase mpn_mul_basecase
325
326The technique described above with -CD for showing the time difference in
327cycles per limb between two size operations can be done on an NxN
328mul_basecase using -E to change the basis for the size increment to N*N.
329For instance a 20x20 operation is taken to be doing 400 limbs, and a 16x16
330doing 256 limbs.  The following therefore shows the per crossproduct speed
331of mul_basecase and sqr_basecase at around 20x20 limbs.
332
333        ./speed -s 16-20 -t 4 -CDE mpn_mul_basecase mpn_sqr_basecase
334
335Of course sqr_basecase isn't really doing NxN crossproducts, but it can be
336interesting to compare it to mul_basecase as if it was.  For sqr_basecase
337the -F option can be used to base the deltas on N*(N+1)/2 operations, which
338is the triangular products sqr_basecase does.  For example,
339
340        ./speed -s 16-20 -t 4 -CDF mpn_sqr_basecase
341
342Both -E and -F are preliminary and might change.  A consistent approach to
343using them when claiming certain per crossproduct or per triangularproduct
344speeds hasn't really been established, but the increment between speeds in
345the range karatsuba will call seems sensible, that being k to k/2.  For
346instance, if the karatsuba threshold was 20 for the multiply and 30 for the
347square,
348
349        ./speed -s 10-20 -t 10 -CDE mpn_mul_basecase
350        ./speed -s 15-30 -t 15 -CDF mpn_sqr_basecase
351
352
353
354EXAMPLE COMPARISONS - MALLOC
355
356The gmp manual recommends application programs avoid excessive initializing
357and clearing of mpz_t variables (and mpq_t and mpf_t too).  Every new
358variable will at a minimum go through an init, a realloc for its first
359store, and finally a clear.  Quite how long that takes depends on the C
360library.  The following compares an mpz_init/realloc/clear to a 10 limb
361mpz_add.  Don't be surprised if the mallocing is quite slow.
362
363        ./speed -s 10 -c mpz_init_realloc_clear mpz_add
364
365On some systems malloc and free are much slower when dynamic linked.  The
366speed-dynamic program can be used to see this.  For example the following
367measures malloc/free, first static then dynamic.
368
369        ./speed -s 10 -c malloc_free
370        ./speed-dynamic -s 10 -c malloc_free
371
372Of course a real world program has big problems if it's doing so many
373mallocs and frees that it gets slowed down by a dynamic linked malloc.
374
375
376
377
378
379EXAMPLE COMPARISONS - STRING CONVERSIONS
380
381mpn_get_str does a binary to string conversion.  The base is specified with
382a ".<r>" parameter, or decimal by default.  Power of 2 bases are much faster
383than general bases.  The following compares decimal and hex for instance.
384
385        ./speed -s 1-20 -c mpn_get_str mpn_get_str.16
386
387Smaller bases need more divisions to split a given size number, and so are
388slower.  The following compares base 3 and base 9.  On small operands 9 will
389be nearly twice as fast, though at bigger sizes this reduces since in the
390current implementation both divide repeatedly by 3^20 (or 3^40 for 64 bit
391limbs) and those divisions come to dominate.
392
393        ./speed -s 1-20 -cr mpn_get_str.3 mpn_get_str.9
394
395mpn_set_str does a string to binary conversion.  The base is specified with
396a ".<r>" parameter, or decimal by default.  Power of 2 bases are faster than
397general bases on large conversions.
398
399	./speed -s 1-512 -f 2 -c mpn_set_str.8 mpn_set_str.10
400
401mpn_set_str also has some special case code for decimal which is a bit
402faster than the general case, basically by giving the compiler a chance to
403optimize some multiplications by 10.
404
405	./speed -s 20-40 -c mpn_set_str.9 mpn_set_str.10 mpn_set_str.11
406
407
408
409
410EXAMPLE COMPARISONS - GCDs
411
412mpn_gcd_1 has a threshold for when to reduce using an initial x%y when both
413x and y are single limbs.  This isn't tuned currently, but a value can be
414established by a measurement like
415
416	./speed -s 10-32 mpn_gcd_1.10
417
418This runs src[0] from 10 to 32 bits, and y fixed at 10 bits.  If the div
419threshold is high, say 31 so it's effectively disabled then a 32x10 bit gcd
420is done by nibbling away at the 32-bit operands bit-by-bit.  When the
421threshold is small, say 1 bit, then an initial x%y is done to reduce it to a
42210x10 bit operation.
423
424The threshold in mpn/generic/gcd_1.c or the various assembler
425implementations can be tweaked up or down until there's no more speedups on
426interesting combinations of sizes.  Note that this affects only a 1x1 limb
427operation and so isn't very important.  (An Nx1 limb operation always does
428an initial modular reduction, using mpn_mod_1 or mpn_modexact_1_odd.)
429
430
431
432
433SPEED PROGRAM EXTENSIONS
434
435Potentially lots of things could be made available in the program, but it's
436been left at only the things that have actually been wanted and are likely
437to be reasonably useful in the future.
438
439Extensions should be fairly easy to make though.  speed-ext.c is an example,
440in a style that should suit one-off tests, or new code fragments under
441development.
442
443many.pl is a script for generating a new speed program supplemented with
444alternate versions of the standard routines.  It can be used for measuring
445experimental code, or for comparing different implementations that exist
446within a CPU family.
447
448
449
450
451THRESHOLD EXAMINING
452
453The speed program can be used to examine the speeds of different algorithms
454to check the tune program has done the right thing.  For example to examine
455the karatsuba multiply threshold,
456
457	./speed -s 5-40 mpn_mul_basecase mpn_kara_mul_n
458
459When examining the toom3 threshold, remember it depends on the karatsuba
460threshold, so the right karatsuba threshold needs to be compiled into the
461library first.  The tune program uses specially recompiled versions of
462mpn/mul_n.c etc for this reason, but the speed program simply uses the
463normal libgmp.la.
464
465Note further that the various routines may recurse into themselves on sizes
466far enough above applicable thresholds.  For example, mpn_kara_mul_n will
467recurse into itself on sizes greater than twice the compiled-in
468MUL_TOOM22_THRESHOLD.
469
470When doing the above comparison between mul_basecase and kara_mul_n what's
471probably of interest is mul_basecase versus a kara_mul_n that does one level
472of Karatsuba then calls to mul_basecase, but this only happens on sizes less
473than twice the compiled MUL_TOOM22_THRESHOLD.  A larger value for that
474setting can be compiled-in to avoid the problem if necessary.  The same
475applies to toom3 and DC, though in a trickier fashion.
476
477There are some upper limits on some of the thresholds, arising from arrays
478dimensioned according to a threshold (mpn_mul_n), or asm code with certain
479sized displacements (some x86 versions of sqr_basecase).  So putting huge
480values for the thresholds, even just for testing, may fail.
481
482
483
484
485FUTURE
486
487Make a program to check the time base is working properly, for small and
488large measurements.  Make it able to test each available method, including
489perhaps the apparent resolution of each.
490
491Make a general mechanism for specifying operand overlap, and a syntax like
492maybe "mpn_add_n.dst=src2" to select it.  Some measuring routines do this
493sort of thing with the "r" parameter currently.
494
495
496
497----------------
498Local variables:
499mode: text
500fill-column: 76
501End:
502