1\documentclass[b5paper]{book}
2\usepackage{hyperref}
3\usepackage{makeidx}
4\usepackage{amssymb}
5\usepackage{color}
6\usepackage{alltt}
7\usepackage{graphicx}
8\usepackage{layout}
9\def\union{\cup}
10\def\intersect{\cap}
11\def\getsrandom{\stackrel{\rm R}{\gets}}
12\def\cross{\times}
13\def\cat{\hspace{0.5em} \| \hspace{0.5em}}
14\def\catn{$\|$}
15\def\divides{\hspace{0.3em} | \hspace{0.3em}}
16\def\nequiv{\not\equiv}
17\def\approx{\raisebox{0.2ex}{\mbox{\small $\sim$}}}
18\def\lcm{{\rm lcm}}
19\def\gcd{{\rm gcd}}
20\def\log{{\rm log}}
21\def\ord{{\rm ord}}
22\def\abs{{\mathit abs}}
23\def\rep{{\mathit rep}}
24\def\mod{{\mathit\ mod\ }}
25\renewcommand{\pmod}[1]{\ ({\rm mod\ }{#1})}
26\newcommand{\floor}[1]{\left\lfloor{#1}\right\rfloor}
27\newcommand{\ceil}[1]{\left\lceil{#1}\right\rceil}
28\def\Or{{\rm\ or\ }}
29\def\And{{\rm\ and\ }}
30\def\iff{\hspace{1em}\Longleftrightarrow\hspace{1em}}
31\def\implies{\Rightarrow}
32\def\undefined{{\rm ``undefined"}}
33\def\Proof{\vspace{1ex}\noindent {\bf Proof:}\hspace{1em}}
34\let\oldphi\phi
35\def\phi{\varphi}
36\def\Pr{{\rm Pr}}
37\newcommand{\str}[1]{{\mathbf{#1}}}
38\def\F{{\mathbb F}}
39\def\N{{\mathbb N}}
40\def\Z{{\mathbb Z}}
41\def\R{{\mathbb R}}
42\def\C{{\mathbb C}}
43\def\Q{{\mathbb Q}}
44\definecolor{DGray}{gray}{0.5}
45\newcommand{\emailaddr}[1]{\mbox{$<${#1}$>$}}
46\def\twiddle{\raisebox{0.3ex}{\mbox{\tiny $\sim$}}}
47\def\gap{\vspace{0.5ex}}
48\makeindex
49\begin{document}
50\frontmatter
51\pagestyle{empty}
52\title{TomsFastMath User Manual \\ v0.13.1}
53\author{Tom St Denis \\ tomstdenis@gmail.com}
54\maketitle
55This text and library are all hereby placed in the public domain.  This book has been formatted for B5
56[176x250] paper using the \LaTeX{} {\em book} macro package.
57
58\vspace{13cm}
59
60\begin{flushleft}This project was sponsored in part by
61
62Secure Science Corporation \url{http://www.securescience.net}.
63\end{flushleft}
64
65\tableofcontents
66\listoffigures
67\mainmatter
68\pagestyle{headings}
69\chapter{Introduction}
70\section{What is TomsFastMath?}
71
72TomsFastMath is meant to be a very fast yet still fairly portable and easy to port large
73integer arithmetic library written in ISO C.  The goal specifically is to be able to perform
74very fast modular exponentiations and other related functions required for ECC, DH and RSA
75cryptosystems.
76
77Most of the library is pure ISO C portable source code while a small portion (three files) contain
78a mixture of ISO C and assembler inline fragments.  Compared to LibTomMath this new library is
79meant to be much faster while sacrificing flexibiltiy.  This is accomplished through several means.
80
81\begin{enumerate}
82   \item The new code is slightly messier and contains asm blocks.
83   \item This uses fixed not multiple precision integers.
84   \item It is designed only for fast modular exponentiations [e.g. less flexibility].
85\end{enumerate}
86
87To mitigate some of the problems that arise from using assembler it has been carefully and
88appropriately used where it would make the most gain in performance.  Also we use macro's
89for assembler code which allows new ports to be inserted easily.
90
91The new code uses fixed precision arithmetic which means at compile time you choose a maximum
92precision and all numbers are limited to that.  This has the benefit of not requiring any
93memory heap operations (which are slow) in any of the functions.  It has the downside that
94integers that are too large are truncated.
95
96The goal of this library is to be able to perform modular exponentiations (with an odd modulus) very
97fast.  This is what takes the most time in systems such as RSA and DH.  This also requires
98fast multiplication and squaring and has the side effect of speeding up ECC operations as well.
99
100\section{License}
101TomsFastMath is public domain.
102
103\section{Building}
104To build the library simply type ``make''.  Or to install in typical *unix like directories use
105``make install''.  Similarly a shared library can be built with ``make -f makefile.shared install''.
106
107You can build the test program with ``make test''.  To perform simple static testing (useful to
108test out new assembly ports) use the stest program.  Type ``make stest'' and run it on your
109target.  The program will perform three multiplications, squarings and montgomery reductions.
110Likely if your assembly code is invalid this code will exhibit the bug.
111
112\subsection{Intel CC}
113In theory you should be able to build the library with
114
115\begin{verbatim}
116CFLAGS="-O3 -ip" CC=icc make IGNORE_SPEED=1
117\end{verbatim}
118
119However, Intels inline assembler is way less advanced than GCCs.  As a result it doesn't compile.
120Fortunately it doesn't really matter.
121
122\subsection{MSVC}
123The library doesn't build with MSVC.  Imagine that.
124
125\subsection{Build Limitations}
126TomsFastMath has the following build requirements which are non--portable but under most
127circumstances not problematic.
128
129\begin{enumerate}
130\item ``CHAR\_BIT'' must be eight.
131\item The ``fp\_digit'' type must be a multiple of eight bits long.
132\item The ``fp\_word'' must be at least twice the length of fp\_digit.
133\end{enumerate}
134
135\subsection{Optimization Configuration}
136By default TFM is configured for 32--bit digits using ISO C source code.  This mode while portable
137is not very efficient.  While building the library (from scratch) you can define one of
138several ``CFLAGS'' defines.
139
140For example, to build with with SSE2 optimizations type
141
142\begin{verbatim}
143CFLAGS=-DTFM_SSE2 make clean libtfm.a
144\end{verbatim}
145
146\subsubsection{x86--32}  The ``x86--32'' mode is defined by ``TFM\_X86'' and covers all
147i386 and beyond processors.  It requires GCC to build and only works with 32--bit digits.  In this
148mode fp\_digit is 32--bits and fp\_word is 64--bits.  This mode will be autodetected when building
149with GCC to an  ``i386'' target.  You can override this behaviour by defining TFM\_NO\_ASM or
150another optimization mode (such as SSE2).
151
152\subsubsection{SSE2} The ``SSE2'' mode is defined by ``TFM\_SSE2'' and requires a Pentium 4, Pentium
153M or Athlon64 processor.  It requires GCC to build.  Note that you shouldn't define both
154TFM\_X86 and TFM\_SSE2 at the same time.   This mode only works with 32--bit digits.  In this
155mode fp\_digit is 32--bits and fp\_word is 64--bits.  While this mode will work on the AMD Athlon64
156series of processors it is less efficient than the native ``x86--64'' mode and not recommended.
157
158There is an additional ``TFM\_PRESCOTT'' flag that you can define for P4 Prescott processors.  This causes
159the mul/sqr functions to use x86\_32 and the montgomery reduction to use SSE2 which is (so far) the fastest
160combination.  If you are using an older (e.g. Northwood) generation P4 don't define this.
161
162\subsubsection{x86--64}  The ``x86--64'' mode is defined by ``TFM\_X86\_64'' and requires a
163``x86--64'' capable processor (Athlon64 and future Pentium processors).  It requires GCC to
164build and only works with 64--bit digits.  Note that by enabling this mode it will automatically
165enable 64--bit digits.  In this mode fp\_digit is 64--bits and fp\_word is 128--bits.  This mode will
166be autodetected when building with GCC to an ``x86--64'' target.  You can override this behaviour by defining
167TFM\_NO\_ASM.
168
169\subsubsection{ARM}  The ``ARM'' mode is defined by ``TFM\_ARM'' and requires a ARMv4 with the M instructions (enhanced
170multipliers) or higher processor.  It requires GCC and works with 32--bit digits.  In this mode fp\_digit is 32--bits and
171fp\_word is 64--bits.
172
173\subsubsection{PPC32} The ``PPC32'' mode is defined by ``TFM\_PPC32'' and requires a standard PPC processor.  It doesn't
174use altivec or other extensions so it should work on all compliant implementations of PPC.  It requires GCC and works
175with 32--bit digits.  In this mode fp\_digit is 32--bits and fp\_word is 64--bits.
176
177\subsubsection{PPC64} The ``PPC64'' mode is defined by ``TFM\_PPC64'' and requires a 64--bit PPC processor.
178
179\subsubsection{AVR32} The ``AVR32'' mode is defined by ``TFM\_AVR32'' and requires an Atmel AVR32 processor.
180
181\subsubsection{Future Releases}  Future releases will support additional platform optimizations.
182Developers of MIPS and SPARC platforms are encouraged to submit GCC asm inline patches
183(see chapter \ref{chap:asmops} for more information).
184
185\begin{figure}[here]
186\begin{small}
187\begin{center}
188\begin{tabular}{|l|l|}
189\hline \textbf{Processor} & \textbf{Recommended Mode} \\
190\hline All 32--bit x86 platforms  & TFM\_X86 \\
191\hline Pentium 4                  & TFM\_SSE2 \\
192\hline Pentium 4 Prescott         & TFM\_SSE2 + TFM\_PRESCOTT \\
193\hline Athlon64                   & TFM\_X86\_64 \\
194\hline ARMv4 or higher with M     & TFM\_ARM \\
195\hline G3/G4 (32-bit PPC)         & TFM\_PPC32 \\
196\hline G5 (64-bit PPC)            & TFM\_PPC64 \\
197\hline Atmel AVR32                & TFM\_AVR32 \\
198\hline &\\
199\hline x86--32 or x86--64 (with GCC) & Leave blank and let autodetect work \\
200\hline
201\end{tabular}
202\caption{Recommended Build Modes}
203\end{center}
204\end{small}
205\end{figure}
206
207\subsection{Build Configurations}
208TomsFastMath is configurable in terms of which unrolled code (if any) is included.  By default, the majority of the code is included which
209results in large binaries.  The first flag to try out is TFM\_ALREADY\_SET which tells TFM to turn off \textbf{all} unrolled code.  This will
210result in a smaller library but also a much slower library.
211
212From this clean state, you can start enabling unrolled code for given cryptographic tasks at hand.  A series of TFM\_MULXYZ and TFM\_SQRXYZ macros
213exist to enable specific unrolled code.  For instance, TFM\_MUL32 will enable a 32 digit unrolled multiplier.  For a complete list see the tfm.h header
214file.  Keep in mind this is for digits not bits.  For example, you should enable TFM\_MUL16 if you are doing 1024-bit exptmods on a 64--bit platform, enable
215TFM\_MUL32 on 32--bit platforms.
216
217To help developers use ECC there are a set of defines for the five NIST curve sizes.  They are named TFM\_ECCXYZ where XYZ is one of 192, 224, 256, 384, or 521.  These
218enable the multipliers and squaring code for a given curve, autodetecting 64--bit platforms as well.
219
220\subsection{Precision Configuration}
221The precision of all integers in this library are fixed to a limited precision.  Essentially
222the rule of setting the precision is if you plan on doing modular exponentiation with $k$--bit
223numbers than the precision must be fixed to $2k$--bits plus four digits.
224
225This is changed by altering the value of ``FP\_MAX\_SIZE'' in tfm.h to your desired size.  By default,
226the library is configured to handle upto 2048--bit inputs to the modular exponentiator.
227
228\chapter{Getting Started}
229\section{Data Types}
230TomsFastMath is a large fixed precision integer library.  It provides the functionality to
231manipulate large signed integers through a relatively trivial api and a single data type.
232
233The ``fp\_int'' or fixed precision integer is the data type that the functions operate with.
234
235\begin{verbatim}
236typedef struct {
237    fp_digit dp[FP_SIZE];
238    int      used,
239             sign;
240} fp_int;
241\end{verbatim}
242
243The \textbf{dp} member is the array of digits that forms the number.  It must always be zero
244padded.  The \textbf{used} member is the count of digits used in the array.  Although the
245precision is fixed the algorithms are still tuned to not process the entire array if it
246does not have to.  The \textbf{sign} indicates the sign of the integer.  It is \textbf{FP\_ZPOS} (0)
247if the integer is zero or positive and \textbf{FP\_NEG} (1) otherwise.
248
249\section{Initialization}
250\subsection{Simple Initialization}
251To initialize an integer to the default state of zero use the fp\_init() function.
252
253\index{fp\_init}
254\begin{verbatim}
255void fp_init(fp_int *a);
256\end{verbatim}
257
258This will initialize the fp\_int $a$ to zero.  Note that the function fp\_zero() is an alias
259for fp\_init().
260
261\subsection{Initialize Small Constants}
262To initialize an integer with a small single digit value use the fp\_set() function.
263
264\index{fp\_set}
265\begin{verbatim}
266void fp_set(fp_int *a, fp_digit b);
267\end{verbatim}
268
269This will initialize $a$ and set it equal to the digit $b$.
270
271\subsection{Initialize Copy}
272To initialize an integer with a copy of another integer use the fp\_init\_copy() function.
273
274\index{fp\_init\_copy}
275\begin{verbatim}
276void fp_init_copy(fp_int *a, fp_int *b)
277\end{verbatim}
278
279This will initialize $a$ as a copy of $b$.  Note that for compatibility with LibTomMath the function
280fp\_copy() is also provided.
281
282\subsection{Initialization with a random value}
283To initialize an integer with a random value of a specific length use the fp\_rand() function.
284
285\index{fp\_rand}
286\begin{verbatim}
287void fp_rand(fp_int *a, int digits)
288\end{verbatim}
289
290This will initialize $a$ with $digits$ random digits.
291
292\index{FP\_GEN\_RANDOM} \index{FP\_GEN\_RANDOM\_MAX}
293The source of the random data is \textbf{arc4random()} on *BSD systems that provide this function
294and the standard C function \textbf{rand()} on all other systems. It can be configured at compile time
295by pre-defining \textbf{FP\_GEN\_RANDOM} and \textbf{FP\_GEN\_RANDOM\_MAX}.
296
297\chapter{Arithmetic Operations}
298\section{Odds and Evens}
299To quickly and easily tell if an integer is zero, odd or even use the following functions.
300
301\index{fp\_iszero} \index{fp\_iseven} \index{fp\_isodd}
302\begin{verbatim}
303int fp_iszero(fp_int *a);
304int fp_iseven(fp_int *a);
305int fp_isodd(fp_int *a);
306\end{verbatim}
307
308These will return \textbf{FP\_YES} if the answer to their respective questions is yes.  Otherwise they
309return \textbf{FP\_NO}.  Note that these are implemented as macros and as such you should avoid using
310++ or --~-- operators on the input operand.
311
312\section{Sign Manipulation}
313To negate or compute the absolute of an integer use the following functions.
314
315\index{fp\_neg} \index{fp\_abs}
316\begin{verbatim}
317void fp_neg(fp_int *a, fp_int *b);
318void fp_abs(fp_int *a, fp_int *b);
319\end{verbatim}
320This will compute the negation (or absolute) of $a$ and store the result in $b$.  Note that these
321are implemented as macros and as such you should avoid using ++ or --~-- operators on the input
322operand.
323
324\section{Comparisons}
325To perform signed or unsigned comparisons use following functions.
326
327\index{fp\_cmp} \index{fp\_cmp\_mag}
328\begin{verbatim}
329int fp_cmp(fp_int *a, fp_int *b);
330int fp_cmp_mag(fp_int *a, fp_int *b);
331\end{verbatim}
332These will compare $a$ to $b$.  They will return \textbf{FP\_GT} if $a$ is larger than $b$,
333\textbf{FP\_EQ} if they are equal and \textbf{FP\_LT} if $a$ is less than $b$.
334
335The function fp\_cmp performs signed comparisons while the other performs unsigned comparisons.
336
337\section{Shifting}
338To shift the digits of an fp\_int left or right use the following functions.
339
340\index{fp\_lshd} \index{fp\_rshd}
341\begin{verbatim}
342void fp_lshd(fp_int *a, int x);
343void fp_rshd(fp_int *a, int x);
344\end{verbatim}
345
346These will shift the digits of $a$ left (or right respectively) $x$ digits.
347
348To shift individual bits of an fp\_int use the following functions.
349
350\index{fp\_div\_2d} \index{fp\_mod\_2d} \index{fp\_mul\_2d} \index{fp\_div\_2} \index{fp\_mul\_2}
351\begin{verbatim}
352void fp_div_2d(fp_int *a, int b, fp_int *c, fp_int *d);
353void fp_mod_2d(fp_int *a, int b, fp_int *c);
354void fp_mul_2d(fp_int *a, int b, fp_int *c);
355void fp_mul_2(fp_int *a, fp_int *c);
356void fp_div_2(fp_int *a, fp_int *c);
357void fp_2expt(fp_int *a, int b);
358\end{verbatim}
359fp\_div\_2d() will divide $a$ by $2^b$ and store the quotient in $c$ and remainder in $d$.  Either of
360$c$ or $d$ can be \textbf{NULL} if their value is not required.  fp\_mod\_2d() is a shortcut to
361compute the remainder directly.  fp\_mul\_2d() will multiply $a$ by $2^b$ and store the result in $c$.
362
363The fp\_mul\_2() and fp\_div\_2() functions are optimized multiplication and divisions by two.  The
364function fp\_2expt() will compute $a = 2^b$ quickly.
365
366To quickly count the number of least significant bits that are zero use the following function.
367
368\index{fp\_cnt\_lsb}
369\begin{verbatim}
370int fp_cnt_lsb(fp_int *a);
371\end{verbatim}
372This will return the number of adjacent least significant bits that are zero.  This is equivalent
373to the number of times two evenly divides $a$.
374
375\section{Basic Algebra}
376
377The following functions round out the basic algebraic functionality of the library.
378
379\index{fp\_add} \index{fp\_sub} \index{fp\_mul} \index{fp\_sqr} \index{fp\_div} \index{fp\_mod}
380\begin{verbatim}
381void fp_add(fp_int *a, fp_int *b, fp_int *c);
382void fp_sub(fp_int *a, fp_int *b, fp_int *c);
383void fp_mul(fp_int *a, fp_int *b, fp_int *c);
384void fp_sqr(fp_int *a, fp_int *b);
385int fp_div(fp_int *a, fp_int *b, fp_int *c, fp_int *d);
386int fp_mod(fp_int *a, fp_int *b, fp_int *c);
387\end{verbatim}
388
389The functions fp\_add(), fp\_sub() and fp\_mul() perform their respective operations on $a$ and
390$b$ and store the result in $c$.  The function fp\_sqr() computes $b = a^2$ and is faster than
391using fp\_mul() to perform the same operation.
392
393The function fp\_div() divides $a$ by $b$ and stores the quotient in $c$ and remainder in $d$.  Either
394of $c$ and $d$ can be \textbf{NULL} if the result is not required.  The function fp\_mod() is a simple
395shortcut to find the remainder.
396
397\section{Modular Exponentiation}
398To compute a modular exponentiation use the following function.
399
400\index{fp\_exptmod}
401\begin{verbatim}
402int fp_exptmod(fp_int *a, fp_int *b, fp_int *c, fp_int *d);
403\end{verbatim}
404This computes $d \equiv a^b \mbox{ (mod }c\mbox{)}$ for any odd $c$ and $b$.  $b$ may be negative so long as
405$a^{-1} \mbox{ (mod }c\mbox{)}$ exists.  The initial value of $a$ may be larger than $c$.  The size of $c$ must be
406half of the maximum precision used during the build of the library.  For example, by default $c$ must be less
407than $2^{2048}$.
408
409\section{Number Theoretic}
410
411To perform modular inverses, greatest common divisor or least common multiples use the following
412functions.
413
414\index{fp\_invmod} \index{fp\_gcd} \index{fp\_lcm}
415\begin{verbatim}
416int fp_invmod(fp_int *a, fp_int *b, fp_int *c);
417void fp_gcd(fp_int *a, fp_int *b, fp_int *c);
418void fp_lcm(fp_int *a, fp_int *b, fp_int *c);
419\end{verbatim}
420
421The fp\_invmod() function will find the modular inverse of $a$ modulo an odd modulus $b$ and store
422it in $c$ (provided it exists).  The function fp\_gcd() will compute the greatest common
423divisor of $a$ and $b$ and store it in $c$.  Similarly the fp\_lcm() function will compute
424the least common multiple of $a$ and $b$ and store it in $c$.
425
426\section{Prime Numbers}
427To quickly test a number for primality call this function.
428
429\index{fp\_isprime}
430\index{fp\_isprime\_ex}
431\begin{verbatim}
432int fp_isprime_ex(fp_int *a, int t);
433\end{verbatim}
434This will return \textbf{FP\_YES} if $a$ is probably prime.  It uses 256 trial divisions and
435$t$ rounds of Rabin-Miller testing.  Note that this routine performs modular exponentiations
436which means that $a$ must be in a valid range of precision.
437The valid range of $t$ is $1 \ldots 256$.
438
439For backwards compatibility reasons the API function fp\_isprime(a) is still available
440and simply calls fp\_isprime\_ex(a, 8).
441
442\chapter{Porting TomsFastMath}
443\label{chap:asmops}
444\section{Getting Started}
445Porting TomsFastMath to a given processor target is usually a simple procedure.  For the most part
446assembly is used to get around the lack of a ``add with carry'' operation in the C language.  To
447make matters simpler the use of assembler is through macro blocks.
448
449Each ``port'' is defined by a block of code that re-defines the portable ISO C macros with assembler
450inline blocks.  To add a new port you must designate a TFM\_XXX define that will enable your
451port when built.
452
453\section{Multiply with Comba}
454The file ``fp\_mul\_comba.c'' is responsible for providing the fast multiplication within the
455library.  This comba multiplication is fairly simple.  It uses a sliding three digit carry
456system with the variables $c0$, $c1$, $c2$.  For every digit of output $c0$ is the what will
457be that digit, $c1$ will carry into the next digit and $c2$ will be the ``c1'' carry for
458the next digit.  For every ``next'' digit effectively $c0$ is stored as output, $c1$ moves into
459$c0$, $c2$ into $c1$ and zero into $c2$.
460
461The following macros define the assmebler interface to the code.
462
463\begin{verbatim}
464#define COMBA_START
465\end{verbatim}
466
467This is issued at the beginning of the multiplication function.  This is in place to allow you to
468initialize any registers or machine words required.  You can leave it blank if you do not need
469it.
470
471\begin{verbatim}
472#define COMBA_CLEAR \
473   c0 = c1 = c2 = 0;
474\end{verbatim}
475
476This clears the three comba carries.  If you are going to place carries in registers then
477zero the appropriate registers.  Note that the functions do not use $c0$, $c1$ or $c2$ directly
478so you are free to ignore these varibles and use registers directly.
479
480\begin{verbatim}
481#define COMBA_FORWARD \
482   c0 = c1; c1 = c2; c2 = 0;
483\end{verbatim}
484
485This propagates the carries after a digit has been produced.
486
487\begin{verbatim}
488#define COMBA_STORE(x) \
489   x = c0;
490\end{verbatim}
491
492This stores the $c0$ digit in the memory location specified by $x$.  Note that if you manually
493aliased $c0$ with a register than just store that register in $x$.
494
495\begin{verbatim}
496#define COMBA_STORE2(x) \
497   x = c1;
498\end{verbatim}
499
500This stores the $c1$ digit in the memory location specified by $x$.  Note that if you manually
501aliased $c1$ with a register than just store that register in $x$.
502
503\begin{verbatim}
504#define COMBA_FINI
505\end{verbatim}
506
507If at the end of the function you need to perform some action fill this macro in.
508
509\begin{verbatim}
510#define MULADD(i, j)                                          \
511   t  = ((fp_word)i) * ((fp_word)j);                          \
512   c0 = (c0 + t);              if (c0 < ((fp_digit)t))  ++c1; \
513   c1 = (c1 + (t>>DIGIT_BIT)); if (c1 < (t>>DIGIT_BIT)) ++c2;
514\end{verbatim}
515
516This macro performs the ``multiply and add'' step that is central to the comba
517multiplier.  It multiplies the fp\_digits $i$ and $j$ to produce a fp\_word result.  Effectively
518the double--digit value is added to the three-digit carry formed by $c0$, $c1$, $c2$ where $c0$
519is the least significant digit.
520
521\section{Squaring with Comba}
522Squaring is similar to multiplication except that it uses a special ``multiply and add twice'' macro
523that replaces multiplications that are not required.
524
525\begin{verbatim}
526#define COMBA_START
527\end{verbatim}
528
529This allows for any initialization code you might have.
530
531\begin{verbatim}
532#define CLEAR_CARRY \
533   c0 = c1 = c2 = 0;
534\end{verbatim}
535
536This will clear the carries.  Like multiplication you can safely alias the three carry variables
537to registers if you can/want to.
538
539\begin{verbatim}
540#define COMBA_STORE(x) \
541   x = c0;
542\end{verbatim}
543
544Store the $c0$ carry to a given memory location.
545
546\begin{verbatim}
547#define COMBA_STORE2(x) \
548   x = c1;
549\end{verbatim}
550
551Store the $c1$ carry to a given memory location.
552
553\begin{verbatim}
554#define CARRY_FORWARD \
555   c0 = c1; c1 = c2; c2 = 0;
556\end{verbatim}
557
558Forward propagate all three carry variables.
559
560\begin{verbatim}
561#define COMBA_FINI
562\end{verbatim}
563
564If you need to clean up at the end of the function.
565
566\begin{verbatim}
567/* multiplies point i and j, updates carry "c1" and digit c2 */
568#define SQRADD(i, j)                       \
569   t  = ((fp_word)i) * ((fp_word)j);       \
570   c0 = (c0 + t);              if (c0 < ((fp_digit)t))  ++c1; \
571   c1 = (c1 + (t>>DIGIT_BIT)); if (c1 < (t>>DIGIT_BIT)) ++c2;
572\end{verbatim}
573
574This is essentially the MULADD macro from the multiplication code.
575
576\begin{verbatim}
577/* for squaring some of the terms are doubled... */
578#define SQRADD2(i, j)                       \
579   t  = ((fp_word)i) * ((fp_word)j);       \
580   c0 = (c0 + t);              if (c0 < ((fp_digit)t))  ++c1; \
581   c1 = (c1 + (t>>DIGIT_BIT)); if (c1 < (t>>DIGIT_BIT)) ++c2; \
582   c0 = (c0 + t);              if (c0 < ((fp_digit)t))  ++c1; \
583   c1 = (c1 + (t>>DIGIT_BIT)); if (c1 < (t>>DIGIT_BIT)) ++c2;
584\end{verbatim}
585
586This is like SQRADD except it adds the produce twice.  It's similar to
587computing SQRADD(i, j*2).
588
589To further make things interesting the squaring code also has ``doubles'' (see my LTM book chapter five...) which are
590handled with these macros.
591
592\begin{verbatim}
593#define SQRADDSC(i, j)                                                         \
594   do { fp_word t;                                                             \
595      t =  ((fp_word)i) * ((fp_word)j);                                        \
596      sc0 = (fp_digit)t; sc1 = (t >> DIGIT_BIT); sc2 = 0;                      \
597   } while (0);
598\end{verbatim}
599This computes a product and stores it in the ``secondary'' carry registers $\left < sc0, sc1, sc2 \right >$.
600
601\begin{verbatim}
602#define SQRADDAC(i, j)                                                         \
603   do { fp_word t;                                                             \
604   t = sc0 + ((fp_word)i) * ((fp_word)j);  sc0 = t;                            \
605   t = sc1 + (t >> DIGIT_BIT);             sc1 = t; sc2 += t >> DIGIT_BIT;     \
606   } while (0);
607\end{verbatim}
608This computes a product and adds it to the ``secondary'' carry registers.
609
610\begin{verbatim}
611#define SQRADDDB                                                               \
612   do { fp_word t;                                                             \
613   t = ((fp_word)sc0) + ((fp_word)sc0) + c0; c0 = t;                                                 \
614   t = ((fp_word)sc1) + ((fp_word)sc1) + c1 + (t >> DIGIT_BIT); c1 = t;                              \
615   c2 = c2 + ((fp_word)sc2) + ((fp_word)sc2) + (t >> DIGIT_BIT);                                     \
616   } while (0);
617\end{verbatim}
618This doubles the ``secondary'' carry registers and adds the sum to the main carry registers.  Really complicated.
619
620\section{Montgomery with Comba}
621Montgomery reduction is used in modular exponentiation and is most called function during
622that operation.  It's important to make sure this routine is very fast or all is lost.
623
624Unlike the two other comba routines this one does not use a single three--digit carry
625system.  It does have three--digit carries except that the routine steps through them
626in the inner loop.  This means you cannot alias them to registers (at all).
627
628To make matters simple though the three arrays of carries are stored in one array.  The
629``c0'' array resides in $c[0 \ldots OFF1-1]$, ``c1'' in $c[OFF1 \ldots OFF2-1]$ and ``c2'' in
630$c[OFF2 \ldots OFF2+FP\_SIZE-1]$.
631
632\begin{verbatim}
633#define MONT_START
634\end{verbatim}
635
636This allows you to insert anything at the start that you need.
637
638\begin{verbatim}
639#define MONT_FINI
640\end{verbatim}
641
642This allows you to insert anything at the end that you need.
643
644\begin{verbatim}
645#define LOOP_START \
646   mu = c[x] * mp;
647\end{verbatim}
648
649This computes the $\mu$ value for the inner loop.  You can safely alias $mu$ and $mp$ to
650a register if you want.
651
652\begin{verbatim}
653#define INNERMUL                                      \
654   do { fp_word t;                                    \
655   _c[0] = t  = ((fp_word)_c[0] + (fp_word)cy) +      \
656                (((fp_word)mu) * ((fp_word)*tmpm++)); \
657   cy = (t >> DIGIT_BIT);                             \
658   } while (0)
659\end{verbatim}
660
661This computes the inner product and adds it to the destination and carry variable $cy$.
662This uses the $mu$ value computed above (can be in a register already) and the
663$cy$ which is a chaining carry.  Inside the INNERMUL loop the $cy$ value can be kept
664inside a register (hint: it always starts as $cy = 0$ in the first iteration).
665
666Upon completion of the inner loop the macro LOOP\_END is called which is used to fetch
667$cy$ into the variable the C program can see.  This is where, if you cached $cy$ in a
668register you would copy it to the locally accessible C variable.
669
670\begin{verbatim}
671#define PROPCARRY \
672   do { fp_digit t = _c[0] += cy; cy = (t < cy); } while (0)
673\end{verbatim}
674
675This propagates the carry upwards by one digit.
676
677\input{tfm.ind}
678
679\end{document}
680