1\documentclass[b5paper]{book} 2\usepackage{hyperref} 3\usepackage{makeidx} 4\usepackage{amssymb} 5\usepackage{color} 6\usepackage{alltt} 7\usepackage{graphicx} 8\usepackage{layout} 9\def\union{\cup} 10\def\intersect{\cap} 11\def\getsrandom{\stackrel{\rm R}{\gets}} 12\def\cross{\times} 13\def\cat{\hspace{0.5em} \| \hspace{0.5em}} 14\def\catn{$\|$} 15\def\divides{\hspace{0.3em} | \hspace{0.3em}} 16\def\nequiv{\not\equiv} 17\def\approx{\raisebox{0.2ex}{\mbox{\small $\sim$}}} 18\def\lcm{{\rm lcm}} 19\def\gcd{{\rm gcd}} 20\def\log{{\rm log}} 21\def\ord{{\rm ord}} 22\def\abs{{\mathit abs}} 23\def\rep{{\mathit rep}} 24\def\mod{{\mathit\ mod\ }} 25\renewcommand{\pmod}[1]{\ ({\rm mod\ }{#1})} 26\newcommand{\floor}[1]{\left\lfloor{#1}\right\rfloor} 27\newcommand{\ceil}[1]{\left\lceil{#1}\right\rceil} 28\def\Or{{\rm\ or\ }} 29\def\And{{\rm\ and\ }} 30\def\iff{\hspace{1em}\Longleftrightarrow\hspace{1em}} 31\def\implies{\Rightarrow} 32\def\undefined{{\rm ``undefined"}} 33\def\Proof{\vspace{1ex}\noindent {\bf Proof:}\hspace{1em}} 34\let\oldphi\phi 35\def\phi{\varphi} 36\def\Pr{{\rm Pr}} 37\newcommand{\str}[1]{{\mathbf{#1}}} 38\def\F{{\mathbb F}} 39\def\N{{\mathbb N}} 40\def\Z{{\mathbb Z}} 41\def\R{{\mathbb R}} 42\def\C{{\mathbb C}} 43\def\Q{{\mathbb Q}} 44\definecolor{DGray}{gray}{0.5} 45\newcommand{\emailaddr}[1]{\mbox{$<${#1}$>$}} 46\def\twiddle{\raisebox{0.3ex}{\mbox{\tiny $\sim$}}} 47\def\gap{\vspace{0.5ex}} 48\makeindex 49\begin{document} 50\frontmatter 51\pagestyle{empty} 52\title{TomsFastMath User Manual \\ v0.13.1} 53\author{Tom St Denis \\ tomstdenis@gmail.com} 54\maketitle 55This text and library are all hereby placed in the public domain. This book has been formatted for B5 56[176x250] paper using the \LaTeX{} {\em book} macro package. 57 58\vspace{13cm} 59 60\begin{flushleft}This project was sponsored in part by 61 62Secure Science Corporation \url{http://www.securescience.net}. 63\end{flushleft} 64 65\tableofcontents 66\listoffigures 67\mainmatter 68\pagestyle{headings} 69\chapter{Introduction} 70\section{What is TomsFastMath?} 71 72TomsFastMath is meant to be a very fast yet still fairly portable and easy to port large 73integer arithmetic library written in ISO C. The goal specifically is to be able to perform 74very fast modular exponentiations and other related functions required for ECC, DH and RSA 75cryptosystems. 76 77Most of the library is pure ISO C portable source code while a small portion (three files) contain 78a mixture of ISO C and assembler inline fragments. Compared to LibTomMath this new library is 79meant to be much faster while sacrificing flexibiltiy. This is accomplished through several means. 80 81\begin{enumerate} 82 \item The new code is slightly messier and contains asm blocks. 83 \item This uses fixed not multiple precision integers. 84 \item It is designed only for fast modular exponentiations [e.g. less flexibility]. 85\end{enumerate} 86 87To mitigate some of the problems that arise from using assembler it has been carefully and 88appropriately used where it would make the most gain in performance. Also we use macro's 89for assembler code which allows new ports to be inserted easily. 90 91The new code uses fixed precision arithmetic which means at compile time you choose a maximum 92precision and all numbers are limited to that. This has the benefit of not requiring any 93memory heap operations (which are slow) in any of the functions. It has the downside that 94integers that are too large are truncated. 95 96The goal of this library is to be able to perform modular exponentiations (with an odd modulus) very 97fast. This is what takes the most time in systems such as RSA and DH. This also requires 98fast multiplication and squaring and has the side effect of speeding up ECC operations as well. 99 100\section{License} 101TomsFastMath is public domain. 102 103\section{Building} 104To build the library simply type ``make''. Or to install in typical *unix like directories use 105``make install''. Similarly a shared library can be built with ``make -f makefile.shared install''. 106 107You can build the test program with ``make test''. To perform simple static testing (useful to 108test out new assembly ports) use the stest program. Type ``make stest'' and run it on your 109target. The program will perform three multiplications, squarings and montgomery reductions. 110Likely if your assembly code is invalid this code will exhibit the bug. 111 112\subsection{Intel CC} 113In theory you should be able to build the library with 114 115\begin{verbatim} 116CFLAGS="-O3 -ip" CC=icc make IGNORE_SPEED=1 117\end{verbatim} 118 119However, Intels inline assembler is way less advanced than GCCs. As a result it doesn't compile. 120Fortunately it doesn't really matter. 121 122\subsection{MSVC} 123The library doesn't build with MSVC. Imagine that. 124 125\subsection{Build Limitations} 126TomsFastMath has the following build requirements which are non--portable but under most 127circumstances not problematic. 128 129\begin{enumerate} 130\item ``CHAR\_BIT'' must be eight. 131\item The ``fp\_digit'' type must be a multiple of eight bits long. 132\item The ``fp\_word'' must be at least twice the length of fp\_digit. 133\end{enumerate} 134 135\subsection{Optimization Configuration} 136By default TFM is configured for 32--bit digits using ISO C source code. This mode while portable 137is not very efficient. While building the library (from scratch) you can define one of 138several ``CFLAGS'' defines. 139 140For example, to build with with SSE2 optimizations type 141 142\begin{verbatim} 143CFLAGS=-DTFM_SSE2 make clean libtfm.a 144\end{verbatim} 145 146\subsubsection{x86--32} The ``x86--32'' mode is defined by ``TFM\_X86'' and covers all 147i386 and beyond processors. It requires GCC to build and only works with 32--bit digits. In this 148mode fp\_digit is 32--bits and fp\_word is 64--bits. This mode will be autodetected when building 149with GCC to an ``i386'' target. You can override this behaviour by defining TFM\_NO\_ASM or 150another optimization mode (such as SSE2). 151 152\subsubsection{SSE2} The ``SSE2'' mode is defined by ``TFM\_SSE2'' and requires a Pentium 4, Pentium 153M or Athlon64 processor. It requires GCC to build. Note that you shouldn't define both 154TFM\_X86 and TFM\_SSE2 at the same time. This mode only works with 32--bit digits. In this 155mode fp\_digit is 32--bits and fp\_word is 64--bits. While this mode will work on the AMD Athlon64 156series of processors it is less efficient than the native ``x86--64'' mode and not recommended. 157 158There is an additional ``TFM\_PRESCOTT'' flag that you can define for P4 Prescott processors. This causes 159the mul/sqr functions to use x86\_32 and the montgomery reduction to use SSE2 which is (so far) the fastest 160combination. If you are using an older (e.g. Northwood) generation P4 don't define this. 161 162\subsubsection{x86--64} The ``x86--64'' mode is defined by ``TFM\_X86\_64'' and requires a 163``x86--64'' capable processor (Athlon64 and future Pentium processors). It requires GCC to 164build and only works with 64--bit digits. Note that by enabling this mode it will automatically 165enable 64--bit digits. In this mode fp\_digit is 64--bits and fp\_word is 128--bits. This mode will 166be autodetected when building with GCC to an ``x86--64'' target. You can override this behaviour by defining 167TFM\_NO\_ASM. 168 169\subsubsection{ARM} The ``ARM'' mode is defined by ``TFM\_ARM'' and requires a ARMv4 with the M instructions (enhanced 170multipliers) or higher processor. It requires GCC and works with 32--bit digits. In this mode fp\_digit is 32--bits and 171fp\_word is 64--bits. 172 173\subsubsection{PPC32} The ``PPC32'' mode is defined by ``TFM\_PPC32'' and requires a standard PPC processor. It doesn't 174use altivec or other extensions so it should work on all compliant implementations of PPC. It requires GCC and works 175with 32--bit digits. In this mode fp\_digit is 32--bits and fp\_word is 64--bits. 176 177\subsubsection{PPC64} The ``PPC64'' mode is defined by ``TFM\_PPC64'' and requires a 64--bit PPC processor. 178 179\subsubsection{AVR32} The ``AVR32'' mode is defined by ``TFM\_AVR32'' and requires an Atmel AVR32 processor. 180 181\subsubsection{Future Releases} Future releases will support additional platform optimizations. 182Developers of MIPS and SPARC platforms are encouraged to submit GCC asm inline patches 183(see chapter \ref{chap:asmops} for more information). 184 185\begin{figure}[here] 186\begin{small} 187\begin{center} 188\begin{tabular}{|l|l|} 189\hline \textbf{Processor} & \textbf{Recommended Mode} \\ 190\hline All 32--bit x86 platforms & TFM\_X86 \\ 191\hline Pentium 4 & TFM\_SSE2 \\ 192\hline Pentium 4 Prescott & TFM\_SSE2 + TFM\_PRESCOTT \\ 193\hline Athlon64 & TFM\_X86\_64 \\ 194\hline ARMv4 or higher with M & TFM\_ARM \\ 195\hline G3/G4 (32-bit PPC) & TFM\_PPC32 \\ 196\hline G5 (64-bit PPC) & TFM\_PPC64 \\ 197\hline Atmel AVR32 & TFM\_AVR32 \\ 198\hline &\\ 199\hline x86--32 or x86--64 (with GCC) & Leave blank and let autodetect work \\ 200\hline 201\end{tabular} 202\caption{Recommended Build Modes} 203\end{center} 204\end{small} 205\end{figure} 206 207\subsection{Build Configurations} 208TomsFastMath is configurable in terms of which unrolled code (if any) is included. By default, the majority of the code is included which 209results in large binaries. The first flag to try out is TFM\_ALREADY\_SET which tells TFM to turn off \textbf{all} unrolled code. This will 210result in a smaller library but also a much slower library. 211 212From this clean state, you can start enabling unrolled code for given cryptographic tasks at hand. A series of TFM\_MULXYZ and TFM\_SQRXYZ macros 213exist to enable specific unrolled code. For instance, TFM\_MUL32 will enable a 32 digit unrolled multiplier. For a complete list see the tfm.h header 214file. Keep in mind this is for digits not bits. For example, you should enable TFM\_MUL16 if you are doing 1024-bit exptmods on a 64--bit platform, enable 215TFM\_MUL32 on 32--bit platforms. 216 217To help developers use ECC there are a set of defines for the five NIST curve sizes. They are named TFM\_ECCXYZ where XYZ is one of 192, 224, 256, 384, or 521. These 218enable the multipliers and squaring code for a given curve, autodetecting 64--bit platforms as well. 219 220\subsection{Precision Configuration} 221The precision of all integers in this library are fixed to a limited precision. Essentially 222the rule of setting the precision is if you plan on doing modular exponentiation with $k$--bit 223numbers than the precision must be fixed to $2k$--bits plus four digits. 224 225This is changed by altering the value of ``FP\_MAX\_SIZE'' in tfm.h to your desired size. By default, 226the library is configured to handle upto 2048--bit inputs to the modular exponentiator. 227 228\chapter{Getting Started} 229\section{Data Types} 230TomsFastMath is a large fixed precision integer library. It provides the functionality to 231manipulate large signed integers through a relatively trivial api and a single data type. 232 233The ``fp\_int'' or fixed precision integer is the data type that the functions operate with. 234 235\begin{verbatim} 236typedef struct { 237 fp_digit dp[FP_SIZE]; 238 int used, 239 sign; 240} fp_int; 241\end{verbatim} 242 243The \textbf{dp} member is the array of digits that forms the number. It must always be zero 244padded. The \textbf{used} member is the count of digits used in the array. Although the 245precision is fixed the algorithms are still tuned to not process the entire array if it 246does not have to. The \textbf{sign} indicates the sign of the integer. It is \textbf{FP\_ZPOS} (0) 247if the integer is zero or positive and \textbf{FP\_NEG} (1) otherwise. 248 249\section{Initialization} 250\subsection{Simple Initialization} 251To initialize an integer to the default state of zero use the fp\_init() function. 252 253\index{fp\_init} 254\begin{verbatim} 255void fp_init(fp_int *a); 256\end{verbatim} 257 258This will initialize the fp\_int $a$ to zero. Note that the function fp\_zero() is an alias 259for fp\_init(). 260 261\subsection{Initialize Small Constants} 262To initialize an integer with a small single digit value use the fp\_set() function. 263 264\index{fp\_set} 265\begin{verbatim} 266void fp_set(fp_int *a, fp_digit b); 267\end{verbatim} 268 269This will initialize $a$ and set it equal to the digit $b$. 270 271\subsection{Initialize Copy} 272To initialize an integer with a copy of another integer use the fp\_init\_copy() function. 273 274\index{fp\_init\_copy} 275\begin{verbatim} 276void fp_init_copy(fp_int *a, fp_int *b) 277\end{verbatim} 278 279This will initialize $a$ as a copy of $b$. Note that for compatibility with LibTomMath the function 280fp\_copy() is also provided. 281 282\subsection{Initialization with a random value} 283To initialize an integer with a random value of a specific length use the fp\_rand() function. 284 285\index{fp\_rand} 286\begin{verbatim} 287void fp_rand(fp_int *a, int digits) 288\end{verbatim} 289 290This will initialize $a$ with $digits$ random digits. 291 292\index{FP\_GEN\_RANDOM} \index{FP\_GEN\_RANDOM\_MAX} 293The source of the random data is \textbf{arc4random()} on *BSD systems that provide this function 294and the standard C function \textbf{rand()} on all other systems. It can be configured at compile time 295by pre-defining \textbf{FP\_GEN\_RANDOM} and \textbf{FP\_GEN\_RANDOM\_MAX}. 296 297\chapter{Arithmetic Operations} 298\section{Odds and Evens} 299To quickly and easily tell if an integer is zero, odd or even use the following functions. 300 301\index{fp\_iszero} \index{fp\_iseven} \index{fp\_isodd} 302\begin{verbatim} 303int fp_iszero(fp_int *a); 304int fp_iseven(fp_int *a); 305int fp_isodd(fp_int *a); 306\end{verbatim} 307 308These will return \textbf{FP\_YES} if the answer to their respective questions is yes. Otherwise they 309return \textbf{FP\_NO}. Note that these are implemented as macros and as such you should avoid using 310++ or --~-- operators on the input operand. 311 312\section{Sign Manipulation} 313To negate or compute the absolute of an integer use the following functions. 314 315\index{fp\_neg} \index{fp\_abs} 316\begin{verbatim} 317void fp_neg(fp_int *a, fp_int *b); 318void fp_abs(fp_int *a, fp_int *b); 319\end{verbatim} 320This will compute the negation (or absolute) of $a$ and store the result in $b$. Note that these 321are implemented as macros and as such you should avoid using ++ or --~-- operators on the input 322operand. 323 324\section{Comparisons} 325To perform signed or unsigned comparisons use following functions. 326 327\index{fp\_cmp} \index{fp\_cmp\_mag} 328\begin{verbatim} 329int fp_cmp(fp_int *a, fp_int *b); 330int fp_cmp_mag(fp_int *a, fp_int *b); 331\end{verbatim} 332These will compare $a$ to $b$. They will return \textbf{FP\_GT} if $a$ is larger than $b$, 333\textbf{FP\_EQ} if they are equal and \textbf{FP\_LT} if $a$ is less than $b$. 334 335The function fp\_cmp performs signed comparisons while the other performs unsigned comparisons. 336 337\section{Shifting} 338To shift the digits of an fp\_int left or right use the following functions. 339 340\index{fp\_lshd} \index{fp\_rshd} 341\begin{verbatim} 342void fp_lshd(fp_int *a, int x); 343void fp_rshd(fp_int *a, int x); 344\end{verbatim} 345 346These will shift the digits of $a$ left (or right respectively) $x$ digits. 347 348To shift individual bits of an fp\_int use the following functions. 349 350\index{fp\_div\_2d} \index{fp\_mod\_2d} \index{fp\_mul\_2d} \index{fp\_div\_2} \index{fp\_mul\_2} 351\begin{verbatim} 352void fp_div_2d(fp_int *a, int b, fp_int *c, fp_int *d); 353void fp_mod_2d(fp_int *a, int b, fp_int *c); 354void fp_mul_2d(fp_int *a, int b, fp_int *c); 355void fp_mul_2(fp_int *a, fp_int *c); 356void fp_div_2(fp_int *a, fp_int *c); 357void fp_2expt(fp_int *a, int b); 358\end{verbatim} 359fp\_div\_2d() will divide $a$ by $2^b$ and store the quotient in $c$ and remainder in $d$. Either of 360$c$ or $d$ can be \textbf{NULL} if their value is not required. fp\_mod\_2d() is a shortcut to 361compute the remainder directly. fp\_mul\_2d() will multiply $a$ by $2^b$ and store the result in $c$. 362 363The fp\_mul\_2() and fp\_div\_2() functions are optimized multiplication and divisions by two. The 364function fp\_2expt() will compute $a = 2^b$ quickly. 365 366To quickly count the number of least significant bits that are zero use the following function. 367 368\index{fp\_cnt\_lsb} 369\begin{verbatim} 370int fp_cnt_lsb(fp_int *a); 371\end{verbatim} 372This will return the number of adjacent least significant bits that are zero. This is equivalent 373to the number of times two evenly divides $a$. 374 375\section{Basic Algebra} 376 377The following functions round out the basic algebraic functionality of the library. 378 379\index{fp\_add} \index{fp\_sub} \index{fp\_mul} \index{fp\_sqr} \index{fp\_div} \index{fp\_mod} 380\begin{verbatim} 381void fp_add(fp_int *a, fp_int *b, fp_int *c); 382void fp_sub(fp_int *a, fp_int *b, fp_int *c); 383void fp_mul(fp_int *a, fp_int *b, fp_int *c); 384void fp_sqr(fp_int *a, fp_int *b); 385int fp_div(fp_int *a, fp_int *b, fp_int *c, fp_int *d); 386int fp_mod(fp_int *a, fp_int *b, fp_int *c); 387\end{verbatim} 388 389The functions fp\_add(), fp\_sub() and fp\_mul() perform their respective operations on $a$ and 390$b$ and store the result in $c$. The function fp\_sqr() computes $b = a^2$ and is faster than 391using fp\_mul() to perform the same operation. 392 393The function fp\_div() divides $a$ by $b$ and stores the quotient in $c$ and remainder in $d$. Either 394of $c$ and $d$ can be \textbf{NULL} if the result is not required. The function fp\_mod() is a simple 395shortcut to find the remainder. 396 397\section{Modular Exponentiation} 398To compute a modular exponentiation use the following function. 399 400\index{fp\_exptmod} 401\begin{verbatim} 402int fp_exptmod(fp_int *a, fp_int *b, fp_int *c, fp_int *d); 403\end{verbatim} 404This computes $d \equiv a^b \mbox{ (mod }c\mbox{)}$ for any odd $c$ and $b$. $b$ may be negative so long as 405$a^{-1} \mbox{ (mod }c\mbox{)}$ exists. The initial value of $a$ may be larger than $c$. The size of $c$ must be 406half of the maximum precision used during the build of the library. For example, by default $c$ must be less 407than $2^{2048}$. 408 409\section{Number Theoretic} 410 411To perform modular inverses, greatest common divisor or least common multiples use the following 412functions. 413 414\index{fp\_invmod} \index{fp\_gcd} \index{fp\_lcm} 415\begin{verbatim} 416int fp_invmod(fp_int *a, fp_int *b, fp_int *c); 417void fp_gcd(fp_int *a, fp_int *b, fp_int *c); 418void fp_lcm(fp_int *a, fp_int *b, fp_int *c); 419\end{verbatim} 420 421The fp\_invmod() function will find the modular inverse of $a$ modulo an odd modulus $b$ and store 422it in $c$ (provided it exists). The function fp\_gcd() will compute the greatest common 423divisor of $a$ and $b$ and store it in $c$. Similarly the fp\_lcm() function will compute 424the least common multiple of $a$ and $b$ and store it in $c$. 425 426\section{Prime Numbers} 427To quickly test a number for primality call this function. 428 429\index{fp\_isprime} 430\index{fp\_isprime\_ex} 431\begin{verbatim} 432int fp_isprime_ex(fp_int *a, int t); 433\end{verbatim} 434This will return \textbf{FP\_YES} if $a$ is probably prime. It uses 256 trial divisions and 435$t$ rounds of Rabin-Miller testing. Note that this routine performs modular exponentiations 436which means that $a$ must be in a valid range of precision. 437The valid range of $t$ is $1 \ldots 256$. 438 439For backwards compatibility reasons the API function fp\_isprime(a) is still available 440and simply calls fp\_isprime\_ex(a, 8). 441 442\chapter{Porting TomsFastMath} 443\label{chap:asmops} 444\section{Getting Started} 445Porting TomsFastMath to a given processor target is usually a simple procedure. For the most part 446assembly is used to get around the lack of a ``add with carry'' operation in the C language. To 447make matters simpler the use of assembler is through macro blocks. 448 449Each ``port'' is defined by a block of code that re-defines the portable ISO C macros with assembler 450inline blocks. To add a new port you must designate a TFM\_XXX define that will enable your 451port when built. 452 453\section{Multiply with Comba} 454The file ``fp\_mul\_comba.c'' is responsible for providing the fast multiplication within the 455library. This comba multiplication is fairly simple. It uses a sliding three digit carry 456system with the variables $c0$, $c1$, $c2$. For every digit of output $c0$ is the what will 457be that digit, $c1$ will carry into the next digit and $c2$ will be the ``c1'' carry for 458the next digit. For every ``next'' digit effectively $c0$ is stored as output, $c1$ moves into 459$c0$, $c2$ into $c1$ and zero into $c2$. 460 461The following macros define the assmebler interface to the code. 462 463\begin{verbatim} 464#define COMBA_START 465\end{verbatim} 466 467This is issued at the beginning of the multiplication function. This is in place to allow you to 468initialize any registers or machine words required. You can leave it blank if you do not need 469it. 470 471\begin{verbatim} 472#define COMBA_CLEAR \ 473 c0 = c1 = c2 = 0; 474\end{verbatim} 475 476This clears the three comba carries. If you are going to place carries in registers then 477zero the appropriate registers. Note that the functions do not use $c0$, $c1$ or $c2$ directly 478so you are free to ignore these varibles and use registers directly. 479 480\begin{verbatim} 481#define COMBA_FORWARD \ 482 c0 = c1; c1 = c2; c2 = 0; 483\end{verbatim} 484 485This propagates the carries after a digit has been produced. 486 487\begin{verbatim} 488#define COMBA_STORE(x) \ 489 x = c0; 490\end{verbatim} 491 492This stores the $c0$ digit in the memory location specified by $x$. Note that if you manually 493aliased $c0$ with a register than just store that register in $x$. 494 495\begin{verbatim} 496#define COMBA_STORE2(x) \ 497 x = c1; 498\end{verbatim} 499 500This stores the $c1$ digit in the memory location specified by $x$. Note that if you manually 501aliased $c1$ with a register than just store that register in $x$. 502 503\begin{verbatim} 504#define COMBA_FINI 505\end{verbatim} 506 507If at the end of the function you need to perform some action fill this macro in. 508 509\begin{verbatim} 510#define MULADD(i, j) \ 511 t = ((fp_word)i) * ((fp_word)j); \ 512 c0 = (c0 + t); if (c0 < ((fp_digit)t)) ++c1; \ 513 c1 = (c1 + (t>>DIGIT_BIT)); if (c1 < (t>>DIGIT_BIT)) ++c2; 514\end{verbatim} 515 516This macro performs the ``multiply and add'' step that is central to the comba 517multiplier. It multiplies the fp\_digits $i$ and $j$ to produce a fp\_word result. Effectively 518the double--digit value is added to the three-digit carry formed by $c0$, $c1$, $c2$ where $c0$ 519is the least significant digit. 520 521\section{Squaring with Comba} 522Squaring is similar to multiplication except that it uses a special ``multiply and add twice'' macro 523that replaces multiplications that are not required. 524 525\begin{verbatim} 526#define COMBA_START 527\end{verbatim} 528 529This allows for any initialization code you might have. 530 531\begin{verbatim} 532#define CLEAR_CARRY \ 533 c0 = c1 = c2 = 0; 534\end{verbatim} 535 536This will clear the carries. Like multiplication you can safely alias the three carry variables 537to registers if you can/want to. 538 539\begin{verbatim} 540#define COMBA_STORE(x) \ 541 x = c0; 542\end{verbatim} 543 544Store the $c0$ carry to a given memory location. 545 546\begin{verbatim} 547#define COMBA_STORE2(x) \ 548 x = c1; 549\end{verbatim} 550 551Store the $c1$ carry to a given memory location. 552 553\begin{verbatim} 554#define CARRY_FORWARD \ 555 c0 = c1; c1 = c2; c2 = 0; 556\end{verbatim} 557 558Forward propagate all three carry variables. 559 560\begin{verbatim} 561#define COMBA_FINI 562\end{verbatim} 563 564If you need to clean up at the end of the function. 565 566\begin{verbatim} 567/* multiplies point i and j, updates carry "c1" and digit c2 */ 568#define SQRADD(i, j) \ 569 t = ((fp_word)i) * ((fp_word)j); \ 570 c0 = (c0 + t); if (c0 < ((fp_digit)t)) ++c1; \ 571 c1 = (c1 + (t>>DIGIT_BIT)); if (c1 < (t>>DIGIT_BIT)) ++c2; 572\end{verbatim} 573 574This is essentially the MULADD macro from the multiplication code. 575 576\begin{verbatim} 577/* for squaring some of the terms are doubled... */ 578#define SQRADD2(i, j) \ 579 t = ((fp_word)i) * ((fp_word)j); \ 580 c0 = (c0 + t); if (c0 < ((fp_digit)t)) ++c1; \ 581 c1 = (c1 + (t>>DIGIT_BIT)); if (c1 < (t>>DIGIT_BIT)) ++c2; \ 582 c0 = (c0 + t); if (c0 < ((fp_digit)t)) ++c1; \ 583 c1 = (c1 + (t>>DIGIT_BIT)); if (c1 < (t>>DIGIT_BIT)) ++c2; 584\end{verbatim} 585 586This is like SQRADD except it adds the produce twice. It's similar to 587computing SQRADD(i, j*2). 588 589To further make things interesting the squaring code also has ``doubles'' (see my LTM book chapter five...) which are 590handled with these macros. 591 592\begin{verbatim} 593#define SQRADDSC(i, j) \ 594 do { fp_word t; \ 595 t = ((fp_word)i) * ((fp_word)j); \ 596 sc0 = (fp_digit)t; sc1 = (t >> DIGIT_BIT); sc2 = 0; \ 597 } while (0); 598\end{verbatim} 599This computes a product and stores it in the ``secondary'' carry registers $\left < sc0, sc1, sc2 \right >$. 600 601\begin{verbatim} 602#define SQRADDAC(i, j) \ 603 do { fp_word t; \ 604 t = sc0 + ((fp_word)i) * ((fp_word)j); sc0 = t; \ 605 t = sc1 + (t >> DIGIT_BIT); sc1 = t; sc2 += t >> DIGIT_BIT; \ 606 } while (0); 607\end{verbatim} 608This computes a product and adds it to the ``secondary'' carry registers. 609 610\begin{verbatim} 611#define SQRADDDB \ 612 do { fp_word t; \ 613 t = ((fp_word)sc0) + ((fp_word)sc0) + c0; c0 = t; \ 614 t = ((fp_word)sc1) + ((fp_word)sc1) + c1 + (t >> DIGIT_BIT); c1 = t; \ 615 c2 = c2 + ((fp_word)sc2) + ((fp_word)sc2) + (t >> DIGIT_BIT); \ 616 } while (0); 617\end{verbatim} 618This doubles the ``secondary'' carry registers and adds the sum to the main carry registers. Really complicated. 619 620\section{Montgomery with Comba} 621Montgomery reduction is used in modular exponentiation and is most called function during 622that operation. It's important to make sure this routine is very fast or all is lost. 623 624Unlike the two other comba routines this one does not use a single three--digit carry 625system. It does have three--digit carries except that the routine steps through them 626in the inner loop. This means you cannot alias them to registers (at all). 627 628To make matters simple though the three arrays of carries are stored in one array. The 629``c0'' array resides in $c[0 \ldots OFF1-1]$, ``c1'' in $c[OFF1 \ldots OFF2-1]$ and ``c2'' in 630$c[OFF2 \ldots OFF2+FP\_SIZE-1]$. 631 632\begin{verbatim} 633#define MONT_START 634\end{verbatim} 635 636This allows you to insert anything at the start that you need. 637 638\begin{verbatim} 639#define MONT_FINI 640\end{verbatim} 641 642This allows you to insert anything at the end that you need. 643 644\begin{verbatim} 645#define LOOP_START \ 646 mu = c[x] * mp; 647\end{verbatim} 648 649This computes the $\mu$ value for the inner loop. You can safely alias $mu$ and $mp$ to 650a register if you want. 651 652\begin{verbatim} 653#define INNERMUL \ 654 do { fp_word t; \ 655 _c[0] = t = ((fp_word)_c[0] + (fp_word)cy) + \ 656 (((fp_word)mu) * ((fp_word)*tmpm++)); \ 657 cy = (t >> DIGIT_BIT); \ 658 } while (0) 659\end{verbatim} 660 661This computes the inner product and adds it to the destination and carry variable $cy$. 662This uses the $mu$ value computed above (can be in a register already) and the 663$cy$ which is a chaining carry. Inside the INNERMUL loop the $cy$ value can be kept 664inside a register (hint: it always starts as $cy = 0$ in the first iteration). 665 666Upon completion of the inner loop the macro LOOP\_END is called which is used to fetch 667$cy$ into the variable the C program can see. This is where, if you cached $cy$ in a 668register you would copy it to the locally accessible C variable. 669 670\begin{verbatim} 671#define PROPCARRY \ 672 do { fp_digit t = _c[0] += cy; cy = (t < cy); } while (0) 673\end{verbatim} 674 675This propagates the carry upwards by one digit. 676 677\input{tfm.ind} 678 679\end{document} 680