1% This file is part of crlibm, the correctly rounded mathematical library, 2% which has been developed by the Arénaire project at École normale supérieure 3% de Lyon. 4% 5% Copyright (C) 2004-2011 David Defour, Catherine Daramy-Loirat, 6% Florent de Dinechin, Matthieu Gallet, Nicolas Gast, Christoph Quirin Lauter, 7% and Jean-Michel Muller 8% 9% This library is free software; you can redistribute it and/or 10% modify it under the terms of the GNU Lesser General Public 11% License as published by the Free Software Foundation; either 12% version 2.1 of the License, or (at your option) any later version. 13% 14% This library is distributed in the hope that it will be useful, 15% but WITHOUT ANY WARRANTY; without even the implied warranty of 16% MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU 17% Lesser General Public License for more details. 18% 19% You should have received a copy of the GNU Lesser General Public 20% License along with this library; if not, write to the Free Software 21% Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA 22 23 24Some of \crlibm's functions need high precision square roots. They 25are not intended to be used outside \crlibm. In particular, we do 26currently not guarantee the correct rounding of their results because 27this property is not needed for our purposes. Their implementation 28does not handle all possible special cases ($x < 0$, $\nan$, $\infty$ 29etc.) neither. 30 31We currently provide two C macros computing the square root of a 32double precision argument either in double-double precision with at 33least $100$ correct bits (in faithful rounding) or in triple-double 34precision with an accuracy of at least $146$ bits (in faithful 35rounding). The corresponding macros are called \SqrtD~ and 36\SqrtT. 37 38The implementation of these macros was guided by the following 39principles: 40\begin{itemize} 41\item no dependency on other \texttt{libm}s, so avoidance of 42bootstrapping a Newton iteration by a double precision square root 43implemented elsewhere, 44\item high efficiency, 45\item a small memory footprint, 46\item the possible use of hardware support on some platforms in the 47future. 48\end{itemize} 49\subsubsection{Overview of the algorithm} 50The algorithm uses a combination of polynomial approximation and 51Newton iteration. 52 53After handling some special cases, the argument $x = 2^{E^\prime} 54\cdot m^\prime$ is reduced into its exponent $E^\prime$ stored in 55integer and its fractional part $m^\prime$ stored as a double 56precision number. This argument reduction is obviously exact. The two 57values are then adjusted as follows: 58\vspace{-3mm} 59\begin{center} 60 \begin{tabular}{cc} 61 \begin{minipage}{60mm} 62 $$E = \left \lbrace \begin{array}{ll} E^\prime & \mbox{ if } \exists n \in \N \mbox{ . } E^\prime = 2n\\ 63 E^\prime +1 & \mbox{ otherwise} \end{array} \right.$$ 64 \end{minipage} 65 & 66 \begin{minipage}{60mm} 67 $$m = \left \lbrace \begin{array}{ll} m^\prime & \mbox{ if } \exists n \in \N \mbox{ . } E^\prime = 2n \\ 68 \frac{m^\prime}{2} & \mbox{ otherwise } \end{array} \right.$$ 69 \end{minipage} 70 \end{tabular} 71\end{center} 72One easily checks that $\frac{1}{2} \leq m \leq 2$ and that $E$ is always even. Thus 73$$\sqrt{x} = \sqrt{2^E \cdot m} = 2^{\frac{E}{2}} \cdot \sqrt{m} = 742^{\frac{E}{2}} \cdot m \cdot \frac{1}{\sqrt{m}}$$ The algorithm 75therefore approximates $\hat{r} = \frac{1}{\sqrt{m}}$ and reconstructs 76the square root by multiplying by $m$ and exactly by 77$2^{\frac{E}{2}}$. 78 79The reciprocal square root $\hat{r}$ is approximated in two 80steps. First, a polynomial approximation yields to $r_0 = \hat{r} 81\cdot \left( 1 + \epsilon_1 \right)$, which is exact to about $8$ 82bits. In a second step, this approximation is refined by a Newton 83iteration that approximately doubles its accuracy at each step. So for 84a double-double result, $4$ iterations are needed and for a 85triple-double result $5$. 86 87The initial polynomial approximation is less exact than the one 88provided by Itanium's \texttt{} operation, which allows for using this 89hardware assistance in the future. 90 91\subsubsection{Special case handling} 92The square root of a double precision number can never be 93subnormal. In fact, if $\sqrt{x} \leq 2^{-1021}$, $x = \sqrt{x}^2 \leq 942^{-1042441}$, a value that is is not representable in double 95precision. 96 97Concerning subnormals in argument, it to be mentioned that still 98$E^\prime$ and $m^\prime$ can be found such that $x = 2^{E^\prime} 99\cdot m$ exactly and $1 \leq m^\prime \leq 2$. Only the extraction 100sequence must be modified: $x$ is first multiplied by $2^{52}$ where 101$E^\prime$ is set to $-52$. The double number $x$ is thus no longer a 102subnormal an integer handling can extract its mantissa easily. The 103extraction of the exponent takes into account the preceeding bias of 104$E^\prime$. The case $x = 0$ is filtered out before. Obviously 105$\sqrt{0} = 0$ is returned for this argument. 106 107The special cases $x < 0$, $x = \pm \infty$ and $x = \nan$ are not 108handled since they can be easily excluded by the code using the square 109root macros. 110 111Special case handling is implemented as follows: 112\begin{lstlisting}[caption={Special case handling},firstnumber=1] 113/* Special case x = 0 */ 114if (x == 0) { 115 *resh = x; 116 *resl = 0; 117} else { 118 119 E = 0; 120 121 /* Convert to integer format */ 122 xdb.d = x; 123 124 /* Handle subnormal case */ 125 if (xdb.i[HI] < 0x00100000) { 126 E = -52; 127 xdb.d *= ((db_number) ((double) SQRTTWO52)).d; /* make x a normal number */ 128 } 129 130 /* Extract exponent E and mantissa m */ 131 E += (xdb.i[HI]>>20)-1023; 132 xdb.i[HI] = (xdb.i[HI] & 0x000fffff) | 0x3ff00000; 133 m = xdb.d; 134 135 /* Make exponent even */ 136 if (E & 0x00000001) { 137 E++; 138 m *= 0.5; /* Suppose now 1/2 <= m <= 2 */ 139 } 140 141 /* Construct sqrt(2^E) = 2^(E/2) */ 142 xdb.i[HI] = (E/2 + 1023) << 20; 143 xdb.i[LO] = 0; 144\end{lstlisting} 145 146\subsubsection{Polynomial approximation} 147The reciprocal square root $\hat{r} = \frac{1}{\sqrt{m}}$ is 148approximated in the domain $m \in \left[ \frac{1}{2}; 2 \right]$ by a 149polynomial $p\left( m \right) = \sum\limits_{i=0}^4 c_i \cdot m^i$ of 150degree $4$. The polynomial's coefficients $c_0$ through $c_4$ are 151stored in double precision. The following values are used: 152\begin{eqnarray*} 153c_0 & = & 2.50385236695888790947606139525305479764938354492188 \\ 154c_1 & = & -3.29763389114324168005509818613063544034957885742188 \\ 155c_2 & = & 2.75726076139124520736345402838196605443954467773438 \\ 156c_3 & = & -1.15233725777933848632983426796272397041320800781250 \\ 157c_4 & = & 0.186900066679800969104974228685023263096809387207031 158\end{eqnarray*} 159 160The relative approximation error $\epsilon_{\mbox{\tiny approx}} = 161\frac{p\left( m\right) - \hat{r}}{\hat{r}}$ is bounded by $\left \vert 162\epsilon_{\mbox{\tiny approx}} \right \vert \leq 2^{-8.32}$ for $m \in 163\left[ \frac{1}{2}; 2 \right]$. 164 165The polynomial is evaluated in double precision using Horner's 166scheme. There may be some cancellation in the different steps but the 167relative arithmetical error $\epsilon_{\mbox{\tiny arithpoly}}$ is 168always less in magnitude than $2^{-30}$. This will be shown in more 169detail below. 170 171The code implementing the polynomial approximation reads: 172\begin{lstlisting}[caption={Polynomial approximation},firstnumber=1] 173r0 = SQRTPOLYC0 + m * (SQRTPOLYC1 + m * (SQRTPOLYC2 + m * (SQRTPOLYC3 + m * SQRTPOLYC4))); 174\end{lstlisting} 175So 4 double precision multiplications and 4 additions are needed for computing the 176initial approximation. They can be replaced by 4 FMA instructions, if available. 177 178\subsubsection{Double and double-double Newton iteration} 179The polynomial approximation is then refined using the following iteration scheme: 180$$r_{i+1} = \frac{1}{2} \cdot r_i \cdot (3 - m \cdot r_i^2)$$ 181If the arithmetic operations were exact, one would obtain the following error estimate: 182\begin{eqnarray*} 183\epsilon_{i+1} & = & \frac{r_i - \hat{r}}{\hat{r}} \\ & = & 184\frac{\frac{1}{2} \cdot r_i \cdot \left(3 - m \cdot r_i^2\right) - 185\hat{r}}{\hat{r}} \\ 186& = & \frac{\frac{1}{2} \cdot \hat{r} \cdot 187\left( 1 + \epsilon_i \right) \cdot \left( 3 - m \cdot \hat{r}^2 \cdot 188\left( 1 + \epsilon_i \right)^2 \right) - \hat{r}}{\hat{r}} \\ 189& = & \frac{1}{2} \cdot \left( 1 + \epsilon_i \right) \cdot \left( 3 - m \cdot \frac{1}{m} \cdot \left( 1 + 190\epsilon_i\right)^2 \right) - 1 \\ 191& = & \frac{1}{2} \cdot \left( 1 + \epsilon_i \right) \cdot \left( 3 - 1 - 2 \cdot \epsilon_i - \epsilon_i^2 192\right) - 1 \\ 193& = & \left( 1 + \epsilon_i \right) \cdot \left( 1 - \epsilon_i - \frac{1}{2} \cdot \epsilon_i^2 194\right) - 1 \\ 195& = & 1 - \epsilon_i - \frac{1}{2} \cdot \epsilon_i^2 + \epsilon_i - \epsilon_i^2 - \frac{1}{2} \cdot \epsilon_i^3 - 1\\ 196& = & - \frac{3}{2} \cdot \epsilon_i^2 - \frac{1}{2} \cdot \epsilon_i^3 197\end{eqnarray*} 198So the accuracy of the approximation of the reciprocal square root is doubled at each step. 199 200Since the initial accuracy is about $8$ bits, it is possible to iterate two times on pure double precision 201without any considerable loss of accuracy. After the two iterations about $31$ bits will be correct. 202The macro implements therefore: 203\begin{lstlisting}[caption={Newton iteration - double precision steps},firstnumber=1] 204r1 = 0.5 * r0 * (3 - m * (r0 * r0)); 205r2 = 0.5 * r1 * (3 - m * (r1 * r1)); 206\end{lstlisting} 207For these two iterations, 8 double precision multiplications and 2 additions are needed. 208 209The next iteration steps must be performed in double-double precision 210because the $53$ bit mantissa of a double cannot contain the about 211$60$ bit exact value $m \cdot r_2^2 \approx 1$ before cancellation in 212the substraction with $3$ and the multiplication by $r_2$. 213 214In order to exploit maximally the parallelism in the iteration equation, we rewrite it as 215\begin{eqnarray*} 216r_{3} & = & \frac{1}{2} \cdot r_2 \cdot \left( 3 - m \cdot r_2^2 \right) \\ 217& = & \left( r_2 + \frac{1}{2} \cdot r_2 \right) - \frac{1}{2} \cdot \left( m \cdot r_2 \right) \cdot 218\left( r_2 \cdot r_2 \right) 219\end{eqnarray*} 220Since multiplications by integer powers of $2$ are exact, it is 221possible to compute $r_2 + \frac{1}{2} \cdot r_2$ exactly as a 222double-double. Concurrently it is possible to compute $m \cdot r_2$ 223and $r_2 \cdot r_2$ exactly as double-doubles by means of an exact 224multiplication. The multiplication $\left( m \cdot r_2 \right) \cdot 225\left( r_2 \cdot r_2 \right)$ is then implemented as a double-double 226multiplication. The multiplication by $\frac{1}{2}$ of the value 227obtained is exact and can be performed pairwise on the 228double-double. A final double-double addition leads to $r_3 = \left( 229r_2 + \frac{1}{2} \cdot r_2 \right) - \frac{1}{2} \cdot \left( m \cdot 230r_2 \right) \cdot \left( r_2 \cdot r_2 \right)$. Here, massive 231cancellation is no longer possible since the values added are 232approximately $\frac{3}{2} \cdot r_2$ and $\frac{1}{2} \cdot r_2$. 233 234These steps are implemented as follows: 235\begin{lstlisting}[caption={Newton iteration - first double-double step},firstnumber=1] 236Mul12(&r2Sqh, &r2Sql, r2, r2); Add12(r2PHr2h, r2PHr2l, r2, 0.5 * r2); 237Mul12(&mMr2h, &mMr2l, m, r2); 238Mul22(&mMr2Ch, &mMr2Cl, mMr2h, mMr2l, r2Sqh, r2Sql); 239 240MHmMr2Ch = -0.5 * mMr2Ch; 241MHmMr2Cl = -0.5 * mMr2Cl; 242 243Add22(&r3h, &r3l, r2PHr2h, r2PHr2l, MHmMr2Ch, MHmMr2Cl); 244\end{lstlisting} 245 246The next iteration step provides enough accuracy for a double-double result. 247We rewrite the basic iteration equation once again as: 248\begin{eqnarray*} 249r_4 & = & \frac{1}{2} \cdot r_3 \cdot \left( 3 - m \cdot r_3^2 \right) \\ 250& = & r_3 \cdot \left( \frac{3}{2} - \frac{1}{2} \cdot m \cdot r_3^2 \right) \\ 251& = & r_3 \cdot \left( \frac{3}{2} - \frac{1}{2} \cdot \left( \left( m \cdot r_3^2 - 1 \right) + 1 \right) \right) \\ 252& = & r_3 \cdot \left( 1 - \frac{1}{2} \cdot \left( m \cdot r_3^2 - 1 \right) \right) 253\end{eqnarray*} 254Further, we know that $r_3$, stored as a double-double, verifies $r_3 255= \hat{r} \cdot \left( 1 + \epsilon_3 \right)$ with $\left \vert 256\epsilon_3 \right \vert \leq 2^{-60}$. So we check that 257$$m \cdot r_3^2 = m \cdot \hat{r}^2 \cdot \left( 1 + 258\epsilon_3 \right)^2 = 1 + 2 \cdot \epsilon_3 + \epsilon_3^2$$ 259Clearly, $\left \vert 2 \cdot \epsilon_3 + \epsilon_3^2 \right \vert < 260\frac{1}{2} \mUlp\left( 1 \right)$. So when squaring $r_{3\hi} + r_{3\lo}$ in double-double precision 261and multiplying it in double-double precision by $m$ produces a 262double-double $mMr3Sq_\hi + mMr3Sq_\lo = m \cdot \left( r_{3\hi} + 263r_{3\lo} \right)^2 \cdot \left( 1 + \epsilon \right)$, $\left \vert 264\epsilon \right \vert \leq 2^{-100}$ such that $mMr3Sq_\hi = 1$ in all 265cases. 266 267So we can implement the iteration equation 268$$r_4 = r_3 \cdot \left( 1 - \frac{1}{2} \cdot \left( m \cdot r_3^2 - 1 \right) \right)$$ 269as follows: 270\begin{lstlisting}[caption={Newton iteration - second double-double step},firstnumber=1] 271Mul22(&r3Sqh, &r3Sql, r3h, r3l, r3h, r3l); 272Mul22(&mMr3Sqh, &mMr3Sql, m, 0, r3Sqh, r3Sql); 273 274Mul22(&r4h, &r4l, r3h, r3l, 1, -0.5 * mMr3Sql); 275\end{lstlisting} 276We since get $r_{4\hi} + r_{4\lo} = \hat{r} \cdot \left( 1 + 277\epsilon_4 \right)$ with $\left \vert \epsilon_4 \right \vert \leq 2782^{-102}$, the accuracy being limited by the accuracy of the last 279double-double multiplication operator. 280 281This approximation is than multiplied by $m$ in double-double 282precision, leading to an approximation $srtm_\hi + srtm_\lo = \sqrt{m} 283\cdot \left( 1 + \epsilon \right)$ with $\left \vert \epsilon \right \vert \leq 2842^{-100}$. 285 286Out of this value, the square root of the initial argument can be 287reconstructed by multiplying by $2^{\frac{E}{2}}$, which has already 288been stored in $xdb.d$. This multiplication is exact because it cannot 289produce a subnormal. 290 291These two steps are implemented as shown below: 292\begin{lstlisting}[caption={Multiplication $m \cdot \hat{r}$, reconstruction},firstnumber=1] 293Mul22(&srtmh,&srtml,m,0,r4h,r4l); 294 295/* Multiply componentwise by sqrt(2^E), which is an integer power of 2 that may not produce a subnormal */ 296 297*resh = xdb.d * srtmh; 298*resl = xdb.d * srtml; 299\end{lstlisting} 300 301\subsubsection{Triple-double Newton iteration} 302For producing a triple-double approximate to $\hat{r}$ with an 303accuracy of at least $147$ bits, one more Newton iteration is 304needed. We apply the same equation as in the last double-double step, 305which reads: 306$$r_5 = r_4 \cdot \left( 1 - \frac{1}{2} \cdot \left( m \cdot r_4^2 - 3071 \right) \right)$$ Once again, the first component of the 308triple-double number holding an approximation to $m \cdot r_4^2$ is 309exactly equal to $1$. So by neglecting this component, we substract $1$ from it. 310Unfortunately, a renormalization step is needed after the multiplications for 311squaring $r_4$ and by $m$ because the values computed might be overlapped which would prevent us 312form substracting $1$ by neglecting a component. 313 314We implement thus: 315\begin{lstlisting}[caption={Newton iteration - triple-double step},firstnumber=1] 316Mul23(&r4Sqh, &r4Sqm, &r4Sql, r4h, r4l, r4h, r4l); 317Mul133(&mMr4Sqhover, &mMr4Sqmover, &mMr4Sqlover, m, r4Sqh, r4Sqm, r4Sql); 318Renormalize3(&mMr4Sqh, &mMr4Sqm, &mMr4Sql, mMr4Sqhover, mMr4Sqmover, mMr4Sqlover); 319 320HmMr4Sqm = -0.5 * mMr4Sqm; 321HmMr4Sql = -0.5 * mMr4Sql; 322 323Mul233(&r5h,&r5m,&r5l,r4h,r4l,1,HmMr4Sqm,HmMr4Sql); 324\end{lstlisting} 325 326This approximation $r_{5\hi} + r_{5\mi} + r_{5\lo} = \hat{r} \cdot 327\left( 1 + \epsilon_5 \right)$, where $\left \vert \epsilon_5 \right 328\vert \leq 2^{-147}$ is then multiplied by $m$ in order to obtain a 329triple-double approximation of $\sqrt{m}$. Once renormalized result is 330exactly multiplied by $2^{\frac{E}{2}}$ stored in $xdb.d$. We 331implement: 332\begin{lstlisting}[caption={Newton iteration - triple-double step},firstnumber=1] 333Mul133(&srtmhover, &srtmmover, &srtmlover,m,r5h,r5m,r5l); 334 335Renormalize3(&srtmh,&srtmm,&srtml,srtmhover,srtmmover,srtmlover); 336 337(*(resh)) = xdb.d * srtmh; 338(*(resm)) = xdb.d * srtmm; 339(*(resl)) = xdb.d * srtml; 340\end{lstlisting} 341 342\subsubsection{Accuracy bounds} 343 344TODO: see possibly available Gappa files meanwhile 345 346 347