1% This file is part of crlibm, the correctly rounded mathematical library,
2% which has been developed by the Arénaire project at École normale supérieure
3% de Lyon.
4%
5% Copyright (C) 2004-2011 David Defour, Catherine Daramy-Loirat,
6% Florent de Dinechin, Matthieu Gallet, Nicolas Gast, Christoph Quirin Lauter,
7% and Jean-Michel Muller
8%
9% This library is free software; you can redistribute it and/or
10% modify it under the terms of the GNU Lesser General Public
11% License as published by the Free Software Foundation; either
12% version 2.1 of the License, or (at your option) any later version.
13%
14% This library is distributed in the hope that it will be useful,
15% but WITHOUT ANY WARRANTY; without even the implied warranty of
16% MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
17% Lesser General Public License for more details.
18%
19% You should have received a copy of the GNU Lesser General Public
20% License along with this library; if not, write to the Free Software
21% Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
22
23
24Some of \crlibm's functions need high precision square roots.  They
25are not intended to be used outside \crlibm. In particular, we do
26currently not guarantee the correct rounding of their results because
27this property is not needed for our purposes. Their implementation
28does not handle all possible special cases ($x < 0$, $\nan$, $\infty$
29etc.) neither.
30
31We currently provide two C macros computing the square root of a
32double precision argument either in double-double precision with at
33least $100$ correct bits (in faithful rounding) or in triple-double
34precision with an accuracy of at least $146$ bits (in faithful
35rounding). The corresponding macros are called \SqrtD~ and
36\SqrtT.
37
38The implementation of these macros was guided by the following
39principles:
40\begin{itemize}
41\item no dependency on other \texttt{libm}s, so avoidance of
42bootstrapping a Newton iteration by a double precision square root
43implemented elsewhere,
44\item high efficiency,
45\item a small memory footprint,
46\item the possible use of hardware support on some platforms in the
47future.
48\end{itemize}
49\subsubsection{Overview of the algorithm}
50The algorithm uses a combination of polynomial approximation and
51Newton iteration.
52
53After handling some special cases, the argument $x = 2^{E^\prime}
54\cdot m^\prime$ is reduced into its exponent $E^\prime$ stored in
55integer and its fractional part $m^\prime$ stored as a double
56precision number.  This argument reduction is obviously exact. The two
57values are then adjusted as follows:
58\vspace{-3mm}
59\begin{center}
60  \begin{tabular}{cc}
61    \begin{minipage}{60mm}
62      $$E = \left \lbrace \begin{array}{ll} E^\prime & \mbox{ if } \exists n \in \N \mbox{ . } E^\prime = 2n\\
63          E^\prime +1 & \mbox{ otherwise} \end{array} \right.$$
64    \end{minipage}
65    &
66    \begin{minipage}{60mm}
67      $$m = \left \lbrace \begin{array}{ll} m^\prime & \mbox{ if } \exists n \in \N \mbox{ . } E^\prime = 2n \\
68          \frac{m^\prime}{2} & \mbox{ otherwise } \end{array} \right.$$
69    \end{minipage}
70  \end{tabular}
71\end{center}
72One easily checks that $\frac{1}{2} \leq m \leq 2$ and that $E$ is always even. Thus
73$$\sqrt{x} = \sqrt{2^E \cdot m} = 2^{\frac{E}{2}} \cdot \sqrt{m} =
742^{\frac{E}{2}} \cdot m \cdot \frac{1}{\sqrt{m}}$$ The algorithm
75therefore approximates $\hat{r} = \frac{1}{\sqrt{m}}$ and reconstructs
76the square root by multiplying by $m$ and exactly by
77$2^{\frac{E}{2}}$.
78
79The reciprocal square root $\hat{r}$ is approximated in two
80steps. First, a polynomial approximation yields to $r_0 = \hat{r}
81\cdot \left( 1 + \epsilon_1 \right)$, which is exact to about $8$
82bits.  In a second step, this approximation is refined by a Newton
83iteration that approximately doubles its accuracy at each step. So for
84a double-double result, $4$ iterations are needed and for a
85triple-double result $5$.
86
87The initial polynomial approximation is less exact than the one
88provided by Itanium's \texttt{} operation, which allows for using this
89hardware assistance in the future.
90
91\subsubsection{Special case handling}
92The square root of a double precision number can never be
93subnormal. In fact, if $\sqrt{x} \leq 2^{-1021}$, $x = \sqrt{x}^2 \leq
942^{-1042441}$, a value that is is not representable in double
95precision.
96
97Concerning subnormals in argument, it to be mentioned that still
98$E^\prime$ and $m^\prime$ can be found such that $x = 2^{E^\prime}
99\cdot m$ exactly and $1 \leq m^\prime \leq 2$. Only the extraction
100sequence must be modified: $x$ is first multiplied by $2^{52}$ where
101$E^\prime$ is set to $-52$. The double number $x$ is thus no longer a
102subnormal an integer handling can extract its mantissa easily. The
103extraction of the exponent takes into account the preceeding bias of
104$E^\prime$. The case $x = 0$ is filtered out before. Obviously
105$\sqrt{0} = 0$ is returned for this argument.
106
107The special cases $x < 0$, $x = \pm \infty$ and $x = \nan$ are not
108handled since they can be easily excluded by the code using the square
109root macros.
110
111Special case handling is implemented as follows:
112\begin{lstlisting}[caption={Special case handling},firstnumber=1]
113/* Special case x = 0 */
114if (x == 0) {
115  *resh = x;
116  *resl = 0;
117} else {
118
119  E = 0;
120
121  /* Convert to integer format */
122  xdb.d = x;
123
124  /* Handle subnormal case */
125  if (xdb.i[HI] < 0x00100000) {
126    E = -52;
127    xdb.d *= ((db_number) ((double) SQRTTWO52)).d; 	  /* make x a normal number */
128  }
129
130  /* Extract exponent E and mantissa m */
131  E += (xdb.i[HI]>>20)-1023;
132  xdb.i[HI] = (xdb.i[HI] & 0x000fffff) | 0x3ff00000;
133  m = xdb.d;
134
135  /* Make exponent even */
136  if (E & 0x00000001) {
137    E++;
138    m *= 0.5;    /* Suppose now 1/2 <= m <= 2 */
139  }
140
141  /* Construct sqrt(2^E) = 2^(E/2) */
142  xdb.i[HI] = (E/2 + 1023) << 20;
143  xdb.i[LO] = 0;
144\end{lstlisting}
145
146\subsubsection{Polynomial approximation}
147The reciprocal square root $\hat{r} = \frac{1}{\sqrt{m}}$ is
148approximated in the domain $m \in \left[ \frac{1}{2}; 2 \right]$ by a
149polynomial $p\left( m \right) = \sum\limits_{i=0}^4 c_i \cdot m^i$ of
150degree $4$. The polynomial's coefficients $c_0$ through $c_4$ are
151stored in double precision.  The following values are used:
152\begin{eqnarray*}
153c_0 & = & 2.50385236695888790947606139525305479764938354492188 \\
154c_1 & = & -3.29763389114324168005509818613063544034957885742188  \\
155c_2 & = & 2.75726076139124520736345402838196605443954467773438   \\
156c_3 & = & -1.15233725777933848632983426796272397041320800781250  \\
157c_4 & = & 0.186900066679800969104974228685023263096809387207031
158\end{eqnarray*}
159
160The relative approximation error $\epsilon_{\mbox{\tiny approx}} =
161\frac{p\left( m\right) - \hat{r}}{\hat{r}}$ is bounded by $\left \vert
162\epsilon_{\mbox{\tiny approx}} \right \vert \leq 2^{-8.32}$ for $m \in
163\left[ \frac{1}{2}; 2 \right]$.
164
165The polynomial is evaluated in double precision using Horner's
166scheme. There may be some cancellation in the different steps but the
167relative arithmetical error $\epsilon_{\mbox{\tiny arithpoly}}$ is
168always less in magnitude than $2^{-30}$. This will be shown in more
169detail below.
170
171The code implementing the polynomial approximation reads:
172\begin{lstlisting}[caption={Polynomial approximation},firstnumber=1]
173r0 = SQRTPOLYC0 + m * (SQRTPOLYC1 + m * (SQRTPOLYC2 + m * (SQRTPOLYC3 + m * SQRTPOLYC4)));
174\end{lstlisting}
175So 4 double precision multiplications and 4 additions are needed for computing the
176initial approximation. They can be replaced by 4 FMA instructions, if available.
177
178\subsubsection{Double and double-double Newton iteration}
179The polynomial approximation is then refined using the following iteration scheme:
180$$r_{i+1} = \frac{1}{2} \cdot r_i \cdot (3 - m \cdot r_i^2)$$
181If the arithmetic operations were exact, one would obtain the following error estimate:
182\begin{eqnarray*}
183\epsilon_{i+1} & = & \frac{r_i - \hat{r}}{\hat{r}} \\ & = &
184\frac{\frac{1}{2} \cdot r_i \cdot \left(3 - m \cdot r_i^2\right) -
185\hat{r}}{\hat{r}} \\
186& = & \frac{\frac{1}{2} \cdot \hat{r} \cdot
187\left( 1 + \epsilon_i \right) \cdot \left( 3 - m \cdot \hat{r}^2 \cdot
188\left( 1 + \epsilon_i \right)^2 \right) - \hat{r}}{\hat{r}} \\
189& = & \frac{1}{2} \cdot \left( 1 + \epsilon_i \right) \cdot \left( 3 - m \cdot \frac{1}{m} \cdot \left( 1 +
190\epsilon_i\right)^2 \right) - 1 \\
191& = & \frac{1}{2} \cdot \left( 1 + \epsilon_i \right) \cdot \left( 3 - 1 - 2 \cdot \epsilon_i - \epsilon_i^2
192\right) - 1 \\
193& = & \left( 1 + \epsilon_i \right) \cdot \left( 1 - \epsilon_i - \frac{1}{2} \cdot \epsilon_i^2
194\right) - 1 \\
195& = & 1 - \epsilon_i - \frac{1}{2} \cdot \epsilon_i^2 + \epsilon_i - \epsilon_i^2 - \frac{1}{2} \cdot \epsilon_i^3 - 1\\
196& = & - \frac{3}{2} \cdot \epsilon_i^2 - \frac{1}{2} \cdot \epsilon_i^3
197\end{eqnarray*}
198So the accuracy of the approximation of the reciprocal square root is doubled at each step.
199
200Since the initial accuracy is about $8$ bits, it is possible to iterate two times on pure double precision
201without any considerable loss of accuracy. After the two iterations about $31$ bits will be correct.
202The macro implements therefore:
203\begin{lstlisting}[caption={Newton iteration - double precision steps},firstnumber=1]
204r1 = 0.5 * r0 * (3 - m * (r0 * r0));
205r2 = 0.5 * r1 * (3 - m * (r1 * r1));
206\end{lstlisting}
207For these two iterations, 8 double precision multiplications and 2 additions are needed.
208
209The next iteration steps must be performed in double-double precision
210because the $53$ bit mantissa of a double cannot contain the about
211$60$ bit exact value $m \cdot r_2^2 \approx 1$ before cancellation in
212the substraction with $3$ and the multiplication by $r_2$.
213
214In order to exploit maximally the parallelism in the iteration equation, we rewrite it as
215\begin{eqnarray*}
216r_{3} & = & \frac{1}{2} \cdot r_2 \cdot \left( 3 - m \cdot r_2^2 \right) \\
217& = & \left( r_2 + \frac{1}{2} \cdot r_2 \right) - \frac{1}{2} \cdot \left( m \cdot r_2 \right) \cdot
218\left( r_2 \cdot r_2 \right)
219\end{eqnarray*}
220Since multiplications by integer powers of $2$ are exact, it is
221possible to compute $r_2 + \frac{1}{2} \cdot r_2$ exactly as a
222double-double. Concurrently it is possible to compute $m \cdot r_2$
223and $r_2 \cdot r_2$ exactly as double-doubles by means of an exact
224multiplication.  The multiplication $\left( m \cdot r_2 \right) \cdot
225\left( r_2 \cdot r_2 \right)$ is then implemented as a double-double
226multiplication.  The multiplication by $\frac{1}{2}$ of the value
227obtained is exact and can be performed pairwise on the
228double-double. A final double-double addition leads to $r_3 = \left(
229r_2 + \frac{1}{2} \cdot r_2 \right) - \frac{1}{2} \cdot \left( m \cdot
230r_2 \right) \cdot \left( r_2 \cdot r_2 \right)$. Here, massive
231cancellation is no longer possible since the values added are
232approximately $\frac{3}{2} \cdot r_2$ and $\frac{1}{2} \cdot r_2$.
233
234These steps are implemented as follows:
235\begin{lstlisting}[caption={Newton iteration - first double-double step},firstnumber=1]
236Mul12(&r2Sqh, &r2Sql, r2, r2);    Add12(r2PHr2h, r2PHr2l, r2, 0.5 * r2);
237Mul12(&mMr2h, &mMr2l, m, r2);
238Mul22(&mMr2Ch, &mMr2Cl, mMr2h, mMr2l, r2Sqh, r2Sql);
239
240MHmMr2Ch = -0.5 * mMr2Ch;
241MHmMr2Cl = -0.5 * mMr2Cl;
242
243Add22(&r3h, &r3l, r2PHr2h, r2PHr2l, MHmMr2Ch, MHmMr2Cl);
244\end{lstlisting}
245
246The next iteration step provides enough accuracy for a double-double result.
247We rewrite the basic iteration equation once again as:
248\begin{eqnarray*}
249r_4 & = & \frac{1}{2} \cdot r_3 \cdot \left( 3 - m \cdot r_3^2 \right) \\
250& = & r_3 \cdot \left( \frac{3}{2} - \frac{1}{2} \cdot m \cdot r_3^2 \right) \\
251& = & r_3 \cdot \left( \frac{3}{2} - \frac{1}{2} \cdot \left( \left( m \cdot r_3^2 - 1 \right) + 1 \right) \right) \\
252& = & r_3 \cdot \left( 1 - \frac{1}{2} \cdot \left( m \cdot r_3^2 - 1 \right) \right)
253\end{eqnarray*}
254Further, we know that $r_3$, stored as a double-double, verifies $r_3
255= \hat{r} \cdot \left( 1 + \epsilon_3 \right)$ with $\left \vert
256\epsilon_3 \right \vert \leq 2^{-60}$. So we check that
257$$m \cdot r_3^2 = m \cdot \hat{r}^2 \cdot \left( 1 +
258\epsilon_3 \right)^2 = 1 + 2 \cdot \epsilon_3 + \epsilon_3^2$$
259Clearly, $\left \vert 2 \cdot \epsilon_3 + \epsilon_3^2 \right \vert <
260\frac{1}{2} \mUlp\left( 1 \right)$. So when squaring $r_{3\hi} + r_{3\lo}$ in double-double precision
261and multiplying it in double-double precision by $m$ produces a
262double-double $mMr3Sq_\hi + mMr3Sq_\lo = m \cdot \left( r_{3\hi} +
263r_{3\lo} \right)^2 \cdot \left( 1 + \epsilon \right)$, $\left \vert
264\epsilon \right \vert \leq 2^{-100}$ such that $mMr3Sq_\hi = 1$ in all
265cases.
266
267So we can implement the iteration equation
268$$r_4 = r_3 \cdot \left( 1 - \frac{1}{2} \cdot \left( m \cdot r_3^2 - 1 \right) \right)$$
269as follows:
270\begin{lstlisting}[caption={Newton iteration - second double-double step},firstnumber=1]
271Mul22(&r3Sqh, &r3Sql, r3h, r3l, r3h, r3l);
272Mul22(&mMr3Sqh, &mMr3Sql, m, 0, r3Sqh, r3Sql);
273
274Mul22(&r4h, &r4l, r3h, r3l, 1, -0.5 * mMr3Sql);
275\end{lstlisting}
276We since get $r_{4\hi} + r_{4\lo} = \hat{r} \cdot \left( 1 +
277\epsilon_4 \right)$ with $\left \vert \epsilon_4 \right \vert \leq
2782^{-102}$, the accuracy being limited by the accuracy of the last
279double-double multiplication operator.
280
281This approximation is than multiplied by $m$ in double-double
282precision, leading to an approximation $srtm_\hi + srtm_\lo = \sqrt{m}
283\cdot \left( 1 + \epsilon \right)$ with $\left \vert \epsilon \right \vert \leq
2842^{-100}$.
285
286Out of this value, the square root of the initial argument can be
287reconstructed by multiplying by $2^{\frac{E}{2}}$, which has already
288been stored in $xdb.d$. This multiplication is exact because it cannot
289produce a subnormal.
290
291These two steps are implemented as shown below:
292\begin{lstlisting}[caption={Multiplication $m \cdot \hat{r}$, reconstruction},firstnumber=1]
293Mul22(&srtmh,&srtml,m,0,r4h,r4l);
294
295/* Multiply componentwise by sqrt(2^E), which is an integer power of 2 that may not produce a subnormal */
296
297*resh = xdb.d * srtmh;
298*resl = xdb.d * srtml;
299\end{lstlisting}
300
301\subsubsection{Triple-double Newton iteration}
302For producing a triple-double approximate to $\hat{r}$ with an
303accuracy of at least $147$ bits, one more Newton iteration is
304needed. We apply the same equation as in the last double-double step,
305which reads:
306$$r_5 = r_4 \cdot \left( 1 - \frac{1}{2} \cdot \left( m \cdot r_4^2 -
3071 \right) \right)$$ Once again, the first component of the
308triple-double number holding an approximation to $m \cdot r_4^2$ is
309exactly equal to $1$. So by neglecting this component, we substract $1$ from it.
310Unfortunately, a renormalization step is needed after the multiplications for
311squaring $r_4$ and by $m$ because the values computed might be overlapped which would prevent us
312form substracting $1$ by neglecting a component.
313
314We implement thus:
315\begin{lstlisting}[caption={Newton iteration - triple-double step},firstnumber=1]
316Mul23(&r4Sqh, &r4Sqm, &r4Sql, r4h, r4l, r4h, r4l);
317Mul133(&mMr4Sqhover, &mMr4Sqmover, &mMr4Sqlover, m, r4Sqh, r4Sqm, r4Sql);
318Renormalize3(&mMr4Sqh, &mMr4Sqm, &mMr4Sql, mMr4Sqhover, mMr4Sqmover, mMr4Sqlover);
319
320HmMr4Sqm = -0.5 * mMr4Sqm;
321HmMr4Sql = -0.5 * mMr4Sql;
322
323Mul233(&r5h,&r5m,&r5l,r4h,r4l,1,HmMr4Sqm,HmMr4Sql);
324\end{lstlisting}
325
326This approximation $r_{5\hi} + r_{5\mi} + r_{5\lo} = \hat{r} \cdot
327\left( 1 + \epsilon_5 \right)$, where $\left \vert \epsilon_5 \right
328\vert \leq 2^{-147}$ is then multiplied by $m$ in order to obtain a
329triple-double approximation of $\sqrt{m}$. Once renormalized result is
330exactly multiplied by $2^{\frac{E}{2}}$ stored in $xdb.d$.  We
331implement:
332\begin{lstlisting}[caption={Newton iteration - triple-double step},firstnumber=1]
333Mul133(&srtmhover, &srtmmover, &srtmlover,m,r5h,r5m,r5l);
334
335Renormalize3(&srtmh,&srtmm,&srtml,srtmhover,srtmmover,srtmlover);
336
337(*(resh)) = xdb.d * srtmh;
338(*(resm)) = xdb.d * srtmm;
339(*(resl)) = xdb.d * srtml;
340\end{lstlisting}
341
342\subsubsection{Accuracy bounds}
343
344TODO: see possibly available Gappa files meanwhile
345
346
347