1Some of \crlibm's functions need high precision square roots.  They
2are not intended to be used outside \crlibm. In particular, we do
3currently not guarantee the correct rounding of their results because
4this property is not needed for our purposes. Their implementation
5does not handle all possible special cases ($x < 0$, $\nan$, $\infty$
6etc.) neither.
7
8We currently provide two C macros computing the square root of a
9double precision argument either in double-double precision with at
10least $100$ correct bits (in faithful rounding) or in triple-double
11precision with an accuracy of at least $146$ bits (in faithful
12rounding). The corresponding macros are called \SqrtD~ and
13\SqrtT.
14
15The implementation of these macros was guided by the following
16principles:
17\begin{itemize}
18\item no dependency on other \texttt{libm}s, so avoidance of
19bootstrapping a Newton iteration by a double precision square root
20implemented elsewhere,
21\item high efficiency,
22\item a small memory footprint,
23\item the possible use of hardware support on some platforms in the
24future.
25\end{itemize}
26\subsubsection{Overview of the algorithm}
27The algorithm uses a combination of polynomial approximation and
28Newton iteration.
29
30After handling some special cases, the argument $x = 2^{E^\prime}
31\cdot m^\prime$ is reduced into its exponent $E^\prime$ stored in
32integer and its fractional part $m^\prime$ stored as a double
33precision number.  This argument reduction is obviously exact. The two
34values are then adjusted as follows:
35\vspace{-3mm}
36\begin{center}
37  \begin{tabular}{cc}
38    \begin{minipage}{60mm}
39      $$E = \left \lbrace \begin{array}{ll} E^\prime & \mbox{ if } \exists n \in \N \mbox{ . } E^\prime = 2n\\
40          E^\prime +1 & \mbox{ otherwise} \end{array} \right.$$
41    \end{minipage}
42    &
43    \begin{minipage}{60mm}
44      $$m = \left \lbrace \begin{array}{ll} m^\prime & \mbox{ if } \exists n \in \N \mbox{ . } E^\prime = 2n \\
45          \frac{m^\prime}{2} & \mbox{ otherwise } \end{array} \right.$$
46    \end{minipage}
47  \end{tabular}
48\end{center}
49One easily checks that $\frac{1}{2} \leq m \leq 2$ and that $E$ is always even. Thus
50$$\sqrt{x} = \sqrt{2^E \cdot m} = 2^{\frac{E}{2}} \cdot \sqrt{m} =
512^{\frac{E}{2}} \cdot m \cdot \frac{1}{\sqrt{m}}$$ The algorithm
52therefore approximates $\hat{r} = \frac{1}{\sqrt{m}}$ and reconstructs
53the square root by multiplying by $m$ and exactly by
54$2^{\frac{E}{2}}$.
55
56The reciprocal square root $\hat{r}$ is approximated in two
57steps. First, a polynomial approximation yields to $r_0 = \hat{r}
58\cdot \left( 1 + \epsilon_1 \right)$, which is exact to about $8$
59bits.  In a second step, this approximation is refined by a Newton
60iteration that approximately doubles its accuracy at each step. So for
61a double-double result, $4$ iterations are needed and for a
62triple-double result $5$.
63
64The initial polynomial approximation is less exact than the one
65provided by Itanium's \texttt{} operation, which allows for using this
66hardware assistance in the future.
67
68\subsubsection{Special case handling}
69The square root of a double precision number can never be
70subnormal. In fact, if $\sqrt{x} \leq 2^{-1021}$, $x = \sqrt{x}^2 \leq
712^{-1042441}$, a value that is is not representable in double
72precision.
73
74Concerning subnormals in argument, it to be mentioned that still
75$E^\prime$ and $m^\prime$ can be found such that $x = 2^{E^\prime}
76\cdot m$ exactly and $1 \leq m^\prime \leq 2$. Only the extraction
77sequence must be modified: $x$ is first multiplied by $2^{52}$ where
78$E^\prime$ is set to $-52$. The double number $x$ is thus no longer a
79subnormal an integer handling can extract its mantissa easily. The
80extraction of the exponent takes into account the preceeding bias of
81$E^\prime$. The case $x = 0$ is filtered out before. Obviously
82$\sqrt{0} = 0$ is returned for this argument.
83
84The special cases $x < 0$, $x = \pm \infty$ and $x = \nan$ are not
85handled since they can be easily excluded by the code using the square
86root macros.
87
88Special case handling is implemented as follows:
89\begin{lstlisting}[caption={Special case handling},firstnumber=1]
90/* Special case x = 0 */
91if (x == 0) {
92  *resh = x;
93  *resl = 0;
94} else {
95
96  E = 0;
97
98  /* Convert to integer format */
99  xdb.d = x;
100
101  /* Handle subnormal case */
102  if (xdb.i[HI] < 0x00100000) {
103    E = -52;
104    xdb.d *= ((db_number) ((double) SQRTTWO52)).d; 	  /* make x a normal number */
105  }
106
107  /* Extract exponent E and mantissa m */
108  E += (xdb.i[HI]>>20)-1023;
109  xdb.i[HI] = (xdb.i[HI] & 0x000fffff) | 0x3ff00000;
110  m = xdb.d;
111
112  /* Make exponent even */
113  if (E & 0x00000001) {
114    E++;
115    m *= 0.5;    /* Suppose now 1/2 <= m <= 2 */
116  }
117
118  /* Construct sqrt(2^E) = 2^(E/2) */
119  xdb.i[HI] = (E/2 + 1023) << 20;
120  xdb.i[LO] = 0;
121\end{lstlisting}
122
123\subsubsection{Polynomial approximation}
124The reciprocal square root $\hat{r} = \frac{1}{\sqrt{m}}$ is
125approximated in the domain $m \in \left[ \frac{1}{2}; 2 \right]$ by a
126polynomial $p\left( m \right) = \sum\limits_{i=0}^4 c_i \cdot m^i$ of
127degree $4$. The polynomial's coefficients $c_0$ through $c_4$ are
128stored in double precision.  The following values are used:
129\begin{eqnarray*}
130c_0 & = & 2.50385236695888790947606139525305479764938354492188 \\
131c_1 & = & -3.29763389114324168005509818613063544034957885742188  \\
132c_2 & = & 2.75726076139124520736345402838196605443954467773438   \\
133c_3 & = & -1.15233725777933848632983426796272397041320800781250  \\
134c_4 & = & 0.186900066679800969104974228685023263096809387207031
135\end{eqnarray*}
136
137The relative approximation error $\epsilon_{\mbox{\tiny approx}} =
138\frac{p\left( m\right) - \hat{r}}{\hat{r}}$ is bounded by $\left \vert
139\epsilon_{\mbox{\tiny approx}} \right \vert \leq 2^{-8.32}$ for $m \in
140\left[ \frac{1}{2}; 2 \right]$.
141
142The polynomial is evaluated in double precision using Horner's
143scheme. There may be some cancellation in the different steps but the
144relative arithmetical error $\epsilon_{\mbox{\tiny arithpoly}}$ is
145always less in magnitude than $2^{-30}$. This will be shown in more
146detail below.
147
148The code implementing the polynomial approximation reads:
149\begin{lstlisting}[caption={Polynomial approximation},firstnumber=1]
150r0 = SQRTPOLYC0 + m * (SQRTPOLYC1 + m * (SQRTPOLYC2 + m * (SQRTPOLYC3 + m * SQRTPOLYC4)));
151\end{lstlisting}
152So 4 double precision multiplications and 4 additions are needed for computing the
153initial approximation. They can be replaced by 4 FMA instructions, if available.
154
155\subsubsection{Double and double-double Newton iteration}
156The polynomial approximation is then refined using the following iteration scheme:
157$$r_{i+1} = \frac{1}{2} \cdot r_i \cdot (3 - m \cdot r_i^2)$$
158If the arithmetic operations were exact, one would obtain the following error estimate:
159\begin{eqnarray*}
160\epsilon_{i+1} & = & \frac{r_i - \hat{r}}{\hat{r}} \\ & = &
161\frac{\frac{1}{2} \cdot r_i \cdot \left(3 - m \cdot r_i^2\right) -
162\hat{r}}{\hat{r}} \\
163& = & \frac{\frac{1}{2} \cdot \hat{r} \cdot
164\left( 1 + \epsilon_i \right) \cdot \left( 3 - m \cdot \hat{r}^2 \cdot
165\left( 1 + \epsilon_i \right)^2 \right) - \hat{r}}{\hat{r}} \\
166& = & \frac{1}{2} \cdot \left( 1 + \epsilon_i \right) \cdot \left( 3 - m \cdot \frac{1}{m} \cdot \left( 1 +
167\epsilon_i\right)^2 \right) - 1 \\
168& = & \frac{1}{2} \cdot \left( 1 + \epsilon_i \right) \cdot \left( 3 - 1 - 2 \cdot \epsilon_i - \epsilon_i^2
169\right) - 1 \\
170& = & \left( 1 + \epsilon_i \right) \cdot \left( 1 - \epsilon_i - \frac{1}{2} \cdot \epsilon_i^2
171\right) - 1 \\
172& = & 1 - \epsilon_i - \frac{1}{2} \cdot \epsilon_i^2 + \epsilon_i - \epsilon_i^2 - \frac{1}{2} \cdot \epsilon_i^3 - 1\\
173& = & - \frac{3}{2} \cdot \epsilon_i^2 - \frac{1}{2} \cdot \epsilon_i^3
174\end{eqnarray*}
175So the accuracy of the approximation of the reciprocal square root is doubled at each step.
176
177Since the initial accuracy is about $8$ bits, it is possible to iterate two times on pure double precision
178without any considerable loss of accuracy. After the two iterations about $31$ bits will be correct.
179The macro implements therefore:
180\begin{lstlisting}[caption={Newton iteration - double precision steps},firstnumber=1]
181r1 = 0.5 * r0 * (3 - m * (r0 * r0));
182r2 = 0.5 * r1 * (3 - m * (r1 * r1));
183\end{lstlisting}
184For these two iterations, 8 double precision multiplications and 2 additions are needed.
185
186The next iteration steps must be performed in double-double precision
187because the $53$ bit mantissa of a double cannot contain the about
188$60$ bit exact value $m \cdot r_2^2 \approx 1$ before cancellation in
189the substraction with $3$ and the multiplication by $r_2$.
190
191In order to exploit maximally the parallelism in the iteration equation, we rewrite it as
192\begin{eqnarray*}
193r_{3} & = & \frac{1}{2} \cdot r_2 \cdot \left( 3 - m \cdot r_2^2 \right) \\
194& = & \left( r_2 + \frac{1}{2} \cdot r_2 \right) - \frac{1}{2} \cdot \left( m \cdot r_2 \right) \cdot
195\left( r_2 \cdot r_2 \right)
196\end{eqnarray*}
197Since multiplications by integer powers of $2$ are exact, it is
198possible to compute $r_2 + \frac{1}{2} \cdot r_2$ exactly as a
199double-double. Concurrently it is possible to compute $m \cdot r_2$
200and $r_2 \cdot r_2$ exactly as double-doubles by means of an exact
201multiplication.  The multiplication $\left( m \cdot r_2 \right) \cdot
202\left( r_2 \cdot r_2 \right)$ is then implemented as a double-double
203multiplication.  The multiplication by $\frac{1}{2}$ of the value
204obtained is exact and can be performed pairwise on the
205double-double. A final double-double addition leads to $r_3 = \left(
206r_2 + \frac{1}{2} \cdot r_2 \right) - \frac{1}{2} \cdot \left( m \cdot
207r_2 \right) \cdot \left( r_2 \cdot r_2 \right)$. Here, massive
208cancellation is no longer possible since the values added are
209approximately $\frac{3}{2} \cdot r_2$ and $\frac{1}{2} \cdot r_2$.
210
211These steps are implemented as follows:
212\begin{lstlisting}[caption={Newton iteration - first double-double step},firstnumber=1]
213Mul12(&r2Sqh, &r2Sql, r2, r2);    Add12(r2PHr2h, r2PHr2l, r2, 0.5 * r2);
214Mul12(&mMr2h, &mMr2l, m, r2);
215Mul22(&mMr2Ch, &mMr2Cl, mMr2h, mMr2l, r2Sqh, r2Sql);
216
217MHmMr2Ch = -0.5 * mMr2Ch;
218MHmMr2Cl = -0.5 * mMr2Cl;
219
220Add22(&r3h, &r3l, r2PHr2h, r2PHr2l, MHmMr2Ch, MHmMr2Cl);
221\end{lstlisting}
222
223The next iteration step provides enough accuracy for a double-double result.
224We rewrite the basic iteration equation once again as:
225\begin{eqnarray*}
226r_4 & = & \frac{1}{2} \cdot r_3 \cdot \left( 3 - m \cdot r_3^2 \right) \\
227& = & r_3 \cdot \left( \frac{3}{2} - \frac{1}{2} \cdot m \cdot r_3^2 \right) \\
228& = & r_3 \cdot \left( \frac{3}{2} - \frac{1}{2} \cdot \left( \left( m \cdot r_3^2 - 1 \right) + 1 \right) \right) \\
229& = & r_3 \cdot \left( 1 - \frac{1}{2} \cdot \left( m \cdot r_3^2 - 1 \right) \right)
230\end{eqnarray*}
231Further, we know that $r_3$, stored as a double-double, verifies $r_3
232= \hat{r} \cdot \left( 1 + \epsilon_3 \right)$ with $\left \vert
233\epsilon_3 \right \vert \leq 2^{-60}$. So we check that
234$$m \cdot r_3^2 = m \cdot \hat{r}^2 \cdot \left( 1 +
235\epsilon_3 \right)^2 = 1 + 2 \cdot \epsilon_3 + \epsilon_3^2$$
236Clearly, $\left \vert 2 \cdot \epsilon_3 + \epsilon_3^2 \right \vert <
237\frac{1}{2} \mUlp\left( 1 \right)$. So when squaring $r_{3\hi} + r_{3\lo}$ in double-double precision
238and multiplying it in double-double precision by $m$ produces a
239double-double $mMr3Sq_\hi + mMr3Sq_\lo = m \cdot \left( r_{3\hi} +
240r_{3\lo} \right)^2 \cdot \left( 1 + \epsilon \right)$, $\left \vert
241\epsilon \right \vert \leq 2^{-100}$ such that $mMr3Sq_\hi = 1$ in all
242cases.
243
244So we can implement the iteration equation
245$$r_4 = r_3 \cdot \left( 1 - \frac{1}{2} \cdot \left( m \cdot r_3^2 - 1 \right) \right)$$
246as follows:
247\begin{lstlisting}[caption={Newton iteration - second double-double step},firstnumber=1]
248Mul22(&r3Sqh, &r3Sql, r3h, r3l, r3h, r3l);
249Mul22(&mMr3Sqh, &mMr3Sql, m, 0, r3Sqh, r3Sql);
250
251Mul22(&r4h, &r4l, r3h, r3l, 1, -0.5 * mMr3Sql);
252\end{lstlisting}
253We since get $r_{4\hi} + r_{4\lo} = \hat{r} \cdot \left( 1 +
254\epsilon_4 \right)$ with $\left \vert \epsilon_4 \right \vert \leq
2552^{-102}$, the accuracy being limited by the accuracy of the last
256double-double multiplication operator.
257
258This approximation is than multiplied by $m$ in double-double
259precision, leading to an approximation $srtm_\hi + srtm_\lo = \sqrt{m}
260\cdot \left( 1 + \epsilon \right)$ with $\left \vert \epsilon \right \vert \leq
2612^{-100}$.
262
263Out of this value, the square root of the initial argument can be
264reconstructed by multiplying by $2^{\frac{E}{2}}$, which has already
265been stored in $xdb.d$. This multiplication is exact because it cannot
266produce a subnormal.
267
268These two steps are implemented as shown below:
269\begin{lstlisting}[caption={Multiplication $m \cdot \hat{r}$, reconstruction},firstnumber=1]
270Mul22(&srtmh,&srtml,m,0,r4h,r4l);
271
272/* Multiply componentwise by sqrt(2^E), which is an integer power of 2 that may not produce a subnormal */
273
274*resh = xdb.d * srtmh;
275*resl = xdb.d * srtml;
276\end{lstlisting}
277
278\subsubsection{Triple-double Newton iteration}
279For producing a triple-double approximate to $\hat{r}$ with an
280accuracy of at least $147$ bits, one more Newton iteration is
281needed. We apply the same equation as in the last double-double step,
282which reads:
283$$r_5 = r_4 \cdot \left( 1 - \frac{1}{2} \cdot \left( m \cdot r_4^2 -
2841 \right) \right)$$ Once again, the first component of the
285triple-double number holding an approximation to $m \cdot r_4^2$ is
286exactly equal to $1$. So by neglecting this component, we substract $1$ from it.
287Unfortunately, a renormalization step is needed after the multiplications for
288squaring $r_4$ and by $m$ because the values computed might be overlapped which would prevent us
289form substracting $1$ by neglecting a component.
290
291We implement thus:
292\begin{lstlisting}[caption={Newton iteration - triple-double step},firstnumber=1]
293Mul23(&r4Sqh, &r4Sqm, &r4Sql, r4h, r4l, r4h, r4l);
294Mul133(&mMr4Sqhover, &mMr4Sqmover, &mMr4Sqlover, m, r4Sqh, r4Sqm, r4Sql);
295Renormalize3(&mMr4Sqh, &mMr4Sqm, &mMr4Sql, mMr4Sqhover, mMr4Sqmover, mMr4Sqlover);
296
297HmMr4Sqm = -0.5 * mMr4Sqm;
298HmMr4Sql = -0.5 * mMr4Sql;
299
300Mul233(&r5h,&r5m,&r5l,r4h,r4l,1,HmMr4Sqm,HmMr4Sql);
301\end{lstlisting}
302
303This approximation $r_{5\hi} + r_{5\mi} + r_{5\lo} = \hat{r} \cdot
304\left( 1 + \epsilon_5 \right)$, where $\left \vert \epsilon_5 \right
305\vert \leq 2^{-147}$ is then multiplied by $m$ in order to obtain a
306triple-double approximation of $\sqrt{m}$. Once renormalized result is
307exactly multiplied by $2^{\frac{E}{2}}$ stored in $xdb.d$.  We
308implement:
309\begin{lstlisting}[caption={Newton iteration - triple-double step},firstnumber=1]
310Mul133(&srtmhover, &srtmmover, &srtmlover,m,r5h,r5m,r5l);
311
312Renormalize3(&srtmh,&srtmm,&srtml,srtmhover,srtmmover,srtmlover);
313
314(*(resh)) = xdb.d * srtmh;
315(*(resm)) = xdb.d * srtmm;
316(*(resl)) = xdb.d * srtml;
317\end{lstlisting}
318
319\subsubsection{Accuracy bounds}
320
321TODO: see possibly available Gappa files meanwhile
322
323
324