1.. _using:
2
3Using ball arithmetic
4===============================================================================
5
6This section gives an introduction to working with
7real numbers in Arb (see :ref:`arb` for the API and technical documentation).
8The general principles carry over to complex numbers, polynomials and
9matrices.
10
11Ball semantics
12-------------------------------------------------------------------------------
13
14Let `f : A \to B` be a function.
15A ball implementation of `f` is a function `F` that maps sets `X \subseteq A`
16to sets `F(X) \subseteq B` subject to the following rule:
17
18    For all `x \in X`,
19    we have `f(x) \in F(X)`.
20
21In other words, `F(X)` is an *enclosure* for the set `\{f(x) : x \in X\}`.
22This rule is sometimes called the *inclusion principle*.
23
24Throughout the documentation (except where otherwise noted),
25we will simply write `f(x)` instead of `F(X)`
26when describing ball implementations of pointwise-defined mathematical
27functions, understanding that the input is a set of point values and that
28the output is an enclosure.
29
30General subsets of `\mathbb{R}` are not possible to
31represent on a computer. Instead, we work with subsets of the form
32`[m \pm r] = [m-r, m+r]` where the midpoint *m* and radius *r* are binary
33floating-point numbers, i.e. numbers of the form `u 2^v` with `u, v \in \mathbb{Z}`
34(to make this scheme complete,
35we also need to adjoin the special floating-point
36values `-\infty`, `+\infty` and `\operatorname{NaN}`).
37
38Given a ball `[m \pm r]` with `m \in \mathbb{R}` (not necessarily a
39floating-point number),
40we can always round *m* to a nearby floating-point number that has at most
41most *prec* bits in the component *u*,
42and add an upper bound for the rounding error to *r*.
43In Arb, ball functions that take a *prec* argument as input
44(e.g. :func:`arb_add`) always round their output to *prec* bits.
45Some functions are always exact (e.g. :func:`arb_neg`), and thus do not take a *prec* argument.
46
47The programming interface resembles that of GMP.
48Each :type:`arb_t` variable must be initialized with :func:`arb_init` before use
49(this also sets its value to zero), and deallocated with :func:`arb_clear`
50after use. Variables have pass-by-reference semantics.
51In the list of arguments to a function, output variables come first,
52followed by input variables, and finally the precision:
53
54.. code-block:: c
55
56    #include "arb.h"
57
58    int main()
59    {
60        arb_t x, y;
61        arb_init(x); arb_init(y);
62        arb_set_ui(x, 3);       /* x = 3 */
63        arb_const_pi(y, 128);   /* y = pi, to 128 bits */
64        arb_sub(y, y, x, 53);   /* y = y - x, to 53 bits */
65        arb_clear(x); arb_clear(y);
66    }
67
68Binary and decimal
69-------------------------------------------------------------------------------
70
71While the internal representation uses binary floating-point numbers,
72it is usually preferable to print numbers in decimal. The binary-to-decimal
73conversion generally requires rounding. Three different methods
74are available for printing a number to standard output:
75
76* :func:`arb_print` shows the exact internal representation of a ball, with binary exponents.
77* :func:`arb_printd` shows an inexact view of the internal representation, approximated by decimal floating-point numbers.
78* :func:`arb_printn` shows a *decimal ball* that is guaranteed to be an enclosure of the binary
79  floating-point ball. By default, it only prints digits in the midpoint that are certain to
80  be correct, up to an error of at most one unit in the last place.
81  Converting from binary to decimal is generally inexact, and the output of this
82  method takes this rounding into account when printing the radius.
83
84This snippet computes a 53-bit enclosure of `\pi` and prints it
85in three ways:
86
87.. code-block:: c
88
89    arb_const_pi(x, 53);
90    arb_print(x); printf("\n");
91    arb_printd(x, 20); printf("\n");
92    arb_printn(x, 20, 0); printf("\n");
93
94The output is:
95
96.. code-block:: text
97
98    (884279719003555 * 2^-48) +/- (536870913 * 2^-80)
99    3.141592653589793116 +/- 4.4409e-16
100    [3.141592653589793 +/- 5.61e-16]
101
102The :func:`arb_get_str` and :func:`arb_set_str` methods are useful for
103converting rigorously between decimal strings and binary balls
104(:func:`arb_get_str` produces the same string as :func:`arb_printn`,
105and :func:`arb_set_str` can parse such strings back).
106
107A potential mistake is to create a ball from a ``double`` constant
108such as ``2.3``, when this actually represents
109``2.29999999999999982236431605997495353221893310546875``.
110To produce a ball containing the rational number
111`23/10`, one of the following can be used:
112
113.. code-block:: c
114
115    arb_set_str(x, "2.3", prec)
116
117    arb_set_ui(x, 23);
118    arb_div_ui(x, x, 10, prec)
119
120    fmpq_set_si(q, 23, 10);   /* q is a FLINT fmpq_t */
121    arb_set_fmpq(x, q, prec);
122
123Quality of enclosures
124-------------------------------------------------------------------------------
125
126The main problem when working with ball arithmetic (or interval arithmetic)
127is *overestimation*. In general, the enclosure of a value or set
128of values as computed with ball arithmetic will be larger than the smallest
129possible enclosure.
130
131Overestimation results naturally from rounding errors and cancellations
132in the individual steps of a calculation.
133As a general principle, formula rewriting techniques that make
134floating-point code more numerically stable also make ball arithmetic code
135more numerically stable, in the sense of producing tighter enclosures.
136
137As a result of the *dependency problem*, ball or interval
138arithmetic can produce error
139bounds that are much larger than the actual numerical errors
140resulting from doing floating-point arithmetic.
141Consider the expression `(x + 1) - x` as an example.
142When evaluated in floating-point
143arithmetic, `x` may have a large initial error. However, that error will
144cancel itself out in the subtraction, so that the result equals 1
145(except perhaps for a small rounding error left from the operation `x + 1`).
146In ball arithmetic, dependent errors add up instead of cancelling out.
147If `x = [3 \pm 0.1]`, the result will be `[1 \pm 0.2]`, where
148the error bound has doubled.
149In unfavorable circumstances, error bounds can grow exponentially
150with the number of steps.
151
152If all inputs to a calculation are "point values", i.e.
153exact numbers and known mathematical constants that can
154be approximated arbitrarily closely (such as `\pi`), then an error
155of order `2^n` can typically be overcome by working with *n* extra bits of
156precision, increasing the computation time by an amount
157that is polynomial in *n*.
158In certain situations, however, overestimation leads to exponential
159slowdown or even failure of an algorithm to converge.
160For example, root-finding algorithms that refine the result iteratively
161may fail to converge in ball arithmetic, even if they do converge in plain
162floating-point arithmetic.
163
164Therefore, ball arithmetic is not a silver bullet: there will always
165be situations where some amount of numerical or mathematical analysis
166is required. Some experimentation may be required to find whether
167(and how) it can be used effectively for a given problem.
168
169Predicates
170-------------------------------------------------------------------------------
171
172A ball implementation of a predicate
173`f : \mathbb{R} \to \{\operatorname{True}, \operatorname{False}\}`
174would need to be able to return a third logical value indicating
175that the result could be either True or False.
176In most cases, predicates in Arb are implemented as
177functions that return the *int* value 1 to indicate that the
178result certainly is True, and the *int* value 0 to indicate
179that the result could be either True or False.
180To test whether a predicate certainly is False, the user must
181test whether the negated predicate certainly is True.
182
183For example, the following code would *not* be correct in general:
184
185.. code-block:: c
186
187    if (arb_is_positive(x))
188    {
189        ...  /* do things assuming that x > 0 */
190    }
191    else
192    {
193        ...  /* do things assuming that x <= 0 */
194    }
195
196Instead, the following can be used:
197
198.. code-block:: c
199
200    if (arb_is_positive(x))
201    {
202        ...  /* do things assuming that x > 0 */
203    }
204    else if (arb_is_nonpositive(x))
205    {
206        ...  /* do things assuming that x <= 0 */
207    }
208    else
209    {
210        ... /* do things assuming that the sign of x is unknown */
211    }
212
213Likewise, we will write `x \le y` in mathematical notation with the meaning
214that `x \le y` holds for all `x \in X, y \in Y` where `X` and `Y` are balls.
215
216Note that some predicates such as :func:`arb_overlaps` and :func:`arb_contains`
217actually are predicates on balls viewed as sets, and not ball implementations
218of pointwise predicates.
219
220Some predicates are also complementary.
221For example :func:`arb_contains_zero` tests whether the input ball
222contains the point zero.
223Negated, it is equivalent to :func:`arb_is_nonzero`,
224and complementary to :func:`arb_is_zero` as a pointwise predicate:
225
226.. code-block:: c
227
228    if (arb_is_zero(x))
229    {
230        ...  /* do things assuming that x = 0 */
231    }
232    #if 1
233    else if (arb_is_nonzero(x))
234    #else
235    else if (!arb_contains_zero(x))      /* equivalent */
236    #endif
237    {
238        ...  /* do things assuming that x != 0 */
239    }
240    else
241    {
242        ... /* do things assuming that the sign of x is unknown */
243    }
244
245A worked example: the sine function
246-------------------------------------------------------------------------------
247
248We implement the function `\sin(x)` naively using
249the Taylor series `\sum_{k=0}^{\infty} (-1)^k x^{2k+1} / (2k+1)!`
250and :type:`arb_t` arithmetic.
251Since there are infinitely many terms, we need to split the series
252in two parts: a finite sum that can be evaluated directly, and
253a tail that has to be bounded.
254
255We stop as soon as we reach a term `t` bounded by `|t| \le 2^{-prec} < 1`.
256The terms are alternating and must have decreasing magnitude
257from that point, so the tail of the series
258is bounded by `|t|`. We add this magnitude to the radius
259of the output. Since ball arithmetic automatically bounds the numerical errors
260resulting from all arithmetic operations, the output *res* is a
261ball guaranteed to contain `\sin(x)`.
262
263.. code-block:: c
264
265    #include "arb.h"
266
267    void arb_sin_naive(arb_t res, const arb_t x, slong prec)
268    {
269        arb_t s, t, u, tol;
270        slong k;
271        arb_init(s); arb_init(t); arb_init(u); arb_init(tol);
272
273        arb_one(tol);
274        arb_mul_2exp_si(tol, tol, -prec);  /* tol = 2^-prec */
275
276        for (k = 0; ; k++)
277        {
278            arb_pow_ui(t, x, 2 * k + 1, prec);
279            arb_fac_ui(u, 2 * k + 1, prec);
280            arb_div(t, t, u, prec);  /* t = x^(2k+1) / (2k+1)! */
281
282            arb_abs(u, t);
283            if (arb_le(u, tol))   /* if |t| <= 2^-prec */
284            {
285                arb_add_error(s, u);    /* add |t| to the radius and stop */
286                break;
287            }
288
289            if (k % 2 == 0)
290                arb_add(s, s, t, prec);
291            else
292                arb_sub(s, s, t, prec);
293
294        }
295
296        arb_set(res, s);
297        arb_clear(s); arb_clear(t); arb_clear(u); arb_clear(tol);
298    }
299
300This algorithm is naive, because the Taylor series is slow to converge
301and suffers from catastrophic cancellation when `|x|` is large
302(we could also improve the efficiency of the code slightly by
303computing the terms using recurrence relations instead of
304computing `x^k` and `k!` from scratch each iteration).
305
306As a test, we compute `\sin(2016.1)`.
307The largest term in the Taylor series for `\sin(x)` reaches
308a magnitude of about `x^x / x!`, or about `10^{873}` in this case.
309Therefore, we need over 873 digits (about 3000 bits) of precision
310to overcome the catastrophic cancellation and determine
311the result with sufficient accuracy to tell whether it is positive
312or negative.
313
314.. code-block:: c
315
316    int main()
317    {
318        arb_t x, y;
319        slong prec;
320        arb_init(x); arb_init(y);
321
322        for (prec = 64; ; prec *= 2)
323        {
324            arb_set_str(x, "2016.1", prec);
325            arb_sin_naive(y, x, prec);
326            printf("Using %5ld bits, sin(x) = ", prec);
327            arb_printn(y, 10, 0); printf("\n");
328            if (!arb_contains_zero(y))  /* stopping condition */
329                break;
330        }
331
332        arb_clear(x); arb_clear(y);
333    }
334
335The program produces the following output:
336
337.. code-block:: text
338
339    Using    64 bits, sin(x) = [+/- 2.67e+859]
340    Using   128 bits, sin(x) = [+/- 1.30e+840]
341    Using   256 bits, sin(x) = [+/- 3.60e+801]
342    Using   512 bits, sin(x) = [+/- 3.01e+724]
343    Using  1024 bits, sin(x) = [+/- 2.18e+570]
344    Using  2048 bits, sin(x) = [+/- 1.22e+262]
345    Using  4096 bits, sin(x) = [-0.7190842207 +/- 1.20e-11]
346
347As an exercise, the reader may improve the naive algorithm by making it
348subtract a well-chosen multiple of `2 \pi` from `x` before invoking
349the Taylor series (hint: use :func:`arb_const_pi`, :func:`arb_div`
350and :func:`arf_get_fmpz`).
351If done correctly, 64 bits of precision should be more than enough to
352compute `\sin(2016.1)`, and with minor adjustments
353to the code, the user should be able to compute
354`\sin(\exp(2016.1))` quite easily as well.
355
356This example illustrates how ball arithmetic can be used to perform
357nontrivial calculations. To evaluate an infinite series, the user
358needs to know how to bound the tail of the series, but everything
359else is automatic.
360When evaluating a finite formula that can be expressed
361completely using built-in functions, all error bounding is automatic
362from the point of view of the user.
363In particular, the :func:`arb_sin` method should be used to compute the sine
364of a real number; it uses a much more efficient algorithm
365than the naive code above.
366
367This example also illustrates the "guess-and-verify" paradigm:
368instead of determining *a priori* the floating-point precision necessary
369to get a correct result, we *guess* some initial precision, use ball arithmetic
370to *verify* that the result is accurate enough, and restart with
371higher precision (or signal failure) if it is not.
372
373If we think of rounding errors as essentially random processes,
374then a floating-point computation is analogous to a
375*Monte Carlo algorithm*. Using ball arithmetic to get a verified result
376effectively turns it into the analog of a *Las Vegas algorithm*,
377which is a randomized algorithm that always gives a correct result if it terminates, but
378may fail to terminate (alternatively, instead of actually looping forever,
379it might signal failure after a certain number of iterations).
380
381The loop will fail to terminate if we attempt to determine the sign of
382`\sin(\pi)`:
383
384.. code-block:: text
385
386    Using    64 bits, sin(x) = [+/- 3.96e-18]
387    Using   128 bits, sin(x) = [+/- 2.17e-37]
388    Using   256 bits, sin(x) = [+/- 6.10e-76]
389    Using   512 bits, sin(x) = [+/- 5.13e-153]
390    Using  1024 bits, sin(x) = [+/- 4.01e-307]
391    Using  2048 bits, sin(x) = [+/- 2.13e-615]
392    Using  4096 bits, sin(x) = [+/- 6.85e-1232]
393    Using  8192 bits, sin(x) = [+/- 6.46e-2465]
394    Using 16384 bits, sin(x) = [+/- 5.09e-4931]
395    Using 32768 bits, sin(x) = [+/- 5.41e-9863]
396    ...
397
398The sign of a nonzero real number can be
399decided by computing it to sufficiently high accuracy, but the sign
400of an expression that is exactly equal to zero cannot be decided
401by a numerical computation unless the entire computation
402happens to be exact (in this example, we could use the :func:`arb_sin_pi`
403function which computes `\sin(\pi x)` in one step, with the input `x = 1`).
404
405It is up to the user to implement a stopping criterion appropriate for
406the circumstances of a given application. For example, breaking
407when it is clear that `|\sin(x)| < 10^{-10000}` would allow the program
408to terminate and convey some meaningful information about the input `x = \pi`,
409though this would not constitute a mathematical proof that
410`\sin(\pi) = 0`.
411
412More on precision and accuracy
413-------------------------------------------------------------------------------
414
415The relation between the working precision and the accuracy of the output
416is not always easy predict. The following remarks might help
417to choose *prec* optimally.
418
419For a ball `[m \pm r]` it is convenient to define the following notions:
420
421* Absolute error: `e_{abs} = |r|`
422* Relative error: `e_{rel} = |r| / \max(0, |m| - |r|)` (or `e_{rel} = 0` if `r = m = 0`)
423* Absolute accuracy: `a_{abs} = 1 / e_{abs}`
424* Relative accuracy: `a_{rel} = 1 / e_{rel}`
425
426Expressed in bits, one takes the corresponding `\log_2` values.
427
428Of course, if `x` is the exact value being approximated, then
429the "absolute error" so defined is an upper bound for the
430actual absolute error `|x-m|` and "absolute accuracy"
431a lower bound for `1/|x-m|`, etc.
432
433The *prec* argument in Arb should be thought of as controlling
434the working precision.
435Generically, when evaluating a fixed expression (that is, when the
436sequence of operations does not depend on the precision), the
437absolute or relative error will be bounded by
438
439.. math ::
440
441    2^{O(1) - prec}
442
443where the `O(1)` term depends on the expression and implementation
444details of the ball functions used to evaluate it.
445Accordingly, for an accuracy of *p* bits, we need to use a working precision
446`O(1) + p`.
447If the expression is numerically well-behaved, then the `O(1)` term
448will be small, which leads to the heuristic of "adding a few guard bits"
449(for most basic calculations, 10 or 20 guard bits is enough).
450If the `O(1)` term is unknown, then increasing the number of guard
451bits in exponential steps until the result is accurate enough
452is generally a good heuristic.
453
454Sometimes, a partially accurate result can be used to estimate the `O(1)`
455term. For example, if the goal is to achieve 100 bits of accuracy
456and a precision of 120 bits yields 80 bits of accuracy, then
457it is plausible that a precision of just over
458140 bits yields 100 bits of accuracy.
459
460Built-in functions in Arb can roughly be characterized as
461belonging to one of two extremes (though there is actually a spectrum):
462
463* Simple operations, including basic arithmetic operations and many
464  elementary functions. In most cases, for an input `x = [m \pm r]`,
465  `f(x)` is evaluated by computing `f(m)` and then separately bounding the
466  *propagated error* `|f(m) - f(m + \varepsilon)|, |\varepsilon| \le r`.
467  The working precision is automatically increased internally
468  so that `f(m)` is computed to *prec* bits of relative accuracy
469  with an error of at most a few units in the last place (perhaps with
470  rare exceptions).
471  The propagated error can generally be bounded quite tightly as well (see :ref:`general_formulas`).
472  As a result, the enclosure will be close to the best possible
473  at the given precision, and the user can estimate the precision to use
474  accordingly.
475
476* Complex operations, such as certain higher
477  transcendental functions (for example, the Riemann zeta function).
478  The function is evaluated by performing a sequence of simpler operations,
479  each using ball arithmetic with a working precision of roughly *prec*
480  bits. The sequence of operations might depend on *prec*;
481  for example, an infinite series might be truncated
482  so that the remainder is smaller than `2^{-prec}`.
483  The final result can be far from tight, and it is not guaranteed
484  that the error converges to zero as `prec \to \infty`, though
485  in practice, it should do so in most cases.
486
487In short, the *inclusion principle* is the fundamental contract in Arb.
488Enclosures computed by built-in functions may or may not be tight
489enough to be useful, but the hope is that they will be sufficient
490for most purposes.
491Tightening the error bounds for more complex operations is a long
492term optimization goal, which in many cases will require a
493fair amount of research.
494A tradeoff also has to be made for efficiency: tighter error bounds
495allow the user to work with a lower precision, but they may
496also be much more expensive to compute.
497
498Polynomial time guarantee
499-------------------------------------------------------------------------------
500
501Arb provides a soft guarantee that the time used to evaluate a ball
502function will depend polynomially on *prec* and the bit size
503of the input, uniformly regardless of the numerical value of the input.
504
505The idea behind this soft guarantee is to allow Arb to be used as a
506black box to evaluate expressions numerically without potentially
507slowing down, hanging indefinitely or crashing
508because of "bad" input such as nested exponentials.
509By controlling the precision, the user can cancel
510a computation before it uses up
511an unreasonable amount of resources,
512without having to rely on other timeout or exception mechanisms.
513A result that is feasible but very expensive to compute
514can still be forced by setting the precision high enough.
515
516As motivation, consider evaluating `\sin(x)` or `\exp(x)` with
517the exact floating-point number
518`x = 2^{2^n}` as input.
519The time and space required to compute an accurate floating-point
520approximation of `\sin(x)` or `\exp(x)` increases as `2^n`,
521in the first case because because of the need to subtract an accurate
522multiple of `2\pi` and in the second case due to the size of the
523output exponent and the internal subtraction of an accurate multiple of `\log(2)`.
524This is despite the fact that the size of `x` as an object in memory only
525increases linearly with `n`.
526Already `n = 33` would require at least 1 GB of memory, and
527`n = 100` would be physically impossible to process.
528For functions that are computed by direct use of power series expansions,
529e.g. `f(x) = \sum_{k=0}^{\infty} c_k x^k`,
530without having fast argument-reduction techniques
531like those for elementary functions,
532the time would be exponential in `n` already when `x = 2^n`.
533
534Therefore, Arb caps internal work parameters
535(the internal working precision,
536the number terms of an infinite series to add, etc.) by polynomial,
537usually linear, functions of *prec*.
538When the limit is exceeded, the output is set to a crude bound.
539For example, if `x` is too large, :func:`arb_sin` will
540simply return `[\pm 1]`, and :func:`arb_exp`
541will simply return `[\pm \infty]` if `x` is positive
542or `[\pm 2^{-m}]` if `x` is negative.
543
544This is not just a failsafe, but occasionally a useful optimization.
545It is not entirely uncommon to have formulas where one term
546is modest and another term decreases exponentially, such as:
547
548.. math ::
549
550    \log(x) + \sin(x) \exp(-x).
551
552For example, the reflection formula of the digamma function has
553a similar structure.
554When `x` is large, the right term would be expensive to compute
555to high relative accuracy. Doing so is unnecessary, however,
556since a crude bound of `[\pm 1] \cdot [\pm 2^{-m}]` is enough to evaluate
557the expression as a whole accurately.
558
559The polynomial time guarantee is "soft" in that there are a few exceptions.
560For example, the complexity of computing the Riemann zeta function
561`\zeta(\sigma+it)` increases linearly with the imaginary height `|t|`
562in the current implementation, and all known algorithms
563have a complexity of `|t|^{\alpha}` where the best known value for `\alpha`
564is about `0.3`.
565Input with large `|t|` is most likely to be given deliberately
566by users with the explicit intent of evaluating the zeta
567function itself, so the evaluation is not cut off automatically.
568
569