1Statistics (`scipy.stats`)
2==========================
3
4.. sectionauthor:: Travis E. Oliphant
5.. sectionauthor:: Josef Perktold
6.. sectionauthor:: Nicky van Foreest
7.. sectionauthor:: Sambit Panda
8.. sectionauthor:: Pamphile T. Roy (@tupui)
9
10.. currentmodule:: scipy
11
12Introduction
13------------
14
15In this tutorial, we discuss many, but certainly not all, features of
16``scipy.stats``. The intention here is to provide a user with a
17working knowledge of this package. We refer to the
18:ref:`reference manual <statsrefmanual>` for further details.
19
20Note: This documentation is work in progress.
21
22.. toctree::
23   :maxdepth: 1
24
25   stats/discrete
26   stats/continuous
27
28
29Random variables
30----------------
31
32There are two general distribution classes that have been implemented
33for encapsulating :ref:`continuous random variables
34<continuous-random-variables>` and :ref:`discrete random variables
35<discrete-random-variables>`. Over 80 continuous random variables
36(RVs) and 10 discrete random variables have been implemented using
37these classes. Besides this, new routines and distributions can be
38easily added by the end user. (If you create one, please contribute it.)
39
40All of the statistics functions are located in the sub-package
41:mod:`scipy.stats` and a fairly complete listing of these functions
42can be obtained using ``info(stats)``. The list of the random
43variables available can also be obtained from the docstring for the
44stats sub-package.
45
46In the discussion below, we mostly focus on continuous RVs. Nearly everything
47also applies to discrete variables, but we point out some differences
48here: :ref:`discrete_points_label`.
49
50In the code samples below, we assume that the :mod:`scipy.stats` package
51is imported as
52
53    >>> from scipy import stats
54
55and in some cases we assume that individual objects are imported as
56
57    >>> from scipy.stats import norm
58
59Getting help
60^^^^^^^^^^^^
61
62First of all, all distributions are accompanied with help
63functions. To obtain just some basic information, we print the relevant
64docstring: ``print(stats.norm.__doc__)``.
65
66To find the support, i.e., upper and lower bounds of the distribution,
67call:
68
69    >>> print('bounds of distribution lower: %s, upper: %s' % norm.support())
70    bounds of distribution lower: -inf, upper: inf
71
72We can list all methods and properties of the distribution with
73``dir(norm)``. As it turns out, some of the methods are private,
74although they are not named as such (their names do not start
75with a leading underscore), for example ``veccdf``, are only available
76for internal calculation (those methods will give warnings when one tries to
77use them, and will be removed at some point).
78
79To obtain the *real* main methods, we list the methods of the frozen
80distribution. (We explain the meaning of a `frozen` distribution
81below).
82
83    >>> rv = norm()
84    >>> dir(rv)  # reformatted
85    ['__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__',
86     '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__',
87     '__init__', '__le__', '__lt__', '__module__', '__ne__', '__new__',
88     '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__',
89     '__str__', '__subclasshook__', '__weakref__', 'a', 'args', 'b', 'cdf',
90     'dist', 'entropy', 'expect', 'interval', 'isf', 'kwds', 'logcdf',
91     'logpdf', 'logpmf', 'logsf', 'mean', 'median', 'moment', 'pdf', 'pmf',
92     'ppf', 'random_state', 'rvs', 'sf', 'stats', 'std', 'var']
93
94Finally, we can obtain the list of available distribution through
95introspection:
96
97    >>> dist_continu = [d for d in dir(stats) if
98    ...                 isinstance(getattr(stats, d), stats.rv_continuous)]
99    >>> dist_discrete = [d for d in dir(stats) if
100    ...                  isinstance(getattr(stats, d), stats.rv_discrete)]
101    >>> print('number of continuous distributions: %d' % len(dist_continu))
102    number of continuous distributions: 104
103    >>> print('number of discrete distributions:   %d' % len(dist_discrete))
104    number of discrete distributions:   19
105
106Common methods
107^^^^^^^^^^^^^^
108
109The main public methods for continuous  RVs are:
110
111* rvs:   Random Variates
112* pdf:   Probability Density Function
113* cdf:   Cumulative Distribution Function
114* sf:    Survival Function (1-CDF)
115* ppf:   Percent Point Function (Inverse of CDF)
116* isf:   Inverse Survival Function (Inverse of SF)
117* stats: Return mean, variance, (Fisher's) skew, or (Fisher's) kurtosis
118* moment: non-central moments of the distribution
119
120
121Let's take a normal RV as an example.
122
123    >>> norm.cdf(0)
124    0.5
125
126To compute the ``cdf`` at a number of points, we can pass a list or a numpy array.
127
128    >>> norm.cdf([-1., 0, 1])
129    array([ 0.15865525,  0.5,  0.84134475])
130    >>> import numpy as np
131    >>> norm.cdf(np.array([-1., 0, 1]))
132    array([ 0.15865525,  0.5,  0.84134475])
133
134Thus, the basic methods, such as `pdf`, `cdf`, and so on, are vectorized.
135
136Other generally useful methods are supported too:
137
138    >>> norm.mean(), norm.std(), norm.var()
139    (0.0, 1.0, 1.0)
140    >>> norm.stats(moments="mv")
141    (array(0.0), array(1.0))
142
143To find the median of a distribution, we can use the percent point
144function ``ppf``, which is the inverse of the ``cdf``:
145
146    >>> norm.ppf(0.5)
147    0.0
148
149To generate a sequence of random variates, use the ``size`` keyword
150argument:
151
152    >>> norm.rvs(size=3)
153    array([-0.35687759,  1.34347647, -0.11710531])   # random
154
155Don't think that ``norm.rvs(5)`` generates 5 variates:
156
157    >>> norm.rvs(5)
158    5.471435163732493  # random
159
160Here, ``5`` with no keyword is being interpreted as the first possible
161keyword argument, ``loc``, which is the first of a pair of keyword arguments
162taken by all continuous distributions.
163This brings us to the topic of the next subsection.
164
165Random number generation
166^^^^^^^^^^^^^^^^^^^^^^^^
167
168Drawing random numbers relies on generators from `numpy.random` package.
169In the examples above, the specific stream of
170random numbers is not reproducible across runs. To achieve reproducibility,
171you can explicitly *seed* a random number generator. In NumPy, a generator
172is an instance of `numpy.random.Generator`. Here is the canonical way to create
173a generator:
174
175    >>> from numpy.random import default_rng
176    >>> rng = default_rng()
177
178And fixing the seed can be done like this:
179
180    >>> # do NOT copy this value
181    >>> rng = default_rng(301439351238479871608357552876690613766)
182
183.. warning:: Do not use this number or common values such as 0. Using just a
184             small set of seeds to instantiate larger state spaces means that
185             there are some initial states that are impossible to reach. This
186             creates some biases if everyone uses such values. A good way to
187             get a seed is to use a `numpy.random.SeedSequence`:
188
189             >>> from numpy.random import SeedSequence
190             >>> print(SeedSequence().entropy)
191             301439351238479871608357552876690613766  # random
192
193The `random_state` parameter in distributions accepts an instance of
194`numpy.random.Generator` class, or an integer, which is then used to
195seed an internal ``Generator`` object:
196
197    >>> norm.rvs(size=5, random_state=rng)
198    array([ 0.47143516, -1.19097569,  1.43270697, -0.3126519 , -0.72058873])  # random
199
200For further info, see `NumPy's documentation
201<https://numpy.org/doc/stable/reference/random/index.html>`__.
202
203Shifting and scaling
204^^^^^^^^^^^^^^^^^^^^
205
206All continuous distributions take ``loc`` and ``scale`` as keyword
207parameters to adjust the location and scale of the distribution,
208e.g., for the standard normal distribution, the location is the mean and
209the scale is the standard deviation.
210
211    >>> norm.stats(loc=3, scale=4, moments="mv")
212    (array(3.0), array(16.0))
213
214In many cases, the standardized distribution for a random variable ``X``
215is obtained through the transformation ``(X - loc) / scale``. The
216default values are ``loc = 0`` and ``scale = 1``.
217
218Smart use of ``loc`` and ``scale`` can help modify the standard
219distributions in many ways. To illustrate the scaling further, the
220``cdf`` of an exponentially distributed RV with mean :math:`1/\lambda`
221is given by
222
223.. math::
224
225    F(x) = 1 - \exp(-\lambda x)
226
227By applying the scaling rule above, it can be seen that by
228taking ``scale  = 1./lambda`` we get the proper scale.
229
230    >>> from scipy.stats import expon
231    >>> expon.mean(scale=3.)
232    3.0
233
234.. note:: Distributions that take shape parameters may
235   require more than simple application of ``loc`` and/or
236   ``scale`` to achieve the desired form. For example, the
237   distribution of 2-D vector lengths given a constant vector
238   of length :math:`R` perturbed by independent N(0, :math:`\sigma^2`)
239   deviations in each component is
240   rice(:math:`R/\sigma`, scale= :math:`\sigma`). The first argument
241   is a shape parameter that needs to be scaled along with :math:`x`.
242
243The uniform distribution is also interesting:
244
245    >>> from scipy.stats import uniform
246    >>> uniform.cdf([0, 1, 2, 3, 4, 5], loc=1, scale=4)
247    array([ 0.  ,  0.  ,  0.25,  0.5 ,  0.75,  1.  ])
248
249
250Finally, recall from the previous paragraph that we are left with the
251problem of the meaning of ``norm.rvs(5)``. As it turns out, calling a
252distribution like this, the first argument, i.e., the 5, gets passed
253to set the ``loc`` parameter. Let's see:
254
255    >>> np.mean(norm.rvs(5, size=500))
256    5.0098355106969992  # random
257
258Thus, to explain the output of the example of the last section:
259``norm.rvs(5)`` generates a single normally distributed random variate with
260mean ``loc=5``, because of the default ``size=1``.
261
262We recommend that you set ``loc`` and ``scale`` parameters explicitly, by
263passing the values as keywords rather than as arguments. Repetition
264can be minimized when calling more than one method of a given RV by
265using the technique of `Freezing a Distribution`_, as explained below.
266
267
268Shape parameters
269^^^^^^^^^^^^^^^^
270
271While a general continuous random variable can be shifted and scaled
272with the ``loc`` and ``scale`` parameters, some distributions require
273additional shape parameters. For instance, the gamma distribution with density
274
275.. math::
276
277    \gamma(x, a) = \frac{\lambda (\lambda x)^{a-1}}{\Gamma(a)} e^{-\lambda x}\;,
278
279requires the shape parameter :math:`a`. Observe that setting
280:math:`\lambda` can be obtained by setting the ``scale`` keyword to
281:math:`1/\lambda`.
282
283Let's check the number and name of the shape parameters of the gamma
284distribution. (We know from the above that this should be 1.)
285
286    >>> from scipy.stats import gamma
287    >>> gamma.numargs
288    1
289    >>> gamma.shapes
290    'a'
291
292Now, we set the value of the shape variable to 1 to obtain the
293exponential distribution, so that we compare easily whether we get the
294results we expect.
295
296    >>> gamma(1, scale=2.).stats(moments="mv")
297    (array(2.0), array(4.0))
298
299Notice that we can also specify shape parameters as keywords:
300
301   >>> gamma(a=1, scale=2.).stats(moments="mv")
302   (array(2.0), array(4.0))
303
304
305Freezing a distribution
306^^^^^^^^^^^^^^^^^^^^^^^
307
308Passing the ``loc`` and ``scale`` keywords time and again can become
309quite bothersome. The concept of `freezing` a RV is used to
310solve such problems.
311
312    >>> rv = gamma(1, scale=2.)
313
314By using ``rv`` we no longer have to include the scale or the shape
315parameters anymore. Thus, distributions can be used in one of two
316ways, either by passing all distribution parameters to each method
317call (such as we did earlier) or by freezing the parameters for the
318instance of the distribution. Let us check this:
319
320    >>> rv.mean(), rv.std()
321    (2.0, 2.0)
322
323This is, indeed, what we should get.
324
325
326Broadcasting
327^^^^^^^^^^^^
328
329The basic methods ``pdf``, and so on, satisfy the usual numpy broadcasting rules. For
330example, we can calculate the critical values for the upper tail of
331the t distribution for different probabilities and degrees of freedom.
332
333    >>> stats.t.isf([0.1, 0.05, 0.01], [[10], [11]])
334    array([[ 1.37218364,  1.81246112,  2.76376946],
335           [ 1.36343032,  1.79588482,  2.71807918]])
336
337Here, the first row contains the critical values for 10 degrees of freedom
338and the second row for 11 degrees of freedom (d.o.f.). Thus, the
339broadcasting rules give the same result of calling ``isf`` twice:
340
341    >>> stats.t.isf([0.1, 0.05, 0.01], 10)
342    array([ 1.37218364,  1.81246112,  2.76376946])
343    >>> stats.t.isf([0.1, 0.05, 0.01], 11)
344    array([ 1.36343032,  1.79588482,  2.71807918])
345
346If the array with probabilities, i.e., ``[0.1, 0.05, 0.01]`` and the
347array of degrees of freedom i.e., ``[10, 11, 12]``, have the same
348array shape, then element-wise matching is used. As an example, we can
349obtain the 10% tail for 10 d.o.f., the 5% tail for 11 d.o.f. and the
3501% tail for 12 d.o.f. by calling
351
352    >>> stats.t.isf([0.1, 0.05, 0.01], [10, 11, 12])
353    array([ 1.37218364,  1.79588482,  2.68099799])
354
355
356.. _discrete_points_label:
357
358Specific points for discrete distributions
359^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
360
361Discrete distributions have mostly the same basic methods as the
362continuous distributions. However ``pdf`` is replaced by the probability
363mass function ``pmf``, no estimation methods, such as fit, are
364available, and ``scale`` is not a valid keyword parameter. The
365location parameter, keyword ``loc``, can still be used to shift the
366distribution.
367
368The computation of the cdf requires some extra attention. In the case
369of continuous distribution, the cumulative distribution function is, in
370most standard cases, strictly monotonic increasing in the bounds (a,b)
371and has, therefore, a unique inverse. The cdf of a discrete
372distribution, however, is a step function, hence the inverse cdf,
373i.e., the percent point function, requires a different definition:
374
375::
376
377    ppf(q) = min{x : cdf(x) >= q, x integer}
378
379For further info, see the docs `here
380<https://docs.scipy.org/doc/scipy/reference/tutorial/stats/discrete.html#percent-point-function-inverse-cdf>`__.
381
382
383We can look at the hypergeometric distribution as an example
384
385    >>> from scipy.stats import hypergeom
386    >>> [M, n, N] = [20, 7, 12]
387
388If we use the cdf at some integer points and then evaluate the ppf at those
389cdf values, we get the initial integers back, for example
390
391    >>> x = np.arange(4) * 2
392    >>> x
393    array([0, 2, 4, 6])
394    >>> prb = hypergeom.cdf(x, M, n, N)
395    >>> prb
396    array([  1.03199174e-04,   5.21155831e-02,   6.08359133e-01,
397             9.89783282e-01])
398    >>> hypergeom.ppf(prb, M, n, N)
399    array([ 0.,  2.,  4.,  6.])
400
401If we use values that are not at the kinks of the cdf step function, we get
402the next higher integer back:
403
404    >>> hypergeom.ppf(prb + 1e-8, M, n, N)
405    array([ 1.,  3.,  5.,  7.])
406    >>> hypergeom.ppf(prb - 1e-8, M, n, N)
407    array([ 0.,  2.,  4.,  6.])
408
409
410Fitting distributions
411^^^^^^^^^^^^^^^^^^^^^
412
413The main additional methods of the not frozen distribution are related
414to the estimation of distribution parameters:
415
416* fit:   maximum likelihood estimation of distribution parameters, including location
417         and scale
418* fit_loc_scale: estimation of location and scale when shape parameters are given
419* nnlf:  negative log likelihood function
420* expect: calculate the expectation of a function against the pdf or pmf
421
422
423.. _performance_issues_label:
424
425Performance issues and cautionary remarks
426^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
427
428The performance of the individual methods, in terms of speed, varies
429widely by distribution and method. The results of a method are
430obtained in one of two ways: either by explicit calculation, or by a
431generic algorithm that is independent of the specific distribution.
432
433Explicit calculation, on the one hand, requires that the method is
434directly specified for the given distribution, either through analytic
435formulas or through special functions in ``scipy.special`` or
436``numpy.random`` for ``rvs``. These are usually relatively fast
437calculations.
438
439The generic methods, on the other hand, are used if the distribution
440does not specify any explicit calculation. To define a distribution,
441only one of pdf or cdf is necessary; all other methods can be derived
442using numeric integration and root finding. However, these indirect
443methods can be `very` slow. As an example, ``rgh =
444stats.gausshyper.rvs(0.5, 2, 2, 2, size=100)`` creates random
445variables in a very indirect way and takes about 19 seconds for 100
446random variables on my computer, while one million random variables
447from the standard normal or from the t distribution take just above
448one second.
449
450
451Remaining issues
452^^^^^^^^^^^^^^^^
453
454The distributions in ``scipy.stats`` have recently been corrected and improved
455and gained a considerable test suite; however, a few issues remain:
456
457* The distributions have been tested over some range of parameters;
458  however, in some corner ranges, a few incorrect results may remain.
459* The maximum likelihood estimation in `fit` does not work with
460  default starting parameters for all distributions and the user
461  needs to supply good starting parameters. Also, for some
462  distribution using a maximum likelihood estimator might
463  inherently not be the best choice.
464
465
466Building specific distributions
467-------------------------------
468
469The next examples shows how to build your own distributions. Further
470examples show the usage of the distributions and some statistical
471tests.
472
473
474Making a continuous distribution, i.e., subclassing ``rv_continuous``
475^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
476
477Making continuous distributions is fairly simple.
478
479    >>> from scipy import stats
480    >>> class deterministic_gen(stats.rv_continuous):
481    ...     def _cdf(self, x):
482    ...         return np.where(x < 0, 0., 1.)
483    ...     def _stats(self):
484    ...         return 0., 0., 0., 0.
485
486    >>> deterministic = deterministic_gen(name="deterministic")
487    >>> deterministic.cdf(np.arange(-3, 3, 0.5))
488    array([ 0.,  0.,  0.,  0.,  0.,  0.,  1.,  1.,  1.,  1.,  1.,  1.])
489
490Interestingly,  the ``pdf`` is now computed automatically:
491
492    >>> deterministic.pdf(np.arange(-3, 3, 0.5))
493    array([  0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
494             0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
495             5.83333333e+04,   4.16333634e-12,   4.16333634e-12,
496             4.16333634e-12,   4.16333634e-12,   4.16333634e-12])
497
498
499Be aware of the performance issues mentioned in
500:ref:`performance_issues_label`. The computation of unspecified
501common methods can become very slow, since only general methods are
502called, which, by their very nature, cannot use any specific
503information about the distribution. Thus, as a cautionary example:
504
505    >>> from scipy.integrate import quad
506    >>> quad(deterministic.pdf, -1e-1, 1e-1)
507    (4.163336342344337e-13, 0.0)
508
509But this is not correct: the integral over this pdf should be 1. Let's make the
510integration interval smaller:
511
512    >>> quad(deterministic.pdf, -1e-3, 1e-3)  # warning removed
513    (1.000076872229173, 0.0010625571718182458)
514
515This looks better. However, the problem originated from the fact that
516the pdf is not specified in the class definition of the deterministic
517distribution.
518
519
520Subclassing ``rv_discrete``
521^^^^^^^^^^^^^^^^^^^^^^^^^^^
522
523In the following, we use `stats.rv_discrete` to generate a discrete
524distribution that has the probabilities of the truncated normal for the
525intervals centered around the integers.
526
527**General info**
528
529From the docstring of rv_discrete, ``help(stats.rv_discrete)``,
530
531  "You can construct an arbitrary discrete rv where P{X=xk} = pk by
532  passing to the rv_discrete initialization method (through the values=
533  keyword) a tuple of sequences (xk, pk) which describes only those
534  values of X (xk) that occur with nonzero probability (pk)."
535
536Next to this, there are some further requirements for this approach to
537work:
538
539* The keyword `name` is required.
540* The support points of the distribution xk have to be integers.
541* The number of significant digits (decimals) needs to be specified.
542
543In fact, if the last two requirements are not satisfied, an exception
544may be raised or the resulting numbers may be incorrect.
545
546**An example**
547
548Let's do the work. First:
549
550    >>> npoints = 20   # number of integer support points of the distribution minus 1
551    >>> npointsh = npoints // 2
552    >>> npointsf = float(npoints)
553    >>> nbound = 4   # bounds for the truncated normal
554    >>> normbound = (1+1/npointsf) * nbound   # actual bounds of truncated normal
555    >>> grid = np.arange(-npointsh, npointsh+2, 1)   # integer grid
556    >>> gridlimitsnorm = (grid-0.5) / npointsh * nbound   # bin limits for the truncnorm
557    >>> gridlimits = grid - 0.5   # used later in the analysis
558    >>> grid = grid[:-1]
559    >>> probs = np.diff(stats.truncnorm.cdf(gridlimitsnorm, -normbound, normbound))
560    >>> gridint = grid
561
562And, finally, we can subclass ``rv_discrete``:
563
564    >>> normdiscrete = stats.rv_discrete(values=(gridint,
565    ...              np.round(probs, decimals=7)), name='normdiscrete')
566
567Now that we have defined the distribution, we have access to all
568common methods of discrete distributions.
569
570    >>> print('mean = %6.4f, variance = %6.4f, skew = %6.4f, kurtosis = %6.4f' %
571    ...       normdiscrete.stats(moments='mvsk'))
572    mean = -0.0000, variance = 6.3302, skew = 0.0000, kurtosis = -0.0076
573
574    >>> nd_std = np.sqrt(normdiscrete.stats(moments='v'))
575
576**Testing the implementation**
577
578Let's generate a random sample and compare observed frequencies with
579the probabilities.
580
581    >>> n_sample = 500
582    >>> rvs = normdiscrete.rvs(size=n_sample)
583    >>> f, l = np.histogram(rvs, bins=gridlimits)
584    >>> sfreq = np.vstack([gridint, f, probs*n_sample]).T
585    >>> print(sfreq)
586    [[-1.00000000e+01  0.00000000e+00  2.95019349e-02]  # random
587     [-9.00000000e+00  0.00000000e+00  1.32294142e-01]
588     [-8.00000000e+00  0.00000000e+00  5.06497902e-01]
589     [-7.00000000e+00  2.00000000e+00  1.65568919e+00]
590     [-6.00000000e+00  1.00000000e+00  4.62125309e+00]
591     [-5.00000000e+00  9.00000000e+00  1.10137298e+01]
592     [-4.00000000e+00  2.60000000e+01  2.24137683e+01]
593     [-3.00000000e+00  3.70000000e+01  3.89503370e+01]
594     [-2.00000000e+00  5.10000000e+01  5.78004747e+01]
595     [-1.00000000e+00  7.10000000e+01  7.32455414e+01]
596     [ 0.00000000e+00  7.40000000e+01  7.92618251e+01]
597     [ 1.00000000e+00  8.90000000e+01  7.32455414e+01]
598     [ 2.00000000e+00  5.50000000e+01  5.78004747e+01]
599     [ 3.00000000e+00  5.00000000e+01  3.89503370e+01]
600     [ 4.00000000e+00  1.70000000e+01  2.24137683e+01]
601     [ 5.00000000e+00  1.10000000e+01  1.10137298e+01]
602     [ 6.00000000e+00  4.00000000e+00  4.62125309e+00]
603     [ 7.00000000e+00  3.00000000e+00  1.65568919e+00]
604     [ 8.00000000e+00  0.00000000e+00  5.06497902e-01]
605     [ 9.00000000e+00  0.00000000e+00  1.32294142e-01]
606     [ 1.00000000e+01  0.00000000e+00  2.95019349e-02]]
607
608
609.. plot:: tutorial/examples/normdiscr_plot1.py
610   :align: center
611   :include-source: 0
612
613
614.. plot:: tutorial/examples/normdiscr_plot2.py
615   :align: center
616   :include-source: 0
617
618
619Next, we can test whether our sample was generated by our norm-discrete
620distribution. This also verifies whether the random numbers were generated
621correctly.
622
623The chisquare test requires that there are a minimum number of observations
624in each bin. We combine the tail bins into larger bins so that they contain
625enough observations.
626
627    >>> f2 = np.hstack([f[:5].sum(), f[5:-5], f[-5:].sum()])
628    >>> p2 = np.hstack([probs[:5].sum(), probs[5:-5], probs[-5:].sum()])
629    >>> ch2, pval = stats.chisquare(f2, p2*n_sample)
630
631    >>> print('chisquare for normdiscrete: chi2 = %6.3f pvalue = %6.4f' % (ch2, pval))
632    chisquare for normdiscrete: chi2 = 12.466 pvalue = 0.4090  # random
633
634The pvalue in this case is high, so we can be quite confident that
635our random sample was actually generated by the distribution.
636
637
638Analysing one sample
639--------------------
640
641First, we create some random variables. We set a seed so that in each run
642we get identical results to look at. As an example we take a sample from
643the Student t distribution:
644
645    >>> x = stats.t.rvs(10, size=1000)
646
647Here, we set the required shape parameter of the t distribution, which
648in statistics corresponds to the degrees of freedom, to 10. Using size=1000 means
649that our sample consists of 1000 independently drawn (pseudo) random numbers.
650Since we did not specify the keyword arguments `loc` and `scale`, those are
651set to their default values zero and one.
652
653Descriptive statistics
654^^^^^^^^^^^^^^^^^^^^^^
655
656`x` is a numpy array, and we have direct access to all array methods, e.g.,
657
658    >>> print(x.min())   # equivalent to np.min(x)
659    -3.78975572422  # random
660    >>> print(x.max())   # equivalent to np.max(x)
661    5.26327732981  # random
662    >>> print(x.mean())  # equivalent to np.mean(x)
663    0.0140610663985  # random
664    >>> print(x.var())   # equivalent to np.var(x))
665    1.28899386208  # random
666
667How do the sample properties compare to their theoretical counterparts?
668
669    >>> m, v, s, k = stats.t.stats(10, moments='mvsk')
670    >>> n, (smin, smax), sm, sv, ss, sk = stats.describe(x)
671
672    >>> sstr = '%-14s mean = %6.4f, variance = %6.4f, skew = %6.4f, kurtosis = %6.4f'
673    >>> print(sstr % ('distribution:', m, v, s ,k))
674    distribution:  mean = 0.0000, variance = 1.2500, skew = 0.0000, kurtosis = 1.0000  # random
675    >>> print(sstr % ('sample:', sm, sv, ss, sk))
676    sample:        mean = 0.0141, variance = 1.2903, skew = 0.2165, kurtosis = 1.0556  # random
677
678Note: `stats.describe` uses the unbiased estimator for the variance, while
679np.var is the biased estimator.
680
681
682For our sample the sample statistics differ a by a small amount from
683their theoretical counterparts.
684
685
686T-test and KS-test
687^^^^^^^^^^^^^^^^^^
688
689We can use the t-test to test whether the mean of our sample differs
690in a statistically significant way from the theoretical expectation.
691
692    >>> print('t-statistic = %6.3f pvalue = %6.4f' %  stats.ttest_1samp(x, m))
693    t-statistic =  0.391 pvalue = 0.6955  # random
694
695The pvalue is 0.7, this means that with an alpha error of, for
696example, 10%, we cannot reject the hypothesis that the sample mean
697is equal to zero, the expectation of the standard t-distribution.
698
699
700As an exercise, we can calculate our ttest also directly without
701using the provided function, which should give us the same answer,
702and so it does:
703
704    >>> tt = (sm-m)/np.sqrt(sv/float(n))  # t-statistic for mean
705    >>> pval = stats.t.sf(np.abs(tt), n-1)*2  # two-sided pvalue = Prob(abs(t)>tt)
706    >>> print('t-statistic = %6.3f pvalue = %6.4f' % (tt, pval))
707    t-statistic =  0.391 pvalue = 0.6955  # random
708
709The Kolmogorov-Smirnov test can be used to test the hypothesis that
710the sample comes from the standard t-distribution
711
712    >>> print('KS-statistic D = %6.3f pvalue = %6.4f' % stats.kstest(x, 't', (10,)))
713    KS-statistic D =  0.016 pvalue = 0.9571  # random
714
715Again, the p-value is high enough that we cannot reject the
716hypothesis that the random sample really is distributed according to the
717t-distribution. In real applications, we don't know what the
718underlying distribution is. If we perform the Kolmogorov-Smirnov
719test of our sample against the standard normal distribution, then we
720also cannot reject the hypothesis that our sample was generated by the
721normal distribution given that, in this example, the p-value is almost 40%.
722
723    >>> print('KS-statistic D = %6.3f pvalue = %6.4f' % stats.kstest(x, 'norm'))
724    KS-statistic D =  0.028 pvalue = 0.3918  # random
725
726However, the standard normal distribution has a variance of 1, while our
727sample has a variance of 1.29. If we standardize our sample and test it
728against the normal distribution, then the p-value is again large enough
729that we cannot reject the hypothesis that the sample came form the
730normal distribution.
731
732    >>> d, pval = stats.kstest((x-x.mean())/x.std(), 'norm')
733    >>> print('KS-statistic D = %6.3f pvalue = %6.4f' % (d, pval))
734    KS-statistic D =  0.032 pvalue = 0.2397  # random
735
736Note: The Kolmogorov-Smirnov test assumes that we test against a
737distribution with given parameters, since, in the last case, we
738estimated mean and variance, this assumption is violated and the
739distribution of the test statistic, on which the p-value is based, is
740not correct.
741
742Tails of the distribution
743^^^^^^^^^^^^^^^^^^^^^^^^^
744
745Finally, we can check the upper tail of the distribution. We can use
746the percent point function ppf, which is the inverse of the cdf
747function, to obtain the critical values, or, more directly, we can use
748the inverse of the survival function
749
750    >>> crit01, crit05, crit10 = stats.t.ppf([1-0.01, 1-0.05, 1-0.10], 10)
751    >>> print('critical values from ppf at 1%%, 5%% and 10%% %8.4f %8.4f %8.4f' % (crit01, crit05, crit10))
752    critical values from ppf at 1%, 5% and 10%   2.7638   1.8125   1.3722
753    >>> print('critical values from isf at 1%%, 5%% and 10%% %8.4f %8.4f %8.4f' % tuple(stats.t.isf([0.01,0.05,0.10],10)))
754    critical values from isf at 1%, 5% and 10%   2.7638   1.8125   1.3722
755
756    >>> freq01 = np.sum(x>crit01) / float(n) * 100
757    >>> freq05 = np.sum(x>crit05) / float(n) * 100
758    >>> freq10 = np.sum(x>crit10) / float(n) * 100
759    >>> print('sample %%-frequency at 1%%, 5%% and 10%% tail %8.4f %8.4f %8.4f' % (freq01, freq05, freq10))
760    sample %-frequency at 1%, 5% and 10% tail   1.4000   5.8000  10.5000  # random
761
762In all three cases, our sample has more weight in the top tail than the
763underlying distribution.
764We can briefly check a larger sample to see if we get a closer match. In this
765case, the empirical frequency is quite close to the theoretical probability,
766but if we repeat this several times, the fluctuations are still pretty large.
767
768    >>> freq05l = np.sum(stats.t.rvs(10, size=10000) > crit05) / 10000.0 * 100
769    >>> print('larger sample %%-frequency at 5%% tail %8.4f' % freq05l)
770    larger sample %-frequency at 5% tail   4.8000  # random
771
772We can also compare it with the tail of the normal distribution, which
773has less weight in the tails:
774
775    >>> print('tail prob. of normal at 1%%, 5%% and 10%% %8.4f %8.4f %8.4f' %
776    ...       tuple(stats.norm.sf([crit01, crit05, crit10])*100))
777    tail prob. of normal at 1%, 5% and 10%   0.2857   3.4957   8.5003
778
779The chisquare test can be used to test whether for a finite number of bins,
780the observed frequencies differ significantly from the probabilities of the
781hypothesized distribution.
782
783    >>> quantiles = [0.0, 0.01, 0.05, 0.1, 1-0.10, 1-0.05, 1-0.01, 1.0]
784    >>> crit = stats.t.ppf(quantiles, 10)
785    >>> crit
786    array([       -inf, -2.76376946, -1.81246112, -1.37218364,  1.37218364,
787            1.81246112,  2.76376946,         inf])
788    >>> n_sample = x.size
789    >>> freqcount = np.histogram(x, bins=crit)[0]
790    >>> tprob = np.diff(quantiles)
791    >>> nprob = np.diff(stats.norm.cdf(crit))
792    >>> tch, tpval = stats.chisquare(freqcount, tprob*n_sample)
793    >>> nch, npval = stats.chisquare(freqcount, nprob*n_sample)
794    >>> print('chisquare for t:      chi2 = %6.2f pvalue = %6.4f' % (tch, tpval))
795    chisquare for t:      chi2 =  2.30 pvalue = 0.8901  # random
796    >>> print('chisquare for normal: chi2 = %6.2f pvalue = %6.4f' % (nch, npval))
797    chisquare for normal: chi2 = 64.60 pvalue = 0.0000  # random
798
799We see that the standard normal distribution is clearly rejected, while the
800standard t-distribution cannot be rejected. Since the variance of our sample
801differs from both standard distributions, we can again redo the test taking
802the estimate for scale and location into account.
803
804The fit method of the distributions can be used to estimate the parameters
805of the distribution, and the test is repeated using probabilities of the
806estimated distribution.
807
808    >>> tdof, tloc, tscale = stats.t.fit(x)
809    >>> nloc, nscale = stats.norm.fit(x)
810    >>> tprob = np.diff(stats.t.cdf(crit, tdof, loc=tloc, scale=tscale))
811    >>> nprob = np.diff(stats.norm.cdf(crit, loc=nloc, scale=nscale))
812    >>> tch, tpval = stats.chisquare(freqcount, tprob*n_sample)
813    >>> nch, npval = stats.chisquare(freqcount, nprob*n_sample)
814    >>> print('chisquare for t:      chi2 = %6.2f pvalue = %6.4f' % (tch, tpval))
815    chisquare for t:      chi2 =  1.58 pvalue = 0.9542  # random
816    >>> print('chisquare for normal: chi2 = %6.2f pvalue = %6.4f' % (nch, npval))
817    chisquare for normal: chi2 = 11.08 pvalue = 0.0858  # random
818
819Taking account of the estimated parameters, we can still reject the
820hypothesis that our sample came from a normal distribution (at the 5% level),
821but again, with a p-value of 0.95, we cannot reject the t-distribution.
822
823
824Special tests for normal distributions
825^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
826
827Since the normal distribution is the most common distribution in statistics,
828there are several additional functions available to test whether a sample
829could have been drawn from a normal distribution.
830
831First, we can test if skew and kurtosis of our sample differ significantly from
832those of a normal distribution:
833
834    >>> print('normal skewtest teststat = %6.3f pvalue = %6.4f' % stats.skewtest(x))
835    normal skewtest teststat =  2.785 pvalue = 0.0054  # random
836    >>> print('normal kurtosistest teststat = %6.3f pvalue = %6.4f' % stats.kurtosistest(x))
837    normal kurtosistest teststat =  4.757 pvalue = 0.0000  # random
838
839These two tests are combined in the normality test
840
841    >>> print('normaltest teststat = %6.3f pvalue = %6.4f' % stats.normaltest(x))
842    normaltest teststat = 30.379 pvalue = 0.0000  # random
843
844In all three tests, the p-values are very low and we can reject the hypothesis
845that the our sample has skew and kurtosis of the normal distribution.
846
847Since skew and kurtosis of our sample are based on central moments, we get
848exactly the same results if we test the standardized sample:
849
850    >>> print('normaltest teststat = %6.3f pvalue = %6.4f' %
851    ...       stats.normaltest((x-x.mean())/x.std()))
852    normaltest teststat = 30.379 pvalue = 0.0000  # random
853
854Because normality is rejected so strongly, we can check whether the
855normaltest gives reasonable results for other cases:
856
857    >>> print('normaltest teststat = %6.3f pvalue = %6.4f' %
858    ...       stats.normaltest(stats.t.rvs(10, size=100)))
859    normaltest teststat =  4.698 pvalue = 0.0955  # random
860    >>> print('normaltest teststat = %6.3f pvalue = %6.4f' %
861    ...              stats.normaltest(stats.norm.rvs(size=1000)))
862    normaltest teststat =  0.613 pvalue = 0.7361  # random
863
864When testing for normality of a small sample of t-distributed observations
865and a large sample of normal-distributed observations, then in neither case
866can we reject the null hypothesis that the sample comes from a normal
867distribution. In the first case, this is because the test is not powerful
868enough to distinguish a t and a normally distributed random variable in a
869small sample.
870
871
872Comparing two samples
873---------------------
874
875In the following, we are given two samples, which can come either from the
876same or from different distribution, and we want to test whether these
877samples have the same statistical properties.
878
879
880Comparing means
881^^^^^^^^^^^^^^^
882
883Test with sample with identical means:
884
885    >>> rvs1 = stats.norm.rvs(loc=5, scale=10, size=500)
886    >>> rvs2 = stats.norm.rvs(loc=5, scale=10, size=500)
887    >>> stats.ttest_ind(rvs1, rvs2)
888    Ttest_indResult(statistic=-0.5489036175088705, pvalue=0.5831943748663959)  # random
889
890Test with sample with different means:
891
892    >>> rvs3 = stats.norm.rvs(loc=8, scale=10, size=500)
893    >>> stats.ttest_ind(rvs1, rvs3)
894    Ttest_indResult(statistic=-4.533414290175026, pvalue=6.507128186389019e-06)  # random
895
896Kolmogorov-Smirnov test for two samples ks_2samp
897^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
898
899For the example, where both samples are drawn from the same distribution,
900we cannot reject the null hypothesis, since the pvalue is high
901
902    >>> stats.ks_2samp(rvs1, rvs2)
903    KstestResult(statistic=0.026, pvalue=0.9959527565364388)  # random
904
905In the second example, with different location, i.e., means, we can
906reject the null hypothesis, since the pvalue is below 1%
907
908    >>> stats.ks_2samp(rvs1, rvs3)
909    KstestResult(statistic=0.114, pvalue=0.00299005061044668)  # random
910
911Kernel density estimation
912-------------------------
913
914A common task in statistics is to estimate the probability density function
915(PDF) of a random variable from a set of data samples. This task is called
916density estimation. The most well-known tool to do this is the histogram.
917A histogram is a useful tool for visualization (mainly because everyone
918understands it), but doesn't use the available data very efficiently. Kernel
919density estimation (KDE) is a more efficient tool for the same task. The
920:func:`~stats.gaussian_kde` estimator can be used to estimate the PDF of univariate as
921well as multivariate data. It works best if the data is unimodal.
922
923
924Univariate estimation
925^^^^^^^^^^^^^^^^^^^^^
926
927We start with a minimal amount of data in order to see how :func:`~stats.gaussian_kde`
928works and what the different options for bandwidth selection do. The data
929sampled from the PDF are shown as blue dashes at the bottom of the figure (this
930is called a rug plot):
931
932.. plot::
933
934    >>> from scipy import stats
935    >>> import matplotlib.pyplot as plt
936
937    >>> x1 = np.array([-7, -5, 1, 4, 5], dtype=np.float64)
938    >>> kde1 = stats.gaussian_kde(x1)
939    >>> kde2 = stats.gaussian_kde(x1, bw_method='silverman')
940
941    >>> fig = plt.figure()
942    >>> ax = fig.add_subplot(111)
943
944    >>> ax.plot(x1, np.zeros(x1.shape), 'b+', ms=20)  # rug plot
945    >>> x_eval = np.linspace(-10, 10, num=200)
946    >>> ax.plot(x_eval, kde1(x_eval), 'k-', label="Scott's Rule")
947    >>> ax.plot(x_eval, kde2(x_eval), 'r-', label="Silverman's Rule")
948
949    >>> plt.show()
950
951We see that there is very little difference between Scott's Rule and
952Silverman's Rule, and that the bandwidth selection with a limited amount of
953data is probably a bit too wide. We can define our own bandwidth function to
954get a less smoothed-out result.
955
956    >>> def my_kde_bandwidth(obj, fac=1./5):
957    ...     """We use Scott's Rule, multiplied by a constant factor."""
958    ...     return np.power(obj.n, -1./(obj.d+4)) * fac
959
960    >>> fig = plt.figure()
961    >>> ax = fig.add_subplot(111)
962
963    >>> ax.plot(x1, np.zeros(x1.shape), 'b+', ms=20)  # rug plot
964    >>> kde3 = stats.gaussian_kde(x1, bw_method=my_kde_bandwidth)
965    >>> ax.plot(x_eval, kde3(x_eval), 'g-', label="With smaller BW")
966
967    >>> plt.show()
968
969.. plot:: tutorial/stats/plots/kde_plot2.py
970   :align: center
971   :include-source: 0
972
973We see that if we set bandwidth to be very narrow, the obtained estimate for
974the probability density function (PDF) is simply the sum of Gaussians around
975each data point.
976
977We now take a more realistic example and look at the difference between the
978two available bandwidth selection rules. Those rules are known to work well
979for (close to) normal distributions, but even for unimodal distributions that
980are quite strongly non-normal they work reasonably well. As a non-normal
981distribution we take a Student's T distribution with 5 degrees of freedom.
982
983.. plot:: tutorial/stats/plots/kde_plot3.py
984   :align: center
985   :include-source: 1
986
987We now take a look at a bimodal distribution with one wider and one narrower
988Gaussian feature. We expect that this will be a more difficult density to
989approximate, due to the different bandwidths required to accurately resolve
990each feature.
991
992    >>> from functools import partial
993
994    >>> loc1, scale1, size1 = (-2, 1, 175)
995    >>> loc2, scale2, size2 = (2, 0.2, 50)
996    >>> x2 = np.concatenate([np.random.normal(loc=loc1, scale=scale1, size=size1),
997    ...                      np.random.normal(loc=loc2, scale=scale2, size=size2)])
998
999    >>> x_eval = np.linspace(x2.min() - 1, x2.max() + 1, 500)
1000
1001    >>> kde = stats.gaussian_kde(x2)
1002    >>> kde2 = stats.gaussian_kde(x2, bw_method='silverman')
1003    >>> kde3 = stats.gaussian_kde(x2, bw_method=partial(my_kde_bandwidth, fac=0.2))
1004    >>> kde4 = stats.gaussian_kde(x2, bw_method=partial(my_kde_bandwidth, fac=0.5))
1005
1006    >>> pdf = stats.norm.pdf
1007    >>> bimodal_pdf = pdf(x_eval, loc=loc1, scale=scale1) * float(size1) / x2.size + \
1008    ...               pdf(x_eval, loc=loc2, scale=scale2) * float(size2) / x2.size
1009
1010    >>> fig = plt.figure(figsize=(8, 6))
1011    >>> ax = fig.add_subplot(111)
1012
1013    >>> ax.plot(x2, np.zeros(x2.shape), 'b+', ms=12)
1014    >>> ax.plot(x_eval, kde(x_eval), 'k-', label="Scott's Rule")
1015    >>> ax.plot(x_eval, kde2(x_eval), 'b-', label="Silverman's Rule")
1016    >>> ax.plot(x_eval, kde3(x_eval), 'g-', label="Scott * 0.2")
1017    >>> ax.plot(x_eval, kde4(x_eval), 'c-', label="Scott * 0.5")
1018    >>> ax.plot(x_eval, bimodal_pdf, 'r--', label="Actual PDF")
1019
1020    >>> ax.set_xlim([x_eval.min(), x_eval.max()])
1021    >>> ax.legend(loc=2)
1022    >>> ax.set_xlabel('x')
1023    >>> ax.set_ylabel('Density')
1024    >>> plt.show()
1025
1026.. plot:: tutorial/stats/plots/kde_plot4.py
1027   :align: center
1028   :include-source: 0
1029
1030As expected, the KDE is not as close to the true PDF as we would like due to
1031the different characteristic size of the two features of the bimodal
1032distribution. By halving the default bandwidth (``Scott * 0.5``), we can do
1033somewhat better, while using a factor 5 smaller bandwidth than the default
1034doesn't smooth enough. What we really need, though, in this case, is a
1035non-uniform (adaptive) bandwidth.
1036
1037
1038Multivariate estimation
1039^^^^^^^^^^^^^^^^^^^^^^^
1040
1041With :func:`~stats.gaussian_kde` we can perform multivariate, as well as univariate
1042estimation. We demonstrate the bivariate case. First, we generate some random
1043data with a model in which the two variates are correlated.
1044
1045    >>> def measure(n):
1046    ...     """Measurement model, return two coupled measurements."""
1047    ...     m1 = np.random.normal(size=n)
1048    ...     m2 = np.random.normal(scale=0.5, size=n)
1049    ...     return m1+m2, m1-m2
1050
1051    >>> m1, m2 = measure(2000)
1052    >>> xmin = m1.min()
1053    >>> xmax = m1.max()
1054    >>> ymin = m2.min()
1055    >>> ymax = m2.max()
1056
1057Then we apply the KDE to the data:
1058
1059    >>> X, Y = np.mgrid[xmin:xmax:100j, ymin:ymax:100j]
1060    >>> positions = np.vstack([X.ravel(), Y.ravel()])
1061    >>> values = np.vstack([m1, m2])
1062    >>> kernel = stats.gaussian_kde(values)
1063    >>> Z = np.reshape(kernel.evaluate(positions).T, X.shape)
1064
1065Finally, we plot the estimated bivariate distribution as a colormap and plot
1066the individual data points on top.
1067
1068    >>> fig = plt.figure(figsize=(8, 6))
1069    >>> ax = fig.add_subplot(111)
1070
1071    >>> ax.imshow(np.rot90(Z), cmap=plt.cm.gist_earth_r,
1072    ...           extent=[xmin, xmax, ymin, ymax])
1073    >>> ax.plot(m1, m2, 'k.', markersize=2)
1074
1075    >>> ax.set_xlim([xmin, xmax])
1076    >>> ax.set_ylim([ymin, ymax])
1077
1078    >>> plt.show()
1079
1080.. plot:: tutorial/stats/plots/kde_plot5.py
1081   :align: center
1082   :include-source: 0
1083
1084
1085Multiscale Graph Correlation (MGC)
1086^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
1087
1088With :func:`~stats.multiscale_graphcorr`, we can test for independence on high
1089dimensional and nonlinear data. Before we start, let's import some useful
1090packages:
1091
1092    >>> import numpy as np
1093    >>> import matplotlib.pyplot as plt; plt.style.use('classic')
1094    >>> from scipy.stats import multiscale_graphcorr
1095
1096Let's use a custom plotting function to plot the data relationship:
1097
1098    >>> def mgc_plot(x, y, sim_name, mgc_dict=None, only_viz=False,
1099    ...              only_mgc=False):
1100    ...     """Plot sim and MGC-plot"""
1101    ...     if not only_mgc:
1102    ...         # simulation
1103    ...         plt.figure(figsize=(8, 8))
1104    ...         ax = plt.gca()
1105    ...         ax.set_title(sim_name + " Simulation", fontsize=20)
1106    ...         ax.scatter(x, y)
1107    ...         ax.set_xlabel('X', fontsize=15)
1108    ...         ax.set_ylabel('Y', fontsize=15)
1109    ...         ax.axis('equal')
1110    ...         ax.tick_params(axis="x", labelsize=15)
1111    ...         ax.tick_params(axis="y", labelsize=15)
1112    ...         plt.show()
1113    ...     if not only_viz:
1114    ...         # local correlation map
1115    ...         plt.figure(figsize=(8,8))
1116    ...         ax = plt.gca()
1117    ...         mgc_map = mgc_dict["mgc_map"]
1118    ...         # draw heatmap
1119    ...         ax.set_title("Local Correlation Map", fontsize=20)
1120    ...         im = ax.imshow(mgc_map, cmap='YlGnBu')
1121    ...         # colorbar
1122    ...         cbar = ax.figure.colorbar(im, ax=ax)
1123    ...         cbar.ax.set_ylabel("", rotation=-90, va="bottom")
1124    ...         ax.invert_yaxis()
1125    ...         # Turn spines off and create white grid.
1126    ...         for edge, spine in ax.spines.items():
1127    ...             spine.set_visible(False)
1128    ...         # optimal scale
1129    ...         opt_scale = mgc_dict["opt_scale"]
1130    ...         ax.scatter(opt_scale[0], opt_scale[1],
1131    ...                    marker='X', s=200, color='red')
1132    ...         # other formatting
1133    ...         ax.tick_params(bottom="off", left="off")
1134    ...         ax.set_xlabel('#Neighbors for X', fontsize=15)
1135    ...         ax.set_ylabel('#Neighbors for Y', fontsize=15)
1136    ...         ax.tick_params(axis="x", labelsize=15)
1137    ...         ax.tick_params(axis="y", labelsize=15)
1138    ...         ax.set_xlim(0, 100)
1139    ...         ax.set_ylim(0, 100)
1140    ...         plt.show()
1141
1142Let's look at some linear data first:
1143
1144    >>> rng = np.random.default_rng()
1145    >>> x = np.linspace(-1, 1, num=100)
1146    >>> y = x + 0.3 * rng.random(x.size)
1147
1148The simulation relationship can be plotted below:
1149
1150    >>> mgc_plot(x, y, "Linear", only_viz=True)
1151
1152.. plot:: tutorial/stats/plots/mgc_plot1.py
1153   :align: center
1154   :include-source: 0
1155
1156Now, we can see the test statistic, p-value, and MGC map visualized below. The
1157optimal scale is shown on the map as a red "x":
1158
1159    >>> stat, pvalue, mgc_dict = multiscale_graphcorr(x, y)
1160    >>> print("MGC test statistic: ", round(stat, 1))
1161    MGC test statistic:  1.0
1162    >>> print("P-value: ", round(pvalue, 1))
1163    P-value:  0.0
1164    >>> mgc_plot(x, y, "Linear", mgc_dict, only_mgc=True)
1165
1166.. plot:: tutorial/stats/plots/mgc_plot2.py
1167   :align: center
1168   :include-source: 0
1169
1170It is clear from here, that MGC is able to determine a relationship between the
1171input data matrices because the p-value is very low and the MGC test statistic
1172is relatively high. The MGC-map indicates a **strongly linear relationship**.
1173Intuitively, this is because having more neighbors will help in identifying a
1174linear relationship between :math:`x` and :math:`y`. The optimal scale in this
1175case is **equivalent to the global scale**, marked by a red spot on the map.
1176
1177The same can be done for nonlinear data sets. The following :math:`x` and
1178:math:`y` arrays are derived from a nonlinear simulation:
1179
1180    >>> unif = np.array(rng.uniform(0, 5, size=100))
1181    >>> x = unif * np.cos(np.pi * unif)
1182    >>> y = unif * np.sin(np.pi * unif) + 0.4 * rng.random(x.size)
1183
1184The simulation relationship can be plotted below:
1185
1186    >>> mgc_plot(x, y, "Spiral", only_viz=True)
1187
1188.. plot:: tutorial/stats/plots/mgc_plot3.py
1189   :align: center
1190   :include-source: 0
1191
1192Now, we can see the test statistic, p-value, and MGC map visualized below. The
1193optimal scale is shown on the map as a red "x":
1194
1195    >>> stat, pvalue, mgc_dict = multiscale_graphcorr(x, y)
1196    >>> print("MGC test statistic: ", round(stat, 1))
1197    MGC test statistic:  0.2  # random
1198    >>> print("P-value: ", round(pvalue, 1))
1199    P-value:  0.0
1200    >>> mgc_plot(x, y, "Spiral", mgc_dict, only_mgc=True)
1201
1202.. plot:: tutorial/stats/plots/mgc_plot4.py
1203   :align: center
1204   :include-source: 0
1205
1206It is clear from here, that MGC is able to determine a relationship again
1207because the p-value is very low and the MGC test statistic is relatively high.
1208The MGC-map indicates a **strongly nonlinear relationship**. The optimal scale
1209in this case is **equivalent to the local scale**, marked by a red spot on the
1210map.
1211
1212Quasi-Monte Carlo
1213-----------------
1214
1215Before talking about Quasi-Monte Carlo (QMC), a quick introduction about Monte
1216Carlo (MC). MC methods, or MC experiments, are a broad class of
1217computational algorithms that rely on repeated random sampling to obtain
1218numerical results. The underlying concept is to use randomness to solve
1219problems that might be deterministic in principle. They are often used in
1220physical and mathematical problems and are most useful when it is difficult or
1221impossible to use other approaches. MC methods are mainly used in
1222three problem classes: optimization, numerical integration, and generating
1223draws from a probability distribution.
1224
1225Generating random numbers with specific properties is a more complex problem
1226than it sounds. Simple MC methods are designed to sample points to be
1227independent and identically distributed (IID). But generating multiple sets
1228of random points can produce radically different results.
1229
1230.. plot:: tutorial/stats/plots/qmc_plot_mc.py
1231   :align: center
1232   :include-source: 0
1233
1234In both cases in the plot above, points are generated randomly without any
1235knowledge about previously drawn points. It is clear that some regions of
1236the space are left unexplored - which can cause problems in simulations as a
1237particular set of points might trigger a totally different behaviour.
1238
1239A great benefit of MC is that it has known convergence properties.
1240Let's look at the mean of the squared sum in 5 dimensions:
1241
1242.. math::
1243
1244    f(\mathbf{x}) = \left( \sum_{j=1}^{5}x_j \right)^2,
1245
1246with :math:`x_j \sim \mathcal{U}(0,1)`. It has a known mean value,
1247:math:`\mu = 5/3+5(5-1)/4`. Using MC sampling, we
1248can compute that mean numerically, and the approximation error follows a
1249theoretical rate of :math:`O(n^{-1/2})`.
1250
1251.. plot:: tutorial/stats/plots/qmc_plot_conv_mc.py
1252   :align: center
1253   :include-source: 0
1254
1255Although the convergence is ensured, practitioners tend to want to have an
1256exploration process which is more deterministic. With normal MC, a seed can be
1257used to have a repeatable process. But fixing the seed would break the
1258convergence property: a given seed could work for a given class of problem
1259and break for another one.
1260
1261What is commonly done to walk through the space in a deterministic manner, is
1262to use a regular grid spanning all parameter dimensions, also called a
1263saturated design. Let’s consider the unit-hypercube, with all bounds ranging
1264from 0 to 1. Now, having a distance of 0.1 between points, the number of points
1265required to fill the unit interval would be 10. In a 2-dimensional hypercube
1266the same spacing would require 100, and in 3 dimensions 1,000 points. As the
1267number of dimensions grows, the number of experiments which is required to fill
1268the space rises exponentially as the dimensionality of the space increases.
1269This exponential growth is called "the curse of dimensionality".
1270
1271    >>> import numpy as np
1272    >>> disc = 10
1273    >>> x1 = np.linspace(0, 1, disc)
1274    >>> x2 = np.linspace(0, 1, disc)
1275    >>> x3 = np.linspace(0, 1, disc)
1276    >>> x1, x2, x3 = np.meshgrid(x1, x2, x3)
1277
1278.. plot:: tutorial/stats/plots/qmc_plot_curse.py
1279   :align: center
1280   :include-source: 0
1281
1282To mitigate this issue, QMC methods have been designed. They are
1283deterministic, have a good coverage of the space and some of them can be
1284continued and retain good properties.
1285The main difference with MC methods is that the points are not IID but they
1286know about previous points. Hence, some methods are also referred to as
1287sequences.
1288
1289.. plot:: tutorial/stats/plots/qmc_plot_mc_qmc.py
1290   :align: center
1291   :include-source: 0
1292
1293This figure presents 2 sets of 256 points. The design of the left is a plain
1294MC whereas the design of the right is a QMC design using the *Sobol'* method.
1295We clearly see that the QMC version is more uniform. The points sample better
1296near the boundaries and there are less clusters or gaps.
1297
1298One way to assess the uniformity is to use a measure called the discrepancy.
1299Here the discrepancy of *Sobol'* points is better than crude MC.
1300
1301Coming back to the computation of the mean, QMC methods also have better rates
1302of convergence for the error. They can achieve :math:`O(n^{-1})` for this
1303function, and even better rates on very smooth functions. This figure shows
1304that the *Sobol'* method has a rate of :math:`O(n^{-1})`:
1305
1306.. plot:: tutorial/stats/plots/qmc_plot_conv_mc_sobol.py
1307   :align: center
1308   :include-source: 0
1309
1310We refer to the documentation of :mod:`scipy.stats.qmc` for
1311more mathematical details.
1312
1313Calculate the discrepancy
1314^^^^^^^^^^^^^^^^^^^^^^^^^
1315
1316Let's consider two sets of points. From the figure below, it is clear that
1317the design on the left covers more of the space than the design on the right.
1318This can be quantified using a :func:`~stats.qmc.discrepancy` measure.
1319The lower the discrepancy, the more uniform a sample is.
1320
1321    >>> import numpy as np
1322    >>> from scipy.stats import qmc
1323    >>> space_1 = np.array([[1, 3], [2, 6], [3, 2], [4, 5], [5, 1], [6, 4]])
1324    >>> space_2 = np.array([[1, 5], [2, 4], [3, 3], [4, 2], [5, 1], [6, 6]])
1325    >>> l_bounds = [0.5, 0.5]
1326    >>> u_bounds = [6.5, 6.5]
1327    >>> space_1 = qmc.scale(space_1, l_bounds, u_bounds, reverse=True)
1328    >>> space_2 = qmc.scale(space_2, l_bounds, u_bounds, reverse=True)
1329    >>> qmc.discrepancy(space_1)
1330    0.008142039609053464
1331    >>> qmc.discrepancy(space_2)
1332    0.010456854423869011
1333
1334.. plot:: tutorial/stats/plots/qmc_plot_discrepancy.py
1335   :align: center
1336   :include-source: 0
1337
1338Using a QMC engine
1339^^^^^^^^^^^^^^^^^^
1340
1341Several QMC samplers/engines are implemented. Here we look at two of the most
1342used QMC methods: :class:`~stats.qmc.Sobol` and :class:`~stats.qmc.Halton`
1343sequences.
1344
1345.. plot:: tutorial/stats/plots/qmc_plot_sobol_halton.py
1346   :align: center
1347   :include-source: 1
1348
1349.. warning:: QMC methods require particular care and the user must read the
1350   documentation to avoid common pitfalls. *Sobol'* for instance requires a
1351   number of points following a power of 2. Also, thinning, burning or other
1352   point selection can break the properties of the sequence and result in a
1353   set of points which would not be better than MC.
1354
1355QMC engines are state-aware. Meaning that you can continue the sequence,
1356skip some points, or reset it. Let's take 5 points from
1357:class:`~stats.qmc.Halton`. And then ask for a second set of 5 points:
1358
1359    >>> from scipy.stats import qmc
1360    >>> engine = qmc.Halton(d=2)
1361    >>> engine.random(5)
1362    array([[0.22166437, 0.07980522],  # random
1363           [0.72166437, 0.93165708],
1364           [0.47166437, 0.41313856],
1365           [0.97166437, 0.19091633],
1366           [0.01853937, 0.74647189]])
1367    >>> engine.random(5)
1368    array([[0.51853937, 0.52424967],  # random
1369           [0.26853937, 0.30202745],
1370           [0.76853937, 0.857583  ],
1371           [0.14353937, 0.63536078],
1372           [0.64353937, 0.01807683]])
1373
1374Now we reset the sequence. Asking for 5 points leads to the same first 5
1375points:
1376
1377    >>> engine.reset()
1378    >>> engine.random(5)
1379    array([[0.22166437, 0.07980522],  # random
1380           [0.72166437, 0.93165708],
1381           [0.47166437, 0.41313856],
1382           [0.97166437, 0.19091633],
1383           [0.01853937, 0.74647189]])
1384
1385And here we advance the sequence to get the same second set of 5 points:
1386
1387    >>> engine.reset()
1388    >>> engine.fast_forward(5)
1389    >>> engine.random(5)
1390    array([[0.51853937, 0.52424967],  # random
1391           [0.26853937, 0.30202745],
1392           [0.76853937, 0.857583  ],
1393           [0.14353937, 0.63536078],
1394           [0.64353937, 0.01807683]])
1395
1396.. note:: By default, both :class:`~stats.qmc.Sobol` and
1397   :class:`~stats.qmc.Halton` are scrambled. The convergence properties are
1398   better, and it prevents the appearance of fringes or noticeable patterns
1399   of points in high dimensions. There should be no practical reason not to
1400   use the scrambled version.
1401
1402Making a QMC engine, i.e., subclassing ``QMCEngine``
1403^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
1404
1405To make your own :class:`~stats.qmc.QMCEngine`, a few methods have to be
1406defined. Following is an example wrapping `numpy.random.Generator`.
1407
1408    >>> import numpy as np
1409    >>> from scipy.stats import qmc
1410    >>> class RandomEngine(qmc.QMCEngine):
1411    ...     def __init__(self, d, seed=None):
1412    ...         super().__init__(d=d, seed=seed)
1413    ...         self.rng = np.random.default_rng(self.rng_seed)
1414    ...
1415    ...
1416    ...     def random(self, n=1):
1417    ...         self.num_generated += n
1418    ...         return self.rng.random((n, self.d))
1419    ...
1420    ...
1421    ...     def reset(self):
1422    ...         self.rng = np.random.default_rng(self.rng_seed)
1423    ...         self.num_generated = 0
1424    ...         return self
1425    ...
1426    ...
1427    ...     def fast_forward(self, n):
1428    ...         self.random(n)
1429    ...         return self
1430
1431Then we use it as any other QMC engine:
1432
1433    >>> engine = RandomEngine(2)
1434    >>> engine.random(5)
1435    array([[0.22733602, 0.31675834],  # random
1436           [0.79736546, 0.67625467],
1437           [0.39110955, 0.33281393],
1438           [0.59830875, 0.18673419],
1439           [0.67275604, 0.94180287]])
1440    >>> engine.reset()
1441    >>> engine.random(5)
1442    array([[0.22733602, 0.31675834],  # random
1443           [0.79736546, 0.67625467],
1444           [0.39110955, 0.33281393],
1445           [0.59830875, 0.18673419],
1446           [0.67275604, 0.94180287]])
1447
1448Guidelines on using QMC
1449^^^^^^^^^^^^^^^^^^^^^^^
1450
1451* QMC has rules! Be sure to read the documentation or you might have no
1452  benefit over MC.
1453* Use :class:`~stats.qmc.Sobol` if you need **exactly** :math:`2^m` points.
1454* :class:`~stats.qmc.Halton` allows to sample, or skip, an arbitrary number of
1455  points. This is at the cost of a slower rate of convergence than *Sobol'*.
1456* Never remove the first points of the sequence. It will destroy the
1457  properties.
1458* Scrambling is always better.
1459* If you use LHS based methods, you cannot add points without losing the LHS
1460  properties. (There are some methods to do so, but this is not implemented.)
1461