1Statistics (`scipy.stats`) 2========================== 3 4.. sectionauthor:: Travis E. Oliphant 5.. sectionauthor:: Josef Perktold 6.. sectionauthor:: Nicky van Foreest 7.. sectionauthor:: Sambit Panda 8.. sectionauthor:: Pamphile T. Roy (@tupui) 9 10.. currentmodule:: scipy 11 12Introduction 13------------ 14 15In this tutorial, we discuss many, but certainly not all, features of 16``scipy.stats``. The intention here is to provide a user with a 17working knowledge of this package. We refer to the 18:ref:`reference manual <statsrefmanual>` for further details. 19 20Note: This documentation is work in progress. 21 22.. toctree:: 23 :maxdepth: 1 24 25 stats/discrete 26 stats/continuous 27 28 29Random variables 30---------------- 31 32There are two general distribution classes that have been implemented 33for encapsulating :ref:`continuous random variables 34<continuous-random-variables>` and :ref:`discrete random variables 35<discrete-random-variables>`. Over 80 continuous random variables 36(RVs) and 10 discrete random variables have been implemented using 37these classes. Besides this, new routines and distributions can be 38easily added by the end user. (If you create one, please contribute it.) 39 40All of the statistics functions are located in the sub-package 41:mod:`scipy.stats` and a fairly complete listing of these functions 42can be obtained using ``info(stats)``. The list of the random 43variables available can also be obtained from the docstring for the 44stats sub-package. 45 46In the discussion below, we mostly focus on continuous RVs. Nearly everything 47also applies to discrete variables, but we point out some differences 48here: :ref:`discrete_points_label`. 49 50In the code samples below, we assume that the :mod:`scipy.stats` package 51is imported as 52 53 >>> from scipy import stats 54 55and in some cases we assume that individual objects are imported as 56 57 >>> from scipy.stats import norm 58 59Getting help 60^^^^^^^^^^^^ 61 62First of all, all distributions are accompanied with help 63functions. To obtain just some basic information, we print the relevant 64docstring: ``print(stats.norm.__doc__)``. 65 66To find the support, i.e., upper and lower bounds of the distribution, 67call: 68 69 >>> print('bounds of distribution lower: %s, upper: %s' % norm.support()) 70 bounds of distribution lower: -inf, upper: inf 71 72We can list all methods and properties of the distribution with 73``dir(norm)``. As it turns out, some of the methods are private, 74although they are not named as such (their names do not start 75with a leading underscore), for example ``veccdf``, are only available 76for internal calculation (those methods will give warnings when one tries to 77use them, and will be removed at some point). 78 79To obtain the *real* main methods, we list the methods of the frozen 80distribution. (We explain the meaning of a `frozen` distribution 81below). 82 83 >>> rv = norm() 84 >>> dir(rv) # reformatted 85 ['__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', 86 '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', 87 '__init__', '__le__', '__lt__', '__module__', '__ne__', '__new__', 88 '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', 89 '__str__', '__subclasshook__', '__weakref__', 'a', 'args', 'b', 'cdf', 90 'dist', 'entropy', 'expect', 'interval', 'isf', 'kwds', 'logcdf', 91 'logpdf', 'logpmf', 'logsf', 'mean', 'median', 'moment', 'pdf', 'pmf', 92 'ppf', 'random_state', 'rvs', 'sf', 'stats', 'std', 'var'] 93 94Finally, we can obtain the list of available distribution through 95introspection: 96 97 >>> dist_continu = [d for d in dir(stats) if 98 ... isinstance(getattr(stats, d), stats.rv_continuous)] 99 >>> dist_discrete = [d for d in dir(stats) if 100 ... isinstance(getattr(stats, d), stats.rv_discrete)] 101 >>> print('number of continuous distributions: %d' % len(dist_continu)) 102 number of continuous distributions: 104 103 >>> print('number of discrete distributions: %d' % len(dist_discrete)) 104 number of discrete distributions: 19 105 106Common methods 107^^^^^^^^^^^^^^ 108 109The main public methods for continuous RVs are: 110 111* rvs: Random Variates 112* pdf: Probability Density Function 113* cdf: Cumulative Distribution Function 114* sf: Survival Function (1-CDF) 115* ppf: Percent Point Function (Inverse of CDF) 116* isf: Inverse Survival Function (Inverse of SF) 117* stats: Return mean, variance, (Fisher's) skew, or (Fisher's) kurtosis 118* moment: non-central moments of the distribution 119 120 121Let's take a normal RV as an example. 122 123 >>> norm.cdf(0) 124 0.5 125 126To compute the ``cdf`` at a number of points, we can pass a list or a numpy array. 127 128 >>> norm.cdf([-1., 0, 1]) 129 array([ 0.15865525, 0.5, 0.84134475]) 130 >>> import numpy as np 131 >>> norm.cdf(np.array([-1., 0, 1])) 132 array([ 0.15865525, 0.5, 0.84134475]) 133 134Thus, the basic methods, such as `pdf`, `cdf`, and so on, are vectorized. 135 136Other generally useful methods are supported too: 137 138 >>> norm.mean(), norm.std(), norm.var() 139 (0.0, 1.0, 1.0) 140 >>> norm.stats(moments="mv") 141 (array(0.0), array(1.0)) 142 143To find the median of a distribution, we can use the percent point 144function ``ppf``, which is the inverse of the ``cdf``: 145 146 >>> norm.ppf(0.5) 147 0.0 148 149To generate a sequence of random variates, use the ``size`` keyword 150argument: 151 152 >>> norm.rvs(size=3) 153 array([-0.35687759, 1.34347647, -0.11710531]) # random 154 155Don't think that ``norm.rvs(5)`` generates 5 variates: 156 157 >>> norm.rvs(5) 158 5.471435163732493 # random 159 160Here, ``5`` with no keyword is being interpreted as the first possible 161keyword argument, ``loc``, which is the first of a pair of keyword arguments 162taken by all continuous distributions. 163This brings us to the topic of the next subsection. 164 165Random number generation 166^^^^^^^^^^^^^^^^^^^^^^^^ 167 168Drawing random numbers relies on generators from `numpy.random` package. 169In the examples above, the specific stream of 170random numbers is not reproducible across runs. To achieve reproducibility, 171you can explicitly *seed* a random number generator. In NumPy, a generator 172is an instance of `numpy.random.Generator`. Here is the canonical way to create 173a generator: 174 175 >>> from numpy.random import default_rng 176 >>> rng = default_rng() 177 178And fixing the seed can be done like this: 179 180 >>> # do NOT copy this value 181 >>> rng = default_rng(301439351238479871608357552876690613766) 182 183.. warning:: Do not use this number or common values such as 0. Using just a 184 small set of seeds to instantiate larger state spaces means that 185 there are some initial states that are impossible to reach. This 186 creates some biases if everyone uses such values. A good way to 187 get a seed is to use a `numpy.random.SeedSequence`: 188 189 >>> from numpy.random import SeedSequence 190 >>> print(SeedSequence().entropy) 191 301439351238479871608357552876690613766 # random 192 193The `random_state` parameter in distributions accepts an instance of 194`numpy.random.Generator` class, or an integer, which is then used to 195seed an internal ``Generator`` object: 196 197 >>> norm.rvs(size=5, random_state=rng) 198 array([ 0.47143516, -1.19097569, 1.43270697, -0.3126519 , -0.72058873]) # random 199 200For further info, see `NumPy's documentation 201<https://numpy.org/doc/stable/reference/random/index.html>`__. 202 203Shifting and scaling 204^^^^^^^^^^^^^^^^^^^^ 205 206All continuous distributions take ``loc`` and ``scale`` as keyword 207parameters to adjust the location and scale of the distribution, 208e.g., for the standard normal distribution, the location is the mean and 209the scale is the standard deviation. 210 211 >>> norm.stats(loc=3, scale=4, moments="mv") 212 (array(3.0), array(16.0)) 213 214In many cases, the standardized distribution for a random variable ``X`` 215is obtained through the transformation ``(X - loc) / scale``. The 216default values are ``loc = 0`` and ``scale = 1``. 217 218Smart use of ``loc`` and ``scale`` can help modify the standard 219distributions in many ways. To illustrate the scaling further, the 220``cdf`` of an exponentially distributed RV with mean :math:`1/\lambda` 221is given by 222 223.. math:: 224 225 F(x) = 1 - \exp(-\lambda x) 226 227By applying the scaling rule above, it can be seen that by 228taking ``scale = 1./lambda`` we get the proper scale. 229 230 >>> from scipy.stats import expon 231 >>> expon.mean(scale=3.) 232 3.0 233 234.. note:: Distributions that take shape parameters may 235 require more than simple application of ``loc`` and/or 236 ``scale`` to achieve the desired form. For example, the 237 distribution of 2-D vector lengths given a constant vector 238 of length :math:`R` perturbed by independent N(0, :math:`\sigma^2`) 239 deviations in each component is 240 rice(:math:`R/\sigma`, scale= :math:`\sigma`). The first argument 241 is a shape parameter that needs to be scaled along with :math:`x`. 242 243The uniform distribution is also interesting: 244 245 >>> from scipy.stats import uniform 246 >>> uniform.cdf([0, 1, 2, 3, 4, 5], loc=1, scale=4) 247 array([ 0. , 0. , 0.25, 0.5 , 0.75, 1. ]) 248 249 250Finally, recall from the previous paragraph that we are left with the 251problem of the meaning of ``norm.rvs(5)``. As it turns out, calling a 252distribution like this, the first argument, i.e., the 5, gets passed 253to set the ``loc`` parameter. Let's see: 254 255 >>> np.mean(norm.rvs(5, size=500)) 256 5.0098355106969992 # random 257 258Thus, to explain the output of the example of the last section: 259``norm.rvs(5)`` generates a single normally distributed random variate with 260mean ``loc=5``, because of the default ``size=1``. 261 262We recommend that you set ``loc`` and ``scale`` parameters explicitly, by 263passing the values as keywords rather than as arguments. Repetition 264can be minimized when calling more than one method of a given RV by 265using the technique of `Freezing a Distribution`_, as explained below. 266 267 268Shape parameters 269^^^^^^^^^^^^^^^^ 270 271While a general continuous random variable can be shifted and scaled 272with the ``loc`` and ``scale`` parameters, some distributions require 273additional shape parameters. For instance, the gamma distribution with density 274 275.. math:: 276 277 \gamma(x, a) = \frac{\lambda (\lambda x)^{a-1}}{\Gamma(a)} e^{-\lambda x}\;, 278 279requires the shape parameter :math:`a`. Observe that setting 280:math:`\lambda` can be obtained by setting the ``scale`` keyword to 281:math:`1/\lambda`. 282 283Let's check the number and name of the shape parameters of the gamma 284distribution. (We know from the above that this should be 1.) 285 286 >>> from scipy.stats import gamma 287 >>> gamma.numargs 288 1 289 >>> gamma.shapes 290 'a' 291 292Now, we set the value of the shape variable to 1 to obtain the 293exponential distribution, so that we compare easily whether we get the 294results we expect. 295 296 >>> gamma(1, scale=2.).stats(moments="mv") 297 (array(2.0), array(4.0)) 298 299Notice that we can also specify shape parameters as keywords: 300 301 >>> gamma(a=1, scale=2.).stats(moments="mv") 302 (array(2.0), array(4.0)) 303 304 305Freezing a distribution 306^^^^^^^^^^^^^^^^^^^^^^^ 307 308Passing the ``loc`` and ``scale`` keywords time and again can become 309quite bothersome. The concept of `freezing` a RV is used to 310solve such problems. 311 312 >>> rv = gamma(1, scale=2.) 313 314By using ``rv`` we no longer have to include the scale or the shape 315parameters anymore. Thus, distributions can be used in one of two 316ways, either by passing all distribution parameters to each method 317call (such as we did earlier) or by freezing the parameters for the 318instance of the distribution. Let us check this: 319 320 >>> rv.mean(), rv.std() 321 (2.0, 2.0) 322 323This is, indeed, what we should get. 324 325 326Broadcasting 327^^^^^^^^^^^^ 328 329The basic methods ``pdf``, and so on, satisfy the usual numpy broadcasting rules. For 330example, we can calculate the critical values for the upper tail of 331the t distribution for different probabilities and degrees of freedom. 332 333 >>> stats.t.isf([0.1, 0.05, 0.01], [[10], [11]]) 334 array([[ 1.37218364, 1.81246112, 2.76376946], 335 [ 1.36343032, 1.79588482, 2.71807918]]) 336 337Here, the first row contains the critical values for 10 degrees of freedom 338and the second row for 11 degrees of freedom (d.o.f.). Thus, the 339broadcasting rules give the same result of calling ``isf`` twice: 340 341 >>> stats.t.isf([0.1, 0.05, 0.01], 10) 342 array([ 1.37218364, 1.81246112, 2.76376946]) 343 >>> stats.t.isf([0.1, 0.05, 0.01], 11) 344 array([ 1.36343032, 1.79588482, 2.71807918]) 345 346If the array with probabilities, i.e., ``[0.1, 0.05, 0.01]`` and the 347array of degrees of freedom i.e., ``[10, 11, 12]``, have the same 348array shape, then element-wise matching is used. As an example, we can 349obtain the 10% tail for 10 d.o.f., the 5% tail for 11 d.o.f. and the 3501% tail for 12 d.o.f. by calling 351 352 >>> stats.t.isf([0.1, 0.05, 0.01], [10, 11, 12]) 353 array([ 1.37218364, 1.79588482, 2.68099799]) 354 355 356.. _discrete_points_label: 357 358Specific points for discrete distributions 359^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 360 361Discrete distributions have mostly the same basic methods as the 362continuous distributions. However ``pdf`` is replaced by the probability 363mass function ``pmf``, no estimation methods, such as fit, are 364available, and ``scale`` is not a valid keyword parameter. The 365location parameter, keyword ``loc``, can still be used to shift the 366distribution. 367 368The computation of the cdf requires some extra attention. In the case 369of continuous distribution, the cumulative distribution function is, in 370most standard cases, strictly monotonic increasing in the bounds (a,b) 371and has, therefore, a unique inverse. The cdf of a discrete 372distribution, however, is a step function, hence the inverse cdf, 373i.e., the percent point function, requires a different definition: 374 375:: 376 377 ppf(q) = min{x : cdf(x) >= q, x integer} 378 379For further info, see the docs `here 380<https://docs.scipy.org/doc/scipy/reference/tutorial/stats/discrete.html#percent-point-function-inverse-cdf>`__. 381 382 383We can look at the hypergeometric distribution as an example 384 385 >>> from scipy.stats import hypergeom 386 >>> [M, n, N] = [20, 7, 12] 387 388If we use the cdf at some integer points and then evaluate the ppf at those 389cdf values, we get the initial integers back, for example 390 391 >>> x = np.arange(4) * 2 392 >>> x 393 array([0, 2, 4, 6]) 394 >>> prb = hypergeom.cdf(x, M, n, N) 395 >>> prb 396 array([ 1.03199174e-04, 5.21155831e-02, 6.08359133e-01, 397 9.89783282e-01]) 398 >>> hypergeom.ppf(prb, M, n, N) 399 array([ 0., 2., 4., 6.]) 400 401If we use values that are not at the kinks of the cdf step function, we get 402the next higher integer back: 403 404 >>> hypergeom.ppf(prb + 1e-8, M, n, N) 405 array([ 1., 3., 5., 7.]) 406 >>> hypergeom.ppf(prb - 1e-8, M, n, N) 407 array([ 0., 2., 4., 6.]) 408 409 410Fitting distributions 411^^^^^^^^^^^^^^^^^^^^^ 412 413The main additional methods of the not frozen distribution are related 414to the estimation of distribution parameters: 415 416* fit: maximum likelihood estimation of distribution parameters, including location 417 and scale 418* fit_loc_scale: estimation of location and scale when shape parameters are given 419* nnlf: negative log likelihood function 420* expect: calculate the expectation of a function against the pdf or pmf 421 422 423.. _performance_issues_label: 424 425Performance issues and cautionary remarks 426^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 427 428The performance of the individual methods, in terms of speed, varies 429widely by distribution and method. The results of a method are 430obtained in one of two ways: either by explicit calculation, or by a 431generic algorithm that is independent of the specific distribution. 432 433Explicit calculation, on the one hand, requires that the method is 434directly specified for the given distribution, either through analytic 435formulas or through special functions in ``scipy.special`` or 436``numpy.random`` for ``rvs``. These are usually relatively fast 437calculations. 438 439The generic methods, on the other hand, are used if the distribution 440does not specify any explicit calculation. To define a distribution, 441only one of pdf or cdf is necessary; all other methods can be derived 442using numeric integration and root finding. However, these indirect 443methods can be `very` slow. As an example, ``rgh = 444stats.gausshyper.rvs(0.5, 2, 2, 2, size=100)`` creates random 445variables in a very indirect way and takes about 19 seconds for 100 446random variables on my computer, while one million random variables 447from the standard normal or from the t distribution take just above 448one second. 449 450 451Remaining issues 452^^^^^^^^^^^^^^^^ 453 454The distributions in ``scipy.stats`` have recently been corrected and improved 455and gained a considerable test suite; however, a few issues remain: 456 457* The distributions have been tested over some range of parameters; 458 however, in some corner ranges, a few incorrect results may remain. 459* The maximum likelihood estimation in `fit` does not work with 460 default starting parameters for all distributions and the user 461 needs to supply good starting parameters. Also, for some 462 distribution using a maximum likelihood estimator might 463 inherently not be the best choice. 464 465 466Building specific distributions 467------------------------------- 468 469The next examples shows how to build your own distributions. Further 470examples show the usage of the distributions and some statistical 471tests. 472 473 474Making a continuous distribution, i.e., subclassing ``rv_continuous`` 475^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 476 477Making continuous distributions is fairly simple. 478 479 >>> from scipy import stats 480 >>> class deterministic_gen(stats.rv_continuous): 481 ... def _cdf(self, x): 482 ... return np.where(x < 0, 0., 1.) 483 ... def _stats(self): 484 ... return 0., 0., 0., 0. 485 486 >>> deterministic = deterministic_gen(name="deterministic") 487 >>> deterministic.cdf(np.arange(-3, 3, 0.5)) 488 array([ 0., 0., 0., 0., 0., 0., 1., 1., 1., 1., 1., 1.]) 489 490Interestingly, the ``pdf`` is now computed automatically: 491 492 >>> deterministic.pdf(np.arange(-3, 3, 0.5)) 493 array([ 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 494 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 495 5.83333333e+04, 4.16333634e-12, 4.16333634e-12, 496 4.16333634e-12, 4.16333634e-12, 4.16333634e-12]) 497 498 499Be aware of the performance issues mentioned in 500:ref:`performance_issues_label`. The computation of unspecified 501common methods can become very slow, since only general methods are 502called, which, by their very nature, cannot use any specific 503information about the distribution. Thus, as a cautionary example: 504 505 >>> from scipy.integrate import quad 506 >>> quad(deterministic.pdf, -1e-1, 1e-1) 507 (4.163336342344337e-13, 0.0) 508 509But this is not correct: the integral over this pdf should be 1. Let's make the 510integration interval smaller: 511 512 >>> quad(deterministic.pdf, -1e-3, 1e-3) # warning removed 513 (1.000076872229173, 0.0010625571718182458) 514 515This looks better. However, the problem originated from the fact that 516the pdf is not specified in the class definition of the deterministic 517distribution. 518 519 520Subclassing ``rv_discrete`` 521^^^^^^^^^^^^^^^^^^^^^^^^^^^ 522 523In the following, we use `stats.rv_discrete` to generate a discrete 524distribution that has the probabilities of the truncated normal for the 525intervals centered around the integers. 526 527**General info** 528 529From the docstring of rv_discrete, ``help(stats.rv_discrete)``, 530 531 "You can construct an arbitrary discrete rv where P{X=xk} = pk by 532 passing to the rv_discrete initialization method (through the values= 533 keyword) a tuple of sequences (xk, pk) which describes only those 534 values of X (xk) that occur with nonzero probability (pk)." 535 536Next to this, there are some further requirements for this approach to 537work: 538 539* The keyword `name` is required. 540* The support points of the distribution xk have to be integers. 541* The number of significant digits (decimals) needs to be specified. 542 543In fact, if the last two requirements are not satisfied, an exception 544may be raised or the resulting numbers may be incorrect. 545 546**An example** 547 548Let's do the work. First: 549 550 >>> npoints = 20 # number of integer support points of the distribution minus 1 551 >>> npointsh = npoints // 2 552 >>> npointsf = float(npoints) 553 >>> nbound = 4 # bounds for the truncated normal 554 >>> normbound = (1+1/npointsf) * nbound # actual bounds of truncated normal 555 >>> grid = np.arange(-npointsh, npointsh+2, 1) # integer grid 556 >>> gridlimitsnorm = (grid-0.5) / npointsh * nbound # bin limits for the truncnorm 557 >>> gridlimits = grid - 0.5 # used later in the analysis 558 >>> grid = grid[:-1] 559 >>> probs = np.diff(stats.truncnorm.cdf(gridlimitsnorm, -normbound, normbound)) 560 >>> gridint = grid 561 562And, finally, we can subclass ``rv_discrete``: 563 564 >>> normdiscrete = stats.rv_discrete(values=(gridint, 565 ... np.round(probs, decimals=7)), name='normdiscrete') 566 567Now that we have defined the distribution, we have access to all 568common methods of discrete distributions. 569 570 >>> print('mean = %6.4f, variance = %6.4f, skew = %6.4f, kurtosis = %6.4f' % 571 ... normdiscrete.stats(moments='mvsk')) 572 mean = -0.0000, variance = 6.3302, skew = 0.0000, kurtosis = -0.0076 573 574 >>> nd_std = np.sqrt(normdiscrete.stats(moments='v')) 575 576**Testing the implementation** 577 578Let's generate a random sample and compare observed frequencies with 579the probabilities. 580 581 >>> n_sample = 500 582 >>> rvs = normdiscrete.rvs(size=n_sample) 583 >>> f, l = np.histogram(rvs, bins=gridlimits) 584 >>> sfreq = np.vstack([gridint, f, probs*n_sample]).T 585 >>> print(sfreq) 586 [[-1.00000000e+01 0.00000000e+00 2.95019349e-02] # random 587 [-9.00000000e+00 0.00000000e+00 1.32294142e-01] 588 [-8.00000000e+00 0.00000000e+00 5.06497902e-01] 589 [-7.00000000e+00 2.00000000e+00 1.65568919e+00] 590 [-6.00000000e+00 1.00000000e+00 4.62125309e+00] 591 [-5.00000000e+00 9.00000000e+00 1.10137298e+01] 592 [-4.00000000e+00 2.60000000e+01 2.24137683e+01] 593 [-3.00000000e+00 3.70000000e+01 3.89503370e+01] 594 [-2.00000000e+00 5.10000000e+01 5.78004747e+01] 595 [-1.00000000e+00 7.10000000e+01 7.32455414e+01] 596 [ 0.00000000e+00 7.40000000e+01 7.92618251e+01] 597 [ 1.00000000e+00 8.90000000e+01 7.32455414e+01] 598 [ 2.00000000e+00 5.50000000e+01 5.78004747e+01] 599 [ 3.00000000e+00 5.00000000e+01 3.89503370e+01] 600 [ 4.00000000e+00 1.70000000e+01 2.24137683e+01] 601 [ 5.00000000e+00 1.10000000e+01 1.10137298e+01] 602 [ 6.00000000e+00 4.00000000e+00 4.62125309e+00] 603 [ 7.00000000e+00 3.00000000e+00 1.65568919e+00] 604 [ 8.00000000e+00 0.00000000e+00 5.06497902e-01] 605 [ 9.00000000e+00 0.00000000e+00 1.32294142e-01] 606 [ 1.00000000e+01 0.00000000e+00 2.95019349e-02]] 607 608 609.. plot:: tutorial/examples/normdiscr_plot1.py 610 :align: center 611 :include-source: 0 612 613 614.. plot:: tutorial/examples/normdiscr_plot2.py 615 :align: center 616 :include-source: 0 617 618 619Next, we can test whether our sample was generated by our norm-discrete 620distribution. This also verifies whether the random numbers were generated 621correctly. 622 623The chisquare test requires that there are a minimum number of observations 624in each bin. We combine the tail bins into larger bins so that they contain 625enough observations. 626 627 >>> f2 = np.hstack([f[:5].sum(), f[5:-5], f[-5:].sum()]) 628 >>> p2 = np.hstack([probs[:5].sum(), probs[5:-5], probs[-5:].sum()]) 629 >>> ch2, pval = stats.chisquare(f2, p2*n_sample) 630 631 >>> print('chisquare for normdiscrete: chi2 = %6.3f pvalue = %6.4f' % (ch2, pval)) 632 chisquare for normdiscrete: chi2 = 12.466 pvalue = 0.4090 # random 633 634The pvalue in this case is high, so we can be quite confident that 635our random sample was actually generated by the distribution. 636 637 638Analysing one sample 639-------------------- 640 641First, we create some random variables. We set a seed so that in each run 642we get identical results to look at. As an example we take a sample from 643the Student t distribution: 644 645 >>> x = stats.t.rvs(10, size=1000) 646 647Here, we set the required shape parameter of the t distribution, which 648in statistics corresponds to the degrees of freedom, to 10. Using size=1000 means 649that our sample consists of 1000 independently drawn (pseudo) random numbers. 650Since we did not specify the keyword arguments `loc` and `scale`, those are 651set to their default values zero and one. 652 653Descriptive statistics 654^^^^^^^^^^^^^^^^^^^^^^ 655 656`x` is a numpy array, and we have direct access to all array methods, e.g., 657 658 >>> print(x.min()) # equivalent to np.min(x) 659 -3.78975572422 # random 660 >>> print(x.max()) # equivalent to np.max(x) 661 5.26327732981 # random 662 >>> print(x.mean()) # equivalent to np.mean(x) 663 0.0140610663985 # random 664 >>> print(x.var()) # equivalent to np.var(x)) 665 1.28899386208 # random 666 667How do the sample properties compare to their theoretical counterparts? 668 669 >>> m, v, s, k = stats.t.stats(10, moments='mvsk') 670 >>> n, (smin, smax), sm, sv, ss, sk = stats.describe(x) 671 672 >>> sstr = '%-14s mean = %6.4f, variance = %6.4f, skew = %6.4f, kurtosis = %6.4f' 673 >>> print(sstr % ('distribution:', m, v, s ,k)) 674 distribution: mean = 0.0000, variance = 1.2500, skew = 0.0000, kurtosis = 1.0000 # random 675 >>> print(sstr % ('sample:', sm, sv, ss, sk)) 676 sample: mean = 0.0141, variance = 1.2903, skew = 0.2165, kurtosis = 1.0556 # random 677 678Note: `stats.describe` uses the unbiased estimator for the variance, while 679np.var is the biased estimator. 680 681 682For our sample the sample statistics differ a by a small amount from 683their theoretical counterparts. 684 685 686T-test and KS-test 687^^^^^^^^^^^^^^^^^^ 688 689We can use the t-test to test whether the mean of our sample differs 690in a statistically significant way from the theoretical expectation. 691 692 >>> print('t-statistic = %6.3f pvalue = %6.4f' % stats.ttest_1samp(x, m)) 693 t-statistic = 0.391 pvalue = 0.6955 # random 694 695The pvalue is 0.7, this means that with an alpha error of, for 696example, 10%, we cannot reject the hypothesis that the sample mean 697is equal to zero, the expectation of the standard t-distribution. 698 699 700As an exercise, we can calculate our ttest also directly without 701using the provided function, which should give us the same answer, 702and so it does: 703 704 >>> tt = (sm-m)/np.sqrt(sv/float(n)) # t-statistic for mean 705 >>> pval = stats.t.sf(np.abs(tt), n-1)*2 # two-sided pvalue = Prob(abs(t)>tt) 706 >>> print('t-statistic = %6.3f pvalue = %6.4f' % (tt, pval)) 707 t-statistic = 0.391 pvalue = 0.6955 # random 708 709The Kolmogorov-Smirnov test can be used to test the hypothesis that 710the sample comes from the standard t-distribution 711 712 >>> print('KS-statistic D = %6.3f pvalue = %6.4f' % stats.kstest(x, 't', (10,))) 713 KS-statistic D = 0.016 pvalue = 0.9571 # random 714 715Again, the p-value is high enough that we cannot reject the 716hypothesis that the random sample really is distributed according to the 717t-distribution. In real applications, we don't know what the 718underlying distribution is. If we perform the Kolmogorov-Smirnov 719test of our sample against the standard normal distribution, then we 720also cannot reject the hypothesis that our sample was generated by the 721normal distribution given that, in this example, the p-value is almost 40%. 722 723 >>> print('KS-statistic D = %6.3f pvalue = %6.4f' % stats.kstest(x, 'norm')) 724 KS-statistic D = 0.028 pvalue = 0.3918 # random 725 726However, the standard normal distribution has a variance of 1, while our 727sample has a variance of 1.29. If we standardize our sample and test it 728against the normal distribution, then the p-value is again large enough 729that we cannot reject the hypothesis that the sample came form the 730normal distribution. 731 732 >>> d, pval = stats.kstest((x-x.mean())/x.std(), 'norm') 733 >>> print('KS-statistic D = %6.3f pvalue = %6.4f' % (d, pval)) 734 KS-statistic D = 0.032 pvalue = 0.2397 # random 735 736Note: The Kolmogorov-Smirnov test assumes that we test against a 737distribution with given parameters, since, in the last case, we 738estimated mean and variance, this assumption is violated and the 739distribution of the test statistic, on which the p-value is based, is 740not correct. 741 742Tails of the distribution 743^^^^^^^^^^^^^^^^^^^^^^^^^ 744 745Finally, we can check the upper tail of the distribution. We can use 746the percent point function ppf, which is the inverse of the cdf 747function, to obtain the critical values, or, more directly, we can use 748the inverse of the survival function 749 750 >>> crit01, crit05, crit10 = stats.t.ppf([1-0.01, 1-0.05, 1-0.10], 10) 751 >>> print('critical values from ppf at 1%%, 5%% and 10%% %8.4f %8.4f %8.4f' % (crit01, crit05, crit10)) 752 critical values from ppf at 1%, 5% and 10% 2.7638 1.8125 1.3722 753 >>> print('critical values from isf at 1%%, 5%% and 10%% %8.4f %8.4f %8.4f' % tuple(stats.t.isf([0.01,0.05,0.10],10))) 754 critical values from isf at 1%, 5% and 10% 2.7638 1.8125 1.3722 755 756 >>> freq01 = np.sum(x>crit01) / float(n) * 100 757 >>> freq05 = np.sum(x>crit05) / float(n) * 100 758 >>> freq10 = np.sum(x>crit10) / float(n) * 100 759 >>> print('sample %%-frequency at 1%%, 5%% and 10%% tail %8.4f %8.4f %8.4f' % (freq01, freq05, freq10)) 760 sample %-frequency at 1%, 5% and 10% tail 1.4000 5.8000 10.5000 # random 761 762In all three cases, our sample has more weight in the top tail than the 763underlying distribution. 764We can briefly check a larger sample to see if we get a closer match. In this 765case, the empirical frequency is quite close to the theoretical probability, 766but if we repeat this several times, the fluctuations are still pretty large. 767 768 >>> freq05l = np.sum(stats.t.rvs(10, size=10000) > crit05) / 10000.0 * 100 769 >>> print('larger sample %%-frequency at 5%% tail %8.4f' % freq05l) 770 larger sample %-frequency at 5% tail 4.8000 # random 771 772We can also compare it with the tail of the normal distribution, which 773has less weight in the tails: 774 775 >>> print('tail prob. of normal at 1%%, 5%% and 10%% %8.4f %8.4f %8.4f' % 776 ... tuple(stats.norm.sf([crit01, crit05, crit10])*100)) 777 tail prob. of normal at 1%, 5% and 10% 0.2857 3.4957 8.5003 778 779The chisquare test can be used to test whether for a finite number of bins, 780the observed frequencies differ significantly from the probabilities of the 781hypothesized distribution. 782 783 >>> quantiles = [0.0, 0.01, 0.05, 0.1, 1-0.10, 1-0.05, 1-0.01, 1.0] 784 >>> crit = stats.t.ppf(quantiles, 10) 785 >>> crit 786 array([ -inf, -2.76376946, -1.81246112, -1.37218364, 1.37218364, 787 1.81246112, 2.76376946, inf]) 788 >>> n_sample = x.size 789 >>> freqcount = np.histogram(x, bins=crit)[0] 790 >>> tprob = np.diff(quantiles) 791 >>> nprob = np.diff(stats.norm.cdf(crit)) 792 >>> tch, tpval = stats.chisquare(freqcount, tprob*n_sample) 793 >>> nch, npval = stats.chisquare(freqcount, nprob*n_sample) 794 >>> print('chisquare for t: chi2 = %6.2f pvalue = %6.4f' % (tch, tpval)) 795 chisquare for t: chi2 = 2.30 pvalue = 0.8901 # random 796 >>> print('chisquare for normal: chi2 = %6.2f pvalue = %6.4f' % (nch, npval)) 797 chisquare for normal: chi2 = 64.60 pvalue = 0.0000 # random 798 799We see that the standard normal distribution is clearly rejected, while the 800standard t-distribution cannot be rejected. Since the variance of our sample 801differs from both standard distributions, we can again redo the test taking 802the estimate for scale and location into account. 803 804The fit method of the distributions can be used to estimate the parameters 805of the distribution, and the test is repeated using probabilities of the 806estimated distribution. 807 808 >>> tdof, tloc, tscale = stats.t.fit(x) 809 >>> nloc, nscale = stats.norm.fit(x) 810 >>> tprob = np.diff(stats.t.cdf(crit, tdof, loc=tloc, scale=tscale)) 811 >>> nprob = np.diff(stats.norm.cdf(crit, loc=nloc, scale=nscale)) 812 >>> tch, tpval = stats.chisquare(freqcount, tprob*n_sample) 813 >>> nch, npval = stats.chisquare(freqcount, nprob*n_sample) 814 >>> print('chisquare for t: chi2 = %6.2f pvalue = %6.4f' % (tch, tpval)) 815 chisquare for t: chi2 = 1.58 pvalue = 0.9542 # random 816 >>> print('chisquare for normal: chi2 = %6.2f pvalue = %6.4f' % (nch, npval)) 817 chisquare for normal: chi2 = 11.08 pvalue = 0.0858 # random 818 819Taking account of the estimated parameters, we can still reject the 820hypothesis that our sample came from a normal distribution (at the 5% level), 821but again, with a p-value of 0.95, we cannot reject the t-distribution. 822 823 824Special tests for normal distributions 825^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 826 827Since the normal distribution is the most common distribution in statistics, 828there are several additional functions available to test whether a sample 829could have been drawn from a normal distribution. 830 831First, we can test if skew and kurtosis of our sample differ significantly from 832those of a normal distribution: 833 834 >>> print('normal skewtest teststat = %6.3f pvalue = %6.4f' % stats.skewtest(x)) 835 normal skewtest teststat = 2.785 pvalue = 0.0054 # random 836 >>> print('normal kurtosistest teststat = %6.3f pvalue = %6.4f' % stats.kurtosistest(x)) 837 normal kurtosistest teststat = 4.757 pvalue = 0.0000 # random 838 839These two tests are combined in the normality test 840 841 >>> print('normaltest teststat = %6.3f pvalue = %6.4f' % stats.normaltest(x)) 842 normaltest teststat = 30.379 pvalue = 0.0000 # random 843 844In all three tests, the p-values are very low and we can reject the hypothesis 845that the our sample has skew and kurtosis of the normal distribution. 846 847Since skew and kurtosis of our sample are based on central moments, we get 848exactly the same results if we test the standardized sample: 849 850 >>> print('normaltest teststat = %6.3f pvalue = %6.4f' % 851 ... stats.normaltest((x-x.mean())/x.std())) 852 normaltest teststat = 30.379 pvalue = 0.0000 # random 853 854Because normality is rejected so strongly, we can check whether the 855normaltest gives reasonable results for other cases: 856 857 >>> print('normaltest teststat = %6.3f pvalue = %6.4f' % 858 ... stats.normaltest(stats.t.rvs(10, size=100))) 859 normaltest teststat = 4.698 pvalue = 0.0955 # random 860 >>> print('normaltest teststat = %6.3f pvalue = %6.4f' % 861 ... stats.normaltest(stats.norm.rvs(size=1000))) 862 normaltest teststat = 0.613 pvalue = 0.7361 # random 863 864When testing for normality of a small sample of t-distributed observations 865and a large sample of normal-distributed observations, then in neither case 866can we reject the null hypothesis that the sample comes from a normal 867distribution. In the first case, this is because the test is not powerful 868enough to distinguish a t and a normally distributed random variable in a 869small sample. 870 871 872Comparing two samples 873--------------------- 874 875In the following, we are given two samples, which can come either from the 876same or from different distribution, and we want to test whether these 877samples have the same statistical properties. 878 879 880Comparing means 881^^^^^^^^^^^^^^^ 882 883Test with sample with identical means: 884 885 >>> rvs1 = stats.norm.rvs(loc=5, scale=10, size=500) 886 >>> rvs2 = stats.norm.rvs(loc=5, scale=10, size=500) 887 >>> stats.ttest_ind(rvs1, rvs2) 888 Ttest_indResult(statistic=-0.5489036175088705, pvalue=0.5831943748663959) # random 889 890Test with sample with different means: 891 892 >>> rvs3 = stats.norm.rvs(loc=8, scale=10, size=500) 893 >>> stats.ttest_ind(rvs1, rvs3) 894 Ttest_indResult(statistic=-4.533414290175026, pvalue=6.507128186389019e-06) # random 895 896Kolmogorov-Smirnov test for two samples ks_2samp 897^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 898 899For the example, where both samples are drawn from the same distribution, 900we cannot reject the null hypothesis, since the pvalue is high 901 902 >>> stats.ks_2samp(rvs1, rvs2) 903 KstestResult(statistic=0.026, pvalue=0.9959527565364388) # random 904 905In the second example, with different location, i.e., means, we can 906reject the null hypothesis, since the pvalue is below 1% 907 908 >>> stats.ks_2samp(rvs1, rvs3) 909 KstestResult(statistic=0.114, pvalue=0.00299005061044668) # random 910 911Kernel density estimation 912------------------------- 913 914A common task in statistics is to estimate the probability density function 915(PDF) of a random variable from a set of data samples. This task is called 916density estimation. The most well-known tool to do this is the histogram. 917A histogram is a useful tool for visualization (mainly because everyone 918understands it), but doesn't use the available data very efficiently. Kernel 919density estimation (KDE) is a more efficient tool for the same task. The 920:func:`~stats.gaussian_kde` estimator can be used to estimate the PDF of univariate as 921well as multivariate data. It works best if the data is unimodal. 922 923 924Univariate estimation 925^^^^^^^^^^^^^^^^^^^^^ 926 927We start with a minimal amount of data in order to see how :func:`~stats.gaussian_kde` 928works and what the different options for bandwidth selection do. The data 929sampled from the PDF are shown as blue dashes at the bottom of the figure (this 930is called a rug plot): 931 932.. plot:: 933 934 >>> from scipy import stats 935 >>> import matplotlib.pyplot as plt 936 937 >>> x1 = np.array([-7, -5, 1, 4, 5], dtype=np.float64) 938 >>> kde1 = stats.gaussian_kde(x1) 939 >>> kde2 = stats.gaussian_kde(x1, bw_method='silverman') 940 941 >>> fig = plt.figure() 942 >>> ax = fig.add_subplot(111) 943 944 >>> ax.plot(x1, np.zeros(x1.shape), 'b+', ms=20) # rug plot 945 >>> x_eval = np.linspace(-10, 10, num=200) 946 >>> ax.plot(x_eval, kde1(x_eval), 'k-', label="Scott's Rule") 947 >>> ax.plot(x_eval, kde2(x_eval), 'r-', label="Silverman's Rule") 948 949 >>> plt.show() 950 951We see that there is very little difference between Scott's Rule and 952Silverman's Rule, and that the bandwidth selection with a limited amount of 953data is probably a bit too wide. We can define our own bandwidth function to 954get a less smoothed-out result. 955 956 >>> def my_kde_bandwidth(obj, fac=1./5): 957 ... """We use Scott's Rule, multiplied by a constant factor.""" 958 ... return np.power(obj.n, -1./(obj.d+4)) * fac 959 960 >>> fig = plt.figure() 961 >>> ax = fig.add_subplot(111) 962 963 >>> ax.plot(x1, np.zeros(x1.shape), 'b+', ms=20) # rug plot 964 >>> kde3 = stats.gaussian_kde(x1, bw_method=my_kde_bandwidth) 965 >>> ax.plot(x_eval, kde3(x_eval), 'g-', label="With smaller BW") 966 967 >>> plt.show() 968 969.. plot:: tutorial/stats/plots/kde_plot2.py 970 :align: center 971 :include-source: 0 972 973We see that if we set bandwidth to be very narrow, the obtained estimate for 974the probability density function (PDF) is simply the sum of Gaussians around 975each data point. 976 977We now take a more realistic example and look at the difference between the 978two available bandwidth selection rules. Those rules are known to work well 979for (close to) normal distributions, but even for unimodal distributions that 980are quite strongly non-normal they work reasonably well. As a non-normal 981distribution we take a Student's T distribution with 5 degrees of freedom. 982 983.. plot:: tutorial/stats/plots/kde_plot3.py 984 :align: center 985 :include-source: 1 986 987We now take a look at a bimodal distribution with one wider and one narrower 988Gaussian feature. We expect that this will be a more difficult density to 989approximate, due to the different bandwidths required to accurately resolve 990each feature. 991 992 >>> from functools import partial 993 994 >>> loc1, scale1, size1 = (-2, 1, 175) 995 >>> loc2, scale2, size2 = (2, 0.2, 50) 996 >>> x2 = np.concatenate([np.random.normal(loc=loc1, scale=scale1, size=size1), 997 ... np.random.normal(loc=loc2, scale=scale2, size=size2)]) 998 999 >>> x_eval = np.linspace(x2.min() - 1, x2.max() + 1, 500) 1000 1001 >>> kde = stats.gaussian_kde(x2) 1002 >>> kde2 = stats.gaussian_kde(x2, bw_method='silverman') 1003 >>> kde3 = stats.gaussian_kde(x2, bw_method=partial(my_kde_bandwidth, fac=0.2)) 1004 >>> kde4 = stats.gaussian_kde(x2, bw_method=partial(my_kde_bandwidth, fac=0.5)) 1005 1006 >>> pdf = stats.norm.pdf 1007 >>> bimodal_pdf = pdf(x_eval, loc=loc1, scale=scale1) * float(size1) / x2.size + \ 1008 ... pdf(x_eval, loc=loc2, scale=scale2) * float(size2) / x2.size 1009 1010 >>> fig = plt.figure(figsize=(8, 6)) 1011 >>> ax = fig.add_subplot(111) 1012 1013 >>> ax.plot(x2, np.zeros(x2.shape), 'b+', ms=12) 1014 >>> ax.plot(x_eval, kde(x_eval), 'k-', label="Scott's Rule") 1015 >>> ax.plot(x_eval, kde2(x_eval), 'b-', label="Silverman's Rule") 1016 >>> ax.plot(x_eval, kde3(x_eval), 'g-', label="Scott * 0.2") 1017 >>> ax.plot(x_eval, kde4(x_eval), 'c-', label="Scott * 0.5") 1018 >>> ax.plot(x_eval, bimodal_pdf, 'r--', label="Actual PDF") 1019 1020 >>> ax.set_xlim([x_eval.min(), x_eval.max()]) 1021 >>> ax.legend(loc=2) 1022 >>> ax.set_xlabel('x') 1023 >>> ax.set_ylabel('Density') 1024 >>> plt.show() 1025 1026.. plot:: tutorial/stats/plots/kde_plot4.py 1027 :align: center 1028 :include-source: 0 1029 1030As expected, the KDE is not as close to the true PDF as we would like due to 1031the different characteristic size of the two features of the bimodal 1032distribution. By halving the default bandwidth (``Scott * 0.5``), we can do 1033somewhat better, while using a factor 5 smaller bandwidth than the default 1034doesn't smooth enough. What we really need, though, in this case, is a 1035non-uniform (adaptive) bandwidth. 1036 1037 1038Multivariate estimation 1039^^^^^^^^^^^^^^^^^^^^^^^ 1040 1041With :func:`~stats.gaussian_kde` we can perform multivariate, as well as univariate 1042estimation. We demonstrate the bivariate case. First, we generate some random 1043data with a model in which the two variates are correlated. 1044 1045 >>> def measure(n): 1046 ... """Measurement model, return two coupled measurements.""" 1047 ... m1 = np.random.normal(size=n) 1048 ... m2 = np.random.normal(scale=0.5, size=n) 1049 ... return m1+m2, m1-m2 1050 1051 >>> m1, m2 = measure(2000) 1052 >>> xmin = m1.min() 1053 >>> xmax = m1.max() 1054 >>> ymin = m2.min() 1055 >>> ymax = m2.max() 1056 1057Then we apply the KDE to the data: 1058 1059 >>> X, Y = np.mgrid[xmin:xmax:100j, ymin:ymax:100j] 1060 >>> positions = np.vstack([X.ravel(), Y.ravel()]) 1061 >>> values = np.vstack([m1, m2]) 1062 >>> kernel = stats.gaussian_kde(values) 1063 >>> Z = np.reshape(kernel.evaluate(positions).T, X.shape) 1064 1065Finally, we plot the estimated bivariate distribution as a colormap and plot 1066the individual data points on top. 1067 1068 >>> fig = plt.figure(figsize=(8, 6)) 1069 >>> ax = fig.add_subplot(111) 1070 1071 >>> ax.imshow(np.rot90(Z), cmap=plt.cm.gist_earth_r, 1072 ... extent=[xmin, xmax, ymin, ymax]) 1073 >>> ax.plot(m1, m2, 'k.', markersize=2) 1074 1075 >>> ax.set_xlim([xmin, xmax]) 1076 >>> ax.set_ylim([ymin, ymax]) 1077 1078 >>> plt.show() 1079 1080.. plot:: tutorial/stats/plots/kde_plot5.py 1081 :align: center 1082 :include-source: 0 1083 1084 1085Multiscale Graph Correlation (MGC) 1086^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 1087 1088With :func:`~stats.multiscale_graphcorr`, we can test for independence on high 1089dimensional and nonlinear data. Before we start, let's import some useful 1090packages: 1091 1092 >>> import numpy as np 1093 >>> import matplotlib.pyplot as plt; plt.style.use('classic') 1094 >>> from scipy.stats import multiscale_graphcorr 1095 1096Let's use a custom plotting function to plot the data relationship: 1097 1098 >>> def mgc_plot(x, y, sim_name, mgc_dict=None, only_viz=False, 1099 ... only_mgc=False): 1100 ... """Plot sim and MGC-plot""" 1101 ... if not only_mgc: 1102 ... # simulation 1103 ... plt.figure(figsize=(8, 8)) 1104 ... ax = plt.gca() 1105 ... ax.set_title(sim_name + " Simulation", fontsize=20) 1106 ... ax.scatter(x, y) 1107 ... ax.set_xlabel('X', fontsize=15) 1108 ... ax.set_ylabel('Y', fontsize=15) 1109 ... ax.axis('equal') 1110 ... ax.tick_params(axis="x", labelsize=15) 1111 ... ax.tick_params(axis="y", labelsize=15) 1112 ... plt.show() 1113 ... if not only_viz: 1114 ... # local correlation map 1115 ... plt.figure(figsize=(8,8)) 1116 ... ax = plt.gca() 1117 ... mgc_map = mgc_dict["mgc_map"] 1118 ... # draw heatmap 1119 ... ax.set_title("Local Correlation Map", fontsize=20) 1120 ... im = ax.imshow(mgc_map, cmap='YlGnBu') 1121 ... # colorbar 1122 ... cbar = ax.figure.colorbar(im, ax=ax) 1123 ... cbar.ax.set_ylabel("", rotation=-90, va="bottom") 1124 ... ax.invert_yaxis() 1125 ... # Turn spines off and create white grid. 1126 ... for edge, spine in ax.spines.items(): 1127 ... spine.set_visible(False) 1128 ... # optimal scale 1129 ... opt_scale = mgc_dict["opt_scale"] 1130 ... ax.scatter(opt_scale[0], opt_scale[1], 1131 ... marker='X', s=200, color='red') 1132 ... # other formatting 1133 ... ax.tick_params(bottom="off", left="off") 1134 ... ax.set_xlabel('#Neighbors for X', fontsize=15) 1135 ... ax.set_ylabel('#Neighbors for Y', fontsize=15) 1136 ... ax.tick_params(axis="x", labelsize=15) 1137 ... ax.tick_params(axis="y", labelsize=15) 1138 ... ax.set_xlim(0, 100) 1139 ... ax.set_ylim(0, 100) 1140 ... plt.show() 1141 1142Let's look at some linear data first: 1143 1144 >>> rng = np.random.default_rng() 1145 >>> x = np.linspace(-1, 1, num=100) 1146 >>> y = x + 0.3 * rng.random(x.size) 1147 1148The simulation relationship can be plotted below: 1149 1150 >>> mgc_plot(x, y, "Linear", only_viz=True) 1151 1152.. plot:: tutorial/stats/plots/mgc_plot1.py 1153 :align: center 1154 :include-source: 0 1155 1156Now, we can see the test statistic, p-value, and MGC map visualized below. The 1157optimal scale is shown on the map as a red "x": 1158 1159 >>> stat, pvalue, mgc_dict = multiscale_graphcorr(x, y) 1160 >>> print("MGC test statistic: ", round(stat, 1)) 1161 MGC test statistic: 1.0 1162 >>> print("P-value: ", round(pvalue, 1)) 1163 P-value: 0.0 1164 >>> mgc_plot(x, y, "Linear", mgc_dict, only_mgc=True) 1165 1166.. plot:: tutorial/stats/plots/mgc_plot2.py 1167 :align: center 1168 :include-source: 0 1169 1170It is clear from here, that MGC is able to determine a relationship between the 1171input data matrices because the p-value is very low and the MGC test statistic 1172is relatively high. The MGC-map indicates a **strongly linear relationship**. 1173Intuitively, this is because having more neighbors will help in identifying a 1174linear relationship between :math:`x` and :math:`y`. The optimal scale in this 1175case is **equivalent to the global scale**, marked by a red spot on the map. 1176 1177The same can be done for nonlinear data sets. The following :math:`x` and 1178:math:`y` arrays are derived from a nonlinear simulation: 1179 1180 >>> unif = np.array(rng.uniform(0, 5, size=100)) 1181 >>> x = unif * np.cos(np.pi * unif) 1182 >>> y = unif * np.sin(np.pi * unif) + 0.4 * rng.random(x.size) 1183 1184The simulation relationship can be plotted below: 1185 1186 >>> mgc_plot(x, y, "Spiral", only_viz=True) 1187 1188.. plot:: tutorial/stats/plots/mgc_plot3.py 1189 :align: center 1190 :include-source: 0 1191 1192Now, we can see the test statistic, p-value, and MGC map visualized below. The 1193optimal scale is shown on the map as a red "x": 1194 1195 >>> stat, pvalue, mgc_dict = multiscale_graphcorr(x, y) 1196 >>> print("MGC test statistic: ", round(stat, 1)) 1197 MGC test statistic: 0.2 # random 1198 >>> print("P-value: ", round(pvalue, 1)) 1199 P-value: 0.0 1200 >>> mgc_plot(x, y, "Spiral", mgc_dict, only_mgc=True) 1201 1202.. plot:: tutorial/stats/plots/mgc_plot4.py 1203 :align: center 1204 :include-source: 0 1205 1206It is clear from here, that MGC is able to determine a relationship again 1207because the p-value is very low and the MGC test statistic is relatively high. 1208The MGC-map indicates a **strongly nonlinear relationship**. The optimal scale 1209in this case is **equivalent to the local scale**, marked by a red spot on the 1210map. 1211 1212Quasi-Monte Carlo 1213----------------- 1214 1215Before talking about Quasi-Monte Carlo (QMC), a quick introduction about Monte 1216Carlo (MC). MC methods, or MC experiments, are a broad class of 1217computational algorithms that rely on repeated random sampling to obtain 1218numerical results. The underlying concept is to use randomness to solve 1219problems that might be deterministic in principle. They are often used in 1220physical and mathematical problems and are most useful when it is difficult or 1221impossible to use other approaches. MC methods are mainly used in 1222three problem classes: optimization, numerical integration, and generating 1223draws from a probability distribution. 1224 1225Generating random numbers with specific properties is a more complex problem 1226than it sounds. Simple MC methods are designed to sample points to be 1227independent and identically distributed (IID). But generating multiple sets 1228of random points can produce radically different results. 1229 1230.. plot:: tutorial/stats/plots/qmc_plot_mc.py 1231 :align: center 1232 :include-source: 0 1233 1234In both cases in the plot above, points are generated randomly without any 1235knowledge about previously drawn points. It is clear that some regions of 1236the space are left unexplored - which can cause problems in simulations as a 1237particular set of points might trigger a totally different behaviour. 1238 1239A great benefit of MC is that it has known convergence properties. 1240Let's look at the mean of the squared sum in 5 dimensions: 1241 1242.. math:: 1243 1244 f(\mathbf{x}) = \left( \sum_{j=1}^{5}x_j \right)^2, 1245 1246with :math:`x_j \sim \mathcal{U}(0,1)`. It has a known mean value, 1247:math:`\mu = 5/3+5(5-1)/4`. Using MC sampling, we 1248can compute that mean numerically, and the approximation error follows a 1249theoretical rate of :math:`O(n^{-1/2})`. 1250 1251.. plot:: tutorial/stats/plots/qmc_plot_conv_mc.py 1252 :align: center 1253 :include-source: 0 1254 1255Although the convergence is ensured, practitioners tend to want to have an 1256exploration process which is more deterministic. With normal MC, a seed can be 1257used to have a repeatable process. But fixing the seed would break the 1258convergence property: a given seed could work for a given class of problem 1259and break for another one. 1260 1261What is commonly done to walk through the space in a deterministic manner, is 1262to use a regular grid spanning all parameter dimensions, also called a 1263saturated design. Let’s consider the unit-hypercube, with all bounds ranging 1264from 0 to 1. Now, having a distance of 0.1 between points, the number of points 1265required to fill the unit interval would be 10. In a 2-dimensional hypercube 1266the same spacing would require 100, and in 3 dimensions 1,000 points. As the 1267number of dimensions grows, the number of experiments which is required to fill 1268the space rises exponentially as the dimensionality of the space increases. 1269This exponential growth is called "the curse of dimensionality". 1270 1271 >>> import numpy as np 1272 >>> disc = 10 1273 >>> x1 = np.linspace(0, 1, disc) 1274 >>> x2 = np.linspace(0, 1, disc) 1275 >>> x3 = np.linspace(0, 1, disc) 1276 >>> x1, x2, x3 = np.meshgrid(x1, x2, x3) 1277 1278.. plot:: tutorial/stats/plots/qmc_plot_curse.py 1279 :align: center 1280 :include-source: 0 1281 1282To mitigate this issue, QMC methods have been designed. They are 1283deterministic, have a good coverage of the space and some of them can be 1284continued and retain good properties. 1285The main difference with MC methods is that the points are not IID but they 1286know about previous points. Hence, some methods are also referred to as 1287sequences. 1288 1289.. plot:: tutorial/stats/plots/qmc_plot_mc_qmc.py 1290 :align: center 1291 :include-source: 0 1292 1293This figure presents 2 sets of 256 points. The design of the left is a plain 1294MC whereas the design of the right is a QMC design using the *Sobol'* method. 1295We clearly see that the QMC version is more uniform. The points sample better 1296near the boundaries and there are less clusters or gaps. 1297 1298One way to assess the uniformity is to use a measure called the discrepancy. 1299Here the discrepancy of *Sobol'* points is better than crude MC. 1300 1301Coming back to the computation of the mean, QMC methods also have better rates 1302of convergence for the error. They can achieve :math:`O(n^{-1})` for this 1303function, and even better rates on very smooth functions. This figure shows 1304that the *Sobol'* method has a rate of :math:`O(n^{-1})`: 1305 1306.. plot:: tutorial/stats/plots/qmc_plot_conv_mc_sobol.py 1307 :align: center 1308 :include-source: 0 1309 1310We refer to the documentation of :mod:`scipy.stats.qmc` for 1311more mathematical details. 1312 1313Calculate the discrepancy 1314^^^^^^^^^^^^^^^^^^^^^^^^^ 1315 1316Let's consider two sets of points. From the figure below, it is clear that 1317the design on the left covers more of the space than the design on the right. 1318This can be quantified using a :func:`~stats.qmc.discrepancy` measure. 1319The lower the discrepancy, the more uniform a sample is. 1320 1321 >>> import numpy as np 1322 >>> from scipy.stats import qmc 1323 >>> space_1 = np.array([[1, 3], [2, 6], [3, 2], [4, 5], [5, 1], [6, 4]]) 1324 >>> space_2 = np.array([[1, 5], [2, 4], [3, 3], [4, 2], [5, 1], [6, 6]]) 1325 >>> l_bounds = [0.5, 0.5] 1326 >>> u_bounds = [6.5, 6.5] 1327 >>> space_1 = qmc.scale(space_1, l_bounds, u_bounds, reverse=True) 1328 >>> space_2 = qmc.scale(space_2, l_bounds, u_bounds, reverse=True) 1329 >>> qmc.discrepancy(space_1) 1330 0.008142039609053464 1331 >>> qmc.discrepancy(space_2) 1332 0.010456854423869011 1333 1334.. plot:: tutorial/stats/plots/qmc_plot_discrepancy.py 1335 :align: center 1336 :include-source: 0 1337 1338Using a QMC engine 1339^^^^^^^^^^^^^^^^^^ 1340 1341Several QMC samplers/engines are implemented. Here we look at two of the most 1342used QMC methods: :class:`~stats.qmc.Sobol` and :class:`~stats.qmc.Halton` 1343sequences. 1344 1345.. plot:: tutorial/stats/plots/qmc_plot_sobol_halton.py 1346 :align: center 1347 :include-source: 1 1348 1349.. warning:: QMC methods require particular care and the user must read the 1350 documentation to avoid common pitfalls. *Sobol'* for instance requires a 1351 number of points following a power of 2. Also, thinning, burning or other 1352 point selection can break the properties of the sequence and result in a 1353 set of points which would not be better than MC. 1354 1355QMC engines are state-aware. Meaning that you can continue the sequence, 1356skip some points, or reset it. Let's take 5 points from 1357:class:`~stats.qmc.Halton`. And then ask for a second set of 5 points: 1358 1359 >>> from scipy.stats import qmc 1360 >>> engine = qmc.Halton(d=2) 1361 >>> engine.random(5) 1362 array([[0.22166437, 0.07980522], # random 1363 [0.72166437, 0.93165708], 1364 [0.47166437, 0.41313856], 1365 [0.97166437, 0.19091633], 1366 [0.01853937, 0.74647189]]) 1367 >>> engine.random(5) 1368 array([[0.51853937, 0.52424967], # random 1369 [0.26853937, 0.30202745], 1370 [0.76853937, 0.857583 ], 1371 [0.14353937, 0.63536078], 1372 [0.64353937, 0.01807683]]) 1373 1374Now we reset the sequence. Asking for 5 points leads to the same first 5 1375points: 1376 1377 >>> engine.reset() 1378 >>> engine.random(5) 1379 array([[0.22166437, 0.07980522], # random 1380 [0.72166437, 0.93165708], 1381 [0.47166437, 0.41313856], 1382 [0.97166437, 0.19091633], 1383 [0.01853937, 0.74647189]]) 1384 1385And here we advance the sequence to get the same second set of 5 points: 1386 1387 >>> engine.reset() 1388 >>> engine.fast_forward(5) 1389 >>> engine.random(5) 1390 array([[0.51853937, 0.52424967], # random 1391 [0.26853937, 0.30202745], 1392 [0.76853937, 0.857583 ], 1393 [0.14353937, 0.63536078], 1394 [0.64353937, 0.01807683]]) 1395 1396.. note:: By default, both :class:`~stats.qmc.Sobol` and 1397 :class:`~stats.qmc.Halton` are scrambled. The convergence properties are 1398 better, and it prevents the appearance of fringes or noticeable patterns 1399 of points in high dimensions. There should be no practical reason not to 1400 use the scrambled version. 1401 1402Making a QMC engine, i.e., subclassing ``QMCEngine`` 1403^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 1404 1405To make your own :class:`~stats.qmc.QMCEngine`, a few methods have to be 1406defined. Following is an example wrapping `numpy.random.Generator`. 1407 1408 >>> import numpy as np 1409 >>> from scipy.stats import qmc 1410 >>> class RandomEngine(qmc.QMCEngine): 1411 ... def __init__(self, d, seed=None): 1412 ... super().__init__(d=d, seed=seed) 1413 ... self.rng = np.random.default_rng(self.rng_seed) 1414 ... 1415 ... 1416 ... def random(self, n=1): 1417 ... self.num_generated += n 1418 ... return self.rng.random((n, self.d)) 1419 ... 1420 ... 1421 ... def reset(self): 1422 ... self.rng = np.random.default_rng(self.rng_seed) 1423 ... self.num_generated = 0 1424 ... return self 1425 ... 1426 ... 1427 ... def fast_forward(self, n): 1428 ... self.random(n) 1429 ... return self 1430 1431Then we use it as any other QMC engine: 1432 1433 >>> engine = RandomEngine(2) 1434 >>> engine.random(5) 1435 array([[0.22733602, 0.31675834], # random 1436 [0.79736546, 0.67625467], 1437 [0.39110955, 0.33281393], 1438 [0.59830875, 0.18673419], 1439 [0.67275604, 0.94180287]]) 1440 >>> engine.reset() 1441 >>> engine.random(5) 1442 array([[0.22733602, 0.31675834], # random 1443 [0.79736546, 0.67625467], 1444 [0.39110955, 0.33281393], 1445 [0.59830875, 0.18673419], 1446 [0.67275604, 0.94180287]]) 1447 1448Guidelines on using QMC 1449^^^^^^^^^^^^^^^^^^^^^^^ 1450 1451* QMC has rules! Be sure to read the documentation or you might have no 1452 benefit over MC. 1453* Use :class:`~stats.qmc.Sobol` if you need **exactly** :math:`2^m` points. 1454* :class:`~stats.qmc.Halton` allows to sample, or skip, an arbitrary number of 1455 points. This is at the cost of a slower rate of convergence than *Sobol'*. 1456* Never remove the first points of the sequence. It will destroy the 1457 properties. 1458* Scrambling is always better. 1459* If you use LHS based methods, you cannot add points without losing the LHS 1460 properties. (There are some methods to do so, but this is not implemented.) 1461