1 2Statistics functions 3==================== 4 5 This chapter describes the statistical functions in the library. The 6 basic statistical functions include routines to compute the mean, 7 variance and standard deviation. More advanced functions allow you 8 to calculate absolute deviations, skewness, and kurtosis as well as 9 the median and arbitrary percentiles. 10 11The algorithms provided here use recurrence relations to compute average 12quantities in a stable way, without large intermediate values that might 13overflow. All functions work on any Python sequence (of appropriate 14data-type), but see section [sec:stat:speed-considerations] for 15advantages and drawbacks of different kinds of input data. 16 17For details on the underlying implementation of these functions please 18consult the GNU Scientific Library reference manual. 19 20Organization of the module 21-------------------------- 22 23Individual parts of the GSL functions names, providing artificial 24namespaces in C, are mapped to modules and submodules in PyGSL. That is, 25can be found as and as . 26 27The functions in the module are available in versions for datasets in 28the standard and NumPy floating-point and integer types. The generic 29versions available in the module are using the generic GSL versions. The 30submodules use GSL functions according to the submodule name, e.g. long 31for . 32 33Implemented submodules are , , , , , , and . The latter one also serves 34as default and is used whenever you don’t expclicitely state a different 35datatype. In most cases it is appropriate to simply use the default 36implementation as it covers the widest range of the real space, offers 37high precision, and as such is simple to use. If you have a sequence of 38all integer values it is straightforward to use functions as these use 39an implementation corresponding to Pythons -type. These implemented 40submodules represent all numeric datatypes available in Python (, ) 41besides which has no representation in standard C, as well as all 42numeric datatypes available in NumPy that have corresponding 43implementations in GSL (on 32 bit systems these are: Character, 44UnsigendInt8, Int16, Int32, Int, Float32, Float). 45 46Available functions 47------------------- 48 49Mean, Standard Deviation, and Variance 50~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 51 52meanx Arithmetic mean (*sample mean*) of : 53 54.. math:: \hat\mu = \frac{1}{N} \sum x_i 55 56variancex Estimated (*sample*) variance of : 57 58.. math:: \hat\sigma^2 = \frac{1}{N-1} \sum (x_i - \hat\mu)^2 59 60 This function computes the mean via a call to . If you have already 61computed the mean then you can pass it directly to . 62 63variance\_mx, mean Estimated (*sample*) variance of relative to : 64 65.. math:: \hat\sigma^2 = \frac{1}{N-1} \sum (x_i - mean)^2 66 67sdx 68 69sd\_mx, mean The standard deviation is defined as the square root of the 70variance of . These functions returns the square root of the respective 71variance-functions above. 72 73variance\_with\_fixed\_meanx, mean Compute an unbiased estimate of the 74variance of when the population mean of the underlying distribution is 75known *a priori*. In this case the estimator for the variance uses the 76factor :math:`1/N` and the sample mean :math:`\hat\mu` is replaced by 77the known population mean :math:`\mu`: 78 79.. math:: \hat\sigma^2 = \frac{1}{N} \sum (x_i - \mu)^2 80 81Absolute deviation 82~~~~~~~~~~~~~~~~~~ 83 84absdevdata Compute the absolute deviation from the mean of The absolute 85deviation from the mean is defined as 86 87.. math:: absdev = (1/N) \sum |x_i - \hat\mu| 88 89 where :math:`x_i` are the elements of the dataset . The absolute 90deviation from the mean provides a more robust measure of the width of a 91distribution than the variance. This function computes the mean of via a 92call to . 93 94absdev\_mdata, mean Compute the absolute deviation of the dataset 95relative to the given value of 96 97.. math:: absdev = (1/N) \sum |x_i - mean| 98 99 This function is useful if you have already computed the mean of (and 100want to avoid recomputing it), or wish to calculate the absolute 101deviation relative to another value (such as zero, or the median). 102 103Higher moments (skewness and kurtosis) 104~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 105 106skewdata Compute the skewness of . The skewness is defined as 107 108.. math:: skew = (1/N) \sum ((x_i - \hat\mu)/\hat\sigma)^3 109 110 where :math:`x_i` are the elements of the dataset . The skewness 111measures the asymmetry of the tails of a distribution. 112 113The function computes the mean and estimated standard deviation of via 114calls to and . 115 116skew\_m\_sddata, mean, sd Compute the skewness of the dataset using the 117given values of the mean and standard deviation varsd 118 119.. math:: skew = (1/N) \sum ((x_i - mean)/sd)^3 120 121 These functions are useful if you have already computed the mean and 122standard deviation of and want to avoid recomputing them. 123 124kurtosisdata Compute the kurtosis of . The kurtosis is defined as 125 126.. math:: kurtosis = ((1/N) \sum ((x_i - \hat\mu)/\hat\sigma)^4) - 3 127 128 The kurtosis measures how sharply peaked a distribution is, relative to 129its width. The kurtosis is normalized to zero for a gaussian 130distribution. 131 132kurtosis\_m\_sddata, mean, sd This function computes the kurtosis of the 133dataset using the given values of the mean and standard deviation 134 135.. math:: kurtosis = ((1/N) \sum ((x_i - mean)/sd)^4) - 3 136 137 This function is useful if you have already computed the mean and 138standard deviation of and want to avoid recomputing them. 139 140Autocorrelation 141~~~~~~~~~~~~~~~ 142 143lag1\_autocorrelationx Computes the lag-1 autocorrelation of the dataset 144 145.. math:: 146 147 a_1 = \frac{\sum^{n}_{i = 1} (x_{i} - \hat\mu) (x_{i-1} - \hat\mu)}{ 148 \sum^{n}_{i = 1} (x_{i} - \hat\mu) (x_{i} - \hat\mu)} 149 150lag1\_autocorrelation\_mx, mean Computes the lag-1 autocorrelation of 151the dataset using the given value of the mean . 152 153.. math:: 154 155 a_1 = \frac{\sum_{i = 1}^{n} (x_{i} - \var{mean}) (x_{i-1} - \var{mean})}{ 156 \sum^{n}_{i = 1} (x_{i} - \var{mean}) (x_{i} - \var{mean})} 157 158Covariance 159~~~~~~~~~~ 160 161covariancex, y Computes the covariance of the datasets and which must be 162of same length. 163 164.. math:: c = \frac{1}{n-1} \sum^{n}_{i=1} (x_i - \hat x) (y_i - \hat y) 165 166lag1\_autocorrelation\_mx, y, mean\_x, mean\_y Computes the covariance 167of the datasets and using the given values of the means and . The 168datasets and must be of equal length. 169 170.. math:: 171 172 c = \frac{1}{n-1} \sum^{n}_{i=1} (x_i - \var{mean\_x}) (y_i - 173 \var{mean\_y}) 174 175Maximum and Minimum values 176~~~~~~~~~~~~~~~~~~~~~~~~~~ 177 178maxdata This function returns the maximum value in . The maximum value 179is defined as the value of the element :math:`x_i` which satisfies 180:math:`x_i \ge x_j` for all :math:`j`. 181 182If you want instead to find the element with the largest absolute 183magnitude you will need to apply ‘fabs’ or ‘abs’ to your data before 184calling this function. 185 186mindata This function returns the minimum value in . The maximum value 187is defined as the value of the element :math:`x_i` which satisfies 188:math:`x_i \le x_j` for all :math:`j`. 189 190If you want instead to find the element with the smallest absolute 191magnitude you will need to apply ‘fabs’ or ‘abs’ to your data before 192calling this function. 193 194minmaxdata This function returns both the minimum and maximum values of 195, determined in a single pass. 196 197max\_indexdata This function returns the index of the maximum value in . 198The maximum value is defined as the value of the element :math:`x_i` 199which satisfies :math:`x_i \ge x_j` for all :math:`j`. When there are 200several equal maximum elements then the first one is chosen. 201 202min\_indexdata This function returns the index of the minimum value in . 203The minimum value is defined as the value of the element :math:`x_i` 204which satisfies :math:`x_i \le x_j` for all :math:`j`. When there are 205several equal minimum elements then the first one is chosen. 206 207minmax\_indexdata This function returns the indexes of the minimum and 208maximum values of , determined in a single pass. 209 210Median and Percentiles 211~~~~~~~~~~~~~~~~~~~~~~ 212 213The median and percentile functions described in this section operate on 214sorted data. For convenience we use “quantiles”, measured on a scale of 2150 to 1, instead of percentiles (which use a scale of 0 to 100). 216 217median\_from\_sorted\_datadata This function returns the median value of 218. The elements of the array must be in ascending numerical order. There 219are no checks to see whether the data are sorted, so the function should 220always be used first. 221 222When the dataset has an odd number of elements the median is the value 223of element (n-1)/2. When the dataset has an even number of elements the 224median is the mean of the two nearest middle values, elements (n-1)/2 225and n/2. Since the algorithm for computing the median involves 226interpolation this function always returns a floating-point number, even 227for integer data types. 228 229quantile\_from\_sorted\_datadata, F This function returns a quantile 230value of . The elements of the array must be in ascending numerical 231order. The quantile is determined by the , a fraction between 0 and 1. 232For example, to compute the value of the 75th percentile should have the 233value 0.75. 234 235There are no checks to see whether the data are sorted, so the function 236should always be used first. 237 238The quantile is found by interpolation, using the formula 239 240.. math:: quantile = (1 - \delta) x_i + \delta x_{i+1} 241 242 where :math:`i` is :math:`floor((n - 1)f)` and :math:`\delta` is 243:math:`(n-1)f - i`. 244 245Thus the minimum value of the array () is given by equal to zero, the 246maximum value () is given by equal to one and the median value is given 247by equal to 0.5. Since the algorithm for computing quantiles involves 248interpolation this function always returns a floating-point number, even 249for integer data types. 250 251Weighted Samples 252~~~~~~~~~~~~~~~~ 253 254The functions described in this section allow the computation of 255statistics for weighted samples. The functions accept an array of 256samples, :math:`x_i`, with associated weights, :math:`w_i`. Each sample 257:math:`x_i` is considered as having been drawn from a Gaussian 258distribution with variance :math:`\sigma_i^2`. The sample weight 259:math:`w_i` is defined as the reciprocal of this variance, :math:`w_i = 2601/\sigma_i^2`. Setting a weight to zero corresponds to removing a sample 261from a dataset. 262 263wmeanw, data This function returns the weighted mean of the dataset 264using the set of weights . The weighted mean is defined as 265 266.. math:: \hat\mu = (\sum w_i x_i) / (\sum w_i) 267 268wvariance w, data This function returns the estimated variance of the 269dataset , using the set of weights . The estimated variance of a 270weighted dataset is defined as 271 272.. math:: \hat\sigma^2 = ((\sum w_i)/((\sum w_i)^2 - \sum (w_i^2))) \sum w_i (x_i - \hat\mu)^2 273 274 Note that this expression reduces to an unweighted variance with the 275familiar :math:`1/(N-1)` factor when there are :math:`N` equal non-zero 276weights. 277 278wvariance\_mw, data, wmean This function returns the estimated variance 279of the weighted dataset using the given weighted mean . 280 281wsdw, data The standard deviation is defined as the square root of the 282variance. This function returns the square root of the corresponding 283variance function above. 284 285wsd\_mw, data, wmean This function returns the square root of the 286corresponding variance function above. 287 288wvariance\_with\_fixed\_meanw, data, mean This function computes an 289unbiased estimate of the variance of weighted dataset when the 290population mean of the underlying distribution is known \_a priori\_. In 291this case the estimator for the variance replaces the sample mean 292:math:`\hat\mu` by the known population mean :math:`\mu`, 293 294.. math:: \hat\sigma^2 = (\sum w_i (x_i - \mu)^2) / (\sum w_i) 295 296wsd\_with\_fixed\_meanw, data, mean The standard deviation is defined as 297the square root of the variance. This function returns the square root 298of the corresponding variance function above. 299 300wabsdevw, data This function computes the weighted absolute deviation 301from the weighted mean of . The absolute deviation from the mean is 302defined as 303 304.. math:: absdev = (\sum w_i |x_i - \hat\mu|) / (\sum w_i) 305 306wabsdev\_mw, data, wmean This function computes the absolute deviation 307of the weighted dataset DATA about the given weighted mean WMEAN. 308 309wskeww, data This function computes the weighted skewness of the dataset 310DATA. 311 312.. math:: skew = (\sum w_i ((x_i - xbar)/\sigma)^3) / (\sum w_i) 313 314wskew\_m\_sdw, data, mean, wsd This function computes the weighted 315skewness of the dataset using the given values of the weighted mean and 316weighted standard deviation, and . 317 318wkurtosisw, data This function computes the weighted kurtosis of the 319dataset . The kurtosis is defined as 320 321.. math:: kurtosis = ((\sum w_i ((x_i - xbar)/sigma)^4) / (\sum w_i)) - 3 322 323wkurtosis\_m\_sdw, data, mean, wsd This function computes the weighted 324kurtosis of the dataset using the given values of the weighted mean and 325weighted standard deviation, and . 326 327Further Reading 328--------------- 329 330See the GSL reference manual for a description of all available 331functions and the calculations they perform. 332 333The standard reference for almost any topic in statistics is the 334multi-volume *Advanced Theory of Statistics* by Kendall and Stuart. Many 335statistical concepts can be more easily understood by a Bayesian 336approach. The book by Gelman, Carlin, Stern and Rubin gives a 337comprehensive coverage of the subject. For physicists the Particle Data 338Group provides useful reviews of Probability and Statistics in the 339“Mathematical Tools” section of its Annual Review of Particle Physics. 340 341Modules in Testing 342================== 343 344Modules in this package are often reimplementations of an original 345package with significant change to the original. The current rng 346implementation, for example, started its life here. The sf module 347implemented here will supersede the sf package in one of the next 348releases. Concerning the other modules the usage is encouraged for tests 349to see if they work, but use them with caution in your production code! 350