1
2Statistics functions
3====================
4
5    This chapter describes the statistical functions in the library. The
6    basic statistical functions include routines to compute the mean,
7    variance and standard deviation. More advanced functions allow you
8    to calculate absolute deviations, skewness, and kurtosis as well as
9    the median and arbitrary percentiles.
10
11The algorithms provided here use recurrence relations to compute average
12quantities in a stable way, without large intermediate values that might
13overflow. All functions work on any Python sequence (of appropriate
14data-type), but see section [sec:stat:speed-considerations] for
15advantages and drawbacks of different kinds of input data.
16
17For details on the underlying implementation of these functions please
18consult the GNU Scientific Library reference manual.
19
20Organization of the module
21--------------------------
22
23Individual parts of the GSL functions names, providing artificial
24namespaces in C, are mapped to modules and submodules in PyGSL. That is,
25can be found as and as .
26
27The functions in the module are available in versions for datasets in
28the standard and NumPy floating-point and integer types. The generic
29versions available in the module are using the generic GSL versions. The
30submodules use GSL functions according to the submodule name, e.g. long
31for .
32
33Implemented submodules are , , , , , , and . The latter one also serves
34as default and is used whenever you don’t expclicitely state a different
35datatype. In most cases it is appropriate to simply use the default
36implementation as it covers the widest range of the real space, offers
37high precision, and as such is simple to use. If you have a sequence of
38all integer values it is straightforward to use functions as these use
39an implementation corresponding to Pythons -type. These implemented
40submodules represent all numeric datatypes available in Python (, )
41besides which has no representation in standard C, as well as all
42numeric datatypes available in NumPy that have corresponding
43implementations in GSL (on 32 bit systems these are: Character,
44UnsigendInt8, Int16, Int32, Int, Float32, Float).
45
46Available functions
47-------------------
48
49Mean, Standard Deviation, and Variance
50~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
51
52meanx Arithmetic mean (*sample mean*) of :
53
54.. math:: \hat\mu = \frac{1}{N} \sum x_i
55
56variancex Estimated (*sample*) variance of :
57
58.. math:: \hat\sigma^2 = \frac{1}{N-1} \sum (x_i - \hat\mu)^2
59
60 This function computes the mean via a call to . If you have already
61computed the mean then you can pass it directly to .
62
63variance\_mx, mean Estimated (*sample*) variance of relative to :
64
65.. math:: \hat\sigma^2 = \frac{1}{N-1} \sum (x_i - mean)^2
66
67sdx
68
69sd\_mx, mean The standard deviation is defined as the square root of the
70variance of . These functions returns the square root of the respective
71variance-functions above.
72
73variance\_with\_fixed\_meanx, mean Compute an unbiased estimate of the
74variance of when the population mean of the underlying distribution is
75known *a priori*. In this case the estimator for the variance uses the
76factor :math:`1/N` and the sample mean :math:`\hat\mu` is replaced by
77the known population mean :math:`\mu`:
78
79.. math:: \hat\sigma^2 = \frac{1}{N} \sum (x_i - \mu)^2
80
81Absolute deviation
82~~~~~~~~~~~~~~~~~~
83
84absdevdata Compute the absolute deviation from the mean of The absolute
85deviation from the mean is defined as
86
87.. math:: absdev  = (1/N) \sum |x_i - \hat\mu|
88
89 where :math:`x_i` are the elements of the dataset . The absolute
90deviation from the mean provides a more robust measure of the width of a
91distribution than the variance. This function computes the mean of via a
92call to .
93
94absdev\_mdata, mean Compute the absolute deviation of the dataset
95relative to the given value of
96
97.. math:: absdev  = (1/N) \sum |x_i - mean|
98
99 This function is useful if you have already computed the mean of (and
100want to avoid recomputing it), or wish to calculate the absolute
101deviation relative to another value (such as zero, or the median).
102
103Higher moments (skewness and kurtosis)
104~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
105
106skewdata Compute the skewness of . The skewness is defined as
107
108.. math:: skew = (1/N) \sum ((x_i - \hat\mu)/\hat\sigma)^3
109
110 where :math:`x_i` are the elements of the dataset . The skewness
111measures the asymmetry of the tails of a distribution.
112
113The function computes the mean and estimated standard deviation of via
114calls to and .
115
116skew\_m\_sddata, mean, sd Compute the skewness of the dataset using the
117given values of the mean and standard deviation varsd
118
119.. math:: skew = (1/N) \sum ((x_i - mean)/sd)^3
120
121 These functions are useful if you have already computed the mean and
122standard deviation of and want to avoid recomputing them.
123
124kurtosisdata Compute the kurtosis of . The kurtosis is defined as
125
126.. math:: kurtosis = ((1/N) \sum ((x_i - \hat\mu)/\hat\sigma)^4) - 3
127
128 The kurtosis measures how sharply peaked a distribution is, relative to
129its width. The kurtosis is normalized to zero for a gaussian
130distribution.
131
132kurtosis\_m\_sddata, mean, sd This function computes the kurtosis of the
133dataset using the given values of the mean and standard deviation
134
135.. math:: kurtosis = ((1/N) \sum ((x_i - mean)/sd)^4) - 3
136
137 This function is useful if you have already computed the mean and
138standard deviation of and want to avoid recomputing them.
139
140Autocorrelation
141~~~~~~~~~~~~~~~
142
143lag1\_autocorrelationx Computes the lag-1 autocorrelation of the dataset
144
145.. math::
146
147   a_1 = \frac{\sum^{n}_{i = 1} (x_{i} - \hat\mu) (x_{i-1} - \hat\mu)}{
148            \sum^{n}_{i = 1} (x_{i} - \hat\mu) (x_{i} - \hat\mu)}
149
150lag1\_autocorrelation\_mx, mean Computes the lag-1 autocorrelation of
151the dataset using the given value of the mean .
152
153.. math::
154
155   a_1 = \frac{\sum_{i = 1}^{n} (x_{i} - \var{mean}) (x_{i-1} - \var{mean})}{
156            \sum^{n}_{i = 1} (x_{i} - \var{mean}) (x_{i} - \var{mean})}
157
158Covariance
159~~~~~~~~~~
160
161covariancex, y Computes the covariance of the datasets and which must be
162of same length.
163
164.. math:: c = \frac{1}{n-1} \sum^{n}_{i=1} (x_i - \hat x) (y_i - \hat y)
165
166lag1\_autocorrelation\_mx, y, mean\_x, mean\_y Computes the covariance
167of the datasets and using the given values of the means and . The
168datasets and must be of equal length.
169
170.. math::
171
172   c = \frac{1}{n-1} \sum^{n}_{i=1} (x_i - \var{mean\_x}) (y_i -
173         \var{mean\_y})
174
175Maximum and Minimum values
176~~~~~~~~~~~~~~~~~~~~~~~~~~
177
178maxdata This function returns the maximum value in . The maximum value
179is defined as the value of the element :math:`x_i` which satisfies
180:math:`x_i \ge x_j` for all :math:`j`.
181
182If you want instead to find the element with the largest absolute
183magnitude you will need to apply ‘fabs’ or ‘abs’ to your data before
184calling this function.
185
186mindata This function returns the minimum value in . The maximum value
187is defined as the value of the element :math:`x_i` which satisfies
188:math:`x_i \le x_j` for all :math:`j`.
189
190If you want instead to find the element with the smallest absolute
191magnitude you will need to apply ‘fabs’ or ‘abs’ to your data before
192calling this function.
193
194minmaxdata This function returns both the minimum and maximum values of
195, determined in a single pass.
196
197max\_indexdata This function returns the index of the maximum value in .
198The maximum value is defined as the value of the element :math:`x_i`
199which satisfies :math:`x_i \ge x_j` for all :math:`j`. When there are
200several equal maximum elements then the first one is chosen.
201
202min\_indexdata This function returns the index of the minimum value in .
203The minimum value is defined as the value of the element :math:`x_i`
204which satisfies :math:`x_i \le x_j` for all :math:`j`. When there are
205several equal minimum elements then the first one is chosen.
206
207minmax\_indexdata This function returns the indexes of the minimum and
208maximum values of , determined in a single pass.
209
210Median and Percentiles
211~~~~~~~~~~~~~~~~~~~~~~
212
213The median and percentile functions described in this section operate on
214sorted data. For convenience we use “quantiles”, measured on a scale of
2150 to 1, instead of percentiles (which use a scale of 0 to 100).
216
217median\_from\_sorted\_datadata This function returns the median value of
218. The elements of the array must be in ascending numerical order. There
219are no checks to see whether the data are sorted, so the function should
220always be used first.
221
222When the dataset has an odd number of elements the median is the value
223of element (n-1)/2. When the dataset has an even number of elements the
224median is the mean of the two nearest middle values, elements (n-1)/2
225and n/2. Since the algorithm for computing the median involves
226interpolation this function always returns a floating-point number, even
227for integer data types.
228
229quantile\_from\_sorted\_datadata, F This function returns a quantile
230value of . The elements of the array must be in ascending numerical
231order. The quantile is determined by the , a fraction between 0 and 1.
232For example, to compute the value of the 75th percentile should have the
233value 0.75.
234
235There are no checks to see whether the data are sorted, so the function
236should always be used first.
237
238The quantile is found by interpolation, using the formula
239
240.. math:: quantile = (1 - \delta) x_i + \delta x_{i+1}
241
242 where :math:`i` is :math:`floor((n - 1)f)` and :math:`\delta` is
243:math:`(n-1)f - i`.
244
245Thus the minimum value of the array () is given by equal to zero, the
246maximum value () is given by equal to one and the median value is given
247by equal to 0.5. Since the algorithm for computing quantiles involves
248interpolation this function always returns a floating-point number, even
249for integer data types.
250
251Weighted Samples
252~~~~~~~~~~~~~~~~
253
254The functions described in this section allow the computation of
255statistics for weighted samples. The functions accept an array of
256samples, :math:`x_i`, with associated weights, :math:`w_i`. Each sample
257:math:`x_i` is considered as having been drawn from a Gaussian
258distribution with variance :math:`\sigma_i^2`. The sample weight
259:math:`w_i` is defined as the reciprocal of this variance, :math:`w_i =
2601/\sigma_i^2`. Setting a weight to zero corresponds to removing a sample
261from a dataset.
262
263wmeanw, data This function returns the weighted mean of the dataset
264using the set of weights . The weighted mean is defined as
265
266.. math:: \hat\mu = (\sum w_i x_i) / (\sum w_i)
267
268wvariance w, data This function returns the estimated variance of the
269dataset , using the set of weights . The estimated variance of a
270weighted dataset is defined as
271
272.. math:: \hat\sigma^2 = ((\sum w_i)/((\sum w_i)^2 - \sum (w_i^2))) \sum w_i (x_i - \hat\mu)^2
273
274 Note that this expression reduces to an unweighted variance with the
275familiar :math:`1/(N-1)` factor when there are :math:`N` equal non-zero
276weights.
277
278wvariance\_mw, data, wmean This function returns the estimated variance
279of the weighted dataset using the given weighted mean .
280
281wsdw, data The standard deviation is defined as the square root of the
282variance. This function returns the square root of the corresponding
283variance function above.
284
285wsd\_mw, data, wmean This function returns the square root of the
286corresponding variance function above.
287
288wvariance\_with\_fixed\_meanw, data, mean This function computes an
289unbiased estimate of the variance of weighted dataset when the
290population mean of the underlying distribution is known \_a priori\_. In
291this case the estimator for the variance replaces the sample mean
292:math:`\hat\mu` by the known population mean :math:`\mu`,
293
294.. math:: \hat\sigma^2 = (\sum w_i (x_i - \mu)^2) / (\sum w_i)
295
296wsd\_with\_fixed\_meanw, data, mean The standard deviation is defined as
297the square root of the variance. This function returns the square root
298of the corresponding variance function above.
299
300wabsdevw, data This function computes the weighted absolute deviation
301from the weighted mean of . The absolute deviation from the mean is
302defined as
303
304.. math:: absdev = (\sum w_i |x_i - \hat\mu|) / (\sum w_i)
305
306wabsdev\_mw, data, wmean This function computes the absolute deviation
307of the weighted dataset DATA about the given weighted mean WMEAN.
308
309wskeww, data This function computes the weighted skewness of the dataset
310DATA.
311
312.. math:: skew = (\sum w_i ((x_i - xbar)/\sigma)^3) / (\sum w_i)
313
314wskew\_m\_sdw, data, mean, wsd This function computes the weighted
315skewness of the dataset using the given values of the weighted mean and
316weighted standard deviation, and .
317
318wkurtosisw, data This function computes the weighted kurtosis of the
319dataset . The kurtosis is defined as
320
321.. math:: kurtosis = ((\sum w_i ((x_i - xbar)/sigma)^4) / (\sum w_i)) - 3
322
323wkurtosis\_m\_sdw, data, mean, wsd This function computes the weighted
324kurtosis of the dataset using the given values of the weighted mean and
325weighted standard deviation, and .
326
327Further Reading
328---------------
329
330See the GSL reference manual for a description of all available
331functions and the calculations they perform.
332
333The standard reference for almost any topic in statistics is the
334multi-volume *Advanced Theory of Statistics* by Kendall and Stuart. Many
335statistical concepts can be more easily understood by a Bayesian
336approach. The book by Gelman, Carlin, Stern and Rubin gives a
337comprehensive coverage of the subject. For physicists the Particle Data
338Group provides useful reviews of Probability and Statistics in the
339“Mathematical Tools” section of its Annual Review of Particle Physics.
340
341Modules in Testing
342==================
343
344Modules in this package are often reimplementations of an original
345package with significant change to the original. The current rng
346implementation, for example, started its life here. The sf module
347implemented here will supersede the sf package in one of the next
348releases. Concerning the other modules the usage is encouraged for tests
349to see if they work, but use them with caution in your production code!
350