1# NAME
2
3Statistics::Descriptive::Discrete - Compute descriptive statistics for discrete data sets.
4
5To install, use the CPAN module (https://metacpan.org/pod/Statistics::Descriptive::Discrete).
6
7# SYNOPSIS
8
9```perl
10    use Statistics::Descriptive::Discrete;
11
12    my $stats = new Statistics::Descriptive::Discrete;
13    $stats->add_data(1,10,2,1,1,4,5,1,10,8,7);
14    print "count = ",$stats->count(),"\n";
15    print "uniq  = ",$stats->uniq(),"\n";
16    print "sum = ",$stats->sum(),"\n";
17    print "min = ",$stats->min(),"\n";
18    print "min index = ",$stats->mindex(),"\n";
19    print "max = ",$stats->max(),"\n";
20    print "max index = ",$stats->maxdex(),"\n";
21    print "mean = ",$stats->mean(),"\n";
22    print "geometric mean = ",$stats->geometric_mean(),"\n";
23    print "harmonic mean = ", $stats->harmonic_mean(),"\n";
24    print "standard_deviation = ",$stats->standard_deviation(),"\n";
25    print "variance = ",$stats->variance(),"\n";
26    print "sample_range = ",$stats->sample_range(),"\n";
27    print "mode = ",$stats->mode(),"\n";
28    print "median = ",$stats->median(),"\n";
29    my $f = $stats->frequency_distribution_ref(3);
30    for (sort {$a <=> $b} keys %$f) {
31      print "key = $_, count = $f->{$_}\n";
32    }
33```
34# DESCRIPTION
35
36This module provides basic functions used in descriptive statistics.
37It borrows very heavily from Statistics::Descriptive::Full
38(which is included with Statistics::Descriptive) with one major
39difference.  This module is optimized for discretized data
40e.g. data from an A/D conversion that  has a discrete set of possible values.
41E.g. if your data is produced by an 8 bit A/D then you'd have only 256 possible
42values in your data  set.  Even though you might have a million data points,
43you'd only have 256 different values in those million points.  Instead of storing the
44entire data set as Statistics::Descriptive does, this module only stores
45the values seen and the number of times each value occurs.
46
47For very large data sets, this storage method results in significant speed
48and memory improvements.  For example, for an 8-bit data set (256 possible values),
49with 1,000,000 data points,  this module is about 10x faster than Statistics::Descriptive::Full
50or Statistics::Descriptive::Sparse.
51
52Statistics::Descriptive run time is a factor of the size of the data set. In particular,
53repeated calls to `add_data` are slow.  Statistics::Descriptive::Discrete's `add_data` is
54optimized for speed.  For a give number of data points, this module's run time will increase
55as the number of unique data values in the data set increases. For example, while this module
56runs about 10x the speed of Statistics::Descriptive::Full for an 8-bit data set, the
57run speed drops to about 3x for an equivalent sized 20-bit data set.
58
59See sdd\_prof.pl in the examples directory to play with profiling this module against
60Statistics::Descriptive::Full.
61
62# METHODS
63
64- $stat = Statistics::Descriptive::Discrete->new();
65
66    Create a new statistics object.
67
68- $stat->add\_data(1,2,3,4,5);
69
70    Adds data to the statistics object.  Sets a flag so that
71    the statistics will be recomputed the next time they're
72    needed.
73
74- $stat->add\_data\_tuple(1,2,42,3);
75
76    Adds data to the statistics object where every two elements
77    are a value and a count (how many times did the value occur?)
78    The above is equivalent to `$stat->add_data(1,1,42,42,42);`
79    Use this when your data is in a form isomorphic to
80    ($value, $occurrence).
81
82- $stat->max();
83
84    Returns the maximum value of the data set.
85
86- $stat->min();
87
88    Returns the minimum value of the data set.
89
90- $stat->mindex();
91
92    Returns the index of the minimum value of the data set.
93    The index returned is the first occurence of the minimum value.
94
95    Note: the index is determined by the order data was added using add\_data() or add\_data\_tuple().
96    It is meaningless in context of get\_data() as get\_data() does not return values in the same
97    order in which they were added.  This behavior is different than Statistics::Descriptive which
98    does preserve order.
99
100- $stat->maxdex();
101
102    Returns the index of the maximum value of the data set.
103    The index returned is the first occurence of the maximum value.
104
105    Note: the index is determined by the order data was added using
106    `add_data()` or `add_data_tuple()`. It is meaningless in context of
107    `get_data()` as `get_data()` does not return values in the same
108    order in which they were added.  This behavior is different than
109    Statistics::Descriptive which does preserve order.
110
111- $stat->count();
112
113    Returns the total number of elements in the data set.
114
115- $stat->uniq();
116
117    If called in scalar context, returns the total number of unique elements in the data set.
118    For example, if your data set is (1,2,2,3,3,3), uniq will return 3.
119
120    If called in array context, returns an array of each data value in the data set in sorted order.
121    In the above example, `@uniq = $stats->uniq();` would return (1,2,3)
122
123    This function is specific to Statistics::Descriptive::Discrete
124    and is not implemented in Statistics::Descriptive.
125
126    It is useful for getting a frequency distribution for each discrete value in the data the set:
127    ```perl
128        my $stats = Statistics::Descriptive::Discrete->new();
129        $stats->add_data_tuple(1,1,2,2,3,3,4,4,5,5,6,6,7,7);
130        my @bins = $stats->uniq();
131        my $f = $stats->frequency_distribution_ref(\@bins);
132        for (sort {$a <=> $b} keys %$f) {
133            print "value = $_, count = $f->{$_}\n";
134        }
135    ```
136- $stat->sum();
137
138    Returns the sum of all the values in the data set.
139
140- $stat->mean();
141
142    Returns the mean of the data.
143
144- $stat->harmonic\_mean();
145
146    Returns the harmonic mean of the data.  Since the mean is undefined
147    if any of the data are zero or if the sum of the reciprocals is zero,
148    it will return undef for both of those cases.
149
150- $stat->geometric\_mean();
151
152    Returns the geometric mean of the data.  Returns `undef` if any of the data
153    are less than 0. Returns 0 if any of the data are 0.
154
155- $stat->median();
156
157    Returns the median value of the data.
158
159- $stat->mode();
160
161    Returns the mode of the data.
162
163- $stat->variance();
164
165    Returns the variance of the data.
166
167- $stat->standard\_deviation();
168
169    Returns the standard\_deviation of the data.
170
171- $stat->sample\_range();
172
173    Returns the sample range (max - min) of the data set.
174
175- $stat->frequency\_distribution\_ref($num\_partitions);
176- $stat->frequency\_distribution\_ref(\\@bins);
177- $stat->frequency\_distribution\_ref();
178
179    `frequency_distribution_ref($num_partitions)` slices the data into
180    `$num_partitions` sets (where $num\_partitions is greater than 1) and counts
181    the number of items that fall into each partition. It returns a reference to a
182    hash where the keys are the numerical values of the partitions used. The
183    minimum value of the data set is not a key and the maximum value of the data
184    set is always a key. The number of entries for a particular partition key are
185    the number of items which are greater than the previous partition key and less
186    then or equal to the current partition key. As an example,
187    ```perl
188        $stat->add_data(1,1.5,2,2.5,3,3.5,4);
189        $f = $stat->frequency_distribution_ref(2);
190        for (sort {$a <=> $b} keys %$f) {
191           print "key = $_, count = $f->{$_}\n";
192        }
193    ```
194    prints
195
196        key = 2.5, count = 4
197        key = 4, count = 3
198
199    since there are four items less than or equal to 2.5, and 3 items
200    greater than 2.5 and less than 4.
201
202    `frequency_distribution_ref(\@bins)` provides the bins that are to be used
203    for the distribution.  This allows for non-uniform distributions as
204    well as trimmed or sample distributions to be found.  `@bins` must
205    be monotonic and must contain at least one element.  Note that unless the
206    set of bins contains the full range of the data, the total counts returned will
207    be less than the sample size.
208
209    Calling `frequency_distribution_ref()` with no arguments returns the last
210    distribution calculated, if such exists.
211
212- my %hash = $stat->frequency\_distribution($partitions);
213- my %hash = $stat->frequency\_distribution(\\@bins);
214- my %hash = $stat->frequency\_distribution();
215
216    Same as `frequency_distribution_ref()` except that it returns the hash
217    clobbered into the return list. Kept for compatibility reasons with previous
218    versions of Statistics::Descriptive::Discrete and using it is discouraged.
219
220    Note: in earlier versions of Statistics:Descriptive::Discrete, `frequency_distribution()`
221    behaved differently than the Statistics::Descriptive implementation.  Any code that uses
222    this function should be carefully checked to ensure compatability with the current
223    implementation.
224
225- $stat->get\_data();
226
227    Returns a copy of the data array.  Note: This array could be
228    very large and would thus defeat the purpose of using this
229    module.  Make sure you really need it before using get\_data().
230
231    The returned array contains the values sorted by value.  It does
232    not preserve the order in which the values were added.  Preserving
233    order would defeat the purpose of this module which trades speed
234    and memory usage over preserving order.  If order is important,
235    use Statistics::Descriptive.
236
237- $stat->clear();
238
239    Clears all data and resets the instance as if it were newly created
240
241    Effectively the same as
242
243    ```perl
244        my $class = ref($stat);
245        undef $stat;
246        $stat = new $class;
247    ```
248# NOTE
249
250The interface for this module strives to be identical to Statistics::Descriptive.
251Any differences are noted in the description for each method.
252
253# BUGS
254
255- Code for calculating mode is not as robust as it should be.
256
257# TODO
258
259- Add rest of methods (at least ones that don't depend on original order of data)
260from Statistics::Descriptive
261
262# AUTHOR
263
264Rhet Turnbull, rturnbull+cpan@gmail.com
265
266# CREDIT
267
268Thanks to the following individuals for finding bugs, providing feedback,
269and submitting changes:
270
271- Peter Dienes for finding and fixing a bug in the variance calculation.
272- Bill Dueber for suggesting the add\_data\_tuple method.
273
274# COPYRIGHT
275
276    Copyright (c) 2002, 2019 Rhet Turnbull. All rights reserved.  This
277    program is free software; you can redistribute it and/or modify it
278    under the same terms as Perl itself.
279
280    Portions of this code is from Statistics::Descriptive which is under
281    the following copyrights:
282
283    Copyright (c) 1997,1998 Colin Kuskie. All rights reserved.  This
284    program is free software; you can redistribute it and/or modify it
285    under the same terms as Perl itself.
286
287    Copyright (c) 1998 Andrea Spinelli. All rights reserved.  This program
288    is free software; you can redistribute it and/or modify it under the
289    same terms as Perl itself.
290
291    Copyright (c) 1994,1995 Jason Kastner. All rights
292    reserved.  This program is free software; you can redistribute it
293    and/or modify it under the same terms as Perl itself.
294
295# SEE ALSO
296
297Statistics::Descriptive
298
299Statistics::Discrete
300