1NAME
2    Statistics::Descriptive::Discrete - Compute descriptive statistics for
3    discrete data sets.
4
5    To install, use the CPAN module
6    (https://metacpan.org/pod/Statistics::Descriptive::Discrete).
7
8SYNOPSIS
9      use Statistics::Descriptive::Discrete;
10
11      my $stats = new Statistics::Descriptive::Discrete;
12      $stats->add_data(1,10,2,1,1,4,5,1,10,8,7);
13      print "count = ",$stats->count(),"\n";
14      print "uniq  = ",$stats->uniq(),"\n";
15      print "sum = ",$stats->sum(),"\n";
16      print "min = ",$stats->min(),"\n";
17      print "min index = ",$stats->mindex(),"\n";
18      print "max = ",$stats->max(),"\n";
19      print "max index = ",$stats->maxdex(),"\n";
20      print "mean = ",$stats->mean(),"\n";
21      print "geometric mean = ",$stats->geometric_mean(),"\n";
22      print "harmonic mean = ", $stats->harmonic_mean(),"\n";
23      print "standard_deviation = ",$stats->standard_deviation(),"\n";
24      print "variance = ",$stats->variance(),"\n";
25      print "sample_range = ",$stats->sample_range(),"\n";
26      print "mode = ",$stats->mode(),"\n";
27      print "median = ",$stats->median(),"\n";
28      my $f = $stats->frequency_distribution_ref(3);
29      for (sort {$a <=> $b} keys %$f) {
30        print "key = $_, count = $f->{$_}\n";
31      }
32
33DESCRIPTION
34    This module provides basic functions used in descriptive statistics. It
35    borrows very heavily from Statistics::Descriptive::Full (which is
36    included with Statistics::Descriptive) with one major difference. This
37    module is optimized for discretized data e.g. data from an A/D
38    conversion that has a discrete set of possible values. E.g. if your data
39    is produced by an 8 bit A/D then you'd have only 256 possible values in
40    your data set. Even though you might have a million data points, you'd
41    only have 256 different values in those million points. Instead of
42    storing the entire data set as Statistics::Descriptive does, this module
43    only stores the values seen and the number of times each value occurs.
44
45    For very large data sets, this storage method results in significant
46    speed and memory improvements. For example, for an 8-bit data set (256
47    possible values), with 1,000,000 data points, this module is about 10x
48    faster than Statistics::Descriptive::Full or
49    Statistics::Descriptive::Sparse.
50
51    Statistics::Descriptive run time is a factor of the size of the data
52    set. In particular, repeated calls to `add_data' are slow.
53    Statistics::Descriptive::Discrete's `add_data' is optimized for speed.
54    For a give number of data points, this module's run time will increase
55    as the number of unique data values in the data set increases. For
56    example, while this module runs about 10x the speed of
57    Statistics::Descriptive::Full for an 8-bit data set, the run speed drops
58    to about 3x for an equivalent sized 20-bit data set.
59
60    See sdd_prof.pl in the examples directory to play with profiling this
61    module against Statistics::Descriptive::Full.
62
63METHODS
64    $stat = Statistics::Descriptive::Discrete->new();
65        Create a new statistics object.
66
67    $stat->add_data(1,2,3,4,5);
68        Adds data to the statistics object. Sets a flag so that the
69        statistics will be recomputed the next time they're needed.
70
71    $stat->add_data_tuple(1,2,42,3);
72        Adds data to the statistics object where every two elements are a
73        value and a count (how many times did the value occur?) The above is
74        equivalent to `$stat->add_data(1,1,42,42,42);' Use this when your
75        data is in a form isomorphic to ($value, $occurrence).
76
77    $stat->max();
78        Returns the maximum value of the data set.
79
80    $stat->min();
81        Returns the minimum value of the data set.
82
83    $stat->mindex();
84        Returns the index of the minimum value of the data set. The index
85        returned is the first occurence of the minimum value.
86
87        Note: the index is determined by the order data was added using
88        add_data() or add_data_tuple(). It is meaningless in context of
89        get_data() as get_data() does not return values in the same order in
90        which they were added. This behavior is different than
91        Statistics::Descriptive which does preserve order.
92
93    $stat->maxdex();
94        Returns the index of the maximum value of the data set. The index
95        returned is the first occurence of the maximum value.
96
97        Note: the index is determined by the order data was added using
98        `add_data()' or `add_data_tuple()'. It is meaningless in context of
99        `get_data()' as `get_data()' does not return values in the same
100        order in which they were added. This behavior is different than
101        Statistics::Descriptive which does preserve order.
102
103    $stat->count();
104        Returns the total number of elements in the data set.
105
106    $stat->uniq();
107        If called in scalar context, returns the total number of unique
108        elements in the data set. For example, if your data set is
109        (1,2,2,3,3,3), uniq will return 3.
110
111        If called in array context, returns an array of each data value in
112        the data set in sorted order. In the above example, `@uniq =
113        $stats->uniq();' would return (1,2,3)
114
115        This function is specific to Statistics::Descriptive::Discrete and
116        is not implemented in Statistics::Descriptive.
117
118        It is useful for getting a frequency distribution for each discrete
119        value in the data the set:
120
121           my $stats = Statistics::Descriptive::Discrete->new();
122                 $stats->add_data_tuple(1,1,2,2,3,3,4,4,5,5,6,6,7,7);
123                 my @bins = $stats->uniq();
124                 my $f = $stats->frequency_distribution_ref(\@bins);
125                 for (sort {$a <=> $b} keys %$f) {
126                         print "value = $_, count = $f->{$_}\n";
127                 }
128
129    $stat->sum();
130        Returns the sum of all the values in the data set.
131
132    $stat->mean();
133        Returns the mean of the data.
134
135    $stat->harmonic_mean();
136        Returns the harmonic mean of the data. Since the mean is undefined
137        if any of the data are zero or if the sum of the reciprocals is
138        zero, it will return undef for both of those cases.
139
140    $stat->geometric_mean();
141        Returns the geometric mean of the data. Returns `undef' if any of
142        the data are less than 0. Returns 0 if any of the data are 0.
143
144    $stat->median();
145        Returns the median value of the data.
146
147    $stat->mode();
148        Returns the mode of the data.
149
150    $stat->variance();
151        Returns the variance of the data.
152
153    $stat->standard_deviation();
154        Returns the standard_deviation of the data.
155
156    $stat->sample_range();
157        Returns the sample range (max - min) of the data set.
158
159    $stat->frequency_distribution_ref($num_partitions);
160    $stat->frequency_distribution_ref(\@bins);
161    $stat->frequency_distribution_ref();
162        `frequency_distribution_ref($num_partitions)' slices the data into
163        `$num_partitions' sets (where $num_partitions is greater than 1) and
164        counts the number of items that fall into each partition. It returns
165        a reference to a hash where the keys are the numerical values of the
166        partitions used. The minimum value of the data set is not a key and
167        the maximum value of the data set is always a key. The number of
168        entries for a particular partition key are the number of items which
169        are greater than the previous partition key and less then or equal
170        to the current partition key. As an example,
171
172           $stat->add_data(1,1.5,2,2.5,3,3.5,4);
173           $f = $stat->frequency_distribution_ref(2);
174           for (sort {$a <=> $b} keys %$f) {
175              print "key = $_, count = $f->{$_}\n";
176           }
177
178        prints
179
180           key = 2.5, count = 4
181           key = 4, count = 3
182
183        since there are four items less than or equal to 2.5, and 3 items
184        greater than 2.5 and less than 4.
185
186        `frequency_distribution_ref(\@bins)' provides the bins that are to
187        be used for the distribution. This allows for non-uniform
188        distributions as well as trimmed or sample distributions to be
189        found. `@bins' must be monotonic and must contain at least one
190        element. Note that unless the set of bins contains the full range of
191        the data, the total counts returned will be less than the sample
192        size.
193
194        Calling `frequency_distribution_ref()' with no arguments returns the
195        last distribution calculated, if such exists.
196
197    my %hash = $stat->frequency_distribution($partitions);
198    my %hash = $stat->frequency_distribution(\@bins);
199    my %hash = $stat->frequency_distribution();
200        Same as `frequency_distribution_ref()' except that it returns the
201        hash clobbered into the return list. Kept for compatibility reasons
202        with previous versions of Statistics::Descriptive::Discrete and
203        using it is discouraged.
204
205        Note: in earlier versions of Statistics:Descriptive::Discrete,
206        `frequency_distribution()' behaved differently than the
207        Statistics::Descriptive implementation. Any code that uses this
208        function should be carefully checked to ensure compatability with
209        the current implementation.
210
211    $stat->get_data();
212        Returns a copy of the data array. Note: This array could be very
213        large and would thus defeat the purpose of using this module. Make
214        sure you really need it before using get_data().
215
216        The returned array contains the values sorted by value. It does not
217        preserve the order in which the values were added. Preserving order
218        would defeat the purpose of this module which trades speed and
219        memory usage over preserving order. If order is important, use
220        Statistics::Descriptive.
221
222    $stat->clear();
223        Clears all data and resets the instance as if it were newly created
224
225        Effectively the same as
226
227          my $class = ref($stat);
228          undef $stat;
229          $stat = new $class;
230
231NOTE
232    The interface for this module strives to be identical to
233    Statistics::Descriptive. Any differences are noted in the description
234    for each method.
235
236BUGS
237    *   Code for calculating mode is not as robust as it should be.
238
239TODO
240    *   Add rest of methods (at least ones that don't depend on original
241        order of data) from Statistics::Descriptive
242
243AUTHOR
244    Rhet Turnbull, rturnbull+cpan@gmail.com
245
246CREDIT
247    Thanks to the following individuals for finding bugs, providing
248    feedback, and submitting changes:
249
250    *   Peter Dienes for finding and fixing a bug in the variance
251        calculation.
252
253    *   Bill Dueber for suggesting the add_data_tuple method.
254
255COPYRIGHT
256      Copyright (c) 2002, 2019 Rhet Turnbull. All rights reserved.  This
257      program is free software; you can redistribute it and/or modify it
258      under the same terms as Perl itself.
259
260      Portions of this code is from Statistics::Descriptive which is under
261      the following copyrights:
262
263      Copyright (c) 1997,1998 Colin Kuskie. All rights reserved.  This
264      program is free software; you can redistribute it and/or modify it
265      under the same terms as Perl itself.
266
267      Copyright (c) 1998 Andrea Spinelli. All rights reserved.  This program
268      is free software; you can redistribute it and/or modify it under the
269      same terms as Perl itself.
270
271      Copyright (c) 1994,1995 Jason Kastner. All rights
272      reserved.  This program is free software; you can redistribute it
273      and/or modify it under the same terms as Perl itself.
274
275SEE ALSO
276    Statistics::Descriptive
277
278    Statistics::Discrete
279
280