README
1NAME
2 Statistics::Descriptive::Discrete - Compute descriptive statistics for
3 discrete data sets.
4
5 To install, use the CPAN module
6 (https://metacpan.org/pod/Statistics::Descriptive::Discrete).
7
8SYNOPSIS
9 use Statistics::Descriptive::Discrete;
10
11 my $stats = new Statistics::Descriptive::Discrete;
12 $stats->add_data(1,10,2,1,1,4,5,1,10,8,7);
13 print "count = ",$stats->count(),"\n";
14 print "uniq = ",$stats->uniq(),"\n";
15 print "sum = ",$stats->sum(),"\n";
16 print "min = ",$stats->min(),"\n";
17 print "min index = ",$stats->mindex(),"\n";
18 print "max = ",$stats->max(),"\n";
19 print "max index = ",$stats->maxdex(),"\n";
20 print "mean = ",$stats->mean(),"\n";
21 print "geometric mean = ",$stats->geometric_mean(),"\n";
22 print "harmonic mean = ", $stats->harmonic_mean(),"\n";
23 print "standard_deviation = ",$stats->standard_deviation(),"\n";
24 print "variance = ",$stats->variance(),"\n";
25 print "sample_range = ",$stats->sample_range(),"\n";
26 print "mode = ",$stats->mode(),"\n";
27 print "median = ",$stats->median(),"\n";
28 my $f = $stats->frequency_distribution_ref(3);
29 for (sort {$a <=> $b} keys %$f) {
30 print "key = $_, count = $f->{$_}\n";
31 }
32
33DESCRIPTION
34 This module provides basic functions used in descriptive statistics. It
35 borrows very heavily from Statistics::Descriptive::Full (which is
36 included with Statistics::Descriptive) with one major difference. This
37 module is optimized for discretized data e.g. data from an A/D
38 conversion that has a discrete set of possible values. E.g. if your data
39 is produced by an 8 bit A/D then you'd have only 256 possible values in
40 your data set. Even though you might have a million data points, you'd
41 only have 256 different values in those million points. Instead of
42 storing the entire data set as Statistics::Descriptive does, this module
43 only stores the values seen and the number of times each value occurs.
44
45 For very large data sets, this storage method results in significant
46 speed and memory improvements. For example, for an 8-bit data set (256
47 possible values), with 1,000,000 data points, this module is about 10x
48 faster than Statistics::Descriptive::Full or
49 Statistics::Descriptive::Sparse.
50
51 Statistics::Descriptive run time is a factor of the size of the data
52 set. In particular, repeated calls to `add_data' are slow.
53 Statistics::Descriptive::Discrete's `add_data' is optimized for speed.
54 For a give number of data points, this module's run time will increase
55 as the number of unique data values in the data set increases. For
56 example, while this module runs about 10x the speed of
57 Statistics::Descriptive::Full for an 8-bit data set, the run speed drops
58 to about 3x for an equivalent sized 20-bit data set.
59
60 See sdd_prof.pl in the examples directory to play with profiling this
61 module against Statistics::Descriptive::Full.
62
63METHODS
64 $stat = Statistics::Descriptive::Discrete->new();
65 Create a new statistics object.
66
67 $stat->add_data(1,2,3,4,5);
68 Adds data to the statistics object. Sets a flag so that the
69 statistics will be recomputed the next time they're needed.
70
71 $stat->add_data_tuple(1,2,42,3);
72 Adds data to the statistics object where every two elements are a
73 value and a count (how many times did the value occur?) The above is
74 equivalent to `$stat->add_data(1,1,42,42,42);' Use this when your
75 data is in a form isomorphic to ($value, $occurrence).
76
77 $stat->max();
78 Returns the maximum value of the data set.
79
80 $stat->min();
81 Returns the minimum value of the data set.
82
83 $stat->mindex();
84 Returns the index of the minimum value of the data set. The index
85 returned is the first occurence of the minimum value.
86
87 Note: the index is determined by the order data was added using
88 add_data() or add_data_tuple(). It is meaningless in context of
89 get_data() as get_data() does not return values in the same order in
90 which they were added. This behavior is different than
91 Statistics::Descriptive which does preserve order.
92
93 $stat->maxdex();
94 Returns the index of the maximum value of the data set. The index
95 returned is the first occurence of the maximum value.
96
97 Note: the index is determined by the order data was added using
98 `add_data()' or `add_data_tuple()'. It is meaningless in context of
99 `get_data()' as `get_data()' does not return values in the same
100 order in which they were added. This behavior is different than
101 Statistics::Descriptive which does preserve order.
102
103 $stat->count();
104 Returns the total number of elements in the data set.
105
106 $stat->uniq();
107 If called in scalar context, returns the total number of unique
108 elements in the data set. For example, if your data set is
109 (1,2,2,3,3,3), uniq will return 3.
110
111 If called in array context, returns an array of each data value in
112 the data set in sorted order. In the above example, `@uniq =
113 $stats->uniq();' would return (1,2,3)
114
115 This function is specific to Statistics::Descriptive::Discrete and
116 is not implemented in Statistics::Descriptive.
117
118 It is useful for getting a frequency distribution for each discrete
119 value in the data the set:
120
121 my $stats = Statistics::Descriptive::Discrete->new();
122 $stats->add_data_tuple(1,1,2,2,3,3,4,4,5,5,6,6,7,7);
123 my @bins = $stats->uniq();
124 my $f = $stats->frequency_distribution_ref(\@bins);
125 for (sort {$a <=> $b} keys %$f) {
126 print "value = $_, count = $f->{$_}\n";
127 }
128
129 $stat->sum();
130 Returns the sum of all the values in the data set.
131
132 $stat->mean();
133 Returns the mean of the data.
134
135 $stat->harmonic_mean();
136 Returns the harmonic mean of the data. Since the mean is undefined
137 if any of the data are zero or if the sum of the reciprocals is
138 zero, it will return undef for both of those cases.
139
140 $stat->geometric_mean();
141 Returns the geometric mean of the data. Returns `undef' if any of
142 the data are less than 0. Returns 0 if any of the data are 0.
143
144 $stat->median();
145 Returns the median value of the data.
146
147 $stat->mode();
148 Returns the mode of the data.
149
150 $stat->variance();
151 Returns the variance of the data.
152
153 $stat->standard_deviation();
154 Returns the standard_deviation of the data.
155
156 $stat->sample_range();
157 Returns the sample range (max - min) of the data set.
158
159 $stat->frequency_distribution_ref($num_partitions);
160 $stat->frequency_distribution_ref(\@bins);
161 $stat->frequency_distribution_ref();
162 `frequency_distribution_ref($num_partitions)' slices the data into
163 `$num_partitions' sets (where $num_partitions is greater than 1) and
164 counts the number of items that fall into each partition. It returns
165 a reference to a hash where the keys are the numerical values of the
166 partitions used. The minimum value of the data set is not a key and
167 the maximum value of the data set is always a key. The number of
168 entries for a particular partition key are the number of items which
169 are greater than the previous partition key and less then or equal
170 to the current partition key. As an example,
171
172 $stat->add_data(1,1.5,2,2.5,3,3.5,4);
173 $f = $stat->frequency_distribution_ref(2);
174 for (sort {$a <=> $b} keys %$f) {
175 print "key = $_, count = $f->{$_}\n";
176 }
177
178 prints
179
180 key = 2.5, count = 4
181 key = 4, count = 3
182
183 since there are four items less than or equal to 2.5, and 3 items
184 greater than 2.5 and less than 4.
185
186 `frequency_distribution_ref(\@bins)' provides the bins that are to
187 be used for the distribution. This allows for non-uniform
188 distributions as well as trimmed or sample distributions to be
189 found. `@bins' must be monotonic and must contain at least one
190 element. Note that unless the set of bins contains the full range of
191 the data, the total counts returned will be less than the sample
192 size.
193
194 Calling `frequency_distribution_ref()' with no arguments returns the
195 last distribution calculated, if such exists.
196
197 my %hash = $stat->frequency_distribution($partitions);
198 my %hash = $stat->frequency_distribution(\@bins);
199 my %hash = $stat->frequency_distribution();
200 Same as `frequency_distribution_ref()' except that it returns the
201 hash clobbered into the return list. Kept for compatibility reasons
202 with previous versions of Statistics::Descriptive::Discrete and
203 using it is discouraged.
204
205 Note: in earlier versions of Statistics:Descriptive::Discrete,
206 `frequency_distribution()' behaved differently than the
207 Statistics::Descriptive implementation. Any code that uses this
208 function should be carefully checked to ensure compatability with
209 the current implementation.
210
211 $stat->get_data();
212 Returns a copy of the data array. Note: This array could be very
213 large and would thus defeat the purpose of using this module. Make
214 sure you really need it before using get_data().
215
216 The returned array contains the values sorted by value. It does not
217 preserve the order in which the values were added. Preserving order
218 would defeat the purpose of this module which trades speed and
219 memory usage over preserving order. If order is important, use
220 Statistics::Descriptive.
221
222 $stat->clear();
223 Clears all data and resets the instance as if it were newly created
224
225 Effectively the same as
226
227 my $class = ref($stat);
228 undef $stat;
229 $stat = new $class;
230
231NOTE
232 The interface for this module strives to be identical to
233 Statistics::Descriptive. Any differences are noted in the description
234 for each method.
235
236BUGS
237 * Code for calculating mode is not as robust as it should be.
238
239TODO
240 * Add rest of methods (at least ones that don't depend on original
241 order of data) from Statistics::Descriptive
242
243AUTHOR
244 Rhet Turnbull, rturnbull+cpan@gmail.com
245
246CREDIT
247 Thanks to the following individuals for finding bugs, providing
248 feedback, and submitting changes:
249
250 * Peter Dienes for finding and fixing a bug in the variance
251 calculation.
252
253 * Bill Dueber for suggesting the add_data_tuple method.
254
255COPYRIGHT
256 Copyright (c) 2002, 2019 Rhet Turnbull. All rights reserved. This
257 program is free software; you can redistribute it and/or modify it
258 under the same terms as Perl itself.
259
260 Portions of this code is from Statistics::Descriptive which is under
261 the following copyrights:
262
263 Copyright (c) 1997,1998 Colin Kuskie. All rights reserved. This
264 program is free software; you can redistribute it and/or modify it
265 under the same terms as Perl itself.
266
267 Copyright (c) 1998 Andrea Spinelli. All rights reserved. This program
268 is free software; you can redistribute it and/or modify it under the
269 same terms as Perl itself.
270
271 Copyright (c) 1994,1995 Jason Kastner. All rights
272 reserved. This program is free software; you can redistribute it
273 and/or modify it under the same terms as Perl itself.
274
275SEE ALSO
276 Statistics::Descriptive
277
278 Statistics::Discrete
279
280
README.md
1# NAME
2
3Statistics::Descriptive::Discrete - Compute descriptive statistics for discrete data sets.
4
5To install, use the CPAN module (https://metacpan.org/pod/Statistics::Descriptive::Discrete).
6
7# SYNOPSIS
8
9```perl
10 use Statistics::Descriptive::Discrete;
11
12 my $stats = new Statistics::Descriptive::Discrete;
13 $stats->add_data(1,10,2,1,1,4,5,1,10,8,7);
14 print "count = ",$stats->count(),"\n";
15 print "uniq = ",$stats->uniq(),"\n";
16 print "sum = ",$stats->sum(),"\n";
17 print "min = ",$stats->min(),"\n";
18 print "min index = ",$stats->mindex(),"\n";
19 print "max = ",$stats->max(),"\n";
20 print "max index = ",$stats->maxdex(),"\n";
21 print "mean = ",$stats->mean(),"\n";
22 print "geometric mean = ",$stats->geometric_mean(),"\n";
23 print "harmonic mean = ", $stats->harmonic_mean(),"\n";
24 print "standard_deviation = ",$stats->standard_deviation(),"\n";
25 print "variance = ",$stats->variance(),"\n";
26 print "sample_range = ",$stats->sample_range(),"\n";
27 print "mode = ",$stats->mode(),"\n";
28 print "median = ",$stats->median(),"\n";
29 my $f = $stats->frequency_distribution_ref(3);
30 for (sort {$a <=> $b} keys %$f) {
31 print "key = $_, count = $f->{$_}\n";
32 }
33```
34# DESCRIPTION
35
36This module provides basic functions used in descriptive statistics.
37It borrows very heavily from Statistics::Descriptive::Full
38(which is included with Statistics::Descriptive) with one major
39difference. This module is optimized for discretized data
40e.g. data from an A/D conversion that has a discrete set of possible values.
41E.g. if your data is produced by an 8 bit A/D then you'd have only 256 possible
42values in your data set. Even though you might have a million data points,
43you'd only have 256 different values in those million points. Instead of storing the
44entire data set as Statistics::Descriptive does, this module only stores
45the values seen and the number of times each value occurs.
46
47For very large data sets, this storage method results in significant speed
48and memory improvements. For example, for an 8-bit data set (256 possible values),
49with 1,000,000 data points, this module is about 10x faster than Statistics::Descriptive::Full
50or Statistics::Descriptive::Sparse.
51
52Statistics::Descriptive run time is a factor of the size of the data set. In particular,
53repeated calls to `add_data` are slow. Statistics::Descriptive::Discrete's `add_data` is
54optimized for speed. For a give number of data points, this module's run time will increase
55as the number of unique data values in the data set increases. For example, while this module
56runs about 10x the speed of Statistics::Descriptive::Full for an 8-bit data set, the
57run speed drops to about 3x for an equivalent sized 20-bit data set.
58
59See sdd\_prof.pl in the examples directory to play with profiling this module against
60Statistics::Descriptive::Full.
61
62# METHODS
63
64- $stat = Statistics::Descriptive::Discrete->new();
65
66 Create a new statistics object.
67
68- $stat->add\_data(1,2,3,4,5);
69
70 Adds data to the statistics object. Sets a flag so that
71 the statistics will be recomputed the next time they're
72 needed.
73
74- $stat->add\_data\_tuple(1,2,42,3);
75
76 Adds data to the statistics object where every two elements
77 are a value and a count (how many times did the value occur?)
78 The above is equivalent to `$stat->add_data(1,1,42,42,42);`
79 Use this when your data is in a form isomorphic to
80 ($value, $occurrence).
81
82- $stat->max();
83
84 Returns the maximum value of the data set.
85
86- $stat->min();
87
88 Returns the minimum value of the data set.
89
90- $stat->mindex();
91
92 Returns the index of the minimum value of the data set.
93 The index returned is the first occurence of the minimum value.
94
95 Note: the index is determined by the order data was added using add\_data() or add\_data\_tuple().
96 It is meaningless in context of get\_data() as get\_data() does not return values in the same
97 order in which they were added. This behavior is different than Statistics::Descriptive which
98 does preserve order.
99
100- $stat->maxdex();
101
102 Returns the index of the maximum value of the data set.
103 The index returned is the first occurence of the maximum value.
104
105 Note: the index is determined by the order data was added using
106 `add_data()` or `add_data_tuple()`. It is meaningless in context of
107 `get_data()` as `get_data()` does not return values in the same
108 order in which they were added. This behavior is different than
109 Statistics::Descriptive which does preserve order.
110
111- $stat->count();
112
113 Returns the total number of elements in the data set.
114
115- $stat->uniq();
116
117 If called in scalar context, returns the total number of unique elements in the data set.
118 For example, if your data set is (1,2,2,3,3,3), uniq will return 3.
119
120 If called in array context, returns an array of each data value in the data set in sorted order.
121 In the above example, `@uniq = $stats->uniq();` would return (1,2,3)
122
123 This function is specific to Statistics::Descriptive::Discrete
124 and is not implemented in Statistics::Descriptive.
125
126 It is useful for getting a frequency distribution for each discrete value in the data the set:
127 ```perl
128 my $stats = Statistics::Descriptive::Discrete->new();
129 $stats->add_data_tuple(1,1,2,2,3,3,4,4,5,5,6,6,7,7);
130 my @bins = $stats->uniq();
131 my $f = $stats->frequency_distribution_ref(\@bins);
132 for (sort {$a <=> $b} keys %$f) {
133 print "value = $_, count = $f->{$_}\n";
134 }
135 ```
136- $stat->sum();
137
138 Returns the sum of all the values in the data set.
139
140- $stat->mean();
141
142 Returns the mean of the data.
143
144- $stat->harmonic\_mean();
145
146 Returns the harmonic mean of the data. Since the mean is undefined
147 if any of the data are zero or if the sum of the reciprocals is zero,
148 it will return undef for both of those cases.
149
150- $stat->geometric\_mean();
151
152 Returns the geometric mean of the data. Returns `undef` if any of the data
153 are less than 0. Returns 0 if any of the data are 0.
154
155- $stat->median();
156
157 Returns the median value of the data.
158
159- $stat->mode();
160
161 Returns the mode of the data.
162
163- $stat->variance();
164
165 Returns the variance of the data.
166
167- $stat->standard\_deviation();
168
169 Returns the standard\_deviation of the data.
170
171- $stat->sample\_range();
172
173 Returns the sample range (max - min) of the data set.
174
175- $stat->frequency\_distribution\_ref($num\_partitions);
176- $stat->frequency\_distribution\_ref(\\@bins);
177- $stat->frequency\_distribution\_ref();
178
179 `frequency_distribution_ref($num_partitions)` slices the data into
180 `$num_partitions` sets (where $num\_partitions is greater than 1) and counts
181 the number of items that fall into each partition. It returns a reference to a
182 hash where the keys are the numerical values of the partitions used. The
183 minimum value of the data set is not a key and the maximum value of the data
184 set is always a key. The number of entries for a particular partition key are
185 the number of items which are greater than the previous partition key and less
186 then or equal to the current partition key. As an example,
187 ```perl
188 $stat->add_data(1,1.5,2,2.5,3,3.5,4);
189 $f = $stat->frequency_distribution_ref(2);
190 for (sort {$a <=> $b} keys %$f) {
191 print "key = $_, count = $f->{$_}\n";
192 }
193 ```
194 prints
195
196 key = 2.5, count = 4
197 key = 4, count = 3
198
199 since there are four items less than or equal to 2.5, and 3 items
200 greater than 2.5 and less than 4.
201
202 `frequency_distribution_ref(\@bins)` provides the bins that are to be used
203 for the distribution. This allows for non-uniform distributions as
204 well as trimmed or sample distributions to be found. `@bins` must
205 be monotonic and must contain at least one element. Note that unless the
206 set of bins contains the full range of the data, the total counts returned will
207 be less than the sample size.
208
209 Calling `frequency_distribution_ref()` with no arguments returns the last
210 distribution calculated, if such exists.
211
212- my %hash = $stat->frequency\_distribution($partitions);
213- my %hash = $stat->frequency\_distribution(\\@bins);
214- my %hash = $stat->frequency\_distribution();
215
216 Same as `frequency_distribution_ref()` except that it returns the hash
217 clobbered into the return list. Kept for compatibility reasons with previous
218 versions of Statistics::Descriptive::Discrete and using it is discouraged.
219
220 Note: in earlier versions of Statistics:Descriptive::Discrete, `frequency_distribution()`
221 behaved differently than the Statistics::Descriptive implementation. Any code that uses
222 this function should be carefully checked to ensure compatability with the current
223 implementation.
224
225- $stat->get\_data();
226
227 Returns a copy of the data array. Note: This array could be
228 very large and would thus defeat the purpose of using this
229 module. Make sure you really need it before using get\_data().
230
231 The returned array contains the values sorted by value. It does
232 not preserve the order in which the values were added. Preserving
233 order would defeat the purpose of this module which trades speed
234 and memory usage over preserving order. If order is important,
235 use Statistics::Descriptive.
236
237- $stat->clear();
238
239 Clears all data and resets the instance as if it were newly created
240
241 Effectively the same as
242
243 ```perl
244 my $class = ref($stat);
245 undef $stat;
246 $stat = new $class;
247 ```
248# NOTE
249
250The interface for this module strives to be identical to Statistics::Descriptive.
251Any differences are noted in the description for each method.
252
253# BUGS
254
255- Code for calculating mode is not as robust as it should be.
256
257# TODO
258
259- Add rest of methods (at least ones that don't depend on original order of data)
260from Statistics::Descriptive
261
262# AUTHOR
263
264Rhet Turnbull, rturnbull+cpan@gmail.com
265
266# CREDIT
267
268Thanks to the following individuals for finding bugs, providing feedback,
269and submitting changes:
270
271- Peter Dienes for finding and fixing a bug in the variance calculation.
272- Bill Dueber for suggesting the add\_data\_tuple method.
273
274# COPYRIGHT
275
276 Copyright (c) 2002, 2019 Rhet Turnbull. All rights reserved. This
277 program is free software; you can redistribute it and/or modify it
278 under the same terms as Perl itself.
279
280 Portions of this code is from Statistics::Descriptive which is under
281 the following copyrights:
282
283 Copyright (c) 1997,1998 Colin Kuskie. All rights reserved. This
284 program is free software; you can redistribute it and/or modify it
285 under the same terms as Perl itself.
286
287 Copyright (c) 1998 Andrea Spinelli. All rights reserved. This program
288 is free software; you can redistribute it and/or modify it under the
289 same terms as Perl itself.
290
291 Copyright (c) 1994,1995 Jason Kastner. All rights
292 reserved. This program is free software; you can redistribute it
293 and/or modify it under the same terms as Perl itself.
294
295# SEE ALSO
296
297Statistics::Descriptive
298
299Statistics::Discrete
300