1# NAME 2 3Statistics::Descriptive::Discrete - Compute descriptive statistics for discrete data sets. 4 5To install, use the CPAN module (https://metacpan.org/pod/Statistics::Descriptive::Discrete). 6 7# SYNOPSIS 8 9```perl 10 use Statistics::Descriptive::Discrete; 11 12 my $stats = new Statistics::Descriptive::Discrete; 13 $stats->add_data(1,10,2,1,1,4,5,1,10,8,7); 14 print "count = ",$stats->count(),"\n"; 15 print "uniq = ",$stats->uniq(),"\n"; 16 print "sum = ",$stats->sum(),"\n"; 17 print "min = ",$stats->min(),"\n"; 18 print "min index = ",$stats->mindex(),"\n"; 19 print "max = ",$stats->max(),"\n"; 20 print "max index = ",$stats->maxdex(),"\n"; 21 print "mean = ",$stats->mean(),"\n"; 22 print "geometric mean = ",$stats->geometric_mean(),"\n"; 23 print "harmonic mean = ", $stats->harmonic_mean(),"\n"; 24 print "standard_deviation = ",$stats->standard_deviation(),"\n"; 25 print "variance = ",$stats->variance(),"\n"; 26 print "sample_range = ",$stats->sample_range(),"\n"; 27 print "mode = ",$stats->mode(),"\n"; 28 print "median = ",$stats->median(),"\n"; 29 my $f = $stats->frequency_distribution_ref(3); 30 for (sort {$a <=> $b} keys %$f) { 31 print "key = $_, count = $f->{$_}\n"; 32 } 33``` 34# DESCRIPTION 35 36This module provides basic functions used in descriptive statistics. 37It borrows very heavily from Statistics::Descriptive::Full 38(which is included with Statistics::Descriptive) with one major 39difference. This module is optimized for discretized data 40e.g. data from an A/D conversion that has a discrete set of possible values. 41E.g. if your data is produced by an 8 bit A/D then you'd have only 256 possible 42values in your data set. Even though you might have a million data points, 43you'd only have 256 different values in those million points. Instead of storing the 44entire data set as Statistics::Descriptive does, this module only stores 45the values seen and the number of times each value occurs. 46 47For very large data sets, this storage method results in significant speed 48and memory improvements. For example, for an 8-bit data set (256 possible values), 49with 1,000,000 data points, this module is about 10x faster than Statistics::Descriptive::Full 50or Statistics::Descriptive::Sparse. 51 52Statistics::Descriptive run time is a factor of the size of the data set. In particular, 53repeated calls to `add_data` are slow. Statistics::Descriptive::Discrete's `add_data` is 54optimized for speed. For a give number of data points, this module's run time will increase 55as the number of unique data values in the data set increases. For example, while this module 56runs about 10x the speed of Statistics::Descriptive::Full for an 8-bit data set, the 57run speed drops to about 3x for an equivalent sized 20-bit data set. 58 59See sdd\_prof.pl in the examples directory to play with profiling this module against 60Statistics::Descriptive::Full. 61 62# METHODS 63 64- $stat = Statistics::Descriptive::Discrete->new(); 65 66 Create a new statistics object. 67 68- $stat->add\_data(1,2,3,4,5); 69 70 Adds data to the statistics object. Sets a flag so that 71 the statistics will be recomputed the next time they're 72 needed. 73 74- $stat->add\_data\_tuple(1,2,42,3); 75 76 Adds data to the statistics object where every two elements 77 are a value and a count (how many times did the value occur?) 78 The above is equivalent to `$stat->add_data(1,1,42,42,42);` 79 Use this when your data is in a form isomorphic to 80 ($value, $occurrence). 81 82- $stat->max(); 83 84 Returns the maximum value of the data set. 85 86- $stat->min(); 87 88 Returns the minimum value of the data set. 89 90- $stat->mindex(); 91 92 Returns the index of the minimum value of the data set. 93 The index returned is the first occurence of the minimum value. 94 95 Note: the index is determined by the order data was added using add\_data() or add\_data\_tuple(). 96 It is meaningless in context of get\_data() as get\_data() does not return values in the same 97 order in which they were added. This behavior is different than Statistics::Descriptive which 98 does preserve order. 99 100- $stat->maxdex(); 101 102 Returns the index of the maximum value of the data set. 103 The index returned is the first occurence of the maximum value. 104 105 Note: the index is determined by the order data was added using 106 `add_data()` or `add_data_tuple()`. It is meaningless in context of 107 `get_data()` as `get_data()` does not return values in the same 108 order in which they were added. This behavior is different than 109 Statistics::Descriptive which does preserve order. 110 111- $stat->count(); 112 113 Returns the total number of elements in the data set. 114 115- $stat->uniq(); 116 117 If called in scalar context, returns the total number of unique elements in the data set. 118 For example, if your data set is (1,2,2,3,3,3), uniq will return 3. 119 120 If called in array context, returns an array of each data value in the data set in sorted order. 121 In the above example, `@uniq = $stats->uniq();` would return (1,2,3) 122 123 This function is specific to Statistics::Descriptive::Discrete 124 and is not implemented in Statistics::Descriptive. 125 126 It is useful for getting a frequency distribution for each discrete value in the data the set: 127 ```perl 128 my $stats = Statistics::Descriptive::Discrete->new(); 129 $stats->add_data_tuple(1,1,2,2,3,3,4,4,5,5,6,6,7,7); 130 my @bins = $stats->uniq(); 131 my $f = $stats->frequency_distribution_ref(\@bins); 132 for (sort {$a <=> $b} keys %$f) { 133 print "value = $_, count = $f->{$_}\n"; 134 } 135 ``` 136- $stat->sum(); 137 138 Returns the sum of all the values in the data set. 139 140- $stat->mean(); 141 142 Returns the mean of the data. 143 144- $stat->harmonic\_mean(); 145 146 Returns the harmonic mean of the data. Since the mean is undefined 147 if any of the data are zero or if the sum of the reciprocals is zero, 148 it will return undef for both of those cases. 149 150- $stat->geometric\_mean(); 151 152 Returns the geometric mean of the data. Returns `undef` if any of the data 153 are less than 0. Returns 0 if any of the data are 0. 154 155- $stat->median(); 156 157 Returns the median value of the data. 158 159- $stat->mode(); 160 161 Returns the mode of the data. 162 163- $stat->variance(); 164 165 Returns the variance of the data. 166 167- $stat->standard\_deviation(); 168 169 Returns the standard\_deviation of the data. 170 171- $stat->sample\_range(); 172 173 Returns the sample range (max - min) of the data set. 174 175- $stat->frequency\_distribution\_ref($num\_partitions); 176- $stat->frequency\_distribution\_ref(\\@bins); 177- $stat->frequency\_distribution\_ref(); 178 179 `frequency_distribution_ref($num_partitions)` slices the data into 180 `$num_partitions` sets (where $num\_partitions is greater than 1) and counts 181 the number of items that fall into each partition. It returns a reference to a 182 hash where the keys are the numerical values of the partitions used. The 183 minimum value of the data set is not a key and the maximum value of the data 184 set is always a key. The number of entries for a particular partition key are 185 the number of items which are greater than the previous partition key and less 186 then or equal to the current partition key. As an example, 187 ```perl 188 $stat->add_data(1,1.5,2,2.5,3,3.5,4); 189 $f = $stat->frequency_distribution_ref(2); 190 for (sort {$a <=> $b} keys %$f) { 191 print "key = $_, count = $f->{$_}\n"; 192 } 193 ``` 194 prints 195 196 key = 2.5, count = 4 197 key = 4, count = 3 198 199 since there are four items less than or equal to 2.5, and 3 items 200 greater than 2.5 and less than 4. 201 202 `frequency_distribution_ref(\@bins)` provides the bins that are to be used 203 for the distribution. This allows for non-uniform distributions as 204 well as trimmed or sample distributions to be found. `@bins` must 205 be monotonic and must contain at least one element. Note that unless the 206 set of bins contains the full range of the data, the total counts returned will 207 be less than the sample size. 208 209 Calling `frequency_distribution_ref()` with no arguments returns the last 210 distribution calculated, if such exists. 211 212- my %hash = $stat->frequency\_distribution($partitions); 213- my %hash = $stat->frequency\_distribution(\\@bins); 214- my %hash = $stat->frequency\_distribution(); 215 216 Same as `frequency_distribution_ref()` except that it returns the hash 217 clobbered into the return list. Kept for compatibility reasons with previous 218 versions of Statistics::Descriptive::Discrete and using it is discouraged. 219 220 Note: in earlier versions of Statistics:Descriptive::Discrete, `frequency_distribution()` 221 behaved differently than the Statistics::Descriptive implementation. Any code that uses 222 this function should be carefully checked to ensure compatability with the current 223 implementation. 224 225- $stat->get\_data(); 226 227 Returns a copy of the data array. Note: This array could be 228 very large and would thus defeat the purpose of using this 229 module. Make sure you really need it before using get\_data(). 230 231 The returned array contains the values sorted by value. It does 232 not preserve the order in which the values were added. Preserving 233 order would defeat the purpose of this module which trades speed 234 and memory usage over preserving order. If order is important, 235 use Statistics::Descriptive. 236 237- $stat->clear(); 238 239 Clears all data and resets the instance as if it were newly created 240 241 Effectively the same as 242 243 ```perl 244 my $class = ref($stat); 245 undef $stat; 246 $stat = new $class; 247 ``` 248# NOTE 249 250The interface for this module strives to be identical to Statistics::Descriptive. 251Any differences are noted in the description for each method. 252 253# BUGS 254 255- Code for calculating mode is not as robust as it should be. 256 257# TODO 258 259- Add rest of methods (at least ones that don't depend on original order of data) 260from Statistics::Descriptive 261 262# AUTHOR 263 264Rhet Turnbull, rturnbull+cpan@gmail.com 265 266# CREDIT 267 268Thanks to the following individuals for finding bugs, providing feedback, 269and submitting changes: 270 271- Peter Dienes for finding and fixing a bug in the variance calculation. 272- Bill Dueber for suggesting the add\_data\_tuple method. 273 274# COPYRIGHT 275 276 Copyright (c) 2002, 2019 Rhet Turnbull. All rights reserved. This 277 program is free software; you can redistribute it and/or modify it 278 under the same terms as Perl itself. 279 280 Portions of this code is from Statistics::Descriptive which is under 281 the following copyrights: 282 283 Copyright (c) 1997,1998 Colin Kuskie. All rights reserved. This 284 program is free software; you can redistribute it and/or modify it 285 under the same terms as Perl itself. 286 287 Copyright (c) 1998 Andrea Spinelli. All rights reserved. This program 288 is free software; you can redistribute it and/or modify it under the 289 same terms as Perl itself. 290 291 Copyright (c) 1994,1995 Jason Kastner. All rights 292 reserved. This program is free software; you can redistribute it 293 and/or modify it under the same terms as Perl itself. 294 295# SEE ALSO 296 297Statistics::Descriptive 298 299Statistics::Discrete 300