1NAME 2 Statistics::Descriptive::Discrete - Compute descriptive statistics for 3 discrete data sets. 4 5 To install, use the CPAN module 6 (https://metacpan.org/pod/Statistics::Descriptive::Discrete). 7 8SYNOPSIS 9 use Statistics::Descriptive::Discrete; 10 11 my $stats = new Statistics::Descriptive::Discrete; 12 $stats->add_data(1,10,2,1,1,4,5,1,10,8,7); 13 print "count = ",$stats->count(),"\n"; 14 print "uniq = ",$stats->uniq(),"\n"; 15 print "sum = ",$stats->sum(),"\n"; 16 print "min = ",$stats->min(),"\n"; 17 print "min index = ",$stats->mindex(),"\n"; 18 print "max = ",$stats->max(),"\n"; 19 print "max index = ",$stats->maxdex(),"\n"; 20 print "mean = ",$stats->mean(),"\n"; 21 print "geometric mean = ",$stats->geometric_mean(),"\n"; 22 print "harmonic mean = ", $stats->harmonic_mean(),"\n"; 23 print "standard_deviation = ",$stats->standard_deviation(),"\n"; 24 print "variance = ",$stats->variance(),"\n"; 25 print "sample_range = ",$stats->sample_range(),"\n"; 26 print "mode = ",$stats->mode(),"\n"; 27 print "median = ",$stats->median(),"\n"; 28 my $f = $stats->frequency_distribution_ref(3); 29 for (sort {$a <=> $b} keys %$f) { 30 print "key = $_, count = $f->{$_}\n"; 31 } 32 33DESCRIPTION 34 This module provides basic functions used in descriptive statistics. It 35 borrows very heavily from Statistics::Descriptive::Full (which is 36 included with Statistics::Descriptive) with one major difference. This 37 module is optimized for discretized data e.g. data from an A/D 38 conversion that has a discrete set of possible values. E.g. if your data 39 is produced by an 8 bit A/D then you'd have only 256 possible values in 40 your data set. Even though you might have a million data points, you'd 41 only have 256 different values in those million points. Instead of 42 storing the entire data set as Statistics::Descriptive does, this module 43 only stores the values seen and the number of times each value occurs. 44 45 For very large data sets, this storage method results in significant 46 speed and memory improvements. For example, for an 8-bit data set (256 47 possible values), with 1,000,000 data points, this module is about 10x 48 faster than Statistics::Descriptive::Full or 49 Statistics::Descriptive::Sparse. 50 51 Statistics::Descriptive run time is a factor of the size of the data 52 set. In particular, repeated calls to `add_data' are slow. 53 Statistics::Descriptive::Discrete's `add_data' is optimized for speed. 54 For a give number of data points, this module's run time will increase 55 as the number of unique data values in the data set increases. For 56 example, while this module runs about 10x the speed of 57 Statistics::Descriptive::Full for an 8-bit data set, the run speed drops 58 to about 3x for an equivalent sized 20-bit data set. 59 60 See sdd_prof.pl in the examples directory to play with profiling this 61 module against Statistics::Descriptive::Full. 62 63METHODS 64 $stat = Statistics::Descriptive::Discrete->new(); 65 Create a new statistics object. 66 67 $stat->add_data(1,2,3,4,5); 68 Adds data to the statistics object. Sets a flag so that the 69 statistics will be recomputed the next time they're needed. 70 71 $stat->add_data_tuple(1,2,42,3); 72 Adds data to the statistics object where every two elements are a 73 value and a count (how many times did the value occur?) The above is 74 equivalent to `$stat->add_data(1,1,42,42,42);' Use this when your 75 data is in a form isomorphic to ($value, $occurrence). 76 77 $stat->max(); 78 Returns the maximum value of the data set. 79 80 $stat->min(); 81 Returns the minimum value of the data set. 82 83 $stat->mindex(); 84 Returns the index of the minimum value of the data set. The index 85 returned is the first occurence of the minimum value. 86 87 Note: the index is determined by the order data was added using 88 add_data() or add_data_tuple(). It is meaningless in context of 89 get_data() as get_data() does not return values in the same order in 90 which they were added. This behavior is different than 91 Statistics::Descriptive which does preserve order. 92 93 $stat->maxdex(); 94 Returns the index of the maximum value of the data set. The index 95 returned is the first occurence of the maximum value. 96 97 Note: the index is determined by the order data was added using 98 `add_data()' or `add_data_tuple()'. It is meaningless in context of 99 `get_data()' as `get_data()' does not return values in the same 100 order in which they were added. This behavior is different than 101 Statistics::Descriptive which does preserve order. 102 103 $stat->count(); 104 Returns the total number of elements in the data set. 105 106 $stat->uniq(); 107 If called in scalar context, returns the total number of unique 108 elements in the data set. For example, if your data set is 109 (1,2,2,3,3,3), uniq will return 3. 110 111 If called in array context, returns an array of each data value in 112 the data set in sorted order. In the above example, `@uniq = 113 $stats->uniq();' would return (1,2,3) 114 115 This function is specific to Statistics::Descriptive::Discrete and 116 is not implemented in Statistics::Descriptive. 117 118 It is useful for getting a frequency distribution for each discrete 119 value in the data the set: 120 121 my $stats = Statistics::Descriptive::Discrete->new(); 122 $stats->add_data_tuple(1,1,2,2,3,3,4,4,5,5,6,6,7,7); 123 my @bins = $stats->uniq(); 124 my $f = $stats->frequency_distribution_ref(\@bins); 125 for (sort {$a <=> $b} keys %$f) { 126 print "value = $_, count = $f->{$_}\n"; 127 } 128 129 $stat->sum(); 130 Returns the sum of all the values in the data set. 131 132 $stat->mean(); 133 Returns the mean of the data. 134 135 $stat->harmonic_mean(); 136 Returns the harmonic mean of the data. Since the mean is undefined 137 if any of the data are zero or if the sum of the reciprocals is 138 zero, it will return undef for both of those cases. 139 140 $stat->geometric_mean(); 141 Returns the geometric mean of the data. Returns `undef' if any of 142 the data are less than 0. Returns 0 if any of the data are 0. 143 144 $stat->median(); 145 Returns the median value of the data. 146 147 $stat->mode(); 148 Returns the mode of the data. 149 150 $stat->variance(); 151 Returns the variance of the data. 152 153 $stat->standard_deviation(); 154 Returns the standard_deviation of the data. 155 156 $stat->sample_range(); 157 Returns the sample range (max - min) of the data set. 158 159 $stat->frequency_distribution_ref($num_partitions); 160 $stat->frequency_distribution_ref(\@bins); 161 $stat->frequency_distribution_ref(); 162 `frequency_distribution_ref($num_partitions)' slices the data into 163 `$num_partitions' sets (where $num_partitions is greater than 1) and 164 counts the number of items that fall into each partition. It returns 165 a reference to a hash where the keys are the numerical values of the 166 partitions used. The minimum value of the data set is not a key and 167 the maximum value of the data set is always a key. The number of 168 entries for a particular partition key are the number of items which 169 are greater than the previous partition key and less then or equal 170 to the current partition key. As an example, 171 172 $stat->add_data(1,1.5,2,2.5,3,3.5,4); 173 $f = $stat->frequency_distribution_ref(2); 174 for (sort {$a <=> $b} keys %$f) { 175 print "key = $_, count = $f->{$_}\n"; 176 } 177 178 prints 179 180 key = 2.5, count = 4 181 key = 4, count = 3 182 183 since there are four items less than or equal to 2.5, and 3 items 184 greater than 2.5 and less than 4. 185 186 `frequency_distribution_ref(\@bins)' provides the bins that are to 187 be used for the distribution. This allows for non-uniform 188 distributions as well as trimmed or sample distributions to be 189 found. `@bins' must be monotonic and must contain at least one 190 element. Note that unless the set of bins contains the full range of 191 the data, the total counts returned will be less than the sample 192 size. 193 194 Calling `frequency_distribution_ref()' with no arguments returns the 195 last distribution calculated, if such exists. 196 197 my %hash = $stat->frequency_distribution($partitions); 198 my %hash = $stat->frequency_distribution(\@bins); 199 my %hash = $stat->frequency_distribution(); 200 Same as `frequency_distribution_ref()' except that it returns the 201 hash clobbered into the return list. Kept for compatibility reasons 202 with previous versions of Statistics::Descriptive::Discrete and 203 using it is discouraged. 204 205 Note: in earlier versions of Statistics:Descriptive::Discrete, 206 `frequency_distribution()' behaved differently than the 207 Statistics::Descriptive implementation. Any code that uses this 208 function should be carefully checked to ensure compatability with 209 the current implementation. 210 211 $stat->get_data(); 212 Returns a copy of the data array. Note: This array could be very 213 large and would thus defeat the purpose of using this module. Make 214 sure you really need it before using get_data(). 215 216 The returned array contains the values sorted by value. It does not 217 preserve the order in which the values were added. Preserving order 218 would defeat the purpose of this module which trades speed and 219 memory usage over preserving order. If order is important, use 220 Statistics::Descriptive. 221 222 $stat->clear(); 223 Clears all data and resets the instance as if it were newly created 224 225 Effectively the same as 226 227 my $class = ref($stat); 228 undef $stat; 229 $stat = new $class; 230 231NOTE 232 The interface for this module strives to be identical to 233 Statistics::Descriptive. Any differences are noted in the description 234 for each method. 235 236BUGS 237 * Code for calculating mode is not as robust as it should be. 238 239TODO 240 * Add rest of methods (at least ones that don't depend on original 241 order of data) from Statistics::Descriptive 242 243AUTHOR 244 Rhet Turnbull, rturnbull+cpan@gmail.com 245 246CREDIT 247 Thanks to the following individuals for finding bugs, providing 248 feedback, and submitting changes: 249 250 * Peter Dienes for finding and fixing a bug in the variance 251 calculation. 252 253 * Bill Dueber for suggesting the add_data_tuple method. 254 255COPYRIGHT 256 Copyright (c) 2002, 2019 Rhet Turnbull. All rights reserved. This 257 program is free software; you can redistribute it and/or modify it 258 under the same terms as Perl itself. 259 260 Portions of this code is from Statistics::Descriptive which is under 261 the following copyrights: 262 263 Copyright (c) 1997,1998 Colin Kuskie. All rights reserved. This 264 program is free software; you can redistribute it and/or modify it 265 under the same terms as Perl itself. 266 267 Copyright (c) 1998 Andrea Spinelli. All rights reserved. This program 268 is free software; you can redistribute it and/or modify it under the 269 same terms as Perl itself. 270 271 Copyright (c) 1994,1995 Jason Kastner. All rights 272 reserved. This program is free software; you can redistribute it 273 and/or modify it under the same terms as Perl itself. 274 275SEE ALSO 276 Statistics::Descriptive 277 278 Statistics::Discrete 279 280