1package Statistics::Descriptive; 2$Statistics::Descriptive::VERSION = '3.0800'; 3use strict; 4use warnings; 5 6##This module draws heavily from perltoot v0.4 from Tom Christiansen. 7 8use 5.006; 9 10use vars (qw($Tolerance $Min_samples_number)); 11 12$Tolerance = 0.0; 13$Min_samples_number = 4; 14 15use Statistics::Descriptive::Sparse (); 16use Statistics::Descriptive::Full (); 17 18package Statistics::Descriptive; 19 20##All modules return true. 211; 22 23__END__ 24 25=pod 26 27=encoding UTF-8 28 29=head1 NAME 30 31Statistics::Descriptive - Module of basic descriptive statistical functions. 32 33=head1 VERSION 34 35version 3.0800 36 37=head1 SYNOPSIS 38 39 use Statistics::Descriptive; 40 my $stat = Statistics::Descriptive::Full->new(); 41 $stat->add_data(1,2,3,4); 42 my $mean = $stat->mean(); 43 my $var = $stat->variance(); 44 my $tm = $stat->trimmed_mean(.25); 45 $Statistics::Descriptive::Tolerance = 1e-10; 46 47=head1 DESCRIPTION 48 49This module provides basic functions used in descriptive statistics. 50It has an object oriented design and supports two different types of 51data storage and calculation objects: sparse and full. With the sparse 52method, none of the data is stored and only a few statistical measures 53are available. Using the full method, the entire data set is retained 54and additional functions are available. 55 56Whenever a division by zero may occur, the denominator is checked to be 57greater than the value C<$Statistics::Descriptive::Tolerance>, which 58defaults to 0.0. You may want to change this value to some small 59positive value such as 1e-24 in order to obtain error messages in case 60of very small denominators. 61 62Many of the methods (both Sparse and Full) cache values so that subsequent 63calls with the same arguments are faster. 64 65=head1 METHODS 66 67=head2 Sparse Methods 68 69=over 5 70 71=item $stat = Statistics::Descriptive::Sparse->new(); 72 73Create a new sparse statistics object. 74 75=item $stat->clear(); 76 77Effectively the same as 78 79 my $class = ref($stat); 80 undef $stat; 81 $stat = new $class; 82 83except more efficient. 84 85=item $stat->add_data(1,2,3); 86 87Adds data to the statistics variable. The cached statistical values are 88updated automatically. 89 90=item $stat->count(); 91 92Returns the number of data items. 93 94=item $stat->mean(); 95 96Returns the mean of the data. 97 98=item $stat->sum(); 99 100Returns the sum of the data. 101 102=item $stat->variance(); 103 104Returns the variance of the data. Division by n-1 is used. 105 106=item $stat->standard_deviation(); 107 108Returns the standard deviation of the data. Division by n-1 is used. 109 110=item $stat->min(); 111 112Returns the minimum value of the data set. 113 114=item $stat->mindex(); 115 116Returns the index of the minimum value of the data set. 117 118=item $stat->max(); 119 120Returns the maximum value of the data set. 121 122=item $stat->maxdex(); 123 124Returns the index of the maximum value of the data set. 125 126=item $stat->sample_range(); 127 128Returns the sample range (max - min) of the data set. 129 130=back 131 132=head2 Full Methods 133 134Similar to the Sparse Methods above, any Full Method that is called caches 135the current result so that it doesn't have to be recalculated. In some 136cases, several values can be cached at the same time. 137 138=over 5 139 140=item $stat = Statistics::Descriptive::Full->new(); 141 142Create a new statistics object that inherits from 143Statistics::Descriptive::Sparse so that it contains all the methods 144described above. 145 146=item $stat->add_data(1,2,4,5); 147 148Adds data to the statistics variable. All of the sparse statistical 149values are updated and cached. Cached values from Full methods are 150deleted since they are no longer valid. 151 152I<Note: Calling add_data with an empty array will delete all of your 153Full method cached values! Cached values for the sparse methods are 154not changed> 155 156=item $stat->add_data_with_samples([{1 => 10}, {2 => 20}, {3 => 30},]); 157 158Add data to the statistics variable and set the number of samples each value 159has been built with. The data is the key of each element of the input array 160ref, while the value is the number of samples: [{data1 => smaples1}, {data2 => 161samples2}, ...]. 162 163B<NOTE:> The number of samples is only used by the smoothing function and is 164ignored otherwise. It is not equivalent to repeat count. In order to repeat 165a certain datum more than one time call add_data() like this: 166 167 my $value = 5; 168 my $repeat_count = 10; 169 $stat->add_data( 170 [ ($value) x $repeat_count ] 171 ); 172 173=item $stat->get_data(); 174 175Returns a copy of the data array. 176 177=item $stat->get_data_without_outliers(); 178 179Returns a copy of the data array without outliers. The number minimum of 180samples to apply the outlier filtering is C<$Statistics::Descriptive::Min_samples_number>, 1814 by default. 182 183A function to detect outliers need to be defined (see C<set_outlier_filter>), 184otherwise the function will return an undef value. 185 186The filtering will act only on the most extreme value of the data set 187(i.e.: value with the highest absolute standard deviation from the mean). 188 189If there is the need to remove more than one outlier, the filtering 190need to be re-run for the next most extreme value with the initial outlier removed. 191 192This is not always needed since the test (for example Grubb's test) usually can only detect 193the most exreme value. If there is more than one extreme case in a set, 194then the standard deviation will be high enough to make neither case an outlier. 195 196=item $stat->set_outlier_filter($code_ref); 197 198Set the function to filter out the outlier. 199 200C<$code_ref> is the reference to the subroutine implementing the filtering 201function. 202 203Returns C<undef> for invalid values of C<$code_ref> (i.e.: not defined or not a 204code reference), C<1> otherwise. 205 206=over 4 207 208=item 209 210Example #1: Undefined code reference 211 212 my $stat = Statistics::Descriptive::Full->new(); 213 $stat->add_data(1, 2, 3, 4, 5); 214 215 print $stat->set_outlier_filter(); # => undef 216 217=item 218 219Example #2: Valid code reference 220 221 sub outlier_filter { return $_[1] > 1; } 222 223 my $stat = Statistics::Descriptive::Full->new(); 224 $stat->add_data( 1, 1, 1, 100, 1, ); 225 226 print $stat->set_outlier_filter( \&outlier_filter ); # => 1 227 my @filtered_data = $stat->get_data_without_outliers(); 228 # @filtered_data is (1, 1, 1, 1) 229 230In this example the series is really simple and the outlier filter function as well. 231For more complex series the outlier filter function might be more complex 232(see Grubbs' test for outliers). 233 234The outlier filter function will receive as first parameter the Statistics::Descriptive::Full object, 235as second the value of the candidate outlier. Having the object in the function 236might be useful for complex filters where statistics property are needed (again see Grubbs' test for outlier). 237 238=back 239 240=item $stat->set_smoother({ method => 'exponential', coeff => 0, }); 241 242Set the method used to smooth the data and the smoothing coefficient. 243See C<Statistics::Smoother> for more details. 244 245=item $stat->get_smoothed_data(); 246 247Returns a copy of the smoothed data array. 248 249The smoothing method and coefficient need to be defined (see C<set_smoother>), 250otherwise the function will return an undef value. 251 252=item $stat->sort_data(); 253 254Sort the stored data and update the mindex and maxdex methods. This 255method uses perl's internal sort. 256 257=item $stat->presorted(1); 258 259=item $stat->presorted(); 260 261If called with a non-zero argument, this method sets a flag that says 262the data is already sorted and need not be sorted again. Since some of 263the methods in this class require sorted data, this saves some time. 264If you supply sorted data to the object, call this method to prevent 265the data from being sorted again. The flag is cleared whenever add_data 266is called. Calling the method without an argument returns the value of 267the flag. 268 269=item $stat->skewness(); 270 271Returns the skewness of the data. 272A value of zero is no skew, negative is a left skewed tail, 273positive is a right skewed tail. 274This is consistent with Excel. 275 276=item $stat->kurtosis(); 277 278Returns the kurtosis of the data. 279Positive is peaked, negative is flattened. 280 281=item $x = $stat->percentile(25); 282 283=item ($x, $index) = $stat->percentile(25); 284 285Sorts the data and returns the value that corresponds to the 286percentile as defined in RFC2330: 287 288=over 4 289 290=item 291 292For example, given the 6 measurements: 293 294-2, 7, 7, 4, 18, -5 295 296Then F(-8) = 0, F(-5) = 1/6, F(-5.0001) = 0, F(-4.999) = 1/6, F(7) = 2975/6, F(18) = 1, F(239) = 1. 298 299Note that we can recover the different measured values and how many 300times each occurred from F(x) -- no information regarding the range 301in values is lost. Summarizing measurements using histograms, on the 302other hand, in general loses information about the different values 303observed, so the EDF is preferred. 304 305Using either the EDF or a histogram, however, we do lose information 306regarding the order in which the values were observed. Whether this 307loss is potentially significant will depend on the metric being 308measured. 309 310We will use the term "percentile" to refer to the smallest value of x 311for which F(x) >= a given percentage. So the 50th percentile of the 312example above is 4, since F(4) = 3/6 = 50%; the 25th percentile is 313-2, since F(-5) = 1/6 < 25%, and F(-2) = 2/6 >= 25%; the 100th 314percentile is 18; and the 0th percentile is -infinity, as is the 15th 315percentile, which for ease of handling and backward compatibility is returned 316as undef() by the function. 317 318Care must be taken when using percentiles to summarize a sample, 319because they can lend an unwarranted appearance of more precision 320than is really available. Any such summary must include the sample 321size N, because any percentile difference finer than 1/N is below the 322resolution of the sample. 323 324=back 325 326(Taken from: 327I<RFC2330 - Framework for IP Performance Metrics>, 328Section 11.3. Defining Statistical Distributions. 329RFC2330 is available from: 330L<http://www.ietf.org/rfc/rfc2330.txt> .) 331 332If the percentile method is called in a list context then it will 333also return the index of the percentile. 334 335=item $x = $stat->quantile($Type); 336 337Sorts the data and returns estimates of underlying distribution quantiles based on one 338or two order statistics from the supplied elements. 339 340This method use the same algorithm as Excel and R language (quantile B<type 7>). 341 342The generic function quantile produces sample quantiles corresponding to the given probabilities. 343 344B<$Type> is an integer value between 0 to 4 : 345 346 0 => zero quartile (Q0) : minimal value 347 1 => first quartile (Q1) : lower quartile = lowest cut off (25%) of data = 25th percentile 348 2 => second quartile (Q2) : median = it cuts data set in half = 50th percentile 349 3 => third quartile (Q3) : upper quartile = highest cut off (25%) of data, or lowest 75% = 75th percentile 350 4 => fourth quartile (Q4) : maximal value 351 352Example : 353 354 my @data = (1..10); 355 my $stat = Statistics::Descriptive::Full->new(); 356 $stat->add_data(@data); 357 print $stat->quantile(0); # => 1 358 print $stat->quantile(1); # => 3.25 359 print $stat->quantile(2); # => 5.5 360 print $stat->quantile(3); # => 7.75 361 print $stat->quantile(4); # => 10 362 363=item $stat->median(); 364 365Sorts the data and returns the median value of the data. 366 367=item $stat->harmonic_mean(); 368 369Returns the harmonic mean of the data. Since the mean is undefined 370if any of the data are zero or if the sum of the reciprocals is zero, 371it will return undef for both of those cases. 372 373=item $stat->geometric_mean(); 374 375Returns the geometric mean of the data. 376 377=item my $mode = $stat->mode(); 378 379Returns the mode of the data. The mode is the most commonly occurring datum. 380See L<http://en.wikipedia.org/wiki/Mode_%28statistics%29> . If all values 381occur only once, then mode() will return undef. 382 383=item $stat->trimmed_mean(ltrim[,utrim]); 384 385C<trimmed_mean(ltrim)> returns the mean with a fraction C<ltrim> 386of entries at each end dropped. C<trimmed_mean(ltrim,utrim)> 387returns the mean after a fraction C<ltrim> has been removed from the 388lower end of the data and a fraction C<utrim> has been removed from the 389upper end of the data. This method sorts the data before beginning 390to analyze it. 391 392All calls to trimmed_mean() are cached so that they don't have to be 393calculated a second time. 394 395=item $stat->frequency_distribution_ref($num_partitions); 396 397=item $stat->frequency_distribution_ref(\@bins); 398 399=item $stat->frequency_distribution_ref(); 400 401C<frequency_distribution_ref($num_partitions)> slices the data into 402C<$num_partitions> sets (where $num_partitions is greater than 1) and counts 403the number of items that fall into each partition. It returns a reference to a 404hash where the keys are the numerical values of the partitions used. The 405minimum value of the data set is not a key and the maximum value of the data 406set is always a key. The number of entries for a particular partition key are 407the number of items which are greater than the previous partition key and less 408then or equal to the current partition key. As an example, 409 410 $stat->add_data(1,1.5,2,2.5,3,3.5,4); 411 $f = $stat->frequency_distribution_ref(2); 412 for (sort {$a <=> $b} keys %$f) { 413 print "key = $_, count = $f->{$_}\n"; 414 } 415 416prints 417 418 key = 2.5, count = 4 419 key = 4, count = 3 420 421since there are four items less than or equal to 2.5, and 3 items 422greater than 2.5 and less than 4. 423 424C<frequency_distribution_refs(\@bins)> provides the bins that are to be used 425for the distribution. This allows for non-uniform distributions as 426well as trimmed or sample distributions to be found. C<@bins> must 427be monotonic and must contain at least one element. Note that unless the 428set of bins contains the full range of the data, the total counts returned will 429be less than the sample size. 430 431Calling C<frequency_distribution_ref()> with no arguments returns the last 432distribution calculated, if such exists. 433 434=item my %hash = $stat->frequency_distribution($partitions); 435 436=item my %hash = $stat->frequency_distribution(\@bins); 437 438=item my %hash = $stat->frequency_distribution(); 439 440Same as C<frequency_distribution_ref()> except that it returns the hash 441clobbered into the return list. Kept for compatibility reasons with previous 442versions of Statistics::Descriptive and using it is discouraged. 443 444=item $stat->least_squares_fit(); 445 446=item $stat->least_squares_fit(@x); 447 448C<least_squares_fit()> performs a least squares fit on the data, 449assuming a domain of C<@x> or a default of 1..$stat->count(). It 450returns an array of four elements C<($q, $m, $r, $rms)> where 451 452=over 4 453 454=item C<$q and $m> 455 456satisfy the equation C($y = $m*$x + $q). 457 458=item C<$r> 459 460is the Pearson linear correlation cofficient. 461 462=item C<$rms> 463 464is the root-mean-square error. 465 466=back 467 468If case of error or division by zero, the empty list is returned. 469 470The array that is returned can be "coerced" into a hash structure 471by doing the following: 472 473 my %hash = (); 474 @hash{'q', 'm', 'r', 'err'} = $stat->least_squares_fit(); 475 476Because calling C<least_squares_fit()> with no arguments defaults 477to using the current range, there is no caching of the results. 478 479=back 480 481=head1 REPORTING ERRORS 482 483I read my email frequently, but since adopting this module I've added 2 484children and 1 dog to my family, so please be patient about my response 485times. When reporting errors, please include the following to help 486me out: 487 488=over 4 489 490=item * 491 492Your version of perl. This can be obtained by typing perl C<-v> at 493the command line. 494 495=item * 496 497Which version of Statistics::Descriptive you're using. As you can 498see below, I do make mistakes. Unfortunately for me, right now 499there are thousands of CD's with the version of this module with 500the bugs in it. Fortunately for you, I'm a very patient module 501maintainer. 502 503=item * 504 505Details about what the error is. Try to narrow down the scope 506of the problem and send me code that I can run to verify and 507track it down. 508 509=back 510 511=head1 AUTHOR 512 513Current maintainer: 514 515Shlomi Fish, L<http://www.shlomifish.org/> , C<shlomif@cpan.org> 516 517Previously: 518 519Colin Kuskie 520 521My email address can be found at http://www.perl.com under Who's Who 522or at: https://metacpan.org/author/COLINK . 523 524=head1 CONTRIBUTORS 525 526Fabio Ponciroli & Adzuna Ltd. team (outliers handling) 527 528=head1 REFERENCES 529 530RFC2330, Framework for IP Performance Metrics 531 532The Art of Computer Programming, Volume 2, Donald Knuth. 533 534Handbook of Mathematica Functions, Milton Abramowitz and Irene Stegun. 535 536Probability and Statistics for Engineering and the Sciences, Jay Devore. 537 538=head1 COPYRIGHT 539 540Copyright (c) 1997,1998 Colin Kuskie. All rights reserved. This 541program is free software; you can redistribute it and/or modify it 542under the same terms as Perl itself. 543 544Copyright (c) 1998 Andrea Spinelli. All rights reserved. This program 545is free software; you can redistribute it and/or modify it under the 546same terms as Perl itself. 547 548Copyright (c) 1994,1995 Jason Kastner. All rights 549reserved. This program is free software; you can redistribute it 550and/or modify it under the same terms as Perl itself. 551 552=head1 LICENSE 553 554This program is free software; you can redistribute it and/or modify it 555under the same terms as Perl itself. 556 557=for :stopwords cpan testmatrix url bugtracker rt cpants kwalitee diff irc mailto metadata placeholders metacpan 558 559=head1 SUPPORT 560 561=head2 Websites 562 563The following websites have more information about this module, and may be of help to you. As always, 564in addition to those websites please use your favorite search engine to discover more resources. 565 566=over 4 567 568=item * 569 570MetaCPAN 571 572A modern, open-source CPAN search engine, useful to view POD in HTML format. 573 574L<https://metacpan.org/release/Statistics-Descriptive> 575 576=item * 577 578RT: CPAN's Bug Tracker 579 580The RT ( Request Tracker ) website is the default bug/issue tracking system for CPAN. 581 582L<https://rt.cpan.org/Public/Dist/Display.html?Name=Statistics-Descriptive> 583 584=item * 585 586CPANTS 587 588The CPANTS is a website that analyzes the Kwalitee ( code metrics ) of a distribution. 589 590L<http://cpants.cpanauthors.org/dist/Statistics-Descriptive> 591 592=item * 593 594CPAN Testers 595 596The CPAN Testers is a network of smoke testers who run automated tests on uploaded CPAN distributions. 597 598L<http://www.cpantesters.org/distro/S/Statistics-Descriptive> 599 600=item * 601 602CPAN Testers Matrix 603 604The CPAN Testers Matrix is a website that provides a visual overview of the test results for a distribution on various Perls/platforms. 605 606L<http://matrix.cpantesters.org/?dist=Statistics-Descriptive> 607 608=item * 609 610CPAN Testers Dependencies 611 612The CPAN Testers Dependencies is a website that shows a chart of the test results of all dependencies for a distribution. 613 614L<http://deps.cpantesters.org/?module=Statistics::Descriptive> 615 616=back 617 618=head2 Bugs / Feature Requests 619 620Please report any bugs or feature requests by email to C<bug-statistics-descriptive at rt.cpan.org>, or through 621the web interface at L<https://rt.cpan.org/Public/Bug/Report.html?Queue=Statistics-Descriptive>. You will be automatically notified of any 622progress on the request by the system. 623 624=head2 Source Code 625 626The code is open to the world, and available for you to hack on. Please feel free to browse it and play 627with it, or whatever. If you want to contribute patches, please send me a diff or prod me to pull 628from your repository :) 629 630L<https://github.com/shlomif/perl-Statistics-Descriptive> 631 632 git clone git://github.com/shlomif/perl-Statistics-Descriptive.git 633 634=head1 AUTHOR 635 636Shlomi Fish <shlomif@cpan.org> 637 638=head1 BUGS 639 640Please report any bugs or feature requests on the bugtracker website 641L<https://github.com/shlomif/perl-Statistics-Descriptive/issues> 642 643When submitting a bug or request, please include a test-file or a 644patch to an existing test-file that illustrates the bug or desired 645feature. 646 647=head1 COPYRIGHT AND LICENSE 648 649This software is copyright (c) 1997 by Jason Kastner, Andrea Spinelli, Colin Kuskie, and others. 650 651This is free software; you can redistribute it and/or modify it under 652the same terms as the Perl 5 programming language system itself. 653 654=cut 655