1package Statistics::Descriptive;
2$Statistics::Descriptive::VERSION = '3.0800';
3use strict;
4use warnings;
5
6##This module draws heavily from perltoot v0.4 from Tom Christiansen.
7
8use 5.006;
9
10use vars (qw($Tolerance $Min_samples_number));
11
12$Tolerance          = 0.0;
13$Min_samples_number = 4;
14
15use Statistics::Descriptive::Sparse ();
16use Statistics::Descriptive::Full   ();
17
18package Statistics::Descriptive;
19
20##All modules return true.
211;
22
23__END__
24
25=pod
26
27=encoding UTF-8
28
29=head1 NAME
30
31Statistics::Descriptive - Module of basic descriptive statistical functions.
32
33=head1 VERSION
34
35version 3.0800
36
37=head1 SYNOPSIS
38
39    use Statistics::Descriptive;
40    my $stat = Statistics::Descriptive::Full->new();
41    $stat->add_data(1,2,3,4);
42    my $mean = $stat->mean();
43    my $var = $stat->variance();
44    my $tm = $stat->trimmed_mean(.25);
45    $Statistics::Descriptive::Tolerance = 1e-10;
46
47=head1 DESCRIPTION
48
49This module provides basic functions used in descriptive statistics.
50It has an object oriented design and supports two different types of
51data storage and calculation objects: sparse and full. With the sparse
52method, none of the data is stored and only a few statistical measures
53are available. Using the full method, the entire data set is retained
54and additional functions are available.
55
56Whenever a division by zero may occur, the denominator is checked to be
57greater than the value C<$Statistics::Descriptive::Tolerance>, which
58defaults to 0.0. You may want to change this value to some small
59positive value such as 1e-24 in order to obtain error messages in case
60of very small denominators.
61
62Many of the methods (both Sparse and Full) cache values so that subsequent
63calls with the same arguments are faster.
64
65=head1 METHODS
66
67=head2 Sparse Methods
68
69=over 5
70
71=item $stat = Statistics::Descriptive::Sparse->new();
72
73Create a new sparse statistics object.
74
75=item $stat->clear();
76
77Effectively the same as
78
79  my $class = ref($stat);
80  undef $stat;
81  $stat = new $class;
82
83except more efficient.
84
85=item $stat->add_data(1,2,3);
86
87Adds data to the statistics variable. The cached statistical values are
88updated automatically.
89
90=item $stat->count();
91
92Returns the number of data items.
93
94=item $stat->mean();
95
96Returns the mean of the data.
97
98=item $stat->sum();
99
100Returns the sum of the data.
101
102=item $stat->variance();
103
104Returns the variance of the data.  Division by n-1 is used.
105
106=item $stat->standard_deviation();
107
108Returns the standard deviation of the data. Division by n-1 is used.
109
110=item $stat->min();
111
112Returns the minimum value of the data set.
113
114=item $stat->mindex();
115
116Returns the index of the minimum value of the data set.
117
118=item $stat->max();
119
120Returns the maximum value of the data set.
121
122=item $stat->maxdex();
123
124Returns the index of the maximum value of the data set.
125
126=item $stat->sample_range();
127
128Returns the sample range (max - min) of the data set.
129
130=back
131
132=head2 Full Methods
133
134Similar to the Sparse Methods above, any Full Method that is called caches
135the current result so that it doesn't have to be recalculated.  In some
136cases, several values can be cached at the same time.
137
138=over 5
139
140=item $stat = Statistics::Descriptive::Full->new();
141
142Create a new statistics object that inherits from
143Statistics::Descriptive::Sparse so that it contains all the methods
144described above.
145
146=item $stat->add_data(1,2,4,5);
147
148Adds data to the statistics variable.  All of the sparse statistical
149values are updated and cached.  Cached values from Full methods are
150deleted since they are no longer valid.
151
152I<Note:  Calling add_data with an empty array will delete all of your
153Full method cached values!  Cached values for the sparse methods are
154not changed>
155
156=item $stat->add_data_with_samples([{1 => 10}, {2 => 20}, {3 => 30},]);
157
158Add data to the statistics variable and set the number of samples each value
159has been built with. The data is the key of each element of the input array
160ref, while the value is the number of samples: [{data1 => smaples1}, {data2 =>
161samples2}, ...].
162
163B<NOTE:> The number of samples is only used by the smoothing function and is
164ignored otherwise. It is not equivalent to repeat count. In order to repeat
165a certain datum more than one time call add_data() like this:
166
167    my $value = 5;
168    my $repeat_count = 10;
169    $stat->add_data(
170        [ ($value) x $repeat_count ]
171    );
172
173=item $stat->get_data();
174
175Returns a copy of the data array.
176
177=item $stat->get_data_without_outliers();
178
179Returns a copy of the data array without outliers. The number minimum of
180samples to apply the outlier filtering is C<$Statistics::Descriptive::Min_samples_number>,
1814 by default.
182
183A function to detect outliers need to be defined (see C<set_outlier_filter>),
184otherwise the function will return an undef value.
185
186The filtering will act only on the most extreme value of the data set
187(i.e.: value with the highest absolute standard deviation from the mean).
188
189If there is the need to remove more than one outlier, the filtering
190need to be re-run for the next most extreme value with the initial outlier removed.
191
192This is not always needed since the test (for example Grubb's test) usually can only detect
193the most exreme value. If there is more than one extreme case in a set,
194then the standard deviation will be high enough to make neither case an outlier.
195
196=item $stat->set_outlier_filter($code_ref);
197
198Set the function to filter out the outlier.
199
200C<$code_ref> is the reference to the subroutine implementing the filtering
201function.
202
203Returns C<undef> for invalid values of C<$code_ref> (i.e.: not defined or not a
204code reference), C<1> otherwise.
205
206=over 4
207
208=item
209
210Example #1: Undefined code reference
211
212    my $stat = Statistics::Descriptive::Full->new();
213    $stat->add_data(1, 2, 3, 4, 5);
214
215    print $stat->set_outlier_filter(); # => undef
216
217=item
218
219Example #2: Valid code reference
220
221    sub outlier_filter { return $_[1] > 1; }
222
223    my $stat = Statistics::Descriptive::Full->new();
224    $stat->add_data( 1, 1, 1, 100, 1, );
225
226    print $stat->set_outlier_filter( \&outlier_filter ); # => 1
227    my @filtered_data = $stat->get_data_without_outliers();
228    # @filtered_data is (1, 1, 1, 1)
229
230In this example the series is really simple and the outlier filter function as well.
231For more complex series the outlier filter function might be more complex
232(see Grubbs' test for outliers).
233
234The outlier filter function will receive as first parameter the Statistics::Descriptive::Full object,
235as second the value of the candidate outlier. Having the object in the function
236might be useful for complex filters where statistics property are needed (again see Grubbs' test for outlier).
237
238=back
239
240=item $stat->set_smoother({ method => 'exponential', coeff => 0, });
241
242Set the method used to smooth the data and the smoothing coefficient.
243See C<Statistics::Smoother> for more details.
244
245=item $stat->get_smoothed_data();
246
247Returns a copy of the smoothed data array.
248
249The smoothing method and coefficient need to be defined (see C<set_smoother>),
250otherwise the function will return an undef value.
251
252=item $stat->sort_data();
253
254Sort the stored data and update the mindex and maxdex methods.  This
255method uses perl's internal sort.
256
257=item $stat->presorted(1);
258
259=item $stat->presorted();
260
261If called with a non-zero argument, this method sets a flag that says
262the data is already sorted and need not be sorted again.  Since some of
263the methods in this class require sorted data, this saves some time.
264If you supply sorted data to the object, call this method to prevent
265the data from being sorted again. The flag is cleared whenever add_data
266is called.  Calling the method without an argument returns the value of
267the flag.
268
269=item $stat->skewness();
270
271Returns the skewness of the data.
272A value of zero is no skew, negative is a left skewed tail,
273positive is a right skewed tail.
274This is consistent with Excel.
275
276=item $stat->kurtosis();
277
278Returns the kurtosis of the data.
279Positive is peaked, negative is flattened.
280
281=item $x = $stat->percentile(25);
282
283=item ($x, $index) = $stat->percentile(25);
284
285Sorts the data and returns the value that corresponds to the
286percentile as defined in RFC2330:
287
288=over 4
289
290=item
291
292For example, given the 6 measurements:
293
294-2, 7, 7, 4, 18, -5
295
296Then F(-8) = 0, F(-5) = 1/6, F(-5.0001) = 0, F(-4.999) = 1/6, F(7) =
2975/6, F(18) = 1, F(239) = 1.
298
299Note that we can recover the different measured values and how many
300times each occurred from F(x) -- no information regarding the range
301in values is lost.  Summarizing measurements using histograms, on the
302other hand, in general loses information about the different values
303observed, so the EDF is preferred.
304
305Using either the EDF or a histogram, however, we do lose information
306regarding the order in which the values were observed.  Whether this
307loss is potentially significant will depend on the metric being
308measured.
309
310We will use the term "percentile" to refer to the smallest value of x
311for which F(x) >= a given percentage.  So the 50th percentile of the
312example above is 4, since F(4) = 3/6 = 50%; the 25th percentile is
313-2, since F(-5) = 1/6 < 25%, and F(-2) = 2/6 >= 25%; the 100th
314percentile is 18; and the 0th percentile is -infinity, as is the 15th
315percentile, which for ease of handling and backward compatibility is returned
316as undef() by the function.
317
318Care must be taken when using percentiles to summarize a sample,
319because they can lend an unwarranted appearance of more precision
320than is really available.  Any such summary must include the sample
321size N, because any percentile difference finer than 1/N is below the
322resolution of the sample.
323
324=back
325
326(Taken from:
327I<RFC2330 - Framework for IP Performance Metrics>,
328Section 11.3.  Defining Statistical Distributions.
329RFC2330 is available from:
330L<http://www.ietf.org/rfc/rfc2330.txt> .)
331
332If the percentile method is called in a list context then it will
333also return the index of the percentile.
334
335=item $x = $stat->quantile($Type);
336
337Sorts the data and returns estimates of underlying distribution quantiles based on one
338or two order statistics from the supplied elements.
339
340This method use the same algorithm as Excel and R language (quantile B<type 7>).
341
342The generic function quantile produces sample quantiles corresponding to the given probabilities.
343
344B<$Type> is an integer value between 0 to 4 :
345
346  0 => zero quartile (Q0) : minimal value
347  1 => first quartile (Q1) : lower quartile = lowest cut off (25%) of data = 25th percentile
348  2 => second quartile (Q2) : median = it cuts data set in half = 50th percentile
349  3 => third quartile (Q3) : upper quartile = highest cut off (25%) of data, or lowest 75% = 75th percentile
350  4 => fourth quartile (Q4) : maximal value
351
352Example :
353
354  my @data = (1..10);
355  my $stat = Statistics::Descriptive::Full->new();
356  $stat->add_data(@data);
357  print $stat->quantile(0); # => 1
358  print $stat->quantile(1); # => 3.25
359  print $stat->quantile(2); # => 5.5
360  print $stat->quantile(3); # => 7.75
361  print $stat->quantile(4); # => 10
362
363=item $stat->median();
364
365Sorts the data and returns the median value of the data.
366
367=item $stat->harmonic_mean();
368
369Returns the harmonic mean of the data.  Since the mean is undefined
370if any of the data are zero or if the sum of the reciprocals is zero,
371it will return undef for both of those cases.
372
373=item $stat->geometric_mean();
374
375Returns the geometric mean of the data.
376
377=item my $mode = $stat->mode();
378
379Returns the mode of the data. The mode is the most commonly occurring datum.
380See L<http://en.wikipedia.org/wiki/Mode_%28statistics%29> . If all values
381occur only once, then mode() will return undef.
382
383=item $stat->trimmed_mean(ltrim[,utrim]);
384
385C<trimmed_mean(ltrim)> returns the mean with a fraction C<ltrim>
386of entries at each end dropped. C<trimmed_mean(ltrim,utrim)>
387returns the mean after a fraction C<ltrim> has been removed from the
388lower end of the data and a fraction C<utrim> has been removed from the
389upper end of the data.  This method sorts the data before beginning
390to analyze it.
391
392All calls to trimmed_mean() are cached so that they don't have to be
393calculated a second time.
394
395=item $stat->frequency_distribution_ref($num_partitions);
396
397=item $stat->frequency_distribution_ref(\@bins);
398
399=item $stat->frequency_distribution_ref();
400
401C<frequency_distribution_ref($num_partitions)> slices the data into
402C<$num_partitions> sets (where $num_partitions is greater than 1) and counts
403the number of items that fall into each partition. It returns a reference to a
404hash where the keys are the numerical values of the partitions used. The
405minimum value of the data set is not a key and the maximum value of the data
406set is always a key. The number of entries for a particular partition key are
407the number of items which are greater than the previous partition key and less
408then or equal to the current partition key. As an example,
409
410   $stat->add_data(1,1.5,2,2.5,3,3.5,4);
411   $f = $stat->frequency_distribution_ref(2);
412   for (sort {$a <=> $b} keys %$f) {
413      print "key = $_, count = $f->{$_}\n";
414   }
415
416prints
417
418   key = 2.5, count = 4
419   key = 4, count = 3
420
421since there are four items less than or equal to 2.5, and 3 items
422greater than 2.5 and less than 4.
423
424C<frequency_distribution_refs(\@bins)> provides the bins that are to be used
425for the distribution.  This allows for non-uniform distributions as
426well as trimmed or sample distributions to be found.  C<@bins> must
427be monotonic and must contain at least one element.  Note that unless the
428set of bins contains the full range of the data, the total counts returned will
429be less than the sample size.
430
431Calling C<frequency_distribution_ref()> with no arguments returns the last
432distribution calculated, if such exists.
433
434=item my %hash = $stat->frequency_distribution($partitions);
435
436=item my %hash = $stat->frequency_distribution(\@bins);
437
438=item my %hash = $stat->frequency_distribution();
439
440Same as C<frequency_distribution_ref()> except that it returns the hash
441clobbered into the return list. Kept for compatibility reasons with previous
442versions of Statistics::Descriptive and using it is discouraged.
443
444=item $stat->least_squares_fit();
445
446=item $stat->least_squares_fit(@x);
447
448C<least_squares_fit()> performs a least squares fit on the data,
449assuming a domain of C<@x> or a default of 1..$stat->count().  It
450returns an array of four elements C<($q, $m, $r, $rms)> where
451
452=over 4
453
454=item C<$q and $m>
455
456satisfy the equation C($y = $m*$x + $q).
457
458=item C<$r>
459
460is the Pearson linear correlation cofficient.
461
462=item C<$rms>
463
464is the root-mean-square error.
465
466=back
467
468If case of error or division by zero, the empty list is returned.
469
470The array that is returned can be "coerced" into a hash structure
471by doing the following:
472
473  my %hash = ();
474  @hash{'q', 'm', 'r', 'err'} = $stat->least_squares_fit();
475
476Because calling C<least_squares_fit()> with no arguments defaults
477to using the current range, there is no caching of the results.
478
479=back
480
481=head1 REPORTING ERRORS
482
483I read my email frequently, but since adopting this module I've added 2
484children and 1 dog to my family, so please be patient about my response
485times.  When reporting errors, please include the following to help
486me out:
487
488=over 4
489
490=item *
491
492Your version of perl.  This can be obtained by typing perl C<-v> at
493the command line.
494
495=item *
496
497Which version of Statistics::Descriptive you're using.  As you can
498see below, I do make mistakes.  Unfortunately for me, right now
499there are thousands of CD's with the version of this module with
500the bugs in it.  Fortunately for you, I'm a very patient module
501maintainer.
502
503=item *
504
505Details about what the error is.  Try to narrow down the scope
506of the problem and send me code that I can run to verify and
507track it down.
508
509=back
510
511=head1 AUTHOR
512
513Current maintainer:
514
515Shlomi Fish, L<http://www.shlomifish.org/> , C<shlomif@cpan.org>
516
517Previously:
518
519Colin Kuskie
520
521My email address can be found at http://www.perl.com under Who's Who
522or at: https://metacpan.org/author/COLINK .
523
524=head1 CONTRIBUTORS
525
526Fabio Ponciroli & Adzuna Ltd. team (outliers handling)
527
528=head1 REFERENCES
529
530RFC2330, Framework for IP Performance Metrics
531
532The Art of Computer Programming, Volume 2, Donald Knuth.
533
534Handbook of Mathematica Functions, Milton Abramowitz and Irene Stegun.
535
536Probability and Statistics for Engineering and the Sciences, Jay Devore.
537
538=head1 COPYRIGHT
539
540Copyright (c) 1997,1998 Colin Kuskie. All rights reserved.  This
541program is free software; you can redistribute it and/or modify it
542under the same terms as Perl itself.
543
544Copyright (c) 1998 Andrea Spinelli. All rights reserved.  This program
545is free software; you can redistribute it and/or modify it under the
546same terms as Perl itself.
547
548Copyright (c) 1994,1995 Jason Kastner. All rights
549reserved.  This program is free software; you can redistribute it
550and/or modify it under the same terms as Perl itself.
551
552=head1 LICENSE
553
554This program is free software; you can redistribute it and/or modify it
555under the same terms as Perl itself.
556
557=for :stopwords cpan testmatrix url bugtracker rt cpants kwalitee diff irc mailto metadata placeholders metacpan
558
559=head1 SUPPORT
560
561=head2 Websites
562
563The following websites have more information about this module, and may be of help to you. As always,
564in addition to those websites please use your favorite search engine to discover more resources.
565
566=over 4
567
568=item *
569
570MetaCPAN
571
572A modern, open-source CPAN search engine, useful to view POD in HTML format.
573
574L<https://metacpan.org/release/Statistics-Descriptive>
575
576=item *
577
578RT: CPAN's Bug Tracker
579
580The RT ( Request Tracker ) website is the default bug/issue tracking system for CPAN.
581
582L<https://rt.cpan.org/Public/Dist/Display.html?Name=Statistics-Descriptive>
583
584=item *
585
586CPANTS
587
588The CPANTS is a website that analyzes the Kwalitee ( code metrics ) of a distribution.
589
590L<http://cpants.cpanauthors.org/dist/Statistics-Descriptive>
591
592=item *
593
594CPAN Testers
595
596The CPAN Testers is a network of smoke testers who run automated tests on uploaded CPAN distributions.
597
598L<http://www.cpantesters.org/distro/S/Statistics-Descriptive>
599
600=item *
601
602CPAN Testers Matrix
603
604The CPAN Testers Matrix is a website that provides a visual overview of the test results for a distribution on various Perls/platforms.
605
606L<http://matrix.cpantesters.org/?dist=Statistics-Descriptive>
607
608=item *
609
610CPAN Testers Dependencies
611
612The CPAN Testers Dependencies is a website that shows a chart of the test results of all dependencies for a distribution.
613
614L<http://deps.cpantesters.org/?module=Statistics::Descriptive>
615
616=back
617
618=head2 Bugs / Feature Requests
619
620Please report any bugs or feature requests by email to C<bug-statistics-descriptive at rt.cpan.org>, or through
621the web interface at L<https://rt.cpan.org/Public/Bug/Report.html?Queue=Statistics-Descriptive>. You will be automatically notified of any
622progress on the request by the system.
623
624=head2 Source Code
625
626The code is open to the world, and available for you to hack on. Please feel free to browse it and play
627with it, or whatever. If you want to contribute patches, please send me a diff or prod me to pull
628from your repository :)
629
630L<https://github.com/shlomif/perl-Statistics-Descriptive>
631
632  git clone git://github.com/shlomif/perl-Statistics-Descriptive.git
633
634=head1 AUTHOR
635
636Shlomi Fish <shlomif@cpan.org>
637
638=head1 BUGS
639
640Please report any bugs or feature requests on the bugtracker website
641L<https://github.com/shlomif/perl-Statistics-Descriptive/issues>
642
643When submitting a bug or request, please include a test-file or a
644patch to an existing test-file that illustrates the bug or desired
645feature.
646
647=head1 COPYRIGHT AND LICENSE
648
649This software is copyright (c) 1997 by Jason Kastner, Andrea Spinelli, Colin Kuskie, and others.
650
651This is free software; you can redistribute it and/or modify it under
652the same terms as the Perl 5 programming language system itself.
653
654=cut
655