• Home
  • History
  • Annotate
Name Date Size #Lines LOC

..03-May-2022-

lib/Statistics/H24-Jun-2004-746331

scripts/H24-Jun-2004-

t/H02-Sep-2004-779685

ChangesH A D02-Sep-20041.1 KiB3724

INSTALLH A D23-Nov-20031 KiB4024

LICENSEH A D15-Nov-200320.1 KiB384309

MANIFESTH A D17-Nov-2003313 2322

Makefile.PLH A D17-Nov-2003452 1310

READMEH A D02-Sep-200415.3 KiB341272

TodoH A D20-May-2004472 149

README

1
2See the file INSTALL for installation instructions.
3
4Contents:
5    NAME
6    SYNOPSIS
7    DESCRIPTION
8    ALGORITHM
9    LIMITATIONS
10    EXAMPLES
11    METHODS
12    SEE ALSO
13    AUTHOR
14    LICENSE
15    DISCLAIMER
16
17NAME
18    Statistics::LineFit - Least squares line fit, weighted or unweighted
19
20SYNOPSIS
21     use Statistics::LineFit;
22     $lineFit = Statistics::LineFit->new();
23     $lineFit->setData (\@xValues, \@yValues) or die "Invalid data";
24     ($intercept, $slope) = $lineFit->coefficients();
25     defined $intercept or die "Can't fit line if x values are all equal";
26     $rSquared = $lineFit->rSquared();
27     $meanSquaredError = $lineFit->meanSqError();
28     $durbinWatson = $lineFit->durbinWatson();
29     $sigma = $lineFit->sigma();
30     ($tStatIntercept, $tStatSlope) = $lineFit->tStatistics();
31     @predictedYs = $lineFit->predictedYs();
32     @residuals = $lineFit->residuals();
33     (varianceIntercept, $varianceSlope) = $lineFit->varianceOfEstimates();
34
35DESCRIPTION
36    The Statistics::LineFit module does weighted or unweighted least-squares
37    line fitting to two-dimensional data (y = a + b * x). (This is also
38    called linear regression.) In addition to the slope and y-intercept, the
39    module can return the square of the correlation coefficient (R squared),
40    the Durbin-Watson statistic, the mean squared error, sigma, the t
41    statistics, the variance of the estimates of the slope and y-intercept,
42    the predicted y values and the residuals of the y values. (See the
43    METHODS section for a description of these statistics.)
44
45    The module accepts input data in separate x and y arrays or a single 2-D
46    array (an array of arrayrefs). The optional weights are input in a
47    separate array. The module can optionally verify that the input data and
48    weights are valid numbers. If weights are input, the line fit minimizes
49    the weighted sum of the squared errors and the following statistics are
50    weighted: the correlation coefficient, the Durbin-Watson statistic, the
51    mean squared error, sigma and the t statistics.
52
53    The module is state-oriented and caches its results. Once you call the
54    setData() method, you can call the other methods in any order or call a
55    method several times without invoking redundant calculations. After
56    calling setData(), you can modify the input data or weights without
57    affecting the module's results.
58
59    The decision to use or not use weighting could be made using your a
60    priori knowledge of the data or using supplemental data. If the data is
61    sparse or contains non-random noise, weighting can degrade the solution.
62    Weighting is a good option if some points are suspect or less relevant
63    (e.g., older terms in a time series, points that are known to have more
64    noise).
65
66ALGORITHM
67    The least-square line is the line that minimizes the sum of the squares
68    of the y residuals:
69
70     Minimize SUM((y[i] - (a + b * x[i])) ** 2)
71
72    Setting the parial derivatives of a and b to zero yields a solution that
73    can be expressed in terms of the means, variances and covariances of x
74    and y:
75
76     b = SUM((x[i] - meanX) * (y[i] - meanY)) / SUM((x[i] - meanX) ** 2)
77
78     a = meanY - b * meanX
79
80    Note that a and b are undefined if all the x values are the same.
81
82    If you use weights, each term in the above sums is multiplied by the
83    value of the weight for that index. The program normalizes the weights
84    (after copying the input values) so that the sum of the weights equals
85    the number of points. This minimizes the differences between the
86    weighted and unweighted equations.
87
88    Statistics::LineFit uses equations that are mathematically equivalent to
89    the above equations and computationally more efficient. The module runs
90    in O(N) (linear time).
91
92LIMITATIONS
93    The regression fails if the input x values are all equal or the only
94    unequal x values have zero weights. This is an inherent limit to fitting
95    a line of the form y = a + b * x. In this case, the module issues an
96    error message and methods that return statistical values will return
97    undefined values. You can also use the return value of the regress()
98    method to check the status of the regression.
99
100    As the sum of the squared deviations of the x values approaches zero,
101    the module's results becomes sensitive to the precision of floating
102    point operations on the host system.
103
104    If the x values are not all the same and the apparent "best fit" line is
105    vertical, the module will fit a horizontal line. For example, an input
106    of (1, 1), (1, 7), (2, 3), (2, 5) returns a slope of zero, an intercept
107    of 4 and an R squared of zero. This is correct behavior because this
108    line is the best least-squares fit to the data for the given
109    parameterization (y = a + b * x).
110
111    On a 32-bit system the results are accurate to about 11 significant
112    digits, depending on the input data. Many of the installation tests will
113    fail on a system with word lengths of 16 bits or fewer. (You might want
114    to upgrade your old 80286 IBM PC.)
115
116EXAMPLES
117  Alternate calling sequence:
118     use Statistics::LineFit;
119     $lineFit = Statistics::LineFit->new();
120     $lineFit->setData(\@x, \@y) or die "Invalid regression data\n";
121     if (defined $lineFit->rSquared()
122         and $lineFit->rSquared() > $threshold)
123     {
124         ($intercept, $slope) = $lineFit->coefficients();
125         print "Slope: $slope  Y-intercept: $intercept\n";
126     }
127
128  Multiple calls with same object, validate input, suppress error messages:
129     use Statistics::LineFit;
130     $lineFit = Statistics::LineFit->new(1, 1);
131     while (1) {
132         @xy = read2Dxy();  # User-supplied subroutine
133         $lineFit->setData(\@xy);
134         ($intercept, $slope) = $lineFit->coefficients();
135         if (defined $intercept) {
136             print "Slope: $slope  Y-intercept: $intercept\n";
137         }
138     }
139
140METHODS
141    The module is state-oriented and caches its results. Once you call the
142    setData() method, you can call the other methods in any order or call a
143    method several times without invoking redundant calculations.
144
145    The regression fails if the x values are all the same. In this case, the
146    module issues an error message and methods that return statistical
147    values will return undefined values. You can also use the return value
148    of the regress() method to check the status of the regression.
149
150  new() - create a new Statistics::LineFit object
151     $lineFit = Statistics::LineFit->new();
152     $lineFit = Statistics::LineFit->new($validate);
153     $lineFit = Statistics::LineFit->new($validate, $hush);
154
155     $validate = 1 -> Verify input data is numeric (slower execution)
156                 0 -> Don't verify input data (default, faster execution)
157     $hush = 1 -> Suppress error messages
158           = 0 -> Enable error messages (default)
159
160  coefficients() - Return the slope and y intercept
161     ($intercept, $slope) = $lineFit->coefficients();
162
163    The returned list is undefined if the regression fails.
164
165  durbinWatson() - Return the Durbin-Watson statistic
166     $durbinWatson = $lineFit->durbinWatson();
167
168    The Durbin-Watson test is a test for first-order autocorrelation in the
169    residuals of a time series regression. The Durbin-Watson statistic has a
170    range of 0 to 4; a value of 2 indicates there is no autocorrelation.
171
172    The return value is undefined if the regression fails. If weights are
173    input, the return value is the weighted Durbin-Watson statistic.
174
175  meanSqError() - Return the mean squared error
176     $meanSquaredError = $lineFit->meanSqError();
177
178    The return value is undefined if the regression fails. If weights are
179    input, the return value is the weighted mean squared error.
180
181  predictedYs() - Return the predicted y values
182     @predictedYs = $lineFit->predictedYs();
183
184    The returned list is undefined if the regression fails.
185
186  regress() - Do the least squares line fit (if not already done)
187     $lineFit->regress() or die "Regression failed"
188
189    You don't need to call this method because it is invoked by the other
190    methods as needed. After you call setData(), you can call regress() at
191    any time to get the status of the regression for the current data.
192
193  residuals() - Return predicted y values minus input y values
194     @residuals = $lineFit->residuals();
195
196    The returned list is undefined if the regression fails.
197
198  rSquared() - Return the square of the correlation coefficient
199     $rSquared = $lineFit->rSquared();
200
201    R squared, also called the square of the Pearson product-moment
202    correlation coefficient, is a measure of goodness-of-fit. It is the
203    fraction of the variation in Y that can be attributed to the variation
204    in X. A perfect fit will have an R squared of 1; fitting a line to the
205    vertices of a regular polygon will yield an R squared of zero. Graphical
206    displays of data with an R squared of less than about 0.1 do not show a
207    visible linear trend.
208
209    The return value is undefined if the regression fails. If weights are
210    input, the return value is the weighted correlation coefficient.
211
212  setData() - Initialize (x,y) values and optional weights
213     $lineFit->setData(\@x, \@y) or die "Invalid regression data";
214     $lineFit->setData(\@x, \@y, \@weights) or die "Invalid regression data";
215     $lineFit->setData(\@xy) or die "Invalid regression data";
216     $lineFit->setData(\@xy, \@weights) or die "Invalid regression data";
217
218    @xy is an array of arrayrefs; x values are $xy[$i][0], y values are
219    $xy[$i][1]. (The module does not access any indices greater than
220    $xy[$i][1], so the arrayrefs can point to arrays that are longer than
221    two elements.) The method identifies the difference between the first
222    and fourth calling signatures by examining the first argument.
223
224    The optional weights array must be the same length as the data array(s).
225    The weights must be non-negative numbers; at least two of the weights
226    must be nonzero. Only the relative size of the weights is significant:
227    the program normalizes the weights (after copying the input values) so
228    that the sum of the weights equals the number of points. If you want to
229    do multiple line fits using the same weights, the weights must be passed
230    to each call to setData().
231
232    The method will return zero if the array lengths don't match, there are
233    less than two data points, any weights are negative or less than two of
234    the weights are nonzero. If the new() method was called with validate =
235    1, the method will also verify that the data and weights are valid
236    numbers. Once you successfully call setData(), the next call to any
237    method other than new() or setData() invokes the regression. You can
238    modify the input data or weights after calling setData() without
239    affecting the module's results.
240
241  sigma() - Return the standard error of the estimate
242    $sigma = $lineFit->sigma();
243
244    Sigma is an estimate of the homoscedastic standard deviation of the
245    error. Sigma is also known as the standard error of the estimate.
246
247    The return value is undefined if the regression fails. If weights are
248    input, the return value is the weighted standard error.
249
250  tStatistics() - Return the t statistics
251     (tStatIntercept, $tStatSlope) = $lineFit->tStatistics();
252
253    The t statistic, also called the t ratio or Wald statistic, is used to
254    accept or reject a hypothesis using a table of cutoff values computed
255    from the t distribution. The t-statistic suggests that the estimated
256    value is (reasonable, too small, too large) when the t-statistic is
257    (close to zero, large and positive, large and negative).
258
259    The returned list is undefined if the regression fails. If weights are
260    input, the returned values are the weighted t statistics.
261
262  varianceOfEstimates() - Return variances of estimates of intercept, slope
263     (varianceIntercept, $varianceSlope) = $lineFit->varianceOfEstimates();
264
265    Assuming the data are noisy or inaccurate, the intercept and slope
266    returned by the coefficients() method are only estimates of the true
267    intercept and slope. The varianceofEstimate() method returns the
268    variances of the estimates of the intercept and slope, respectively. See
269    Numerical Recipes in C, section 15.2 (Fitting Data to a Straight Line),
270    equation 15.2.9.
271
272    The returned list is undefined if the regression fails. If weights are
273    input, the returned values are the weighted variances.
274
275SEE ALSO
276     Mendenhall, W., and Sincich, T.L., 2003, A Second Course in Statistics:
277       Regression Analysis, 6th ed., Prentice Hall.
278     Press, W. H., Flannery, B. P., Teukolsky, S. A., Vetterling, W. T., 1992,
279       Numerical Recipes in C : The Art of Scientific Computing, 2nd ed.,
280       Cambridge University Press.
281     The man page for perl(1).
282     The CPAN modules Statistics::OLS, Statistics::GaussHelmert and
283       Statistics::Regression.
284
285    Statistics::LineFit is simpler to use than Statistics::GaussHelmert or
286    Statistics::Regression. Statistics::LineFit was inspired by and borrows
287    some ideas from the venerable Statistics::OLS module.
288
289    The significant differences between Statistics::LineFit and
290    Statistics::OLS (version 0.07) are:
291
292    Statistics::LineFit is more robust.
293        Statistics::OLS returns incorrect results for certain input
294        datasets. Statistics::OLS does not deep copy its input arrays, which
295        can lead to subtle bugs. The Statistics::OLS installation test has
296        only one test and does not verify that the regression returns
297        correct results. In contrast, Statistics::LineFit has over 200
298        installation tests that use various datasets/calling sequences to
299        verify the accuracy of the regression to within 1.0e-10.
300
301    Statistics::LineFit is faster.
302        For a sequence of calls to new(), setData(\@x, \@y) and regress(),
303        Statistics::LineFit is faster than Statistics::OLS by factors of
304        2.0, 1.6 and 2.4 for array lengths of 5, 100 and 10000,
305        respectively.
306
307    Statistics::LineFit can do weighted or unweighted regression.
308        Statistics::OLS lacks this option.
309
310    Statistics::LineFit has a better interface.
311        Once you call the Statistics::LineFit::setData() method, you can
312        call the other methods in any order and call methods multiple times
313        without invoking redundant calculations. Statistics::LineFit lets
314        you enable or disable data verification or error messages.
315
316    Statistics::LineFit has better code and documentation.
317        The code in Statistics::LineFit is more readable, more object
318        oriented and more compliant with Perl coding standards than the code
319        in Statistics::OLS. The documentation for Statistics::LineFit is
320        more detailed and complete.
321
322AUTHOR
323    Richard Anderson, cpan(AT)richardanderson(DOT)org,
324    http://www.richardanderson.org
325
326LICENSE
327    This program is free software; you can redistribute it and/or modify it
328    under the same terms as Perl itself.
329
330    The full text of the license can be found in the LICENSE file included
331    in the distribution and available in the CPAN listing for
332    Statistics::LineFit (see www.cpan.org or search.cpan.org).
333
334DISCLAIMER
335    To the maximum extent permitted by applicable law, the author of this
336    module disclaims all warranties, either express or implied, including
337    but not limited to implied warranties of merchantability and fitness for
338    a particular purpose, with regard to the software and the accompanying
339    documentation.
340
341