1
2See the file INSTALL for installation instructions.
3
4Contents:
5 NAME
6 SYNOPSIS
7 DESCRIPTION
8 ALGORITHM
9 LIMITATIONS
10 EXAMPLES
11 METHODS
12 SEE ALSO
13 AUTHOR
14 LICENSE
15 DISCLAIMER
16
17NAME
18 Statistics::LineFit - Least squares line fit, weighted or unweighted
19
20SYNOPSIS
21 use Statistics::LineFit;
22 $lineFit = Statistics::LineFit->new();
23 $lineFit->setData (\@xValues, \@yValues) or die "Invalid data";
24 ($intercept, $slope) = $lineFit->coefficients();
25 defined $intercept or die "Can't fit line if x values are all equal";
26 $rSquared = $lineFit->rSquared();
27 $meanSquaredError = $lineFit->meanSqError();
28 $durbinWatson = $lineFit->durbinWatson();
29 $sigma = $lineFit->sigma();
30 ($tStatIntercept, $tStatSlope) = $lineFit->tStatistics();
31 @predictedYs = $lineFit->predictedYs();
32 @residuals = $lineFit->residuals();
33 (varianceIntercept, $varianceSlope) = $lineFit->varianceOfEstimates();
34
35DESCRIPTION
36 The Statistics::LineFit module does weighted or unweighted least-squares
37 line fitting to two-dimensional data (y = a + b * x). (This is also
38 called linear regression.) In addition to the slope and y-intercept, the
39 module can return the square of the correlation coefficient (R squared),
40 the Durbin-Watson statistic, the mean squared error, sigma, the t
41 statistics, the variance of the estimates of the slope and y-intercept,
42 the predicted y values and the residuals of the y values. (See the
43 METHODS section for a description of these statistics.)
44
45 The module accepts input data in separate x and y arrays or a single 2-D
46 array (an array of arrayrefs). The optional weights are input in a
47 separate array. The module can optionally verify that the input data and
48 weights are valid numbers. If weights are input, the line fit minimizes
49 the weighted sum of the squared errors and the following statistics are
50 weighted: the correlation coefficient, the Durbin-Watson statistic, the
51 mean squared error, sigma and the t statistics.
52
53 The module is state-oriented and caches its results. Once you call the
54 setData() method, you can call the other methods in any order or call a
55 method several times without invoking redundant calculations. After
56 calling setData(), you can modify the input data or weights without
57 affecting the module's results.
58
59 The decision to use or not use weighting could be made using your a
60 priori knowledge of the data or using supplemental data. If the data is
61 sparse or contains non-random noise, weighting can degrade the solution.
62 Weighting is a good option if some points are suspect or less relevant
63 (e.g., older terms in a time series, points that are known to have more
64 noise).
65
66ALGORITHM
67 The least-square line is the line that minimizes the sum of the squares
68 of the y residuals:
69
70 Minimize SUM((y[i] - (a + b * x[i])) ** 2)
71
72 Setting the parial derivatives of a and b to zero yields a solution that
73 can be expressed in terms of the means, variances and covariances of x
74 and y:
75
76 b = SUM((x[i] - meanX) * (y[i] - meanY)) / SUM((x[i] - meanX) ** 2)
77
78 a = meanY - b * meanX
79
80 Note that a and b are undefined if all the x values are the same.
81
82 If you use weights, each term in the above sums is multiplied by the
83 value of the weight for that index. The program normalizes the weights
84 (after copying the input values) so that the sum of the weights equals
85 the number of points. This minimizes the differences between the
86 weighted and unweighted equations.
87
88 Statistics::LineFit uses equations that are mathematically equivalent to
89 the above equations and computationally more efficient. The module runs
90 in O(N) (linear time).
91
92LIMITATIONS
93 The regression fails if the input x values are all equal or the only
94 unequal x values have zero weights. This is an inherent limit to fitting
95 a line of the form y = a + b * x. In this case, the module issues an
96 error message and methods that return statistical values will return
97 undefined values. You can also use the return value of the regress()
98 method to check the status of the regression.
99
100 As the sum of the squared deviations of the x values approaches zero,
101 the module's results becomes sensitive to the precision of floating
102 point operations on the host system.
103
104 If the x values are not all the same and the apparent "best fit" line is
105 vertical, the module will fit a horizontal line. For example, an input
106 of (1, 1), (1, 7), (2, 3), (2, 5) returns a slope of zero, an intercept
107 of 4 and an R squared of zero. This is correct behavior because this
108 line is the best least-squares fit to the data for the given
109 parameterization (y = a + b * x).
110
111 On a 32-bit system the results are accurate to about 11 significant
112 digits, depending on the input data. Many of the installation tests will
113 fail on a system with word lengths of 16 bits or fewer. (You might want
114 to upgrade your old 80286 IBM PC.)
115
116EXAMPLES
117 Alternate calling sequence:
118 use Statistics::LineFit;
119 $lineFit = Statistics::LineFit->new();
120 $lineFit->setData(\@x, \@y) or die "Invalid regression data\n";
121 if (defined $lineFit->rSquared()
122 and $lineFit->rSquared() > $threshold)
123 {
124 ($intercept, $slope) = $lineFit->coefficients();
125 print "Slope: $slope Y-intercept: $intercept\n";
126 }
127
128 Multiple calls with same object, validate input, suppress error messages:
129 use Statistics::LineFit;
130 $lineFit = Statistics::LineFit->new(1, 1);
131 while (1) {
132 @xy = read2Dxy(); # User-supplied subroutine
133 $lineFit->setData(\@xy);
134 ($intercept, $slope) = $lineFit->coefficients();
135 if (defined $intercept) {
136 print "Slope: $slope Y-intercept: $intercept\n";
137 }
138 }
139
140METHODS
141 The module is state-oriented and caches its results. Once you call the
142 setData() method, you can call the other methods in any order or call a
143 method several times without invoking redundant calculations.
144
145 The regression fails if the x values are all the same. In this case, the
146 module issues an error message and methods that return statistical
147 values will return undefined values. You can also use the return value
148 of the regress() method to check the status of the regression.
149
150 new() - create a new Statistics::LineFit object
151 $lineFit = Statistics::LineFit->new();
152 $lineFit = Statistics::LineFit->new($validate);
153 $lineFit = Statistics::LineFit->new($validate, $hush);
154
155 $validate = 1 -> Verify input data is numeric (slower execution)
156 0 -> Don't verify input data (default, faster execution)
157 $hush = 1 -> Suppress error messages
158 = 0 -> Enable error messages (default)
159
160 coefficients() - Return the slope and y intercept
161 ($intercept, $slope) = $lineFit->coefficients();
162
163 The returned list is undefined if the regression fails.
164
165 durbinWatson() - Return the Durbin-Watson statistic
166 $durbinWatson = $lineFit->durbinWatson();
167
168 The Durbin-Watson test is a test for first-order autocorrelation in the
169 residuals of a time series regression. The Durbin-Watson statistic has a
170 range of 0 to 4; a value of 2 indicates there is no autocorrelation.
171
172 The return value is undefined if the regression fails. If weights are
173 input, the return value is the weighted Durbin-Watson statistic.
174
175 meanSqError() - Return the mean squared error
176 $meanSquaredError = $lineFit->meanSqError();
177
178 The return value is undefined if the regression fails. If weights are
179 input, the return value is the weighted mean squared error.
180
181 predictedYs() - Return the predicted y values
182 @predictedYs = $lineFit->predictedYs();
183
184 The returned list is undefined if the regression fails.
185
186 regress() - Do the least squares line fit (if not already done)
187 $lineFit->regress() or die "Regression failed"
188
189 You don't need to call this method because it is invoked by the other
190 methods as needed. After you call setData(), you can call regress() at
191 any time to get the status of the regression for the current data.
192
193 residuals() - Return predicted y values minus input y values
194 @residuals = $lineFit->residuals();
195
196 The returned list is undefined if the regression fails.
197
198 rSquared() - Return the square of the correlation coefficient
199 $rSquared = $lineFit->rSquared();
200
201 R squared, also called the square of the Pearson product-moment
202 correlation coefficient, is a measure of goodness-of-fit. It is the
203 fraction of the variation in Y that can be attributed to the variation
204 in X. A perfect fit will have an R squared of 1; fitting a line to the
205 vertices of a regular polygon will yield an R squared of zero. Graphical
206 displays of data with an R squared of less than about 0.1 do not show a
207 visible linear trend.
208
209 The return value is undefined if the regression fails. If weights are
210 input, the return value is the weighted correlation coefficient.
211
212 setData() - Initialize (x,y) values and optional weights
213 $lineFit->setData(\@x, \@y) or die "Invalid regression data";
214 $lineFit->setData(\@x, \@y, \@weights) or die "Invalid regression data";
215 $lineFit->setData(\@xy) or die "Invalid regression data";
216 $lineFit->setData(\@xy, \@weights) or die "Invalid regression data";
217
218 @xy is an array of arrayrefs; x values are $xy[$i][0], y values are
219 $xy[$i][1]. (The module does not access any indices greater than
220 $xy[$i][1], so the arrayrefs can point to arrays that are longer than
221 two elements.) The method identifies the difference between the first
222 and fourth calling signatures by examining the first argument.
223
224 The optional weights array must be the same length as the data array(s).
225 The weights must be non-negative numbers; at least two of the weights
226 must be nonzero. Only the relative size of the weights is significant:
227 the program normalizes the weights (after copying the input values) so
228 that the sum of the weights equals the number of points. If you want to
229 do multiple line fits using the same weights, the weights must be passed
230 to each call to setData().
231
232 The method will return zero if the array lengths don't match, there are
233 less than two data points, any weights are negative or less than two of
234 the weights are nonzero. If the new() method was called with validate =
235 1, the method will also verify that the data and weights are valid
236 numbers. Once you successfully call setData(), the next call to any
237 method other than new() or setData() invokes the regression. You can
238 modify the input data or weights after calling setData() without
239 affecting the module's results.
240
241 sigma() - Return the standard error of the estimate
242 $sigma = $lineFit->sigma();
243
244 Sigma is an estimate of the homoscedastic standard deviation of the
245 error. Sigma is also known as the standard error of the estimate.
246
247 The return value is undefined if the regression fails. If weights are
248 input, the return value is the weighted standard error.
249
250 tStatistics() - Return the t statistics
251 (tStatIntercept, $tStatSlope) = $lineFit->tStatistics();
252
253 The t statistic, also called the t ratio or Wald statistic, is used to
254 accept or reject a hypothesis using a table of cutoff values computed
255 from the t distribution. The t-statistic suggests that the estimated
256 value is (reasonable, too small, too large) when the t-statistic is
257 (close to zero, large and positive, large and negative).
258
259 The returned list is undefined if the regression fails. If weights are
260 input, the returned values are the weighted t statistics.
261
262 varianceOfEstimates() - Return variances of estimates of intercept, slope
263 (varianceIntercept, $varianceSlope) = $lineFit->varianceOfEstimates();
264
265 Assuming the data are noisy or inaccurate, the intercept and slope
266 returned by the coefficients() method are only estimates of the true
267 intercept and slope. The varianceofEstimate() method returns the
268 variances of the estimates of the intercept and slope, respectively. See
269 Numerical Recipes in C, section 15.2 (Fitting Data to a Straight Line),
270 equation 15.2.9.
271
272 The returned list is undefined if the regression fails. If weights are
273 input, the returned values are the weighted variances.
274
275SEE ALSO
276 Mendenhall, W., and Sincich, T.L., 2003, A Second Course in Statistics:
277 Regression Analysis, 6th ed., Prentice Hall.
278 Press, W. H., Flannery, B. P., Teukolsky, S. A., Vetterling, W. T., 1992,
279 Numerical Recipes in C : The Art of Scientific Computing, 2nd ed.,
280 Cambridge University Press.
281 The man page for perl(1).
282 The CPAN modules Statistics::OLS, Statistics::GaussHelmert and
283 Statistics::Regression.
284
285 Statistics::LineFit is simpler to use than Statistics::GaussHelmert or
286 Statistics::Regression. Statistics::LineFit was inspired by and borrows
287 some ideas from the venerable Statistics::OLS module.
288
289 The significant differences between Statistics::LineFit and
290 Statistics::OLS (version 0.07) are:
291
292 Statistics::LineFit is more robust.
293 Statistics::OLS returns incorrect results for certain input
294 datasets. Statistics::OLS does not deep copy its input arrays, which
295 can lead to subtle bugs. The Statistics::OLS installation test has
296 only one test and does not verify that the regression returns
297 correct results. In contrast, Statistics::LineFit has over 200
298 installation tests that use various datasets/calling sequences to
299 verify the accuracy of the regression to within 1.0e-10.
300
301 Statistics::LineFit is faster.
302 For a sequence of calls to new(), setData(\@x, \@y) and regress(),
303 Statistics::LineFit is faster than Statistics::OLS by factors of
304 2.0, 1.6 and 2.4 for array lengths of 5, 100 and 10000,
305 respectively.
306
307 Statistics::LineFit can do weighted or unweighted regression.
308 Statistics::OLS lacks this option.
309
310 Statistics::LineFit has a better interface.
311 Once you call the Statistics::LineFit::setData() method, you can
312 call the other methods in any order and call methods multiple times
313 without invoking redundant calculations. Statistics::LineFit lets
314 you enable or disable data verification or error messages.
315
316 Statistics::LineFit has better code and documentation.
317 The code in Statistics::LineFit is more readable, more object
318 oriented and more compliant with Perl coding standards than the code
319 in Statistics::OLS. The documentation for Statistics::LineFit is
320 more detailed and complete.
321
322AUTHOR
323 Richard Anderson, cpan(AT)richardanderson(DOT)org,
324 http://www.richardanderson.org
325
326LICENSE
327 This program is free software; you can redistribute it and/or modify it
328 under the same terms as Perl itself.
329
330 The full text of the license can be found in the LICENSE file included
331 in the distribution and available in the CPAN listing for
332 Statistics::LineFit (see www.cpan.org or search.cpan.org).
333
334DISCLAIMER
335 To the maximum extent permitted by applicable law, the author of this
336 module disclaims all warranties, either express or implied, including
337 but not limited to implied warranties of merchantability and fitness for
338 a particular purpose, with regard to the software and the accompanying
339 documentation.
340
341