1@c PSPP - a program for statistical analysis.
2@c Copyright (C) 2017 Free Software Foundation, Inc.
3@c Permission is granted to copy, distribute and/or modify this document
4@c under the terms of the GNU Free Documentation License, Version 1.3
5@c or any later version published by the Free Software Foundation;
6@c with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts.
7@c A copy of the license is included in the section entitled "GNU
8@c Free Documentation License".
9@c
10@node Data Selection
11@chapter Selecting data for analysis
12
13This chapter documents @pspp{} commands that temporarily or permanently
14select data records from the active dataset for analysis.
15
16@menu
17* FILTER::                      Exclude cases based on a variable.
18* N OF CASES::                  Limit the size of the active dataset.
19* SAMPLE::                      Select a specified proportion of cases.
20* SELECT IF::                   Permanently delete selected cases.
21* SPLIT FILE::                  Do multiple analyses with one command.
22* TEMPORARY::                   Make transformations' effects temporary.
23* WEIGHT::                      Weight cases by a variable.
24@end menu
25
26@node FILTER
27@section FILTER
28@vindex FILTER
29
30@display
31FILTER BY @var{var_name}.
32FILTER OFF.
33@end display
34
35@cmd{FILTER} allows a boolean-valued variable to be used to select
36cases from the data stream for processing.
37
38To set up filtering, specify @subcmd{BY} and a variable name.  Keyword
39BY is optional but recommended.  Cases which have a zero or system- or
40user-missing value are excluded from analysis, but not deleted from the
41data stream.  Cases with other values are analyzed.
42To filter based on a different condition, use
43transformations such as @cmd{COMPUTE} or @cmd{RECODE} to compute a
44filter variable of the required form, then specify that variable on
45@cmd{FILTER}.
46
47@code{FILTER OFF} turns off case filtering.
48
49Filtering takes place immediately before cases pass to a procedure for
50analysis.  Only one filter variable may be active at a time.  Normally,
51case filtering continues until it is explicitly turned off with @code{FILTER
52OFF}.  However, if @cmd{FILTER} is placed after @cmd{TEMPORARY}, it filters only
53the next procedure or procedure-like command.
54
55@node N OF CASES
56@section N OF CASES
57@vindex N OF CASES
58
59@display
60N [OF CASES] @var{num_of_cases} [ESTIMATED].
61@end display
62
63@cmd{N OF CASES} limits the number of cases processed by any
64procedures that follow it in the command stream.  @code{N OF CASES
65100}, for example, tells @pspp{} to disregard all cases after the first
66100.
67
68When @cmd{N OF CASES} is specified after @cmd{TEMPORARY}, it affects
69only the next procedure (@pxref{TEMPORARY}).  Otherwise, cases beyond
70the limit specified are not processed by any later procedure.
71
72If the limit specified on @cmd{N OF CASES} is greater than the number
73of cases in the active dataset, it has no effect.
74
75When @cmd{N OF CASES} is used along with @cmd{SAMPLE} or @cmd{SELECT
76IF}, the case limit is applied to the cases obtained after sampling or
77case selection, regardless of how @cmd{N OF CASES} is placed relative
78to @cmd{SAMPLE} or @cmd{SELECT IF} in the command file.  Thus, the
79commands @code{N OF CASES 100} and @code{SAMPLE .5} will both randomly
80sample approximately half of the active dataset's cases, then select the
81first 100 of those sampled, regardless of their order in the command
82file.
83
84@cmd{N OF CASES} with the @code{ESTIMATED} keyword gives an estimated
85number of cases before @cmd{DATA LIST} or another command to read in
86data.  @code{ESTIMATED} never limits the number of cases processed by
87procedures.  @pspp{} currently does not make use of case count estimates.
88
89@node SAMPLE
90@section SAMPLE
91@vindex SAMPLE
92
93@display
94SAMPLE @var{num1} [FROM @var{num2}].
95@end display
96
97@cmd{SAMPLE} randomly samples a proportion of the cases in the active
98file.  Unless it follows @cmd{TEMPORARY}, it operates as a
99transformation, permanently removing cases from the active dataset.
100
101The proportion to sample can be expressed as a single number between 0
102and 1.  If @var{k} is the number specified, and @var{N} is the number
103of currently-selected cases in the active dataset, then after
104@subcmd{SAMPLE @var{k}.}, approximately @var{k}*@var{N} cases will be
105selected.
106
107The proportion to sample can also be specified in the style @subcmd{SAMPLE
108@var{m} FROM @var{N}}.  With this style, cases are selected as follows:
109
110@enumerate
111@item
112If @var{N} is equal to the number of currently-selected cases in the
113active dataset, exactly @var{m} cases will be selected.
114
115@item
116If @var{N} is greater than the number of currently-selected cases in the
117active dataset, an equivalent proportion of cases will be selected.
118
119@item
120If @var{N} is less than the number of currently-selected cases in the
121active, exactly @var{m} cases will be selected @emph{from the first
122@var{N} cases in the active dataset.}
123@end enumerate
124
125@cmd{SAMPLE} and @cmd{SELECT IF} are performed in
126the order specified by the syntax file.
127
128@cmd{SAMPLE} is always performed before @code{N OF CASES}, regardless
129of ordering in the syntax file (@pxref{N OF CASES}).
130
131The same values for @cmd{SAMPLE} may result in different samples.  To
132obtain the same sample, use the @code{SET} command to set the random
133number seed to the same value before each @cmd{SAMPLE}.  Different
134samples may still result when the file is processed on systems with
135differing endianness or floating-point formats.  By default, the
136random number seed is based on the system time.
137
138@node SELECT IF
139@section SELECT IF
140@vindex SELECT IF
141
142@display
143SELECT IF @var{expression}.
144@end display
145
146@cmd{SELECT IF} selects cases for analysis based on the value of
147@var{expression}.  Cases not selected are permanently eliminated
148from the active dataset, unless @cmd{TEMPORARY} is in effect
149(@pxref{TEMPORARY}).
150
151Specify a boolean expression (@pxref{Expressions}).  If the value of the
152expression is true for a particular case, the case will be analyzed.  If
153the expression has a false or missing value, then the case will be
154deleted from the data stream.
155
156Place @cmd{SELECT IF} as early in the command file as
157possible.  Cases that are deleted early can be processed more
158efficiently in time and space.
159
160When @cmd{SELECT IF} is specified following @cmd{TEMPORARY}
161(@pxref{TEMPORARY}), the @cmd{LAG} function may not be used
162(@pxref{LAG}).
163
164@node SPLIT FILE
165@section SPLIT FILE
166@vindex SPLIT FILE
167
168@display
169SPLIT FILE [@{LAYERED, SEPARATE@}] BY @var{var_list}.
170SPLIT FILE OFF.
171@end display
172
173@cmd{SPLIT FILE} allows multiple sets of data present in one data
174file to be analyzed separately using single statistical procedure
175commands.
176
177Specify a list of variable names to analyze multiple sets of
178data separately.  Groups of adjacent cases having the same values for these
179variables are analyzed by statistical procedure commands as one group.
180An independent analysis is carried out for each group of cases, and the
181variable values for the group are printed along with the analysis.
182
183When a list of variable names is specified, one of the keywords
184@subcmd{LAYERED} or @subcmd{SEPARATE} may also be specified.  If provided, either
185keyword are ignored.
186
187Groups are formed only by @emph{adjacent} cases.  To create a split
188using a variable where like values are not adjacent in the working file,
189you should first sort the data by that variable (@pxref{SORT CASES}).
190
191Specify @subcmd{OFF} to disable @cmd{SPLIT FILE} and resume analysis of the
192entire active dataset as a single group of data.
193
194When @cmd{SPLIT FILE} is specified after @cmd{TEMPORARY}, it affects only
195the next procedure (@pxref{TEMPORARY}).
196
197@node TEMPORARY
198@section TEMPORARY
199@vindex TEMPORARY
200
201@display
202TEMPORARY.
203@end display
204
205@cmd{TEMPORARY} is used to make the effects of transformations
206following its execution temporary.  These transformations will
207affect only the execution of the next procedure or procedure-like
208command.  Their effects will not be saved to the active dataset.
209
210The only specification on @cmd{TEMPORARY} is the command name.
211
212@cmd{TEMPORARY} may not appear within a @cmd{DO IF} or @cmd{LOOP}
213construct.  It may appear only once between procedures and
214procedure-like commands.
215
216Scratch variables cannot be used following @cmd{TEMPORARY}.
217
218An example may help to clarify:
219
220@example
221DATA LIST /X 1-2.
222BEGIN DATA.
223 2
224 4
22510
22615
22720
22824
229END DATA.
230
231COMPUTE X=X/2.
232
233TEMPORARY.
234COMPUTE X=X+3.
235
236DESCRIPTIVES X.
237DESCRIPTIVES X.
238@end example
239
240The data read by the first @cmd{DESCRIPTIVES} are 4, 5, 8,
24110.5, 13, 15.  The data read by the first @cmd{DESCRIPTIVES} are 1, 2,
2425, 7.5, 10, 12.
243
244@node WEIGHT
245@section WEIGHT
246@vindex WEIGHT
247
248@display
249WEIGHT BY @var{var_name}.
250WEIGHT OFF.
251@end display
252
253@cmd{WEIGHT} assigns cases varying weights,
254changing the frequency distribution of the active dataset.  Execution of
255@cmd{WEIGHT} is delayed until data have been read.
256
257If a variable name is specified, @cmd{WEIGHT} causes the values of that
258variable to be used as weighting factors for subsequent statistical
259procedures.  Use of keyword @subcmd{BY} is optional but recommended.  Weighting
260variables must be numeric.  Scratch variables may not be used for
261weighting (@pxref{Scratch Variables}).
262
263When @subcmd{OFF} is specified, subsequent statistical procedures will weight all
264cases equally.
265
266A positive integer weighting factor @var{w} on a case will yield the
267same statistical output as would replicating the case @var{w} times.
268A weighting factor of 0 is treated for statistical purposes as if the
269case did not exist in the input.  Weighting values need not be
270integers, but negative and system-missing values for the weighting
271variable are interpreted as weighting factors of 0.  User-missing
272values are not treated specially.
273
274When @cmd{WEIGHT} is specified after @cmd{TEMPORARY}, it affects only
275the next procedure (@pxref{TEMPORARY}).
276
277@cmd{WEIGHT} does not cause cases in the active dataset to be
278replicated in memory.
279