1@c PSPP - a program for statistical analysis. 2@c Copyright (C) 2017 Free Software Foundation, Inc. 3@c Permission is granted to copy, distribute and/or modify this document 4@c under the terms of the GNU Free Documentation License, Version 1.3 5@c or any later version published by the Free Software Foundation; 6@c with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts. 7@c A copy of the license is included in the section entitled "GNU 8@c Free Documentation License". 9@c 10@node Data Selection 11@chapter Selecting data for analysis 12 13This chapter documents @pspp{} commands that temporarily or permanently 14select data records from the active dataset for analysis. 15 16@menu 17* FILTER:: Exclude cases based on a variable. 18* N OF CASES:: Limit the size of the active dataset. 19* SAMPLE:: Select a specified proportion of cases. 20* SELECT IF:: Permanently delete selected cases. 21* SPLIT FILE:: Do multiple analyses with one command. 22* TEMPORARY:: Make transformations' effects temporary. 23* WEIGHT:: Weight cases by a variable. 24@end menu 25 26@node FILTER 27@section FILTER 28@vindex FILTER 29 30@display 31FILTER BY @var{var_name}. 32FILTER OFF. 33@end display 34 35@cmd{FILTER} allows a boolean-valued variable to be used to select 36cases from the data stream for processing. 37 38To set up filtering, specify @subcmd{BY} and a variable name. Keyword 39BY is optional but recommended. Cases which have a zero or system- or 40user-missing value are excluded from analysis, but not deleted from the 41data stream. Cases with other values are analyzed. 42To filter based on a different condition, use 43transformations such as @cmd{COMPUTE} or @cmd{RECODE} to compute a 44filter variable of the required form, then specify that variable on 45@cmd{FILTER}. 46 47@code{FILTER OFF} turns off case filtering. 48 49Filtering takes place immediately before cases pass to a procedure for 50analysis. Only one filter variable may be active at a time. Normally, 51case filtering continues until it is explicitly turned off with @code{FILTER 52OFF}. However, if @cmd{FILTER} is placed after @cmd{TEMPORARY}, it filters only 53the next procedure or procedure-like command. 54 55@node N OF CASES 56@section N OF CASES 57@vindex N OF CASES 58 59@display 60N [OF CASES] @var{num_of_cases} [ESTIMATED]. 61@end display 62 63@cmd{N OF CASES} limits the number of cases processed by any 64procedures that follow it in the command stream. @code{N OF CASES 65100}, for example, tells @pspp{} to disregard all cases after the first 66100. 67 68When @cmd{N OF CASES} is specified after @cmd{TEMPORARY}, it affects 69only the next procedure (@pxref{TEMPORARY}). Otherwise, cases beyond 70the limit specified are not processed by any later procedure. 71 72If the limit specified on @cmd{N OF CASES} is greater than the number 73of cases in the active dataset, it has no effect. 74 75When @cmd{N OF CASES} is used along with @cmd{SAMPLE} or @cmd{SELECT 76IF}, the case limit is applied to the cases obtained after sampling or 77case selection, regardless of how @cmd{N OF CASES} is placed relative 78to @cmd{SAMPLE} or @cmd{SELECT IF} in the command file. Thus, the 79commands @code{N OF CASES 100} and @code{SAMPLE .5} will both randomly 80sample approximately half of the active dataset's cases, then select the 81first 100 of those sampled, regardless of their order in the command 82file. 83 84@cmd{N OF CASES} with the @code{ESTIMATED} keyword gives an estimated 85number of cases before @cmd{DATA LIST} or another command to read in 86data. @code{ESTIMATED} never limits the number of cases processed by 87procedures. @pspp{} currently does not make use of case count estimates. 88 89@node SAMPLE 90@section SAMPLE 91@vindex SAMPLE 92 93@display 94SAMPLE @var{num1} [FROM @var{num2}]. 95@end display 96 97@cmd{SAMPLE} randomly samples a proportion of the cases in the active 98file. Unless it follows @cmd{TEMPORARY}, it operates as a 99transformation, permanently removing cases from the active dataset. 100 101The proportion to sample can be expressed as a single number between 0 102and 1. If @var{k} is the number specified, and @var{N} is the number 103of currently-selected cases in the active dataset, then after 104@subcmd{SAMPLE @var{k}.}, approximately @var{k}*@var{N} cases will be 105selected. 106 107The proportion to sample can also be specified in the style @subcmd{SAMPLE 108@var{m} FROM @var{N}}. With this style, cases are selected as follows: 109 110@enumerate 111@item 112If @var{N} is equal to the number of currently-selected cases in the 113active dataset, exactly @var{m} cases will be selected. 114 115@item 116If @var{N} is greater than the number of currently-selected cases in the 117active dataset, an equivalent proportion of cases will be selected. 118 119@item 120If @var{N} is less than the number of currently-selected cases in the 121active, exactly @var{m} cases will be selected @emph{from the first 122@var{N} cases in the active dataset.} 123@end enumerate 124 125@cmd{SAMPLE} and @cmd{SELECT IF} are performed in 126the order specified by the syntax file. 127 128@cmd{SAMPLE} is always performed before @code{N OF CASES}, regardless 129of ordering in the syntax file (@pxref{N OF CASES}). 130 131The same values for @cmd{SAMPLE} may result in different samples. To 132obtain the same sample, use the @code{SET} command to set the random 133number seed to the same value before each @cmd{SAMPLE}. Different 134samples may still result when the file is processed on systems with 135differing endianness or floating-point formats. By default, the 136random number seed is based on the system time. 137 138@node SELECT IF 139@section SELECT IF 140@vindex SELECT IF 141 142@display 143SELECT IF @var{expression}. 144@end display 145 146@cmd{SELECT IF} selects cases for analysis based on the value of 147@var{expression}. Cases not selected are permanently eliminated 148from the active dataset, unless @cmd{TEMPORARY} is in effect 149(@pxref{TEMPORARY}). 150 151Specify a boolean expression (@pxref{Expressions}). If the value of the 152expression is true for a particular case, the case will be analyzed. If 153the expression has a false or missing value, then the case will be 154deleted from the data stream. 155 156Place @cmd{SELECT IF} as early in the command file as 157possible. Cases that are deleted early can be processed more 158efficiently in time and space. 159 160When @cmd{SELECT IF} is specified following @cmd{TEMPORARY} 161(@pxref{TEMPORARY}), the @cmd{LAG} function may not be used 162(@pxref{LAG}). 163 164@node SPLIT FILE 165@section SPLIT FILE 166@vindex SPLIT FILE 167 168@display 169SPLIT FILE [@{LAYERED, SEPARATE@}] BY @var{var_list}. 170SPLIT FILE OFF. 171@end display 172 173@cmd{SPLIT FILE} allows multiple sets of data present in one data 174file to be analyzed separately using single statistical procedure 175commands. 176 177Specify a list of variable names to analyze multiple sets of 178data separately. Groups of adjacent cases having the same values for these 179variables are analyzed by statistical procedure commands as one group. 180An independent analysis is carried out for each group of cases, and the 181variable values for the group are printed along with the analysis. 182 183When a list of variable names is specified, one of the keywords 184@subcmd{LAYERED} or @subcmd{SEPARATE} may also be specified. If provided, either 185keyword are ignored. 186 187Groups are formed only by @emph{adjacent} cases. To create a split 188using a variable where like values are not adjacent in the working file, 189you should first sort the data by that variable (@pxref{SORT CASES}). 190 191Specify @subcmd{OFF} to disable @cmd{SPLIT FILE} and resume analysis of the 192entire active dataset as a single group of data. 193 194When @cmd{SPLIT FILE} is specified after @cmd{TEMPORARY}, it affects only 195the next procedure (@pxref{TEMPORARY}). 196 197@node TEMPORARY 198@section TEMPORARY 199@vindex TEMPORARY 200 201@display 202TEMPORARY. 203@end display 204 205@cmd{TEMPORARY} is used to make the effects of transformations 206following its execution temporary. These transformations will 207affect only the execution of the next procedure or procedure-like 208command. Their effects will not be saved to the active dataset. 209 210The only specification on @cmd{TEMPORARY} is the command name. 211 212@cmd{TEMPORARY} may not appear within a @cmd{DO IF} or @cmd{LOOP} 213construct. It may appear only once between procedures and 214procedure-like commands. 215 216Scratch variables cannot be used following @cmd{TEMPORARY}. 217 218An example may help to clarify: 219 220@example 221DATA LIST /X 1-2. 222BEGIN DATA. 223 2 224 4 22510 22615 22720 22824 229END DATA. 230 231COMPUTE X=X/2. 232 233TEMPORARY. 234COMPUTE X=X+3. 235 236DESCRIPTIVES X. 237DESCRIPTIVES X. 238@end example 239 240The data read by the first @cmd{DESCRIPTIVES} are 4, 5, 8, 24110.5, 13, 15. The data read by the first @cmd{DESCRIPTIVES} are 1, 2, 2425, 7.5, 10, 12. 243 244@node WEIGHT 245@section WEIGHT 246@vindex WEIGHT 247 248@display 249WEIGHT BY @var{var_name}. 250WEIGHT OFF. 251@end display 252 253@cmd{WEIGHT} assigns cases varying weights, 254changing the frequency distribution of the active dataset. Execution of 255@cmd{WEIGHT} is delayed until data have been read. 256 257If a variable name is specified, @cmd{WEIGHT} causes the values of that 258variable to be used as weighting factors for subsequent statistical 259procedures. Use of keyword @subcmd{BY} is optional but recommended. Weighting 260variables must be numeric. Scratch variables may not be used for 261weighting (@pxref{Scratch Variables}). 262 263When @subcmd{OFF} is specified, subsequent statistical procedures will weight all 264cases equally. 265 266A positive integer weighting factor @var{w} on a case will yield the 267same statistical output as would replicating the case @var{w} times. 268A weighting factor of 0 is treated for statistical purposes as if the 269case did not exist in the input. Weighting values need not be 270integers, but negative and system-missing values for the weighting 271variable are interpreted as weighting factors of 0. User-missing 272values are not treated specially. 273 274When @cmd{WEIGHT} is specified after @cmd{TEMPORARY}, it affects only 275the next procedure (@pxref{TEMPORARY}). 276 277@cmd{WEIGHT} does not cause cases in the active dataset to be 278replicated in memory. 279