1\chapter{What is a dataset?} 2\label{chap:dataset} 3 4A dataset is a memory area designed to hold the data you want to work 5on, if any. It may be thought of a big global variable, containing a 6(possibly huge) matrix of data and a hefty collection of metadata. 7 8\app{R} users may think that a dataset is similar to what you get when 9you \texttt{attach} a data frame in \app{R}. Not really: in hansl, you 10cannot have more than one dataset open at the same time. That's why we 11talk about \emph{the} dataset. 12 13When a dataset is present in memory (that is, ``open''), a number of 14objects become available for your hansl script in a transparent and 15convenient way. Of course, the data themselves: the columns of the 16dataset matrix are called \emph{series}, which will be described in 17section \ref{sec:series}; sometimes, you will want to organize one or 18more series in a \emph{list} (section \ref{sec:lists}). Additionally, 19you have the possibility of using, as read-only global variables, some 20scalars or matrices, such as the number of observations, the number of 21variables, the nature of your dataset (cross-sectional, time series or 22panel), and so on. These are called \emph{accessors}, and will be 23discussed in section \ref{sec:accessors}. 24 25You can open a dataset by reading data from a disk file, via the 26\cmd{open} command, or by creating one from scratch. 27 28\section{Creating a dataset from scratch} 29 30The primary commands in this context are \cmd{nulldata} and 31\cmd{setobs}. For example: 32\begin{code} 33set echo off 34set messages off 35 36set seed 443322 # initialize the random number generator 37nulldata 240 # stipulate how long your series will be 38setobs 12 1995:1 # define as monthly dataset, starting Jan 1995 39\end{code} 40 41For more details see \GUG, and the \GCR\ for the \cmd{nulldata} and 42\cmd{setobs} commands. The only important thing to say at this point, 43however, is that you can resize your dataset and/or change some of its 44characteristics, such as its periodicity, at nearly any point inside 45your script if necessary. 46 47Once your dataset is in place, you can start populating it with 48series, either by reading them from files or by generating them via 49appropriate commands and functions. 50 51\section{Reading a dataset from a file} 52 53The primary commands here are \cmd{open}, \cmd{append} and \cmd{join}. 54 55The \cmd{open} command is what you'll want to use in most cases. It 56handles transparently a wide variety of formats (native, CSV, 57spreadsheet, data files produced by other packages such as 58\textsf{Stata}, \textsf{Eviews}, \textsf{SPSS} and \textsf{SAS}) and 59also takes care of setting up the dataset for you automatically. 60\begin{code} 61 open mydata.gdt # native format 62 open yourdata.dta # Stata format 63 open theirdata.xls # Excel format 64\end{code} 65 66The \cmd{open} command can also be used to read stuff off the 67Internet, by using a URL instead of a filename, as in 68\begin{code} 69 open http://someserver.com/somedata.csv 70\end{code} 71 72The \textit{Gretl User's Guide} describes the requirements on plain 73text data files of the ``CSV'' type for direct importation by 74gretl. It also describes gretl's native data formats (XML-based and 75binary). 76 77The \cmd{append} and \cmd{join} commands can be used to add further 78series from file to a previously opened dataset. The \cmd{join} 79command is extremely flexible and has a chapter to itself in 80\GUG. 81 82\section{Saving datasets} 83 84The \cmd{store} command is used to write the current dataset (or a 85subset) out to file. Besides writing in gretl's native formats, 86\cmd{store} can also be used to export data as CSV or in the format of 87\textsf{R}. Series can be written out as matrices using the 88\texttt{mwrite} function. If you have special requirements that are 89not met by \cmd{store} or \cmd{mwrite} it is possible to use 90\cmd{outfile} plus \cmd{printf} (see chapter~\ref{chap:formatting}) 91to gain full control over the way data are saved. 92 93 94\section{The \cmd{smpl} command} 95 96Once you have opened a dataset somehow, the \cmd{smpl} command allows 97you to discard observations selectively, so that your series will 98contain only the observations you want (automatically changing the 99dimension of the dataset in the process). See chapter 4 in \GUG\ for 100further information.\footnote{Users with a Stata background may find 101 the hansl way of doing things a little disconcerting at first. In 102 hansl, you first restrict your sample through the \cmd{smpl} 103 command, which applies until further notice, then you do what you 104 have to. There is no equivalent to Stata's \texttt{if} clause to 105 commands.} 106 107There are basically three variants to the \cmd{smpl} command: 108\begin{enumerate} 109\item Selecting a contiguous subset of observations: this will be 110 mostly useful with time-series datasets. For example: 111 \begin{code} 112 smpl 4 122 # select observations for 4 to 122 113 smpl 1984:1 2008:4 # the so-called "Great Moderation" period 114 smpl 2008-01-01 ; # form January 1st, 2008 onwards 115 \end{code} 116\item Selecting observations on the basis of some criterion: this is 117 typically what you want with cross-sectional datasets. Example: 118 \begin{code} 119 smpl male == 1 --restrict # males only 120 smpl male == 1 && age < 30 --restrict # just the young guys 121 smpl employed --dummy # via a dummy variable 122 \end{code} 123 Note that, in this context, restrictions go ``on top of'' previous 124 ones. In order to start from scratch, you either reset the full 125 sample via \texttt{smpl full} or use the \option{replace} option 126 along with \option{restrict}. 127\item Restricting the active dataset to some observations so that a 128 certain effect is achieved automatically: for example, drawing a 129 random subsample, or ensuring that all rows that have missing 130 observations are automatically excluded. This is achieved via the 131 \option{no-missing}, \option{contiguous}, and \option{random} 132 options. 133\end{enumerate} 134 135In the context of panel datasets, some extra qualifications have to be 136made; see \GUG. 137 138\section{Dataset accessors} 139\label{sec:accessors} 140 141Several characteristics of the current dataset can be determined by 142reference to built-in accessor (``dollar'') variables. The main ones, 143which all return scalar values, are shown in 144Table~\ref{tab:dataset-accessors}. 145 146\begin{table}[htbp] 147 \centering 148 \begin{tabular}{lp{0.7\textwidth}} 149 \textbf{Accessor} & \textbf{Value returned} \\ \hline 150 \verb|$datatype| & Coding for the type of dataset: 151 0 = no data; 1 = cross-sectional (undated); 2 = time-series; 152 3 = panel \\ 153 \verb|$nobs| & The number of observations in the current 154 sample range \\ 155 \verb|$nvars| & The number of series (including the constant)\\ 156 \verb|$pd| & The data frequency (1 for cross-sectional, 4 for 157 quarterly, and so on) \\ 158 \verb|$t1| & 1-based index of the first observation in the 159 current sample \\ 160 \verb|$t2| & 1-based index of the last observation in the 161 current sample \\ 162 \hline 163 \end{tabular} 164 \caption{The principal dataset accessors} 165 \label{tab:dataset-accessors} 166\end{table} 167 168In addition there are a few more specialized accessors: 169\dollar{obsdate}, \dollar{obsmajor}, \dollar{obsminor}, 170\dollar{obsmicro} and \dollar{unit}. These are specific to time-series 171and/or panel data, and they all return series. See the \GCR{} for 172details. 173 174 175 176%%% Local Variables: 177%%% mode: latex 178%%% TeX-master: "hansl-primer" 179%%% End: 180