1\chapter{What is a dataset?}
2\label{chap:dataset}
3
4A dataset is a memory area designed to hold the data you want to work
5on, if any. It may be thought of a big global variable, containing a
6(possibly huge) matrix of data and a hefty collection of metadata.
7
8\app{R} users may think that a dataset is similar to what you get when
9you \texttt{attach} a data frame in \app{R}. Not really: in hansl, you
10cannot have more than one dataset open at the same time. That's why we
11talk about \emph{the} dataset.
12
13When a dataset is present in memory (that is, ``open''), a number of
14objects become available for your hansl script in a transparent and
15convenient way. Of course, the data themselves: the columns of the
16dataset matrix are called \emph{series}, which will be described in
17section \ref{sec:series}; sometimes, you will want to organize one or
18more series in a \emph{list} (section \ref{sec:lists}). Additionally,
19you have the possibility of using, as read-only global variables, some
20scalars or matrices, such as the number of observations, the number of
21variables, the nature of your dataset (cross-sectional, time series or
22panel), and so on. These are called \emph{accessors}, and will be
23discussed in section \ref{sec:accessors}.
24
25You can open a dataset by reading data from a disk file, via the
26\cmd{open} command, or by creating one from scratch.
27
28\section{Creating a dataset from scratch}
29
30The primary commands in this context are \cmd{nulldata} and
31\cmd{setobs}.  For example:
32\begin{code}
33set echo off
34set messages off
35
36set seed 443322           # initialize the random number generator
37nulldata 240              # stipulate how long your series will be
38setobs 12 1995:1          # define as monthly dataset, starting Jan 1995
39\end{code}
40
41For more details see \GUG, and the \GCR\ for the \cmd{nulldata} and
42\cmd{setobs} commands. The only important thing to say at this point,
43however, is that you can resize your dataset and/or change some of its
44characteristics, such as its periodicity, at nearly any point inside
45your script if necessary.
46
47Once your dataset is in place, you can start populating it with
48series, either by reading them from files or by generating them via
49appropriate commands and functions.
50
51\section{Reading a dataset from a file}
52
53The primary commands here are \cmd{open}, \cmd{append} and \cmd{join}.
54
55The \cmd{open} command is what you'll want to use in most cases. It
56handles transparently a wide variety of formats (native, CSV,
57spreadsheet, data files produced by other packages such as
58\textsf{Stata}, \textsf{Eviews}, \textsf{SPSS} and \textsf{SAS}) and
59also takes care of setting up the dataset for you automatically.
60\begin{code}
61  open mydata.gdt    # native format
62  open yourdata.dta  # Stata format
63  open theirdata.xls # Excel format
64\end{code}
65
66The \cmd{open} command can also be used to read stuff off the
67Internet, by using a URL instead of a filename, as in
68\begin{code}
69  open http://someserver.com/somedata.csv
70\end{code}
71
72The \textit{Gretl User's Guide} describes the requirements on plain
73text data files of the ``CSV'' type for direct importation by
74gretl. It also describes gretl's native data formats (XML-based and
75binary).
76
77The \cmd{append} and \cmd{join} commands can be used to add further
78series from file to a previously opened dataset. The \cmd{join}
79command is extremely flexible and has a chapter to itself in
80\GUG.
81
82\section{Saving datasets}
83
84The \cmd{store} command is used to write the current dataset (or a
85subset) out to file. Besides writing in gretl's native formats,
86\cmd{store} can also be used to export data as CSV or in the format of
87\textsf{R}. Series can be written out as matrices using the
88\texttt{mwrite} function. If you have special requirements that are
89not met by \cmd{store} or \cmd{mwrite} it is possible to use
90\cmd{outfile} plus \cmd{printf} (see chapter~\ref{chap:formatting})
91to gain full control over the way data are saved.
92
93
94\section{The \cmd{smpl} command}
95
96Once you have opened a dataset somehow, the \cmd{smpl} command allows
97you to discard observations selectively, so that your series will
98contain only the observations you want (automatically changing the
99dimension of the dataset in the process). See chapter 4 in \GUG\ for
100further information.\footnote{Users with a Stata background may find
101  the hansl way of doing things a little disconcerting at first. In
102  hansl, you first restrict your sample through the \cmd{smpl}
103  command, which applies until further notice, then you do what you
104  have to. There is no equivalent to Stata's \texttt{if} clause to
105  commands.}
106
107There are basically three variants to the \cmd{smpl} command:
108\begin{enumerate}
109\item Selecting a contiguous subset of observations: this will be
110  mostly useful with time-series datasets. For example:
111  \begin{code}
112    smpl 4 122            # select observations for 4 to 122
113    smpl 1984:1 2008:4    # the so-called "Great Moderation" period
114    smpl 2008-01-01 ;     # form January 1st, 2008 onwards
115  \end{code}
116\item Selecting observations on the basis of some criterion: this is
117  typically what you want with cross-sectional datasets. Example:
118  \begin{code}
119    smpl male == 1 --restrict                # males only
120    smpl male == 1 && age < 30 --restrict    # just the young guys
121    smpl employed --dummy                    # via a dummy variable
122  \end{code}
123  Note that, in this context, restrictions go ``on top of'' previous
124  ones. In order to start from scratch, you either reset the full
125  sample via \texttt{smpl full} or use the \option{replace} option
126  along with \option{restrict}.
127\item Restricting the active dataset to some observations so that a
128  certain effect is achieved automatically: for example, drawing a
129  random subsample, or ensuring that all rows that have missing
130  observations are automatically excluded. This is achieved via the
131  \option{no-missing}, \option{contiguous}, and \option{random}
132  options.
133\end{enumerate}
134
135In the context of panel datasets, some extra qualifications have to be
136made; see \GUG.
137
138\section{Dataset accessors}
139\label{sec:accessors}
140
141Several characteristics of the current dataset can be determined by
142reference to built-in accessor (``dollar'') variables. The main ones,
143which all return scalar values, are shown in
144Table~\ref{tab:dataset-accessors}.
145
146\begin{table}[htbp]
147  \centering
148  \begin{tabular}{lp{0.7\textwidth}}
149    \textbf{Accessor} & \textbf{Value returned} \\ \hline
150    \verb|$datatype| & Coding for the type of dataset:
151    0 = no data; 1 = cross-sectional (undated); 2 = time-series;
152    3 = panel \\
153    \verb|$nobs| & The number of observations in the current
154    sample range \\
155    \verb|$nvars| & The number of series (including the constant)\\
156    \verb|$pd| & The data frequency (1 for cross-sectional, 4 for
157    quarterly, and so on) \\
158    \verb|$t1| & 1-based index of the first observation in the
159    current sample \\
160    \verb|$t2| & 1-based index of the last observation in the
161    current sample \\
162    \hline
163  \end{tabular}
164  \caption{The principal dataset accessors}
165  \label{tab:dataset-accessors}
166\end{table}
167
168In addition there are a few more specialized accessors:
169\dollar{obsdate}, \dollar{obsmajor}, \dollar{obsminor},
170\dollar{obsmicro} and \dollar{unit}. These are specific to time-series
171and/or panel data, and they all return series. See the \GCR{} for
172details.
173
174
175
176%%% Local Variables:
177%%% mode: latex
178%%% TeX-master: "hansl-primer"
179%%% End:
180