doc/tex/hp-dataset.tex

\chapter{What is a dataset?}
\label{chap:dataset}

A dataset is a memory area designed to hold the data you want to work
on, if any. It may be thought of a big global variable, containing a
(possibly huge) matrix of data and a hefty collection of metadata.

\app{R} users may think that a dataset is similar to what you get when
you \texttt{attach} a data frame in \app{R}. Not really: in hansl, you
cannot have more than one dataset open at the same time. That's why we
talk about \emph{the} dataset.

When a dataset is present in memory (that is, ``open''), a number of
objects become available for your hansl script in a transparent and
convenient way. Of course, the data themselves: the columns of the
dataset matrix are called \emph{series}, which will be described in
section \ref{sec:series}; sometimes, you will want to organize one or
more series in a \emph{list} (section \ref{sec:lists}). Additionally,
you have the possibility of using, as read-only global variables, some
scalars or matrices, such as the number of observations, the number of
variables, the nature of your dataset (cross-sectional, time series or
panel), and so on. These are called \emph{accessors}, and will be
discussed in section \ref{sec:accessors}.

You can open a dataset by reading data from a disk file, via the
\cmd{open} command, or by creating one from scratch.

\section{Creating a dataset from scratch}

The primary commands in this context are \cmd{nulldata} and
\cmd{setobs}.  For example:
\begin{code}
set echo off
set messages off

set seed 443322           # initialize the random number generator
nulldata 240              # stipulate how long your series will be
setobs 12 1995:1          # define as monthly dataset, starting Jan 1995
\end{code}

For more details see \GUG, and the \GCR\ for the \cmd{nulldata} and
\cmd{setobs} commands. The only important thing to say at this point,
however, is that you can resize your dataset and/or change some of its
characteristics, such as its periodicity, at nearly any point inside
your script if necessary.

Once your dataset is in place, you can start populating it with
series, either by reading them from files or by generating them via
appropriate commands and functions.

\section{Reading a dataset from a file}

The primary commands here are \cmd{open}, \cmd{append} and \cmd{join}.

The \cmd{open} command is what you'll want to use in most cases. It
handles transparently a wide variety of formats (native, CSV,
spreadsheet, data files produced by other packages such as
\textsf{Stata}, \textsf{Eviews}, \textsf{SPSS} and \textsf{SAS}) and
also takes care of setting up the dataset for you automatically.
\begin{code}
  open mydata.gdt    # native format
  open yourdata.dta  # Stata format
  open theirdata.xls # Excel format
\end{code}

The \cmd{open} command can also be used to read stuff off the
Internet, by using a URL instead of a filename, as in
\begin{code}
  open http://someserver.com/somedata.csv
\end{code}

The \textit{Gretl User's Guide} describes the requirements on plain
text data files of the ``CSV'' type for direct importation by
gretl. It also describes gretl's native data formats (XML-based and
binary).

The \cmd{append} and \cmd{join} commands can be used to add further
series from file to a previously opened dataset. The \cmd{join}
command is extremely flexible and has a chapter to itself in
\GUG.

\section{Saving datasets}

The \cmd{store} command is used to write the current dataset (or a
subset) out to file. Besides writing in gretl's native formats,
\cmd{store} can also be used to export data as CSV or in the format of
\textsf{R}. Series can be written out as matrices using the
\texttt{mwrite} function. If you have special requirements that are
not met by \cmd{store} or \cmd{mwrite} it is possible to use
\cmd{outfile} plus \cmd{printf} (see chapter~\ref{chap:formatting})
to gain full control over the way data are saved.


\section{The \cmd{smpl} command}

Once you have opened a dataset somehow, the \cmd{smpl} command allows
you to discard observations selectively, so that your series will
contain only the observations you want (automatically changing the
dimension of the dataset in the process). See chapter 4 in \GUG\ for
further information.\footnote{Users with a Stata background may find
  the hansl way of doing things a little disconcerting at first. In
  hansl, you first restrict your sample through the \cmd{smpl}
  command, which applies until further notice, then you do what you
  have to. There is no equivalent to Stata's \texttt{if} clause to
  commands.}

There are basically three variants to the \cmd{smpl} command:
\begin{enumerate}
\item Selecting a contiguous subset of observations: this will be
  mostly useful with time-series datasets. For example:
  \begin{code}
    smpl 4 122            # select observations for 4 to 122
    smpl 1984:1 2008:4    # the so-called "Great Moderation" period
    smpl 2008-01-01 ;     # form January 1st, 2008 onwards
  \end{code}
\item Selecting observations on the basis of some criterion: this is
  typically what you want with cross-sectional datasets. Example:
  \begin{code}
    smpl male == 1 --restrict                # males only
    smpl male == 1 && age < 30 --restrict    # just the young guys
    smpl employed --dummy                    # via a dummy variable
  \end{code}
  Note that, in this context, restrictions go ``on top of'' previous
  ones. In order to start from scratch, you either reset the full
  sample via \texttt{smpl full} or use the \option{replace} option
  along with \option{restrict}.
\item Restricting the active dataset to some observations so that a
  certain effect is achieved automatically: for example, drawing a
  random subsample, or ensuring that all rows that have missing
  observations are automatically excluded. This is achieved via the
  \option{no-missing}, \option{contiguous}, and \option{random}
  options.
\end{enumerate}

In the context of panel datasets, some extra qualifications have to be
made; see \GUG.

\section{Dataset accessors}
\label{sec:accessors}

Several characteristics of the current dataset can be determined by
reference to built-in accessor (``dollar'') variables. The main ones,
which all return scalar values, are shown in
Table~\ref{tab:dataset-accessors}.

\begin{table}[htbp]
  \centering
  \begin{tabular}{lp{0.7\textwidth}}
    \textbf{Accessor} & \textbf{Value returned} \\ \hline
    \verb|$datatype| & Coding for the type of dataset:
    0 = no data; 1 = cross-sectional (undated); 2 = time-series;
    3 = panel \\
    \verb|$nobs| & The number of observations in the current
    sample range \\
    \verb|$nvars| & The number of series (including the constant)\\
    \verb|$pd| & The data frequency (1 for cross-sectional, 4 for
    quarterly, and so on) \\
    \verb|$t1| & 1-based index of the first observation in the
    current sample \\
    \verb|$t2| & 1-based index of the last observation in the
    current sample \\
    \hline
  \end{tabular}
  \caption{The principal dataset accessors}
  \label{tab:dataset-accessors}
\end{table}

In addition there are a few more specialized accessors:
\dollar{obsdate}, \dollar{obsmajor}, \dollar{obsminor},
\dollar{obsmicro} and \dollar{unit}. These are specific to time-series
and/or panel data, and they all return series. See the \GCR{} for
details.


%%% Local Variables:
%%% mode: latex
%%% TeX-master: "hansl-primer"
%%% End: