1\chapter{Cheat sheet}
2\label{chap:cheatsheet}
3
4This chapter explains how to perform some common---and some not so
5common---tasks in gretl's scripting language, hansl.  Some but not
6all of the techniques listed here are also available through the
7graphical interface.  Although the graphical interface may be more
8intuitive and less intimidating at first, we encourage users to take
9advantage of the power of gretl's scripting language as soon as
10they feel comfortable with the program.
11
12\section{Dataset handling}
13
14\subsection{``Weird'' periodicities}
15
16\emph{Problem:} You have data sampled each 3 minutes from 9am onwards;
17you'll probably want to specify the hour as 20 periods.
18
19\emph{Solution:}
20\begin{code}
21setobs 20 9:1 --special
22\end{code}
23
24\emph{Comment:} Now functions like \cmd{sdiff()} (``seasonal''
25difference) or estimation methods like seasonal ARIMA will work as
26expected.
27
28\subsection{Generating a panel dataset of given dimensions}
29
30\emph{Problem:} You want to generate via \cmd{nulldata} a panel
31dataset and specify in advance the number of units and the time length
32of your series via two scalar variables.
33
34\emph{Solution:}
35\begin{code}
36scalar n_units = 100
37scalar T = 12
38scalar NT = T * n_units
39
40nulldata NT --preserve
41setobs T 1:1 --stacked-time-series
42\end{code}
43
44\emph{Comment:} The essential ingredient that we use here is the
45\option{preserve} option: it protects existing scalars (and matrices,
46for that matter) from being trashed by \cmd{nulldata}, thus making it
47possible to use the scalar $T$ in the \cmd{setobs} command.
48
49\subsection{Help, my data are backwards!}
50
51\emph{Problem:} Gretl expects time series data to be in chronological
52order (most recent observation last), but you have imported
53third-party data that are in reverse order (most recent first).
54
55\emph{Solution:}
56\begin{code}
57setobs 1 1 --cross-section
58series sortkey = -obs
59dataset sortby sortkey
60setobs 1 1950 --time-series
61\end{code}
62
63\emph{Comment:} The first line is required only if the data currently
64have a time series interpretation: it removes that interpretation,
65because (for fairly obvious reasons) the \cmd{dataset sortby}
66operation is not allowed for time series data.  The following two
67lines reverse the data, using the negative of the built-in index
68variable \texttt{obs}.  The last line is just illustrative: it
69establishes the data as annual time series, starting in 1950.
70
71If you have a dataset that is mostly the right way round, but a
72particular variable is wrong, you can reverse that variable as
73follows:
74\begin{code}
75x = sortby(-obs, x)
76\end{code}
77
78
79\subsection{Dropping missing observations selectively}
80
81\emph{Problem:} You have a dataset with many variables and want to
82restrict the sample to those observations for which there are no
83missing observations for the variables \texttt{x1}, \texttt{x2} and
84\texttt{x3}.
85
86\begin{samepage}
87\emph{Solution:}
88\begin{code}
89list X = x1 x2 x3
90smpl --no-missing X
91\end{code}
92\end{samepage}
93
94\emph{Comment:} You can save the file via a \cmd{store} command
95to preserve a subsampled version of the dataset. Alternative solutions
96based on the \cmd{ok} function, such as
97\begin{code}
98list X = x1 x2 x3
99series sel = ok(X)
100smpl sel --restrict
101\end{code}
102are perhaps less obvious, but more flexible. Pick your poison.
103
104\subsection{``By'' operations}
105
106\emph{Problem:} You have a discrete variable \texttt{d} and you want
107to run some commands (for example, estimate a model) by splitting the
108sample according to the values of \texttt{d}.
109
110\emph{Solution:}
111\begin{code}
112matrix vd = values(d)
113m = rows(vd)
114loop i=1..m
115  scalar sel = vd[i]
116  smpl d==sel --restrict --replace
117  ols y const x
118endloop
119smpl --full
120\end{code}
121
122\emph{Comment:} The main ingredient here is a loop.  You can have
123gretl perform as many instructions as you want for each value of
124\texttt{d}, as long as they are allowed inside a loop. Note, however,
125that if all you want is descriptive statistics, the \cmd{summary}
126command does have a \option{by} option.
127
128\subsection{Adding a time series to a panel}
129
130\emph{Problem:} You have a panel dataset (comprising observations of
131$n$ indidivuals in each of $T$ periods) and you want to add a variable
132which is available in straight time-series form.  For example, you
133want to add annual CPI data to a panel in order to deflate nominal
134income figures.
135
136In gretl a panel is represented in stacked time-series format, so in
137effect the task is to create a new variable which holds $n$ stacked
138copies of the original time series.  Let's say the panel comprises 500
139individuals observed in the years 1990, 1995 and 2000 ($n=500$,
140$T=3$), and we have these CPI data in the ASCII file \texttt{cpi.txt}:
141
142\begin{code}
143date cpi
1441990 130.658
1451995 152.383
1462000 172.192
147\end{code}
148
149What we need is for the CPI variable in the panel to repeat these
150three values 500 times.
151
152\emph{Solution:} Simple!  With the panel dataset open in gretl,
153\begin{code}
154append cpi.txt
155\end{code}
156
157\emph{Comment:} If the length of the time series is the same as the
158length of the time dimension in the panel (3 in this example), gretl
159will perform the stacking automatically.  Rather than using the
160\cmd{append} command you could use the ``Append data'' item under
161the \textsf{File} menu in the GUI program.
162
163If the length of your time series does not exactly match the $T$
164dimension of your panel dataset, \cmd{append} will not work but you
165can use the \cmd{join} command, which is able to pick just the
166observations with matching time periods. On selecting ``Append data''
167in the GUI you are given a choice between plain ``append'' and
168``join'' modes, and if you choose the latter you get a dialog window
169allowing you to specify the key(s) for the join operation. For native
170gretl data files you can use built-in series that identify the time
171periods, such as \dollar{obsmajor}, for your outer key to match the
172dates. In the example above, if the CPI data were in gretl format
173\dollar{obsmajor} would give you the year of the observations.
174
175\subsection{Time averaging of panel datasets}
176
177\emph{Problem:} You have a panel dataset (comprising observations of
178$n$ indidivuals in each of $T$ periods) and you want to lower the time
179frequency by averaging. This is commonly done in empirical growth
180economics, where annual data are turned into 3- or 4- or 5-year
181averages \citep[see for example][]{Islam1995}.
182
183\emph{Solution:} In a panel dataset, gretl functions that deal
184with time are aware of the panel structure, so they will automatically
185``do the right thing''. Therefore, all you have to do is use the
186\cmd{movavg()} function for computing moving averages and then just
187drop the years you don't need. An example with artificial data
188follows:
189\begin{code}
190nulldata 36
191set seed 61218
192setobs 12 1:1 --stacked-time-series
193
194###
195### generate simulated yearly data
196###
197series year = 2000 + time
198series y = round(normal())
199series x = round(3*uniform())
200list X = y x
201print year X -o
202
203###
204### now re-cast as 4-year averages
205###
206# a dummy for endpoints
207series endpoint = (year % 4 == 0)
208# id variable
209series id = $unit
210
211# compute averages
212loop foreach i X
213    series $i = movavg($i, 4)
214endloop
215
216# drop extra observations
217smpl endpoint --dummy --permanent
218
219# restore panel structure
220setobs id year --panel-vars
221print id year X -o
222\end{code}
223% $
224Running the above script produces (among other output):
225\begin{code}
226? print year X -o
227
228             year            y            x
229
2301:01         2001            1            1
2311:02         2002           -1            1
2321:03         2003           -1            0
2331:04         2004            0            1
2341:05         2005            1            2
2351:06         2006            1            2
2361:07         2007           -1            0
2371:08         2008            1            1
2381:09         2009            0            3
2391:10         2010           -1            1
2401:11         2011            1            1
2411:12         2012            0            1
242...
2433:09         2009            0            1
2443:10         2010            1            1
2453:11         2011            0            2
2463:12         2012            1            2
247
248? print id year X -o
249
250              id         year            y            x
251
2521:1            1         2004        -0.25         0.75
2531:2            1         2008         0.50         1.25
2541:3            1         2012         0.00         1.50
255...
2563:3            3         2012         0.50         1.50
257\end{code}
258
259\subsection{Turning observation-marker strings into a series}
260
261\emph{Problem:} Here's one that might turn up in the context of the
262\cmd{join} command (see chapter~\ref{chap:join}). The current dataset
263contains a string-valued series that you'd like to use as a key for
264matching observations, perhaps the two-letter codes for the names of
265US states. The file from which you wish to add data contains that same
266information, but \textit{not} in the form of a string-valued series,
267rather it exists in the form of ``observation markers''. Such markers
268cannot be used as a key directly, but is there a way to parlay them
269into a string-valued series? Why, of course there is!
270
271\emph{Solution:}
272We'll illustrate with the Ramanathan data file \texttt{data4-10.gdt},
273which contains private school enrollment data and covariates for the
27450 US states plus Washington, D.C. ($n = 51$).
275\begin{code}
276open data4-10.gdt
277markers --to-array="state_codes"
278genr index
279stringify(index, state_codes)
280store joindata.gdt
281\end{code}
282
283\emph{Comment:} The \texttt{markers} command saves the observation
284markers to an array of strings. The command \texttt{genr index}
285creates a series that goes 1, 2, 3, \dots{}, and we attach the state
286codes to this series via \texttt{stringify()}. After saving the result
287we have a datafile that contains a series, \texttt{index}, that can be
288matched with whatever series holds the state code strings in the
289target dataset.
290
291Suppose the relevant string-valued key series in the target dataset is
292called \texttt{state}. We might prefer to avoid the need to specify a
293distinct ``outer'' key (again see chapter~\ref{chap:join}). In that
294case, in place of
295\begin{code}
296genr index
297stringify(index, state_codes)
298\end{code}
299we could do
300\begin{code}
301genr index
302series state = index
303stringify(state, state_codes)
304\end{code}
305and the two datafiles will contain a comparable string-valued
306\texttt{state} series.
307
308\section{Creating/modifying variables}
309
310\subsection{Generating a dummy variable for a specific observation}
311
312\emph{Problem:} Generate $d_t = 0$ for all observation but one, for
313which $d_t = 1$.
314
315\emph{Solution:}
316\begin{code}
317  series d = (t=="1984:2")
318\end{code}
319
320\emph{Comment:} The internal variable \texttt{t} is used to refer to
321observations in string form, so if you have a cross-section sample you
322may just use \texttt{d = (t=="123")}. If the dataset has observation
323labels you can use the corresponding label. For example, if you open the
324dataset \texttt{mrw.gdt}, supplied with gretl among the
325examples, a dummy variable for Italy could be generated via
326\begin{code}
327  series DIta = (t=="Italy")
328\end{code}
329
330Note that this method does not require scripting at all. In fact, you
331might as well use the GUI Menu ``Add/Define new variable'' for the
332same purpose, with the same syntax.
333
334\subsection{Generating a discrete variable out of a set of dummies}
335
336\emph{Problem:} The \cmd{dummify} function (also available as a
337command) generates a set of mutually exclusive dummies from a discrete
338variable. The reverse functionality, however, seems to be absent.
339
340\emph{Solution:}
341\begin{code}
342series x = lincomb(D, seq(1, nelem(D)))
343\end{code}
344
345\emph{Comment:} Suppose you have a list \texttt{D} of mutually
346exclusive dummies, that is a full set of 0/1 variables coding for the
347value of some characteristic, such that the sum of the values of the
348elements of \texttt{D} is 1 at each observation. This is, by the way,
349exactly what the \cmd{dummify} command produces.  The reverse job of
350\cmd{dummify} can be performed neatly by using the \cmd{lincomb}
351function.
352
353The code above multiplies the first dummy variable in the list
354\texttt{D} by 1, the second one by 2, and so on. Hence, the return
355value is a series whose value is $i$ if and only if the $i$-th member
356of \texttt{D} has value 1.
357
358If you want your coding to start from 0 instead of 1, you'll have to
359modify the code snippet above into
360\begin{code}
361series x = lincomb(D, seq(0, nelem(D)-1))
362\end{code}
363
364\subsection{Easter}
365
366\emph{Problem:} I have a 7-day daily dataset. How do I create an
367``Easter'' dummy?
368
369\emph{Solution:} We have the \cmd{easterday()} function, which returns
370month and day of Easter given the year. The following is an example
371script which uses this function and a few string magic tricks:
372
373\begin{code}
374series Easter = 0
375loop y=2011..2016
376    a = easterday(y)
377    m = floor(a)
378    d = round(100*(a-m))
379    ed_str = sprintf("%04d-%02d-%02d", y, m, d)
380    Easter["@ed_str"] = 1
381endloop
382\end{code}
383
384\emph{Comment:} The \cmd{round()} function is necessary for the
385``day'' component because otherwise floating-point problems may
386ensue. Try the year 2015, for example.
387
388\subsection{Recoding a variable}
389
390\emph{Problem:} You want to perform a 1-to-1 recode on a variable. For
391example, consider tennis points: you may have a variable $x$ holding
392values 1 to 3 and you want to recode it to 15, 30, 40.
393
394\emph{Solution 1:}
395\begin{code}
396series x = replace(x, 1, 15)
397series x = replace(x, 2, 30)
398series x = replace(x, 3, 40)
399\end{code}
400
401\emph{Solution 2:}
402\begin{code}
403matrix tennis = {15, 30, 40}
404series x = replace(x, seq(1,3), tennis)
405\end{code}
406
407\emph{Comment:} There are many equivalent ways to achieve the same
408effect, but for simple cases such as this, the \cmd{replace} function
409is simple and transparent. If you don't mind using matrices, scripts
410using \cmd{replace} can also be remarkably compact. Note that
411\cmd{replace} also performs $n$-to-1 (``surjective'') replacements,
412such as
413\begin{code}
414series x = replace{z, {2, 3, 5, 11, 22, 33}, 1)
415\end{code}
416which would turn all entries equal to 2, 3, 5, 11, 22 or 33 to 1 and
417leave the other ones unchanged.
418
419\subsection{Generating a ``subset of values'' dummy}
420
421\emph{Problem:} You have a dataset which contains a fine-grained
422coding for some qualitative variable and you want to ``collapse'' this
423to a relatively small set of dummy variables. Examples: you have place
424of work by US state and you want a small set of regional dummies; or
425you have detailed occupational codes from a census dataset and you
426want a manageable number of occupational category dummies.
427
428Let's call the source series \texttt{src} and one of the target dummies
429\texttt{D1}. And let's say that the values of \texttt{src} to be grouped
430under \texttt{D1} are 2, 13, 14 and 25. We'll consider three possible
431solutions: ``Longhand,'' ``Clever,'' and ``Proper.''
432
433\emph{``Longhand'' solution:}
434\begin{code}
435series D1 = src==2 || src==13 || src==14 || src==25
436\end{code}
437
438\emph{Comment:} The above works fine if the number of distinct values
439in the source to be condensed into each dummy variable is fairly
440small, but it becomes cumbersome if a single dummy must comprise
441dozens of source values.
442
443\emph{Clever solution:}
444\begin{code}
445matrix sel = {2,13,14,25}
446series D1 = maxr({src} .= vec(sel)') .> 0
447\end{code}
448
449\emph{Comment:} The subset of values to be grouped together can be
450written out as a matrix relatively compactly (first line). The magic
451that turns this into the desired series (second line) relies on the
452versatility of the ``dot'' (element-wise) matrix operators. The
453expression ``\texttt{\{src\}}'' gets a column-vector version of the
454input series---call this $x$---and ``\texttt{vec(sel)'}'' gets the
455input matrix as a row vector, in case it's a column vector or a matrix
456with both dimensions greater than 1---call this $s$. If $x$ is
457$n\times1$ and $s$ is $1\times m$, the ``\texttt{.=}'' operator
458produces an $n\times m$ result, each element $(i,j)$ of which equals 1
459if $x_i = s_j$, otherwise 0. The \texttt{maxr()} function along with
460the ``\verb|.>|'' operator (see chapter~\ref{chap:matrices} for both)
461then produces the result we want.
462
463Of course, whichever procedure you use, you have to repeat for each of
464the dummy series you want to create (but keep reading---the ``proper''
465solution is probably what you want if you plan to create several
466dummies).
467
468\emph{Further comment:} Note that the clever solution depends on
469converting what is ``naturally'' a vector result into a series. This
470will fail if there are missing values in \texttt{src}, since (by
471default) missing values will be skipped when converting \texttt{src}
472to $x$, and so the number of rows in the result will fall short of the
473number of observations in the dataset. One fix is then to subsample
474the dataset to exclude missing values before employing this method;
475another is to adjust the \texttt{skip\_missing} setting via the
476\cmd{set} command (see the \GCR).
477
478\emph{Proper solution:}
479
480The best solution, in terms of both computational efficiency and code
481clarity, would be using a ``conversion table'' and the \cmd{replace}
482function, to produce a series on which the \cmd{dummify} command can
483be used. For example, suppose we want to convert from a series called
484\texttt{fips} holding FIPS codes\footnote{FIPS is the Federal
485  Information Processing Standard: it assigns numeric codes from 1 to
486  56 to the US states and outlying areas.} for the 50 US states plus
487the District of Columbia to a series holding codes for the four
488standard US regions. We could create a $2 \times 51$ matrix---call it
489\texttt{srmap}---with the 51 FIPS codes on the first row and the
490corresponding region codes on the second, and then do
491\begin{code}
492series region = replace(fips, srmap[1,], srmap[2,])
493\end{code}
494
495\subsection{Generating an ARMA(1,1)}
496
497\emph{Problem:} Generate $y_t = 0.9 y_{t-1} + \varepsilon_t - 0.5
498\varepsilon_{t-1}$, with $\varepsilon_t \sim N\!I\!I\!D(0,1)$.
499
500\emph{Recommended solution:}
501\begin{code}
502alpha = 0.9
503theta = -0.5
504series y = filter(normal(), {1, theta}, alpha)
505\end{code}
506
507\emph{``Bread and butter'' solution:}
508\begin{code}
509alpha = 0.9
510theta = -0.5
511series e = normal()
512series y = 0
513series y = alpha * y(-1) + e + theta * e(-1)
514\end{code}
515
516\emph{Comment:} The \cmd{filter} function is specifically designed for
517this purpose so in most cases you'll want to take advantage of its
518speed and flexibility. That said, in some cases you may want to
519generate the series in a manner which is more transparent (maybe for
520teaching purposes).
521
522In the second solution, the statement \cmd{series y = 0} is necessary
523because the next statement evaluates \texttt{y} recursively, so
524\texttt{y[1]} must be set.  Note that you must use the keyword
525\texttt{series} here instead of writing \texttt{genr y = 0} or simply
526\texttt{y = 0}, to ensure that \texttt{y} is a series and
527not a scalar.
528
529\subsection{Recoding a variable by classes}
530
531\emph{Problem:} You want to recode a variable by classes. For example,
532you have the age of a sample of individuals ($x_i$) and you need to
533compute age classes ($y_i$) as
534\begin{eqnarray*}
535  y_i = 1 & \mathrm{for} & x_i < 18 \\
536  y_i = 2 & \mathrm{for} & 18 \le x_i < 65 \\
537  y_i = 3 & \mathrm{for} & x_i \ge 65
538\end{eqnarray*}
539
540\emph{Solution:}
541\begin{code}
542series y = 1 + (x >= 18) + (x >= 65)
543\end{code}
544
545\emph{Comment:} True and false expressions are evaluated as 1 and 0
546respectively, so they can be manipulated algebraically as any other
547number. The same result could also be achieved by using the
548conditional assignment operator (see below), but in most cases it
549would probably lead to more convoluted constructs.
550
551\subsection{Conditional assignment}
552
553\emph{Problem:} Generate $y_t$ via the following rule:
554\[
555  y_t = \left\{
556    \begin{array}{ll}
557      x_t & \mathrm{for} \quad d_t > a \\
558      z_t & \mathrm{for} \quad d_t \le a
559    \end{array}
560    \right.
561\]
562
563\emph{Solution:}
564\begin{code}
565series y = (d > a) ? x : z
566\end{code}
567
568\emph{Comment:} There are several alternatives to the one presented
569above. One is a brute force solution using loops. Another one, more
570efficient but still suboptimal, would be
571\begin{code}
572series y = (d>a)*x + (d<=a)*z
573\end{code}
574However, the ternary conditional assignment operator is not only the
575most efficient way to accomplish what we want, it is also remarkably
576transparent to read when one gets used to it. Some readers may find it
577helpful to note that the conditional assignment operator works exactly
578the same way as the \texttt{=IF()} function in spreadsheets.
579
580\subsection{Generating a time index for panel datasets}
581
582\emph{Problem:} gretl has a \texttt{\$unit} accessor, but not
583the equivalent for time. What should I use?
584
585\emph{Solution:}
586\begin{code}
587series x = time
588\end{code}
589
590\emph{Comment:} The special construct \cmd{genr time} and its variants
591are aware of whether a dataset is a panel.
592
593\subsection{Sanitizing a list of regressors}
594
595\emph{Problem:} I noticed that built-in commands like \cmd{ols}
596automatically drop collinear variables and put the constant first. How
597can I achieve the same result for an estimator I'm writing?
598
599\emph{Solution:}
600No worry. The function below does just that
601\begin{code}
602function list sanitize(list X)
603    list R = X - const
604    if nelem(R) < nelem(X)
605        R = const R
606    endif
607    return dropcoll(R)
608end function
609\end{code}
610so for example the code below
611\begin{code}
612nulldata 20
613x = normal()
614y = normal()
615z = x + y # collinear
616list A = x y const z
617list B = sanitize(A)
618
619list print A
620list print B
621\end{code}
622returns
623\begin{code}
624? list print A
625x y const z
626? list print B
627const x y
628\end{code}
629
630Besides: it has been brought to our attention that some mischievous
631programs out there put the constant \emph{last}, instead of first,
632like God intended. We are not amused by this utter disrespect of
633econometric tradition, but if you want to pursue the way of evil, it
634is rather simple to adapt the script above to that effect.
635
636
637\subsection{Generating the ``hat'' values after an OLS regression}
638\label{sec:hatvalues}
639
640\emph{Problem:} I've just run an OLS regression, and now I need the
641so-called the leverage values (also known as the ``hat'' values). I
642know you can access residuals and fitted values through ``dollar''
643accessors, but nothing like that seems to be available for ``hat''
644values.
645
646\emph{Solution:}
647``Hat'' values are can be thought of as the diagonal of the projection
648matrix $P_X$, or more explicitly as
649\[
650h_i = \mathbf{x}_i' (X'X)^{-1} \mathbf{x}_i
651\]
652where $X$ is the matrix of regressors and $\mathbf{x}_i'$ is its
653$i$-th row.
654
655The reader is invited to study the code below, which offers four
656different solutions to the problem:
657
658\begin{code}
659open data4-1.gdt --quiet
660list X = const sqft bedrms baths
661ols price X
662
663# method 1
664leverage --save --quiet
665series h1 = lever
666
667# these are necessary for what comes next
668matrix mX  = {X}
669matrix iXX = invpd(mX'mX)
670
671# method 2
672series h2 = diag(qform(mX, iXX))
673# method 3
674series h3 = sumr(mX .* (mX*iXX))
675# method 4
676series h4 = NA
677loop i=1..$nobs
678    matrix x = mX[i,]'
679    h4[i] = x'iXX*x
680endloop
681
682# verify
683print h1 h2 h3 h4 --byobs
684\end{code}
685%$
686
687\emph{Comment:} Solution 1 is the preferable one: it relies on the
688built-in \cmd{leverage} command, which computes the requested series
689quite efficiently, taking care of missing values, possible
690restrictions to the sample, etc.
691
692However, three more are shown for didactical purposes, mainly to show
693the user how to manipulate matrices. Solution 2 first constructs the
694$P_X$ matrix explicitly, via the \cmd{qform} function, and then takes
695its diagonal; this is definitely \emph{not} recommended (despite its
696compactness), since you generate a much bigger matrix than you
697actually need and waste a lot of memory and CPU cycles in the
698process. It doesn't matter very much in the present case, since the
699sample size is very small, but with a big dataset this could be a very
700bad idea.
701
702Solution 3 is more clever, and relies on the fact that, if you
703define $Z = X\cdot(X'X)^{-1}$, then $h_i$ could also
704be written as
705\[
706 h_i = \mathbf{x}_i' \mathbf{z}_i = \sum_{i=1}^k x_{ik} z_{i_k}
707\]
708which is in turn equivalent to the sum of the elements of the $i$-th
709row of $X \odot Z$, where $\odot$ is the element-by-element
710product. In this case, your clever usage of matrix algebra would
711produce a solution computationally much superior to solution 2.
712
713Solution 4 is the most old-fashioned one, and employs an indexed
714loop. While this wastes practically no memory and employs no more CPU
715cycles in algebraic operations than strictly necessary, it imposes a
716much greater burden on the hansl interpreter, since handling a loop is
717conceptually more complex than a single operation. In practice, you'll
718find that for any realistically-sized problem, solution 4 is much
719slower that solution 3.
720
721\subsection{Moving functions for time series}
722\label{sec:movfun}
723
724\emph{Problem:} gretl provides native functions for moving
725averages, but I need a to compute a different statistic on a sliding
726data window. Is there a way to do this without using loops?
727
728\emph{Solution:} One of the nice things of the \texttt{list} data type
729is that, if you define a list, then several functions that would
730normally apply ``vertically'' to elements of a series apply
731``horizontally'' across the list. So for example, the following piece
732of code
733\begin{code}
734open bjg.gdt
735order = 12
736list L = lg || lags(order-1, lg)
737smpl +order ;
738series movmin = min(L)
739series movmax = max(L)
740series movmed = median(L)
741smpl full
742\end{code}
743computes the moving minimum, maximum and median of the \texttt{lg}
744series. Plotting the four series would produce something similar to
745figure \ref{fig:movfun}.
746
747\begin{figure}[hbtp]
748  \centering
749  \includegraphics{figures/movfun}
750  \caption{``Moving'' functions}
751  \label{fig:movfun}
752\end{figure}
753
754\subsection{Generating data with a prescribed correlation structure}
755\label{sec:CholeskyTrick}
756
757\emph{Problem:} I'd like to generate a bunch of normal random variates
758whose covariance matrix is exactly equal to a given matrix
759$\Sigma$. How can I do this in gretl?
760
761\emph{Solution:} The Cholesky decomposition is your friend. If you
762want to generate data with a given \emph{population} covariance
763matrix, then all you have to do is post-multiply your pseudo-random
764data by the Cholesky factor (transposed) of the matrix you want. For
765example:
766\begin{code}
767set seed 123
768S = {2,1;1,1}
769T = 1000
770X = mnormal(T, rows(S))
771X = X * cholesky(S)'
772eval mcov(X)
773\end{code}
774should give you
775\begin{code}
776? eval mcov(X)
777      2.0016       1.0157
778      1.0157       1.0306
779\end{code}
780
781If, instead, you want your simulated data to have a given \emph{sample}
782covariance matrix, you have to apply the same technique twice: one for
783standardizing the data, another one for giving it the covariance
784structure you want. Example:
785\begin{code}
786S = {2,1;1,1}
787T = 1000
788X = mnormal(T, rows(S))
789X = X * (cholesky(S)/cholesky(mcov(X)))'
790eval mcov(X)
791\end{code}
792gives you
793\begin{code}
794? eval mcov(X)
795   2   1
796   1   1
797\end{code}
798as required.
799
800\section{Neat tricks}
801\label{sec:cheat-neat}
802
803\subsection{Interaction dummies}
804
805\emph{Problem:} You want to estimate the model $y_i = \mathbf{x}_i
806\beta_1 + \mathbf{z}_i \beta_2 + d_i \beta_3 + (d_i \cdot \mathbf{z}_i)
807\beta_4 + \varepsilon_t$, where $d_i$ is a dummy variable while
808$\mathbf{x}_i$ and $\mathbf{z}_i$ are vectors of explanatory
809variables.
810
811\emph{Solution:} As of version 1.9.12, gretl provides the
812\verb|^| operator to make this operation easy. See section
813\ref{sec:transform-lists} for details (especially example script
814\ref{ex:interact_list}). But back in my day, we used loops to do that!
815Here's how:
816
817\begin{code}
818list X = x1 x2 x3
819list Z = z1 z2
820list dZ = null
821loop foreach i Z
822  series d$i = d * $i
823  list dZ = dZ d$i
824endloop
825
826ols y X Z d dZ
827\end{code}
828%$
829
830\emph{Comment:} It's amazing what string substitution can do for
831you, isn't it?
832
833\subsection{Realized volatility}
834
835\emph{Problem:} Given data by the minute, you want to compute the
836``realized volatility'' for the hour as $RV_t = \frac{1}{60}
837\sum_{\tau=1}^{60} y_{t:\tau}^2$. Imagine your sample starts at time 1:1.
838
839\emph{Solution:}
840\begin{code}
841smpl --full
842genr time
843series minute = int(time/60) + 1
844series second = time % 60
845setobs minute second --panel
846series rv = psd(y)^2
847setobs 1 1
848smpl second==1 --restrict
849store foo rv
850\end{code}
851
852\emph{Comment:} Here we trick gretl into thinking that our
853dataset is a panel dataset, where the minutes are the ``units'' and
854the seconds are the ``time''; this way, we can take advantage of the
855special function \cmd{psd()}, panel standard deviation.  Then we
856simply drop all observations but one per minute and save the resulting
857data (\texttt{store foo rv} translates as ``store in the gretl
858datafile \texttt{foo.gdt} the series \texttt{rv}'').
859
860\subsection{Looping over two paired lists}
861
862\emph{Problem:} Suppose you have two lists with the same number of
863elements, and you want to apply some command to corresponding elements
864over a loop.
865
866\emph{Solution:}
867\begin{code}
868list L1 = a b c
869list L2 = x y z
870
871k1 = 1
872loop foreach i L1
873    k2 = 1
874    loop foreach j L2
875        if k1 == k2
876            ols $i 0 $j
877        endif
878        k2++
879    endloop
880    k1++
881endloop
882\end{code}
883
884\emph{Comment:} The simplest way to achieve the result is to loop over
885all possible combinations and filter out the unneeded ones via an
886\texttt{if} condition, as above. That said, in some cases variable
887names can help. For example, if
888\begin{code}
889  list Lx = x1 x2 x3
890  list Ly = y1 y2 y3
891\end{code}
892then we could just loop over the integers---quite intuitive and certainly
893more elegant:
894\begin{code}
895  loop i=1..3
896    ols y$i const x$i
897  endloop
898\end{code}
899
900\subsection{Convolution / polynomial multiplication}
901
902\emph{Problem:} How do I multiply polynomials? There's no dedicated
903function to do that, and yet it's a fairly basic mathematical task.
904
905\emph{Solution:} Never fear! We have the \cmd{conv2d} function, which
906is a tool for a more general problem, but includes polynomial
907multiplication as a special case..
908
909Suppose you want to multiply two finite-order polynomials $P(x) =
910\sum_{i=0}^m p_i x^i$ and $Q(x) = \sum_{i=0}^n q_i x^i$. What you want
911is the sequence of coefficients of the polynomial
912\[
913  R(x) = P(x) \cdot Q(x) = \sum_{k=0}^{m+n} r_k x^k
914\]
915where
916\[
917  r_k = \sum_{i=0}^k p_i q_{k-i}
918\]
919is the \emph{convolution} of the $p_i$ and $q_i$ coefficients. The
920same operation can be performed via the FFT, but in most cases using
921\cmd{conv2d} is quicker and more natural.
922
923As an example, we'll use the same one we used in Section
924\ref{sec:genr-fft}: consider the multiplication of two polynomials:
925\begin{eqnarray*}
926  P(x) & = & 1 + 0.5 x \\
927  Q(x) & = & 1 + 0.3 x - 0.8 x^2 \\
928  R(x) = P(x) \cdot Q(x) & = & 1 + 0.8 x - 0.65 x^2 - 0.4 x^3
929\end{eqnarray*}
930The following code snippet performs all the necessary calculations:
931\begin{code}
932p = {1; 0.5}
933q = {1; 0.3; -0.8}
934r = conv2d(p, q)
935print r
936\end{code}
937Runnning the above produces
938\begin{code}
939r (4 x 1)
940
941      1
942    0.8
943  -0.65
944   -0.4
945\end{code}
946which is indeed the desired result. Note that the same computation
947could also be performed via the \cmd{filter} function, at the price of
948slightly more elaborate syntax.
949
950\subsection{Comparing two lists}
951
952\emph{Problem:} How can I tell if two lists contain the same variables
953(not necessarily in the same order)?
954
955\emph{Solution:} In many respects, lists are like sets, so it makes
956sense to use the so-called ``symmetric difference'' operator, which is
957defined as
958\[
959A\,\triangle\,B = (A \setminus B) \cup (B \setminus A)
960\]
961where in this context backslash represents the relative complement
962operator, such that
963\[
964A \setminus B = \left\{ x \in A \,|\, x \not \in B \right\}
965\]
966
967In practice we first check if there are any series in $A$ but not in
968$B$, then we perform the reverse check. If the union of the two
969results is an empty set, then the lists must contain the same
970variables. The hansl syntax for this would be something like
971
972\begin{code}
973scalar NotTheSame = nelem((A-B) || (B-A)) > 0
974\end{code}
975
976\subsection{Reordering list elements}
977
978\emph{Problem:} Is there a way to reorder list elements?
979
980\emph{Solution:} You can use the fact that a list can be cast into
981a vector of integers and then manipulated via ordinary matrix
982syntax. So, for example, if you wanted to ``flip'' a list you may
983just use the \cmd{mreverse} function. For example:
984\begin{code}
985open AWM.gdt --quiet
986list X = 3 6 9 12
987matrix tmp = X
988list revX = mreverse(tmp')
989
990list X print
991list revX print
992\end{code}
993will produce
994\begin{code}
995? list X print
996D1 D872 EEN_DIS GCD
997? list revX print
998GCD EEN_DIS D872 D1
999\end{code}
1000
1001\subsection{Plotting an asymmetric confidence interval}
1002
1003\emph{Problem:} ``I like the look of the \option{band} option to the
1004\cmd{gnuplot} and \cmd{plot} commands, but it's set up for plotting a
1005symmetric interval and I want to show an asymmetric one.''
1006
1007\emph{Solution:} Any interval is by construction symmetrical about its
1008mean at each observation. So you just need to perform a little
1009tweak. Say you want to plot a series \texttt{x} along with a band
1010defined by the two series \texttt{top} and \texttt{bot}. Here we go:
1011\begin{code}
1012# create series for mid-point and deviation
1013series mid = (top + bot)/2
1014series dev = top - mid
1015gnuplot x --band=mid,dev --time-series --with-lines --output=display
1016\end{code}
1017
1018\subsection{Cross-validation}
1019\label{sec:xvalid}
1020
1021\emph{Problem:} ``I'd like to compute the so-called \emph{leave-one-out
1022cross-validation criterion} for my regression. Is there a command in
1023gretl?''
1024
1025If you have a sample with $n$ observations, the ``leave-one-out''
1026cross-validation criterion can be mechanically computed by running $n$
1027regressions in which one observation at a time is omitted and all the
1028other ones are used to forecast its value. The sum of the $n$ squared
1029forecast errors is the statistic we want. Fortunately, there is no
1030need to do so. It is possible to prove that the same statistic can be
1031computed as
1032\[
1033CV = \sum_{i=1}^n [\hat{u}_{i}/(1-h_{i})]^2,
1034\]
1035where $h_i$ is the $i$-th element of the ``hat'' matrix (see section
1036\ref{sec:hatvalues}) from a regression on the whole sample.
1037
1038This method is natively provided by gretl as a side benefit to the
1039\cmd{leverage} command, that stores the CV criterion into the
1040\dollar{test} accessor. The following script shows the equivalence of
1041the two approaches:
1042\begin{code}
1043set verbose off
1044open data4-1.gdt
1045list X = const sqft bedrms baths
1046
1047# compute the CV criterion the silly way
1048
1049scalar CV = 0
1050matrix mX = {X}
1051loop i = 1 .. $nobs
1052    xi = mX[i,]
1053    yi = price[i]
1054    smpl obs != i --restrict
1055    ols price X --quiet
1056    smpl full
1057    scalar fe = yi - xi * $coeff
1058    CV += fe^2
1059endloop
1060
1061printf "CV = %g\n", CV
1062
1063# the smart way
1064
1065ols price X --quiet
1066leverage --quiet
1067printf "CV = %g\n", $test
1068\end{code}
1069
1070\subsection{Is my matrix result broken?}
1071\label{sec:brokenmat}
1072
1073\emph{Problem:} ``Most of the matrix manipulation functions available
1074in gretl flag an error if something goes wrong, but there's no
1075guarantee that every matrix computation will return an entirely finite
1076matrix, containing no infinities or \texttt{NaN}s. So how do I tell if
1077I've got a fully valid matrix?''
1078
1079\emph{Solution}: Given a matrix \texttt{m}, the call
1080``\texttt{ok(m)}'' returns a matrix with the same dimensions as
1081\texttt{m}, with elements 1 for finite values and 0 for infinities or
1082\texttt{NaN}s. A matrix as a whole is OK if it has no elements which
1083fail this test, so here's a suitable check for a ``broken'' matrix,
1084using the logical NOT operator, ``\texttt{!}'':
1085\begin{code}
1086sumc(sumr(!ok(m))) > 0
1087\end{code}
1088If this gives a non-zero return value you know that \texttt{m}
1089contains at least one non-finite element.
1090
1091
1092%%% Local Variables:
1093%%% mode: latex
1094%%% TeX-master: "gretl-guide"
1095%%% End:
1096