1\chapter{Cheat sheet} 2\label{chap:cheatsheet} 3 4This chapter explains how to perform some common---and some not so 5common---tasks in gretl's scripting language, hansl. Some but not 6all of the techniques listed here are also available through the 7graphical interface. Although the graphical interface may be more 8intuitive and less intimidating at first, we encourage users to take 9advantage of the power of gretl's scripting language as soon as 10they feel comfortable with the program. 11 12\section{Dataset handling} 13 14\subsection{``Weird'' periodicities} 15 16\emph{Problem:} You have data sampled each 3 minutes from 9am onwards; 17you'll probably want to specify the hour as 20 periods. 18 19\emph{Solution:} 20\begin{code} 21setobs 20 9:1 --special 22\end{code} 23 24\emph{Comment:} Now functions like \cmd{sdiff()} (``seasonal'' 25difference) or estimation methods like seasonal ARIMA will work as 26expected. 27 28\subsection{Generating a panel dataset of given dimensions} 29 30\emph{Problem:} You want to generate via \cmd{nulldata} a panel 31dataset and specify in advance the number of units and the time length 32of your series via two scalar variables. 33 34\emph{Solution:} 35\begin{code} 36scalar n_units = 100 37scalar T = 12 38scalar NT = T * n_units 39 40nulldata NT --preserve 41setobs T 1:1 --stacked-time-series 42\end{code} 43 44\emph{Comment:} The essential ingredient that we use here is the 45\option{preserve} option: it protects existing scalars (and matrices, 46for that matter) from being trashed by \cmd{nulldata}, thus making it 47possible to use the scalar $T$ in the \cmd{setobs} command. 48 49\subsection{Help, my data are backwards!} 50 51\emph{Problem:} Gretl expects time series data to be in chronological 52order (most recent observation last), but you have imported 53third-party data that are in reverse order (most recent first). 54 55\emph{Solution:} 56\begin{code} 57setobs 1 1 --cross-section 58series sortkey = -obs 59dataset sortby sortkey 60setobs 1 1950 --time-series 61\end{code} 62 63\emph{Comment:} The first line is required only if the data currently 64have a time series interpretation: it removes that interpretation, 65because (for fairly obvious reasons) the \cmd{dataset sortby} 66operation is not allowed for time series data. The following two 67lines reverse the data, using the negative of the built-in index 68variable \texttt{obs}. The last line is just illustrative: it 69establishes the data as annual time series, starting in 1950. 70 71If you have a dataset that is mostly the right way round, but a 72particular variable is wrong, you can reverse that variable as 73follows: 74\begin{code} 75x = sortby(-obs, x) 76\end{code} 77 78 79\subsection{Dropping missing observations selectively} 80 81\emph{Problem:} You have a dataset with many variables and want to 82restrict the sample to those observations for which there are no 83missing observations for the variables \texttt{x1}, \texttt{x2} and 84\texttt{x3}. 85 86\begin{samepage} 87\emph{Solution:} 88\begin{code} 89list X = x1 x2 x3 90smpl --no-missing X 91\end{code} 92\end{samepage} 93 94\emph{Comment:} You can save the file via a \cmd{store} command 95to preserve a subsampled version of the dataset. Alternative solutions 96based on the \cmd{ok} function, such as 97\begin{code} 98list X = x1 x2 x3 99series sel = ok(X) 100smpl sel --restrict 101\end{code} 102are perhaps less obvious, but more flexible. Pick your poison. 103 104\subsection{``By'' operations} 105 106\emph{Problem:} You have a discrete variable \texttt{d} and you want 107to run some commands (for example, estimate a model) by splitting the 108sample according to the values of \texttt{d}. 109 110\emph{Solution:} 111\begin{code} 112matrix vd = values(d) 113m = rows(vd) 114loop i=1..m 115 scalar sel = vd[i] 116 smpl d==sel --restrict --replace 117 ols y const x 118endloop 119smpl --full 120\end{code} 121 122\emph{Comment:} The main ingredient here is a loop. You can have 123gretl perform as many instructions as you want for each value of 124\texttt{d}, as long as they are allowed inside a loop. Note, however, 125that if all you want is descriptive statistics, the \cmd{summary} 126command does have a \option{by} option. 127 128\subsection{Adding a time series to a panel} 129 130\emph{Problem:} You have a panel dataset (comprising observations of 131$n$ indidivuals in each of $T$ periods) and you want to add a variable 132which is available in straight time-series form. For example, you 133want to add annual CPI data to a panel in order to deflate nominal 134income figures. 135 136In gretl a panel is represented in stacked time-series format, so in 137effect the task is to create a new variable which holds $n$ stacked 138copies of the original time series. Let's say the panel comprises 500 139individuals observed in the years 1990, 1995 and 2000 ($n=500$, 140$T=3$), and we have these CPI data in the ASCII file \texttt{cpi.txt}: 141 142\begin{code} 143date cpi 1441990 130.658 1451995 152.383 1462000 172.192 147\end{code} 148 149What we need is for the CPI variable in the panel to repeat these 150three values 500 times. 151 152\emph{Solution:} Simple! With the panel dataset open in gretl, 153\begin{code} 154append cpi.txt 155\end{code} 156 157\emph{Comment:} If the length of the time series is the same as the 158length of the time dimension in the panel (3 in this example), gretl 159will perform the stacking automatically. Rather than using the 160\cmd{append} command you could use the ``Append data'' item under 161the \textsf{File} menu in the GUI program. 162 163If the length of your time series does not exactly match the $T$ 164dimension of your panel dataset, \cmd{append} will not work but you 165can use the \cmd{join} command, which is able to pick just the 166observations with matching time periods. On selecting ``Append data'' 167in the GUI you are given a choice between plain ``append'' and 168``join'' modes, and if you choose the latter you get a dialog window 169allowing you to specify the key(s) for the join operation. For native 170gretl data files you can use built-in series that identify the time 171periods, such as \dollar{obsmajor}, for your outer key to match the 172dates. In the example above, if the CPI data were in gretl format 173\dollar{obsmajor} would give you the year of the observations. 174 175\subsection{Time averaging of panel datasets} 176 177\emph{Problem:} You have a panel dataset (comprising observations of 178$n$ indidivuals in each of $T$ periods) and you want to lower the time 179frequency by averaging. This is commonly done in empirical growth 180economics, where annual data are turned into 3- or 4- or 5-year 181averages \citep[see for example][]{Islam1995}. 182 183\emph{Solution:} In a panel dataset, gretl functions that deal 184with time are aware of the panel structure, so they will automatically 185``do the right thing''. Therefore, all you have to do is use the 186\cmd{movavg()} function for computing moving averages and then just 187drop the years you don't need. An example with artificial data 188follows: 189\begin{code} 190nulldata 36 191set seed 61218 192setobs 12 1:1 --stacked-time-series 193 194### 195### generate simulated yearly data 196### 197series year = 2000 + time 198series y = round(normal()) 199series x = round(3*uniform()) 200list X = y x 201print year X -o 202 203### 204### now re-cast as 4-year averages 205### 206# a dummy for endpoints 207series endpoint = (year % 4 == 0) 208# id variable 209series id = $unit 210 211# compute averages 212loop foreach i X 213 series $i = movavg($i, 4) 214endloop 215 216# drop extra observations 217smpl endpoint --dummy --permanent 218 219# restore panel structure 220setobs id year --panel-vars 221print id year X -o 222\end{code} 223% $ 224Running the above script produces (among other output): 225\begin{code} 226? print year X -o 227 228 year y x 229 2301:01 2001 1 1 2311:02 2002 -1 1 2321:03 2003 -1 0 2331:04 2004 0 1 2341:05 2005 1 2 2351:06 2006 1 2 2361:07 2007 -1 0 2371:08 2008 1 1 2381:09 2009 0 3 2391:10 2010 -1 1 2401:11 2011 1 1 2411:12 2012 0 1 242... 2433:09 2009 0 1 2443:10 2010 1 1 2453:11 2011 0 2 2463:12 2012 1 2 247 248? print id year X -o 249 250 id year y x 251 2521:1 1 2004 -0.25 0.75 2531:2 1 2008 0.50 1.25 2541:3 1 2012 0.00 1.50 255... 2563:3 3 2012 0.50 1.50 257\end{code} 258 259\subsection{Turning observation-marker strings into a series} 260 261\emph{Problem:} Here's one that might turn up in the context of the 262\cmd{join} command (see chapter~\ref{chap:join}). The current dataset 263contains a string-valued series that you'd like to use as a key for 264matching observations, perhaps the two-letter codes for the names of 265US states. The file from which you wish to add data contains that same 266information, but \textit{not} in the form of a string-valued series, 267rather it exists in the form of ``observation markers''. Such markers 268cannot be used as a key directly, but is there a way to parlay them 269into a string-valued series? Why, of course there is! 270 271\emph{Solution:} 272We'll illustrate with the Ramanathan data file \texttt{data4-10.gdt}, 273which contains private school enrollment data and covariates for the 27450 US states plus Washington, D.C. ($n = 51$). 275\begin{code} 276open data4-10.gdt 277markers --to-array="state_codes" 278genr index 279stringify(index, state_codes) 280store joindata.gdt 281\end{code} 282 283\emph{Comment:} The \texttt{markers} command saves the observation 284markers to an array of strings. The command \texttt{genr index} 285creates a series that goes 1, 2, 3, \dots{}, and we attach the state 286codes to this series via \texttt{stringify()}. After saving the result 287we have a datafile that contains a series, \texttt{index}, that can be 288matched with whatever series holds the state code strings in the 289target dataset. 290 291Suppose the relevant string-valued key series in the target dataset is 292called \texttt{state}. We might prefer to avoid the need to specify a 293distinct ``outer'' key (again see chapter~\ref{chap:join}). In that 294case, in place of 295\begin{code} 296genr index 297stringify(index, state_codes) 298\end{code} 299we could do 300\begin{code} 301genr index 302series state = index 303stringify(state, state_codes) 304\end{code} 305and the two datafiles will contain a comparable string-valued 306\texttt{state} series. 307 308\section{Creating/modifying variables} 309 310\subsection{Generating a dummy variable for a specific observation} 311 312\emph{Problem:} Generate $d_t = 0$ for all observation but one, for 313which $d_t = 1$. 314 315\emph{Solution:} 316\begin{code} 317 series d = (t=="1984:2") 318\end{code} 319 320\emph{Comment:} The internal variable \texttt{t} is used to refer to 321observations in string form, so if you have a cross-section sample you 322may just use \texttt{d = (t=="123")}. If the dataset has observation 323labels you can use the corresponding label. For example, if you open the 324dataset \texttt{mrw.gdt}, supplied with gretl among the 325examples, a dummy variable for Italy could be generated via 326\begin{code} 327 series DIta = (t=="Italy") 328\end{code} 329 330Note that this method does not require scripting at all. In fact, you 331might as well use the GUI Menu ``Add/Define new variable'' for the 332same purpose, with the same syntax. 333 334\subsection{Generating a discrete variable out of a set of dummies} 335 336\emph{Problem:} The \cmd{dummify} function (also available as a 337command) generates a set of mutually exclusive dummies from a discrete 338variable. The reverse functionality, however, seems to be absent. 339 340\emph{Solution:} 341\begin{code} 342series x = lincomb(D, seq(1, nelem(D))) 343\end{code} 344 345\emph{Comment:} Suppose you have a list \texttt{D} of mutually 346exclusive dummies, that is a full set of 0/1 variables coding for the 347value of some characteristic, such that the sum of the values of the 348elements of \texttt{D} is 1 at each observation. This is, by the way, 349exactly what the \cmd{dummify} command produces. The reverse job of 350\cmd{dummify} can be performed neatly by using the \cmd{lincomb} 351function. 352 353The code above multiplies the first dummy variable in the list 354\texttt{D} by 1, the second one by 2, and so on. Hence, the return 355value is a series whose value is $i$ if and only if the $i$-th member 356of \texttt{D} has value 1. 357 358If you want your coding to start from 0 instead of 1, you'll have to 359modify the code snippet above into 360\begin{code} 361series x = lincomb(D, seq(0, nelem(D)-1)) 362\end{code} 363 364\subsection{Easter} 365 366\emph{Problem:} I have a 7-day daily dataset. How do I create an 367``Easter'' dummy? 368 369\emph{Solution:} We have the \cmd{easterday()} function, which returns 370month and day of Easter given the year. The following is an example 371script which uses this function and a few string magic tricks: 372 373\begin{code} 374series Easter = 0 375loop y=2011..2016 376 a = easterday(y) 377 m = floor(a) 378 d = round(100*(a-m)) 379 ed_str = sprintf("%04d-%02d-%02d", y, m, d) 380 Easter["@ed_str"] = 1 381endloop 382\end{code} 383 384\emph{Comment:} The \cmd{round()} function is necessary for the 385``day'' component because otherwise floating-point problems may 386ensue. Try the year 2015, for example. 387 388\subsection{Recoding a variable} 389 390\emph{Problem:} You want to perform a 1-to-1 recode on a variable. For 391example, consider tennis points: you may have a variable $x$ holding 392values 1 to 3 and you want to recode it to 15, 30, 40. 393 394\emph{Solution 1:} 395\begin{code} 396series x = replace(x, 1, 15) 397series x = replace(x, 2, 30) 398series x = replace(x, 3, 40) 399\end{code} 400 401\emph{Solution 2:} 402\begin{code} 403matrix tennis = {15, 30, 40} 404series x = replace(x, seq(1,3), tennis) 405\end{code} 406 407\emph{Comment:} There are many equivalent ways to achieve the same 408effect, but for simple cases such as this, the \cmd{replace} function 409is simple and transparent. If you don't mind using matrices, scripts 410using \cmd{replace} can also be remarkably compact. Note that 411\cmd{replace} also performs $n$-to-1 (``surjective'') replacements, 412such as 413\begin{code} 414series x = replace{z, {2, 3, 5, 11, 22, 33}, 1) 415\end{code} 416which would turn all entries equal to 2, 3, 5, 11, 22 or 33 to 1 and 417leave the other ones unchanged. 418 419\subsection{Generating a ``subset of values'' dummy} 420 421\emph{Problem:} You have a dataset which contains a fine-grained 422coding for some qualitative variable and you want to ``collapse'' this 423to a relatively small set of dummy variables. Examples: you have place 424of work by US state and you want a small set of regional dummies; or 425you have detailed occupational codes from a census dataset and you 426want a manageable number of occupational category dummies. 427 428Let's call the source series \texttt{src} and one of the target dummies 429\texttt{D1}. And let's say that the values of \texttt{src} to be grouped 430under \texttt{D1} are 2, 13, 14 and 25. We'll consider three possible 431solutions: ``Longhand,'' ``Clever,'' and ``Proper.'' 432 433\emph{``Longhand'' solution:} 434\begin{code} 435series D1 = src==2 || src==13 || src==14 || src==25 436\end{code} 437 438\emph{Comment:} The above works fine if the number of distinct values 439in the source to be condensed into each dummy variable is fairly 440small, but it becomes cumbersome if a single dummy must comprise 441dozens of source values. 442 443\emph{Clever solution:} 444\begin{code} 445matrix sel = {2,13,14,25} 446series D1 = maxr({src} .= vec(sel)') .> 0 447\end{code} 448 449\emph{Comment:} The subset of values to be grouped together can be 450written out as a matrix relatively compactly (first line). The magic 451that turns this into the desired series (second line) relies on the 452versatility of the ``dot'' (element-wise) matrix operators. The 453expression ``\texttt{\{src\}}'' gets a column-vector version of the 454input series---call this $x$---and ``\texttt{vec(sel)'}'' gets the 455input matrix as a row vector, in case it's a column vector or a matrix 456with both dimensions greater than 1---call this $s$. If $x$ is 457$n\times1$ and $s$ is $1\times m$, the ``\texttt{.=}'' operator 458produces an $n\times m$ result, each element $(i,j)$ of which equals 1 459if $x_i = s_j$, otherwise 0. The \texttt{maxr()} function along with 460the ``\verb|.>|'' operator (see chapter~\ref{chap:matrices} for both) 461then produces the result we want. 462 463Of course, whichever procedure you use, you have to repeat for each of 464the dummy series you want to create (but keep reading---the ``proper'' 465solution is probably what you want if you plan to create several 466dummies). 467 468\emph{Further comment:} Note that the clever solution depends on 469converting what is ``naturally'' a vector result into a series. This 470will fail if there are missing values in \texttt{src}, since (by 471default) missing values will be skipped when converting \texttt{src} 472to $x$, and so the number of rows in the result will fall short of the 473number of observations in the dataset. One fix is then to subsample 474the dataset to exclude missing values before employing this method; 475another is to adjust the \texttt{skip\_missing} setting via the 476\cmd{set} command (see the \GCR). 477 478\emph{Proper solution:} 479 480The best solution, in terms of both computational efficiency and code 481clarity, would be using a ``conversion table'' and the \cmd{replace} 482function, to produce a series on which the \cmd{dummify} command can 483be used. For example, suppose we want to convert from a series called 484\texttt{fips} holding FIPS codes\footnote{FIPS is the Federal 485 Information Processing Standard: it assigns numeric codes from 1 to 486 56 to the US states and outlying areas.} for the 50 US states plus 487the District of Columbia to a series holding codes for the four 488standard US regions. We could create a $2 \times 51$ matrix---call it 489\texttt{srmap}---with the 51 FIPS codes on the first row and the 490corresponding region codes on the second, and then do 491\begin{code} 492series region = replace(fips, srmap[1,], srmap[2,]) 493\end{code} 494 495\subsection{Generating an ARMA(1,1)} 496 497\emph{Problem:} Generate $y_t = 0.9 y_{t-1} + \varepsilon_t - 0.5 498\varepsilon_{t-1}$, with $\varepsilon_t \sim N\!I\!I\!D(0,1)$. 499 500\emph{Recommended solution:} 501\begin{code} 502alpha = 0.9 503theta = -0.5 504series y = filter(normal(), {1, theta}, alpha) 505\end{code} 506 507\emph{``Bread and butter'' solution:} 508\begin{code} 509alpha = 0.9 510theta = -0.5 511series e = normal() 512series y = 0 513series y = alpha * y(-1) + e + theta * e(-1) 514\end{code} 515 516\emph{Comment:} The \cmd{filter} function is specifically designed for 517this purpose so in most cases you'll want to take advantage of its 518speed and flexibility. That said, in some cases you may want to 519generate the series in a manner which is more transparent (maybe for 520teaching purposes). 521 522In the second solution, the statement \cmd{series y = 0} is necessary 523because the next statement evaluates \texttt{y} recursively, so 524\texttt{y[1]} must be set. Note that you must use the keyword 525\texttt{series} here instead of writing \texttt{genr y = 0} or simply 526\texttt{y = 0}, to ensure that \texttt{y} is a series and 527not a scalar. 528 529\subsection{Recoding a variable by classes} 530 531\emph{Problem:} You want to recode a variable by classes. For example, 532you have the age of a sample of individuals ($x_i$) and you need to 533compute age classes ($y_i$) as 534\begin{eqnarray*} 535 y_i = 1 & \mathrm{for} & x_i < 18 \\ 536 y_i = 2 & \mathrm{for} & 18 \le x_i < 65 \\ 537 y_i = 3 & \mathrm{for} & x_i \ge 65 538\end{eqnarray*} 539 540\emph{Solution:} 541\begin{code} 542series y = 1 + (x >= 18) + (x >= 65) 543\end{code} 544 545\emph{Comment:} True and false expressions are evaluated as 1 and 0 546respectively, so they can be manipulated algebraically as any other 547number. The same result could also be achieved by using the 548conditional assignment operator (see below), but in most cases it 549would probably lead to more convoluted constructs. 550 551\subsection{Conditional assignment} 552 553\emph{Problem:} Generate $y_t$ via the following rule: 554\[ 555 y_t = \left\{ 556 \begin{array}{ll} 557 x_t & \mathrm{for} \quad d_t > a \\ 558 z_t & \mathrm{for} \quad d_t \le a 559 \end{array} 560 \right. 561\] 562 563\emph{Solution:} 564\begin{code} 565series y = (d > a) ? x : z 566\end{code} 567 568\emph{Comment:} There are several alternatives to the one presented 569above. One is a brute force solution using loops. Another one, more 570efficient but still suboptimal, would be 571\begin{code} 572series y = (d>a)*x + (d<=a)*z 573\end{code} 574However, the ternary conditional assignment operator is not only the 575most efficient way to accomplish what we want, it is also remarkably 576transparent to read when one gets used to it. Some readers may find it 577helpful to note that the conditional assignment operator works exactly 578the same way as the \texttt{=IF()} function in spreadsheets. 579 580\subsection{Generating a time index for panel datasets} 581 582\emph{Problem:} gretl has a \texttt{\$unit} accessor, but not 583the equivalent for time. What should I use? 584 585\emph{Solution:} 586\begin{code} 587series x = time 588\end{code} 589 590\emph{Comment:} The special construct \cmd{genr time} and its variants 591are aware of whether a dataset is a panel. 592 593\subsection{Sanitizing a list of regressors} 594 595\emph{Problem:} I noticed that built-in commands like \cmd{ols} 596automatically drop collinear variables and put the constant first. How 597can I achieve the same result for an estimator I'm writing? 598 599\emph{Solution:} 600No worry. The function below does just that 601\begin{code} 602function list sanitize(list X) 603 list R = X - const 604 if nelem(R) < nelem(X) 605 R = const R 606 endif 607 return dropcoll(R) 608end function 609\end{code} 610so for example the code below 611\begin{code} 612nulldata 20 613x = normal() 614y = normal() 615z = x + y # collinear 616list A = x y const z 617list B = sanitize(A) 618 619list print A 620list print B 621\end{code} 622returns 623\begin{code} 624? list print A 625x y const z 626? list print B 627const x y 628\end{code} 629 630Besides: it has been brought to our attention that some mischievous 631programs out there put the constant \emph{last}, instead of first, 632like God intended. We are not amused by this utter disrespect of 633econometric tradition, but if you want to pursue the way of evil, it 634is rather simple to adapt the script above to that effect. 635 636 637\subsection{Generating the ``hat'' values after an OLS regression} 638\label{sec:hatvalues} 639 640\emph{Problem:} I've just run an OLS regression, and now I need the 641so-called the leverage values (also known as the ``hat'' values). I 642know you can access residuals and fitted values through ``dollar'' 643accessors, but nothing like that seems to be available for ``hat'' 644values. 645 646\emph{Solution:} 647``Hat'' values are can be thought of as the diagonal of the projection 648matrix $P_X$, or more explicitly as 649\[ 650h_i = \mathbf{x}_i' (X'X)^{-1} \mathbf{x}_i 651\] 652where $X$ is the matrix of regressors and $\mathbf{x}_i'$ is its 653$i$-th row. 654 655The reader is invited to study the code below, which offers four 656different solutions to the problem: 657 658\begin{code} 659open data4-1.gdt --quiet 660list X = const sqft bedrms baths 661ols price X 662 663# method 1 664leverage --save --quiet 665series h1 = lever 666 667# these are necessary for what comes next 668matrix mX = {X} 669matrix iXX = invpd(mX'mX) 670 671# method 2 672series h2 = diag(qform(mX, iXX)) 673# method 3 674series h3 = sumr(mX .* (mX*iXX)) 675# method 4 676series h4 = NA 677loop i=1..$nobs 678 matrix x = mX[i,]' 679 h4[i] = x'iXX*x 680endloop 681 682# verify 683print h1 h2 h3 h4 --byobs 684\end{code} 685%$ 686 687\emph{Comment:} Solution 1 is the preferable one: it relies on the 688built-in \cmd{leverage} command, which computes the requested series 689quite efficiently, taking care of missing values, possible 690restrictions to the sample, etc. 691 692However, three more are shown for didactical purposes, mainly to show 693the user how to manipulate matrices. Solution 2 first constructs the 694$P_X$ matrix explicitly, via the \cmd{qform} function, and then takes 695its diagonal; this is definitely \emph{not} recommended (despite its 696compactness), since you generate a much bigger matrix than you 697actually need and waste a lot of memory and CPU cycles in the 698process. It doesn't matter very much in the present case, since the 699sample size is very small, but with a big dataset this could be a very 700bad idea. 701 702Solution 3 is more clever, and relies on the fact that, if you 703define $Z = X\cdot(X'X)^{-1}$, then $h_i$ could also 704be written as 705\[ 706 h_i = \mathbf{x}_i' \mathbf{z}_i = \sum_{i=1}^k x_{ik} z_{i_k} 707\] 708which is in turn equivalent to the sum of the elements of the $i$-th 709row of $X \odot Z$, where $\odot$ is the element-by-element 710product. In this case, your clever usage of matrix algebra would 711produce a solution computationally much superior to solution 2. 712 713Solution 4 is the most old-fashioned one, and employs an indexed 714loop. While this wastes practically no memory and employs no more CPU 715cycles in algebraic operations than strictly necessary, it imposes a 716much greater burden on the hansl interpreter, since handling a loop is 717conceptually more complex than a single operation. In practice, you'll 718find that for any realistically-sized problem, solution 4 is much 719slower that solution 3. 720 721\subsection{Moving functions for time series} 722\label{sec:movfun} 723 724\emph{Problem:} gretl provides native functions for moving 725averages, but I need a to compute a different statistic on a sliding 726data window. Is there a way to do this without using loops? 727 728\emph{Solution:} One of the nice things of the \texttt{list} data type 729is that, if you define a list, then several functions that would 730normally apply ``vertically'' to elements of a series apply 731``horizontally'' across the list. So for example, the following piece 732of code 733\begin{code} 734open bjg.gdt 735order = 12 736list L = lg || lags(order-1, lg) 737smpl +order ; 738series movmin = min(L) 739series movmax = max(L) 740series movmed = median(L) 741smpl full 742\end{code} 743computes the moving minimum, maximum and median of the \texttt{lg} 744series. Plotting the four series would produce something similar to 745figure \ref{fig:movfun}. 746 747\begin{figure}[hbtp] 748 \centering 749 \includegraphics{figures/movfun} 750 \caption{``Moving'' functions} 751 \label{fig:movfun} 752\end{figure} 753 754\subsection{Generating data with a prescribed correlation structure} 755\label{sec:CholeskyTrick} 756 757\emph{Problem:} I'd like to generate a bunch of normal random variates 758whose covariance matrix is exactly equal to a given matrix 759$\Sigma$. How can I do this in gretl? 760 761\emph{Solution:} The Cholesky decomposition is your friend. If you 762want to generate data with a given \emph{population} covariance 763matrix, then all you have to do is post-multiply your pseudo-random 764data by the Cholesky factor (transposed) of the matrix you want. For 765example: 766\begin{code} 767set seed 123 768S = {2,1;1,1} 769T = 1000 770X = mnormal(T, rows(S)) 771X = X * cholesky(S)' 772eval mcov(X) 773\end{code} 774should give you 775\begin{code} 776? eval mcov(X) 777 2.0016 1.0157 778 1.0157 1.0306 779\end{code} 780 781If, instead, you want your simulated data to have a given \emph{sample} 782covariance matrix, you have to apply the same technique twice: one for 783standardizing the data, another one for giving it the covariance 784structure you want. Example: 785\begin{code} 786S = {2,1;1,1} 787T = 1000 788X = mnormal(T, rows(S)) 789X = X * (cholesky(S)/cholesky(mcov(X)))' 790eval mcov(X) 791\end{code} 792gives you 793\begin{code} 794? eval mcov(X) 795 2 1 796 1 1 797\end{code} 798as required. 799 800\section{Neat tricks} 801\label{sec:cheat-neat} 802 803\subsection{Interaction dummies} 804 805\emph{Problem:} You want to estimate the model $y_i = \mathbf{x}_i 806\beta_1 + \mathbf{z}_i \beta_2 + d_i \beta_3 + (d_i \cdot \mathbf{z}_i) 807\beta_4 + \varepsilon_t$, where $d_i$ is a dummy variable while 808$\mathbf{x}_i$ and $\mathbf{z}_i$ are vectors of explanatory 809variables. 810 811\emph{Solution:} As of version 1.9.12, gretl provides the 812\verb|^| operator to make this operation easy. See section 813\ref{sec:transform-lists} for details (especially example script 814\ref{ex:interact_list}). But back in my day, we used loops to do that! 815Here's how: 816 817\begin{code} 818list X = x1 x2 x3 819list Z = z1 z2 820list dZ = null 821loop foreach i Z 822 series d$i = d * $i 823 list dZ = dZ d$i 824endloop 825 826ols y X Z d dZ 827\end{code} 828%$ 829 830\emph{Comment:} It's amazing what string substitution can do for 831you, isn't it? 832 833\subsection{Realized volatility} 834 835\emph{Problem:} Given data by the minute, you want to compute the 836``realized volatility'' for the hour as $RV_t = \frac{1}{60} 837\sum_{\tau=1}^{60} y_{t:\tau}^2$. Imagine your sample starts at time 1:1. 838 839\emph{Solution:} 840\begin{code} 841smpl --full 842genr time 843series minute = int(time/60) + 1 844series second = time % 60 845setobs minute second --panel 846series rv = psd(y)^2 847setobs 1 1 848smpl second==1 --restrict 849store foo rv 850\end{code} 851 852\emph{Comment:} Here we trick gretl into thinking that our 853dataset is a panel dataset, where the minutes are the ``units'' and 854the seconds are the ``time''; this way, we can take advantage of the 855special function \cmd{psd()}, panel standard deviation. Then we 856simply drop all observations but one per minute and save the resulting 857data (\texttt{store foo rv} translates as ``store in the gretl 858datafile \texttt{foo.gdt} the series \texttt{rv}''). 859 860\subsection{Looping over two paired lists} 861 862\emph{Problem:} Suppose you have two lists with the same number of 863elements, and you want to apply some command to corresponding elements 864over a loop. 865 866\emph{Solution:} 867\begin{code} 868list L1 = a b c 869list L2 = x y z 870 871k1 = 1 872loop foreach i L1 873 k2 = 1 874 loop foreach j L2 875 if k1 == k2 876 ols $i 0 $j 877 endif 878 k2++ 879 endloop 880 k1++ 881endloop 882\end{code} 883 884\emph{Comment:} The simplest way to achieve the result is to loop over 885all possible combinations and filter out the unneeded ones via an 886\texttt{if} condition, as above. That said, in some cases variable 887names can help. For example, if 888\begin{code} 889 list Lx = x1 x2 x3 890 list Ly = y1 y2 y3 891\end{code} 892then we could just loop over the integers---quite intuitive and certainly 893more elegant: 894\begin{code} 895 loop i=1..3 896 ols y$i const x$i 897 endloop 898\end{code} 899 900\subsection{Convolution / polynomial multiplication} 901 902\emph{Problem:} How do I multiply polynomials? There's no dedicated 903function to do that, and yet it's a fairly basic mathematical task. 904 905\emph{Solution:} Never fear! We have the \cmd{conv2d} function, which 906is a tool for a more general problem, but includes polynomial 907multiplication as a special case.. 908 909Suppose you want to multiply two finite-order polynomials $P(x) = 910\sum_{i=0}^m p_i x^i$ and $Q(x) = \sum_{i=0}^n q_i x^i$. What you want 911is the sequence of coefficients of the polynomial 912\[ 913 R(x) = P(x) \cdot Q(x) = \sum_{k=0}^{m+n} r_k x^k 914\] 915where 916\[ 917 r_k = \sum_{i=0}^k p_i q_{k-i} 918\] 919is the \emph{convolution} of the $p_i$ and $q_i$ coefficients. The 920same operation can be performed via the FFT, but in most cases using 921\cmd{conv2d} is quicker and more natural. 922 923As an example, we'll use the same one we used in Section 924\ref{sec:genr-fft}: consider the multiplication of two polynomials: 925\begin{eqnarray*} 926 P(x) & = & 1 + 0.5 x \\ 927 Q(x) & = & 1 + 0.3 x - 0.8 x^2 \\ 928 R(x) = P(x) \cdot Q(x) & = & 1 + 0.8 x - 0.65 x^2 - 0.4 x^3 929\end{eqnarray*} 930The following code snippet performs all the necessary calculations: 931\begin{code} 932p = {1; 0.5} 933q = {1; 0.3; -0.8} 934r = conv2d(p, q) 935print r 936\end{code} 937Runnning the above produces 938\begin{code} 939r (4 x 1) 940 941 1 942 0.8 943 -0.65 944 -0.4 945\end{code} 946which is indeed the desired result. Note that the same computation 947could also be performed via the \cmd{filter} function, at the price of 948slightly more elaborate syntax. 949 950\subsection{Comparing two lists} 951 952\emph{Problem:} How can I tell if two lists contain the same variables 953(not necessarily in the same order)? 954 955\emph{Solution:} In many respects, lists are like sets, so it makes 956sense to use the so-called ``symmetric difference'' operator, which is 957defined as 958\[ 959A\,\triangle\,B = (A \setminus B) \cup (B \setminus A) 960\] 961where in this context backslash represents the relative complement 962operator, such that 963\[ 964A \setminus B = \left\{ x \in A \,|\, x \not \in B \right\} 965\] 966 967In practice we first check if there are any series in $A$ but not in 968$B$, then we perform the reverse check. If the union of the two 969results is an empty set, then the lists must contain the same 970variables. The hansl syntax for this would be something like 971 972\begin{code} 973scalar NotTheSame = nelem((A-B) || (B-A)) > 0 974\end{code} 975 976\subsection{Reordering list elements} 977 978\emph{Problem:} Is there a way to reorder list elements? 979 980\emph{Solution:} You can use the fact that a list can be cast into 981a vector of integers and then manipulated via ordinary matrix 982syntax. So, for example, if you wanted to ``flip'' a list you may 983just use the \cmd{mreverse} function. For example: 984\begin{code} 985open AWM.gdt --quiet 986list X = 3 6 9 12 987matrix tmp = X 988list revX = mreverse(tmp') 989 990list X print 991list revX print 992\end{code} 993will produce 994\begin{code} 995? list X print 996D1 D872 EEN_DIS GCD 997? list revX print 998GCD EEN_DIS D872 D1 999\end{code} 1000 1001\subsection{Plotting an asymmetric confidence interval} 1002 1003\emph{Problem:} ``I like the look of the \option{band} option to the 1004\cmd{gnuplot} and \cmd{plot} commands, but it's set up for plotting a 1005symmetric interval and I want to show an asymmetric one.'' 1006 1007\emph{Solution:} Any interval is by construction symmetrical about its 1008mean at each observation. So you just need to perform a little 1009tweak. Say you want to plot a series \texttt{x} along with a band 1010defined by the two series \texttt{top} and \texttt{bot}. Here we go: 1011\begin{code} 1012# create series for mid-point and deviation 1013series mid = (top + bot)/2 1014series dev = top - mid 1015gnuplot x --band=mid,dev --time-series --with-lines --output=display 1016\end{code} 1017 1018\subsection{Cross-validation} 1019\label{sec:xvalid} 1020 1021\emph{Problem:} ``I'd like to compute the so-called \emph{leave-one-out 1022cross-validation criterion} for my regression. Is there a command in 1023gretl?'' 1024 1025If you have a sample with $n$ observations, the ``leave-one-out'' 1026cross-validation criterion can be mechanically computed by running $n$ 1027regressions in which one observation at a time is omitted and all the 1028other ones are used to forecast its value. The sum of the $n$ squared 1029forecast errors is the statistic we want. Fortunately, there is no 1030need to do so. It is possible to prove that the same statistic can be 1031computed as 1032\[ 1033CV = \sum_{i=1}^n [\hat{u}_{i}/(1-h_{i})]^2, 1034\] 1035where $h_i$ is the $i$-th element of the ``hat'' matrix (see section 1036\ref{sec:hatvalues}) from a regression on the whole sample. 1037 1038This method is natively provided by gretl as a side benefit to the 1039\cmd{leverage} command, that stores the CV criterion into the 1040\dollar{test} accessor. The following script shows the equivalence of 1041the two approaches: 1042\begin{code} 1043set verbose off 1044open data4-1.gdt 1045list X = const sqft bedrms baths 1046 1047# compute the CV criterion the silly way 1048 1049scalar CV = 0 1050matrix mX = {X} 1051loop i = 1 .. $nobs 1052 xi = mX[i,] 1053 yi = price[i] 1054 smpl obs != i --restrict 1055 ols price X --quiet 1056 smpl full 1057 scalar fe = yi - xi * $coeff 1058 CV += fe^2 1059endloop 1060 1061printf "CV = %g\n", CV 1062 1063# the smart way 1064 1065ols price X --quiet 1066leverage --quiet 1067printf "CV = %g\n", $test 1068\end{code} 1069 1070\subsection{Is my matrix result broken?} 1071\label{sec:brokenmat} 1072 1073\emph{Problem:} ``Most of the matrix manipulation functions available 1074in gretl flag an error if something goes wrong, but there's no 1075guarantee that every matrix computation will return an entirely finite 1076matrix, containing no infinities or \texttt{NaN}s. So how do I tell if 1077I've got a fully valid matrix?'' 1078 1079\emph{Solution}: Given a matrix \texttt{m}, the call 1080``\texttt{ok(m)}'' returns a matrix with the same dimensions as 1081\texttt{m}, with elements 1 for finite values and 0 for infinities or 1082\texttt{NaN}s. A matrix as a whole is OK if it has no elements which 1083fail this test, so here's a suitable check for a ``broken'' matrix, 1084using the logical NOT operator, ``\texttt{!}'': 1085\begin{code} 1086sumc(sumr(!ok(m))) > 0 1087\end{code} 1088If this gives a non-zero return value you know that \texttt{m} 1089contains at least one non-finite element. 1090 1091 1092%%% Local Variables: 1093%%% mode: latex 1094%%% TeX-master: "gretl-guide" 1095%%% End: 1096