1\chapter{String-valued series} 2\label{chap:strval-series} 3 4\section{Introduction} 5 6Gretl's support for data series with string values has gone through 7three phases: 8\begin{enumerate} 9\item No support: we simply rejected non-numerical values when reading 10 data from file. 11\item Numeric encoding only: we would read a string-valued series from 12 a delimited text data file (provided the series didn't mix numerical 13 values and strings) but the representation of the data within gretl 14 was purely numerical. We printed a ``string table'' showing the 15 mapping between the original strings and gretl's encoding and it was 16 up to the user to keep track of this mapping. 17\item Preservation of string values: the string table that we 18 construct in reading a string-valued series is now stored as a 19 component of the dataset so it's possible to display and manipulate 20 these values within gretl. 21\end{enumerate} 22 23The third phase has now been in effect for several years, with a 24series of gradual refinements. This chapter gives an account of the 25status quo. It explains how to create string-valued series and 26describes the operations that are supported for such series. 27 28\section{Creating a string-valued series} 29 30This can be done in two ways: first, by reading such a series from a 31suitable source file and second, by taking a suitable numerical series 32within gretl and adding string values using the \cmd{stringify()} 33function. In either case string values will be preserved when such 34a series is saved in a gretl-native data file. 35 36\subsection{Reading string-valued series} 37\label{sec:reading} 38 39The primary ``suitable source'' for string-valued series is a 40delimited text data file (but see section\ref{sec:other-imports} 41below). Here's a little example. The following is the content of a 42file named \texttt{gc.csv}: 43% 44\begin{code} 45city,year 46"Bilbao",2009 47"Toruń",2011 48"Oklahoma City",2013 49"Berlin",2015 50"Athens",2017 51"Naples",2019 52\end{code} 53% 54and here's a script: 55% 56\begin{code} 57open gc.csv --quiet 58print --byobs 59print city --byobs --numeric 60printf "The third gretl conference took place in %s.\n", city[3] 61\end{code} 62 63The output from the script is: 64% 65\begin{code} 66? print --byobs 67 68 city year 69 701 Bilbao 2009 712 Toruń 2011 723 Oklahoma C.. 2013 734 Berlin 2015 745 Athens 2017 756 Naples 2019 76 77? print city --byobs --numeric 78 79 city 80 811 1 822 2 833 3 844 4 855 5 866 6 87 88The third gretl conference took place in Oklahoma City. 89\end{code} 90 91From this we can see a few things. 92\begin{itemize} 93\item By default the \cmd{print} command shows us the string values 94 of the series \texttt{city}, and it handles non-ASCII characters 95 provided they're in UTF-8 (but it doesn't handle longer strings 96 very elegantly). 97\item The \verb|--numeric| option to \cmd{print} exposes the 98 numeric codes for a string-valued series. 99\item The syntax \texttt{seriesname[obs]} gives a string when a series 100 is string-valued. 101\end{itemize} 102 103Suppose you want to access the numeric code for a particular 104string-valued observation: you can get that by ``casting'' the series 105to a vector. Thus 106\begin{code} 107printf "The code for '%s' is %d.\n", city[3], {city}[3] 108\end{code} 109gives 110\begin{code} 111The code for 'Oklahoma City' is 3. 112\end{code} 113 114The numeric codes for string-valued series are always assigned thus: 115reading the data file row by row, the first string value is assigned 1161, the next \textit{distinct} string value is assigned 2, and so on. 117 118\subsection{Assigning string values to an existing series} 119\label{sec:stringify} 120 121This is done via the \cmd{stringify()} function, which takes two 122arguments, the name of a series and an array of strings. For this to 123work two conditions must be met: 124 125\begin{enumerate} 126\item The series must have only integer values and the smallest value 127 must be 1 or greater. 128\item The array of strings must have at least $n$ distinct members, 129 where $n$ is the largest value found in the series. 130\end{enumerate} 131 132The logic of these conditions is that we're looking to create a 133mapping as described above, from a 1-based sequence of integers to a 134set of strings. However, we're allowing for the possibility that the 135series in question is an incomplete sample from an associated 136population. Suppose we have a series that goes 2, 3, 5, 9, 10. This is 137taken to be a sample from a population that has at least 10 discrete 138values, 1, 2, \dots{}, 10, and so requires at least 10 value-strings. 139 140Here's (a simplified version of) an example that one of the authors 141has had cause to use: deriving US-style ``letter grades'' from a 142series containing percentage scores for students. Call the percentage 143series $x$, and say we want to create a series with values \texttt{A} 144for $x \geq 90$, \texttt{B} for $80 \leq x <90$, and so on down to 145\texttt{F} for $x<60$. Then we can do: 146\begin{code} 147series grade = 1 # F, the least value 148grade += x >= 60 # D 149grade += x >= 70 # C 150grade += x >= 80 # B 151grade += x >= 90 # A 152stringify(grade, strsplit("F D C B A")) 153\end{code} 154% 155The way the \texttt{grade} series is constructed is not the most 156compact, but it's nice and explicit, and easy to amend if one wants to 157adjust the threshold values. Note the use of \cmd{strsplit()} to 158create an on-the-fly array of strings from a string literal; this is 159convenient when the array contains a moderate number of elements with 160no embedded spaces. An alternative way to get the same result is to 161define the array of strings via the \cmd{defarray()} function, as in 162\begin{code} 163stringify(grade,defarray("F","D","C","B","A")) 164\end{code} 165 166The inverse operation of \cmd{stringify()} can be performed by the 167\cmd{strvals()} function: this retrieves the array of string values 168from a series (or returns an empty array if the series is not 169string-valued). 170 171\section{Permitted operations} 172 173One question that arises with string-valued series is, what are you 174allowed to do with them and what is banned? This is a debatable point, 175but here we set out the current state of things. 176 177\subsection{Setting values per observation} 178 179You can set particular values in a string-valued series either by 180string or numeric code. For example, suppose (in relation to the 181example in section~\ref{sec:stringify}) that for some reason student 182number 31 with a percentage score of 88 nonetheless merits an 183\texttt{A} grade. We could do 184\begin{code} 185grade[31] = "A" 186\end{code} 187or, if we're confident about the mapping, 188\begin{code} 189grade[31] = 5 190\end{code} 191Or to raise the student's grade by one letter: 192\begin{code} 193grade[31] += 1 194\end{code} 195 196What you're \textit{not} allowed to do here is make a numerical 197adjustment that would put the value out of bounds in relation to the 198set of string values. For example, if we tried \texttt{grade[31] = 6} 199we'd get an error. 200 201On the other hand, you \textit{can} implicitly extend the set of 202string values. This wouldn't make sense for the letter grades example 203but it might for, say, city names. Returning to the example in 204section~\ref{sec:reading} suppose we try 205% 206\begin{code} 207dataset addobs 1 208year[7] = 2021 209city[7] = "London?" 210\end{code} 211% 212This will work: we're implicitly adding another member to the string 213table for \texttt{city}; the associated numeric code will be the next 214available integer.\footnote{Admittedly there is a downside to this 215 feature: one may inadvertently add a new string value by mistyping a 216 string that's already present.} 217 218\subsection{Logical product of two string-valued series} 219 220The operator \verb|^| can be used to produce what we might call the 221logical product of two string-valued series, as in 222\begin{code} 223series sv3 = sv1 ^ sv2 224\end{code} 225The result is another string-valued series with value $s_i.s_j$ at 226observations where \texttt{sv1} has value $s_i$ and \texttt{sv2} has 227value $s_j$. For example, if at a given observation \texttt{sv1} has 228value ``\texttt{A}'' and \texttt{sv2} has value ``\texttt{X}'', then 229\texttt{sv3} will have value ``\texttt{A.X}''. The set of strings 230attached to the resulting series will include all such string 231combinations even if they are not all represented in the given sample. 232 233\subsection{Assignment to an entire series} 234 235Other than the ``logical product'' case described above, this is 236disallowed at present: you can't execute an assignment with the name 237of a string-valued series \textit{per se} on the left-hand side. Put 238differently, you cannot overwrite an entire string-valued series at 239once. While this is debatable, it's the easiest way of ensuring that 240we never end up with a broken mapping. It's possible this restriction 241may be relaxed in future. 242 243Besides assigning an out-of-bounds numerical value to a particular 244observation, this sort of assignment is in fact the only operation 245that is banned for string-valued series. 246 247\subsection{Missing values} 248 249We support one exception to the general rule, never break the mapping 250between strings and numeric codes for string-valued series: you can 251mark particular observations as missing. This is done in the usual 252way, e.g., 253\begin{code} 254grade[31] = NA 255\end{code} 256Note, however, that on importing a string series from a delimited text 257file any non-blank strings (including ``NA'') will be interpreted as 258valid values; any missing values in such a file should therefore be 259represented by blank cells. 260 261\subsection{Copying a string-valued series} 262 263If you make a copy of a string-valued series, as in 264\begin{code} 265series foo = city 266\end{code} 267the string values are \textit{not} copied over: you get a purely 268numerical series holding the codes of the original series. But if you 269want a full copy with the string values that can easily be arranged: 270\begin{code} 271series citycopy = city 272stringify(citycopy, strvals(city)) 273\end{code} 274 275\subsection{String-valued series in other contexts} 276 277String-valued series can be used on the right-hand side of assignment 278statements at will, and in that context their numerical values are 279taken. For example, 280% 281\begin{code} 282series y = sqrt(city) 283\end{code} 284% 285will elicit no complaint and generate a numerical series 1, 1.41421, 286\dots{}. It's up to the user to judge whether this sort of thing 287makes any sense. 288 289Similarly, it's up to the user to decide if it makes sense to use a 290string-valued series ``as is'' in a regression model, whether as 291regressand or regressor---again, the numerical values of the series 292are taken. Often this will not make sense, but sometimes it may: the 293numerical values may by design form an ordinal, or even a cardinal, 294scale (as in the ``grade'' example in section~\ref{sec:stringify}). 295 296More likely, one would want to use \cmd{dummify} on a string-valued 297series before using it in statistical modeling. In that context 298gretl's series labels are suitably informative. For example, suppose 299we have a series \texttt{race} with numerical values 1, 2 and 3 and 300associated strings ``White'', ``Black'' and ``Other''. Then the hansl 301code 302\begin{code} 303list D = dummify(race) 304labels 305\end{code} 306will show these labels: 307\begin{code} 308Drace_2: dummy for race = 'Black' 309Drace_3: dummy for race = 'Other' 310\end{code} 311 312Given such a series you can use string values in a sample restriction, 313as in 314\begin{code} 315smpl race == "Black" --restrict 316\end{code} 317(although \texttt{race == 2} would also be acceptable). 318 319There may be other contexts that we haven't yet thought of where it 320would be good to have string values displayed and/or accepted on 321input; suggestions are welcome. 322 323\section{String-valued series and functions} 324 325User-defined hansl functions can deal with string-valued series, 326although there are a few points to note. 327 328If you supply such a series as an argument to a hansl function its 329string values will be accessible within the function. One can test 330whether a given series \texttt{arg} is string-valued as follows: 331\begin{code} 332if nelem(strvals(arg)) > 0 333 # yes 334else 335 # no 336endif 337\end{code} 338 339Now suppose one wanted to put something like the code that generated 340the \texttt{grade} series in section~\ref{sec:stringify} into a 341function. That can be done, but \textit{not} in the form of a function 342that directly returns the desired series---that is, something like 343\begin{code} 344function series letter_grade (series x) 345 series grade 346 # define grade based on x and stringify it, as shown above 347 return grade 348end function 349\end{code} 350% 351Unfortunately the above will \emph{not} work: the caller will get the 352\texttt{grade} series OK but it won't be string-valued. At first sight 353this may seem to be a bug but it's defensible as a consequence of the 354way series work in gretl. 355 356The point is that series have, so to speak, two grades of 357existence. They can exist as fully-fledged members of a dataset, or 358they can have a fleeting existence as simply anonymous arrays of 359numbers that are of the same length as dataset series. Consider the 360statement 361\begin{code} 362series rootx1 = sqrt(x+1) 363\end{code} 364On the right-hand side we have the ``series'' \texttt{x+1}, which is 365called into existence as part of a calculation but has no name and 366cannot have string values. Similarly, consider 367\begin{code} 368series grade = letter_grade(x) 369\end{code} 370The return value from \verb|letter_grade()| is likewise an anonymous 371array,\footnote{A proper named series, with string values, existed 372 while the function was executing but it ceased to exist as soon as 373 the function was finished.} incapable of holding string values 374\textit{until} it gets assigned to the named series 375\texttt{grade}. The solution is to define \texttt{grade} as a series, 376at the level of the caller, before calling \verb|letter_grade()|, as 377in 378% 379\begin{code} 380function void letter_grade (series x, series *grade) 381 # define grade based on x and stringify it 382 # this version will work! 383end function 384 385# caller 386... 387series grade 388letter_grade(x, &grade) 389\end{code} 390 391As you'll see from the account above, we don't offer any very fancy 392facilities for string-valued series. We'll read them from suitable 393sources and we'll create them natively via \cmd{stringify}---and 394we'll try to ensure that they retain their integrity---but we don't, 395for example, take the specification of a string-valued series as a 396regressor as an implicit request to include the dummification of its 397distinct values. Besides laziness, this reflects the fact that in 398gretl a string-valued series \textit{may} be usable ``as is'', 399depending on how it's defined; you can use \cmd{dummify} if you 400need it. 401 402\section{Other import formats} 403\label{sec:other-imports} 404 405In section~\ref{sec:reading} we illustrated the reading of 406string-valued series with reference to a delimited text data 407file. Gretl can also handle several other sources of string-valued 408data, including the spreadsheet formats \texttt{xls}, \texttt{xlsx}, 409\texttt{gnumeric} and \texttt{ods} and (to a degree) the formats of 410\textsf{Stata}, \textsf{SAS} and \textsf{SPSS}. 411 412\subsection{Stata files} 413 414Stata supports two relevant sorts of variables: (1) those that are of 415``string type'' and (2) variables of one or other numeric type that 416have ``value labels'' defined. Neither of these is exactly equivalent 417to what we call a ``string-valued series'' in gretl. 418 419Stata variables of string type have no numeric representation; their 420values are literally strings, and that's all. Stata's numeric 421variables with value labels do not have to be integer-valued and their 422least value does not have to be 1; however, you can't define a label 423for a value that is not an integer. Thus in Stata you can have a 424series that comprises both integer and non-integer values, but only 425the integer values can be labeled.\footnote{Verified in Stata 12.} 426 427This means that on import to gretl we can readily handle variables of 428string type from Stata's \texttt{dta} files. We give them a 1-based 429numeric encoding; this is arbitrary but does not conflict with any 430information in the \texttt{dta} file. On the other hand, in general 431we're not able to handle Stata's numeric variables with value labels; 432currently we report the value labels to the user but do not attempt to 433store them in the gretl dataset. We could check such variables and 434import them as string-valued series if they satisfy the criteria 435stated in section~\ref{sec:stringify} but we don't at present. 436 437\subsection{SAS and SPSS files} 438 439Gretl is able to read and preserve string values associated with 440variables from SAS ``export'' (\texttt{xpt}) files, and also from SPSS 441\texttt{sav} files. Such variables seem to be on the same pattern as 442Stata variables of string type. 443 444%%% Local Variables: 445%%% mode: latex 446%%% TeX-master: "gretl-guide" 447%%% End: 448 449