\chapter{String-valued series} \label{chap:strval-series} \section{Introduction} Gretl's support for data series with string values has gone through three phases: \begin{enumerate} \item No support: we simply rejected non-numerical values when reading data from file. \item Numeric encoding only: we would read a string-valued series from a delimited text data file (provided the series didn't mix numerical values and strings) but the representation of the data within gretl was purely numerical. We printed a ``string table'' showing the mapping between the original strings and gretl's encoding and it was up to the user to keep track of this mapping. \item Preservation of string values: the string table that we construct in reading a string-valued series is now stored as a component of the dataset so it's possible to display and manipulate these values within gretl. \end{enumerate} The third phase has now been in effect for several years, with a series of gradual refinements. This chapter gives an account of the status quo. It explains how to create string-valued series and describes the operations that are supported for such series. \section{Creating a string-valued series} This can be done in two ways: first, by reading such a series from a suitable source file and second, by taking a suitable numerical series within gretl and adding string values using the \cmd{stringify()} function. In either case string values will be preserved when such a series is saved in a gretl-native data file. \subsection{Reading string-valued series} \label{sec:reading} The primary ``suitable source'' for string-valued series is a delimited text data file (but see section\ref{sec:other-imports} below). Here's a little example. The following is the content of a file named \texttt{gc.csv}: % \begin{code} city,year "Bilbao",2009 "Toruń",2011 "Oklahoma City",2013 "Berlin",2015 "Athens",2017 "Naples",2019 \end{code} % and here's a script: % \begin{code} open gc.csv --quiet print --byobs print city --byobs --numeric printf "The third gretl conference took place in %s.\n", city[3] \end{code} The output from the script is: % \begin{code} ? print --byobs city year 1 Bilbao 2009 2 Toruń 2011 3 Oklahoma C.. 2013 4 Berlin 2015 5 Athens 2017 6 Naples 2019 ? print city --byobs --numeric city 1 1 2 2 3 3 4 4 5 5 6 6 The third gretl conference took place in Oklahoma City. \end{code} From this we can see a few things. \begin{itemize} \item By default the \cmd{print} command shows us the string values of the series \texttt{city}, and it handles non-ASCII characters provided they're in UTF-8 (but it doesn't handle longer strings very elegantly). \item The \verb|--numeric| option to \cmd{print} exposes the numeric codes for a string-valued series. \item The syntax \texttt{seriesname[obs]} gives a string when a series is string-valued. \end{itemize} Suppose you want to access the numeric code for a particular string-valued observation: you can get that by ``casting'' the series to a vector. Thus \begin{code} printf "The code for '%s' is %d.\n", city[3], {city}[3] \end{code} gives \begin{code} The code for 'Oklahoma City' is 3. \end{code} The numeric codes for string-valued series are always assigned thus: reading the data file row by row, the first string value is assigned 1, the next \textit{distinct} string value is assigned 2, and so on. \subsection{Assigning string values to an existing series} \label{sec:stringify} This is done via the \cmd{stringify()} function, which takes two arguments, the name of a series and an array of strings. For this to work two conditions must be met: \begin{enumerate} \item The series must have only integer values and the smallest value must be 1 or greater. \item The array of strings must have at least $n$ distinct members, where $n$ is the largest value found in the series. \end{enumerate} The logic of these conditions is that we're looking to create a mapping as described above, from a 1-based sequence of integers to a set of strings. However, we're allowing for the possibility that the series in question is an incomplete sample from an associated population. Suppose we have a series that goes 2, 3, 5, 9, 10. This is taken to be a sample from a population that has at least 10 discrete values, 1, 2, \dots{}, 10, and so requires at least 10 value-strings. Here's (a simplified version of) an example that one of the authors has had cause to use: deriving US-style ``letter grades'' from a series containing percentage scores for students. Call the percentage series $x$, and say we want to create a series with values \texttt{A} for $x \geq 90$, \texttt{B} for $80 \leq x <90$, and so on down to \texttt{F} for $x<60$. Then we can do: \begin{code} series grade = 1 # F, the least value grade += x >= 60 # D grade += x >= 70 # C grade += x >= 80 # B grade += x >= 90 # A stringify(grade, strsplit("F D C B A")) \end{code} % The way the \texttt{grade} series is constructed is not the most compact, but it's nice and explicit, and easy to amend if one wants to adjust the threshold values. Note the use of \cmd{strsplit()} to create an on-the-fly array of strings from a string literal; this is convenient when the array contains a moderate number of elements with no embedded spaces. An alternative way to get the same result is to define the array of strings via the \cmd{defarray()} function, as in \begin{code} stringify(grade,defarray("F","D","C","B","A")) \end{code} The inverse operation of \cmd{stringify()} can be performed by the \cmd{strvals()} function: this retrieves the array of string values from a series (or returns an empty array if the series is not string-valued). \section{Permitted operations} One question that arises with string-valued series is, what are you allowed to do with them and what is banned? This is a debatable point, but here we set out the current state of things. \subsection{Setting values per observation} You can set particular values in a string-valued series either by string or numeric code. For example, suppose (in relation to the example in section~\ref{sec:stringify}) that for some reason student number 31 with a percentage score of 88 nonetheless merits an \texttt{A} grade. We could do \begin{code} grade[31] = "A" \end{code} or, if we're confident about the mapping, \begin{code} grade[31] = 5 \end{code} Or to raise the student's grade by one letter: \begin{code} grade[31] += 1 \end{code} What you're \textit{not} allowed to do here is make a numerical adjustment that would put the value out of bounds in relation to the set of string values. For example, if we tried \texttt{grade[31] = 6} we'd get an error. On the other hand, you \textit{can} implicitly extend the set of string values. This wouldn't make sense for the letter grades example but it might for, say, city names. Returning to the example in section~\ref{sec:reading} suppose we try % \begin{code} dataset addobs 1 year[7] = 2021 city[7] = "London?" \end{code} % This will work: we're implicitly adding another member to the string table for \texttt{city}; the associated numeric code will be the next available integer.\footnote{Admittedly there is a downside to this feature: one may inadvertently add a new string value by mistyping a string that's already present.} \subsection{Logical product of two string-valued series} The operator \verb|^| can be used to produce what we might call the logical product of two string-valued series, as in \begin{code} series sv3 = sv1 ^ sv2 \end{code} The result is another string-valued series with value $s_i.s_j$ at observations where \texttt{sv1} has value $s_i$ and \texttt{sv2} has value $s_j$. For example, if at a given observation \texttt{sv1} has value ``\texttt{A}'' and \texttt{sv2} has value ``\texttt{X}'', then \texttt{sv3} will have value ``\texttt{A.X}''. The set of strings attached to the resulting series will include all such string combinations even if they are not all represented in the given sample. \subsection{Assignment to an entire series} Other than the ``logical product'' case described above, this is disallowed at present: you can't execute an assignment with the name of a string-valued series \textit{per se} on the left-hand side. Put differently, you cannot overwrite an entire string-valued series at once. While this is debatable, it's the easiest way of ensuring that we never end up with a broken mapping. It's possible this restriction may be relaxed in future. Besides assigning an out-of-bounds numerical value to a particular observation, this sort of assignment is in fact the only operation that is banned for string-valued series. \subsection{Missing values} We support one exception to the general rule, never break the mapping between strings and numeric codes for string-valued series: you can mark particular observations as missing. This is done in the usual way, e.g., \begin{code} grade[31] = NA \end{code} Note, however, that on importing a string series from a delimited text file any non-blank strings (including ``NA'') will be interpreted as valid values; any missing values in such a file should therefore be represented by blank cells. \subsection{Copying a string-valued series} If you make a copy of a string-valued series, as in \begin{code} series foo = city \end{code} the string values are \textit{not} copied over: you get a purely numerical series holding the codes of the original series. But if you want a full copy with the string values that can easily be arranged: \begin{code} series citycopy = city stringify(citycopy, strvals(city)) \end{code} \subsection{String-valued series in other contexts} String-valued series can be used on the right-hand side of assignment statements at will, and in that context their numerical values are taken. For example, % \begin{code} series y = sqrt(city) \end{code} % will elicit no complaint and generate a numerical series 1, 1.41421, \dots{}. It's up to the user to judge whether this sort of thing makes any sense. Similarly, it's up to the user to decide if it makes sense to use a string-valued series ``as is'' in a regression model, whether as regressand or regressor---again, the numerical values of the series are taken. Often this will not make sense, but sometimes it may: the numerical values may by design form an ordinal, or even a cardinal, scale (as in the ``grade'' example in section~\ref{sec:stringify}). More likely, one would want to use \cmd{dummify} on a string-valued series before using it in statistical modeling. In that context gretl's series labels are suitably informative. For example, suppose we have a series \texttt{race} with numerical values 1, 2 and 3 and associated strings ``White'', ``Black'' and ``Other''. Then the hansl code \begin{code} list D = dummify(race) labels \end{code} will show these labels: \begin{code} Drace_2: dummy for race = 'Black' Drace_3: dummy for race = 'Other' \end{code} Given such a series you can use string values in a sample restriction, as in \begin{code} smpl race == "Black" --restrict \end{code} (although \texttt{race == 2} would also be acceptable). There may be other contexts that we haven't yet thought of where it would be good to have string values displayed and/or accepted on input; suggestions are welcome. \section{String-valued series and functions} User-defined hansl functions can deal with string-valued series, although there are a few points to note. If you supply such a series as an argument to a hansl function its string values will be accessible within the function. One can test whether a given series \texttt{arg} is string-valued as follows: \begin{code} if nelem(strvals(arg)) > 0 # yes else # no endif \end{code} Now suppose one wanted to put something like the code that generated the \texttt{grade} series in section~\ref{sec:stringify} into a function. That can be done, but \textit{not} in the form of a function that directly returns the desired series---that is, something like \begin{code} function series letter_grade (series x) series grade # define grade based on x and stringify it, as shown above return grade end function \end{code} % Unfortunately the above will \emph{not} work: the caller will get the \texttt{grade} series OK but it won't be string-valued. At first sight this may seem to be a bug but it's defensible as a consequence of the way series work in gretl. The point is that series have, so to speak, two grades of existence. They can exist as fully-fledged members of a dataset, or they can have a fleeting existence as simply anonymous arrays of numbers that are of the same length as dataset series. Consider the statement \begin{code} series rootx1 = sqrt(x+1) \end{code} On the right-hand side we have the ``series'' \texttt{x+1}, which is called into existence as part of a calculation but has no name and cannot have string values. Similarly, consider \begin{code} series grade = letter_grade(x) \end{code} The return value from \verb|letter_grade()| is likewise an anonymous array,\footnote{A proper named series, with string values, existed while the function was executing but it ceased to exist as soon as the function was finished.} incapable of holding string values \textit{until} it gets assigned to the named series \texttt{grade}. The solution is to define \texttt{grade} as a series, at the level of the caller, before calling \verb|letter_grade()|, as in % \begin{code} function void letter_grade (series x, series *grade) # define grade based on x and stringify it # this version will work! end function # caller ... series grade letter_grade(x, &grade) \end{code} As you'll see from the account above, we don't offer any very fancy facilities for string-valued series. We'll read them from suitable sources and we'll create them natively via \cmd{stringify}---and we'll try to ensure that they retain their integrity---but we don't, for example, take the specification of a string-valued series as a regressor as an implicit request to include the dummification of its distinct values. Besides laziness, this reflects the fact that in gretl a string-valued series \textit{may} be usable ``as is'', depending on how it's defined; you can use \cmd{dummify} if you need it. \section{Other import formats} \label{sec:other-imports} In section~\ref{sec:reading} we illustrated the reading of string-valued series with reference to a delimited text data file. Gretl can also handle several other sources of string-valued data, including the spreadsheet formats \texttt{xls}, \texttt{xlsx}, \texttt{gnumeric} and \texttt{ods} and (to a degree) the formats of \textsf{Stata}, \textsf{SAS} and \textsf{SPSS}. \subsection{Stata files} Stata supports two relevant sorts of variables: (1) those that are of ``string type'' and (2) variables of one or other numeric type that have ``value labels'' defined. Neither of these is exactly equivalent to what we call a ``string-valued series'' in gretl. Stata variables of string type have no numeric representation; their values are literally strings, and that's all. Stata's numeric variables with value labels do not have to be integer-valued and their least value does not have to be 1; however, you can't define a label for a value that is not an integer. Thus in Stata you can have a series that comprises both integer and non-integer values, but only the integer values can be labeled.\footnote{Verified in Stata 12.} This means that on import to gretl we can readily handle variables of string type from Stata's \texttt{dta} files. We give them a 1-based numeric encoding; this is arbitrary but does not conflict with any information in the \texttt{dta} file. On the other hand, in general we're not able to handle Stata's numeric variables with value labels; currently we report the value labels to the user but do not attempt to store them in the gretl dataset. We could check such variables and import them as string-valued series if they satisfy the criteria stated in section~\ref{sec:stringify} but we don't at present. \subsection{SAS and SPSS files} Gretl is able to read and preserve string values associated with variables from SAS ``export'' (\texttt{xpt}) files, and also from SPSS \texttt{sav} files. Such variables seem to be on the same pattern as Stata variables of string type. %%% Local Variables: %%% mode: latex %%% TeX-master: "gretl-guide" %%% End: