1\chapter{String-valued series}
2\label{chap:strval-series}
3
4\section{Introduction}
5
6Gretl's support for data series with string values has gone through
7three phases:
8\begin{enumerate}
9\item No support: we simply rejected non-numerical values when reading
10  data from file.
11\item Numeric encoding only: we would read a string-valued series from
12  a delimited text data file (provided the series didn't mix numerical
13  values and strings) but the representation of the data within gretl
14  was purely numerical. We printed a ``string table'' showing the
15  mapping between the original strings and gretl's encoding and it was
16  up to the user to keep track of this mapping.
17\item Preservation of string values: the string table that we
18  construct in reading a string-valued series is now stored as a
19  component of the dataset so it's possible to display and manipulate
20  these values within gretl.
21\end{enumerate}
22
23The third phase has now been in effect for several years, with a
24series of gradual refinements. This chapter gives an account of the
25status quo. It explains how to create string-valued series and
26describes the operations that are supported for such series.
27
28\section{Creating a string-valued series}
29
30This can be done in two ways: first, by reading such a series from a
31suitable source file and second, by taking a suitable numerical series
32within gretl and adding string values using the \cmd{stringify()}
33function. In either case string values will be preserved when such
34a series is saved in a gretl-native data file.
35
36\subsection{Reading string-valued series}
37\label{sec:reading}
38
39The primary ``suitable source'' for string-valued series is a
40delimited text data file (but see section\ref{sec:other-imports}
41below). Here's a little example. The following is the content of a
42file named \texttt{gc.csv}:
43%
44\begin{code}
45city,year
46"Bilbao",2009
47"Toruń",2011
48"Oklahoma City",2013
49"Berlin",2015
50"Athens",2017
51"Naples",2019
52\end{code}
53%
54and here's a script:
55%
56\begin{code}
57open gc.csv --quiet
58print --byobs
59print city --byobs --numeric
60printf "The third gretl conference took place in %s.\n", city[3]
61\end{code}
62
63The output from the script is:
64%
65\begin{code}
66? print --byobs
67
68          city         year
69
701       Bilbao         2009
712        Toruń         2011
723 Oklahoma C..         2013
734       Berlin         2015
745       Athens         2017
756       Naples         2019
76
77? print city --byobs --numeric
78
79          city
80
811            1
822            2
833            3
844            4
855            5
866            6
87
88The third gretl conference took place in Oklahoma City.
89\end{code}
90
91From this we can see a few things.
92\begin{itemize}
93\item By default the \cmd{print} command shows us the string values
94  of the series \texttt{city}, and it handles non-ASCII characters
95  provided they're in UTF-8 (but it doesn't handle longer strings
96  very elegantly).
97\item The \verb|--numeric| option to \cmd{print} exposes the
98  numeric codes for a string-valued series.
99\item The syntax \texttt{seriesname[obs]} gives a string when a series
100  is string-valued.
101\end{itemize}
102
103Suppose you want to access the numeric code for a particular
104string-valued observation: you can get that by ``casting'' the series
105to a vector. Thus
106\begin{code}
107printf "The code for '%s' is %d.\n", city[3], {city}[3]
108\end{code}
109gives
110\begin{code}
111The code for 'Oklahoma City' is 3.
112\end{code}
113
114The numeric codes for string-valued series are always assigned thus:
115reading the data file row by row, the first string value is assigned
1161, the next \textit{distinct} string value is assigned 2, and so on.
117
118\subsection{Assigning string values to an existing series}
119\label{sec:stringify}
120
121This is done via the \cmd{stringify()} function, which takes two
122arguments, the name of a series and an array of strings. For this to
123work two conditions must be met:
124
125\begin{enumerate}
126\item The series must have only integer values and the smallest value
127  must be 1 or greater.
128\item The array of strings must have at least $n$ distinct members,
129  where $n$ is the largest value found in the series.
130\end{enumerate}
131
132The logic of these conditions is that we're looking to create a
133mapping as described above, from a 1-based sequence of integers to a
134set of strings. However, we're allowing for the possibility that the
135series in question is an incomplete sample from an associated
136population. Suppose we have a series that goes 2, 3, 5, 9, 10. This is
137taken to be a sample from a population that has at least 10 discrete
138values, 1, 2, \dots{}, 10, and so requires at least 10 value-strings.
139
140Here's (a simplified version of) an example that one of the authors
141has had cause to use: deriving US-style ``letter grades'' from a
142series containing percentage scores for students. Call the percentage
143series $x$, and say we want to create a series with values \texttt{A}
144for $x \geq 90$, \texttt{B} for $80 \leq x <90$, and so on down to
145\texttt{F} for $x<60$. Then we can do:
146\begin{code}
147series grade = 1 # F, the least value
148grade += x >= 60 # D
149grade += x >= 70 # C
150grade += x >= 80 # B
151grade += x >= 90 # A
152stringify(grade, strsplit("F D C B A"))
153\end{code}
154%
155The way the \texttt{grade} series is constructed is not the most
156compact, but it's nice and explicit, and easy to amend if one wants to
157adjust the threshold values. Note the use of \cmd{strsplit()} to
158create an on-the-fly array of strings from a string literal; this is
159convenient when the array contains a moderate number of elements with
160no embedded spaces. An alternative way to get the same result is to
161define the array of strings via the \cmd{defarray()} function, as in
162\begin{code}
163stringify(grade,defarray("F","D","C","B","A"))
164\end{code}
165
166The inverse operation of \cmd{stringify()} can be performed by the
167\cmd{strvals()} function: this retrieves the array of string values
168from a series (or returns an empty array if the series is not
169string-valued).
170
171\section{Permitted operations}
172
173One question that arises with string-valued series is, what are you
174allowed to do with them and what is banned? This is a debatable point,
175but here we set out the current state of things.
176
177\subsection{Setting values per observation}
178
179You can set particular values in a string-valued series either by
180string or numeric code. For example, suppose (in relation to the
181example in section~\ref{sec:stringify}) that for some reason student
182number 31 with a percentage score of 88 nonetheless merits an
183\texttt{A} grade. We could do
184\begin{code}
185grade[31] = "A"
186\end{code}
187or, if we're confident about the mapping,
188\begin{code}
189grade[31] = 5
190\end{code}
191Or to raise the student's grade by one letter:
192\begin{code}
193grade[31] += 1
194\end{code}
195
196What you're \textit{not} allowed to do here is make a numerical
197adjustment that would put the value out of bounds in relation to the
198set of string values. For example, if we tried \texttt{grade[31] = 6}
199we'd get an error.
200
201On the other hand, you \textit{can} implicitly extend the set of
202string values. This wouldn't make sense for the letter grades example
203but it might for, say, city names. Returning to the example in
204section~\ref{sec:reading} suppose we try
205%
206\begin{code}
207dataset addobs 1
208year[7] = 2021
209city[7] = "London?"
210\end{code}
211%
212This will work: we're implicitly adding another member to the string
213table for \texttt{city}; the associated numeric code will be the next
214available integer.\footnote{Admittedly there is a downside to this
215  feature: one may inadvertently add a new string value by mistyping a
216  string that's already present.}
217
218\subsection{Logical product of two string-valued series}
219
220The operator \verb|^| can be used to produce what we might call the
221logical product of two string-valued series, as in
222\begin{code}
223series sv3 = sv1 ^ sv2
224\end{code}
225The result is another string-valued series with value $s_i.s_j$ at
226observations where \texttt{sv1} has value $s_i$ and \texttt{sv2} has
227value $s_j$. For example, if at a given observation \texttt{sv1} has
228value ``\texttt{A}'' and \texttt{sv2} has value ``\texttt{X}'', then
229\texttt{sv3} will have value ``\texttt{A.X}''. The set of strings
230attached to the resulting series will include all such string
231combinations even if they are not all represented in the given sample.
232
233\subsection{Assignment to an entire series}
234
235Other than the ``logical product'' case described above, this is
236disallowed at present: you can't execute an assignment with the name
237of a string-valued series \textit{per se} on the left-hand side. Put
238differently, you cannot overwrite an entire string-valued series at
239once. While this is debatable, it's the easiest way of ensuring that
240we never end up with a broken mapping. It's possible this restriction
241may be relaxed in future.
242
243Besides assigning an out-of-bounds numerical value to a particular
244observation, this sort of assignment is in fact the only operation
245that is banned for string-valued series.
246
247\subsection{Missing values}
248
249We support one exception to the general rule, never break the mapping
250between strings and numeric codes for string-valued series: you can
251mark particular observations as missing. This is done in the usual
252way, e.g.,
253\begin{code}
254grade[31] = NA
255\end{code}
256Note, however, that on importing a string series from a delimited text
257file any non-blank strings (including ``NA'') will be interpreted as
258valid values; any missing values in such a file should therefore be
259represented by blank cells.
260
261\subsection{Copying a string-valued series}
262
263If you make a copy of a string-valued series, as in
264\begin{code}
265series foo = city
266\end{code}
267the string values are \textit{not} copied over: you get a purely
268numerical series holding the codes of the original series. But if you
269want a full copy with the string values that can easily be arranged:
270\begin{code}
271series citycopy = city
272stringify(citycopy, strvals(city))
273\end{code}
274
275\subsection{String-valued series in other contexts}
276
277String-valued series can be used on the right-hand side of assignment
278statements at will, and in that context their numerical values are
279taken. For example,
280%
281\begin{code}
282series y = sqrt(city)
283\end{code}
284%
285will elicit no complaint and generate a numerical series 1, 1.41421,
286\dots{}. It's up to the user to judge whether this sort of thing
287makes any sense.
288
289Similarly, it's up to the user to decide if it makes sense to use a
290string-valued series ``as is'' in a regression model, whether as
291regressand or regressor---again, the numerical values of the series
292are taken. Often this will not make sense, but sometimes it may: the
293numerical values may by design form an ordinal, or even a cardinal,
294scale (as in the ``grade'' example in section~\ref{sec:stringify}).
295
296More likely, one would want to use \cmd{dummify} on a string-valued
297series before using it in statistical modeling. In that context
298gretl's series labels are suitably informative. For example, suppose
299we have a series \texttt{race} with numerical values 1, 2 and 3 and
300associated strings ``White'', ``Black'' and ``Other''. Then the hansl
301code
302\begin{code}
303list D = dummify(race)
304labels
305\end{code}
306will show these labels:
307\begin{code}
308Drace_2: dummy for race = 'Black'
309Drace_3: dummy for race = 'Other'
310\end{code}
311
312Given such a series you can use string values in a sample restriction,
313as in
314\begin{code}
315smpl race == "Black" --restrict
316\end{code}
317(although \texttt{race == 2} would also be acceptable).
318
319There may be other contexts that we haven't yet thought of where it
320would be good to have string values displayed and/or accepted on
321input; suggestions are welcome.
322
323\section{String-valued series and functions}
324
325User-defined hansl functions can deal with string-valued series,
326although there are a few points to note.
327
328If you supply such a series as an argument to a hansl function its
329string values will be accessible within the function. One can test
330whether a given series \texttt{arg} is string-valued as follows:
331\begin{code}
332if nelem(strvals(arg)) > 0
333  # yes
334else
335  # no
336endif
337\end{code}
338
339Now suppose one wanted to put something like the code that generated
340the \texttt{grade} series in section~\ref{sec:stringify} into a
341function. That can be done, but \textit{not} in the form of a function
342that directly returns the desired series---that is, something like
343\begin{code}
344function series letter_grade (series x)
345  series grade
346  # define grade based on x and stringify it, as shown above
347  return grade
348end function
349\end{code}
350%
351Unfortunately the above will \emph{not} work: the caller will get the
352\texttt{grade} series OK but it won't be string-valued. At first sight
353this may seem to be a bug but it's defensible as a consequence of the
354way series work in gretl.
355
356The point is that series have, so to speak, two grades of
357existence. They can exist as fully-fledged members of a dataset, or
358they can have a fleeting existence as simply anonymous arrays of
359numbers that are of the same length as dataset series. Consider the
360statement
361\begin{code}
362series rootx1 = sqrt(x+1)
363\end{code}
364On the right-hand side we have the ``series'' \texttt{x+1}, which is
365called into existence as part of a calculation but has no name and
366cannot have string values. Similarly, consider
367\begin{code}
368series grade = letter_grade(x)
369\end{code}
370The return value from \verb|letter_grade()| is likewise an anonymous
371array,\footnote{A proper named series, with string values, existed
372  while the function was executing but it ceased to exist as soon as
373  the function was finished.} incapable of holding string values
374\textit{until} it gets assigned to the named series
375\texttt{grade}. The solution is to define \texttt{grade} as a series,
376at the level of the caller, before calling \verb|letter_grade()|, as
377in
378%
379\begin{code}
380function void letter_grade (series x, series *grade)
381  # define grade based on x and stringify it
382  # this version will work!
383end function
384
385# caller
386...
387series grade
388letter_grade(x, &grade)
389\end{code}
390
391As you'll see from the account above, we don't offer any very fancy
392facilities for string-valued series. We'll read them from suitable
393sources and we'll create them natively via \cmd{stringify}---and
394we'll try to ensure that they retain their integrity---but we don't,
395for example, take the specification of a string-valued series as a
396regressor as an implicit request to include the dummification of its
397distinct values. Besides laziness, this reflects the fact that in
398gretl a string-valued series \textit{may} be usable ``as is'',
399depending on how it's defined; you can use \cmd{dummify} if you
400need it.
401
402\section{Other import formats}
403\label{sec:other-imports}
404
405In section~\ref{sec:reading} we illustrated the reading of
406string-valued series with reference to a delimited text data
407file. Gretl can also handle several other sources of string-valued
408data, including the spreadsheet formats \texttt{xls}, \texttt{xlsx},
409\texttt{gnumeric} and \texttt{ods} and (to a degree) the formats of
410\textsf{Stata}, \textsf{SAS} and \textsf{SPSS}.
411
412\subsection{Stata files}
413
414Stata supports two relevant sorts of variables: (1) those that are of
415``string type'' and (2) variables of one or other numeric type that
416have ``value labels'' defined. Neither of these is exactly equivalent
417to what we call a ``string-valued series'' in gretl.
418
419Stata variables of string type have no numeric representation; their
420values are literally strings, and that's all. Stata's numeric
421variables with value labels do not have to be integer-valued and their
422least value does not have to be 1; however, you can't define a label
423for a value that is not an integer. Thus in Stata you can have a
424series that comprises both integer and non-integer values, but only
425the integer values can be labeled.\footnote{Verified in Stata 12.}
426
427This means that on import to gretl we can readily handle variables of
428string type from Stata's \texttt{dta} files. We give them a 1-based
429numeric encoding; this is arbitrary but does not conflict with any
430information in the \texttt{dta} file. On the other hand, in general
431we're not able to handle Stata's numeric variables with value labels;
432currently we report the value labels to the user but do not attempt to
433store them in the gretl dataset. We could check such variables and
434import them as string-valued series if they satisfy the criteria
435stated in section~\ref{sec:stringify} but we don't at present.
436
437\subsection{SAS and SPSS files}
438
439Gretl is able to read and preserve string values associated with
440variables from SAS ``export'' (\texttt{xpt}) files, and also from SPSS
441\texttt{sav} files. Such variables seem to be on the same pattern as
442Stata variables of string type.
443
444%%% Local Variables:
445%%% mode: latex
446%%% TeX-master: "gretl-guide"
447%%% End:
448
449