1%\VignetteIndexEntry{OGR shapefile encoding}
2%\VignetteDepends{}
3%\VignetteKeywords{spatial}
4%\VignettePackage{rgdal}
5\documentclass[a4paper,10pt]{article}
6\usepackage[utf8]{inputenc}
7\usepackage[T1]{fontenc}
8%\usepackage[dvips]{graphicx,color}
9\usepackage{times}
10\usepackage{hyperref}
11\usepackage{natbib}
12\usepackage[english]{babel}
13\usepackage{xspace}
14
15\usepackage{Sweave}
16\usepackage{mathptm}
17\usepackage{natbib}
18
19\setkeys{Gin}{width=0.95\textwidth}
20\newcommand{\strong}[1]{{\normalfont\fontseries{b}\selectfont #1}}
21\let\pkg=\strong
22\RequirePackage{alltt}
23\newenvironment{example}{\begin{alltt}}{\end{alltt}}
24\newenvironment{smallexample}{\begin{alltt}\small}{\end{alltt}}
25\newcommand{\code}[1]{\texttt{\small #1}}
26\def\RR{\textsf{R}\xspace}
27\def\SP{\texttt{S-PLUS}\xspace}
28\def\SS{\texttt{S}\xspace}
29
30\title{OGR shapefile encoding}
31\author{Roger Bivand}
32
33\begin{document}
34
35\maketitle
36
37\section{Introduction}
38
39Changes have taken place in the way that OGR, the part of GDAL that handles vector data, treats character strings, both the contents of string attribute fields, and the names of fields. A discussion document covers the principles recognised in work on encoding strings.\footnote{\url{http://trac.osgeo.org/gdal/wiki/rfc23_ogr_unicode}.} The document states that ``it is proposed to implement the CPLRecode() method using the iconv() and related functions when available.'' These mechanisms are similar to, but not identical with, those used in \RR itself.
40
41The implementation\footnote{\url{http://trac.osgeo.org/gdal/browser/trunk/gdal/port/cpl_recode.cpp}.} distinguishes between the use of iconv mechanisms in OGR, when GDAL is built with iconv support, and a fallback setting when iconv is not available. In both settings, UTF-8 encoding is intended to be use internally. Conversion from the layer in the data source to what we see inside \RR will differ depending on whether iconv support is available in GDAL, and on the encoding used in the \RR session.
42
43This has had particular impact on the ESRI Shapefile driver, because this OGR format uses DBF files for storing attribute data. Impacts on other drivers are not known at present. A subsection has been created in the \RR wiki\footnote{\url{http://rwiki.sciviews.org/doku.php\#ogr_string_encodings}.} to gather user experiences.
44
45Recoding support for the ESRI Shapefile driver was introduced in GDAL/OGR 1.9.0. Two mechanisms are described in the driver documentation.\footnote{\url{http://www.gdal.org/ogr/drv_shapefile.html}.}
46
47We can read thet ``the SHAPE\_ENCODING configuration option may be used to override the encoding interpretation of the shapefile with any encoding supported by CPLRecode or to "" to avoid any recoding.'' Reference is made here to option values set in a number of ways, possibly by an environment (shell) variable, but maintained within the running instance of GDAL.\footnote{\url{http://trac.osgeo.org/gdal/browser/trunk/gdal/port/cpl_conv.cpp}, functions CPLSetConfigOption() and CPLGetConfigOption().} It emerged during discussion of a ticket on this issue,\footnote{\url{http://trac.osgeo.org/gdal/ticket/4920}.} that the use of shell variables may not be portable, so functions in \pkg{rgdal} set and get the configure options through compiled code, in particular the SHAPE\_ENCODING configuration option.
48
49The second mechanism does not seem very robust, and is based on storing and retrieving values set in the LDID byte of the DBF file header. The driver documentation says ``an attempt is made to read the LDID/codepage setting from the .dbf file and use it to translate string fields to UTF-8 on read, and back when writing. LDID "87 / 0x57" is treated as ISO8859\_1 which may not be appropriate.'' In the driver code, a listing is provided\footnote{\url{http://trac.osgeo.org/gdal/browser/trunk/gdal/ogr/ogrsf_frmts/shape/ogrshapelayer.cpp}, in function ConvertCodePage(), after line 180.}, referring to a Russian website\footnote{\url{http://www.autopark.ru/ASBProgrammerGuide/DBFSTRUC.HTM}, see Table 9.} as authority. Crucially, the most common case for ESRI Shapefiles written by ArcGIS under Windows is LDID 87, which this function treats as ISO8859\_1. In addition, users of this format may provide an extra file with the extension CPG, which may or may not be respected.
50
51This mechanism is related to the rather incomplete description provided by ESRI in their knowledge base,\footnote{\url{http://support.esri.com/en/knowledgebase/techarticles/detail/21106}.} but which may contain hints that are helpful in exchanging shapefiles between \RR/\pkg{rgdal} via the OGR ESRI Shapefile driver. It is made plain that it is the user's responsibility to divine the appropriate encoding.
52
53For obvious reasons, it is very difficult to give a sensible representation in this vignette of the character strings involved, because the vignette itself contains encoded strings. I have encapsulated results on two platforms available to me, and have consulted with a user of OSX. Some results are given using \code{charToRaw} to provide a neutral framework permitting encodings to be compared across platforms. The key string used in this example may be found at this location: \url{http://goo.gl/maps/PNRgX} (Střítež nad Ludinou), using a shapefile provided by Lukáš Marek as a result of work provoked by Jeff Ranara reporting contorted Swedish strings during a course in Bergen.
54
55\section{Does GDAL have iconv support?}
56
57When the \pkg{rgdal} is loaded, a set of startup messages is displayed (unless supressed). They may include the line:
58
59\begin{verbatim}
60GDAL does not use iconv for recoding strings.
61\end{verbatim}
62
63\noindent
64which is generated if:
65
66<<echo=TRUE,eval=TRUE,results=hide>>=
67library(rgdal)
68@
69<<echo=TRUE,eval=FALSE>>=
70.Call("RGDAL_CPL_RECODE_ICONV", PACKAGE="rgdal")
71@
72
73\noindent
74is \code{FALSE}. This test is not absolutely trustworthy when GDAL is dynamically linked to \pkg{rgdal}, because it reports the state of the GDAL configure variables set when GDAL and \pkg{rgdal} were built. It is trustworthy for the CRAN Windows and OSX binaries, because they are built with static linking to GDAL and its dependencies. In other cases, if the GDAL runtime binaries have been updated but the header files have not been, or \pkg{rgdal} has not been re-installed, the value returned may be misleading.
75
76Depending on the outcome of this test, and concentrating here on the ESRI Shapefile driver, we can fork between two cases.
77
78\subsection{LDID and codepage values}
79
80First we will see how the LDID value may be retrieved. The \code{ogrInfo} reports the LDID for the ESRI Shapefile driver:
81
82<<echo=TRUE,eval=TRUE>>=
83dsn <- system.file("etc", package="rgdal")
84layer <- "point"
85oI <- ogrInfo(dsn, layer)
86attr(oI, "LDID")
87@
88
89\noindent
90In this case, Lukáš Marek also provided a CPG file, as this is said to be common practice to inform ArcGIS of the appropriate codepage for the DBF file:
91
92<<echo=TRUE,eval=TRUE>>=
93scan(paste(dsn, .Platform$file.sep, layer, ".cpg", sep=""),
94 "character")
95@
96
97\noindent
98This value does not match the LDID (which suits CP1252, the so-called ANSI codepage), but does match the Czech OEM codepage, for which the LDID should possibly be 31 (or 200). Most LDID values observed are 0 or 87. It is also possible that the value in the CPG file should be CP1250, rather than that used here.
99
100To set the LDID value in writing a vector \code{Spatial*DataFrame} using the ESRI Shapefile driver, one may use the \code{layer\_options=} argument to \code{writeOGR}:
101
102<<echo=TRUE,eval=FALSE>>=
103writeOGR(mySDF, dsn, layer, driver="ESRI Shapefile",
104 layer_options='ENCODING="LDID/31"')
105@
106
107\noindent
108but note that the OGR format documentation says: ``The default value is "LDID/87". It is not clear what other values may be appropriate.''
109
110\subsection{String representation in \RR}
111
112The encoding in use in the \RR session may be seen by:
113
114<<echo=TRUE,eval=TRUE>>=
115Sys.getlocale("LC_CTYPE")
116unlist(l10n_info())
117@
118
119\noindent
120where \code{l10n\_info} may also return a \code{codepage} component (on Windows). Even when \RR is not running in a multibyte locale, as in the Windows GUI in some cases, it may be able to display marked multibyte charsets. OGR does not seem to mark multibyte charsets. \RR manuals document the ways in which internationalization, locales, and encoding are handled in general,\footnote{\url{http://cran.r-project.org/doc/manuals/r-release/R-admin.html\#Internationalization}.} and for data import and export.\footnote{\url{http://cran.r-project.org/doc/manuals/r-release/R-data.html\#Encodings}.} In addition, the OSX\footnote{\url{http://cran.r-project.org/bin/macosx/RMacOSX-FAQ.html\#Internationalization-of-the-R_002eapp}.} and Windows\footnote{\url{http://cran.r-project.org/bin/windows/base/rw-FAQ.html\#Languages-and-Internationalization}.} FAQ describe platform-specific questions.
121
122\section{GDAL with iconv}
123
124Despite the uncertainty present in the LDID and codepage definitions, it is possible to assert control both when GDAL is built with iconv and without iconv. First the GDAL with iconv case. The data displayed here are pre-stored from this system:
125
126<<echo=FALSE,eval=TRUE>>=
127load(paste(dsn, .Platform$file.sep, "point_LinuxGDAL.RData", sep=""))
128@
129<<echo=TRUE,eval=FALSE>>=
130sessionInfo()
131@
132<<echo=FALSE,eval=TRUE>>=
133sI_1
134@
135\noindent
136This system has \pkg{rgdal} installed from source, using GDAL also installed from source. Similar behaviour will be found on other similar systems, also including \pkg{rgdal} installed from source under OSX, built against the Kyngchaos GDAL framework. Here, we can check that iconv is available in GDAL:
137
138<<echo=TRUE,eval=FALSE>>=
139.Call("RGDAL_CPL_RECODE_ICONV", PACKAGE="rgdal")
140@
141<<echo=FALSE,eval=TRUE>>=
142cpliconv_1
143@
144
145\noindent
146We proceed by checking the CPL configure option:
147
148<<echo=TRUE,eval=FALSE>>=
149getCPLConfigOption("SHAPE_ENCODING")
150@
151<<echo=FALSE,eval=TRUE>>=
152NULL
153@
154
155\noindent
156It may be set using \code{setCPLConfigOption("SHAPE\_ENCODING", value)}, where \code{value} is the appropriate encoding, or \code{NULL} to unset \code{SHAPE\_ENCODING}. If \code{value} is the empty string, recoding is turned off in GDAL. Since we know that the encoding of the sample file is CP1250, we can import into \RR recoding to UTF-8 (the charset of the \RR session and the charset used internally by GDAL) in three ways (using \code{stringsAsFactors=FALSE} to make access to the string values easier):
157
158<<echo=TRUE,eval=FALSE>>=
159setCPLConfigOption("SHAPE_ENCODING", "CP1250")
160pt1 <- readOGR(dsn, layer, stringsAsFactors=FALSE)
161setCPLConfigOption("SHAPE_ENCODING", NULL)
162charToRaw(pt1$NAZEV[1])
163@
164<<echo=FALSE,eval=TRUE>>=
165pt2cr_1
166@
167<<echo=TRUE,eval=FALSE>>=
168pt2 <- readOGR(dsn, layer, stringsAsFactors=FALSE, encoding="CP1250")
169charToRaw(pt2$NAZEV[1])
170@
171<<echo=FALSE,eval=TRUE>>=
172pt2cr_1
173@
174<<echo=TRUE,eval=FALSE>>=
175setCPLConfigOption("SHAPE_ENCODING", "")
176pt3 <- readOGR(dsn, layer, stringsAsFactors=FALSE)
177setCPLConfigOption("SHAPE_ENCODING", NULL)
178pt1i_1 <- iconv(pt3$NAZEV[1], from="CP1250", to="UTF-8")
179charToRaw(pt1i_1)
180@
181<<echo=FALSE,eval=TRUE>>=
182pt1icr_1
183@
184
185\noindent
186The three methods are either to set the CPL configure option directly import the data, and then unset it; to use the \code{encoding=} argument to \code{readOGR} which sets and unsets the CPL configure option internally; or to turn off GDAL recoding, and use \code{iconv} inside \RR. The final approach would need extra care if the field names also need recoding, and the string data are stored in the \RR object in their original charset. This, however, may be desirable if the same object is to be exported back to, for example, ArcGIS. A fourth method is to turn off GDAL recoding and use iconv within \RR inside \code{readOGR} to recode from the given encoding to the session charset, here UTF-8:
187
188<<echo=TRUE,eval=FALSE>>=
189setCPLConfigOption("SHAPE_ENCODING", "")
190pt4 <- readOGR(dsn, layer, stringsAsFactors=FALSE, use_iconv=TRUE,
191 encoding="CP1250")
192setCPLConfigOption("SHAPE_ENCODING", NULL)
193charToRaw(pt4$NAZEV[1])
194@
195<<echo=FALSE,eval=TRUE>>=
196pt1icr_1
197@
198
199\noindent
200Note that \code{ogrInfo} also takes \code{encoding=} and \code{use\_iconv=} arguments. If we do not set the encoding, GDAL recoding is from CP1252, not CP1250, and is wrongly rendered in UTF-8:
201
202<<echo=TRUE,eval=FALSE>>=
203pt5 <- readOGR(dsn, layer, stringsAsFactors=FALSE)
204charToRaw(pt5$NAZEV[1])
205@
206<<echo=FALSE,eval=TRUE>>=
207ptcr_1
208@
209<<echo=TRUE,eval=FALSE>>=
210all.equal(charToRaw(pt5$NAZEV[1]), charToRaw(pt4$NAZEV[1]))
211@
212<<echo=FALSE,eval=TRUE>>=
213all.equal(ptcr_1, pt1icr_1)
214@
215
216\section{GDAL without iconv}
217
218Next we turn to the GDAL without iconv case. The data displayed here are pre-stored from this system:
219
220<<echo=FALSE,eval=TRUE>>=
221load(paste(dsn, .Platform$file.sep, "point_WinCRAN.RData", sep=""))
222@
223<<echo=TRUE,eval=FALSE>>=
224sessionInfo()
225@
226<<echo=FALSE,eval=TRUE>>=
227sI
228@
229<<echo=TRUE,eval=FALSE>>=
230unlist(l10n_info())
231@
232<<echo=FALSE,eval=TRUE>>=
233unlist(l10n)
234@
235\noindent
236This system has \pkg{rgdal} installed using the CRAN binary. Here, we can check that iconv is available in GDAL:
237
238<<echo=TRUE,eval=FALSE>>=
239.Call("RGDAL_CPL_RECODE_ICONV", PACKAGE="rgdal")
240@
241<<echo=FALSE,eval=TRUE>>=
242cpliconv
243@
244
245\noindent
246This means that if GDAL recoding is not turned off, default non-iconv stub recoding will be used. It will often not be sensible to permit this to happen, so users may protect the raw data by either of these two methods:
247
248<<echo=TRUE,eval=FALSE>>=
249setCPLConfigOption("SHAPE_ENCODING", "")
250pt6 <- readOGR(dsn, layer, stringsAsFactors=FALSE)
251setCPLConfigOption("SHAPE_ENCODING", NULL)
252charToRaw(pt6$NAZEV[1])
253@
254<<echo=FALSE,eval=TRUE>>=
255pt1cr
256@
257<<echo=TRUE,eval=FALSE>>=
258pt7 <- readOGR(dsn, layer, stringsAsFactors=FALSE, encoding="")
259charToRaw(pt7$NAZEV[1])
260@
261<<echo=FALSE,eval=TRUE>>=
262pt1cr
263@
264
265The assumption here is that there may not be a suitable recoding matching the imported strings and the \RR console; users will need to experiment to see how \code{iconv} may be used within \RR.
266
267\end{document}
268
269