1%\VignetteIndexEntry{OGR shapefile encoding} 2%\VignetteDepends{} 3%\VignetteKeywords{spatial} 4%\VignettePackage{rgdal} 5\documentclass[a4paper,10pt]{article} 6\usepackage[utf8]{inputenc} 7\usepackage[T1]{fontenc} 8%\usepackage[dvips]{graphicx,color} 9\usepackage{times} 10\usepackage{hyperref} 11\usepackage{natbib} 12\usepackage[english]{babel} 13\usepackage{xspace} 14 15\usepackage{Sweave} 16\usepackage{mathptm} 17\usepackage{natbib} 18 19\setkeys{Gin}{width=0.95\textwidth} 20\newcommand{\strong}[1]{{\normalfont\fontseries{b}\selectfont #1}} 21\let\pkg=\strong 22\RequirePackage{alltt} 23\newenvironment{example}{\begin{alltt}}{\end{alltt}} 24\newenvironment{smallexample}{\begin{alltt}\small}{\end{alltt}} 25\newcommand{\code}[1]{\texttt{\small #1}} 26\def\RR{\textsf{R}\xspace} 27\def\SP{\texttt{S-PLUS}\xspace} 28\def\SS{\texttt{S}\xspace} 29 30\title{OGR shapefile encoding} 31\author{Roger Bivand} 32 33\begin{document} 34 35\maketitle 36 37\section{Introduction} 38 39Changes have taken place in the way that OGR, the part of GDAL that handles vector data, treats character strings, both the contents of string attribute fields, and the names of fields. A discussion document covers the principles recognised in work on encoding strings.\footnote{\url{http://trac.osgeo.org/gdal/wiki/rfc23_ogr_unicode}.} The document states that ``it is proposed to implement the CPLRecode() method using the iconv() and related functions when available.'' These mechanisms are similar to, but not identical with, those used in \RR itself. 40 41The implementation\footnote{\url{http://trac.osgeo.org/gdal/browser/trunk/gdal/port/cpl_recode.cpp}.} distinguishes between the use of iconv mechanisms in OGR, when GDAL is built with iconv support, and a fallback setting when iconv is not available. In both settings, UTF-8 encoding is intended to be use internally. Conversion from the layer in the data source to what we see inside \RR will differ depending on whether iconv support is available in GDAL, and on the encoding used in the \RR session. 42 43This has had particular impact on the ESRI Shapefile driver, because this OGR format uses DBF files for storing attribute data. Impacts on other drivers are not known at present. A subsection has been created in the \RR wiki\footnote{\url{http://rwiki.sciviews.org/doku.php\#ogr_string_encodings}.} to gather user experiences. 44 45Recoding support for the ESRI Shapefile driver was introduced in GDAL/OGR 1.9.0. Two mechanisms are described in the driver documentation.\footnote{\url{http://www.gdal.org/ogr/drv_shapefile.html}.} 46 47We can read thet ``the SHAPE\_ENCODING configuration option may be used to override the encoding interpretation of the shapefile with any encoding supported by CPLRecode or to "" to avoid any recoding.'' Reference is made here to option values set in a number of ways, possibly by an environment (shell) variable, but maintained within the running instance of GDAL.\footnote{\url{http://trac.osgeo.org/gdal/browser/trunk/gdal/port/cpl_conv.cpp}, functions CPLSetConfigOption() and CPLGetConfigOption().} It emerged during discussion of a ticket on this issue,\footnote{\url{http://trac.osgeo.org/gdal/ticket/4920}.} that the use of shell variables may not be portable, so functions in \pkg{rgdal} set and get the configure options through compiled code, in particular the SHAPE\_ENCODING configuration option. 48 49The second mechanism does not seem very robust, and is based on storing and retrieving values set in the LDID byte of the DBF file header. The driver documentation says ``an attempt is made to read the LDID/codepage setting from the .dbf file and use it to translate string fields to UTF-8 on read, and back when writing. LDID "87 / 0x57" is treated as ISO8859\_1 which may not be appropriate.'' In the driver code, a listing is provided\footnote{\url{http://trac.osgeo.org/gdal/browser/trunk/gdal/ogr/ogrsf_frmts/shape/ogrshapelayer.cpp}, in function ConvertCodePage(), after line 180.}, referring to a Russian website\footnote{\url{http://www.autopark.ru/ASBProgrammerGuide/DBFSTRUC.HTM}, see Table 9.} as authority. Crucially, the most common case for ESRI Shapefiles written by ArcGIS under Windows is LDID 87, which this function treats as ISO8859\_1. In addition, users of this format may provide an extra file with the extension CPG, which may or may not be respected. 50 51This mechanism is related to the rather incomplete description provided by ESRI in their knowledge base,\footnote{\url{http://support.esri.com/en/knowledgebase/techarticles/detail/21106}.} but which may contain hints that are helpful in exchanging shapefiles between \RR/\pkg{rgdal} via the OGR ESRI Shapefile driver. It is made plain that it is the user's responsibility to divine the appropriate encoding. 52 53For obvious reasons, it is very difficult to give a sensible representation in this vignette of the character strings involved, because the vignette itself contains encoded strings. I have encapsulated results on two platforms available to me, and have consulted with a user of OSX. Some results are given using \code{charToRaw} to provide a neutral framework permitting encodings to be compared across platforms. The key string used in this example may be found at this location: \url{http://goo.gl/maps/PNRgX} (Střítež nad Ludinou), using a shapefile provided by Lukáš Marek as a result of work provoked by Jeff Ranara reporting contorted Swedish strings during a course in Bergen. 54 55\section{Does GDAL have iconv support?} 56 57When the \pkg{rgdal} is loaded, a set of startup messages is displayed (unless supressed). They may include the line: 58 59\begin{verbatim} 60GDAL does not use iconv for recoding strings. 61\end{verbatim} 62 63\noindent 64which is generated if: 65 66<<echo=TRUE,eval=TRUE,results=hide>>= 67library(rgdal) 68@ 69<<echo=TRUE,eval=FALSE>>= 70.Call("RGDAL_CPL_RECODE_ICONV", PACKAGE="rgdal") 71@ 72 73\noindent 74is \code{FALSE}. This test is not absolutely trustworthy when GDAL is dynamically linked to \pkg{rgdal}, because it reports the state of the GDAL configure variables set when GDAL and \pkg{rgdal} were built. It is trustworthy for the CRAN Windows and OSX binaries, because they are built with static linking to GDAL and its dependencies. In other cases, if the GDAL runtime binaries have been updated but the header files have not been, or \pkg{rgdal} has not been re-installed, the value returned may be misleading. 75 76Depending on the outcome of this test, and concentrating here on the ESRI Shapefile driver, we can fork between two cases. 77 78\subsection{LDID and codepage values} 79 80First we will see how the LDID value may be retrieved. The \code{ogrInfo} reports the LDID for the ESRI Shapefile driver: 81 82<<echo=TRUE,eval=TRUE>>= 83dsn <- system.file("etc", package="rgdal") 84layer <- "point" 85oI <- ogrInfo(dsn, layer) 86attr(oI, "LDID") 87@ 88 89\noindent 90In this case, Lukáš Marek also provided a CPG file, as this is said to be common practice to inform ArcGIS of the appropriate codepage for the DBF file: 91 92<<echo=TRUE,eval=TRUE>>= 93scan(paste(dsn, .Platform$file.sep, layer, ".cpg", sep=""), 94 "character") 95@ 96 97\noindent 98This value does not match the LDID (which suits CP1252, the so-called ANSI codepage), but does match the Czech OEM codepage, for which the LDID should possibly be 31 (or 200). Most LDID values observed are 0 or 87. It is also possible that the value in the CPG file should be CP1250, rather than that used here. 99 100To set the LDID value in writing a vector \code{Spatial*DataFrame} using the ESRI Shapefile driver, one may use the \code{layer\_options=} argument to \code{writeOGR}: 101 102<<echo=TRUE,eval=FALSE>>= 103writeOGR(mySDF, dsn, layer, driver="ESRI Shapefile", 104 layer_options='ENCODING="LDID/31"') 105@ 106 107\noindent 108but note that the OGR format documentation says: ``The default value is "LDID/87". It is not clear what other values may be appropriate.'' 109 110\subsection{String representation in \RR} 111 112The encoding in use in the \RR session may be seen by: 113 114<<echo=TRUE,eval=TRUE>>= 115Sys.getlocale("LC_CTYPE") 116unlist(l10n_info()) 117@ 118 119\noindent 120where \code{l10n\_info} may also return a \code{codepage} component (on Windows). Even when \RR is not running in a multibyte locale, as in the Windows GUI in some cases, it may be able to display marked multibyte charsets. OGR does not seem to mark multibyte charsets. \RR manuals document the ways in which internationalization, locales, and encoding are handled in general,\footnote{\url{http://cran.r-project.org/doc/manuals/r-release/R-admin.html\#Internationalization}.} and for data import and export.\footnote{\url{http://cran.r-project.org/doc/manuals/r-release/R-data.html\#Encodings}.} In addition, the OSX\footnote{\url{http://cran.r-project.org/bin/macosx/RMacOSX-FAQ.html\#Internationalization-of-the-R_002eapp}.} and Windows\footnote{\url{http://cran.r-project.org/bin/windows/base/rw-FAQ.html\#Languages-and-Internationalization}.} FAQ describe platform-specific questions. 121 122\section{GDAL with iconv} 123 124Despite the uncertainty present in the LDID and codepage definitions, it is possible to assert control both when GDAL is built with iconv and without iconv. First the GDAL with iconv case. The data displayed here are pre-stored from this system: 125 126<<echo=FALSE,eval=TRUE>>= 127load(paste(dsn, .Platform$file.sep, "point_LinuxGDAL.RData", sep="")) 128@ 129<<echo=TRUE,eval=FALSE>>= 130sessionInfo() 131@ 132<<echo=FALSE,eval=TRUE>>= 133sI_1 134@ 135\noindent 136This system has \pkg{rgdal} installed from source, using GDAL also installed from source. Similar behaviour will be found on other similar systems, also including \pkg{rgdal} installed from source under OSX, built against the Kyngchaos GDAL framework. Here, we can check that iconv is available in GDAL: 137 138<<echo=TRUE,eval=FALSE>>= 139.Call("RGDAL_CPL_RECODE_ICONV", PACKAGE="rgdal") 140@ 141<<echo=FALSE,eval=TRUE>>= 142cpliconv_1 143@ 144 145\noindent 146We proceed by checking the CPL configure option: 147 148<<echo=TRUE,eval=FALSE>>= 149getCPLConfigOption("SHAPE_ENCODING") 150@ 151<<echo=FALSE,eval=TRUE>>= 152NULL 153@ 154 155\noindent 156It may be set using \code{setCPLConfigOption("SHAPE\_ENCODING", value)}, where \code{value} is the appropriate encoding, or \code{NULL} to unset \code{SHAPE\_ENCODING}. If \code{value} is the empty string, recoding is turned off in GDAL. Since we know that the encoding of the sample file is CP1250, we can import into \RR recoding to UTF-8 (the charset of the \RR session and the charset used internally by GDAL) in three ways (using \code{stringsAsFactors=FALSE} to make access to the string values easier): 157 158<<echo=TRUE,eval=FALSE>>= 159setCPLConfigOption("SHAPE_ENCODING", "CP1250") 160pt1 <- readOGR(dsn, layer, stringsAsFactors=FALSE) 161setCPLConfigOption("SHAPE_ENCODING", NULL) 162charToRaw(pt1$NAZEV[1]) 163@ 164<<echo=FALSE,eval=TRUE>>= 165pt2cr_1 166@ 167<<echo=TRUE,eval=FALSE>>= 168pt2 <- readOGR(dsn, layer, stringsAsFactors=FALSE, encoding="CP1250") 169charToRaw(pt2$NAZEV[1]) 170@ 171<<echo=FALSE,eval=TRUE>>= 172pt2cr_1 173@ 174<<echo=TRUE,eval=FALSE>>= 175setCPLConfigOption("SHAPE_ENCODING", "") 176pt3 <- readOGR(dsn, layer, stringsAsFactors=FALSE) 177setCPLConfigOption("SHAPE_ENCODING", NULL) 178pt1i_1 <- iconv(pt3$NAZEV[1], from="CP1250", to="UTF-8") 179charToRaw(pt1i_1) 180@ 181<<echo=FALSE,eval=TRUE>>= 182pt1icr_1 183@ 184 185\noindent 186The three methods are either to set the CPL configure option directly import the data, and then unset it; to use the \code{encoding=} argument to \code{readOGR} which sets and unsets the CPL configure option internally; or to turn off GDAL recoding, and use \code{iconv} inside \RR. The final approach would need extra care if the field names also need recoding, and the string data are stored in the \RR object in their original charset. This, however, may be desirable if the same object is to be exported back to, for example, ArcGIS. A fourth method is to turn off GDAL recoding and use iconv within \RR inside \code{readOGR} to recode from the given encoding to the session charset, here UTF-8: 187 188<<echo=TRUE,eval=FALSE>>= 189setCPLConfigOption("SHAPE_ENCODING", "") 190pt4 <- readOGR(dsn, layer, stringsAsFactors=FALSE, use_iconv=TRUE, 191 encoding="CP1250") 192setCPLConfigOption("SHAPE_ENCODING", NULL) 193charToRaw(pt4$NAZEV[1]) 194@ 195<<echo=FALSE,eval=TRUE>>= 196pt1icr_1 197@ 198 199\noindent 200Note that \code{ogrInfo} also takes \code{encoding=} and \code{use\_iconv=} arguments. If we do not set the encoding, GDAL recoding is from CP1252, not CP1250, and is wrongly rendered in UTF-8: 201 202<<echo=TRUE,eval=FALSE>>= 203pt5 <- readOGR(dsn, layer, stringsAsFactors=FALSE) 204charToRaw(pt5$NAZEV[1]) 205@ 206<<echo=FALSE,eval=TRUE>>= 207ptcr_1 208@ 209<<echo=TRUE,eval=FALSE>>= 210all.equal(charToRaw(pt5$NAZEV[1]), charToRaw(pt4$NAZEV[1])) 211@ 212<<echo=FALSE,eval=TRUE>>= 213all.equal(ptcr_1, pt1icr_1) 214@ 215 216\section{GDAL without iconv} 217 218Next we turn to the GDAL without iconv case. The data displayed here are pre-stored from this system: 219 220<<echo=FALSE,eval=TRUE>>= 221load(paste(dsn, .Platform$file.sep, "point_WinCRAN.RData", sep="")) 222@ 223<<echo=TRUE,eval=FALSE>>= 224sessionInfo() 225@ 226<<echo=FALSE,eval=TRUE>>= 227sI 228@ 229<<echo=TRUE,eval=FALSE>>= 230unlist(l10n_info()) 231@ 232<<echo=FALSE,eval=TRUE>>= 233unlist(l10n) 234@ 235\noindent 236This system has \pkg{rgdal} installed using the CRAN binary. Here, we can check that iconv is available in GDAL: 237 238<<echo=TRUE,eval=FALSE>>= 239.Call("RGDAL_CPL_RECODE_ICONV", PACKAGE="rgdal") 240@ 241<<echo=FALSE,eval=TRUE>>= 242cpliconv 243@ 244 245\noindent 246This means that if GDAL recoding is not turned off, default non-iconv stub recoding will be used. It will often not be sensible to permit this to happen, so users may protect the raw data by either of these two methods: 247 248<<echo=TRUE,eval=FALSE>>= 249setCPLConfigOption("SHAPE_ENCODING", "") 250pt6 <- readOGR(dsn, layer, stringsAsFactors=FALSE) 251setCPLConfigOption("SHAPE_ENCODING", NULL) 252charToRaw(pt6$NAZEV[1]) 253@ 254<<echo=FALSE,eval=TRUE>>= 255pt1cr 256@ 257<<echo=TRUE,eval=FALSE>>= 258pt7 <- readOGR(dsn, layer, stringsAsFactors=FALSE, encoding="") 259charToRaw(pt7$NAZEV[1]) 260@ 261<<echo=FALSE,eval=TRUE>>= 262pt1cr 263@ 264 265The assumption here is that there may not be a suitable recoding matching the imported strings and the \RR console; users will need to experiment to see how \code{iconv} may be used within \RR. 266 267\end{document} 268 269