1.. _rfc-23:
2
3================================================================================
4RFC 23.1: Unicode support in OGR
5================================================================================
6
7Authors: Frank Warmerdam
8
9Contact: warmerdam@pobox.com
10
11Status: Adopted (implemented)
12
13Summary
14-------
15
16This document proposes preliminary steps towards GDAL/OGR handling
17strings internally in UTF-8, and supporting conversion between different
18encodings.
19
20Main concepts
21-------------
22
23GDAL should be modified in a way to support three following main ideas:
24
251. C Functions will be provided to support a variety of encoding
26   conversions, including conversion between representations (ie. UTF-8
27   to UCS-16/wchar_t).
282. Character encodings will be identified by iconv() style strings.
293. OFTString/OFTStringList feature attributes in OGR will be treated as
30   being in UTF-8.
31
32This RFC specifically does not attempt to address issues of using
33non-ascii filenames. It also does not attempt to make definitions about
34the encoding of other strings used in GDAL/OGR (such as field names,
35metadata, etc). These would presumably be addressed in a later RFC
36building on this one.
37
38CPLRecode API
39-------------
40
41The following three C callable functions will be introduced for recoding
42strings, and for converting between wchar_t (wide character) and char
43(multi-byte) formats:
44
45::
46
47   char *CPLRecode( const char *pszSource,
48                    const char *pszSrcEncoding, const char *pszDstEncoding );
49
50   char *CPLRecodeFromWChar( const wchar_t *pwszSource,
51                             const char *pszSrcEncoding,
52                             const char *pszDstEncoding );
53   wchar_t *CPLRecodeToWChar( const char *pszSource,
54                              const char *pszSrcEncoding,
55                              const char *pszDstEncoding );
56
57In each case the returned string is zero terminated, as is the input
58string, and the returned string should be deallocated with CPLFree(). In
59case of error the returned string will be NULL, and the function will
60issue a CPLError(). The functions will be marked with CPL_DLL and
61considered part of the public GDAL/OGR API for use of applications as
62well as internal use.
63
64Encoding Names
65--------------
66
67It is proposed that the encoding names will be the same sorts of names
68used by iconv(). So stuff like "UTF-8", "LATIN5", "CP850" and
69"ISO_8859-1". It does not appear that these names for encodings are a
701:1 match with C library locale names (like "en_CA.utf8" for instance)
71which may cause some issues.
72
73Some particular names of interest:
74
75-  "": The current locale. Use this when converting from/to the users
76   locale.
77-  "UTF-8": Unicode in multi-byte encoding. Most of the time this will
78   be our internal linga-franca.
79-  "POSIX": I think this is roughly ASCII (perhaps with some extended
80   characters?).
81-  "UCS-2": Two byte unicode. This is a wide character format and only
82   suitable for use with the wchar_t methods.
83
84On some systems you can use "iconv --list" to get a list of supported
85encodings.
86
87iconv()
88-------
89
90It is proposed to implement the CPLRecode() method using the iconv() and
91related functions when available.
92
93There is an excellent implementation of this API as GNU libiconv(),
94which is used by the C libraries on Linux. Also some operating systems
95provide the iconv() API as part of the C library (all unix?); however,
96the system iconv() often has a restricted set of conversions supported
97so it may be desirable to use libiconv in preference to the system
98iconv() even when it is available.
99
100If iconv() is not available, a stub implementation of the recode
101services will be provided which:
102
103-  implements UCS-2 / UTF-8 interconversion using either mbtowc/wctomb,
104   or an implementation derived from
105   `http://www.cl.cam.ac.uk/~mgk25/unicode.html <http://www.cl.cam.ac.uk/~mgk25/unicode.html>`__.
106-  Implements recoding from "" to and from "UTF-8" by doing nothing, but
107   issuing a warning on the first use if the current locale does not
108   appear to be the "C" locale.
109-  Implements recoding from "ASCII" to "UTF-8" as a null operation.
110-  Implements recoding from "UTF-8" to "ASCII" by turning all non-ASCII
111   multi-byte characters to '?'.
112
113This hopefully gives us a weak operational status when built without
114iconv(), but full operation when it is available.
115
116The --with-iconv= option will be added to configure. The argument can be
117the path to a libiconv installation or the special value 'system'
118indicating that the system lib should be used. Alternatively,
119--without-iconv can be used to avoid using iconv.
120
121OFTString/OFTStringList Fields
122------------------------------
123
124It is declared that OGR string attribute values will be in UTF-8. This
125means that OGR drivers are responsible for translating format specific
126representations to UTF-8 when reading, and back to the format specific
127representation when writing. In many cases (of simple ASCII text) this
128requires no transformation.
129
130This implies that the arguments to methods like OGRFeature::SetField(
131int i, const char \*) should be UTF-8, and that GetFieldAsString() will
132return UTF-8.
133
134The same issues apply to OFTStringList lists of strings. Each string
135will be assumed to be UTF-8.
136
137OLCStringsAsUTF8 Capability Flag
138--------------------------------
139
140Some drivers (ie. CSV) can effectively not know the encoding of their
141inputs. Therefore, it isn't always practical to turn things into UTF-8
142in a guaranteed way. So, the new layer level capability called
143"StringsAsUTF8" represented with the macro "OLCStringsAsUTF8" will be
144testable at the layer level with TestCapability(). Drivers which are
145certain to return string attributes as UTF-8 should return TRUE, while
146drivers that do not know the encoding they return should return FALSE.
147Any driver which knows it's encoding should convert to UTF-8.
148
149OGR Driver Updates
150------------------
151
152The following OGR drivers could benefit immediately from recoding to
153UTF-8 support in one way or another.
154
155-  ODBC (add support for wchar_t / NVARSHAR fields)
156-  Shapefile
157-  GML (I'm not sure how the XML encoding values all map to our concept
158   of encoding)
159-  Postgres
160
161I'm sure a number of the other drivers, particularly the RDBMS drivers,
162could benefit from an update.
163
164Implementation
165--------------
166
167Frank Warmerdam will implement the core iconv() capabilities, the
168CPLRecode() additions and update the ODBC driver. Other OGR drivers
169would be updated as time and demand mandates to conform to the
170definitions in this RFC by interested developers.
171
172The core work will be completed for GDAL/OGR 1.6.0 release.
173
174References
175----------
176
177-  `The Unicode Standard, Version 4.0 - Implementation
178   Guidelines <http://unicode.org/versions/Unicode4.0.0/ch05.pdf>`__ -
179   Chapter 5 (PDF)
180-  FAQ on how to use Unicode in software:
181   `http://www.cl.cam.ac.uk/~mgk25/unicode.html <http://www.cl.cam.ac.uk/~mgk25/unicode.html>`__
182-  FLTK implementation of string conversion functions:
183   `http://svn.easysw.com/public/fltk/fltk/trunk/src/utf.c <http://svn.easysw.com/public/fltk/fltk/trunk/src/utf.c>`__
184-  `http://www.easysw.com/~mike/fltk/doc-2.0/html/utf_8h.html <http://www.easysw.com/~mike/fltk/doc-2.0/html/utf_8h.html>`__
185-  Ticket #1494 : UTF-8 encoding for GML output.
186-  Libiconv:
187   `http://www.gnu.org/software/libiconv/ <http://www.gnu.org/software/libiconv/>`__
188-  ICU (another i18n library):
189   `http://www.icu-project.org/ <http://www.icu-project.org/>`__
190