1.. _rfc-5:
2
3=======================================================================================
4RFC 5: Unicode support in GDAL
5=======================================================================================
6
7Author: Andrey Kiselev
8
9Contact: dron@ak4719.spb.edu
10
11Status: Development
12
13Summary
14-------
15
16This document contains proposal on how to make GDAL core locale
17independent preserving support for native character sets.
18
19Main concepts
20-------------
21
22GDAL should be modified in a way to support three following main ideas:
23
241. Users work in localized environment using their native languages.
25   That means we can not assume ASCII character set when working with
26   string data passed to GDAL.
272. GDAL uses UTF-8 encoding internally when working with strings.
283. GDAL uses Unicode version of third-party APIs when it is possible.
29
30So all strings, used in GDAL, are in UTF-8, not in plain ASCII. That
31means we should convert user's input from the local encoding to UTF-8
32during interactive sessions. The opposite should be done for GDAL
33output. For example, when user passes a filename as a command-line
34parameter to GDAL utilities, that filename should be immediately
35converted to UTF-8 and only afetrwards passed to functions like
36GDALOpen() or OGROpen(). All functions, which take character strings as
37parameters, assume UTF-8 (with except of several ones, which will do the
38conversion between different encodings, see Implementation). The same is
39valid for output functions. Output functions (CPLError/CPLDebug),
40embedded in GDAL, should convert all strings from UTF-8 to local
41encoding just before printing them. Custom error handlers should be
42aware of UTF-8 issue and do the proper transformation of strings passed
43to them.
44
45The string encoding pops up again when GDAL needs to call the
46third-party API. UTF-8 should be converted to encoding suitable for that
47API. In particular, that means we should convert UTF-8 to UTF-16 before
48calling CreateFile() function in Windows implementation of VSIFOpenL().
49Another example is a PostgreSQL API. PostgreSQL stores strings in UTF-8
50encoding internally, so we should notify server that passed string is
51already in UTF-8 and it will be stored as is without any conversions and
52losses.
53
54For file format drivers the string representation should be worked out
55on per-driver basis. Not all file formats support non-ASCII characters.
56For example, various .HDR labeled rasters are just 7-bit ASCII text
57files and it is not a good idea to write 8-bit strings in such a files.
58When we need to pass strings, extracted from such file outside the
59driver (e.g., in SetMetadata() call), we should convert them to UTF-8.
60If you just want to use extracted strings internally in driver, there is
61no need in any conversions.
62
63In some cases the file encoding can differ from the local system
64encoding and we do not have a way to know the file encoding other than
65ask a user (for example, imagine a case when someone added a 8-bit
66non-ASCII string field to mentioned above plain text .HDR file). That
67means we can't use conversion form the local encoding to UTF-8, but from
68the file encoding to UTF-8. So we need a way to get file encoding in
69some way on per datasource basis. The natural solution of the problem is
70to introduce optional open parameter "ENCODING" to GDALOpen/OGROpen
71functions. Unfortunately, those functions do not accept options. That
72should be introduced in another RFC. Fortunately, tehre is no need to
73add encoding parameter immediately, because it is independent from the
74general i18n process. We can add UTF-8 support as it is defined in this
75RFC and add support for forcing per-datasource encoding later, when the
76open options will be introduced.
77
78Implementation
79--------------
80
81-  New character conversion functions will be introduced in CPLString
82   class. Objects of that class always contain UTF-8 string internally.
83
84::
85
86
87   // Get string in local encoding from the internal UTF-8 encoded string.
88   // Out-of-range characters replaced with '?' in output string.
89   // nEncoding A codename of encoding. If 0 the local system
90   // encoding will be used.
91   char* CPLString::recode( int nEncoding = 0 );
92
93   // Construct UTF-8 string object from string in other encoding
94   // nEncoding A codename of encoding. If 0 the local system
95   // encoding will be used.
96   CPLString::CPLString( const char*, int nEncoding );
97
98   // Construct UTF-8 string object from array of wchar_t elements.
99   // Source encoding is system specific.
100   CPLString::CPLString( wchar_t* );
101
102   // Get string from UTF-8 encoding into array of wchar_t elements.
103   // Destination encoding is system specific.
104   operator wchar_t* (void) const;
105
106-  In order to use non-ASCII characters in user input every application
107   should call setlocale(LC_ALL, "") function right after the entry
108   point.
109
110-  Code example. Let's look how the gdal utilities and core code should
111   be changed in regard to Unicode.
112
113For input instead of
114
115::
116
117   pszFilename = argv[i];
118   if( pszFilename )
119       hDataset = GDALOpen( pszFilename, GA_ReadOnly );
120
121we should do
122
123::
124
125
126   CPLString oFilename(argv[i], 0); // <-- Conversion from local encoding to UTF-8
127   hDataset = GDALOpen( oFilename.c_str(), GA_ReadOnly );
128
129For output instead of
130
131::
132
133
134   printf( "Description = %s\n", GDALGetDescription(hBand) );
135
136we should do
137
138::
139
140
141   CPLString oDescription( GDALGetDescription(hBand) );
142   printf( "Description = %s\n", oDescription.recode( 0 ) ); // <-- Conversion
143                               // from UTF-8 to local
144
145The filename passed to GDALOpen() in UTF-8 encoding in the code snippet
146above will be further processed in the GDAL core. On Windows instead of
147
148::
149
150
151   hFile = CreateFile( pszFilename, dwDesiredAccess,
152       FILE_SHARE_READ | FILE_SHARE_WRITE, NULL, dwCreationDisposition,
153       dwFlagsAndAttributes, NULL );
154
155we do
156
157::
158
159
160   CPLString oFilename( pszFilename );
161   // I am prefer call the wide character version explicitly
162   // rather than specify _UNICODE switch.
163   hFile = CreateFileW( (wchar_t *)oFilename, dwDesiredAccess,
164           FILE_SHARE_READ | FILE_SHARE_WRITE, NULL,
165           dwCreationDisposition,  dwFlagsAndAttributes, NULL );
166
167-  The actual implementation of the character conversion functions does
168   not specified in this document yet. It needs additional discussion.
169   The main problem is that we need not only local<->UTF-8 encoding
170   conversions, but *arbitrary*\ <->UTF-8 ones. That requires
171   significant support on software part.
172
173Backward Compatibility
174----------------------
175
176The GDAL/OGR backward compatibility will be broken by this new
177functionality in the way how 8-bit characters handled. Before users may
178rely on that all 8-bit character strings will be passed through the
179GDAL/OGR without change and will contain exact the same data all the
180way. Now it is only true for 7-bit ASCII and 8-bit UTF-8 encoded
181strings. Note, that if you used only ASCII subset with GDAL, you are not
182affected by these changes.
183
184From The Unicode Standard, chapter 5:
185
186*The width of wchar_t is compiler-specific and can be as small as 8
187bits. Consequently, programs that need to be portable across any C or
188C++ compiler should not use wchar_t for storing Unicode text.*
189
190References
191----------
192
193-  `The Unicode Standard, Version 4.0 - Implementation
194   Guidelines <http://unicode.org/versions/Unicode4.0.0/ch05.pdf>`__ -
195   Chapter 5 (PDF)
196-  FAQ on how to use Unicode in software:
197   `http://www.cl.cam.ac.uk/~mgk25/unicode.html <http://www.cl.cam.ac.uk/~mgk25/unicode.html>`__
198-  FLTK implementation of string conversion functions:
199   `http://svn.easysw.com/public/fltk/fltk/trunk/src/utf.c <http://svn.easysw.com/public/fltk/fltk/trunk/src/utf.c>`__
200-  `http://www.easysw.com/~mike/fltk/doc-2.0/html/utf_8h.html <http://www.easysw.com/~mike/fltk/doc-2.0/html/utf_8h.html>`__
201-  Ticket #1494 : UTF-8 encoding for GML output.
202-  Filenames also covered in [[wiki:rfc30_utf8_filenames]]
203