1.. _rfc-5: 2 3======================================================================================= 4RFC 5: Unicode support in GDAL 5======================================================================================= 6 7Author: Andrey Kiselev 8 9Contact: dron@ak4719.spb.edu 10 11Status: Development 12 13Summary 14------- 15 16This document contains proposal on how to make GDAL core locale 17independent preserving support for native character sets. 18 19Main concepts 20------------- 21 22GDAL should be modified in a way to support three following main ideas: 23 241. Users work in localized environment using their native languages. 25 That means we can not assume ASCII character set when working with 26 string data passed to GDAL. 272. GDAL uses UTF-8 encoding internally when working with strings. 283. GDAL uses Unicode version of third-party APIs when it is possible. 29 30So all strings, used in GDAL, are in UTF-8, not in plain ASCII. That 31means we should convert user's input from the local encoding to UTF-8 32during interactive sessions. The opposite should be done for GDAL 33output. For example, when user passes a filename as a command-line 34parameter to GDAL utilities, that filename should be immediately 35converted to UTF-8 and only afetrwards passed to functions like 36GDALOpen() or OGROpen(). All functions, which take character strings as 37parameters, assume UTF-8 (with except of several ones, which will do the 38conversion between different encodings, see Implementation). The same is 39valid for output functions. Output functions (CPLError/CPLDebug), 40embedded in GDAL, should convert all strings from UTF-8 to local 41encoding just before printing them. Custom error handlers should be 42aware of UTF-8 issue and do the proper transformation of strings passed 43to them. 44 45The string encoding pops up again when GDAL needs to call the 46third-party API. UTF-8 should be converted to encoding suitable for that 47API. In particular, that means we should convert UTF-8 to UTF-16 before 48calling CreateFile() function in Windows implementation of VSIFOpenL(). 49Another example is a PostgreSQL API. PostgreSQL stores strings in UTF-8 50encoding internally, so we should notify server that passed string is 51already in UTF-8 and it will be stored as is without any conversions and 52losses. 53 54For file format drivers the string representation should be worked out 55on per-driver basis. Not all file formats support non-ASCII characters. 56For example, various .HDR labeled rasters are just 7-bit ASCII text 57files and it is not a good idea to write 8-bit strings in such a files. 58When we need to pass strings, extracted from such file outside the 59driver (e.g., in SetMetadata() call), we should convert them to UTF-8. 60If you just want to use extracted strings internally in driver, there is 61no need in any conversions. 62 63In some cases the file encoding can differ from the local system 64encoding and we do not have a way to know the file encoding other than 65ask a user (for example, imagine a case when someone added a 8-bit 66non-ASCII string field to mentioned above plain text .HDR file). That 67means we can't use conversion form the local encoding to UTF-8, but from 68the file encoding to UTF-8. So we need a way to get file encoding in 69some way on per datasource basis. The natural solution of the problem is 70to introduce optional open parameter "ENCODING" to GDALOpen/OGROpen 71functions. Unfortunately, those functions do not accept options. That 72should be introduced in another RFC. Fortunately, tehre is no need to 73add encoding parameter immediately, because it is independent from the 74general i18n process. We can add UTF-8 support as it is defined in this 75RFC and add support for forcing per-datasource encoding later, when the 76open options will be introduced. 77 78Implementation 79-------------- 80 81- New character conversion functions will be introduced in CPLString 82 class. Objects of that class always contain UTF-8 string internally. 83 84:: 85 86 87 // Get string in local encoding from the internal UTF-8 encoded string. 88 // Out-of-range characters replaced with '?' in output string. 89 // nEncoding A codename of encoding. If 0 the local system 90 // encoding will be used. 91 char* CPLString::recode( int nEncoding = 0 ); 92 93 // Construct UTF-8 string object from string in other encoding 94 // nEncoding A codename of encoding. If 0 the local system 95 // encoding will be used. 96 CPLString::CPLString( const char*, int nEncoding ); 97 98 // Construct UTF-8 string object from array of wchar_t elements. 99 // Source encoding is system specific. 100 CPLString::CPLString( wchar_t* ); 101 102 // Get string from UTF-8 encoding into array of wchar_t elements. 103 // Destination encoding is system specific. 104 operator wchar_t* (void) const; 105 106- In order to use non-ASCII characters in user input every application 107 should call setlocale(LC_ALL, "") function right after the entry 108 point. 109 110- Code example. Let's look how the gdal utilities and core code should 111 be changed in regard to Unicode. 112 113For input instead of 114 115:: 116 117 pszFilename = argv[i]; 118 if( pszFilename ) 119 hDataset = GDALOpen( pszFilename, GA_ReadOnly ); 120 121we should do 122 123:: 124 125 126 CPLString oFilename(argv[i], 0); // <-- Conversion from local encoding to UTF-8 127 hDataset = GDALOpen( oFilename.c_str(), GA_ReadOnly ); 128 129For output instead of 130 131:: 132 133 134 printf( "Description = %s\n", GDALGetDescription(hBand) ); 135 136we should do 137 138:: 139 140 141 CPLString oDescription( GDALGetDescription(hBand) ); 142 printf( "Description = %s\n", oDescription.recode( 0 ) ); // <-- Conversion 143 // from UTF-8 to local 144 145The filename passed to GDALOpen() in UTF-8 encoding in the code snippet 146above will be further processed in the GDAL core. On Windows instead of 147 148:: 149 150 151 hFile = CreateFile( pszFilename, dwDesiredAccess, 152 FILE_SHARE_READ | FILE_SHARE_WRITE, NULL, dwCreationDisposition, 153 dwFlagsAndAttributes, NULL ); 154 155we do 156 157:: 158 159 160 CPLString oFilename( pszFilename ); 161 // I am prefer call the wide character version explicitly 162 // rather than specify _UNICODE switch. 163 hFile = CreateFileW( (wchar_t *)oFilename, dwDesiredAccess, 164 FILE_SHARE_READ | FILE_SHARE_WRITE, NULL, 165 dwCreationDisposition, dwFlagsAndAttributes, NULL ); 166 167- The actual implementation of the character conversion functions does 168 not specified in this document yet. It needs additional discussion. 169 The main problem is that we need not only local<->UTF-8 encoding 170 conversions, but *arbitrary*\ <->UTF-8 ones. That requires 171 significant support on software part. 172 173Backward Compatibility 174---------------------- 175 176The GDAL/OGR backward compatibility will be broken by this new 177functionality in the way how 8-bit characters handled. Before users may 178rely on that all 8-bit character strings will be passed through the 179GDAL/OGR without change and will contain exact the same data all the 180way. Now it is only true for 7-bit ASCII and 8-bit UTF-8 encoded 181strings. Note, that if you used only ASCII subset with GDAL, you are not 182affected by these changes. 183 184From The Unicode Standard, chapter 5: 185 186*The width of wchar_t is compiler-specific and can be as small as 8 187bits. Consequently, programs that need to be portable across any C or 188C++ compiler should not use wchar_t for storing Unicode text.* 189 190References 191---------- 192 193- `The Unicode Standard, Version 4.0 - Implementation 194 Guidelines <http://unicode.org/versions/Unicode4.0.0/ch05.pdf>`__ - 195 Chapter 5 (PDF) 196- FAQ on how to use Unicode in software: 197 `http://www.cl.cam.ac.uk/~mgk25/unicode.html <http://www.cl.cam.ac.uk/~mgk25/unicode.html>`__ 198- FLTK implementation of string conversion functions: 199 `http://svn.easysw.com/public/fltk/fltk/trunk/src/utf.c <http://svn.easysw.com/public/fltk/fltk/trunk/src/utf.c>`__ 200- `http://www.easysw.com/~mike/fltk/doc-2.0/html/utf_8h.html <http://www.easysw.com/~mike/fltk/doc-2.0/html/utf_8h.html>`__ 201- Ticket #1494 : UTF-8 encoding for GML output. 202- Filenames also covered in [[wiki:rfc30_utf8_filenames]] 203