1.. _rfc-30: 2 3================================================================================ 4RFC 30: Unicode Filenames 5================================================================================ 6 7Authors: Frank Warmerdam 8 9Contact: warmerdam@pobox.com 10 11Status: Adopted 12 13Summary 14------- 15 16This document describes steps to generally handle filenames as UTF-8 17strings in GDAL/OGR. In brief it will be assumed that filenames passed 18into and returned by GDAL/OGR interfaces are UTF-8. On some operating 19systems, notably Windows, this will require use of "wide character" 20interfaces in the low level VSI*L API. 21 22Key Interfaces 23-------------- 24 25VSI*L API 26~~~~~~~~~ 27 28All filenames in the VSI*L API will be treated as UTF-8, which means the 29cpl_vsil_win32.cpp implementation will need substantial updates to use 30wide character interfaces. 31 32- VSIFOpenL() 33- VSIFStatL() 34- VSIReadDir() 35- VSIMkdir() 36- VSIRmdir() 37- VSIUnlink() 38- VSIRename() 39 40Old (small file) VSI API 41~~~~~~~~~~~~~~~~~~~~~~~~ 42 43The old VSIFOpen() function will be adapted to use \_wfopen() on windows 44instead of fopen() so that utf-8 filenames will be supported. 45 46- VSIFOpen() 47- VSIStat() 48 49Filename Parsing 50~~~~~~~~~~~~~~~~ 51 52Because the path/extension delimiter characters '.', '', '/' and ':' 53will never appear in the non-ascii portion of utf-8 strings we can 54safely leave the existing path parsing functions working as they do now. 55They do not need to be aware of the real character boundaries for exotic 56characters in utf-8 paths. The following will be left unchanged. 57 58- CPLGetPath() 59- CPLGetDirname() 60- CPLGetFilename() 61- CPLGetBasename() 62- CPLGetExtension() 63- CPLResetExtension() 64 65Other 66~~~~~ 67 68- CPLStat() 69- CPLGetCurrentDir() 70- GDALDataset::GetFileList() 71 72These will all also need to treat filenames as utf-8. 73 74Windows 75------- 76 77Currently Windows's cpl_vsil_win32.cpp module uses CreateFile() with 78ascii filenames. It needs to be converted to use CreateFileW() and other 79wide character functions for stat(), rename, mkdir, etc. Prototype 80implementation already developed (r20620). 81 82.. _linux--unix--macos-x: 83 84Linux / Unix / MacOS X 85---------------------- 86 87On modern linux, unix and MacOS operating systems the fopen(), stat(), 88readdir() functions already support UTF-8 strings. It is not currently 89anticipated that any work will be needed on Linux/Unix/MacOS X though 90there is some question about this. It is considered permissible under 91the definition of this RFC for old, and substandard operating systems 92(WinCE?) to support only ASCII, not UTF-8 filenames. 93 94Metadata 95-------- 96 97There are a variety of places where general text may contain filenames. 98One obvious case is the subdataset filenames returned from the 99SUBDATASET domain. Previously these were just exposed as plain text and 100interpretation of the character set was undefined. As part of this RFC 101we state that such filenames should be considered to be in utf-8 format. 102 103Python Changes 104-------------- 105 106I observe with Python 2.6 that functions like gdal.Open() do not accept 107unicode strings, but they do accept utf-8 string objects. One possible 108solution is to update the bindings in selective places to identify 109unicode strings passed in, and transform them to utf-8 strings. 110 111eg. 112 113:: 114 115 filename = u'xx\u4E2D\u6587.\u4E2D\u6587' 116 if type(filename) == type(u'a'): 117 filename = filename.encode('utf-8') 118 119I'm not sure what the easiest way is to accomplish this in the bindings. 120The key entries are: 121 122- gdal.Open() 123- ogr.Open() 124- gdal.ReadDir() 125- gdal.PushFinderLocation() 126- gdal.FindFile() 127- gdal.Unlink() 128 129Similarly all interfaces (ie. gdal.ReadDir()) that return filenames will 130hereafter return unicode objects rather than string objects. 131 132Also note that in Python 3.x strings are always unicode. 133 134C# Changes 135---------- 136 137Tamas notes that in C# we normally convert the unicode C# strings into C 138string with the PtrToStringAnsi marshaller. Presumably we will need to 139use a utf-8 converter for all interface strings considered to be 140filenames. I would note this should also apploy to OGR string attribute 141values which are also intended to be treated as utf-8. 142 143(It is unclear who will take care of this aspect since the primary 144author (FrankW) is not C#-binding-competent. 145 146Perl Changes 147------------ 148 149The general rule in Perl is that all strings should be decoded before 150giving them to Perl and encoded when they are output. In practice things 151usually just work. To be sure, I (Ari) have added an explicit decode 152from utf8 to FindFile and ReadDir (#20800). 153 154Java Changes 155------------ 156 157No changes are needed for Java. Java strings are unicode, and they are 158already converted to utf-8 in the java swig bindings. That is, the java 159bindings already assumed passing and receiving utf-8 strings to/from 160GDAL/OGR. 161 162Commandline Issues 163------------------ 164 165On windows argv[] as passed into main() will not generally be able to 166represent exotic filenames that can't be represented in the locale 167charset. It is possible to fetch the commandline and parse it as wide 168characters using GetCommandLineW() and CommandLinetoArgvW() to capture 169ucs-16 filenames (easily converted to utf-8); however, this interferes 170with the use of setargv.obj to expand wildcards on windows. 171 172I have not been able to come up with a good solution, so for now I am 173not intending to make any changes to the GDAL/OGR commandline utilities 174to allow passing exotic filenames. So this RFC is mainly aimed at 175ensuring that other applications using GDAL/OGR can utilize exotic 176filenames. 177 178File Formats 179------------ 180 181The proposed implementation really only addresses file format drivers 182that use VSIFOpenL(), VSIFOpen() and related functions. Some drivers 183dependent on external libraries (ie. netcdf) do not have a way to hook 184the file IO API and may not support utf-8 filenames. It might be nice to 185be able to distinguish these. 186 187At the very least any driver marked with GDAL_DCAP_VIRTUALIO as "YES" 188will support UTF-8. Perhaps this opportunity ought to be used to more 189uniformly apply this driver metadata (done). 190 191Test Suite 192---------- 193 194We will need to introduce some test suite tests with multibyte utf-8 195filenames. In support of that aspects of the VSI*L API - particularly 196the rename, mkdir, rmdir, functions and VSIFOpenL itself have been 197exposed in python. 198 199Documentation 200------------- 201 202Appropriate API entry points will be documented as taking and return 203UTF-8 strings. 204 205Implementation 206-------------- 207 208Implementation is underway and being tracked in ticket #3766. 209