1.. _rfc-30:
2
3================================================================================
4RFC 30: Unicode Filenames
5================================================================================
6
7Authors: Frank Warmerdam
8
9Contact: warmerdam@pobox.com
10
11Status: Adopted
12
13Summary
14-------
15
16This document describes steps to generally handle filenames as UTF-8
17strings in GDAL/OGR. In brief it will be assumed that filenames passed
18into and returned by GDAL/OGR interfaces are UTF-8. On some operating
19systems, notably Windows, this will require use of "wide character"
20interfaces in the low level VSI*L API.
21
22Key Interfaces
23--------------
24
25VSI*L API
26~~~~~~~~~
27
28All filenames in the VSI*L API will be treated as UTF-8, which means the
29cpl_vsil_win32.cpp implementation will need substantial updates to use
30wide character interfaces.
31
32-  VSIFOpenL()
33-  VSIFStatL()
34-  VSIReadDir()
35-  VSIMkdir()
36-  VSIRmdir()
37-  VSIUnlink()
38-  VSIRename()
39
40Old (small file) VSI API
41~~~~~~~~~~~~~~~~~~~~~~~~
42
43The old VSIFOpen() function will be adapted to use \_wfopen() on windows
44instead of fopen() so that utf-8 filenames will be supported.
45
46-  VSIFOpen()
47-  VSIStat()
48
49Filename Parsing
50~~~~~~~~~~~~~~~~
51
52Because the path/extension delimiter characters '.', '', '/' and ':'
53will never appear in the non-ascii portion of utf-8 strings we can
54safely leave the existing path parsing functions working as they do now.
55They do not need to be aware of the real character boundaries for exotic
56characters in utf-8 paths. The following will be left unchanged.
57
58-  CPLGetPath()
59-  CPLGetDirname()
60-  CPLGetFilename()
61-  CPLGetBasename()
62-  CPLGetExtension()
63-  CPLResetExtension()
64
65Other
66~~~~~
67
68-  CPLStat()
69-  CPLGetCurrentDir()
70-  GDALDataset::GetFileList()
71
72These will all also need to treat filenames as utf-8.
73
74Windows
75-------
76
77Currently Windows's cpl_vsil_win32.cpp module uses CreateFile() with
78ascii filenames. It needs to be converted to use CreateFileW() and other
79wide character functions for stat(), rename, mkdir, etc. Prototype
80implementation already developed (r20620).
81
82.. _linux--unix--macos-x:
83
84Linux / Unix / MacOS X
85----------------------
86
87On modern linux, unix and MacOS operating systems the fopen(), stat(),
88readdir() functions already support UTF-8 strings. It is not currently
89anticipated that any work will be needed on Linux/Unix/MacOS X though
90there is some question about this. It is considered permissible under
91the definition of this RFC for old, and substandard operating systems
92(WinCE?) to support only ASCII, not UTF-8 filenames.
93
94Metadata
95--------
96
97There are a variety of places where general text may contain filenames.
98One obvious case is the subdataset filenames returned from the
99SUBDATASET domain. Previously these were just exposed as plain text and
100interpretation of the character set was undefined. As part of this RFC
101we state that such filenames should be considered to be in utf-8 format.
102
103Python Changes
104--------------
105
106I observe with Python 2.6 that functions like gdal.Open() do not accept
107unicode strings, but they do accept utf-8 string objects. One possible
108solution is to update the bindings in selective places to identify
109unicode strings passed in, and transform them to utf-8 strings.
110
111eg.
112
113::
114
115   filename =  u'xx\u4E2D\u6587.\u4E2D\u6587'
116   if type(filename) == type(u'a'):
117       filename = filename.encode('utf-8')
118
119I'm not sure what the easiest way is to accomplish this in the bindings.
120The key entries are:
121
122-  gdal.Open()
123-  ogr.Open()
124-  gdal.ReadDir()
125-  gdal.PushFinderLocation()
126-  gdal.FindFile()
127-  gdal.Unlink()
128
129Similarly all interfaces (ie. gdal.ReadDir()) that return filenames will
130hereafter return unicode objects rather than string objects.
131
132Also note that in Python 3.x strings are always unicode.
133
134C# Changes
135----------
136
137Tamas notes that in C# we normally convert the unicode C# strings into C
138string with the PtrToStringAnsi marshaller. Presumably we will need to
139use a utf-8 converter for all interface strings considered to be
140filenames. I would note this should also apploy to OGR string attribute
141values which are also intended to be treated as utf-8.
142
143(It is unclear who will take care of this aspect since the primary
144author (FrankW) is not C#-binding-competent.
145
146Perl Changes
147------------
148
149The general rule in Perl is that all strings should be decoded before
150giving them to Perl and encoded when they are output. In practice things
151usually just work. To be sure, I (Ari) have added an explicit decode
152from utf8 to FindFile and ReadDir (#20800).
153
154Java Changes
155------------
156
157No changes are needed for Java. Java strings are unicode, and they are
158already converted to utf-8 in the java swig bindings. That is, the java
159bindings already assumed passing and receiving utf-8 strings to/from
160GDAL/OGR.
161
162Commandline Issues
163------------------
164
165On windows argv[] as passed into main() will not generally be able to
166represent exotic filenames that can't be represented in the locale
167charset. It is possible to fetch the commandline and parse it as wide
168characters using GetCommandLineW() and CommandLinetoArgvW() to capture
169ucs-16 filenames (easily converted to utf-8); however, this interferes
170with the use of setargv.obj to expand wildcards on windows.
171
172I have not been able to come up with a good solution, so for now I am
173not intending to make any changes to the GDAL/OGR commandline utilities
174to allow passing exotic filenames. So this RFC is mainly aimed at
175ensuring that other applications using GDAL/OGR can utilize exotic
176filenames.
177
178File Formats
179------------
180
181The proposed implementation really only addresses file format drivers
182that use VSIFOpenL(), VSIFOpen() and related functions. Some drivers
183dependent on external libraries (ie. netcdf) do not have a way to hook
184the file IO API and may not support utf-8 filenames. It might be nice to
185be able to distinguish these.
186
187At the very least any driver marked with GDAL_DCAP_VIRTUALIO as "YES"
188will support UTF-8. Perhaps this opportunity ought to be used to more
189uniformly apply this driver metadata (done).
190
191Test Suite
192----------
193
194We will need to introduce some test suite tests with multibyte utf-8
195filenames. In support of that aspects of the VSI*L API - particularly
196the rename, mkdir, rmdir, functions and VSIFOpenL itself have been
197exposed in python.
198
199Documentation
200-------------
201
202Appropriate API entry points will be documented as taking and return
203UTF-8 strings.
204
205Implementation
206--------------
207
208Implementation is underway and being tracked in ticket #3766.
209