1.. _rfc-11:
2
3================================================================================
4RFC 11: Fast Format Identification
5================================================================================
6
7Author: Frank Warmerdam
8
9Contact: warmerdam@pobox.com
10
11Status: Adopted (and Implemented)
12
13Summary
14-------
15
16This RFC aims to add the ability for applications to quickly identify
17what files in the file system are GDAL supported file formats without
18necessarily opening any of them. It is mainly intended to allow GUI file
19browsers based on file types.
20
21This is accomplished by extending the GDALOpenInfo structure to hold
22more directory context, and by adding an Identify() method on the
23GDALDriver which a driver can implement to quickly identify that a file
24is of a given format without doing a more expensive Open() operation.
25
26GDALOpenInfo
27------------
28
29The Open() (or Identify()) methods of many drivers need to probe for
30files associated with the target file in order to open or identify a
31file as being of a particular format. For instance, in order to open an
32ESRI BIL file (EHDR driver) it is necessary to probe for a driver with
33the same basename as the target file, but the extension .hdr. Currently
34this is typically accomplished with VSIFStatL() calls or similar which
35can be fairly expensive.
36
37In order to reduce the need for such searches touch the operating system
38file system machinery, the GDALOpenInfo structure will be extended to
39hold an optional list of files. This is the list of all files at the
40same level in the file system as the target file, including the target
41file. The filenames will *not* include any path components, are an
42essentially just the output of CPLReadDir() on the parent directory. If
43the target object does not have filesystem semantics then the file list
44should be NULL.
45
46The following is added to GDALOpenInfo:
47
48::
49
50              GDALOpenInfo( const char * pszFile, GDALAccess eAccessIn, char **papszSiblings );
51       char **papszSiblingFiles;
52
53The new constructor allows the file list to be passed in to populate the
54papszSiblingFiles member (the argument will be copied). The existing
55default constructor will use CPLGetDirname() to get the directory of the
56passed pszFile, and CPLReadDir() to read the corresponding file list.
57The new constructor is primarily aimed at efficient implementation of
58the later GDALIdentifyDriver() function, avoiding re-reading the file
59list for each file to be tested.
60
61Identify()
62----------
63
64The GDALDriver class will be extended with the following function:
65
66::
67
68     int      (*pfnIdentify)( GDALOpenInfo * );
69
70When implemented by a driver, the function is intended to return TRUE
71(non-zero) if the driver determines that the file passed in via
72GDALOpenInfo appears to be of the format the driver is implemented for.
73To call this applications should call the new function:
74
75::
76
77     GDALDriverH *GDALIdentifyDriver( const char *pszDatasource, const char **papszDirFiles );
78
79Internally GDALIdentifyDriver() will do the following
80
811. A GDALOpenInfo structure will be initialized based on pszDatasource
82   and papszDirFiles.
832. It will iterate over all drivers similarly to GDALOpen(). For each
84   driver it will use the pfnIdentify function if available, otherwise
85   it will use the pfnOpen() method to establish if the driver supports
86   the file.
873. It will return the driver handle for the first driver to respond
88   positively or NULL if none accept it.
89
90Driver Changes
91--------------
92
93In theory it is not necessary for any drivers to be modified, since
94GDALIdentifyDriver() will fallback to using the pfnOpen function to
95test. But in practice, no optimization is achieved unless at least some
96drivers (hopefully those for which Open can be very expensive) are
97updated. Part of the ongoing effort then is to implement identify
98functions for GDAL drivers.
99
100Generally speaking it should be easy to craft an identify function from
101the initial test logic in the open function. For instance, the GeoTIFF
102driver might be changed like this:
103
104::
105
106   int GTiffDataset::Identify( GDALOpenInfo * poOpenInfo )
107
108   {
109   /* -------------------------------------------------------------------- */
110   /*      We have a special hook for handling opening a specific          */
111   /*      directory of a TIFF file.                                       */
112   /* -------------------------------------------------------------------- */
113       if( EQUALN(poOpenInfo->pszFilename,"GTIFF_DIR:",10) )
114           return TRUE;
115
116   /* -------------------------------------------------------------------- */
117   /*  First we check to see if the file has the expected header   */
118   /*  bytes.                              */
119   /* -------------------------------------------------------------------- */
120       if( poOpenInfo->nHeaderBytes < 2 )
121           return FALSE;
122
123       if( (poOpenInfo->pabyHeader[0] != 'I' || poOpenInfo->pabyHeader[1] != 'I')
124           && (poOpenInfo->pabyHeader[0] != 'M' || poOpenInfo->pabyHeader[1] != 'M'))
125           return FALSE;
126
127       // We can't support BigTIFF files for now.
128       if( poOpenInfo->pabyHeader[2] == 43 && poOpenInfo->pabyHeader[3] == 0 )
129           return FALSE;
130
131
132       if( (poOpenInfo->pabyHeader[2] != 0x2A || poOpenInfo->pabyHeader[3] != 0)
133           && (poOpenInfo->pabyHeader[3] != 0x2A || poOpenInfo->pabyHeader[2] != 0) )
134           return FALSE;
135
136       return TRUE;
137   }
138
139The open might then be modified to use the identify function to avoid
140duplicating the test logic.
141
142::
143
144   GDALDataset *GTiffDataset::Open( GDALOpenInfo * poOpenInfo )
145
146   {
147       TIFF    *hTIFF;
148
149       if( !Identify( poOpenInfo ) )
150           return NULL;
151
152   /* -------------------------------------------------------------------- */
153   /*      We have a special hook for handling opening a specific          */
154   /*      directory of a TIFF file.                                       */
155   /* -------------------------------------------------------------------- */
156       if( EQUALN(poOpenInfo->pszFilename,"GTIFF_DIR:",10) )
157           return OpenDir( poOpenInfo->pszFilename );
158
159       GTiffOneTimeInit();
160   ...
161
162Drivers which require header files such as the EHdr driver might
163implement Identify() like this:
164
165::
166
167   int EHdrDataset::Identify( GDALOpenInfo * poOpenInfo )
168
169   {
170       int     i, bSelectedHDR;
171       const char  *pszHDRFilename;
172
173   /* -------------------------------------------------------------------- */
174   /*  We assume the user is pointing to the binary (ie. .bil) file.   */
175   /* -------------------------------------------------------------------- */
176       if( poOpenInfo->nHeaderBytes < 2 )
177           return FALSE;
178
179   /* -------------------------------------------------------------------- */
180   /*      Now we need to tear apart the filename to form a .HDR           */
181   /*      filename.                                                       */
182   /* -------------------------------------------------------------------- */
183       CPLString osBasename = CPLGetBasename( poOpenInfo->pszFilename );
184       pszHDRFilename = CPLFormCIFilename( "", osBasename, "hdr" );
185
186       if( CSLFindString( poOpenInfo->papszSiblingFiles, pszHDRFilename) )
187           return TRUE;
188       else
189           return FALSE;
190   }
191
192During the initial implementation a variety of drivers will be updated,
193including the following. As well some performance and file system
194activity logging will be done to identify drivers that are currently
195expensive.
196
197-  HFA
198-  GTiff
199-  JPEG
200-  PNG
201-  GIF
202-  HDF4
203-  DTED
204-  USGS DEM
205-  MrSID
206-  JP2KAK
207-  ECW
208-  EHdr
209-  RST
210
211CPLReadDir()
212------------
213
214Currently the VSIMemFilesystemHandler implemented in cpl_vsi_mem.cpp
215which provides "filesystem like" access to objects in memory does not
216implement directory reading services. In order to properly populate the
217directory listing this will need to be added.
218
219To do this the CPLReadDir() function will also need to be reimplemented
220to use VSIFilesystemHandler::ReadDir() instead of direct implementation
221in cpl_dir.cpp. The win32 and unix/posix implementations of
222VSIFilesystemHandler::ReadDir() already exist. This should essentially
223complete the virtualization of filesystem access services.
224
225CPLReadDir() will also be renamed VSIReadDir() but with a stub under the
226old name available for backward compatibility.
227
228Compatibility
229-------------
230
231There are no anticipated backward compatibility problems. However
232forward compatibility will be affected, in that drivers updated in trunk
233with the Identify function will not be able to be ported back into 1.4
234builds and used their. Unmodified drivers, and externally maintained
235drivers should not be impacted by this development.
236
237SWIG Implications
238-----------------
239
240The GDALIdentifyDriver() and VSIReadDir() functions will need to be
241exposed via SWIG.
242
243Regression Testing
244------------------
245
246A test script for the Identify() function will be added to the
247autotest/gcore directory. It will include testing of identify in a
248/vsimem memory collection.
249
250Implementation Plan
251-------------------
252
253The new features will be implemented by Frank Warmerdam in *trunk* for
254the GDAL/OGR 1.5.0 release.
255
256Performance Tests
257-----------------
258
259A very quick test introducing the Identify without actually opening
260changed the time to identify all files in a directory with 70 TIFF files
261(on an NFS share) from 2 seconds to 0.5 seconds. So saving the overhead
262of actually opening files can be significant for some formats, including
263very common ones like GeoTIFF.
264