1.. _rfc-11: 2 3================================================================================ 4RFC 11: Fast Format Identification 5================================================================================ 6 7Author: Frank Warmerdam 8 9Contact: warmerdam@pobox.com 10 11Status: Adopted (and Implemented) 12 13Summary 14------- 15 16This RFC aims to add the ability for applications to quickly identify 17what files in the file system are GDAL supported file formats without 18necessarily opening any of them. It is mainly intended to allow GUI file 19browsers based on file types. 20 21This is accomplished by extending the GDALOpenInfo structure to hold 22more directory context, and by adding an Identify() method on the 23GDALDriver which a driver can implement to quickly identify that a file 24is of a given format without doing a more expensive Open() operation. 25 26GDALOpenInfo 27------------ 28 29The Open() (or Identify()) methods of many drivers need to probe for 30files associated with the target file in order to open or identify a 31file as being of a particular format. For instance, in order to open an 32ESRI BIL file (EHDR driver) it is necessary to probe for a driver with 33the same basename as the target file, but the extension .hdr. Currently 34this is typically accomplished with VSIFStatL() calls or similar which 35can be fairly expensive. 36 37In order to reduce the need for such searches touch the operating system 38file system machinery, the GDALOpenInfo structure will be extended to 39hold an optional list of files. This is the list of all files at the 40same level in the file system as the target file, including the target 41file. The filenames will *not* include any path components, are an 42essentially just the output of CPLReadDir() on the parent directory. If 43the target object does not have filesystem semantics then the file list 44should be NULL. 45 46The following is added to GDALOpenInfo: 47 48:: 49 50 GDALOpenInfo( const char * pszFile, GDALAccess eAccessIn, char **papszSiblings ); 51 char **papszSiblingFiles; 52 53The new constructor allows the file list to be passed in to populate the 54papszSiblingFiles member (the argument will be copied). The existing 55default constructor will use CPLGetDirname() to get the directory of the 56passed pszFile, and CPLReadDir() to read the corresponding file list. 57The new constructor is primarily aimed at efficient implementation of 58the later GDALIdentifyDriver() function, avoiding re-reading the file 59list for each file to be tested. 60 61Identify() 62---------- 63 64The GDALDriver class will be extended with the following function: 65 66:: 67 68 int (*pfnIdentify)( GDALOpenInfo * ); 69 70When implemented by a driver, the function is intended to return TRUE 71(non-zero) if the driver determines that the file passed in via 72GDALOpenInfo appears to be of the format the driver is implemented for. 73To call this applications should call the new function: 74 75:: 76 77 GDALDriverH *GDALIdentifyDriver( const char *pszDatasource, const char **papszDirFiles ); 78 79Internally GDALIdentifyDriver() will do the following 80 811. A GDALOpenInfo structure will be initialized based on pszDatasource 82 and papszDirFiles. 832. It will iterate over all drivers similarly to GDALOpen(). For each 84 driver it will use the pfnIdentify function if available, otherwise 85 it will use the pfnOpen() method to establish if the driver supports 86 the file. 873. It will return the driver handle for the first driver to respond 88 positively or NULL if none accept it. 89 90Driver Changes 91-------------- 92 93In theory it is not necessary for any drivers to be modified, since 94GDALIdentifyDriver() will fallback to using the pfnOpen function to 95test. But in practice, no optimization is achieved unless at least some 96drivers (hopefully those for which Open can be very expensive) are 97updated. Part of the ongoing effort then is to implement identify 98functions for GDAL drivers. 99 100Generally speaking it should be easy to craft an identify function from 101the initial test logic in the open function. For instance, the GeoTIFF 102driver might be changed like this: 103 104:: 105 106 int GTiffDataset::Identify( GDALOpenInfo * poOpenInfo ) 107 108 { 109 /* -------------------------------------------------------------------- */ 110 /* We have a special hook for handling opening a specific */ 111 /* directory of a TIFF file. */ 112 /* -------------------------------------------------------------------- */ 113 if( EQUALN(poOpenInfo->pszFilename,"GTIFF_DIR:",10) ) 114 return TRUE; 115 116 /* -------------------------------------------------------------------- */ 117 /* First we check to see if the file has the expected header */ 118 /* bytes. */ 119 /* -------------------------------------------------------------------- */ 120 if( poOpenInfo->nHeaderBytes < 2 ) 121 return FALSE; 122 123 if( (poOpenInfo->pabyHeader[0] != 'I' || poOpenInfo->pabyHeader[1] != 'I') 124 && (poOpenInfo->pabyHeader[0] != 'M' || poOpenInfo->pabyHeader[1] != 'M')) 125 return FALSE; 126 127 // We can't support BigTIFF files for now. 128 if( poOpenInfo->pabyHeader[2] == 43 && poOpenInfo->pabyHeader[3] == 0 ) 129 return FALSE; 130 131 132 if( (poOpenInfo->pabyHeader[2] != 0x2A || poOpenInfo->pabyHeader[3] != 0) 133 && (poOpenInfo->pabyHeader[3] != 0x2A || poOpenInfo->pabyHeader[2] != 0) ) 134 return FALSE; 135 136 return TRUE; 137 } 138 139The open might then be modified to use the identify function to avoid 140duplicating the test logic. 141 142:: 143 144 GDALDataset *GTiffDataset::Open( GDALOpenInfo * poOpenInfo ) 145 146 { 147 TIFF *hTIFF; 148 149 if( !Identify( poOpenInfo ) ) 150 return NULL; 151 152 /* -------------------------------------------------------------------- */ 153 /* We have a special hook for handling opening a specific */ 154 /* directory of a TIFF file. */ 155 /* -------------------------------------------------------------------- */ 156 if( EQUALN(poOpenInfo->pszFilename,"GTIFF_DIR:",10) ) 157 return OpenDir( poOpenInfo->pszFilename ); 158 159 GTiffOneTimeInit(); 160 ... 161 162Drivers which require header files such as the EHdr driver might 163implement Identify() like this: 164 165:: 166 167 int EHdrDataset::Identify( GDALOpenInfo * poOpenInfo ) 168 169 { 170 int i, bSelectedHDR; 171 const char *pszHDRFilename; 172 173 /* -------------------------------------------------------------------- */ 174 /* We assume the user is pointing to the binary (ie. .bil) file. */ 175 /* -------------------------------------------------------------------- */ 176 if( poOpenInfo->nHeaderBytes < 2 ) 177 return FALSE; 178 179 /* -------------------------------------------------------------------- */ 180 /* Now we need to tear apart the filename to form a .HDR */ 181 /* filename. */ 182 /* -------------------------------------------------------------------- */ 183 CPLString osBasename = CPLGetBasename( poOpenInfo->pszFilename ); 184 pszHDRFilename = CPLFormCIFilename( "", osBasename, "hdr" ); 185 186 if( CSLFindString( poOpenInfo->papszSiblingFiles, pszHDRFilename) ) 187 return TRUE; 188 else 189 return FALSE; 190 } 191 192During the initial implementation a variety of drivers will be updated, 193including the following. As well some performance and file system 194activity logging will be done to identify drivers that are currently 195expensive. 196 197- HFA 198- GTiff 199- JPEG 200- PNG 201- GIF 202- HDF4 203- DTED 204- USGS DEM 205- MrSID 206- JP2KAK 207- ECW 208- EHdr 209- RST 210 211CPLReadDir() 212------------ 213 214Currently the VSIMemFilesystemHandler implemented in cpl_vsi_mem.cpp 215which provides "filesystem like" access to objects in memory does not 216implement directory reading services. In order to properly populate the 217directory listing this will need to be added. 218 219To do this the CPLReadDir() function will also need to be reimplemented 220to use VSIFilesystemHandler::ReadDir() instead of direct implementation 221in cpl_dir.cpp. The win32 and unix/posix implementations of 222VSIFilesystemHandler::ReadDir() already exist. This should essentially 223complete the virtualization of filesystem access services. 224 225CPLReadDir() will also be renamed VSIReadDir() but with a stub under the 226old name available for backward compatibility. 227 228Compatibility 229------------- 230 231There are no anticipated backward compatibility problems. However 232forward compatibility will be affected, in that drivers updated in trunk 233with the Identify function will not be able to be ported back into 1.4 234builds and used their. Unmodified drivers, and externally maintained 235drivers should not be impacted by this development. 236 237SWIG Implications 238----------------- 239 240The GDALIdentifyDriver() and VSIReadDir() functions will need to be 241exposed via SWIG. 242 243Regression Testing 244------------------ 245 246A test script for the Identify() function will be added to the 247autotest/gcore directory. It will include testing of identify in a 248/vsimem memory collection. 249 250Implementation Plan 251------------------- 252 253The new features will be implemented by Frank Warmerdam in *trunk* for 254the GDAL/OGR 1.5.0 release. 255 256Performance Tests 257----------------- 258 259A very quick test introducing the Identify without actually opening 260changed the time to identify all files in a directory with 70 TIFF files 261(on an NFS share) from 2 seconds to 0.5 seconds. So saving the overhead 262of actually opening files can be significant for some formats, including 263very common ones like GeoTIFF. 264