1.. _rfc-63: 2 3======================================================================================= 4RFC 63 : Sparse datasets improvements 5======================================================================================= 6 7Author: Even Rouault 8 9Contact: even.rouault at spatialys.com 10 11Status: Adopted, implemented 12 13Version: 2.2 14 15Summary 16------- 17 18This RFC covers an improvement to manage sparse datasets, that is to say 19datasets that contain substantial empty regions. 20 21Approach 22-------- 23 24There are use cases where one needs to read or generate a dataset that 25covers a large spatial extent, but in which significant parts are not 26covered by data. There is no way in the GDAL API to quickly know which 27areas are covered or not by data, hence requiring to process all pixels, 28which is rather inefficient. Whereas some formats like GeoTIFF, VRT or 29GeoPackage can potentially give such an information without processing 30pixels. 31 32It is thus proposed to add a new method GetDataCoverageStatus() in the 33GDALRasterBand class, that takes as input a window of interest and 34returns whether it is made of data, empty blocks or a mix of them. 35 36This method will be used by the GDALDatasetCopyWholeRaster() method 37(used by CreateCopy() / gdal_translate) to avoid processing sparse 38regions when the output driver instructs it to do so. 39 40C++ API 41------- 42 43In GDALRasterBand class, a new virtual method is added : 44 45:: 46 47 virtual int IGetDataCoverageStatus( int nXOff, int nYOff, 48 int nXSize, int nYSize, 49 int nMaskFlagStop, 50 double* pdfDataPct); 51 52 53 /** 54 * \brief Get the coverage status of a sub-window of the raster. 55 * 56 * Returns whether a sub-window of the raster contains only data, only empty 57 * blocks or a mix of both. This function can be used to determine quickly 58 * if it is worth issuing RasterIO / ReadBlock requests in datasets that may 59 * be sparse. 60 * 61 * Empty blocks are blocks that contain only pixels whose value is the nodata 62 * value when it is set, or whose value is 0 when the nodata value is not set. 63 * 64 * The query is done in an efficient way without reading the actual pixel 65 * values. If not possible, or not implemented at all by the driver, 66 * GDAL_DATA_COVERAGE_STATUS_UNIMPLEMENTED | GDAL_DATA_COVERAGE_STATUS_DATA will 67 * be returned. 68 * 69 * The values that can be returned by the function are the following, 70 * potentially combined with the binary or operator : 71 * <ul> 72 * <li>GDAL_DATA_COVERAGE_STATUS_UNIMPLEMENTED : the driver does not implement 73 * GetDataCoverageStatus(). This flag should be returned together with 74 * GDAL_DATA_COVERAGE_STATUS_DATA.</li> 75 * <li>GDAL_DATA_COVERAGE_STATUS_DATA: There is (potentially) data in the queried 76 * window.</li> 77 * <li>GDAL_DATA_COVERAGE_STATUS_EMPTY: There is nodata in the queried window. 78 * This is typically identified by the concept of missing block in formats that 79 * supports it. 80 * </li> 81 * </ul> 82 * 83 * Note that GDAL_DATA_COVERAGE_STATUS_DATA might have false positives and 84 * should be interpreted more as hint of potential presence of data. For example 85 * if a GeoTIFF file is created with blocks filled with zeroes (or set to the 86 * nodata value), instead of using the missing block mechanism, 87 * GDAL_DATA_COVERAGE_STATUS_DATA will be returned. On the contrary, 88 * GDAL_DATA_COVERAGE_STATUS_EMPTY should have no false positives. 89 * 90 * The nMaskFlagStop should be generally set to 0. It can be set to a 91 * binary-or'ed mask of the above mentioned values to enable a quick exiting of 92 * the function as soon as the computed mask matches the nMaskFlagStop. For 93 * example, you can issue a request on the whole raster with nMaskFlagStop = 94 * GDAL_DATA_COVERAGE_STATUS_EMPTY. As soon as one missing block is encountered, 95 * the function will exit, so that you can potentially refine the requested area 96 * to find which particular region(s) have missing blocks. 97 * 98 * @see GDALGetDataCoverageStatus() 99 * 100 * @param nXOff The pixel offset to the top left corner of the region 101 * of the band to be queried. This would be zero to start from the left side. 102 * 103 * @param nYOff The line offset to the top left corner of the region 104 * of the band to be queried. This would be zero to start from the top. 105 * 106 * @param nXSize The width of the region of the band to be queried in pixels. 107 * 108 * @param nYSize The height of the region of the band to be queried in lines. 109 * 110 * @param nMaskFlagStop 0, or a binary-or'ed mask of possible values 111 * GDAL_DATA_COVERAGE_STATUS_UNIMPLEMENTED, 112 * GDAL_DATA_COVERAGE_STATUS_DATA and GDAL_DATA_COVERAGE_STATUS_EMPTY. As soon 113 * as the computation of the coverage matches the mask, the computation will be 114 * stopped. *pdfDataPct will not be valid in that case. 115 * 116 * @param pdfDataPct Optional output parameter whose pointed value will be set 117 * to the (approximate) percentage in [0,100] of pixels in the queried 118 * sub-window that have valid values. The implementation might not always be 119 * able to compute it, in which case it will be set to a negative value. 120 * 121 * @return a binary-or'ed combination of possible values 122 * GDAL_DATA_COVERAGE_STATUS_UNIMPLEMENTED, 123 * GDAL_DATA_COVERAGE_STATUS_DATA and GDAL_DATA_COVERAGE_STATUS_EMPTY 124 * 125 * @note Added in GDAL 2.2 126 */ 127 128This method has a dumb default implementation that returns 129GDAL_DATA_COVERAGE_STATUS_UNIMPLEMENTED \| 130GDAL_DATA_COVERAGE_STATUS_DATA 131 132The public API is made of : 133 134:: 135 136 137 /** Flag returned by GDALGetDataCoverageStatus() when the driver does not 138 * implement GetDataCoverageStatus(). This flag should be returned together 139 * with GDAL_DATA_COVERAGE_STATUS_DATA */ 140 #define GDAL_DATA_COVERAGE_STATUS_UNIMPLEMENTED 0x01 141 142 /** Flag returned by GDALGetDataCoverageStatus() when there is (potentially) 143 * data in the queried window. Can be combined with the binary or operator 144 * with GDAL_DATA_COVERAGE_STATUS_UNIMPLEMENTED or 145 * GDAL_DATA_COVERAGE_STATUS_EMPTY */ 146 #define GDAL_DATA_COVERAGE_STATUS_DATA 0x02 147 148 /** Flag returned by GDALGetDataCoverageStatus() when there is nodata in the 149 * queried window. This is typically identified by the concept of missing block 150 * in formats that supports it. 151 * Can be combined with the binary or operator with 152 * GDAL_DATA_COVERAGE_STATUS_DATA */ 153 #define GDAL_DATA_COVERAGE_STATUS_EMPTY 0x04 154 155 156 C++ : 157 158 int GDALRasterBand::GetDataCoverageStatus( int nXOff, 159 int nYOff, 160 int nXSize, 161 int nYSize, 162 int nMaskFlagStop, 163 double* pdfDataPct) 164 165 C : 166 int GDALGetDataCoverageStatus( GDALRasterBandH hBand, 167 int nXOff, int nYOff, 168 int nXSize, 169 int nYSize, 170 int nMaskFlagStop, 171 double* pdfDataPct); 172 173GDALRasterBand::GetDataCoverageStatus() does basic checks on the 174validity of the window before calling IGetDataCoverageStatus() 175 176Changes 177------- 178 179GDALDatasetCopyWholeRaster() and GDALRasterBandCopyWholeRaster() accepts 180a SKIP_HOLES option that can be set to YES by the output driver to cause 181GetDataCoverageStatus() to be called on each chunk of the source dataset 182to determine if contains only holes or not. 183 184Drivers 185------- 186 187This RFC upgrades the GeoTIFF and VRT drivers to implement the 188IGetDataCoverageStatus() method. 189 190The GeoTIFF driver has also receive a number of prior enhancements, 191related to that topic, for example to accept the SPARSE_OK=YES creation 192option in CreateCopy() mode (or the SPARSE_OK open option in update 193mode). 194 195Extract of the documentation of the driver: 196 197:: 198 199 GDAL makes a special interpretation of a TIFF tile or strip whose offset 200 and byte count are set to 0, that is to say a tile or strip that has no corresponding 201 allocated physical storage. On reading, such tiles or strips are considered to 202 be implicitly set to 0 or to the nodata value when it is defined. On writing, it 203 is possible to enable generating such files through the Create() interface by setting 204 the SPARSE_OK creation option to YES. Then, blocks that are never written 205 through the IWriteBlock()/IRasterIO() interfaces will have their offset and 206 byte count set to 0. This is particularly useful to save disk space and time when 207 the file must be initialized empty before being passed to a further processing 208 stage that will fill it. 209 To avoid ambiguities with another sparse mechanism discussed in the next paragraphs, 210 we will call such files with implicit tiles/strips "TIFF sparse files". They will 211 be likely *not* interoperable with TIFF readers that are not GDAL based and 212 would consider such files with implicit tiles/strips as defective. 213 214 Starting with GDAL 2.2, this mechanism is extended to the CreateCopy() and 215 Open() interfaces (for update mode) as well. If the SPARSE_OK creation option 216 (or the SPARSE_OK open option for Open()) is set to YES, even an attempt to 217 write a all 0/nodata block will be detected so that the tile/strip is not 218 allocated (if it was already allocated, then its content will be replaced by 219 the 0/nodata content). 220 221 Starting with GDAL 2.2, in the case where SPARSE_OK is *not* defined (or set 222 to its default value FALSE), for uncompressed files whose nodata value is not 223 set, or set to 0, in Create() and CreateCopy() mode, the driver will delay the 224 allocation of 0-blocks until file closing, so as to be able to write them at 225 the very end of the file, and in a way compatible of the filesystem sparse file 226 mechanisms (to be distinguished from the TIFF sparse file extension discussed 227 earlier). That is that all the empty blocks will be seen as properly allocated 228 from the TIFF point of view (corresponding strips/tiles will have valid offsets 229 and byte counts), but will have no corresponding physical storage. Provided that 230 the filesystem supports such sparse files, which is the case for most Linux 231 popular filesystems (ext2/3/4, xfs, btfs, ...) or NTFS on Windows. If the file 232 system does not support sparse files, physical storage will be 233 allocated and filled with zeros. 234 235Bindings 236-------- 237 238The Python bindings has a mapping of GDALGetDataCoverageStatus(). Other 239bindings could be updated (need to figure out how to return both a 240status flag and a percentage) 241 242Utilities 243--------- 244 245No direct changes in utilities. 246 247Results 248------- 249 250With this new capability, a VRT of size 200 000 x 200 000 pixels that 251contains 2 regions of 20x20 pixels each can be gdal_translated as a 252sparse tiled GeoTIFF in 2 seconds. The resulting GeoTIFF can be itself 253translated into another sparse tiled GeoTIFF in the same time. 254 255Future work 256----------- 257 258Future work using the new capability could be done in overview building 259or warping. Other drivers could also benefit from that new capability: 260GeoPackage, ERDAS Imagine, ... 261 262Documentation 263------------- 264 265The new method is documented. 266 267Test Suite 268---------- 269 270Tests of the VRT and GeoTIFF drivers are enhanced to test their 271IGetDataCoverageStatus() implementation. 272 273Compatibility Issues 274-------------------- 275 276C++ ABI change. No functional incompatibility foreseen. 277 278Implementation 279-------------- 280 281The implementation will be done by Even Rouault. 282 283The proposed implementation is in 284`https://github.com/rouault/gdal2/tree/sparse_datasets <https://github.com/rouault/gdal2/tree/sparse_datasets>`__ 285 286Changes can be seen with 287`https://github.com/OSGeo/gdal/compare/trunk...rouault:sparse_datasets?expand=1 <https://github.com/OSGeo/gdal/compare/trunk...rouault:sparse_datasets?expand=1>`__ 288 289Voting history 290-------------- 291 292+1 from EvenR and DanielM 293