1.. _rfc-63:
2
3=======================================================================================
4RFC 63 : Sparse datasets improvements
5=======================================================================================
6
7Author: Even Rouault
8
9Contact: even.rouault at spatialys.com
10
11Status: Adopted, implemented
12
13Version: 2.2
14
15Summary
16-------
17
18This RFC covers an improvement to manage sparse datasets, that is to say
19datasets that contain substantial empty regions.
20
21Approach
22--------
23
24There are use cases where one needs to read or generate a dataset that
25covers a large spatial extent, but in which significant parts are not
26covered by data. There is no way in the GDAL API to quickly know which
27areas are covered or not by data, hence requiring to process all pixels,
28which is rather inefficient. Whereas some formats like GeoTIFF, VRT or
29GeoPackage can potentially give such an information without processing
30pixels.
31
32It is thus proposed to add a new method GetDataCoverageStatus() in the
33GDALRasterBand class, that takes as input a window of interest and
34returns whether it is made of data, empty blocks or a mix of them.
35
36This method will be used by the GDALDatasetCopyWholeRaster() method
37(used by CreateCopy() / gdal_translate) to avoid processing sparse
38regions when the output driver instructs it to do so.
39
40C++ API
41-------
42
43In GDALRasterBand class, a new virtual method is added :
44
45::
46
47    virtual int IGetDataCoverageStatus( int nXOff, int nYOff,
48                                        int nXSize, int nYSize,
49                                        int nMaskFlagStop,
50                                        double* pdfDataPct);
51
52
53   /**
54    * \brief Get the coverage status of a sub-window of the raster.
55    *
56    * Returns whether a sub-window of the raster contains only data, only empty
57    * blocks or a mix of both. This function can be used to determine quickly
58    * if it is worth issuing RasterIO / ReadBlock requests in datasets that may
59    * be sparse.
60    *
61    * Empty blocks are blocks that contain only pixels whose value is the nodata
62    * value when it is set, or whose value is 0 when the nodata value is not set.
63    *
64    * The query is done in an efficient way without reading the actual pixel
65    * values. If not possible, or not implemented at all by the driver,
66    * GDAL_DATA_COVERAGE_STATUS_UNIMPLEMENTED | GDAL_DATA_COVERAGE_STATUS_DATA will
67    * be returned.
68    *
69    * The values that can be returned by the function are the following,
70    * potentially combined with the binary or operator :
71    * <ul>
72    * <li>GDAL_DATA_COVERAGE_STATUS_UNIMPLEMENTED : the driver does not implement
73    * GetDataCoverageStatus(). This flag should be returned together with
74    * GDAL_DATA_COVERAGE_STATUS_DATA.</li>
75    * <li>GDAL_DATA_COVERAGE_STATUS_DATA: There is (potentially) data in the queried
76    * window.</li>
77    * <li>GDAL_DATA_COVERAGE_STATUS_EMPTY: There is nodata in the queried window.
78    * This is typically identified by the concept of missing block in formats that
79    * supports it.
80    * </li>
81    * </ul>
82    *
83    * Note that GDAL_DATA_COVERAGE_STATUS_DATA might have false positives and
84    * should be interpreted more as hint of potential presence of data. For example
85    * if a GeoTIFF file is created with blocks filled with zeroes (or set to the
86    * nodata value), instead of using the missing block mechanism,
87    * GDAL_DATA_COVERAGE_STATUS_DATA will be returned. On the contrary,
88    * GDAL_DATA_COVERAGE_STATUS_EMPTY should have no false positives.
89    *
90    * The nMaskFlagStop should be generally set to 0. It can be set to a
91    * binary-or'ed mask of the above mentioned values to enable a quick exiting of
92    * the function as soon as the computed mask matches the nMaskFlagStop. For
93    * example, you can issue a request on the whole raster with nMaskFlagStop =
94    * GDAL_DATA_COVERAGE_STATUS_EMPTY. As soon as one missing block is encountered,
95    * the function will exit, so that you can potentially refine the requested area
96    * to find which particular region(s) have missing blocks.
97    *
98    * @see GDALGetDataCoverageStatus()
99    *
100    * @param nXOff The pixel offset to the top left corner of the region
101    * of the band to be queried. This would be zero to start from the left side.
102    *
103    * @param nYOff The line offset to the top left corner of the region
104    * of the band to be queried. This would be zero to start from the top.
105    *
106    * @param nXSize The width of the region of the band to be queried in pixels.
107    *
108    * @param nYSize The height of the region of the band to be queried in lines.
109    *
110    * @param nMaskFlagStop 0, or a binary-or'ed mask of possible values
111    * GDAL_DATA_COVERAGE_STATUS_UNIMPLEMENTED,
112    * GDAL_DATA_COVERAGE_STATUS_DATA and GDAL_DATA_COVERAGE_STATUS_EMPTY. As soon
113    * as the computation of the coverage matches the mask, the computation will be
114    * stopped. *pdfDataPct will not be valid in that case.
115    *
116    * @param pdfDataPct Optional output parameter whose pointed value will be set
117    * to the (approximate) percentage in [0,100] of pixels in the queried
118    * sub-window that have valid values. The implementation might not always be
119    * able to compute it, in which case it will be set to a negative value.
120    *
121    * @return a binary-or'ed combination of possible values
122    * GDAL_DATA_COVERAGE_STATUS_UNIMPLEMENTED,
123    * GDAL_DATA_COVERAGE_STATUS_DATA and GDAL_DATA_COVERAGE_STATUS_EMPTY
124    *
125    * @note Added in GDAL 2.2
126    */
127
128This method has a dumb default implementation that returns
129GDAL_DATA_COVERAGE_STATUS_UNIMPLEMENTED \|
130GDAL_DATA_COVERAGE_STATUS_DATA
131
132The public API is made of :
133
134::
135
136
137   /** Flag returned by GDALGetDataCoverageStatus() when the driver does not
138    * implement GetDataCoverageStatus(). This flag should be returned together
139    * with GDAL_DATA_COVERAGE_STATUS_DATA */
140   #define GDAL_DATA_COVERAGE_STATUS_UNIMPLEMENTED 0x01
141
142   /** Flag returned by GDALGetDataCoverageStatus() when there is (potentially)
143    * data in the queried window. Can be combined with the binary or operator
144    * with GDAL_DATA_COVERAGE_STATUS_UNIMPLEMENTED or
145    * GDAL_DATA_COVERAGE_STATUS_EMPTY */
146   #define GDAL_DATA_COVERAGE_STATUS_DATA          0x02
147
148   /** Flag returned by GDALGetDataCoverageStatus() when there is nodata in the
149    * queried window. This is typically identified by the concept of missing block
150    * in formats that supports it.
151    * Can be combined with the binary or operator with
152    * GDAL_DATA_COVERAGE_STATUS_DATA */
153   #define GDAL_DATA_COVERAGE_STATUS_EMPTY         0x04
154
155
156   C++ :
157
158   int  GDALRasterBand::GetDataCoverageStatus( int nXOff,
159                                               int nYOff,
160                                               int nXSize,
161                                               int nYSize,
162                                               int nMaskFlagStop,
163                                               double* pdfDataPct)
164
165   C :
166   int GDALGetDataCoverageStatus( GDALRasterBandH hBand,
167                                  int nXOff, int nYOff,
168                                  int nXSize,
169                                  int nYSize,
170                                  int nMaskFlagStop,
171                                  double* pdfDataPct);
172
173GDALRasterBand::GetDataCoverageStatus() does basic checks on the
174validity of the window before calling IGetDataCoverageStatus()
175
176Changes
177-------
178
179GDALDatasetCopyWholeRaster() and GDALRasterBandCopyWholeRaster() accepts
180a SKIP_HOLES option that can be set to YES by the output driver to cause
181GetDataCoverageStatus() to be called on each chunk of the source dataset
182to determine if contains only holes or not.
183
184Drivers
185-------
186
187This RFC upgrades the GeoTIFF and VRT drivers to implement the
188IGetDataCoverageStatus() method.
189
190The GeoTIFF driver has also receive a number of prior enhancements,
191related to that topic, for example to accept the SPARSE_OK=YES creation
192option in CreateCopy() mode (or the SPARSE_OK open option in update
193mode).
194
195Extract of the documentation of the driver:
196
197::
198
199   GDAL makes a special interpretation of a TIFF tile or strip whose offset
200   and byte count are set to 0, that is to say a tile or strip that has no corresponding
201   allocated physical storage. On reading, such tiles or strips are considered to
202   be implicitly set to 0 or to the nodata value when it is defined. On writing, it
203   is possible to enable generating such files through the Create() interface by setting
204   the SPARSE_OK creation option to YES. Then, blocks that are never written
205   through the IWriteBlock()/IRasterIO() interfaces will have their offset and
206   byte count set to 0. This is particularly useful to save disk space and time when
207   the file must be initialized empty before being passed to a further processing
208   stage that will fill it.
209   To avoid ambiguities with another sparse mechanism discussed in the next paragraphs,
210   we will call such files with implicit tiles/strips "TIFF sparse files". They will
211   be likely *not* interoperable with TIFF readers that are not GDAL based and
212   would consider such files with implicit tiles/strips as defective.
213
214   Starting with GDAL 2.2, this mechanism is extended to the CreateCopy() and
215   Open() interfaces (for update mode) as well. If the SPARSE_OK creation option
216   (or the SPARSE_OK open option for Open()) is set to YES, even an attempt to
217   write a all 0/nodata block will be detected so that the tile/strip is not
218   allocated (if it was already allocated, then its content will be replaced by
219   the 0/nodata content).
220
221   Starting with GDAL 2.2, in the case where SPARSE_OK is *not* defined (or set
222   to its default value FALSE), for uncompressed files whose nodata value is not
223   set, or set to 0, in Create() and CreateCopy() mode, the driver will delay the
224   allocation of 0-blocks until file closing, so as to be able to write them at
225   the very end of the file, and in a way compatible of the filesystem sparse file
226   mechanisms (to be distinguished from the TIFF sparse file extension discussed
227   earlier). That is that all the empty blocks will be seen as properly allocated
228   from the TIFF point of view (corresponding strips/tiles will have valid offsets
229   and byte counts), but will have no corresponding physical storage. Provided that
230   the filesystem supports such sparse files, which is the case for most Linux
231   popular filesystems (ext2/3/4, xfs, btfs, ...) or NTFS on Windows. If the file
232   system does not support sparse files, physical storage will be
233   allocated and filled with zeros.
234
235Bindings
236--------
237
238The Python bindings has a mapping of GDALGetDataCoverageStatus(). Other
239bindings could be updated (need to figure out how to return both a
240status flag and a percentage)
241
242Utilities
243---------
244
245No direct changes in utilities.
246
247Results
248-------
249
250With this new capability, a VRT of size 200 000 x 200 000 pixels that
251contains 2 regions of 20x20 pixels each can be gdal_translated as a
252sparse tiled GeoTIFF in 2 seconds. The resulting GeoTIFF can be itself
253translated into another sparse tiled GeoTIFF in the same time.
254
255Future work
256-----------
257
258Future work using the new capability could be done in overview building
259or warping. Other drivers could also benefit from that new capability:
260GeoPackage, ERDAS Imagine, ...
261
262Documentation
263-------------
264
265The new method is documented.
266
267Test Suite
268----------
269
270Tests of the VRT and GeoTIFF drivers are enhanced to test their
271IGetDataCoverageStatus() implementation.
272
273Compatibility Issues
274--------------------
275
276C++ ABI change. No functional incompatibility foreseen.
277
278Implementation
279--------------
280
281The implementation will be done by Even Rouault.
282
283The proposed implementation is in
284`https://github.com/rouault/gdal2/tree/sparse_datasets <https://github.com/rouault/gdal2/tree/sparse_datasets>`__
285
286Changes can be seen with
287`https://github.com/OSGeo/gdal/compare/trunk...rouault:sparse_datasets?expand=1 <https://github.com/OSGeo/gdal/compare/trunk...rouault:sparse_datasets?expand=1>`__
288
289Voting history
290--------------
291
292+1 from EvenR and DanielM
293