1.. _rfc-76: 2 3================================================================================ 4RFC 76: OGR Python drivers 5================================================================================ 6 7============== ============================ 8Author: Even Rouault 9Contact: even.rouault @ spatialys.com 10Started: 2019-Nov-5 11Last updated: 2019-Nov-15 12Status: Adopted, implemented in GDAL 3.1 13============== ============================ 14 15Summary 16------- 17 18This RFC adds the capability to write OGR/vector drivers in Python. 19 20Motivation 21---------- 22 23For some use cases that do not require lighting speed, or to deal with very 24niche formats (possibly in house format), it might be faster and more efficient 25to write a vector driver in Python rather than a GDAL C++ driver as currently required, 26or an ad-hoc converter. 27 28.. note:: 29 30 QGIS has now a way to create Python-based providers such as 31 in https://github.com/qgis/QGIS/blob/master/tests/src/python/provider_python.py 32 Having a way to do in GDAL itself also allows the rest of GDAL/OGR based 33 tools to use the OGR Python driver. 34 35How does that work ? 36-------------------- 37 38Driver registration 39+++++++++++++++++++ 40 41The driver registration mechanism is extended to look for .py scripts in a 42dedicated directory: 43 44* the directory pointed by the ``GDAL_PYTHON_DRIVER_PATH`` configuration option 45 (there may be several paths separated by `:` on Unix or `;` on Windows) 46* if not defined, the directory pointed by the ``GDAL_DRIVER_PATH`` configuration 47 option. 48* if not defined, in the directory (hardcoded at compilation time on Unix builds) 49 where native plugins are located. 50 51Those Python script must set in their first lines at least 2 directives: 52 53- ``# gdal: DRIVER_NAME = "short_name"`` 54- ``# gdal: DRIVER_SUPPORTED_API_VERSION = 1`` . Currently only 1 supported. If the 55 interface changed in a backward incompatible way, we would increment internally 56 the supported API version number. This item enables us to check if we are able 57 to "safely" load a Python driver. If a Python driver would support several API 58 versions (not clear if that's really possible at that point), it might use an 59 array syntax to indicate that, like ``[1,2]`` 60- ``# gdal: DRIVER_DCAP_VECTOR = "YES"`` 61- ``# gdal: DRIVER_DMD_LONGNAME = "my super plugin"`` 62 63Optional metadata such as ``# gdal: DRIVER_DMD_EXTENSIONS`` or 64``# gdal: DRIVER_DMD_HELPTOPIC`` can be defined (basically, any driver metadata key string prefixed by 65``# gdal: DRIVER_`` 66 67These directives will be parsed in a pure textual way, without invocation of the Python 68interpreter, both for efficiency consideration and also because we want to 69delay the research or launch of the Python interpreter as much as possible 70(the typical use case if GDAL used by QGIS: we want to make sure that QGIS 71has itself started Python, to reuse that Python interpreter) 72 73From the short metadata, the driver registration code can instantiate GDALDriver 74C++ objects. When the Identify() or Open() method is invoked on that object, 75the C++ code will: 76 77* if not already done, find Python symbols, or start Python (see below paragraph 78 for more details) 79* if not already done, load the .py file as a Python module 80* if not already done, instantiate an instance of the Python class of the module 81 deriving from ``gdal_python_driver.BaseDriver`` 82* call the ``identify`` and ``open`` method depending on the originated API call. 83 84The ``open`` method will return a Python ``BaseDataset`` object with required and 85optional methods that will be invoked by the corresponding GDAL API calls. And 86likewise for the ``BaseLayer`` object. See the example_. 87 88Connection with the Python interpreter 89++++++++++++++++++++++++++++++++++++++ 90 91The logic will be shared with the VRT pixel functions written in Python functionality 92It relies on runtime linking to the Python symbols already available in the process (for 93example the python executable or a binary embedding Python and using GDAL, such 94as QGIS), or loading of the Python library in case no Python symbols are found, 95rather than compile time linking. 96The reason is that we do not know in advance with which Python version GDAL can 97potentially be linked, and we do not want gdal.so/gdal.dll to be explicitly linked 98with a particular Python library. 99 100This is both embedding and extending Python. 101 102The steps are: 103 1041. through dlopen() + dlsym() on Unix and EnumProcessModules()+GetProcAddress() 105 on Windows, look for Python symbols. If found, use it. This is for example 106 the case if GDAL is used from a Python module (GDAL Python bindings, rasterio, etc.) 107 or an application like QGIS that starts a Python interpreter. 1082. otherwise, look for the PYTHONSO environment variable that should point to 109 a pythonX.Y[...].so/.dll 1103. otherwise, look for the python binary in the path and try to identify the 111 correspond Python .so/.dll 1124. otherwise, try to load with dlopen()/LoadLibrary() well-known names of 113 Python .so/.dll 114 115Impacts on GDAL core 116-------------------- 117 118They are minimal. The GDALAllRegister() method has an added call to 119GDALDriverManager::AutoLoadPythonDrivers() that implements the above mentioned 120logic. The GDALDriver class has been extended to support a new function 121pointer, IdentifyEx(), which is used by the C++ shim that loads the Python code. 122 123.. code-block:: c++ 124 125 int (*pfnIdentifyEx)( GDALDriver*, GDALOpenInfo * ); 126 127This extended IdentifyEx() function pointer, which adds the GDALDriver* argument, 128is used in priority by GDALIdentify() and GDALOpen() methods. The need for that 129is purely boring. For normal C++ drivers, there is no need to pass the driver, 130as there is a one-to-one correspondence between a driver and the function that 131implements the driver. But for the Python driver, there is a single C++ method 132that does the interface with the Python Identify() method of several Python drivers, 133hence the need of a GDALDriver* argument to forward the call to the appropriate 134driver. 135 136.. _example: 137 138Example of such a driver 139------------------------ 140 141Note that the prefixing by the driver name in the connection string is absolutely 142not a requirement, but something specific to this particular driver which is a 143bit artificial. The CityJSON driver mentioned below does not need it. 144 145.. code-block:: python 146 147 #!/usr/bin/env python 148 # -*- coding: utf-8 -*- 149 # This code is in the public domain, so as to serve as a template for 150 # real-world plugins. 151 # or, at the choice of the licensee, 152 # Copyright 2019 Even Rouault 153 # SPDX-License-Identifier: MIT 154 155 # Metadata parsed by GDAL C++ code at driver pre-loading, starting with '# gdal: ' 156 # Required and with that exact syntax since it is parsed by non-Python 157 # aware code. So just literal values, no expressions, etc. 158 # gdal: DRIVER_NAME = "DUMMY" 159 # API version(s) supported. Must include 1 currently 160 # gdal: DRIVER_SUPPORTED_API_VERSION = [1] 161 # gdal: DRIVER_DCAP_VECTOR = "YES" 162 # gdal: DRIVER_DMD_LONGNAME = "my super plugin" 163 164 # Optional driver metadata items. 165 # # gdal: DRIVER_DMD_EXTENSIONS = "ext1 est2" 166 # # gdal: DRIVER_DMD_HELPTOPIC = "http://example.com/my_help.html" 167 168 # The gdal_python_driver module is defined by the GDAL library at runtime 169 from gdal_python_driver import BaseDriver, BaseDataset, BaseLayer 170 171 class Layer(BaseLayer): 172 def __init__(self): 173 174 # Reserved attribute names. Either those or the corresponding method 175 # must be defined 176 self.name = 'my_layer' # Required, or name() method 177 178 self.fid_name = 'my_fid' # Optional 179 180 self.fields = [{'name': 'boolField', 'type': 'Boolean'}, 181 {'name': 'int16Field', 'type': 'Integer16'}, 182 {'name': 'int32Field', 'type': 'Integer'}, 183 {'name': 'int64Field', 'type': 'Integer64'}, 184 {'name': 'realField', 'type': 'Real'}, 185 {'name': 'floatField', 'type': 'Float'}, 186 {'name': 'strField', 'type': 'String'}, 187 {'name': 'strNullField', 'type': 'String'}, 188 {'name': 'strUnsetField', 'type': 'String'}, 189 {'name': 'binaryField', 'type': 'Binary'}, 190 {'name': 'timeField', 'type': 'Time'}, 191 {'name': 'dateField', 'type': 'Date'}, 192 {'name': 'datetimeField', 'type': 'DateTime'}] # Required, or fields() method 193 194 self.geometry_fields = [{'name': 'geomField', 195 'type': 'Point', # optional 196 'srs': 'EPSG:4326' # optional 197 }] # Required, or geometry_fields() method 198 199 self.metadata = {'foo': 'bar'} # optional 200 201 # uncomment if __iter__() honour self.attribute_filter 202 #self.iterator_honour_attribute_filter = True 203 204 # uncomment if __iter__() honour self.spatial_filter 205 #self.iterator_honour_spatial_filter = True 206 207 # uncomment if feature_count() honour self.attribute_filter 208 #self.feature_count_honour_attribute_filter = True 209 210 # uncomment if feature_count() honour self.spatial_filter 211 #self.feature_count_honour_spatial_filter = True 212 213 # End of reserved attribute names 214 215 self.count = 5 216 217 # Required, unless self.name attribute is defined 218 # def name(self): 219 # return 'my_layer' 220 221 # Optional. If not defined, fid name is 'fid' 222 # def fid_name(self): 223 # return 'my_fid' 224 225 # Required, unless self.geometry_fields attribute is defined 226 # def geometry_fields(self): 227 # return [...] 228 229 # Required, unless self.required attribute is defined 230 # def fields(self): 231 # return [...] 232 233 # Optional. Only to be usd if self.metadata field is not defined 234 # def metadata(self, domain): 235 # if domain is None: 236 # return {'foo': 'bar'} 237 # return None 238 239 # Optional. Called when self.attribute_filter is changed by GDAL 240 # def attribute_filter_changed(self): 241 # # You may change self.iterator_honour_attribute_filter 242 # # or feature_count_honour_attribute_filter 243 # pass 244 245 # Optional. Called when self.spatial_filter is changed by GDAL 246 # def spatial_filter_changed(self): 247 # # You may change self.iterator_honour_spatial_filter 248 # # or feature_count_honour_spatial_filter 249 # pass 250 251 # Optional 252 def test_capability(self, cap): 253 if cap == BaseLayer.FastGetExtent: 254 return True 255 if cap == BaseLayer.StringsAsUTF8: 256 return True 257 # if cap == BaseLayer.FastSpatialFilter: 258 # return False 259 # if cap == BaseLayer.RandomRead: 260 # return False 261 if cap == BaseLayer.FastFeatureCount: 262 return self.attribute_filter is None and self.spatial_filter is None 263 return False 264 265 # Optional 266 def extent(self, force_computation): 267 return [2.1, 49, 3, 50] # minx, miny, maxx, maxy 268 269 # Optional. 270 def feature_count(self, force_computation): 271 # As we did not declare feature_count_honour_attribute_filter and 272 # feature_count_honour_spatial_filter, the below case cannot happen 273 # But this is to illustrate that you can callback the default implementation 274 # if needed 275 # if self.attribute_filter is not None or \ 276 # self.spatial_filter is not None: 277 # return super(Layer, self).feature_count(force_computation) 278 279 return self.count 280 281 # Required. You do not need to handle the case of simultaneous iterators on 282 # the same Layer object. 283 def __iter__(self): 284 for i in range(self.count): 285 properties = { 286 'boolField': True, 287 'int16Field': 32767, 288 'int32Field': i + 2, 289 'int64Field': 1234567890123, 290 'realField': 1.23, 291 'floatField': 1.2, 292 'strField': 'foo', 293 'strNullField': None, 294 'binaryField': b'\x01\x00\x02', 295 'timeField': '12:34:56.789', 296 'dateField': '2017-04-26', 297 'datetimeField': '2017-04-26T12:34:56.789Z'} 298 299 yield {"type": "OGRFeature", 300 "id": i + 1, 301 "fields": properties, 302 "geometry_fields": {"geomField": "POINT(2 49)"}, 303 "style": "SYMBOL(a:0)" if i % 2 == 0 else None, 304 } 305 306 # Optional 307 # def feature_by_id(self, fid): 308 # return {} 309 310 311 class Dataset(BaseDataset): 312 313 # Optional, but implementations will generally need it 314 def __init__(self, filename): 315 # If the layers member is set, layer_count() and layer() will not be used 316 self.layers = [Layer()] 317 self.metadata = {'foo': 'bar'} 318 319 # Optional, called on native object destruction 320 def __del__(self): 321 pass 322 323 # Optional. Only to be usd if self.metadata field is not defined 324 # def metadata(self, domain): 325 # if domain is None: 326 # return {'foo': 'bar'} 327 # return None 328 329 # Required, unless a layers attribute is set in __init__ 330 # def layer_count(self): 331 # return len(self.layers) 332 333 # Required, unless a layers attribute is set in __init__ 334 # def layer(self, idx): 335 # return self.layers[idx] 336 337 338 # Required: class deriving from BaseDriver 339 class Driver(BaseDriver): 340 341 # Optional. Called the first time the driver is loaded 342 def __init__(self): 343 pass 344 345 # Required 346 def identify(self, filename, first_bytes, open_flags, open_options={}): 347 return filename == 'DUMMY:' 348 349 # Required 350 def open(self, filename, first_bytes, open_flags, open_options={}): 351 if not self.identify(filename, first_bytes, open_flags): 352 return None 353 return Dataset(filename) 354 355 356Other examples: 357 358* a PASSTHROUGH driver that forwards calls to the GDAL SWIG Python API: 359 https://github.com/OSGeo/gdal/blob/master/gdal/examples/pydrivers/ogr_PASSTHROUGH.py 360* a driver implemented a simple parsing of `CityJSON <https://www.cityjson.org/>`_: 361 https://github.com/OSGeo/gdal/blob/master/gdal/examples/pydrivers/ogr_CityJSON.py 362 363Limitations and scope 364--------------------- 365 366- Vector and read-only for now. This could later be extended of course. 367 368- No connection between the Python code of the plugin and the OGR Python API 369 that is built on top of SWIG. This does not appear to be doable in a 370 reasonable way. Nothing prevents people from using the GDAL/OGR/OSR Python 371 API but the objects exchanged between the OGR core and the Python code will 372 not be OGR Python SWIG objects. A 373 typical example is that a plugin will return its CRS as a string (WKT, PROJSON, 374 or deprecated PROJ.4 string), but not as a osgeo.osr.SpatialReference object. 375 But it is possible to use the osgeo.osr.SpatialReference API to generate this 376 WKT string. 377 378- This RFC does not try to cover the management of Python dependencies. It is 379 up to the user to do the needed "pip install" or whatever Python package 380 management solution it uses. 381 382- The Python "Global Interpreter Lock" is held in the Python drivers, as required 383 for safe use of Python. Consequently scaling of such drivers is limited. 384 385- Given the above restrictions, this will remain an "experimental" feature 386 and the GDAL project will not accept such Python drivers to be included in 387 the GDAL repository. This is similar to the situation of the QGIS project 388 that allows Python plugins outside of the main QGIS repository. If a QGIS plugin 389 want to be moved into the main repository, it has to be converted to C++. 390 The rationale for this is that the correctness of the Python code can mostly be 391 checked at runtime, whereas C++ benefits from static analysis (at compile time, 392 and other checkers). In the context of GDAL, this rationale also applies. GDAL 393 drivers are also stress-tested by the OSS Fuzz infrastructure, and that requires 394 them to be written in C++. 395 396- The interface between the C++ and Python code might break between GDAL feature 397 releases. In that case we will increment the expected API version number to 398 avoid loading incompatible Python drivers. We will likely not make any effort 399 to be able to deal with plugins of incompatible (previous) API version. 400 401 402SWIG binding changes 403-------------------- 404 405None 406 407Security implications 408--------------------- 409 410Similar to the existing native code plugin mechanism of GDAL. If the user 411defines the GDAL_PYTHON_DRIVER_PATH environment variable or GDAL_DRIVER_PATH, 412annd put .py scripts in them (or in {prefix}/lib/gdalplugins/python as a fallback), 413they will be executed. 414 415However, opening a .py file with GDALOpen() or similar mechanisms will not 416lead to its execution, so this is safe for normal GDAL usage. 417 418The GDAL_NO_AUTOLOAD compile time #define, already used to disable loading 419of native plugins, is also honoured to disable the loading of Python plugins. 420 421Performance impact 422------------------ 423 424If no .py script exists in the researched location, the performance impact on 425GDALAllRegister() should be within the noise. 426 427Backward compatibility 428---------------------- 429 430No backward incompatibility. Only functionality addition. 431 432Documentation 433------------- 434 435A tutorial will be added to explain how to write such a Python driver: 436https://github.com/rouault/gdal/blob/pythondrivers/gdal/doc/source/tutorials/vector_python_driver.rst 437 438Testing 439------- 440 441The gdalautotest suite will be extended with the above test Python driver, and 442a few error cases: 443https://github.com/rouault/gdal/blob/pythondrivers/autotest/ogr/ogr_pythondrivers.py 444 445Previous discussions 446-------------------- 447 448This topic has been discussed in the past in : 449 450- https://lists.osgeo.org/pipermail/gdal-dev/2017-April/thread.html#46526 451- https://lists.osgeo.org/pipermail/gdal-dev/2018-November/thread.html#49294 452 453Implementation 454-------------- 455 456A candidate implementation is available at in 457https://github.com/rouault/gdal/tree/pythondrivers 458 459https://github.com/OSGeo/gdal/compare/master...rouault:pythondrivers 460 461Voting history 462-------------- 463 464* +1 from EvenR, JukkaR, MateuzL, DanielM 465* -0 from SeanG 466* +0 from HowardB 467 468Credits 469------- 470 471Sponsored by OpenGeoGroep 472