1.. _rfc-76:
2
3================================================================================
4RFC 76: OGR Python drivers
5================================================================================
6
7============== ============================
8Author:        Even Rouault
9Contact:       even.rouault @ spatialys.com
10Started:       2019-Nov-5
11Last updated:  2019-Nov-15
12Status:        Adopted, implemented in GDAL 3.1
13============== ============================
14
15Summary
16-------
17
18This RFC adds the capability to write OGR/vector drivers in Python.
19
20Motivation
21----------
22
23For some use cases that do not require lighting speed, or to deal with very
24niche formats (possibly in house format), it might be faster and more efficient
25to write a vector driver in Python rather than a GDAL C++ driver as currently required,
26or an ad-hoc converter.
27
28.. note::
29
30    QGIS has now a way to create Python-based providers such as
31    in https://github.com/qgis/QGIS/blob/master/tests/src/python/provider_python.py
32    Having a way to do in GDAL itself also allows the rest of GDAL/OGR based
33    tools to use the OGR Python driver.
34
35How does that work ?
36--------------------
37
38Driver registration
39+++++++++++++++++++
40
41The driver registration mechanism is extended to look for .py scripts in a
42dedicated directory:
43
44* the directory pointed by the ``GDAL_PYTHON_DRIVER_PATH`` configuration option
45  (there may be several paths separated by `:` on Unix or `;` on Windows)
46* if not defined, the directory pointed by the ``GDAL_DRIVER_PATH`` configuration
47  option.
48* if not defined, in the directory (hardcoded at compilation time on Unix builds)
49  where native plugins are located.
50
51Those Python script must set in their first lines at least 2 directives:
52
53- ``# gdal: DRIVER_NAME = "short_name"``
54- ``# gdal: DRIVER_SUPPORTED_API_VERSION = 1`` . Currently only 1 supported. If the
55  interface changed in a backward incompatible way, we would increment internally
56  the supported API version number. This item enables us to check if we are able
57  to "safely" load a Python driver. If a Python driver would support several API
58  versions (not clear if that's really possible at that point), it might use an
59  array syntax to indicate that, like ``[1,2]``
60- ``# gdal: DRIVER_DCAP_VECTOR = "YES"``
61- ``# gdal: DRIVER_DMD_LONGNAME = "my super plugin"``
62
63Optional metadata such as ``# gdal: DRIVER_DMD_EXTENSIONS`` or
64``# gdal: DRIVER_DMD_HELPTOPIC`` can be defined (basically, any driver metadata key string prefixed by
65``# gdal: DRIVER_``
66
67These directives will be parsed in a pure textual way, without invocation of the Python
68interpreter, both for efficiency consideration and also because we want to
69delay the research or launch of the Python interpreter as much as possible
70(the typical use case if GDAL used by QGIS: we want to make sure that QGIS
71has itself started Python, to reuse that Python interpreter)
72
73From the short metadata, the driver registration code can instantiate GDALDriver
74C++ objects. When the Identify() or Open() method is invoked on that object,
75the C++ code will:
76
77* if not already done, find Python symbols, or start Python (see below paragraph
78  for more details)
79* if not already done, load the .py file as a Python module
80* if not already done, instantiate an instance of the Python class of the module
81  deriving from ``gdal_python_driver.BaseDriver``
82* call the  ``identify`` and ``open`` method depending on the originated API call.
83
84The ``open`` method will return a Python ``BaseDataset`` object with required and
85optional methods that will be invoked by the corresponding GDAL API calls. And
86likewise for the ``BaseLayer`` object. See the example_.
87
88Connection with the Python interpreter
89++++++++++++++++++++++++++++++++++++++
90
91The logic will be shared with the VRT pixel functions written in Python functionality
92It relies on runtime linking to the Python symbols already available in the process (for
93example the python executable or a binary embedding Python and using GDAL, such
94as QGIS), or loading of the Python library in case no Python symbols are found,
95rather than compile time linking.
96The reason is that we do not know in advance with which Python version GDAL can
97potentially be linked, and we do not want gdal.so/gdal.dll to be explicitly linked
98with a particular Python library.
99
100This is both embedding and extending Python.
101
102The steps are:
103
1041. through dlopen() + dlsym() on Unix and EnumProcessModules()+GetProcAddress()
105   on Windows, look for Python symbols. If found, use it. This is for example
106   the case if GDAL is used from a Python module (GDAL Python bindings, rasterio, etc.)
107   or an application like QGIS that starts a Python interpreter.
1082. otherwise, look for the PYTHONSO environment variable that should point to
109   a pythonX.Y[...].so/.dll
1103. otherwise, look for the python binary in the path and try to identify the
111   correspond Python .so/.dll
1124. otherwise, try to load with dlopen()/LoadLibrary() well-known names of
113   Python .so/.dll
114
115Impacts on GDAL core
116--------------------
117
118They are minimal. The GDALAllRegister() method has an added call to
119GDALDriverManager::AutoLoadPythonDrivers() that implements the above mentioned
120logic. The GDALDriver class has been extended to support a new function
121pointer, IdentifyEx(), which is used by the C++ shim that loads the Python code.
122
123.. code-block:: c++
124
125    int                 (*pfnIdentifyEx)( GDALDriver*, GDALOpenInfo * );
126
127This extended IdentifyEx() function pointer, which adds the GDALDriver* argument,
128is used in priority by GDALIdentify() and GDALOpen() methods. The need for that
129is purely boring. For normal C++ drivers, there is no need to pass the driver,
130as there is a one-to-one correspondence between a driver and the function that
131implements the driver. But for the Python driver, there is a single C++ method
132that does the interface with the Python Identify() method of several Python drivers,
133hence the need of a GDALDriver* argument to forward the call to the appropriate
134driver.
135
136.. _example:
137
138Example of such a driver
139------------------------
140
141Note that the prefixing by the driver name in the connection string is absolutely
142not a requirement, but something specific to this particular driver which is a
143bit artificial. The CityJSON driver mentioned below does not need it.
144
145.. code-block:: python
146
147    #!/usr/bin/env python
148    # -*- coding: utf-8 -*-
149    # This code is in the public domain, so as to serve as a template for
150    # real-world plugins.
151    # or, at the choice of the licensee,
152    # Copyright 2019 Even Rouault
153    # SPDX-License-Identifier: MIT
154
155    # Metadata parsed by GDAL C++ code at driver pre-loading, starting with '# gdal: '
156    # Required and with that exact syntax since it is parsed by non-Python
157    # aware code. So just literal values, no expressions, etc.
158    # gdal: DRIVER_NAME = "DUMMY"
159    # API version(s) supported. Must include 1 currently
160    # gdal: DRIVER_SUPPORTED_API_VERSION = [1]
161    # gdal: DRIVER_DCAP_VECTOR = "YES"
162    # gdal: DRIVER_DMD_LONGNAME = "my super plugin"
163
164    # Optional driver metadata items.
165    # # gdal: DRIVER_DMD_EXTENSIONS = "ext1 est2"
166    # # gdal: DRIVER_DMD_HELPTOPIC = "http://example.com/my_help.html"
167
168    # The gdal_python_driver module is defined by the GDAL library at runtime
169    from gdal_python_driver import BaseDriver, BaseDataset, BaseLayer
170
171    class Layer(BaseLayer):
172        def __init__(self):
173
174            # Reserved attribute names. Either those or the corresponding method
175            # must be defined
176            self.name = 'my_layer'  # Required, or name() method
177
178            self.fid_name = 'my_fid'  # Optional
179
180            self.fields = [{'name': 'boolField', 'type': 'Boolean'},
181                        {'name': 'int16Field', 'type': 'Integer16'},
182                        {'name': 'int32Field', 'type': 'Integer'},
183                        {'name': 'int64Field', 'type': 'Integer64'},
184                        {'name': 'realField', 'type': 'Real'},
185                        {'name': 'floatField', 'type': 'Float'},
186                        {'name': 'strField', 'type': 'String'},
187                        {'name': 'strNullField', 'type': 'String'},
188                        {'name': 'strUnsetField', 'type': 'String'},
189                        {'name': 'binaryField', 'type': 'Binary'},
190                        {'name': 'timeField', 'type': 'Time'},
191                        {'name': 'dateField', 'type': 'Date'},
192                        {'name': 'datetimeField', 'type': 'DateTime'}]  # Required, or fields() method
193
194            self.geometry_fields = [{'name': 'geomField',
195                                    'type': 'Point',  # optional
196                                    'srs': 'EPSG:4326'  # optional
197                                    }]  # Required, or geometry_fields() method
198
199            self.metadata = {'foo': 'bar'}  # optional
200
201            # uncomment if __iter__() honour self.attribute_filter
202            #self.iterator_honour_attribute_filter = True
203
204            # uncomment if __iter__() honour self.spatial_filter
205            #self.iterator_honour_spatial_filter = True
206
207            # uncomment if feature_count() honour self.attribute_filter
208            #self.feature_count_honour_attribute_filter = True
209
210            # uncomment if feature_count() honour self.spatial_filter
211            #self.feature_count_honour_spatial_filter = True
212
213            # End of reserved attribute names
214
215            self.count = 5
216
217        # Required, unless self.name attribute is defined
218        # def name(self):
219        #    return 'my_layer'
220
221        # Optional. If not defined, fid name is 'fid'
222        # def fid_name(self):
223        #    return 'my_fid'
224
225        # Required, unless self.geometry_fields attribute is defined
226        # def geometry_fields(self):
227        #    return [...]
228
229        # Required, unless self.required attribute is defined
230        # def fields(self):
231        #    return [...]
232
233        # Optional. Only to be usd if self.metadata field is not defined
234        # def metadata(self, domain):
235        #    if domain is None:
236        #        return {'foo': 'bar'}
237        #    return None
238
239        # Optional. Called when self.attribute_filter is changed by GDAL
240        # def attribute_filter_changed(self):
241        #     # You may change self.iterator_honour_attribute_filter
242        #     # or feature_count_honour_attribute_filter
243        #     pass
244
245        # Optional. Called when self.spatial_filter is changed by GDAL
246        # def spatial_filter_changed(self):
247        #     # You may change self.iterator_honour_spatial_filter
248        #     # or feature_count_honour_spatial_filter
249        #     pass
250
251        # Optional
252        def test_capability(self, cap):
253            if cap == BaseLayer.FastGetExtent:
254                return True
255            if cap == BaseLayer.StringsAsUTF8:
256                return True
257            # if cap == BaseLayer.FastSpatialFilter:
258            #    return False
259            # if cap == BaseLayer.RandomRead:
260            #    return False
261            if cap == BaseLayer.FastFeatureCount:
262                return self.attribute_filter is None and self.spatial_filter is None
263            return False
264
265        # Optional
266        def extent(self, force_computation):
267            return [2.1, 49, 3, 50]  # minx, miny, maxx, maxy
268
269        # Optional.
270        def feature_count(self, force_computation):
271            # As we did not declare feature_count_honour_attribute_filter and
272            # feature_count_honour_spatial_filter, the below case cannot happen
273            # But this is to illustrate that you can callback the default implementation
274            # if needed
275            # if self.attribute_filter is not None or \
276            #   self.spatial_filter is not None:
277            #    return super(Layer, self).feature_count(force_computation)
278
279            return self.count
280
281        # Required. You do not need to handle the case of simultaneous iterators on
282        # the same Layer object.
283        def __iter__(self):
284            for i in range(self.count):
285                properties = {
286                    'boolField': True,
287                    'int16Field': 32767,
288                    'int32Field': i + 2,
289                    'int64Field': 1234567890123,
290                    'realField': 1.23,
291                    'floatField': 1.2,
292                    'strField': 'foo',
293                    'strNullField': None,
294                    'binaryField': b'\x01\x00\x02',
295                    'timeField': '12:34:56.789',
296                    'dateField': '2017-04-26',
297                    'datetimeField': '2017-04-26T12:34:56.789Z'}
298
299                yield {"type": "OGRFeature",
300                    "id": i + 1,
301                    "fields": properties,
302                    "geometry_fields": {"geomField": "POINT(2 49)"},
303                    "style": "SYMBOL(a:0)" if i % 2 == 0 else None,
304                    }
305
306        # Optional
307        # def feature_by_id(self, fid):
308        #    return {}
309
310
311    class Dataset(BaseDataset):
312
313        # Optional, but implementations will generally need it
314        def __init__(self, filename):
315            # If the layers member is set, layer_count() and layer() will not be used
316            self.layers = [Layer()]
317            self.metadata = {'foo': 'bar'}
318
319        # Optional, called on native object destruction
320        def __del__(self):
321            pass
322
323        # Optional. Only to be usd if self.metadata field is not defined
324        # def metadata(self, domain):
325        #    if domain is None:
326        #        return {'foo': 'bar'}
327        #    return None
328
329        # Required, unless a layers attribute is set in __init__
330        # def layer_count(self):
331        #    return len(self.layers)
332
333        # Required, unless a layers attribute is set in __init__
334        # def layer(self, idx):
335        #    return self.layers[idx]
336
337
338    # Required: class deriving from BaseDriver
339    class Driver(BaseDriver):
340
341        # Optional. Called the first time the driver is loaded
342        def __init__(self):
343            pass
344
345        # Required
346        def identify(self, filename, first_bytes, open_flags, open_options={}):
347            return filename == 'DUMMY:'
348
349        # Required
350        def open(self, filename, first_bytes, open_flags, open_options={}):
351            if not self.identify(filename, first_bytes, open_flags):
352                return None
353            return Dataset(filename)
354
355
356Other examples:
357
358* a PASSTHROUGH driver that forwards calls to the GDAL SWIG Python API:
359  https://github.com/OSGeo/gdal/blob/master/gdal/examples/pydrivers/ogr_PASSTHROUGH.py
360* a driver implemented a simple parsing of `CityJSON <https://www.cityjson.org/>`_:
361  https://github.com/OSGeo/gdal/blob/master/gdal/examples/pydrivers/ogr_CityJSON.py
362
363Limitations and scope
364---------------------
365
366- Vector and read-only for now. This could later be extended of course.
367
368- No connection between the Python code of the plugin and the OGR Python API
369  that is built on top of SWIG. This does not appear to be doable in a
370  reasonable way. Nothing prevents people from using the GDAL/OGR/OSR Python
371  API but the objects exchanged between the OGR core and the Python code will
372  not be OGR Python SWIG objects. A
373  typical example is that a plugin will return its CRS as a string (WKT, PROJSON,
374  or deprecated PROJ.4 string), but not as a osgeo.osr.SpatialReference object.
375  But it is possible to use the osgeo.osr.SpatialReference API to generate this
376  WKT string.
377
378- This RFC does not try to cover the management of Python dependencies. It is
379  up to the user to do the needed "pip install" or whatever Python package
380  management solution it uses.
381
382- The Python "Global Interpreter Lock" is held in the Python drivers, as required
383  for safe use of Python. Consequently scaling of such drivers is limited.
384
385- Given the above restrictions, this will remain an "experimental" feature
386  and the GDAL project will not accept such Python drivers to be included in
387  the GDAL repository. This is similar to the situation of the QGIS project
388  that allows Python plugins outside of the main QGIS repository. If a QGIS plugin
389  want to be moved into the main repository, it has to be converted to C++.
390  The rationale for this is that the correctness of the Python code can mostly be
391  checked at runtime, whereas C++ benefits from static analysis (at compile time,
392  and other checkers). In the context of GDAL, this rationale also applies. GDAL
393  drivers are also stress-tested by the OSS Fuzz infrastructure, and that requires
394  them to be written in C++.
395
396- The interface between the C++ and Python code might break between GDAL feature
397  releases. In that case we will increment the expected API version number to
398  avoid loading incompatible Python drivers. We will likely not make any effort
399  to be able to deal with plugins of incompatible (previous) API version.
400
401
402SWIG binding changes
403--------------------
404
405None
406
407Security implications
408---------------------
409
410Similar to the existing native code plugin mechanism of GDAL. If the user
411defines the GDAL_PYTHON_DRIVER_PATH environment variable or GDAL_DRIVER_PATH,
412annd put .py scripts in them (or in {prefix}/lib/gdalplugins/python as a fallback),
413they will be executed.
414
415However, opening a .py file with GDALOpen() or similar mechanisms will not
416lead to its execution, so this is safe for normal GDAL usage.
417
418The GDAL_NO_AUTOLOAD compile time #define, already used to disable loading
419of native plugins, is also honoured to disable the loading of Python plugins.
420
421Performance impact
422------------------
423
424If no .py script exists in the researched location, the performance impact on
425GDALAllRegister() should be within the noise.
426
427Backward compatibility
428----------------------
429
430No backward incompatibility. Only functionality addition.
431
432Documentation
433-------------
434
435A tutorial will be added to explain how to write such a Python driver:
436https://github.com/rouault/gdal/blob/pythondrivers/gdal/doc/source/tutorials/vector_python_driver.rst
437
438Testing
439-------
440
441The gdalautotest suite will be extended with the above test Python driver, and
442a few error cases:
443https://github.com/rouault/gdal/blob/pythondrivers/autotest/ogr/ogr_pythondrivers.py
444
445Previous discussions
446--------------------
447
448This topic has been discussed in the past in :
449
450- https://lists.osgeo.org/pipermail/gdal-dev/2017-April/thread.html#46526
451- https://lists.osgeo.org/pipermail/gdal-dev/2018-November/thread.html#49294
452
453Implementation
454--------------
455
456A candidate implementation is available at in
457https://github.com/rouault/gdal/tree/pythondrivers
458
459https://github.com/OSGeo/gdal/compare/master...rouault:pythondrivers
460
461Voting history
462--------------
463
464* +1 from EvenR, JukkaR, MateuzL, DanielM
465* -0 from SeanG
466* +0 from HowardB
467
468Credits
469-------
470
471Sponsored by OpenGeoGroep
472