1.. _NEP22:
2
3===========================================================
4NEP 22 — Duck typing for NumPy arrays – high level overview
5===========================================================
6
7:Author: Stephan Hoyer <shoyer@google.com>, Nathaniel J. Smith <njs@pobox.com>
8:Status: Final
9:Type: Informational
10:Created: 2018-03-22
11:Resolution: https://mail.python.org/pipermail/numpy-discussion/2018-September/078752.html
12
13Abstract
14--------
15
16We outline a high-level vision for how NumPy will approach handling
17“duck arrays”. This is an Informational-class NEP; it doesn’t
18prescribe full details for any particular implementation. In brief, we
19propose developing a number of new protocols for defining
20implementations of multi-dimensional arrays with high-level APIs
21matching NumPy.
22
23
24Detailed description
25--------------------
26
27Traditionally, NumPy’s ``ndarray`` objects have provided two things: a
28high level API for expression operations on homogeneously-typed,
29arbitrary-dimensional, array-structured data, and a concrete
30implementation of the API based on strided in-RAM storage. The API is
31powerful, fairly general, and used ubiquitously across the scientific
32Python stack. The concrete implementation, on the other hand, is
33suitable for a wide range of uses, but has limitations: as data sets
34grow and NumPy becomes used in a variety of new environments, there
35are increasingly cases where the strided in-RAM storage strategy is
36inappropriate, and users find they need sparse arrays, lazily
37evaluated arrays (as in dask), compressed arrays (as in blosc), arrays
38stored in GPU memory, arrays stored in alternative formats such as
39Arrow, and so forth – yet users still want to work with these arrays
40using the familiar NumPy APIs, and re-use existing code with minimal
41(ideally zero) porting overhead. As a working shorthand, we call these
42“duck arrays”, by analogy with Python’s “duck typing”: a “duck array”
43is a Python object which “quacks like” a numpy array in the sense that
44it has the same or similar Python API, but doesn’t share the C-level
45implementation.
46
47This NEP doesn’t propose any specific changes to NumPy or other
48projects; instead, it gives an overview of how we hope to extend NumPy
49to support a robust ecosystem of projects implementing and relying
50upon its high level API.
51
52Terminology
53~~~~~~~~~~~
54
55“Duck array” works fine as a placeholder for now, but it’s pretty
56jargony and may confuse new users, so we may want to pick something
57else for the actual API functions. Unfortunately, “array-like” is
58already taken for the concept of “anything that can be coerced into an
59array” (including e.g. list objects), and “anyarray” is already taken
60for the concept of “something that shares ndarray’s implementation,
61but has different semantics”, which is the opposite of a duck array
62(e.g., np.matrix is an “anyarray”, but is not a “duck array”). This is
63a classic bike-shed so for now we’re just using “duck array”. Some
64possible options though include: arrayish, pseudoarray, nominalarray,
65ersatzarray, arraymimic, ...
66
67
68General approach
69~~~~~~~~~~~~~~~~
70
71At a high level, duck array support requires working through each of
72the API functions provided by NumPy, and figuring out how it can be
73extended to work with duck array objects. In some cases this is easy
74(e.g., methods/attributes on ndarray itself); in other cases it’s more
75difficult. Here are some principles we’ve found useful so far:
76
77
78Principle 1: Focus on “full” duck arrays, but don’t rule out “partial” duck arrays
79^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
80
81We can distinguish between two classes:
82
83* “full” duck arrays, which aspire to fully implement np.ndarray’s
84  Python-level APIs and work essentially anywhere that np.ndarray
85  works
86
87* “partial” duck arrays, which intentionally implement only a subset
88  of np.ndarray’s API.
89
90Full duck arrays are, well, kind of boring. They have exactly the same
91semantics as ndarray, with differences being restricted to
92under-the-hood decisions about how the data is actually stored. The
93kind of people that are excited about making numpy more extensible are
94also, unsurprisingly, excited about changing or extending numpy’s
95semantics. So there’s been a lot of discussion of how to best support
96partial duck arrays. We've been guilty of this ourself.
97
98At this point though, we think the best general strategy is to focus
99our efforts primarily on supporting full duck arrays, and only worry
100about partial duck arrays as much as we need to to make sure we don't
101accidentally rule them out for no reason.
102
103Why focus on full duck arrays? Several reasons:
104
105First, there are lots of very clear use cases. Potential consumers of
106the full duck array interface include almost every package that uses
107numpy (scipy, sklearn, astropy, ...), and in particular packages that
108provide array-wrapping-classes that handle multiple types of arrays,
109such as xarray and dask.array. Potential implementers of the full duck
110array interface include: distributed arrays, sparse arrays, masked
111arrays, arrays with units (unless they switch to using dtypes),
112labeled arrays, and so forth. Clear use cases lead to good and
113relevant APIs.
114
115Second, the Anna Karenina principle applies here: full duck arrays are
116all alike, but every partial duck array is partial in its own way:
117
118* ``xarray.DataArray`` is mostly a duck array, but has incompatible
119  broadcasting semantics.
120* ``xarray.Dataset`` wraps multiple arrays in one object; it still
121  implements some array interfaces like ``__array_ufunc__``, but
122  certainly not all of them.
123* ``pandas.Series`` has methods with similar behavior to numpy, but
124  unique null-skipping behavior.
125* scipy’s ``LinearOperator``\s support matrix multiplication and nothing else
126* h5py and similar libraries for accessing array storage have objects
127  that support numpy-like slicing and conversion into a full array,
128  but not computation.
129* Some classes may be similar to ndarray, but without supporting the
130  full indexing semantics.
131
132And so forth.
133
134Despite our best attempts, we haven't found any clear, unique way of
135slicing up the ndarray API into a hierarchy of related types that
136captures these distinctions; in fact, it’s unlikely that any single
137person even understands all the distinctions. And this is important,
138because we have a *lot* of APIs that we need to add duck array support
139to (both in numpy and in all the projects that depend on numpy!). By
140definition, these already work for ``ndarray``, so hopefully getting
141them to work for full duck arrays shouldn’t be so hard, since by
142definition full duck arrays act like ``ndarray``. It’d be very
143cumbersome to have to go through each function and identify the exact
144subset of the ndarray API that it needs, then figure out which partial
145array types can/should support it. Once we have things working for
146full duck arrays, we can go back later and refine the APIs needed
147further as needed. Focusing on full duck arrays allows us to start
148making progress immediately.
149
150In the future, it might be useful to identify specific use cases for
151duck arrays and standardize narrower interfaces targeted just at those
152use cases. For example, it might make sense to have a standard “array
153loader” interface that file access libraries like h5py, netcdf, pydap,
154zarr, ... all implement, to make it easy to switch between these
155libraries. But that’s something that we can do as we go, and it
156doesn’t necessarily have to involve the NumPy devs at all. For an
157example of what this might look like, see the documentation for
158`dask.array.from_array
159<http://dask.pydata.org/en/latest/array-api.html#dask.array.from_array>`__.
160
161
162Principle 2: Take advantage of duck typing
163^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
164
165``ndarray`` has a very large API surface area::
166
167    In [1]: len(set(dir(np.ndarray)) - set(dir(object)))
168    Out[1]: 138
169
170And this is a huge **under**\estimate, because there are also many
171free-standing functions in NumPy and other libraries which currently
172use the NumPy C API and thus only work on ``ndarray`` objects. In type
173theory, a type is defined by the operations you can perform on an
174object; thus, the actual type of ``ndarray`` includes not just its
175methods and attributes, but *all* of these functions. For duck arrays
176to be successful, they’ll need to implement a large proportion of the
177``ndarray`` API – but not all of it. (For example,
178``dask.array.Array`` does not provide an equivalent to the
179``ndarray.ptp`` method, presumably because no-one has ever noticed or
180cared about its absence. But this doesn’t seem to have stopped people
181from using dask.)
182
183This means that realistically, we can’t hope to define the whole duck
184array API up front, or that anyone will be able to implement it all in
185one go; this will be an incremental process. It also means that even
186the so-called “full” duck array interface is somewhat fuzzily defined
187at the borders; there are parts of the ``np.ndarray`` API that duck
188arrays won’t have to implement, but we aren’t entirely sure what those
189are.
190
191And ultimately, it isn’t really up to the NumPy developers to define
192what does or doesn’t qualify as a duck array. If we want scikit-learn
193functions to work on dask arrays (for example), then that’s going to
194require negotiation between those two projects to discover
195incompatibilities, and when an incompatibility is discovered it will
196be up to them to negotiate who should change and how. The NumPy
197project can provide technical tools and general advice to help resolve
198these disagreements, but we can’t force one group or another to take
199responsibility for any given bug.
200
201Therefore, even though we’re focusing on “full” duck arrays, we
202*don’t* attempt to define a normative “array ABC” – maybe this will be
203useful someday, but right now, it’s not. And as a convenient
204side-effect, the lack of a normative definition leaves partial duck
205arrays room to experiment.
206
207But, we do provide some more detailed advice for duck array
208implementers and consumers below.
209
210Principle 3: Focus on protocols
211^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
212
213Historically, numpy has had lots of success at interoperating with
214third-party objects by defining *protocols*, like ``__array__`` (asks
215an arbitrary object to convert itself into an array),
216``__array_interface__`` (a precursor to Python’s buffer protocol), and
217``__array_ufunc__`` (allows third-party objects to support ufuncs like
218``np.exp``).
219
220`NEP 16 <https://github.com/numpy/numpy/pull/10706>`_ took a
221different approach: we need a duck-array equivalent of
222``asarray``, and it proposed to do this by defining a version of
223``asarray`` that would let through objects which implemented a new
224AbstractArray ABC. As noted above, we now think that trying to define
225an ABC is a bad idea for other reasons. But when this NEP was
226discussed on the mailing list, we realized that even on its own
227merits, this idea is not so great. A better approach is to define a
228*method* that can be called on an arbitrary object to ask it to
229convert itself into a duck array, and then define a version of
230``asarray`` that calls this method.
231
232This is strictly more powerful: if an object is already a duck array,
233it can simply ``return self``. It allows more correct semantics: NEP
23416 assumed that ``asarray(obj, dtype=X)`` is the same as
235``asarray(obj).astype(X)``, but this isn’t true. And it supports more
236use cases: if h5py supported sparse arrays, it might want to provide
237an object which is not itself a sparse array, but which can be
238automatically converted into a sparse array. See NEP <XX, to be
239written> for full details.
240
241The protocol approach is also more consistent with core Python
242conventions: for example, see the ``__iter__`` method for coercing
243objects to iterators, or the ``__index__`` protocol for safe integer
244coercion. And finally, focusing on protocols leaves the door open for
245partial duck arrays, which can pick and choose which subset of the
246protocols they want to participate in, each of which have well-defined
247semantics.
248
249Conclusion: protocols are one honking great idea – let’s do more of
250those.
251
252Principle 4: Reuse existing methods when possible
253^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
254
255It’s tempting to try to define cleaned up versions of ndarray methods
256with a more minimal interface to allow for easier implementation. For
257example, ``__array_reshape__`` could drop some of the strange
258arguments accepted by ``reshape`` and ``__array_basic_getitem__``
259could drop all the `strange edge cases
260<http://www.numpy.org/neps/nep-0021-advanced-indexing.html>`__ of
261NumPy’s advanced indexing.
262
263But as discussed above, we don’t really know what APIs we need for
264duck-typing ndarray. We would inevitably end up with a very long list
265of new special methods. In contrast, existing methods like ``reshape``
266and ``__getitem__`` have the advantage of already being widely
267used/exercised by libraries that use duck arrays, and in practice, any
268serious duck array type is going to have to implement them anyway.
269
270Principle 5: Make it easy to do the right thing
271^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
272
273Making duck arrays work well is going to be a community effort.
274Documentation helps, but only goes so far. We want to make it easy to
275implement duck arrays that do the right thing.
276
277One way NumPy can help is by providing mixin classes for implementing
278large groups of related functionality at once.
279``NDArrayOperatorsMixin`` is a good example: it allows for
280implementing arithmetic operators implicitly via the
281``__array_ufunc__`` method. It’s not complete, and we’ll want more
282helpers like that (e.g. for reductions).
283
284(We initially thought that the importance of these mixins might be an
285argument for providing an array ABC, since that’s the standard way to
286do mixins in modern Python. But in discussion around NEP 16 we
287realized that partial duck arrays also wanted to take advantage of
288these mixins in some cases, so even if we did have an array ABC then
289the mixins would still need some sort of separate existence. So never
290mind that argument.)
291
292Tentative duck array guidelines
293~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
294
295As a general rule, libraries using duck arrays should insist upon the
296minimum possible requirements, and libraries implementing duck arrays
297should provide as complete of an API as possible. This will ensure
298maximum compatibility. For example, users should prefer to rely on
299``.transpose()`` rather than ``.swapaxes()`` (which can be implemented
300in terms of transpose), but duck array authors should ideally
301implement both.
302
303If you are trying to implement a duck array, then you should strive to
304implement everything. You certainly need ``.shape``, ``.ndim`` and
305``.dtype``, but also your dtype attribute should actually be a
306``numpy.dtype`` object, weird fancy indexing edge cases should ideally
307work, etc. Only details related to NumPy’s specific ``np.ndarray``
308implementation (e.g., ``strides``, ``data``, ``view``) are explicitly
309out of scope.
310
311A (very) rough sketch of future plans
312~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
313
314The proposals discussed so far – ``__array_ufunc__`` and some kind of
315``asarray`` protocol – are clearly necessary but not sufficient for
316full duck typing support. We expect the need for additional protocols
317to support (at least) these features:
318
319* **Concatenating** duck arrays, which would be used internally by other
320  array combining methods like stack/vstack/hstack. The implementation
321  of concatenate will need to be negotiated among the list of array
322  arguments. We expect to use an ``__array_concatenate__`` protocol
323  like ``__array_ufunc__`` instead of multiple dispatch.
324* **Ufunc-like functions** that currently aren’t ufuncs. Many NumPy
325  functions like median, percentile, sort, where and clip could be
326  written as generalized ufuncs but currently aren’t. Either these
327  functions should be written as ufuncs, or we should consider adding
328  another generic wrapper mechanism that works similarly to ufuncs but
329  makes fewer guarantees about how the implementation is done.
330* **Random number generation** with duck arrays, e.g.,
331  ``np.random.randn()``. For example, we might want to add new APIs
332  like ``random_like()`` for generating new arrays with a matching
333  shape *and* type – though we'll need to look at some real examples
334  of how these functions are used to figure out what would be helpful.
335* **Miscellaneous other functions** such as ``np.einsum``,
336  ``np.zeros_like``, and ``np.broadcast_to`` that don’t fall into any
337  of the above categories.
338* **Checking mutability** on duck arrays, which would imply that they
339  support assignment with ``__setitem__`` and the out argument to
340  ufuncs. Many otherwise fine duck arrays are not easily mutable (for
341  example, because they use some kinds of sparse or compressed
342  storage, or are in read-only shared memory), and it turns out that
343  frequently-used code like the default implementation of ``np.mean``
344  needs to check this (to decide whether it can re-use temporary
345  arrays).
346
347We intentionally do not describe exactly how to add support for these
348types of duck arrays here. These will be the subject of future NEPs.
349
350
351Copyright
352---------
353
354This document has been placed in the public domain.
355