1.. _NEP22: 2 3=========================================================== 4NEP 22 — Duck typing for NumPy arrays – high level overview 5=========================================================== 6 7:Author: Stephan Hoyer <shoyer@google.com>, Nathaniel J. Smith <njs@pobox.com> 8:Status: Final 9:Type: Informational 10:Created: 2018-03-22 11:Resolution: https://mail.python.org/pipermail/numpy-discussion/2018-September/078752.html 12 13Abstract 14-------- 15 16We outline a high-level vision for how NumPy will approach handling 17“duck arrays”. This is an Informational-class NEP; it doesn’t 18prescribe full details for any particular implementation. In brief, we 19propose developing a number of new protocols for defining 20implementations of multi-dimensional arrays with high-level APIs 21matching NumPy. 22 23 24Detailed description 25-------------------- 26 27Traditionally, NumPy’s ``ndarray`` objects have provided two things: a 28high level API for expression operations on homogeneously-typed, 29arbitrary-dimensional, array-structured data, and a concrete 30implementation of the API based on strided in-RAM storage. The API is 31powerful, fairly general, and used ubiquitously across the scientific 32Python stack. The concrete implementation, on the other hand, is 33suitable for a wide range of uses, but has limitations: as data sets 34grow and NumPy becomes used in a variety of new environments, there 35are increasingly cases where the strided in-RAM storage strategy is 36inappropriate, and users find they need sparse arrays, lazily 37evaluated arrays (as in dask), compressed arrays (as in blosc), arrays 38stored in GPU memory, arrays stored in alternative formats such as 39Arrow, and so forth – yet users still want to work with these arrays 40using the familiar NumPy APIs, and re-use existing code with minimal 41(ideally zero) porting overhead. As a working shorthand, we call these 42“duck arrays”, by analogy with Python’s “duck typing”: a “duck array” 43is a Python object which “quacks like” a numpy array in the sense that 44it has the same or similar Python API, but doesn’t share the C-level 45implementation. 46 47This NEP doesn’t propose any specific changes to NumPy or other 48projects; instead, it gives an overview of how we hope to extend NumPy 49to support a robust ecosystem of projects implementing and relying 50upon its high level API. 51 52Terminology 53~~~~~~~~~~~ 54 55“Duck array” works fine as a placeholder for now, but it’s pretty 56jargony and may confuse new users, so we may want to pick something 57else for the actual API functions. Unfortunately, “array-like” is 58already taken for the concept of “anything that can be coerced into an 59array” (including e.g. list objects), and “anyarray” is already taken 60for the concept of “something that shares ndarray’s implementation, 61but has different semantics”, which is the opposite of a duck array 62(e.g., np.matrix is an “anyarray”, but is not a “duck array”). This is 63a classic bike-shed so for now we’re just using “duck array”. Some 64possible options though include: arrayish, pseudoarray, nominalarray, 65ersatzarray, arraymimic, ... 66 67 68General approach 69~~~~~~~~~~~~~~~~ 70 71At a high level, duck array support requires working through each of 72the API functions provided by NumPy, and figuring out how it can be 73extended to work with duck array objects. In some cases this is easy 74(e.g., methods/attributes on ndarray itself); in other cases it’s more 75difficult. Here are some principles we’ve found useful so far: 76 77 78Principle 1: Focus on “full” duck arrays, but don’t rule out “partial” duck arrays 79^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 80 81We can distinguish between two classes: 82 83* “full” duck arrays, which aspire to fully implement np.ndarray’s 84 Python-level APIs and work essentially anywhere that np.ndarray 85 works 86 87* “partial” duck arrays, which intentionally implement only a subset 88 of np.ndarray’s API. 89 90Full duck arrays are, well, kind of boring. They have exactly the same 91semantics as ndarray, with differences being restricted to 92under-the-hood decisions about how the data is actually stored. The 93kind of people that are excited about making numpy more extensible are 94also, unsurprisingly, excited about changing or extending numpy’s 95semantics. So there’s been a lot of discussion of how to best support 96partial duck arrays. We've been guilty of this ourself. 97 98At this point though, we think the best general strategy is to focus 99our efforts primarily on supporting full duck arrays, and only worry 100about partial duck arrays as much as we need to to make sure we don't 101accidentally rule them out for no reason. 102 103Why focus on full duck arrays? Several reasons: 104 105First, there are lots of very clear use cases. Potential consumers of 106the full duck array interface include almost every package that uses 107numpy (scipy, sklearn, astropy, ...), and in particular packages that 108provide array-wrapping-classes that handle multiple types of arrays, 109such as xarray and dask.array. Potential implementers of the full duck 110array interface include: distributed arrays, sparse arrays, masked 111arrays, arrays with units (unless they switch to using dtypes), 112labeled arrays, and so forth. Clear use cases lead to good and 113relevant APIs. 114 115Second, the Anna Karenina principle applies here: full duck arrays are 116all alike, but every partial duck array is partial in its own way: 117 118* ``xarray.DataArray`` is mostly a duck array, but has incompatible 119 broadcasting semantics. 120* ``xarray.Dataset`` wraps multiple arrays in one object; it still 121 implements some array interfaces like ``__array_ufunc__``, but 122 certainly not all of them. 123* ``pandas.Series`` has methods with similar behavior to numpy, but 124 unique null-skipping behavior. 125* scipy’s ``LinearOperator``\s support matrix multiplication and nothing else 126* h5py and similar libraries for accessing array storage have objects 127 that support numpy-like slicing and conversion into a full array, 128 but not computation. 129* Some classes may be similar to ndarray, but without supporting the 130 full indexing semantics. 131 132And so forth. 133 134Despite our best attempts, we haven't found any clear, unique way of 135slicing up the ndarray API into a hierarchy of related types that 136captures these distinctions; in fact, it’s unlikely that any single 137person even understands all the distinctions. And this is important, 138because we have a *lot* of APIs that we need to add duck array support 139to (both in numpy and in all the projects that depend on numpy!). By 140definition, these already work for ``ndarray``, so hopefully getting 141them to work for full duck arrays shouldn’t be so hard, since by 142definition full duck arrays act like ``ndarray``. It’d be very 143cumbersome to have to go through each function and identify the exact 144subset of the ndarray API that it needs, then figure out which partial 145array types can/should support it. Once we have things working for 146full duck arrays, we can go back later and refine the APIs needed 147further as needed. Focusing on full duck arrays allows us to start 148making progress immediately. 149 150In the future, it might be useful to identify specific use cases for 151duck arrays and standardize narrower interfaces targeted just at those 152use cases. For example, it might make sense to have a standard “array 153loader” interface that file access libraries like h5py, netcdf, pydap, 154zarr, ... all implement, to make it easy to switch between these 155libraries. But that’s something that we can do as we go, and it 156doesn’t necessarily have to involve the NumPy devs at all. For an 157example of what this might look like, see the documentation for 158`dask.array.from_array 159<http://dask.pydata.org/en/latest/array-api.html#dask.array.from_array>`__. 160 161 162Principle 2: Take advantage of duck typing 163^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 164 165``ndarray`` has a very large API surface area:: 166 167 In [1]: len(set(dir(np.ndarray)) - set(dir(object))) 168 Out[1]: 138 169 170And this is a huge **under**\estimate, because there are also many 171free-standing functions in NumPy and other libraries which currently 172use the NumPy C API and thus only work on ``ndarray`` objects. In type 173theory, a type is defined by the operations you can perform on an 174object; thus, the actual type of ``ndarray`` includes not just its 175methods and attributes, but *all* of these functions. For duck arrays 176to be successful, they’ll need to implement a large proportion of the 177``ndarray`` API – but not all of it. (For example, 178``dask.array.Array`` does not provide an equivalent to the 179``ndarray.ptp`` method, presumably because no-one has ever noticed or 180cared about its absence. But this doesn’t seem to have stopped people 181from using dask.) 182 183This means that realistically, we can’t hope to define the whole duck 184array API up front, or that anyone will be able to implement it all in 185one go; this will be an incremental process. It also means that even 186the so-called “full” duck array interface is somewhat fuzzily defined 187at the borders; there are parts of the ``np.ndarray`` API that duck 188arrays won’t have to implement, but we aren’t entirely sure what those 189are. 190 191And ultimately, it isn’t really up to the NumPy developers to define 192what does or doesn’t qualify as a duck array. If we want scikit-learn 193functions to work on dask arrays (for example), then that’s going to 194require negotiation between those two projects to discover 195incompatibilities, and when an incompatibility is discovered it will 196be up to them to negotiate who should change and how. The NumPy 197project can provide technical tools and general advice to help resolve 198these disagreements, but we can’t force one group or another to take 199responsibility for any given bug. 200 201Therefore, even though we’re focusing on “full” duck arrays, we 202*don’t* attempt to define a normative “array ABC” – maybe this will be 203useful someday, but right now, it’s not. And as a convenient 204side-effect, the lack of a normative definition leaves partial duck 205arrays room to experiment. 206 207But, we do provide some more detailed advice for duck array 208implementers and consumers below. 209 210Principle 3: Focus on protocols 211^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 212 213Historically, numpy has had lots of success at interoperating with 214third-party objects by defining *protocols*, like ``__array__`` (asks 215an arbitrary object to convert itself into an array), 216``__array_interface__`` (a precursor to Python’s buffer protocol), and 217``__array_ufunc__`` (allows third-party objects to support ufuncs like 218``np.exp``). 219 220`NEP 16 <https://github.com/numpy/numpy/pull/10706>`_ took a 221different approach: we need a duck-array equivalent of 222``asarray``, and it proposed to do this by defining a version of 223``asarray`` that would let through objects which implemented a new 224AbstractArray ABC. As noted above, we now think that trying to define 225an ABC is a bad idea for other reasons. But when this NEP was 226discussed on the mailing list, we realized that even on its own 227merits, this idea is not so great. A better approach is to define a 228*method* that can be called on an arbitrary object to ask it to 229convert itself into a duck array, and then define a version of 230``asarray`` that calls this method. 231 232This is strictly more powerful: if an object is already a duck array, 233it can simply ``return self``. It allows more correct semantics: NEP 23416 assumed that ``asarray(obj, dtype=X)`` is the same as 235``asarray(obj).astype(X)``, but this isn’t true. And it supports more 236use cases: if h5py supported sparse arrays, it might want to provide 237an object which is not itself a sparse array, but which can be 238automatically converted into a sparse array. See NEP <XX, to be 239written> for full details. 240 241The protocol approach is also more consistent with core Python 242conventions: for example, see the ``__iter__`` method for coercing 243objects to iterators, or the ``__index__`` protocol for safe integer 244coercion. And finally, focusing on protocols leaves the door open for 245partial duck arrays, which can pick and choose which subset of the 246protocols they want to participate in, each of which have well-defined 247semantics. 248 249Conclusion: protocols are one honking great idea – let’s do more of 250those. 251 252Principle 4: Reuse existing methods when possible 253^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 254 255It’s tempting to try to define cleaned up versions of ndarray methods 256with a more minimal interface to allow for easier implementation. For 257example, ``__array_reshape__`` could drop some of the strange 258arguments accepted by ``reshape`` and ``__array_basic_getitem__`` 259could drop all the `strange edge cases 260<http://www.numpy.org/neps/nep-0021-advanced-indexing.html>`__ of 261NumPy’s advanced indexing. 262 263But as discussed above, we don’t really know what APIs we need for 264duck-typing ndarray. We would inevitably end up with a very long list 265of new special methods. In contrast, existing methods like ``reshape`` 266and ``__getitem__`` have the advantage of already being widely 267used/exercised by libraries that use duck arrays, and in practice, any 268serious duck array type is going to have to implement them anyway. 269 270Principle 5: Make it easy to do the right thing 271^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 272 273Making duck arrays work well is going to be a community effort. 274Documentation helps, but only goes so far. We want to make it easy to 275implement duck arrays that do the right thing. 276 277One way NumPy can help is by providing mixin classes for implementing 278large groups of related functionality at once. 279``NDArrayOperatorsMixin`` is a good example: it allows for 280implementing arithmetic operators implicitly via the 281``__array_ufunc__`` method. It’s not complete, and we’ll want more 282helpers like that (e.g. for reductions). 283 284(We initially thought that the importance of these mixins might be an 285argument for providing an array ABC, since that’s the standard way to 286do mixins in modern Python. But in discussion around NEP 16 we 287realized that partial duck arrays also wanted to take advantage of 288these mixins in some cases, so even if we did have an array ABC then 289the mixins would still need some sort of separate existence. So never 290mind that argument.) 291 292Tentative duck array guidelines 293~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 294 295As a general rule, libraries using duck arrays should insist upon the 296minimum possible requirements, and libraries implementing duck arrays 297should provide as complete of an API as possible. This will ensure 298maximum compatibility. For example, users should prefer to rely on 299``.transpose()`` rather than ``.swapaxes()`` (which can be implemented 300in terms of transpose), but duck array authors should ideally 301implement both. 302 303If you are trying to implement a duck array, then you should strive to 304implement everything. You certainly need ``.shape``, ``.ndim`` and 305``.dtype``, but also your dtype attribute should actually be a 306``numpy.dtype`` object, weird fancy indexing edge cases should ideally 307work, etc. Only details related to NumPy’s specific ``np.ndarray`` 308implementation (e.g., ``strides``, ``data``, ``view``) are explicitly 309out of scope. 310 311A (very) rough sketch of future plans 312~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 313 314The proposals discussed so far – ``__array_ufunc__`` and some kind of 315``asarray`` protocol – are clearly necessary but not sufficient for 316full duck typing support. We expect the need for additional protocols 317to support (at least) these features: 318 319* **Concatenating** duck arrays, which would be used internally by other 320 array combining methods like stack/vstack/hstack. The implementation 321 of concatenate will need to be negotiated among the list of array 322 arguments. We expect to use an ``__array_concatenate__`` protocol 323 like ``__array_ufunc__`` instead of multiple dispatch. 324* **Ufunc-like functions** that currently aren’t ufuncs. Many NumPy 325 functions like median, percentile, sort, where and clip could be 326 written as generalized ufuncs but currently aren’t. Either these 327 functions should be written as ufuncs, or we should consider adding 328 another generic wrapper mechanism that works similarly to ufuncs but 329 makes fewer guarantees about how the implementation is done. 330* **Random number generation** with duck arrays, e.g., 331 ``np.random.randn()``. For example, we might want to add new APIs 332 like ``random_like()`` for generating new arrays with a matching 333 shape *and* type – though we'll need to look at some real examples 334 of how these functions are used to figure out what would be helpful. 335* **Miscellaneous other functions** such as ``np.einsum``, 336 ``np.zeros_like``, and ``np.broadcast_to`` that don’t fall into any 337 of the above categories. 338* **Checking mutability** on duck arrays, which would imply that they 339 support assignment with ``__setitem__`` and the out argument to 340 ufuncs. Many otherwise fine duck arrays are not easily mutable (for 341 example, because they use some kinds of sparse or compressed 342 storage, or are in read-only shared memory), and it turns out that 343 frequently-used code like the default implementation of ``np.mean`` 344 needs to check this (to decide whether it can re-use temporary 345 arrays). 346 347We intentionally do not describe exactly how to add support for these 348types of duck arrays here. These will be the subject of future NEPs. 349 350 351Copyright 352--------- 353 354This document has been placed in the public domain. 355