1.. highlight:: cython
2
3.. _numpy_tutorial:
4
5**************************
6Cython for NumPy users
7**************************
8
9This tutorial is aimed at NumPy users who have no experience with Cython at
10all. If you have some knowledge of Cython you may want to skip to the
11''Efficient indexing'' section.
12
13The main scenario considered is NumPy end-use rather than NumPy/SciPy
14development. The reason is that Cython is not (yet) able to support functions
15that are generic with respect to the number of dimensions in a
16high-level fashion. This restriction is much more severe for SciPy development
17than more specific, "end-user" functions. See the last section for more
18information on this.
19
20The style of this tutorial will not fit everybody, so you can also consider:
21
22* Kurt Smith's `video tutorial of Cython at SciPy 2015
23  <https://www.youtube.com/watch?v=gMvkiQ-gOW8&t=4730s&ab_channel=Enthought>`_.
24  The slides and notebooks of this talk are `on github
25  <https://github.com/kwmsmith/scipy-2015-cython-tutorial>`_.
26* Basic Cython documentation (see `Cython front page
27  <https://cython.readthedocs.io/en/latest/index.html>`_).
28
29Cython at a glance
30==================
31
32Cython is a compiler which compiles Python-like code files to C code. Still,
33''Cython is not a Python to C translator''. That is, it doesn't take your full
34program and "turns it into C" -- rather, the result makes full use of the
35Python runtime environment. A way of looking at it may be that your code is
36still Python in that it runs within the Python runtime environment, but rather
37than compiling to interpreted Python bytecode one compiles to native machine
38code (but with the addition of extra syntax for easy embedding of faster
39C-like code).
40
41This has two important consequences:
42
43* Speed. How much depends very much on the program involved though. Typical Python numerical programs would tend to gain very little as most time is spent in lower-level C that is used in a high-level fashion. However for-loop-style programs can gain many orders of magnitude, when typing information is added (and is so made possible as a realistic alternative).
44* Easy calling into C code. One of Cython's purposes is to allow easy wrapping
45  of C libraries. When writing code in Cython you can call into C code as
46  easily as into Python code.
47
48Very few Python constructs are not yet supported, though making Cython compile all
49Python code is a stated goal, you can see the differences with Python in
50:ref:`limitations <cython-limitations>`.
51
52Your Cython environment
53=======================
54
55Using Cython consists of these steps:
56
571. Write a :file:`.pyx` source file
582. Run the Cython compiler to generate a C file
593. Run a C compiler to generate a compiled library
604. Run the Python interpreter and ask it to import the module
61
62However there are several options to automate these steps:
63
641. The `SAGE <http://sagemath.org>`_ mathematics software system provides
65   excellent support for using Cython and NumPy from an interactive command
66   line or through a notebook interface (like
67   Maple/Mathematica). See `this documentation
68   <http://doc.sagemath.org/html/en/developer/coding_in_cython.html>`_.
692. Cython can be used as an extension within a Jupyter notebook,
70   making it easy to compile and use Cython code with just a ``%%cython``
71   at the top of a cell. For more information see
72   :ref:`Using the Jupyter Notebook <jupyter-notebook>`.
733. A version of pyximport is shipped with Cython,
74   so that you can import pyx-files dynamically into Python and
75   have them compiled automatically (See :ref:`pyximport`).
764. Cython supports distutils so that you can very easily create build scripts
77   which automate the process, this is the preferred method for
78   Cython implemented libraries and packages.
79   See :ref:`Basic setup.py <basic_setup.py>`.
805. Manual compilation (see below)
81
82.. Note::
83    If using another interactive command line environment than SAGE, like
84    IPython or Python itself, it is important that you restart the process
85    when you recompile the module. It is not enough to issue an "import"
86    statement again.
87
88Installation
89=============
90
91If you already have a C compiler, just do::
92
93   pip install Cython
94
95otherwise, see :ref:`the installation page <install>`.
96
97
98As of this writing SAGE comes with an older release of Cython than required
99for this tutorial. So if using SAGE you should download the newest Cython and
100then execute ::
101
102    $ cd path/to/cython-distro
103    $ path-to-sage/sage -python setup.py install
104
105This will install the newest Cython into SAGE.
106
107Manual compilation
108====================
109
110As it is always important to know what is going on, I'll describe the manual
111method here. First Cython is run::
112
113    $ cython yourmod.pyx
114
115This creates :file:`yourmod.c` which is the C source for a Python extension
116module. A useful additional switch is ``-a`` which will generate a document
117:file:`yourmod.html`) that shows which Cython code translates to which C code
118line by line.
119
120Then we compile the C file. This may vary according to your system, but the C
121file should be built like Python was built. Python documentation for writing
122extensions should have some details. On Linux this often means something
123like::
124
125    $ gcc -shared -pthread -fPIC -fwrapv -O2 -Wall -fno-strict-aliasing -I/usr/include/python2.7 -o yourmod.so yourmod.c
126
127``gcc`` should have access to the NumPy C header files so if they are not
128installed at :file:`/usr/include/numpy` or similar you may need to pass another
129option for those. You only need to provide the NumPy headers if you write::
130
131    cimport numpy
132
133in your Cython code.
134
135This creates :file:`yourmod.so` in the same directory, which is importable by
136Python by using a normal ``import yourmod`` statement.
137
138The first Cython program
139==========================
140
141You can easily execute the code of this tutorial by
142downloading `the Jupyter notebook <https://github.com/cython/cython/blob/master/docs/examples/userguide/numpy_tutorial/numpy_and_cython.ipynb>`_.
143
144The code below does the equivalent of this function in numpy::
145
146    def compute_np(array_1, array_2, a, b, c):
147        return np.clip(array_1, 2, 10) * a + array_2 * b + c
148
149We'll say that ``array_1`` and ``array_2`` are 2D NumPy arrays of integer type and
150``a``, ``b`` and ``c`` are three Python integers.
151
152This function uses NumPy and is already really fast, so it might be a bit overkill
153to do it again with Cython. This is for demonstration purposes. Nonetheless, we
154will show that we achieve a better speed and memory efficiency than NumPy at the cost of more verbosity.
155
156This code computes the function with the loops over the two dimensions being unrolled.
157It is both valid Python and valid Cython code. I'll refer to it as both
158:file:`compute_py.py` for the Python version and :file:`compute_cy.pyx` for the
159Cython version -- Cython uses ``.pyx`` as its file suffix (but it can also compile
160``.py`` files).
161
162.. literalinclude:: ../../examples/userguide/numpy_tutorial/compute_py.py
163
164This should be compiled to produce :file:`compute_cy.so` for Linux systems
165(on Windows systems, this will be a ``.pyd`` file). We
166run a Python session to test both the Python version (imported from
167``.py``-file) and the compiled Cython module.
168
169.. sourcecode:: ipython
170
171    In [1]: import numpy as np
172    In [2]: array_1 = np.random.uniform(0, 1000, size=(3000, 2000)).astype(np.intc)
173    In [3]: array_2 = np.random.uniform(0, 1000, size=(3000, 2000)).astype(np.intc)
174    In [4]: a = 4
175    In [5]: b = 3
176    In [6]: c = 9
177    In [7]: def compute_np(array_1, array_2, a, b, c):
178       ...:     return np.clip(array_1, 2, 10) * a + array_2 * b + c
179    In [8]: %timeit compute_np(array_1, array_2, a, b, c)
180    103 ms ± 4.16 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
181
182    In [9]: import compute_py
183    In [10]: compute_py.compute(array_1, array_2, a, b, c)
184    1min 10s ± 844 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
185
186    In [11]: import compute_cy
187    In [12]: compute_cy.compute(array_1, array_2, a, b, c)
188    56.5 s ± 587 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
189
190There's not such a huge difference yet; because the C code still does exactly
191what the Python interpreter does (meaning, for instance, that a new object is
192allocated for each number used).
193
194You can look at the Python interaction and the generated C
195code by using ``-a`` when calling Cython from the command
196line, ``%%cython -a`` when using a Jupyter Notebook, or by using
197``cythonize('compute_cy.pyx', annotate=True)`` when using a ``setup.py``.
198Look at the generated html file and see what
199is needed for even the simplest statements. You get the point quickly. We need
200to give Cython more information; we need to add types.
201
202
203Adding types
204=============
205
206To add types we use custom Cython syntax, so we are now breaking Python source
207compatibility. Here's :file:`compute_typed.pyx`. *Read the comments!*
208
209.. literalinclude:: ../../examples/userguide/numpy_tutorial/compute_typed.pyx
210
211.. figure:: compute_typed_html.jpg
212
213At this point, have a look at the generated C code for :file:`compute_cy.pyx` and
214:file:`compute_typed.pyx`. Click on the lines to expand them and see corresponding C.
215
216Especially have a look at the ``for-loops``: In :file:`compute_cy.c`, these are ~20 lines
217of C code to set up while in :file:`compute_typed.c` a normal C for loop is used.
218
219After building this and continuing my (very informal) benchmarks, I get:
220
221.. sourcecode:: ipython
222
223    In [13]: %timeit compute_typed.compute(array_1, array_2, a, b, c)
224    26.5 s ± 422 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
225
226So adding types does make the code faster, but nowhere
227near the speed of NumPy?
228
229What happened is that most of the time spend in this code is spent in the following lines,
230and those lines are slower to execute than in pure Python::
231
232    tmp = clip(array_1[x, y], 2, 10)
233    tmp = tmp * a + array_2[x, y] * b
234    result[x, y] = tmp + c
235
236So what made those line so much slower than in the pure Python version?
237
238``array_1`` and ``array_2`` are still NumPy arrays, so Python objects, and expect
239Python integers as indexes. Here we pass C int values. So every time
240Cython reaches this line, it has to convert all the C integers to Python
241int objects. Since this line is called very often, it outweighs the speed
242benefits of the pure C loops that were created from the ``range()`` earlier.
243
244Furthermore, ``tmp * a + array_2[x, y] * b`` returns a Python integer
245and ``tmp`` is a C integer, so Cython has to do type conversions again.
246In the end those types conversions add up. And made our computation really
247slow. But this problem can be solved easily by using memoryviews.
248
249Efficient indexing with memoryviews
250===================================
251
252There are still two bottlenecks that degrade the performance, and that is the array lookups
253and assignments, as well as C/Python types conversion.
254The ``[]``-operator still uses full Python operations --
255what we would like to do instead is to access the data buffer directly at C
256speed.
257
258What we need to do then is to type the contents of the :obj:`ndarray` objects.
259We do this with a memoryview. There is :ref:`a page in the Cython documentation
260<memoryviews>` dedicated to it.
261
262In short, memoryviews are C structures that can hold a pointer to the data
263of a NumPy array and all the necessary buffer metadata to provide efficient
264and safe access: dimensions, strides, item size, item type information, etc...
265They also support slices, so they work even if
266the NumPy array isn't contiguous in memory.
267They can be indexed by C integers, thus allowing fast access to the
268NumPy array data.
269
270Here is how to declare a memoryview of integers::
271
272    cdef int [:] foo         # 1D memoryview
273    cdef int [:, :] foo      # 2D memoryview
274    cdef int [:, :, :] foo   # 3D memoryview
275    ...                      # You get the idea.
276
277No data is copied from the NumPy array to the memoryview in our example.
278As the name implies, it is only a "view" of the memory. So we can use the
279view ``result_view`` for efficient indexing and at the end return the real NumPy
280array ``result`` that holds the data that we operated on.
281
282Here is how to use them in our code:
283
284:file:`compute_memview.pyx`
285
286.. literalinclude:: ../../examples/userguide/numpy_tutorial/compute_memview.pyx
287
288Let's see how much faster accessing is now.
289
290.. sourcecode:: ipython
291
292    In [22]: %timeit compute_memview.compute(array_1, array_2, a, b, c)
293    22.9 ms ± 197 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
294
295Note the importance of this change.
296We're now 3081 times faster than an interpreted version of Python and 4.5 times
297faster than NumPy.
298
299Memoryviews can be used with slices too, or even
300with Python arrays. Check out the :ref:`memoryview page <memoryviews>` to
301see what they can do for you.
302
303Tuning indexing further
304========================
305
306The array lookups are still slowed down by two factors:
307
3081. Bounds checking is performed.
3092. Negative indices are checked for and handled correctly.  The code above is
310   explicitly coded so that it doesn't use negative indices, and it
311   (hopefully) always access within bounds.
312
313With decorators, we can deactivate those checks::
314
315    ...
316    cimport cython
317    @cython.boundscheck(False)  # Deactivate bounds checking
318    @cython.wraparound(False)   # Deactivate negative indexing.
319    def compute(int[:, :] array_1, int[:, :] array_2, int a, int b, int c):
320    ...
321
322Now bounds checking is not performed (and, as a side-effect, if you ''do''
323happen to access out of bounds you will in the best case crash your program
324and in the worst case corrupt data). It is possible to switch bounds-checking
325mode in many ways, see :ref:`compiler-directives` for more
326information.
327
328
329.. sourcecode:: ipython
330
331    In [23]: %timeit compute_index.compute(array_1, array_2, a, b, c)
332    16.8 ms ± 25.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
333
334We're faster than the NumPy version (6.2x). NumPy is really well written,
335but does not performs operation lazily, resulting in a lot
336of intermediate copy operations in memory. Our version is
337very memory efficient and cache friendly because we
338can execute the operations in a single run over the data.
339
340.. Warning::
341
342    Speed comes with some cost. Especially it can be dangerous to set typed
343    objects (like ``array_1``, ``array_2`` and ``result_view`` in our sample code) to ``None``.
344    Setting such objects to ``None`` is entirely legal, but all you can do with them
345    is check whether they are None. All other use (attribute lookup or indexing)
346    can potentially segfault or corrupt data (rather than raising exceptions as
347    they would in Python).
348
349    The actual rules are a bit more complicated but the main message is clear: Do
350    not use typed objects without knowing that they are not set to ``None``.
351
352Declaring the NumPy arrays as contiguous
353========================================
354
355For extra speed gains, if you know that the NumPy arrays you are
356providing are contiguous in memory, you can declare the
357memoryview as contiguous.
358
359We give an example on an array that has 3 dimensions.
360If you want to give Cython the information that the data is C-contiguous
361you have to declare the memoryview like this::
362
363    cdef int [:,:,::1] a
364
365If you want to give Cython the information that the data is Fortran-contiguous
366you have to declare the memoryview like this::
367
368    cdef int [::1, :, :] a
369
370If all this makes no sense to you, you can skip this part, declaring
371arrays as contiguous constrains the usage of your functions as it rejects array slices as input.
372If you still want to understand what contiguous arrays are
373all about, you can see `this answer on StackOverflow
374<https://stackoverflow.com/questions/26998223/what-is-the-difference-between-contiguous-and-non-contiguous-arrays>`_.
375
376For the sake of giving numbers, here are the speed gains that you should
377get by declaring the memoryviews as contiguous:
378
379.. sourcecode:: ipython
380
381    In [23]: %timeit compute_contiguous.compute(array_1, array_2, a, b, c)
382    11.1 ms ± 30.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
383
384We're now around nine times faster than the NumPy version, and 6300 times
385faster than the pure Python version!
386
387Making the function cleaner
388===========================
389
390Declaring types can make your code quite verbose. If you don't mind
391Cython inferring the C types of your variables, you can use
392the ``infer_types=True`` compiler directive at the top of the file.
393It will save you quite a bit of typing.
394
395Note that since type declarations must happen at the top indentation level,
396Cython won't infer the type of variables declared for the first time
397in other indentation levels. It would change too much the meaning of
398our code. This is why, we must still declare manually the type of the
399``tmp``, ``x`` and ``y`` variable.
400
401And actually, manually giving the type of the ``tmp`` variable will
402be useful when using fused types.
403
404.. literalinclude:: ../../examples/userguide/numpy_tutorial/compute_infer_types.pyx
405
406We now do a speed test:
407
408.. sourcecode:: ipython
409
410    In [24]: %timeit compute_infer_types.compute(array_1, array_2, a, b, c)
411    11.5 ms ± 261 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
412
413Lo and behold, the speed has not changed.
414
415More generic code
416==================
417
418All those speed gains are nice, but adding types constrains our code.
419At the moment, it would mean that our function can only work with
420NumPy arrays with the ``np.intc`` type. Is it possible to make our
421code work for multiple NumPy data types?
422
423Yes, with the help of a new feature called fused types.
424You can learn more about it at :ref:`this section of the documentation
425<fusedtypes>`.
426It is similar to C++ 's templates. It generates multiple function declarations
427at compile time, and then chooses the right one at run-time based on the
428types of the arguments provided. By comparing types in if-conditions, it
429is also possible to execute entirely different code paths depending
430on the specific data type.
431
432In our example, since we don't have access anymore to the NumPy's dtype
433of our input arrays, we use those ``if-else`` statements to
434know what NumPy data type we should use for our output array.
435
436In this case, our function now works for ints, doubles and floats.
437
438.. literalinclude:: ../../examples/userguide/numpy_tutorial/compute_fused_types.pyx
439
440We can check that the output type is the right one::
441
442    >>>compute(array_1, array_2, a, b, c).dtype
443    dtype('int32')
444    >>>compute(array_1.astype(np.double), array_2.astype(np.double), a, b, c).dtype
445    dtype('float64')
446
447We now do a speed test:
448
449.. sourcecode:: ipython
450
451    In [25]: %timeit compute_fused_types.compute(array_1, array_2, a, b, c)
452    11.5 ms ± 258 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
453
454More versions of the function are created at compile time. So it makes
455sense that the speed doesn't change for executing this function with
456integers as before.
457
458Using multiple threads
459======================
460
461Cython has support for OpenMP.  It also has some nice wrappers around it,
462like the function :func:`prange`. You can see more information about Cython and
463parallelism in :ref:`parallel`. Since we do elementwise operations, we can easily
464distribute the work among multiple threads. It's important not to forget to pass the
465correct arguments to the compiler to enable OpenMP. When using the Jupyter notebook,
466you should use the cell magic like this::
467
468    %%cython --force
469    # distutils: extra_compile_args=-fopenmp
470    # distutils: extra_link_args=-fopenmp
471
472The GIL must be released (see :ref:`Releasing the GIL <nogil>`), so this is why we
473declare our :func:`clip` function ``nogil``.
474
475.. literalinclude:: ../../examples/userguide/numpy_tutorial/compute_prange.pyx
476
477We can have substantial speed gains for minimal effort:
478
479.. sourcecode:: ipython
480
481    In [25]: %timeit compute_prange.compute(array_1, array_2, a, b, c)
482    9.33 ms ± 412 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
483
484We're now 7558 times faster than the pure Python version and 11.1 times faster
485than NumPy!
486
487Where to go from here?
488======================
489
490* If you want to learn how to make use of `BLAS <http://www.netlib.org/blas/>`_
491  or `LAPACK <http://www.netlib.org/lapack/>`_ with Cython, you can watch
492  `the presentation of Ian Henriksen at SciPy 2015
493  <https://www.youtube.com/watch?v=R4yB-8tB0J0&t=693s&ab_channel=Enthought>`_.
494* If you want to learn how to use Pythran as backend in Cython, you
495  can see how in :ref:`Pythran as a NumPy backend <numpy-pythran>`.
496  Note that using Pythran only works with the
497  :ref:`old buffer syntax <working-numpy>` and not yet with memoryviews.
498