1.. highlight:: cython 2 3.. _numpy_tutorial: 4 5************************** 6Cython for NumPy users 7************************** 8 9This tutorial is aimed at NumPy users who have no experience with Cython at 10all. If you have some knowledge of Cython you may want to skip to the 11''Efficient indexing'' section. 12 13The main scenario considered is NumPy end-use rather than NumPy/SciPy 14development. The reason is that Cython is not (yet) able to support functions 15that are generic with respect to the number of dimensions in a 16high-level fashion. This restriction is much more severe for SciPy development 17than more specific, "end-user" functions. See the last section for more 18information on this. 19 20The style of this tutorial will not fit everybody, so you can also consider: 21 22* Kurt Smith's `video tutorial of Cython at SciPy 2015 23 <https://www.youtube.com/watch?v=gMvkiQ-gOW8&t=4730s&ab_channel=Enthought>`_. 24 The slides and notebooks of this talk are `on github 25 <https://github.com/kwmsmith/scipy-2015-cython-tutorial>`_. 26* Basic Cython documentation (see `Cython front page 27 <https://cython.readthedocs.io/en/latest/index.html>`_). 28 29Cython at a glance 30================== 31 32Cython is a compiler which compiles Python-like code files to C code. Still, 33''Cython is not a Python to C translator''. That is, it doesn't take your full 34program and "turns it into C" -- rather, the result makes full use of the 35Python runtime environment. A way of looking at it may be that your code is 36still Python in that it runs within the Python runtime environment, but rather 37than compiling to interpreted Python bytecode one compiles to native machine 38code (but with the addition of extra syntax for easy embedding of faster 39C-like code). 40 41This has two important consequences: 42 43* Speed. How much depends very much on the program involved though. Typical Python numerical programs would tend to gain very little as most time is spent in lower-level C that is used in a high-level fashion. However for-loop-style programs can gain many orders of magnitude, when typing information is added (and is so made possible as a realistic alternative). 44* Easy calling into C code. One of Cython's purposes is to allow easy wrapping 45 of C libraries. When writing code in Cython you can call into C code as 46 easily as into Python code. 47 48Very few Python constructs are not yet supported, though making Cython compile all 49Python code is a stated goal, you can see the differences with Python in 50:ref:`limitations <cython-limitations>`. 51 52Your Cython environment 53======================= 54 55Using Cython consists of these steps: 56 571. Write a :file:`.pyx` source file 582. Run the Cython compiler to generate a C file 593. Run a C compiler to generate a compiled library 604. Run the Python interpreter and ask it to import the module 61 62However there are several options to automate these steps: 63 641. The `SAGE <http://sagemath.org>`_ mathematics software system provides 65 excellent support for using Cython and NumPy from an interactive command 66 line or through a notebook interface (like 67 Maple/Mathematica). See `this documentation 68 <http://doc.sagemath.org/html/en/developer/coding_in_cython.html>`_. 692. Cython can be used as an extension within a Jupyter notebook, 70 making it easy to compile and use Cython code with just a ``%%cython`` 71 at the top of a cell. For more information see 72 :ref:`Using the Jupyter Notebook <jupyter-notebook>`. 733. A version of pyximport is shipped with Cython, 74 so that you can import pyx-files dynamically into Python and 75 have them compiled automatically (See :ref:`pyximport`). 764. Cython supports distutils so that you can very easily create build scripts 77 which automate the process, this is the preferred method for 78 Cython implemented libraries and packages. 79 See :ref:`Basic setup.py <basic_setup.py>`. 805. Manual compilation (see below) 81 82.. Note:: 83 If using another interactive command line environment than SAGE, like 84 IPython or Python itself, it is important that you restart the process 85 when you recompile the module. It is not enough to issue an "import" 86 statement again. 87 88Installation 89============= 90 91If you already have a C compiler, just do:: 92 93 pip install Cython 94 95otherwise, see :ref:`the installation page <install>`. 96 97 98As of this writing SAGE comes with an older release of Cython than required 99for this tutorial. So if using SAGE you should download the newest Cython and 100then execute :: 101 102 $ cd path/to/cython-distro 103 $ path-to-sage/sage -python setup.py install 104 105This will install the newest Cython into SAGE. 106 107Manual compilation 108==================== 109 110As it is always important to know what is going on, I'll describe the manual 111method here. First Cython is run:: 112 113 $ cython yourmod.pyx 114 115This creates :file:`yourmod.c` which is the C source for a Python extension 116module. A useful additional switch is ``-a`` which will generate a document 117:file:`yourmod.html`) that shows which Cython code translates to which C code 118line by line. 119 120Then we compile the C file. This may vary according to your system, but the C 121file should be built like Python was built. Python documentation for writing 122extensions should have some details. On Linux this often means something 123like:: 124 125 $ gcc -shared -pthread -fPIC -fwrapv -O2 -Wall -fno-strict-aliasing -I/usr/include/python2.7 -o yourmod.so yourmod.c 126 127``gcc`` should have access to the NumPy C header files so if they are not 128installed at :file:`/usr/include/numpy` or similar you may need to pass another 129option for those. You only need to provide the NumPy headers if you write:: 130 131 cimport numpy 132 133in your Cython code. 134 135This creates :file:`yourmod.so` in the same directory, which is importable by 136Python by using a normal ``import yourmod`` statement. 137 138The first Cython program 139========================== 140 141You can easily execute the code of this tutorial by 142downloading `the Jupyter notebook <https://github.com/cython/cython/blob/master/docs/examples/userguide/numpy_tutorial/numpy_and_cython.ipynb>`_. 143 144The code below does the equivalent of this function in numpy:: 145 146 def compute_np(array_1, array_2, a, b, c): 147 return np.clip(array_1, 2, 10) * a + array_2 * b + c 148 149We'll say that ``array_1`` and ``array_2`` are 2D NumPy arrays of integer type and 150``a``, ``b`` and ``c`` are three Python integers. 151 152This function uses NumPy and is already really fast, so it might be a bit overkill 153to do it again with Cython. This is for demonstration purposes. Nonetheless, we 154will show that we achieve a better speed and memory efficiency than NumPy at the cost of more verbosity. 155 156This code computes the function with the loops over the two dimensions being unrolled. 157It is both valid Python and valid Cython code. I'll refer to it as both 158:file:`compute_py.py` for the Python version and :file:`compute_cy.pyx` for the 159Cython version -- Cython uses ``.pyx`` as its file suffix (but it can also compile 160``.py`` files). 161 162.. literalinclude:: ../../examples/userguide/numpy_tutorial/compute_py.py 163 164This should be compiled to produce :file:`compute_cy.so` for Linux systems 165(on Windows systems, this will be a ``.pyd`` file). We 166run a Python session to test both the Python version (imported from 167``.py``-file) and the compiled Cython module. 168 169.. sourcecode:: ipython 170 171 In [1]: import numpy as np 172 In [2]: array_1 = np.random.uniform(0, 1000, size=(3000, 2000)).astype(np.intc) 173 In [3]: array_2 = np.random.uniform(0, 1000, size=(3000, 2000)).astype(np.intc) 174 In [4]: a = 4 175 In [5]: b = 3 176 In [6]: c = 9 177 In [7]: def compute_np(array_1, array_2, a, b, c): 178 ...: return np.clip(array_1, 2, 10) * a + array_2 * b + c 179 In [8]: %timeit compute_np(array_1, array_2, a, b, c) 180 103 ms ± 4.16 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) 181 182 In [9]: import compute_py 183 In [10]: compute_py.compute(array_1, array_2, a, b, c) 184 1min 10s ± 844 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) 185 186 In [11]: import compute_cy 187 In [12]: compute_cy.compute(array_1, array_2, a, b, c) 188 56.5 s ± 587 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) 189 190There's not such a huge difference yet; because the C code still does exactly 191what the Python interpreter does (meaning, for instance, that a new object is 192allocated for each number used). 193 194You can look at the Python interaction and the generated C 195code by using ``-a`` when calling Cython from the command 196line, ``%%cython -a`` when using a Jupyter Notebook, or by using 197``cythonize('compute_cy.pyx', annotate=True)`` when using a ``setup.py``. 198Look at the generated html file and see what 199is needed for even the simplest statements. You get the point quickly. We need 200to give Cython more information; we need to add types. 201 202 203Adding types 204============= 205 206To add types we use custom Cython syntax, so we are now breaking Python source 207compatibility. Here's :file:`compute_typed.pyx`. *Read the comments!* 208 209.. literalinclude:: ../../examples/userguide/numpy_tutorial/compute_typed.pyx 210 211.. figure:: compute_typed_html.jpg 212 213At this point, have a look at the generated C code for :file:`compute_cy.pyx` and 214:file:`compute_typed.pyx`. Click on the lines to expand them and see corresponding C. 215 216Especially have a look at the ``for-loops``: In :file:`compute_cy.c`, these are ~20 lines 217of C code to set up while in :file:`compute_typed.c` a normal C for loop is used. 218 219After building this and continuing my (very informal) benchmarks, I get: 220 221.. sourcecode:: ipython 222 223 In [13]: %timeit compute_typed.compute(array_1, array_2, a, b, c) 224 26.5 s ± 422 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) 225 226So adding types does make the code faster, but nowhere 227near the speed of NumPy? 228 229What happened is that most of the time spend in this code is spent in the following lines, 230and those lines are slower to execute than in pure Python:: 231 232 tmp = clip(array_1[x, y], 2, 10) 233 tmp = tmp * a + array_2[x, y] * b 234 result[x, y] = tmp + c 235 236So what made those line so much slower than in the pure Python version? 237 238``array_1`` and ``array_2`` are still NumPy arrays, so Python objects, and expect 239Python integers as indexes. Here we pass C int values. So every time 240Cython reaches this line, it has to convert all the C integers to Python 241int objects. Since this line is called very often, it outweighs the speed 242benefits of the pure C loops that were created from the ``range()`` earlier. 243 244Furthermore, ``tmp * a + array_2[x, y] * b`` returns a Python integer 245and ``tmp`` is a C integer, so Cython has to do type conversions again. 246In the end those types conversions add up. And made our computation really 247slow. But this problem can be solved easily by using memoryviews. 248 249Efficient indexing with memoryviews 250=================================== 251 252There are still two bottlenecks that degrade the performance, and that is the array lookups 253and assignments, as well as C/Python types conversion. 254The ``[]``-operator still uses full Python operations -- 255what we would like to do instead is to access the data buffer directly at C 256speed. 257 258What we need to do then is to type the contents of the :obj:`ndarray` objects. 259We do this with a memoryview. There is :ref:`a page in the Cython documentation 260<memoryviews>` dedicated to it. 261 262In short, memoryviews are C structures that can hold a pointer to the data 263of a NumPy array and all the necessary buffer metadata to provide efficient 264and safe access: dimensions, strides, item size, item type information, etc... 265They also support slices, so they work even if 266the NumPy array isn't contiguous in memory. 267They can be indexed by C integers, thus allowing fast access to the 268NumPy array data. 269 270Here is how to declare a memoryview of integers:: 271 272 cdef int [:] foo # 1D memoryview 273 cdef int [:, :] foo # 2D memoryview 274 cdef int [:, :, :] foo # 3D memoryview 275 ... # You get the idea. 276 277No data is copied from the NumPy array to the memoryview in our example. 278As the name implies, it is only a "view" of the memory. So we can use the 279view ``result_view`` for efficient indexing and at the end return the real NumPy 280array ``result`` that holds the data that we operated on. 281 282Here is how to use them in our code: 283 284:file:`compute_memview.pyx` 285 286.. literalinclude:: ../../examples/userguide/numpy_tutorial/compute_memview.pyx 287 288Let's see how much faster accessing is now. 289 290.. sourcecode:: ipython 291 292 In [22]: %timeit compute_memview.compute(array_1, array_2, a, b, c) 293 22.9 ms ± 197 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) 294 295Note the importance of this change. 296We're now 3081 times faster than an interpreted version of Python and 4.5 times 297faster than NumPy. 298 299Memoryviews can be used with slices too, or even 300with Python arrays. Check out the :ref:`memoryview page <memoryviews>` to 301see what they can do for you. 302 303Tuning indexing further 304======================== 305 306The array lookups are still slowed down by two factors: 307 3081. Bounds checking is performed. 3092. Negative indices are checked for and handled correctly. The code above is 310 explicitly coded so that it doesn't use negative indices, and it 311 (hopefully) always access within bounds. 312 313With decorators, we can deactivate those checks:: 314 315 ... 316 cimport cython 317 @cython.boundscheck(False) # Deactivate bounds checking 318 @cython.wraparound(False) # Deactivate negative indexing. 319 def compute(int[:, :] array_1, int[:, :] array_2, int a, int b, int c): 320 ... 321 322Now bounds checking is not performed (and, as a side-effect, if you ''do'' 323happen to access out of bounds you will in the best case crash your program 324and in the worst case corrupt data). It is possible to switch bounds-checking 325mode in many ways, see :ref:`compiler-directives` for more 326information. 327 328 329.. sourcecode:: ipython 330 331 In [23]: %timeit compute_index.compute(array_1, array_2, a, b, c) 332 16.8 ms ± 25.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 333 334We're faster than the NumPy version (6.2x). NumPy is really well written, 335but does not performs operation lazily, resulting in a lot 336of intermediate copy operations in memory. Our version is 337very memory efficient and cache friendly because we 338can execute the operations in a single run over the data. 339 340.. Warning:: 341 342 Speed comes with some cost. Especially it can be dangerous to set typed 343 objects (like ``array_1``, ``array_2`` and ``result_view`` in our sample code) to ``None``. 344 Setting such objects to ``None`` is entirely legal, but all you can do with them 345 is check whether they are None. All other use (attribute lookup or indexing) 346 can potentially segfault or corrupt data (rather than raising exceptions as 347 they would in Python). 348 349 The actual rules are a bit more complicated but the main message is clear: Do 350 not use typed objects without knowing that they are not set to ``None``. 351 352Declaring the NumPy arrays as contiguous 353======================================== 354 355For extra speed gains, if you know that the NumPy arrays you are 356providing are contiguous in memory, you can declare the 357memoryview as contiguous. 358 359We give an example on an array that has 3 dimensions. 360If you want to give Cython the information that the data is C-contiguous 361you have to declare the memoryview like this:: 362 363 cdef int [:,:,::1] a 364 365If you want to give Cython the information that the data is Fortran-contiguous 366you have to declare the memoryview like this:: 367 368 cdef int [::1, :, :] a 369 370If all this makes no sense to you, you can skip this part, declaring 371arrays as contiguous constrains the usage of your functions as it rejects array slices as input. 372If you still want to understand what contiguous arrays are 373all about, you can see `this answer on StackOverflow 374<https://stackoverflow.com/questions/26998223/what-is-the-difference-between-contiguous-and-non-contiguous-arrays>`_. 375 376For the sake of giving numbers, here are the speed gains that you should 377get by declaring the memoryviews as contiguous: 378 379.. sourcecode:: ipython 380 381 In [23]: %timeit compute_contiguous.compute(array_1, array_2, a, b, c) 382 11.1 ms ± 30.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 383 384We're now around nine times faster than the NumPy version, and 6300 times 385faster than the pure Python version! 386 387Making the function cleaner 388=========================== 389 390Declaring types can make your code quite verbose. If you don't mind 391Cython inferring the C types of your variables, you can use 392the ``infer_types=True`` compiler directive at the top of the file. 393It will save you quite a bit of typing. 394 395Note that since type declarations must happen at the top indentation level, 396Cython won't infer the type of variables declared for the first time 397in other indentation levels. It would change too much the meaning of 398our code. This is why, we must still declare manually the type of the 399``tmp``, ``x`` and ``y`` variable. 400 401And actually, manually giving the type of the ``tmp`` variable will 402be useful when using fused types. 403 404.. literalinclude:: ../../examples/userguide/numpy_tutorial/compute_infer_types.pyx 405 406We now do a speed test: 407 408.. sourcecode:: ipython 409 410 In [24]: %timeit compute_infer_types.compute(array_1, array_2, a, b, c) 411 11.5 ms ± 261 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 412 413Lo and behold, the speed has not changed. 414 415More generic code 416================== 417 418All those speed gains are nice, but adding types constrains our code. 419At the moment, it would mean that our function can only work with 420NumPy arrays with the ``np.intc`` type. Is it possible to make our 421code work for multiple NumPy data types? 422 423Yes, with the help of a new feature called fused types. 424You can learn more about it at :ref:`this section of the documentation 425<fusedtypes>`. 426It is similar to C++ 's templates. It generates multiple function declarations 427at compile time, and then chooses the right one at run-time based on the 428types of the arguments provided. By comparing types in if-conditions, it 429is also possible to execute entirely different code paths depending 430on the specific data type. 431 432In our example, since we don't have access anymore to the NumPy's dtype 433of our input arrays, we use those ``if-else`` statements to 434know what NumPy data type we should use for our output array. 435 436In this case, our function now works for ints, doubles and floats. 437 438.. literalinclude:: ../../examples/userguide/numpy_tutorial/compute_fused_types.pyx 439 440We can check that the output type is the right one:: 441 442 >>>compute(array_1, array_2, a, b, c).dtype 443 dtype('int32') 444 >>>compute(array_1.astype(np.double), array_2.astype(np.double), a, b, c).dtype 445 dtype('float64') 446 447We now do a speed test: 448 449.. sourcecode:: ipython 450 451 In [25]: %timeit compute_fused_types.compute(array_1, array_2, a, b, c) 452 11.5 ms ± 258 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 453 454More versions of the function are created at compile time. So it makes 455sense that the speed doesn't change for executing this function with 456integers as before. 457 458Using multiple threads 459====================== 460 461Cython has support for OpenMP. It also has some nice wrappers around it, 462like the function :func:`prange`. You can see more information about Cython and 463parallelism in :ref:`parallel`. Since we do elementwise operations, we can easily 464distribute the work among multiple threads. It's important not to forget to pass the 465correct arguments to the compiler to enable OpenMP. When using the Jupyter notebook, 466you should use the cell magic like this:: 467 468 %%cython --force 469 # distutils: extra_compile_args=-fopenmp 470 # distutils: extra_link_args=-fopenmp 471 472The GIL must be released (see :ref:`Releasing the GIL <nogil>`), so this is why we 473declare our :func:`clip` function ``nogil``. 474 475.. literalinclude:: ../../examples/userguide/numpy_tutorial/compute_prange.pyx 476 477We can have substantial speed gains for minimal effort: 478 479.. sourcecode:: ipython 480 481 In [25]: %timeit compute_prange.compute(array_1, array_2, a, b, c) 482 9.33 ms ± 412 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 483 484We're now 7558 times faster than the pure Python version and 11.1 times faster 485than NumPy! 486 487Where to go from here? 488====================== 489 490* If you want to learn how to make use of `BLAS <http://www.netlib.org/blas/>`_ 491 or `LAPACK <http://www.netlib.org/lapack/>`_ with Cython, you can watch 492 `the presentation of Ian Henriksen at SciPy 2015 493 <https://www.youtube.com/watch?v=R4yB-8tB0J0&t=693s&ab_channel=Enthought>`_. 494* If you want to learn how to use Pythran as backend in Cython, you 495 can see how in :ref:`Pythran as a NumPy backend <numpy-pythran>`. 496 Note that using Pythran only works with the 497 :ref:`old buffer syntax <working-numpy>` and not yet with memoryviews. 498