1.. highlight:: cython 2 3.. _profiling: 4 5********* 6Profiling 7********* 8 9This part describes the profiling abilities of Cython. If you are familiar 10with profiling pure Python code, you can only read the first section 11(:ref:`profiling_basics`). If you are not familiar with Python profiling you 12should also read the tutorial (:ref:`profiling_tutorial`) which takes you 13through a complete example step by step. 14 15.. _profiling_basics: 16 17Cython Profiling Basics 18======================= 19 20Profiling in Cython is controlled by a compiler directive. 21It can be set either for an entire file or on a per function basis 22via a Cython decorator. 23 24Enabling profiling for a complete source file 25--------------------------------------------- 26 27Profiling is enabled for a complete source file via a global directive to the 28Cython compiler at the top of a file:: 29 30 # cython: profile=True 31 32Note that profiling gives a slight overhead to each function call therefore making 33your program a little slower (or a lot, if you call some small functions very 34often). 35 36Once enabled, your Cython code will behave just like Python code when called 37from the cProfile module. This means you can just profile your Cython code 38together with your Python code using the same tools as for Python code alone. 39 40Disabling profiling function wise 41--------------------------------- 42 43If your profiling is messed up because of the call overhead to some small 44functions that you rather do not want to see in your profile - either because 45you plan to inline them anyway or because you are sure that you can't make them 46any faster - you can use a special decorator to disable profiling for one 47function only (regardless of whether it is globally enabled or not): 48 49.. literalinclude:: ../../examples/tutorial/profiling_tutorial/often_called.pyx 50 51Enabling line tracing 52--------------------- 53 54To get more detailed trace information (for tools that can make use of it), 55you can enable line tracing:: 56 57 # cython: linetrace=True 58 59This will also enable profiling support, so the above ``profile=True`` option 60is not needed. Line tracing is needed for coverage analysis, for example. 61 62Note that even if line tracing is enabled via the compiler directive, it is 63not used by default. As the runtime slowdown can be substantial, it must 64additionally be compiled in by the C compiler by setting the C macro definition 65``CYTHON_TRACE=1``. To include nogil functions in the trace, set 66``CYTHON_TRACE_NOGIL=1`` (which implies ``CYTHON_TRACE=1``). C macros can be 67defined either in the extension definition of the ``setup.py`` script or by 68setting the respective distutils options in the source file with the following 69file header comment (if ``cythonize()`` is used for compilation):: 70 71 # distutils: define_macros=CYTHON_TRACE_NOGIL=1 72 73 74Enabling coverage analysis 75-------------------------- 76 77Since Cython 0.23, line tracing (see above) also enables support for coverage 78reporting with the `coverage.py <http://coverage.readthedocs.io/>`_ tool. 79To make the coverage analysis understand Cython modules, you also need to enable 80Cython's coverage plugin in your ``.coveragerc`` file as follows: 81 82.. code-block:: ini 83 84 [run] 85 plugins = Cython.Coverage 86 87With this plugin, your Cython source files should show up normally in the 88coverage reports. 89 90To include the coverage report in the Cython annotated HTML file, you need 91to first run the coverage.py tool to generate an XML result file. Pass 92this file into the ``cython`` command as follows: 93 94.. code-block:: bash 95 96 $ cython --annotate-coverage coverage.xml package/mymodule.pyx 97 98This will recompile the Cython module and generate one HTML output 99file next to each Cython source file it processes, containing colour 100markers for lines that were contained in the coverage report. 101 102 103.. _profiling_tutorial: 104 105Profiling Tutorial 106================== 107 108This will be a complete tutorial, start to finish, of profiling Python code, 109turning it into Cython code and keep profiling until it is fast enough. 110 111As a toy example, we would like to evaluate the summation of the reciprocals of 112squares up to a certain integer :math:`n` for evaluating :math:`\pi`. The 113relation we want to use has been proven by Euler in 1735 and is known as the 114`Basel problem <https://en.wikipedia.org/wiki/Basel_problem>`_. 115 116 117.. math:: 118 \pi^2 = 6 \sum_{k=1}^{\infty} \frac{1}{k^2} = 119 6 \lim_{k \to \infty} \big( \frac{1}{1^2} + 120 \frac{1}{2^2} + \dots + \frac{1}{k^2} \big) \approx 121 6 \big( \frac{1}{1^2} + \frac{1}{2^2} + \dots + \frac{1}{n^2} \big) 122 123A simple Python code for evaluating the truncated sum looks like this: 124 125.. literalinclude:: ../../examples/tutorial/profiling_tutorial/calc_pi.py 126 127On my box, this needs approximately 4 seconds to run the function with the 128default n. The higher we choose n, the better will be the approximation for 129:math:`\pi`. An experienced Python programmer will already see plenty of 130places to optimize this code. But remember the golden rule of optimization: 131Never optimize without having profiled. Let me repeat this: **Never** optimize 132without having profiled your code. Your thoughts about which part of your 133code takes too much time are wrong. At least, mine are always wrong. So let's 134write a short script to profile our code: 135 136.. literalinclude:: ../../examples/tutorial/profiling_tutorial/profile.py 137 138Running this on my box gives the following output: 139 140.. code-block:: none 141 142 Sat Nov 7 17:40:54 2009 Profile.prof 143 144 10000004 function calls in 6.211 CPU seconds 145 146 Ordered by: internal time 147 148 ncalls tottime percall cumtime percall filename:lineno(function) 149 1 3.243 3.243 6.211 6.211 calc_pi.py:7(approx_pi) 150 10000000 2.526 0.000 2.526 0.000 calc_pi.py:4(recip_square) 151 1 0.442 0.442 0.442 0.442 {range} 152 1 0.000 0.000 6.211 6.211 <string>:1(<module>) 153 1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects} 154 155This contains the information that the code runs in 6.2 CPU seconds. Note that 156the code got slower by 2 seconds because it ran inside the cProfile module. The 157table contains the real valuable information. You might want to check the 158Python `profiling documentation <https://docs.python.org/library/profile.html>`_ 159for the nitty gritty details. The most important columns here are totime (total 160time spent in this function **not** counting functions that were called by this 161function) and cumtime (total time spent in this function **also** counting the 162functions called by this function). Looking at the tottime column, we see that 163approximately half the time is spent in approx_pi and the other half is spent 164in recip_square. Also half a second is spent in range ... of course we should 165have used xrange for such a big iteration. And in fact, just changing range to 166xrange makes the code run in 5.8 seconds. 167 168We could optimize a lot in the pure Python version, but since we are interested 169in Cython, let's move forward and bring this module to Cython. We would do this 170anyway at some time to get the loop run faster. Here is our first Cython version: 171 172.. literalinclude:: ../../examples/tutorial/profiling_tutorial/calc_pi_2.pyx 173 174Note the first line: We have to tell Cython that profiling should be enabled. 175This makes the Cython code slightly slower, but without this we would not get 176meaningful output from the cProfile module. The rest of the code is mostly 177unchanged, I only typed some variables which will likely speed things up a bit. 178 179We also need to modify our profiling script to import the Cython module directly. 180Here is the complete version adding the import of the :ref:`Pyximport<pyximport>` module: 181 182.. literalinclude:: ../../examples/tutorial/profiling_tutorial/profile_2.py 183 184We only added two lines, the rest stays completely the same. Alternatively, we could also 185manually compile our code into an extension; we wouldn't need to change the 186profile script then at all. The script now outputs the following: 187 188.. code-block:: none 189 190 Sat Nov 7 18:02:33 2009 Profile.prof 191 192 10000004 function calls in 4.406 CPU seconds 193 194 Ordered by: internal time 195 196 ncalls tottime percall cumtime percall filename:lineno(function) 197 1 3.305 3.305 4.406 4.406 calc_pi.pyx:7(approx_pi) 198 10000000 1.101 0.000 1.101 0.000 calc_pi.pyx:4(recip_square) 199 1 0.000 0.000 4.406 4.406 {calc_pi.approx_pi} 200 1 0.000 0.000 4.406 4.406 <string>:1(<module>) 201 1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects} 202 203We gained 1.8 seconds. Not too shabby. Comparing the output to the previous, we 204see that recip_square function got faster while the approx_pi function has not 205changed a lot. Let's concentrate on the recip_square function a bit more. First 206note, that this function is not to be called from code outside of our module; 207so it would be wise to turn it into a cdef to reduce call overhead. We should 208also get rid of the power operator: it is turned into a pow(i,2) function call by 209Cython, but we could instead just write i*i which could be faster. The 210whole function is also a good candidate for inlining. Let's look at the 211necessary changes for these ideas: 212 213.. literalinclude:: ../../examples/tutorial/profiling_tutorial/calc_pi_3.pyx 214 215Now running the profile script yields: 216 217.. code-block:: none 218 219 Sat Nov 7 18:10:11 2009 Profile.prof 220 221 10000004 function calls in 2.622 CPU seconds 222 223 Ordered by: internal time 224 225 ncalls tottime percall cumtime percall filename:lineno(function) 226 1 1.782 1.782 2.622 2.622 calc_pi.pyx:7(approx_pi) 227 10000000 0.840 0.000 0.840 0.000 calc_pi.pyx:4(recip_square) 228 1 0.000 0.000 2.622 2.622 {calc_pi.approx_pi} 229 1 0.000 0.000 2.622 2.622 <string>:1(<module>) 230 1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects} 231 232That bought us another 1.8 seconds. Not the dramatic change we could have 233expected. And why is recip_square still in this table; it is supposed to be 234inlined, isn't it? The reason for this is that Cython still generates profiling code 235even if the function call is eliminated. Let's tell it to not 236profile recip_square any more; we couldn't get the function to be much faster anyway: 237 238.. literalinclude:: ../../examples/tutorial/profiling_tutorial/calc_pi_4.pyx 239 240Running this shows an interesting result: 241 242.. code-block:: none 243 244 Sat Nov 7 18:15:02 2009 Profile.prof 245 246 4 function calls in 0.089 CPU seconds 247 248 Ordered by: internal time 249 250 ncalls tottime percall cumtime percall filename:lineno(function) 251 1 0.089 0.089 0.089 0.089 calc_pi.pyx:10(approx_pi) 252 1 0.000 0.000 0.089 0.089 {calc_pi.approx_pi} 253 1 0.000 0.000 0.089 0.089 <string>:1(<module>) 254 1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects} 255 256First note the tremendous speed gain: this version only takes 1/50 of the time 257of our first Cython version. Also note that recip_square has vanished from the 258table like we wanted. But the most peculiar and import change is that 259approx_pi also got much faster. This is a problem with all profiling: calling a 260function in a profile run adds a certain overhead to the function call. This 261overhead is **not** added to the time spent in the called function, but to the 262time spent in the **calling** function. In this example, approx_pi didn't need 2.622 263seconds in the last run; but it called recip_square 10000000 times, each time taking a 264little to set up profiling for it. This adds up to the massive time loss of 265around 2.6 seconds. Having disabled profiling for the often called function now 266reveals realistic timings for approx_pi; we could continue optimizing it now if 267needed. 268 269This concludes this profiling tutorial. There is still some room for 270improvement in this code. We could try to replace the power operator in 271approx_pi with a call to sqrt from the C stdlib; but this is not necessarily 272faster than calling pow(x,0.5). 273 274Even so, the result we achieved here is quite satisfactory: we came up with a 275solution that is much faster then our original Python version while retaining 276functionality and readability. 277 278 279