1.. highlight:: cython
2
3.. _profiling:
4
5*********
6Profiling
7*********
8
9This part describes the profiling abilities of Cython. If you are familiar
10with profiling pure Python code, you can only read the first section
11(:ref:`profiling_basics`). If you are not familiar with Python profiling you
12should also read the tutorial (:ref:`profiling_tutorial`) which takes you
13through a complete example step by step.
14
15.. _profiling_basics:
16
17Cython Profiling Basics
18=======================
19
20Profiling in Cython is controlled by a compiler directive.
21It can be set either for an entire file or on a per function basis
22via a Cython decorator.
23
24Enabling profiling for a complete source file
25---------------------------------------------
26
27Profiling is enabled for a complete source file via a global directive to the
28Cython compiler at the top of a file::
29
30   # cython: profile=True
31
32Note that profiling gives a slight overhead to each function call therefore making
33your program a little slower (or a lot, if you call some small functions very
34often).
35
36Once enabled, your Cython code will behave just like Python code when called
37from the cProfile module. This means you can just profile your Cython code
38together with your Python code using the same tools as for Python code alone.
39
40Disabling profiling function wise
41---------------------------------
42
43If your profiling is messed up because of the call overhead to some small
44functions that you rather do not want to see in your profile - either because
45you plan to inline them anyway or because you are sure that you can't make them
46any faster - you can use a special decorator to disable profiling for one
47function only (regardless of whether it is globally enabled or not):
48
49.. literalinclude:: ../../examples/tutorial/profiling_tutorial/often_called.pyx
50
51Enabling line tracing
52---------------------
53
54To get more detailed trace information (for tools that can make use of it),
55you can enable line tracing::
56
57   # cython: linetrace=True
58
59This will also enable profiling support, so the above ``profile=True`` option
60is not needed.  Line tracing is needed for coverage analysis, for example.
61
62Note that even if line tracing is enabled via the compiler directive, it is
63not used by default.  As the runtime slowdown can be substantial, it must
64additionally be compiled in by the C compiler by setting the C macro definition
65``CYTHON_TRACE=1``.  To include nogil functions in the trace, set
66``CYTHON_TRACE_NOGIL=1`` (which implies ``CYTHON_TRACE=1``).  C macros can be
67defined either in the extension definition of the ``setup.py`` script or by
68setting the respective distutils options in the source file with the following
69file header comment (if ``cythonize()`` is used for compilation)::
70
71   # distutils: define_macros=CYTHON_TRACE_NOGIL=1
72
73
74Enabling coverage analysis
75--------------------------
76
77Since Cython 0.23, line tracing (see above) also enables support for coverage
78reporting with the `coverage.py <http://coverage.readthedocs.io/>`_ tool.
79To make the coverage analysis understand Cython modules, you also need to enable
80Cython's coverage plugin in your ``.coveragerc`` file as follows:
81
82.. code-block:: ini
83
84   [run]
85   plugins = Cython.Coverage
86
87With this plugin, your Cython source files should show up normally in the
88coverage reports.
89
90To include the coverage report in the Cython annotated HTML file, you need
91to first run the coverage.py tool to generate an XML result file.  Pass
92this file into the ``cython`` command as follows:
93
94.. code-block:: bash
95
96   $ cython  --annotate-coverage coverage.xml  package/mymodule.pyx
97
98This will recompile the Cython module and generate one HTML output
99file next to each Cython source file it processes, containing colour
100markers for lines that were contained in the coverage report.
101
102
103.. _profiling_tutorial:
104
105Profiling Tutorial
106==================
107
108This will be a complete tutorial, start to finish, of profiling Python code,
109turning it into Cython code and keep profiling until it is fast enough.
110
111As a toy example, we would like to evaluate the summation of the reciprocals of
112squares up to a certain integer :math:`n` for evaluating :math:`\pi`. The
113relation we want to use has been proven by Euler in 1735 and is known as the
114`Basel problem <https://en.wikipedia.org/wiki/Basel_problem>`_.
115
116
117.. math::
118   \pi^2 = 6 \sum_{k=1}^{\infty} \frac{1}{k^2} =
119   6 \lim_{k \to \infty} \big( \frac{1}{1^2} +
120         \frac{1}{2^2} + \dots + \frac{1}{k^2}  \big) \approx
121   6 \big( \frac{1}{1^2} + \frac{1}{2^2} + \dots + \frac{1}{n^2}  \big)
122
123A simple Python code for evaluating the truncated sum looks like this:
124
125.. literalinclude:: ../../examples/tutorial/profiling_tutorial/calc_pi.py
126
127On my box, this needs approximately 4 seconds to run the function with the
128default n. The higher we choose n, the better will be the approximation for
129:math:`\pi`. An experienced Python programmer will already see plenty of
130places to optimize this code. But remember the golden rule of optimization:
131Never optimize without having profiled. Let me repeat this: **Never** optimize
132without having profiled your code. Your thoughts about which part of your
133code takes too much time are wrong. At least, mine are always wrong. So let's
134write a short script to profile our code:
135
136.. literalinclude:: ../../examples/tutorial/profiling_tutorial/profile.py
137
138Running this on my box gives the following output:
139
140.. code-block:: none
141
142   Sat Nov  7 17:40:54 2009    Profile.prof
143
144            10000004 function calls in 6.211 CPU seconds
145
146      Ordered by: internal time
147
148      ncalls  tottime  percall  cumtime  percall filename:lineno(function)
149           1    3.243    3.243    6.211    6.211 calc_pi.py:7(approx_pi)
150    10000000    2.526    0.000    2.526    0.000 calc_pi.py:4(recip_square)
151           1    0.442    0.442    0.442    0.442 {range}
152           1    0.000    0.000    6.211    6.211 <string>:1(<module>)
153           1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
154
155This contains the information that the code runs in 6.2 CPU seconds. Note that
156the code got slower by 2 seconds because it ran inside the cProfile module. The
157table contains the real valuable information.  You might want to check the
158Python `profiling documentation <https://docs.python.org/library/profile.html>`_
159for the nitty gritty details. The most important columns here are totime (total
160time spent in this function **not** counting functions that were called by this
161function) and cumtime (total time spent in this function **also** counting the
162functions called by this function). Looking at the tottime column, we see that
163approximately half the time is spent in approx_pi and the other half is spent
164in recip_square. Also half a second is spent in range ... of course we should
165have used xrange for such a big iteration. And in fact, just changing range to
166xrange makes the code run in 5.8 seconds.
167
168We could optimize a lot in the pure Python version, but since we are interested
169in Cython, let's move forward and bring this module to Cython. We would do this
170anyway at some time to get the loop run faster. Here is our first Cython version:
171
172.. literalinclude:: ../../examples/tutorial/profiling_tutorial/calc_pi_2.pyx
173
174Note the first line: We have to tell Cython that profiling should be enabled.
175This makes the Cython code slightly slower, but without this we would not get
176meaningful output from the cProfile module. The rest of the code is mostly
177unchanged, I only typed some variables which will likely speed things up a bit.
178
179We also need to modify our profiling script to import the Cython module directly.
180Here is the complete version adding the import of the :ref:`Pyximport<pyximport>` module:
181
182.. literalinclude:: ../../examples/tutorial/profiling_tutorial/profile_2.py
183
184We only added two lines, the rest stays completely the same. Alternatively, we could also
185manually compile our code into an extension; we wouldn't need to change the
186profile script then at all. The script now outputs the following:
187
188.. code-block:: none
189
190   Sat Nov  7 18:02:33 2009    Profile.prof
191
192            10000004 function calls in 4.406 CPU seconds
193
194      Ordered by: internal time
195
196      ncalls  tottime  percall  cumtime  percall filename:lineno(function)
197           1    3.305    3.305    4.406    4.406 calc_pi.pyx:7(approx_pi)
198    10000000    1.101    0.000    1.101    0.000 calc_pi.pyx:4(recip_square)
199           1    0.000    0.000    4.406    4.406 {calc_pi.approx_pi}
200           1    0.000    0.000    4.406    4.406 <string>:1(<module>)
201           1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
202
203We gained 1.8 seconds. Not too shabby. Comparing the output to the previous, we
204see that recip_square function got faster while the approx_pi function has not
205changed a lot. Let's concentrate on the recip_square function a bit more. First
206note, that this function is not to be called from code outside of our module;
207so it would be wise to turn it into a cdef to reduce call overhead. We should
208also get rid of the power operator: it is turned into a pow(i,2) function call by
209Cython, but we could instead just write i*i which could be faster. The
210whole function is also a good candidate for inlining.  Let's look at the
211necessary changes for these ideas:
212
213.. literalinclude:: ../../examples/tutorial/profiling_tutorial/calc_pi_3.pyx
214
215Now running the profile script yields:
216
217.. code-block:: none
218
219   Sat Nov  7 18:10:11 2009    Profile.prof
220
221            10000004 function calls in 2.622 CPU seconds
222
223      Ordered by: internal time
224
225      ncalls  tottime  percall  cumtime  percall filename:lineno(function)
226           1    1.782    1.782    2.622    2.622 calc_pi.pyx:7(approx_pi)
227    10000000    0.840    0.000    0.840    0.000 calc_pi.pyx:4(recip_square)
228           1    0.000    0.000    2.622    2.622 {calc_pi.approx_pi}
229           1    0.000    0.000    2.622    2.622 <string>:1(<module>)
230           1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
231
232That bought us another 1.8 seconds. Not the dramatic change we could have
233expected. And why is recip_square still in this table; it is supposed to be
234inlined, isn't it?  The reason for this is that Cython still generates profiling code
235even if the function call is eliminated. Let's tell it to not
236profile recip_square any more; we couldn't get the function to be much faster anyway:
237
238.. literalinclude:: ../../examples/tutorial/profiling_tutorial/calc_pi_4.pyx
239
240Running this shows an interesting result:
241
242.. code-block:: none
243
244   Sat Nov  7 18:15:02 2009    Profile.prof
245
246            4 function calls in 0.089 CPU seconds
247
248      Ordered by: internal time
249
250      ncalls  tottime  percall  cumtime  percall filename:lineno(function)
251           1    0.089    0.089    0.089    0.089 calc_pi.pyx:10(approx_pi)
252           1    0.000    0.000    0.089    0.089 {calc_pi.approx_pi}
253           1    0.000    0.000    0.089    0.089 <string>:1(<module>)
254           1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
255
256First note the tremendous speed gain: this version only takes 1/50 of the time
257of our first Cython version. Also note that recip_square has vanished from the
258table like we wanted. But the most peculiar and import change is that
259approx_pi also got much faster. This is a problem with all profiling: calling a
260function in a profile run adds a certain overhead to the function call. This
261overhead is **not** added to the time spent in the called function, but to the
262time spent in the **calling** function. In this example, approx_pi didn't need 2.622
263seconds in the last run; but it called recip_square 10000000 times, each time taking a
264little to set up profiling for it. This adds up to the massive time loss of
265around 2.6 seconds. Having disabled profiling for the often called function now
266reveals realistic timings for approx_pi; we could continue optimizing it now if
267needed.
268
269This concludes this profiling tutorial. There is still some room for
270improvement in this code. We could try to replace the power operator in
271approx_pi with a call to sqrt from the C stdlib; but this is not necessarily
272faster than calling pow(x,0.5).
273
274Even so, the result we achieved here is quite satisfactory: we came up with a
275solution that is much faster then our original Python version while retaining
276functionality and readability.
277
278
279