1:doc:`/index`
2
3.. _optimizations_label:
4
5Optimizations
6=============
7
8If you are dealing with large nested objects and ignore_order=True, chances are DeepDiff takes a while to calculate the diff. Here are some tips that may help you with optimizations and progress report.
9
10
11Max Passes
12----------
13
14:ref:`max_passes_label` comes with the default of 10000000.
15If you don't need to exactly pinpoint the difference and you can get away with getting a less granular report, you can reduce the number of passes. It is recommended to get a diff of your objects with the defaults max_passes and take a look at the stats by running :ref:`get_stats_label` before deciding to reduce this number. In many cases reducing this number does not yield faster results.
16
17A new pass is started each time 2 iterables are compared in a way that every single item that is different from the first one is compared to every single item that is different in the second iterable.
18
19.. _max_diffs_label:
20
21Max Diffs
22---------
23
24max_diffs: Integer, default = None
25    max_diffs defined the maximum number of diffs to run on objects to pin point what exactly is different. This is only used when ignore_order=True. Every time 2 individual items are compared a diff is counted. The default value of None means there is no limit in the number of diffs that will take place. Any positive integer can make DeepDiff stop doing the calculations upon reaching that max_diffs count.
26
27You can run diffs and then :ref:`get_stats_label` to see how many diffs and passes have happened.
28
29    >>> from deepdiff import DeepDiff
30    >>> diff=DeepDiff(1, 2)
31    >>> diff
32    {'values_changed': {'root': {'new_value': 2, 'old_value': 1}}}
33    >>> diff.get_stats()
34    {'PASSES COUNT': 0, 'DIFF COUNT': 1, 'DISTANCE CACHE HIT COUNT': 0, 'MAX PASS LIMIT REACHED': False, 'MAX DIFF LIMIT REACHED': False}
35    >>> diff=DeepDiff([[1,2]], [[2,3,1]])
36    >>> diff.get_stats()
37    {'PASSES COUNT': 0, 'DIFF COUNT': 8, 'DISTANCE CACHE HIT COUNT': 0, 'MAX PASS LIMIT REACHED': False, 'MAX DIFF LIMIT REACHED': False}
38    >>> diff=DeepDiff([[1,2]], [[2,3,1]], ignore_order=True)
39    >>> diff.get_stats()
40    {'PASSES COUNT': 3, 'DIFF COUNT': 6, 'DISTANCE CACHE HIT COUNT': 0, 'MAX PASS LIMIT REACHED': False, 'MAX DIFF LIMIT REACHED': False}
41
42.. note::
43    Compare :ref:`max_diffs_label` with :ref:`max_passes_label`
44
45
46.. _cache_size_label:
47
48Cache Size
49----------
50
51cache_size : int >= 0, default=0
52    Cache size to be used to improve the performance. A cache size of zero means it is disabled.
53    Using the cache_size can dramatically improve the diff performance especially for the nested objects at the cost of more memory usage. However if cache hits rate is very low, having a cache actually reduces the performance.
54
55**************
56Cache Examples
57**************
58
59For example lets take a look at the performance of the benchmark_deeply_nested_a in the `DeepDiff-Benchmark repo <https://github.com/seperman/deepdiff-benchmark/blob/master/benchmark.py>`_ .
60
61No Cache
62^^^^^^^^
63
64With the no cache option we have the following stats:
65
66    {'PASSES COUNT': 11234, 'DIFF COUNT': 107060, 'DISTANCE CACHE HIT COUNT': 0, 'MAX PASS LIMIT REACHED': False, 'MAX DIFF LIMIT REACHED': False, 'DURATION SEC': 10}
67
68Yes it has taken 10 seconds to do the diff!
69
70.. figure:: _static/benchmark_deeply_nested_a__3.8__ignore_order=True__cache_size=0__cache_tuning_sample_size=0__cutoff_intersection_for_pairs=1.png
71   :alt: cache_size=0
72
73   cache_size=0
74
75Cache Size 500
76^^^^^^^^^^^^^^
77
78With a cache size of 500, we are doing the same diff in 2.5 seconds! And the memory usage has not changed. It is still hovering around 100Mb.
79
80    {'PASSES COUNT': 3960, 'DIFF COUNT': 19469, 'DISTANCE CACHE HIT COUNT': 11847, 'MAX PASS LIMIT REACHED': False, 'MAX DIFF LIMIT REACHED': False, 'DURATION SEC': 2}
81
82As you can see the number of passes and diff counts have gone down and instead the distance cache hit count has gone up.
83
84.. figure:: _static/benchmark_deeply_nested_a__3.8__ignore_order=True__cache_size=500__cache_tuning_sample_size=0__cutoff_intersection_for_pairs=1.png
85   :alt: cache_size=500
86
87   cache_size=500
88
89
90Cache Size 500 and Cache Tuning Sample Size 500
91^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
92
93With a cache size of 500, we set the :ref:`cache_tuning_sample_size_label` to be 500 too. And we have a slight improvement. we are doing the same diff in 2 seconds now. And the memory usage has not changed. It is still hovering around 100Mb.
94
95    {'PASSES COUNT': 3960, 'DIFF COUNT': 19469, 'DISTANCE CACHE HIT COUNT': 11847, 'MAX PASS LIMIT REACHED': False, 'MAX DIFF LIMIT REACHED': False, 'DURATION SEC': 2}
96
97As you can see in this case none of the stats have not changed compared to the previous stats.
98
99.. figure:: _static/benchmark_deeply_nested_a__3.8__ignore_order=True__cache_size=500__cache_tuning_sample_size=500__cutoff_intersection_for_pairs=1.png
100   :alt: cache_size=500 cache_tuning_sample_size=500
101
102   cache_size=500 cache_tuning_sample_size=500
103
104
105Cache Size of 5000
106^^^^^^^^^^^^^^^^^^
107
108Let's pay a little attention to our stats. Particularly to 'DISTANCE CACHE HIT COUNT': 11847 and the fact that the memory usage has not changed so far. What if we bump the cache_size to 5000 and disable cache_tuning_sample_size?
109
110    {'PASSES COUNT': 1486, 'DIFF COUNT': 6637, 'DISTANCE CACHE HIT COUNT': 3440, 'MAX PASS LIMIT REACHED': False, 'MAX DIFF LIMIT REACHED': False, 'DURATION SEC': 0}
111
112We get the result calculated below 1 second! And the memory usage is only slightly above 100Mb.
113
114.. figure:: _static/benchmark_deeply_nested_a__3.8__ignore_order=True__cache_size=5000__cache_tuning_sample_size=0__cutoff_intersection_for_pairs=1.png
115   :alt: cache_size=5000
116
117   cache_size=5000
118
119
120
121.. _cache_tuning_sample_size_label:
122
123Cache Tuning Sample Size
124------------------------
125
126cache_tuning_sample_size : int >= 0, default = 0
127    cache_tuning_sample_size is an experimental feature. It works hands in hands with the :ref:`cache_size_label`. When cache_tuning_sample_size is set to anything above zero, it will sample the cache usage with the passed sample size and decide whether to use the cache or not. And will turn it back on occasionally during the diffing process. This option can be useful if you are not sure if you need any cache or not. However you will gain much better performance with keeping this parameter zero and running your diff with different cache sizes and benchmarking to find the optimal cache size.
128
129.. note::
130    A good start with cache_tuning_sample_size is to set it to the size of your cache.
131
132
133.. _diffing_numbers_optimizations_label:
134
135Optimizations for Diffing Numbers
136---------------------------------
137
138If you are diffing lists of python numbers, you could get performance improvement just by installing numpy. DeepDiff will use Numpy to improve the performance behind the scene.
139
140For example lets take a look at the performance of the benchmark_array_no_numpy vs. benchmark_numpy_array in the `DeepDiff-Benchmark repo <https://github.com/seperman/deepdiff-benchmark/blob/master/benchmark.py>`_.
141
142In this specific test, we have 2 lists of numbers that have nothing in common: `mat1 <https://github.com/seperman/deepdiff-benchmark/blob/master/data/mat1.txt>`_ and `mat2 <https://github.com/seperman/deepdiff-benchmark/blob/master/data/mat2.txt>`_ .
143
144No Cache and No Numpy
145^^^^^^^^^^^^^^^^^^^^^
146
147With the no cache option and no Numpy installed we have the following stats:
148
149    {'PASSES COUNT': 1, 'DIFF COUNT': 439944, 'DISTANCE CACHE HIT COUNT': 0, 'MAX PASS LIMIT REACHED': False, 'MAX DIFF LIMIT REACHED': False, 'DURATION SEC': 30}
150
151Yes it has taken 30 seconds to do the diff!
152
153.. figure:: _static/benchmark_array_no_numpy__3.8__ignore_order=True__cache_size=0__cache_tuning_sample_size=0__cutoff_intersection_for_pairs=1.png
154   :alt: cache_size=0 and no Numpy
155
156   cache_size=0 and no Numpy
157
158Cache Size 10000 and No Numpy
159^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
160
161What if we increase the cache size to 10000?
162
163    {'PASSES COUNT': 1, 'DIFF COUNT': 439944, 'DISTANCE CACHE HIT COUNT': 0, 'MAX PASS LIMIT REACHED': False, 'MAX DIFF LIMIT REACHED': False, 'DURATION SEC': 35}
164
165Not only it didn't help, it increased the diff time by 15%!!
166
167Worse, if you look at the stats you see that the cache hit count is zero. This has happened since the 2 lists of items have nothing in common and hence caching the results does not improve the performance.
168
169
170.. figure:: _static/benchmark_array_no_numpy__3.8__ignore_order=True__cache_size=10000__cache_tuning_sample_size=0__cutoff_intersection_for_pairs=1.png
171   :alt: cache_size=10000 and no Numpy
172
173   cache_size=10000 and no Numpy
174
175No Cache and Numpy
176^^^^^^^^^^^^^^^^^^
177
178Let's install Numpy now. Set the cache_size=0 and run the diff again.
179
180Yay, the same diff is done in 5 seconds!
181
182    {'PASSES COUNT': 1, 'DIFF COUNT': 1348, 'DISTANCE CACHE HIT COUNT': 0, 'MAX PASS LIMIT REACHED': False, 'MAX DIFF LIMIT REACHED': False, 'DURATION SEC': 5}
183
184As you can see the memory usage has gone up from around 500Mb to around 630Mb.
185
186.. figure:: _static/benchmark_numpy_array__3.8__ignore_order=True__cache_size=0__cache_tuning_sample_size=0__cutoff_intersection_for_pairs=1.png
187   :alt: Numpy but no cache
188
189   Numpy but no cache
190
191
192Pypy
193----
194
195If you are diffing big blobs of data that do not mainly include numbers, you may gain some performance improvement by running DeepDiff on Pypy3 instead of cPython.
196
197For example lets take a look at the performance of the benchmark_big_jsons in the `DeepDiff-Benchmark repo <https://github.com/seperman/deepdiff-benchmark/blob/master/benchmark.py>`_.
198
199First we will run it on cPython 3.8:
200
201It takes around 17.5 seconds and 40Mb of memory:
202
203.. figure:: _static/benchmark_big_jsons__3.8__ignore_order=True__cache_size=0__cache_tuning_sample_size=0__max_diffs=300000__max_passes=40000__cutoff_intersection_for_pairs=1.png
204   :alt: Nested blob of text diffed in Python3.8
205
206   Nested blob of text diffed in Python3.8
207
208And then we run it in Pypy3.6-7.3.0. It takes 12 seconds now but around 110Mb of memory.
209
210.. figure:: _static/benchmark_big_jsons__pypy3.6__ignore_order=True__cache_size=0__cache_tuning_sample_size=0__max_diffs=300000__max_passes=40000__cutoff_intersection_for_pairs=1.png
211   :alt: Nested blob of text diffed in Pypy3.6-7.3.0
212
213   Nested blob of text diffed in Pypy3.6-7.3.0
214
215.. note::
216    Note that if you diffing numbers, and have Numpy installed as recommended, cPython will have a better performance than Pypy. But if you are diffing blobs of mixed strings and some numbers, Pypy will have a better CPU performance and worse memory usage.
217
218
219Cutoff Intersection For Pairs
220-----------------------------
221
222:ref:`cutoff_intersection_for_pairs_label` which is only used when ignore_order=True can have a huge affect on the granularity of the results and the performance. A value of zero essentially stops DeepDiff from doing passes while a value of 1 forced DeepDiff to do passes on iterables even when they are very different. Running passes is an expensive operation.
223
224As an example of how much this parameter can affect the results in deeply nested objects, please take a look at :ref:`distance_and_diff_granularity_label`.
225
226.. _cache_purge_level:
227
228Cache Purge Level
229-----------------
230
231cache_purge_level: int, 0, 1, or 2. default=1
232    cache_purge_level defines what objects in DeepDiff should be deleted to free the memory once the diff object is calculated. If this value is set to zero, most of the functionality of the diff object is removed and the most memory is released. A value of 1 preserves all the functionalities of the diff object. A value of 2 also preserves the cache and hashes that were calculated during the diff calculations. In most cases the user does not need to have those objects remained in the diff unless for investigation purposes.
233
234
235
236
237
238Back to :doc:`/index`
239