1:doc:`/index` 2 3.. _optimizations_label: 4 5Optimizations 6============= 7 8If you are dealing with large nested objects and ignore_order=True, chances are DeepDiff takes a while to calculate the diff. Here are some tips that may help you with optimizations and progress report. 9 10 11Max Passes 12---------- 13 14:ref:`max_passes_label` comes with the default of 10000000. 15If you don't need to exactly pinpoint the difference and you can get away with getting a less granular report, you can reduce the number of passes. It is recommended to get a diff of your objects with the defaults max_passes and take a look at the stats by running :ref:`get_stats_label` before deciding to reduce this number. In many cases reducing this number does not yield faster results. 16 17A new pass is started each time 2 iterables are compared in a way that every single item that is different from the first one is compared to every single item that is different in the second iterable. 18 19.. _max_diffs_label: 20 21Max Diffs 22--------- 23 24max_diffs: Integer, default = None 25 max_diffs defined the maximum number of diffs to run on objects to pin point what exactly is different. This is only used when ignore_order=True. Every time 2 individual items are compared a diff is counted. The default value of None means there is no limit in the number of diffs that will take place. Any positive integer can make DeepDiff stop doing the calculations upon reaching that max_diffs count. 26 27You can run diffs and then :ref:`get_stats_label` to see how many diffs and passes have happened. 28 29 >>> from deepdiff import DeepDiff 30 >>> diff=DeepDiff(1, 2) 31 >>> diff 32 {'values_changed': {'root': {'new_value': 2, 'old_value': 1}}} 33 >>> diff.get_stats() 34 {'PASSES COUNT': 0, 'DIFF COUNT': 1, 'DISTANCE CACHE HIT COUNT': 0, 'MAX PASS LIMIT REACHED': False, 'MAX DIFF LIMIT REACHED': False} 35 >>> diff=DeepDiff([[1,2]], [[2,3,1]]) 36 >>> diff.get_stats() 37 {'PASSES COUNT': 0, 'DIFF COUNT': 8, 'DISTANCE CACHE HIT COUNT': 0, 'MAX PASS LIMIT REACHED': False, 'MAX DIFF LIMIT REACHED': False} 38 >>> diff=DeepDiff([[1,2]], [[2,3,1]], ignore_order=True) 39 >>> diff.get_stats() 40 {'PASSES COUNT': 3, 'DIFF COUNT': 6, 'DISTANCE CACHE HIT COUNT': 0, 'MAX PASS LIMIT REACHED': False, 'MAX DIFF LIMIT REACHED': False} 41 42.. note:: 43 Compare :ref:`max_diffs_label` with :ref:`max_passes_label` 44 45 46.. _cache_size_label: 47 48Cache Size 49---------- 50 51cache_size : int >= 0, default=0 52 Cache size to be used to improve the performance. A cache size of zero means it is disabled. 53 Using the cache_size can dramatically improve the diff performance especially for the nested objects at the cost of more memory usage. However if cache hits rate is very low, having a cache actually reduces the performance. 54 55************** 56Cache Examples 57************** 58 59For example lets take a look at the performance of the benchmark_deeply_nested_a in the `DeepDiff-Benchmark repo <https://github.com/seperman/deepdiff-benchmark/blob/master/benchmark.py>`_ . 60 61No Cache 62^^^^^^^^ 63 64With the no cache option we have the following stats: 65 66 {'PASSES COUNT': 11234, 'DIFF COUNT': 107060, 'DISTANCE CACHE HIT COUNT': 0, 'MAX PASS LIMIT REACHED': False, 'MAX DIFF LIMIT REACHED': False, 'DURATION SEC': 10} 67 68Yes it has taken 10 seconds to do the diff! 69 70.. figure:: _static/benchmark_deeply_nested_a__3.8__ignore_order=True__cache_size=0__cache_tuning_sample_size=0__cutoff_intersection_for_pairs=1.png 71 :alt: cache_size=0 72 73 cache_size=0 74 75Cache Size 500 76^^^^^^^^^^^^^^ 77 78With a cache size of 500, we are doing the same diff in 2.5 seconds! And the memory usage has not changed. It is still hovering around 100Mb. 79 80 {'PASSES COUNT': 3960, 'DIFF COUNT': 19469, 'DISTANCE CACHE HIT COUNT': 11847, 'MAX PASS LIMIT REACHED': False, 'MAX DIFF LIMIT REACHED': False, 'DURATION SEC': 2} 81 82As you can see the number of passes and diff counts have gone down and instead the distance cache hit count has gone up. 83 84.. figure:: _static/benchmark_deeply_nested_a__3.8__ignore_order=True__cache_size=500__cache_tuning_sample_size=0__cutoff_intersection_for_pairs=1.png 85 :alt: cache_size=500 86 87 cache_size=500 88 89 90Cache Size 500 and Cache Tuning Sample Size 500 91^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 92 93With a cache size of 500, we set the :ref:`cache_tuning_sample_size_label` to be 500 too. And we have a slight improvement. we are doing the same diff in 2 seconds now. And the memory usage has not changed. It is still hovering around 100Mb. 94 95 {'PASSES COUNT': 3960, 'DIFF COUNT': 19469, 'DISTANCE CACHE HIT COUNT': 11847, 'MAX PASS LIMIT REACHED': False, 'MAX DIFF LIMIT REACHED': False, 'DURATION SEC': 2} 96 97As you can see in this case none of the stats have not changed compared to the previous stats. 98 99.. figure:: _static/benchmark_deeply_nested_a__3.8__ignore_order=True__cache_size=500__cache_tuning_sample_size=500__cutoff_intersection_for_pairs=1.png 100 :alt: cache_size=500 cache_tuning_sample_size=500 101 102 cache_size=500 cache_tuning_sample_size=500 103 104 105Cache Size of 5000 106^^^^^^^^^^^^^^^^^^ 107 108Let's pay a little attention to our stats. Particularly to 'DISTANCE CACHE HIT COUNT': 11847 and the fact that the memory usage has not changed so far. What if we bump the cache_size to 5000 and disable cache_tuning_sample_size? 109 110 {'PASSES COUNT': 1486, 'DIFF COUNT': 6637, 'DISTANCE CACHE HIT COUNT': 3440, 'MAX PASS LIMIT REACHED': False, 'MAX DIFF LIMIT REACHED': False, 'DURATION SEC': 0} 111 112We get the result calculated below 1 second! And the memory usage is only slightly above 100Mb. 113 114.. figure:: _static/benchmark_deeply_nested_a__3.8__ignore_order=True__cache_size=5000__cache_tuning_sample_size=0__cutoff_intersection_for_pairs=1.png 115 :alt: cache_size=5000 116 117 cache_size=5000 118 119 120 121.. _cache_tuning_sample_size_label: 122 123Cache Tuning Sample Size 124------------------------ 125 126cache_tuning_sample_size : int >= 0, default = 0 127 cache_tuning_sample_size is an experimental feature. It works hands in hands with the :ref:`cache_size_label`. When cache_tuning_sample_size is set to anything above zero, it will sample the cache usage with the passed sample size and decide whether to use the cache or not. And will turn it back on occasionally during the diffing process. This option can be useful if you are not sure if you need any cache or not. However you will gain much better performance with keeping this parameter zero and running your diff with different cache sizes and benchmarking to find the optimal cache size. 128 129.. note:: 130 A good start with cache_tuning_sample_size is to set it to the size of your cache. 131 132 133.. _diffing_numbers_optimizations_label: 134 135Optimizations for Diffing Numbers 136--------------------------------- 137 138If you are diffing lists of python numbers, you could get performance improvement just by installing numpy. DeepDiff will use Numpy to improve the performance behind the scene. 139 140For example lets take a look at the performance of the benchmark_array_no_numpy vs. benchmark_numpy_array in the `DeepDiff-Benchmark repo <https://github.com/seperman/deepdiff-benchmark/blob/master/benchmark.py>`_. 141 142In this specific test, we have 2 lists of numbers that have nothing in common: `mat1 <https://github.com/seperman/deepdiff-benchmark/blob/master/data/mat1.txt>`_ and `mat2 <https://github.com/seperman/deepdiff-benchmark/blob/master/data/mat2.txt>`_ . 143 144No Cache and No Numpy 145^^^^^^^^^^^^^^^^^^^^^ 146 147With the no cache option and no Numpy installed we have the following stats: 148 149 {'PASSES COUNT': 1, 'DIFF COUNT': 439944, 'DISTANCE CACHE HIT COUNT': 0, 'MAX PASS LIMIT REACHED': False, 'MAX DIFF LIMIT REACHED': False, 'DURATION SEC': 30} 150 151Yes it has taken 30 seconds to do the diff! 152 153.. figure:: _static/benchmark_array_no_numpy__3.8__ignore_order=True__cache_size=0__cache_tuning_sample_size=0__cutoff_intersection_for_pairs=1.png 154 :alt: cache_size=0 and no Numpy 155 156 cache_size=0 and no Numpy 157 158Cache Size 10000 and No Numpy 159^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 160 161What if we increase the cache size to 10000? 162 163 {'PASSES COUNT': 1, 'DIFF COUNT': 439944, 'DISTANCE CACHE HIT COUNT': 0, 'MAX PASS LIMIT REACHED': False, 'MAX DIFF LIMIT REACHED': False, 'DURATION SEC': 35} 164 165Not only it didn't help, it increased the diff time by 15%!! 166 167Worse, if you look at the stats you see that the cache hit count is zero. This has happened since the 2 lists of items have nothing in common and hence caching the results does not improve the performance. 168 169 170.. figure:: _static/benchmark_array_no_numpy__3.8__ignore_order=True__cache_size=10000__cache_tuning_sample_size=0__cutoff_intersection_for_pairs=1.png 171 :alt: cache_size=10000 and no Numpy 172 173 cache_size=10000 and no Numpy 174 175No Cache and Numpy 176^^^^^^^^^^^^^^^^^^ 177 178Let's install Numpy now. Set the cache_size=0 and run the diff again. 179 180Yay, the same diff is done in 5 seconds! 181 182 {'PASSES COUNT': 1, 'DIFF COUNT': 1348, 'DISTANCE CACHE HIT COUNT': 0, 'MAX PASS LIMIT REACHED': False, 'MAX DIFF LIMIT REACHED': False, 'DURATION SEC': 5} 183 184As you can see the memory usage has gone up from around 500Mb to around 630Mb. 185 186.. figure:: _static/benchmark_numpy_array__3.8__ignore_order=True__cache_size=0__cache_tuning_sample_size=0__cutoff_intersection_for_pairs=1.png 187 :alt: Numpy but no cache 188 189 Numpy but no cache 190 191 192Pypy 193---- 194 195If you are diffing big blobs of data that do not mainly include numbers, you may gain some performance improvement by running DeepDiff on Pypy3 instead of cPython. 196 197For example lets take a look at the performance of the benchmark_big_jsons in the `DeepDiff-Benchmark repo <https://github.com/seperman/deepdiff-benchmark/blob/master/benchmark.py>`_. 198 199First we will run it on cPython 3.8: 200 201It takes around 17.5 seconds and 40Mb of memory: 202 203.. figure:: _static/benchmark_big_jsons__3.8__ignore_order=True__cache_size=0__cache_tuning_sample_size=0__max_diffs=300000__max_passes=40000__cutoff_intersection_for_pairs=1.png 204 :alt: Nested blob of text diffed in Python3.8 205 206 Nested blob of text diffed in Python3.8 207 208And then we run it in Pypy3.6-7.3.0. It takes 12 seconds now but around 110Mb of memory. 209 210.. figure:: _static/benchmark_big_jsons__pypy3.6__ignore_order=True__cache_size=0__cache_tuning_sample_size=0__max_diffs=300000__max_passes=40000__cutoff_intersection_for_pairs=1.png 211 :alt: Nested blob of text diffed in Pypy3.6-7.3.0 212 213 Nested blob of text diffed in Pypy3.6-7.3.0 214 215.. note:: 216 Note that if you diffing numbers, and have Numpy installed as recommended, cPython will have a better performance than Pypy. But if you are diffing blobs of mixed strings and some numbers, Pypy will have a better CPU performance and worse memory usage. 217 218 219Cutoff Intersection For Pairs 220----------------------------- 221 222:ref:`cutoff_intersection_for_pairs_label` which is only used when ignore_order=True can have a huge affect on the granularity of the results and the performance. A value of zero essentially stops DeepDiff from doing passes while a value of 1 forced DeepDiff to do passes on iterables even when they are very different. Running passes is an expensive operation. 223 224As an example of how much this parameter can affect the results in deeply nested objects, please take a look at :ref:`distance_and_diff_granularity_label`. 225 226.. _cache_purge_level: 227 228Cache Purge Level 229----------------- 230 231cache_purge_level: int, 0, 1, or 2. default=1 232 cache_purge_level defines what objects in DeepDiff should be deleted to free the memory once the diff object is calculated. If this value is set to zero, most of the functionality of the diff object is removed and the most memory is released. A value of 1 preserves all the functionalities of the diff object. A value of 2 also preserves the cache and hashes that were calculated during the diff calculations. In most cases the user does not need to have those objects remained in the diff unless for investigation purposes. 233 234 235 236 237 238Back to :doc:`/index` 239