• Home
  • History
  • Annotate
Name Date Size #Lines LOC

..04-Nov-2021-

custom_operations/H04-Nov-2021-8530

nd_operations/H04-Nov-2021-1,515987

results/H04-Nov-2021-644604

rules/H04-Nov-2021-467340

utils/H04-Nov-2021-1,356919

README.mdH A D04-Nov-202110.3 KiB201134

__init__.pyH A D04-Nov-2021785 170

opperf.pyH A D04-Nov-202111.2 KiB206113

README.md

1<!--- Licensed to the Apache Software Foundation (ASF) under one -->
2<!--- or more contributor license agreements.  See the NOTICE file -->
3<!--- distributed with this work for additional information -->
4<!--- regarding copyright ownership.  The ASF licenses this file -->
5<!--- to you under the Apache License, Version 2.0 (the -->
6<!--- "License"); you may not use this file except in compliance -->
7<!--- with the License.  You may obtain a copy of the License at -->
8
9<!---   http://www.apache.org/licenses/LICENSE-2.0 -->
10
11<!--- Unless required by applicable law or agreed to in writing, -->
12<!--- software distributed under the License is distributed on an -->
13<!--- "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -->
14<!--- KIND, either express or implied.  See the License for the -->
15<!--- specific language governing permissions and limitations -->
16<!--- under the License. -->
17
18# MXNet Operator Performance Benchmarks
19
20A Python utility for benchmarking and profiling individual MXNet operator execution.
21
22With this utility, for each MXNet operator you can get the following details:
23
24**Timing**
251. Forward execution time
262. Backward execution time
27
28**Memory**
291. Average and Max memory allocated
30
31NOTE: This is the `pool memory`. It does not reflect the exact memory requested by the operator.
32
33# Motivation
34
35Benchmarks are usually done end-to-end for a given Network Architecture. For example: ResNet-50 benchmarks on ImageNet data. This is good measurement of overall performance and health of a deep learning framework. However, it is important to note the following important factors:
361. Users use a lot more operators that are not part of a standard network like ResNet. Example: Tensor manipulation operators like mean, max, topk, argmax, sort etc.
372. A standard Network Architecture like ResNet-50 is made up of many operators Ex: Convolution2D, Softmax, Dense and more. Consider the following scenarios:
38    1. We improved the performance of Convolution2D operator, but due to a bug, Softmax performance went down. Overall, we may observe end to end benchmarks are running fine, we may miss out the performance degradation of a single operator which can accumulate and become untraceable.
39    2. You need to see in a given network, which operator is taking maximum time and plan optimization work. With end to end benchmarks, it is hard to get more fine grained numbers at operator level.
403. We need to know on different hardware infrastructure (Ex: CPU with MKLDNN, GPU with NVIDIA CUDA and cuDNN) how different operators performs. With these details, we can plan the optimization work at operator level, which could exponentially boost up end to end performance.
414. You want to have nightly performance tests across all operators in a deep learning framework to catch regressions early.
425. We can integrate this framework with a CI/CD system to run per operator performance tests for PRs. Example: When a PR modifies the kernel of TransposeConv2D, we can run benchmarks of TransposeConv2D operator to verify performance.
43
44Hence, in this utility, we will build the functionality to allow users and developers of deep learning frameworks to easily run benchmarks for individual operators.
45
46# How to use
47
48## Prerequisites
49
50Provided you have MXNet installed (any version >= 1.5.1), all you need to use opperf utility is to add path to your cloned MXNet repository to the PYTHONPATH.
51
52Note:
53To install MXNet, refer [Installing MXNet page](https://mxnet.apache.org/versions/master/install/index.html)
54
55```
56export PYTHONPATH=$PYTHONPATH:/path/to/incubator-mxnet/
57```
58
59## Usecase 1 - Run benchmarks for all the operators
60
61Below command runs all the MXNet operators (NDArray) benchmarks with default inputs and saves the final result as JSON in the given file.
62
63```
64python incubator-mxnet/benchmark/opperf/opperf.py --output-format json --output-file mxnet_operator_benchmark_results.json
65```
66
67**Other Supported Options:**
68
691. **output-format** : `json` or `md` for markdown file output.
70
712. **ctx** : `cpu` or `gpu`. By default, cpu on CPU machine, gpu(0) on GPU machine. You can override and set the global context for all operator benchmarks. Example: --ctx gpu(2).
72
733. **dtype** : By default, `float32`. You can override and set the global dtype for all operator benchmarks. Example: --dtype float64.
74
75## Usecase 2 - Run benchmarks for all the operators in a specific category
76
77For example, you want to run benchmarks for all NDArray Broadcast Binary Operators, Ex: broadcast_add, broadcast_mod, broadcast_pow etc., You just run the following python script.
78
79```
80#!/usr/bin/python
81from benchmark.opperf.nd_operations.binary_operators import run_mx_binary_broadcast_operators_benchmarks
82
83# Run all Binary Broadcast operations benchmarks with default input values
84print(run_mx_binary_broadcast_operators_benchmarks())
85```
86
87Output for the above benchmark run, on a CPU machine, would look something like below:
88
89```
90{'broadcast_mod': [{'avg_time_forward_broadcast_mod': 28.7063, 'avg_time_mem_alloc_cpu/0': 4194.3042,
91                    'avg_time_backward_broadcast_mod': 12.0954, 'inputs': {'lhs': (1024, 1024), 'rhs': (1024, 1024)}},
92                   {'avg_time_forward_broadcast_mod': 2.7332, 'avg_time_mem_alloc_cpu/0': 400.0,
93                    'avg_time_backward_broadcast_mod': 1.1288, 'inputs': {'lhs': (10000, 10), 'rhs': (10000, 10)}},
94                   {'avg_time_forward_broadcast_mod': 30.5322, 'avg_time_mem_alloc_cpu/0': 4000.0,
95                    'avg_time_backward_broadcast_mod': 225.0255, 'inputs': {'lhs': (10000, 1), 'rhs': (10000, 100)}}],
96 'broadcast_power': [{'avg_time_backward_broadcast_power': 49.5871, 'avg_time_forward_broadcast_power': 18.0954,
97                      'avg_time_mem_alloc_cpu/0': 4194.3042, 'inputs': {'lhs': (1024, 1024), 'rhs': (1024, 1024)}},
98                     {'avg_time_backward_broadcast_power': 4.6623, 'avg_time_forward_broadcast_power': 1.8283,
99                      'avg_time_mem_alloc_cpu/0': 400.0, 'inputs': {'lhs': (10000, 10), 'rhs': (10000, 10)}},
100                     {'avg_time_backward_broadcast_power': 279.922, 'avg_time_forward_broadcast_power': 24.4621,
101                      'avg_time_mem_alloc_cpu/0': 4000.0, 'inputs': {'lhs': (10000, 1), 'rhs': (10000, 100)}}],
102.....
103.....
104```
105
106## Usecase 3 - Run benchmarks for specific operator
107For example, you want to run benchmarks for `nd.add` operator in MXNet, you just run the following python script.
108
109```
110#!/usr/bin/python
111import mxnet as mx
112from mxnet import nd
113
114from benchmark.opperf.utils.benchmark_utils import run_performance_test
115
116add_res = run_performance_test(nd.add, run_backward=True, dtype='float32', ctx=mx.cpu(),
117                               inputs=[{"lhs": (1024, 1024),
118                                        "rhs": (1024, 1024)}],
119                               warmup=10, runs=25)
120```
121
122Output for the above benchmark run, on a CPU machine, would look something like below:
123
124```
125{'add': [{'avg_time_mem_alloc_cpu/0': 102760.4453,
126          'avg_time_forward_broadcast_add': 4.0372,
127          'avg_time_backward_broadcast_add': 5.3841,
128          'inputs': {'lhs': (1024, 1024), 'rhs': (1024, 1024)}}]}
129
130```
131
132## Usecase 4 - Run benchmarks for group of operators with same input
133For example, you want to run benchmarks for `nd.add`, `nd.sub` operator in MXNet, with the same set of inputs. You just run the following python script.
134
135```
136#!/usr/bin/python
137import mxnet as mx
138from mxnet import nd
139
140from benchmark.opperf.utils.benchmark_utils import run_performance_test
141
142add_res = run_performance_test([nd.add, nd.subtract], run_backward=True, dtype='float32', ctx=mx.cpu(),
143                               inputs=[{"lhs": (1024, 1024),
144                                        "rhs": (1024, 1024)}],
145                               warmup=10, runs=25)
146```
147
148Output for the above benchmark run, on a CPU machine, would look something like below:
149
150```
151{'add': [{'avg_time_mem_alloc_cpu/0': 102760.4453,
152          'avg_time_forward_broadcast_add': 4.0372,
153          'avg_time_backward_broadcast_add': 5.3841,
154          'inputs': {'lhs': (1024, 1024), 'rhs': (1024, 1024)}}],
155'subtract': [{'avg_time_forward_broadcast_sub': 5.5137,
156               'avg_time_mem_alloc_cpu/0': 207618.0469,
157               'avg_time_backward_broadcast_sub': 7.2976,
158               'inputs': {'lhs': (1024, 1024), 'rhs': (1024, 1024)}}
159             ]}
160
161```
162# How does it work under the hood?
163
164Under the hood, executes NDArray operator using randomly generated data. Use MXNet profiler to get summary of the operator execution:
1651. Memory
1662. Computation time (forward, backward)
167
168See the design proposal document for more details - https://cwiki.apache.org/confluence/display/MXNET/MXNet+Operator+Benchmarks
169
170**NOTE:**
171
172This utility queries MXNet operator registry to fetch all operators registered with MXNet, generate inputs and run benchmarks.
173However, fully automated tests are enabled only for simpler operators such as - broadcast operators, element_wise operators etc... For the purpose of readability and giving more control to the users, complex operators such as convolution (2D, 3D), Pooling, Recurrent are not fully automated but expressed as default rules.
174See `utils/op_registry_utils.py` for more details.
175
176## Use python timer
177Optionally, you could use the python time package as the profiler engine to caliberate runtime in each operator.
178To use python timer for all operators, use the argument --profiler 'python':
179```
180python incubator-mxnet/benchmark/opperf/opperf.py --profiler='python'
181```
182
183To use python timer for a specific operator, pass the argument profiler to the run_performance_test method:
184```
185add_res = run_performance_test([nd.add, nd.subtract], run_backward=True, dtype='float32', ctx=mx.cpu(),
186                               inputs=[{"lhs": (1024, 1024),
187                                        "rhs": (1024, 1024)}],
188                               warmup=10, runs=25, profiler='python')
189```
190By default, MXNet profiler is used as the profiler engine.
191
192# TODO
193
194All contributions are welcome. Below is the list of desired features:
195
1961. Cover all MXNet operators.
1972. Enhance MXNet profiler with additional APIs to programmatically fetch and process profiler data.
1983. Integration with CI/CD system to run operator benchmarks for PR builds, nightly builds.
1994. Dashboards and other modes of presentation of results for analyzing and planning tasks such as operator performance improvements.
2005. Randomized Tensor Shape generation for profiling to identify bottlenecks in the operators.
201