1High Performance Configuration
2==============================
3
4NIC
5---
6
7One of the major dependencies for Suricata's performance is the Network
8Interface Card. There are many vendors and possibilities. Some NICs have and
9require their own specific instructions and tools of how to set up the NIC.
10This ensures the greatest benefit when running Suricata. Vendors like
11Napatech, Netronome, Accolade, Myricom include those tools and documentation
12as part of their sources.
13
14For Intel, Mellanox and commodity NICs the following suggestions below could
15be utilized.
16
17It is recommended that the latest available stable NIC drivers are used. In
18general when changing the NIC settings it is advisable to use the latest
19``ethtool`` version. Some NICs ship with their own ``ethtool`` that is
20recommended to be used. Here is an example of how to set up the ethtool
21if needed:
22
23::
24
25 wget https://mirrors.edge.kernel.org/pub/software/network/ethtool/ethtool-5.2.tar.xz
26 tar -xf ethtool-5.2.tar.xz
27 cd ethtool-5.2
28 ./configure && make clean && make && make install
29 /usr/local/sbin/ethtool --version
30
31When doing high performance optimisation make sure ``irqbalance`` is off and
32not running:
33
34::
35
36  service irqbalance stop
37
38Depending on the NIC's available queues (for example Intel's x710/i40 has 64
39available per port/interface) the worker threads can be set up accordingly.
40Usually the available queues can be seen by running:
41
42::
43
44 /usr/local/sbin/ethtool -l eth1
45
46Some NICs - generally lower end 1Gbps - do not support symmetric hashing see
47:doc:`packet-capture`. On those systems due to considerations for out of order
48packets the following setup with af-packet is suggested (the example below
49uses ``eth1``):
50
51::
52
53 /usr/local/sbin/ethtool -L eth1 combined 1
54
55then set up af-packet with number of desired workers threads ``threads: auto``
56(auto by default will use number of CPUs available) and
57``cluster-type: cluster_flow`` (also the default setting)
58
59For higher end systems/NICs a better and more performant solution could be
60utilizing the NIC itself a bit more. x710/i40 and similar Intel NICs or
61Mellanox MT27800 Family [ConnectX-5] for example can easily be set up to do
62a bigger chunk of the work using more RSS queues and symmetric hashing in order
63to allow for increased performance on the Suricata side by using af-packet
64with ``cluster-type: cluster_qm`` mode. In that mode with af-packet all packets
65linked by network card to a RSS queue are sent to the same socket. Below is
66an example of a suggested config set up based on a 16 core one CPU/NUMA node
67socket system using x710:
68
69::
70
71 rmmod i40e && modprobe i40e
72 ifconfig eth1 down
73 /usr/local/sbin/ethtool -L eth1 combined 16
74 /usr/local/sbin/ethtool -K eth1 rxhash on
75 /usr/local/sbin/ethtool -K eth1 ntuple on
76 ifconfig eth1 up
77 /usr/local/sbin/ethtool -X eth1 hkey 6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A equal 16
78 /usr/local/sbin/ethtool -A eth1 rx off
79 /usr/local/sbin/ethtool -C eth1 adaptive-rx off adaptive-tx off rx-usecs 125
80 /usr/local/sbin/ethtool -G eth1 rx 1024
81
82The commands above can be reviewed in detail in the help or manpages of the
83``ethtool``. In brief the sequence makes sure the NIC is reset, the number of
84RSS queues is set to 16, load balancing is enabled for the NIC, a low entropy
85toepiltz key is inserted to allow for symmetric hashing, receive offloading is
86disabled, the adaptive control is disabled for lowest possible latency and
87last but not least, the ring rx descriptor size is set to 1024.
88Make sure the RSS hash function is Toeplitz:
89
90::
91
92 /usr/local/sbin/ethtool -X eth1 hfunc toeplitz
93
94Let the NIC balance as much as possible:
95
96::
97
98 for proto in tcp4 udp4 tcp6 udp6; do
99    /usr/local/sbin/ethtool -N eth1 rx-flow-hash $proto sdfn
100 done
101
102In some cases:
103
104::
105
106 /usr/local/sbin/ethtool -N eth1 rx-flow-hash $proto sd
107
108might be enough or even better depending on the type of traffic. However not
109all NICs allow it. The ``sd`` specifies the multi queue hashing algorithm of
110the NIC (for the particular proto) to use src IP, dst IP only. The ``sdfn``
111allows for the tuple src IP, dst IP, src port, dst port to be used for the
112hashing algorithm.
113In the af-packet section of suricata.yaml:
114
115::
116
117 af-packet:
118  - interface: eth1
119    threads: 16
120    cluster-id: 99
121    cluster-type: cluster_qm
122    ...
123    ...
124
125CPU affinity and NUMA
126---------------------
127
128Intel based systems
129~~~~~~~~~~~~~~~~~~~
130
131If the system has more then one NUMA node there are some more possibilities.
132In those cases it is generally recommended to use as many worker threads as
133cpu cores available/possible - from the same NUMA node. The example below uses
134a 72 core machine and the sniffing NIC that Suricata uses located on NUMA node 1.
135In such 2 socket configurations it is recommended to have Suricata and the
136sniffing NIC to be running and residing on the second NUMA node as by default
137CPU 0 is widely used by many services in Linux. In a case where this is not
138possible it is recommended that (via the cpu affinity config section in
139suricata.yaml and the irq affinity script for the NIC) CPU 0 is never used.
140
141In the case below 36 worker threads are used out of NUMA node 1's CPU,
142af-packet runmode with ``cluster-type: cluster_qm``.
143
144If the CPU's NUMA set up is as follows:
145
146::
147
148    lscpu
149    Architecture:        x86_64
150    CPU op-mode(s):      32-bit, 64-bit
151    Byte Order:          Little Endian
152    CPU(s):              72
153    On-line CPU(s) list: 0-71
154    Thread(s) per core:  2
155    Core(s) per socket:  18
156    Socket(s):           2
157    NUMA node(s):        2
158    Vendor ID:           GenuineIntel
159    CPU family:          6
160    Model:               79
161    Model name:          Intel(R) Xeon(R) CPU E5-2697 v4 @ 2.30GHz
162    Stepping:            1
163    CPU MHz:             1199.724
164    CPU max MHz:         3600.0000
165    CPU min MHz:         1200.0000
166    BogoMIPS:            4589.92
167    Virtualization:      VT-x
168    L1d cache:           32K
169    L1i cache:           32K
170    L2 cache:            256K
171    L3 cache:            46080K
172    NUMA node0 CPU(s):   0-17,36-53
173    NUMA node1 CPU(s):   18-35,54-71
174
175It is recommended that 36 worker threads are used and the NIC set up could be
176as follows:
177
178::
179
180    rmmod i40e && modprobe i40e
181    ifconfig eth1 down
182    /usr/local/sbin/ethtool -L eth1 combined 36
183    /usr/local/sbin/ethtool -K eth1 rxhash on
184    /usr/local/sbin/ethtool -K eth1 ntuple on
185    ifconfig eth1 up
186    ./set_irq_affinity local eth1
187    /usr/local/sbin/ethtool -X eth1 hkey 6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A equal 36
188    /usr/local/sbin/ethtool -A eth1 rx off tx off
189    /usr/local/sbin/ethtool -C eth1 adaptive-rx off adaptive-tx off rx-usecs 125
190    /usr/local/sbin/ethtool -G eth1 rx 1024
191    for proto in tcp4 udp4 tcp6 udp6; do
192        echo "/usr/local/sbin/ethtool -N eth1 rx-flow-hash $proto sdfn"
193        /usr/local/sbin/ethtool -N eth1 rx-flow-hash $proto sdfn
194    done
195
196In the example above the ``set_irq_affinity`` script is used from the NIC
197driver's sources.
198In the cpu affinity section of suricata.yaml config:
199
200::
201
202 # Suricata is multi-threaded. Here the threading can be influenced.
203 threading:
204  cpu-affinity:
205    - management-cpu-set:
206        cpu: [ "1-10" ]  # include only these CPUs in affinity settings
207    - receive-cpu-set:
208        cpu: [ "0-10" ]  # include only these CPUs in affinity settings
209    - worker-cpu-set:
210        cpu: [ "18-35", "54-71" ]
211        mode: "exclusive"
212        prio:
213          low: [ 0 ]
214          medium: [ "1" ]
215          high: [ "18-35","54-71" ]
216          default: "high"
217
218In the af-packet section of suricata.yaml config :
219
220::
221
222  - interface: eth1
223    # Number of receive threads. "auto" uses the number of cores
224    threads: 18
225    cluster-id: 99
226    cluster-type: cluster_qm
227    defrag: no
228    use-mmap: yes
229    mmap-locked: yes
230    tpacket-v3: yes
231    ring-size: 100000
232    block-size: 1048576
233  - interface: eth1
234    # Number of receive threads. "auto" uses the number of cores
235    threads: 18
236    cluster-id: 99
237    cluster-type: cluster_qm
238    defrag: no
239    use-mmap: yes
240    mmap-locked: yes
241    tpacket-v3: yes
242    ring-size: 100000
243    block-size: 1048576
244
245That way 36 worker threads can be mapped (18 per each af-packet interface slot)
246in total per CPUs NUMA 1 range - 18-35,54-71. That part is done via the
247``worker-cpu-set`` affinity settings. ``ring-size`` and ``block-size`` in the
248config section  above are decent default values to start with. Those can be
249better adjusted if needed as explained in :doc:`tuning-considerations`.
250
251AMD based systems
252~~~~~~~~~~~~~~~~~
253
254Another example can be using an AMD based system where the architecture and
255design of the system itself plus the NUMA node's interaction is different as
256it is based on the HyperTransport (HT) technology. In that case per NUMA
257thread/lock would not be needed. The example below shows a suggestion for such
258a configuration utilising af-packet, ``cluster-type: cluster_flow``. The
259Mellanox NIC is located on NUMA 0.
260
261The CPU set up is as follows:
262
263::
264
265    Architecture:          x86_64
266    CPU op-mode(s):        32-bit, 64-bit
267    Byte Order:            Little Endian
268    CPU(s):                128
269    On-line CPU(s) list:   0-127
270    Thread(s) per core:    2
271    Core(s) per socket:    32
272    Socket(s):             2
273    NUMA node(s):          8
274    Vendor ID:             AuthenticAMD
275    CPU family:            23
276    Model:                 1
277    Model name:            AMD EPYC 7601 32-Core Processor
278    Stepping:              2
279    CPU MHz:               1200.000
280    CPU max MHz:           2200.0000
281    CPU min MHz:           1200.0000
282    BogoMIPS:              4391.55
283    Virtualization:        AMD-V
284    L1d cache:             32K
285    L1i cache:             64K
286    L2 cache:              512K
287    L3 cache:              8192K
288    NUMA node0 CPU(s):     0-7,64-71
289    NUMA node1 CPU(s):     8-15,72-79
290    NUMA node2 CPU(s):     16-23,80-87
291    NUMA node3 CPU(s):     24-31,88-95
292    NUMA node4 CPU(s):     32-39,96-103
293    NUMA node5 CPU(s):     40-47,104-111
294    NUMA node6 CPU(s):     48-55,112-119
295    NUMA node7 CPU(s):     56-63,120-127
296
297The ``ethtool``, ``show_irq_affinity.sh`` and ``set_irq_affinity_cpulist.sh``
298tools are provided from the official driver sources.
299Set up the NIC, including offloading and load balancing:
300
301::
302
303 ifconfig eno6 down
304 /opt/mellanox/ethtool/sbin/ethtool -L eno6 combined 15
305 /opt/mellanox/ethtool/sbin/ethtool -K eno6 rxhash on
306 /opt/mellanox/ethtool/sbin/ethtool -K eno6 ntuple on
307 ifconfig eno6 up
308 /sbin/set_irq_affinity_cpulist.sh 1-7,64-71 eno6
309 /opt/mellanox/ethtool/sbin/ethtool -X eno6 hfunc toeplitz
310 /opt/mellanox/ethtool/sbin/ethtool -X eno6 hkey 6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A
311
312In the example above (1-7,64-71 for the irq affinity) CPU 0 is skipped as it is usually used by default on Linux systems by many applications/tools.
313Let the NIC balance as much as possible:
314
315::
316
317 for proto in tcp4 udp4 tcp6 udp6; do
318    /usr/local/sbin/ethtool -N eth1 rx-flow-hash $proto sdfn
319 done
320
321In the cpu affinity section of suricata.yaml config :
322
323::
324
325 # Suricata is multi-threaded. Here the threading can be influenced.
326 threading:
327  set-cpu-affinity: yes
328  cpu-affinity:
329    - management-cpu-set:
330        cpu: [ "120-127" ]  # include only these cpus in affinity settings
331    - receive-cpu-set:
332        cpu: [ 0 ]  # include only these cpus in affinity settings
333    - worker-cpu-set:
334        cpu: [ "8-55" ]
335        mode: "exclusive"
336        prio:
337          high: [ "8-55" ]
338          default: "high"
339
340In the af-packet section of suricata.yaml config:
341
342::
343
344  - interface: eth1
345    # Number of receive threads. "auto" uses the number of cores
346    threads: 48 # 48 worker threads on cpus "8-55" above
347    cluster-id: 99
348    cluster-type: cluster_flow
349    defrag: no
350    use-mmap: yes
351    mmap-locked: yes
352    tpacket-v3: yes
353    ring-size: 100000
354    block-size: 1048576
355
356
357In the example above there are 15 RSS queues pinned to cores 1-7,64-71 on NUMA
358node 0 and 40 worker threads using other CPUs on different NUMA nodes. The
359reason why CPU 0 is skipped in this set up is as in Linux systems it is very
360common for CPU 0 to be used by default by many tools/services. The NIC itself in
361this config is positioned on NUMA 0 so starting with 15 RSS queues on that
362NUMA node and keeping those off for other tools in the system could offer the
363best advantage.
364
365.. note:: Performance and optimization of the whole system can be affected upon regular NIC driver and pkg/kernel upgrades so it should be monitored regularly and tested out in QA/test environments first. As a general suggestion it is always recommended to run the latest stable firmware and drivers as  instructed and provided by the particular NIC vendor.
366
367Other considerations
368~~~~~~~~~~~~~~~~~~~~
369
370Another advanced option to consider is the ``isolcpus`` kernel boot parameter
371is a way of allowing CPU cores to be isolated for use of general system
372processes. That way ensures total dedication of those CPUs/ranges for the
373Suricata process only.
374
375``stream.wrong_thread`` / ``tcp.pkt_on_wrong_thread`` are counters available
376in ``stats.log`` or ``eve.json`` as ``event_type: stats`` that indicate issues with
377the load balancing. There could be traffic/NICs settings related as well. In
378very high/heavily increasing counter values it is recommended to experiment
379with a different load balancing method either via the NIC or for example using
380XDP/eBPF. There is an issue open
381https://redmine.openinfosecfoundation.org/issues/2725 that is a placeholder
382for feedback and findings.
383