1High Performance Configuration 2============================== 3 4NIC 5--- 6 7One of the major dependencies for Suricata's performance is the Network 8Interface Card. There are many vendors and possibilities. Some NICs have and 9require their own specific instructions and tools of how to set up the NIC. 10This ensures the greatest benefit when running Suricata. Vendors like 11Napatech, Netronome, Accolade, Myricom include those tools and documentation 12as part of their sources. 13 14For Intel, Mellanox and commodity NICs the following suggestions below could 15be utilized. 16 17It is recommended that the latest available stable NIC drivers are used. In 18general when changing the NIC settings it is advisable to use the latest 19``ethtool`` version. Some NICs ship with their own ``ethtool`` that is 20recommended to be used. Here is an example of how to set up the ethtool 21if needed: 22 23:: 24 25 wget https://mirrors.edge.kernel.org/pub/software/network/ethtool/ethtool-5.2.tar.xz 26 tar -xf ethtool-5.2.tar.xz 27 cd ethtool-5.2 28 ./configure && make clean && make && make install 29 /usr/local/sbin/ethtool --version 30 31When doing high performance optimisation make sure ``irqbalance`` is off and 32not running: 33 34:: 35 36 service irqbalance stop 37 38Depending on the NIC's available queues (for example Intel's x710/i40 has 64 39available per port/interface) the worker threads can be set up accordingly. 40Usually the available queues can be seen by running: 41 42:: 43 44 /usr/local/sbin/ethtool -l eth1 45 46Some NICs - generally lower end 1Gbps - do not support symmetric hashing see 47:doc:`packet-capture`. On those systems due to considerations for out of order 48packets the following setup with af-packet is suggested (the example below 49uses ``eth1``): 50 51:: 52 53 /usr/local/sbin/ethtool -L eth1 combined 1 54 55then set up af-packet with number of desired workers threads ``threads: auto`` 56(auto by default will use number of CPUs available) and 57``cluster-type: cluster_flow`` (also the default setting) 58 59For higher end systems/NICs a better and more performant solution could be 60utilizing the NIC itself a bit more. x710/i40 and similar Intel NICs or 61Mellanox MT27800 Family [ConnectX-5] for example can easily be set up to do 62a bigger chunk of the work using more RSS queues and symmetric hashing in order 63to allow for increased performance on the Suricata side by using af-packet 64with ``cluster-type: cluster_qm`` mode. In that mode with af-packet all packets 65linked by network card to a RSS queue are sent to the same socket. Below is 66an example of a suggested config set up based on a 16 core one CPU/NUMA node 67socket system using x710: 68 69:: 70 71 rmmod i40e && modprobe i40e 72 ifconfig eth1 down 73 /usr/local/sbin/ethtool -L eth1 combined 16 74 /usr/local/sbin/ethtool -K eth1 rxhash on 75 /usr/local/sbin/ethtool -K eth1 ntuple on 76 ifconfig eth1 up 77 /usr/local/sbin/ethtool -X eth1 hkey 6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A equal 16 78 /usr/local/sbin/ethtool -A eth1 rx off 79 /usr/local/sbin/ethtool -C eth1 adaptive-rx off adaptive-tx off rx-usecs 125 80 /usr/local/sbin/ethtool -G eth1 rx 1024 81 82The commands above can be reviewed in detail in the help or manpages of the 83``ethtool``. In brief the sequence makes sure the NIC is reset, the number of 84RSS queues is set to 16, load balancing is enabled for the NIC, a low entropy 85toepiltz key is inserted to allow for symmetric hashing, receive offloading is 86disabled, the adaptive control is disabled for lowest possible latency and 87last but not least, the ring rx descriptor size is set to 1024. 88Make sure the RSS hash function is Toeplitz: 89 90:: 91 92 /usr/local/sbin/ethtool -X eth1 hfunc toeplitz 93 94Let the NIC balance as much as possible: 95 96:: 97 98 for proto in tcp4 udp4 tcp6 udp6; do 99 /usr/local/sbin/ethtool -N eth1 rx-flow-hash $proto sdfn 100 done 101 102In some cases: 103 104:: 105 106 /usr/local/sbin/ethtool -N eth1 rx-flow-hash $proto sd 107 108might be enough or even better depending on the type of traffic. However not 109all NICs allow it. The ``sd`` specifies the multi queue hashing algorithm of 110the NIC (for the particular proto) to use src IP, dst IP only. The ``sdfn`` 111allows for the tuple src IP, dst IP, src port, dst port to be used for the 112hashing algorithm. 113In the af-packet section of suricata.yaml: 114 115:: 116 117 af-packet: 118 - interface: eth1 119 threads: 16 120 cluster-id: 99 121 cluster-type: cluster_qm 122 ... 123 ... 124 125CPU affinity and NUMA 126--------------------- 127 128Intel based systems 129~~~~~~~~~~~~~~~~~~~ 130 131If the system has more then one NUMA node there are some more possibilities. 132In those cases it is generally recommended to use as many worker threads as 133cpu cores available/possible - from the same NUMA node. The example below uses 134a 72 core machine and the sniffing NIC that Suricata uses located on NUMA node 1. 135In such 2 socket configurations it is recommended to have Suricata and the 136sniffing NIC to be running and residing on the second NUMA node as by default 137CPU 0 is widely used by many services in Linux. In a case where this is not 138possible it is recommended that (via the cpu affinity config section in 139suricata.yaml and the irq affinity script for the NIC) CPU 0 is never used. 140 141In the case below 36 worker threads are used out of NUMA node 1's CPU, 142af-packet runmode with ``cluster-type: cluster_qm``. 143 144If the CPU's NUMA set up is as follows: 145 146:: 147 148 lscpu 149 Architecture: x86_64 150 CPU op-mode(s): 32-bit, 64-bit 151 Byte Order: Little Endian 152 CPU(s): 72 153 On-line CPU(s) list: 0-71 154 Thread(s) per core: 2 155 Core(s) per socket: 18 156 Socket(s): 2 157 NUMA node(s): 2 158 Vendor ID: GenuineIntel 159 CPU family: 6 160 Model: 79 161 Model name: Intel(R) Xeon(R) CPU E5-2697 v4 @ 2.30GHz 162 Stepping: 1 163 CPU MHz: 1199.724 164 CPU max MHz: 3600.0000 165 CPU min MHz: 1200.0000 166 BogoMIPS: 4589.92 167 Virtualization: VT-x 168 L1d cache: 32K 169 L1i cache: 32K 170 L2 cache: 256K 171 L3 cache: 46080K 172 NUMA node0 CPU(s): 0-17,36-53 173 NUMA node1 CPU(s): 18-35,54-71 174 175It is recommended that 36 worker threads are used and the NIC set up could be 176as follows: 177 178:: 179 180 rmmod i40e && modprobe i40e 181 ifconfig eth1 down 182 /usr/local/sbin/ethtool -L eth1 combined 36 183 /usr/local/sbin/ethtool -K eth1 rxhash on 184 /usr/local/sbin/ethtool -K eth1 ntuple on 185 ifconfig eth1 up 186 ./set_irq_affinity local eth1 187 /usr/local/sbin/ethtool -X eth1 hkey 6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A equal 36 188 /usr/local/sbin/ethtool -A eth1 rx off tx off 189 /usr/local/sbin/ethtool -C eth1 adaptive-rx off adaptive-tx off rx-usecs 125 190 /usr/local/sbin/ethtool -G eth1 rx 1024 191 for proto in tcp4 udp4 tcp6 udp6; do 192 echo "/usr/local/sbin/ethtool -N eth1 rx-flow-hash $proto sdfn" 193 /usr/local/sbin/ethtool -N eth1 rx-flow-hash $proto sdfn 194 done 195 196In the example above the ``set_irq_affinity`` script is used from the NIC 197driver's sources. 198In the cpu affinity section of suricata.yaml config: 199 200:: 201 202 # Suricata is multi-threaded. Here the threading can be influenced. 203 threading: 204 cpu-affinity: 205 - management-cpu-set: 206 cpu: [ "1-10" ] # include only these CPUs in affinity settings 207 - receive-cpu-set: 208 cpu: [ "0-10" ] # include only these CPUs in affinity settings 209 - worker-cpu-set: 210 cpu: [ "18-35", "54-71" ] 211 mode: "exclusive" 212 prio: 213 low: [ 0 ] 214 medium: [ "1" ] 215 high: [ "18-35","54-71" ] 216 default: "high" 217 218In the af-packet section of suricata.yaml config : 219 220:: 221 222 - interface: eth1 223 # Number of receive threads. "auto" uses the number of cores 224 threads: 18 225 cluster-id: 99 226 cluster-type: cluster_qm 227 defrag: no 228 use-mmap: yes 229 mmap-locked: yes 230 tpacket-v3: yes 231 ring-size: 100000 232 block-size: 1048576 233 - interface: eth1 234 # Number of receive threads. "auto" uses the number of cores 235 threads: 18 236 cluster-id: 99 237 cluster-type: cluster_qm 238 defrag: no 239 use-mmap: yes 240 mmap-locked: yes 241 tpacket-v3: yes 242 ring-size: 100000 243 block-size: 1048576 244 245That way 36 worker threads can be mapped (18 per each af-packet interface slot) 246in total per CPUs NUMA 1 range - 18-35,54-71. That part is done via the 247``worker-cpu-set`` affinity settings. ``ring-size`` and ``block-size`` in the 248config section above are decent default values to start with. Those can be 249better adjusted if needed as explained in :doc:`tuning-considerations`. 250 251AMD based systems 252~~~~~~~~~~~~~~~~~ 253 254Another example can be using an AMD based system where the architecture and 255design of the system itself plus the NUMA node's interaction is different as 256it is based on the HyperTransport (HT) technology. In that case per NUMA 257thread/lock would not be needed. The example below shows a suggestion for such 258a configuration utilising af-packet, ``cluster-type: cluster_flow``. The 259Mellanox NIC is located on NUMA 0. 260 261The CPU set up is as follows: 262 263:: 264 265 Architecture: x86_64 266 CPU op-mode(s): 32-bit, 64-bit 267 Byte Order: Little Endian 268 CPU(s): 128 269 On-line CPU(s) list: 0-127 270 Thread(s) per core: 2 271 Core(s) per socket: 32 272 Socket(s): 2 273 NUMA node(s): 8 274 Vendor ID: AuthenticAMD 275 CPU family: 23 276 Model: 1 277 Model name: AMD EPYC 7601 32-Core Processor 278 Stepping: 2 279 CPU MHz: 1200.000 280 CPU max MHz: 2200.0000 281 CPU min MHz: 1200.0000 282 BogoMIPS: 4391.55 283 Virtualization: AMD-V 284 L1d cache: 32K 285 L1i cache: 64K 286 L2 cache: 512K 287 L3 cache: 8192K 288 NUMA node0 CPU(s): 0-7,64-71 289 NUMA node1 CPU(s): 8-15,72-79 290 NUMA node2 CPU(s): 16-23,80-87 291 NUMA node3 CPU(s): 24-31,88-95 292 NUMA node4 CPU(s): 32-39,96-103 293 NUMA node5 CPU(s): 40-47,104-111 294 NUMA node6 CPU(s): 48-55,112-119 295 NUMA node7 CPU(s): 56-63,120-127 296 297The ``ethtool``, ``show_irq_affinity.sh`` and ``set_irq_affinity_cpulist.sh`` 298tools are provided from the official driver sources. 299Set up the NIC, including offloading and load balancing: 300 301:: 302 303 ifconfig eno6 down 304 /opt/mellanox/ethtool/sbin/ethtool -L eno6 combined 15 305 /opt/mellanox/ethtool/sbin/ethtool -K eno6 rxhash on 306 /opt/mellanox/ethtool/sbin/ethtool -K eno6 ntuple on 307 ifconfig eno6 up 308 /sbin/set_irq_affinity_cpulist.sh 1-7,64-71 eno6 309 /opt/mellanox/ethtool/sbin/ethtool -X eno6 hfunc toeplitz 310 /opt/mellanox/ethtool/sbin/ethtool -X eno6 hkey 6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A 311 312In the example above (1-7,64-71 for the irq affinity) CPU 0 is skipped as it is usually used by default on Linux systems by many applications/tools. 313Let the NIC balance as much as possible: 314 315:: 316 317 for proto in tcp4 udp4 tcp6 udp6; do 318 /usr/local/sbin/ethtool -N eth1 rx-flow-hash $proto sdfn 319 done 320 321In the cpu affinity section of suricata.yaml config : 322 323:: 324 325 # Suricata is multi-threaded. Here the threading can be influenced. 326 threading: 327 set-cpu-affinity: yes 328 cpu-affinity: 329 - management-cpu-set: 330 cpu: [ "120-127" ] # include only these cpus in affinity settings 331 - receive-cpu-set: 332 cpu: [ 0 ] # include only these cpus in affinity settings 333 - worker-cpu-set: 334 cpu: [ "8-55" ] 335 mode: "exclusive" 336 prio: 337 high: [ "8-55" ] 338 default: "high" 339 340In the af-packet section of suricata.yaml config: 341 342:: 343 344 - interface: eth1 345 # Number of receive threads. "auto" uses the number of cores 346 threads: 48 # 48 worker threads on cpus "8-55" above 347 cluster-id: 99 348 cluster-type: cluster_flow 349 defrag: no 350 use-mmap: yes 351 mmap-locked: yes 352 tpacket-v3: yes 353 ring-size: 100000 354 block-size: 1048576 355 356 357In the example above there are 15 RSS queues pinned to cores 1-7,64-71 on NUMA 358node 0 and 40 worker threads using other CPUs on different NUMA nodes. The 359reason why CPU 0 is skipped in this set up is as in Linux systems it is very 360common for CPU 0 to be used by default by many tools/services. The NIC itself in 361this config is positioned on NUMA 0 so starting with 15 RSS queues on that 362NUMA node and keeping those off for other tools in the system could offer the 363best advantage. 364 365.. note:: Performance and optimization of the whole system can be affected upon regular NIC driver and pkg/kernel upgrades so it should be monitored regularly and tested out in QA/test environments first. As a general suggestion it is always recommended to run the latest stable firmware and drivers as instructed and provided by the particular NIC vendor. 366 367Other considerations 368~~~~~~~~~~~~~~~~~~~~ 369 370Another advanced option to consider is the ``isolcpus`` kernel boot parameter 371is a way of allowing CPU cores to be isolated for use of general system 372processes. That way ensures total dedication of those CPUs/ranges for the 373Suricata process only. 374 375``stream.wrong_thread`` / ``tcp.pkt_on_wrong_thread`` are counters available 376in ``stats.log`` or ``eve.json`` as ``event_type: stats`` that indicate issues with 377the load balancing. There could be traffic/NICs settings related as well. In 378very high/heavily increasing counter values it is recommended to experiment 379with a different load balancing method either via the NIC or for example using 380XDP/eBPF. There is an issue open 381https://redmine.openinfosecfoundation.org/issues/2725 that is a placeholder 382for feedback and findings. 383