1.\" Copyright (c) 2011-2013 Matteo Landi, Luigi Rizzo, Universita` di Pisa 2.\" All rights reserved. 3.\" 4.\" Redistribution and use in source and binary forms, with or without 5.\" modification, are permitted provided that the following conditions 6.\" are met: 7.\" 1. Redistributions of source code must retain the above copyright 8.\" notice, this list of conditions and the following disclaimer. 9.\" 2. Redistributions in binary form must reproduce the above copyright 10.\" notice, this list of conditions and the following disclaimer in the 11.\" documentation and/or other materials provided with the distribution. 12.\" 13.\" THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND 14.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 15.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE 16.\" ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE 17.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 18.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS 19.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) 20.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT 21.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY 22.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF 23.\" SUCH DAMAGE. 24.\" 25.\" This document is derived in part from the enet man page (enet.4) 26.\" distributed with 4.3BSD Unix. 27.\" 28.\" $FreeBSD: head/share/man/man4/netmap.4 228017 2011-11-27 06:55:57Z gjb $ 29.\" 30.Dd December 26, 2013 31.Dt NETMAP 4 32.Os 33.Sh NAME 34.Nm netmap 35.Nd a framework for fast packet I/O 36.Sh SYNOPSIS 37.Cd device netmap 38.Sh DESCRIPTION 39.Nm 40is a framework for extremely fast and efficient packet I/O 41(reaching 14.88 Mpps with a single core at less than 1 GHz) 42for both userspace and kernel clients. 43Userspace clients can use the 44.Nm 45API 46to send and receive raw packets through physical interfaces 47or ports of the 48.Xr vale 4 49switch. 50.Pp 51.Xr vale 4 52is a very fast (reaching 20 Mpps per port) 53and modular software switch, 54implemented within the kernel, which can interconnect 55virtual ports, physical devices, and the native host stack. 56.Pp 57.Nm 58uses a memory mapped region to share packet buffers, 59descriptors and queues with the kernel. 60.Xr ioctl 2 61is used to bind interfaces/ports to file descriptors and 62implement non-blocking I/O, whereas blocking I/O uses 63.Xr select 2 64and 65.Xr poll 2 . 66.Nm 67can exploit the parallelism in multiqueue devices and 68multicore systems. 69.Pp 70For the best performance, 71.Nm 72requires explicit support in device drivers; 73a generic emulation layer is available to implement the 74.Nm 75API on top of unmodified device drivers, 76at the price of reduced performance 77(but still better than what can be achieved with 78.Xr socket 2 , 79.Xr bpf 4 , 80or 81.Xr pcap 3 ) . 82.Pp 83For a list of devices with native 84.Nm 85support, see section 86.Sx SUPPORTED INTERFACES 87at the end of this manual page. 88.Sh OPERATING THE API 89.Nm 90clients must first issue the following code to open the device 91node and to bind the file descriptor to a specific interface or port: 92.Bd -literal -offset indent 93fd = open("/dev/netmap"); 94ioctl(fd, NIOCREGIF, (struct nmreq *)arg); 95.Ed 96.Pp 97.Nm 98has multiple modes of operation controlled by the 99content of the 100.Vt struct nmreq 101passed to 102.Xr ioctl 2 . 103In particular, the 104.Va nr_name 105field specifies whether the client operates on a physical network 106interface or on a port of a 107.Xr vale 4 108switch, as indicated below. 109Additional fields in the 110.Vt struct nmreq 111control the details of operation. 112.Bl -tag -width XXXX 113.It Sy Interface name (e.g. 'em0', 'eth1', ...) 114The data path of the interface is disconnected from the host stack. 115Depending on additional arguments, 116the file descriptor is bound to the NIC (one or all queues), 117or to the host stack. 118.It Sy valeXXX:YYY (arbitrary XXX and YYY) 119The file descriptor is bound to port YYY of a 120.Xr vale 4 121switch called XXX, 122where XXX and YYY are arbitrary alphanumeric strings. 123The string cannot exceed IFNAMSIZ characters, and YYY cannot 124matching the name of any existing interface. 125.Pp 126The switch and the port are created if not existing. 127.It Sy valeXXX:ifname (ifname is an existing interface) 128Flags in the argument control whether the physical interface 129(and optionally the corresponding host stack endpoint) 130are connected or disconnected from the 131.Xr vale 4 132switch named XXX. 133.Pp 134In this case 135.Xr ioctl 2 136is used only for configuring the 137.Xr vale 4 138switch, typically through the 139.Cm vale-ctl 140command. 141The file descriptor cannot be used for I/O, and should be passed to 142.Xr close 2 143after issuing 144.Xr ioctl 2 . 145.El 146.Pp 147The binding can be removed (and the interface returns to 148regular operation, or the virtual port destroyed) with a 149.Xr close 2 150on the file descriptor. 151.Pp 152The processes owning the file descriptor can then 153.Xr mmap 2 154the memory region that contains pre-allocated 155buffers, descriptors and queues, and use them to 156read/write raw packets. 157Non blocking I/O is done with special 158.Xr ioctl 2 159commands, whereas the file descriptor can be passed to 160.Xr select 2 161and 162.Xr poll 2 163to be notified about incoming packet or available transmit buffers. 164.Ss DATA STRUCTURES 165The data structures in the mmapped memory are described below 166(see 167.In net/netmap.h 168for reference). 169All physical devices operating in 170.Nm 171mode use the same memory region, 172shared by the kernel and all processes who own 173.Pa /dev/netmap 174descriptors bound to those devices 175(NOTE: visibility may be restricted in future implementations). 176Virtual ports instead use separate memory regions, 177shared only with the kernel. 178.Pp 179All references between the shared data structure 180are relative (offsets or indexes). 181Some macros help converting 182them into actual pointers. 183.Bl -tag -width XXXX 184.It Sy struct netmap_if (one per interface) 185indicates the number of rings supported by an interface, their 186sizes, and the offsets of the 187.Nm 188rings associated to the interface. 189.Pp 190.Vt struct netmap_if 191is at offset 192.Va nr_offset 193in the shared memory region indicated by the 194field in the structure returned by 195.Dv NIOCREGIF . 196.Bd -literal 197struct netmap_if { 198 char ni_name[IFNAMSIZ]; /* name of the interface. */ 199 const u_int ni_version; /* API version */ 200 const u_int ni_rx_rings; /* number of rx ring pairs */ 201 const u_int ni_tx_rings; /* if 0, same as ni_rx_rings */ 202 const ssize_t ring_ofs[]; /* offset of tx and rx rings */ 203}; 204.Ed 205.It Sy struct netmap_ring (one per ring) 206Contains the positions in the transmit and receive rings to 207synchronize the kernel and the application, 208and an array of 209.Nm 210slots describing the buffers. 211.Va reserved 212is used in receive rings to tell the kernel the number of slots after 213.Va cur 214that are still in use indicates how many slots starting from 215.Va cur 216the 217.\" XXX Fix and finish this sentence? 218.Pp 219Each physical interface has one 220.Vt struct netmap_ring 221for each hardware transmit and receive ring, 222plus one extra transmit and one receive structure 223that connect to the host stack. 224.Bd -literal 225struct netmap_ring { 226 const ssize_t buf_ofs; /* see details */ 227 const uint32_t num_slots; /* number of slots in the ring */ 228 uint32_t avail; /* number of usable slots */ 229 uint32_t cur; /* 'current' read/write index */ 230 uint32_t reserved; /* not refilled before current */ 231 232 const uint16_t nr_buf_size; 233 uint16_t flags; 234#define NR_TIMESTAMP 0x0002 /* set timestamp on *sync() */ 235#define NR_FORWARD 0x0004 /* enable NS_FORWARD for ring */ 236#define NR_RX_TSTMP 0x0008 /* set rx timestamp in slots */ 237 struct timeval ts; 238 struct netmap_slot slot[0]; /* array of slots */ 239} 240.Ed 241.Pp 242In transmit rings, after a system call 243.Va cur 244indicates the first slot that can be used for transmissions, and 245.Va avail 246reports how many of them are available. 247Before the next 248.Nm Ns -related 249system call on the file 250descriptor, the application should fill buffers and 251slots with data, and update 252.Va cur 253and 254.Va avail 255accordingly, as shown in the figure below: 256.Bd -literal 257 cur 258 |----- avail ---| (after syscall) 259 v 260 TX [*****aaaaaaaaaaaaaaaaa**] 261 TX [*****TTTTTaaaaaaaaaaaa**] 262 ^ 263 |-- avail --| (before syscall) 264 cur 265.Ed 266.Pp 267In receive rings, after a system call 268.Va cur 269indicates the first slot that contains a valid packet, and 270.Va avail 271reports how many of them are available. 272Before the next 273.Nm Ns -related 274system call on the file 275descriptor, the application can process buffers and 276release them to the kernel updating 277.Va cur 278and 279.Va avail 280accordingly, as shown in the figure below. 281Receive rings have an additional field called 282.Va reserved 283to indicate how many buffers before 284.Va cur 285cannot be released because they are still being processed. 286.Bd -literal 287 cur 288 |-res-|-- avail --| (after syscall) 289 v 290 RX [**rrrrrrRRRRRRRRRRRR******] 291 RX [**...........rrrrRRR******] 292 |res|--|<avail (before syscall) 293 ^ 294 cur 295.Ed 296.It Sy struct netmap_slot (one per packet) 297contains the metadata for a packet: 298.Bd -literal 299struct netmap_slot { 300 uint32_t buf_idx; /* buffer index */ 301 uint16_t len; /* packet length */ 302 uint16_t flags; /* buf changed, etc. */ 303#define NS_BUF_CHANGED 0x0001 /* must resync, buffer changed */ 304#define NS_REPORT 0x0002 /* tell hw to report results, 305 * e.g. by generating an interrupt 306 */ 307#define NS_FORWARD 0x0004 /* pass packet to the other endpoint 308 * (host stack or device) 309 */ 310#define NS_NO_LEARN 0x0008 311#define NS_INDIRECT 0x0010 312#define NS_MOREFRAG 0x0020 313#define NS_PORT_SHIFT 8 314#define NS_PORT_MASK (0xff << NS_PORT_SHIFT) 315#define NS_RFRAGS(_slot) (((_slot)->flags >> 8) & 0xff) 316 uint64_t ptr; /* buffer address (indirect buffers) */ 317}; 318.Ed 319.Pp 320The flags control how the the buffer associated to the slot 321should be managed. 322.It Sy packet buffers 323are normally fixed size (2 Kbyte) buffers allocated by the kernel 324that contain packet data. 325.El 326.Pp 327Addresses are computed through macros in order to 328support access to objects in the shared memory region, e.g.: 329.Bl -tag -width ".Fn NETMAP_BUF ring buf_idx" 330.It Fn NETMAP_TXRING nifp i 331Returns the address of the 332.Va i Ns -th 333transmit ring. 334.It Fn NETMAP_RXRING nifp i 335Returns the address of the 336.Va i Ns -th 337receive ring. 338.It Fn NETMAP_BUF ring buf_idx 339Returns the address of the buffer with index 340.Va buf_idx 341(which can be part of any ring for the given interface). 342.El 343.Ss FLAGS 344Normally, buffers are associated to slots when interfaces are bound, 345and one packet is fully contained in a single buffer. 346Clients can, however, modify the mapping using the 347following flags: 348.Bl -tag -width ".Fn NS_RFRAGS slot" 349.It Dv NS_BUF_CHANGED 350indicates that the 351.Va buf_idx 352in the slot has changed. 353This can be useful if the client wants to implement 354some form of zero-copy forwarding (e.g. by passing buffers 355from an input interface to an output interface), or 356needs to process packets out of order. 357.Pp 358The flag MUST be used whenever the buffer index is changed. 359.It Dv NS_REPORT 360indicates that we want to be woken up when this buffer 361has been transmitted. 362This reduces performance but insures 363a prompt notification when a buffer has been sent. 364Normally, 365.Nm 366notifies transmit completions in batches, hence signals 367may be delayed indefinitely. 368However, we need such notifications 369before closing a descriptor. 370.It Dv NS_FORWARD 371When the device is opened in 372.Sq transparent 373mode, the client can mark slots in receive rings with this flag. 374For all marked slots, marked packets are forwarded to 375the other endpoint at the next system call, thus restoring 376(in a selective way) the connection between the NIC and the 377host stack. 378.It Dv NS_NO_LEARN 379tells the forwarding code that the SRC MAC address for this 380packet should not be used in the learning bridge. 381.It Dv NS_INDIRECT 382indicates that the packet's payload is not in the 383.Nm Ns -supplied 384buffer, but in a user-supplied buffer whose 385user virtual address is in the 386.Va ptr 387field of the slot. 388The size can reach 65535 bytes. 389This is only supported on the transmit ring of virtual ports. 390.It Dv NS_MOREFRAG 391indicates that the packet continues with subsequent buffers; 392the last buffer in a packet must have the flag cleared. 393The maximum length of a chain is 64 buffers. 394This is only supported on virtual ports. 395.It Fn NS_RFRAGS slot 396on receive rings, returns the number of remaining buffers 397in a packet, including this one. 398Slots with a value greater than 1 also have 399.Dv NS_MOREFRAG 400set. 401The length refers to the individual buffer; 402there is no field for the total length. 403.Pp 404On transmit rings, if 405.Dv NS_DST 406is set, it is passed to the lookup 407function, which can use it e.g. as the index of the destination 408port instead of doing an address lookup. 409.El 410.Sh SYSTEM CALLS 411.Nm 412supports 413.Xr ioctl 2 414commands to synchronize the state of the rings 415between the kernel and the user processes, as well as 416to query and configure the interface. 417The former do not require any argument, whereas the latter use a 418.Vt struct nmreq 419defined as follows: 420.Bd -literal 421struct nmreq { 422 char nr_name[IFNAMSIZ]; 423 uint32_t nr_version; /* API version */ 424#define NETMAP_API 4 /* current version */ 425 uint32_t nr_offset; /* nifp offset in the shared region */ 426 uint32_t nr_memsize; /* size of the shared region */ 427 uint32_t nr_tx_slots; /* slots in tx rings */ 428 uint32_t nr_rx_slots; /* slots in rx rings */ 429 uint16_t nr_tx_rings; /* number of tx rings */ 430 uint16_t nr_rx_rings; /* number of tx rings */ 431 uint16_t nr_ringid; /* ring(s) we care about */ 432#define NETMAP_HW_RING 0x4000 /* low bits indicate one hw ring */ 433#define NETMAP_SW_RING 0x2000 /* we process the sw ring */ 434#define NETMAP_NO_TX_POLL 0x1000 /* no gratuitous txsync on poll */ 435#define NETMAP_RING_MASK 0xfff /* the actual ring number */ 436 uint16_t nr_cmd; 437#define NETMAP_BDG_ATTACH 1 /* attach the NIC */ 438#define NETMAP_BDG_DETACH 2 /* detach the NIC */ 439#define NETMAP_BDG_LOOKUP_REG 3 /* register lookup function */ 440#define NETMAP_BDG_LIST 4 /* get bridge's info */ 441 uint16_t nr_arg1; 442 uint16_t nr_arg2; 443 uint32_t spare2[3]; 444}; 445.Ed 446.Pp 447A device descriptor obtained through 448.Pa /dev/netmap 449supports the 450.Xr ioctl 2 451command codes supported by network devices, as well as 452specific command codes defined in 453.In net/netmap.h . 454These specific command codes are as follows: 455.Bl -tag -width ".Dv NIOCTXSYNC" 456.It Dv NIOCGINFO 457returns 458.Dv EINVAL 459if the named device does not support 460.Nm . 461Otherwise, it returns zero and advisory information 462about the interface. 463Note that all the information below can change before the 464interface is actually put into 465.Nm 466mode. 467.Pp 468.Va nr_memsize 469indicates the size of the 470.Nm 471memory region. 472Physical devices all share the same memory region, whereas 473.Xr vale 4 474ports may have independent regions for each port. 475These sizes can be set through system-wide 476.Xr sysctl 8 477variables. 478.Va nr_tx_slots 479and 480.Va nr_rx_slots 481indicate the size of transmit and receive rings, respectively. 482.Va nr_tx_rings 483and 484.Va nr_rx_rings 485indicate the number of transmit and receive rings, respectively. 486Both ring number and size may be configured at runtime 487using interface-specific functions (e.g.\& 488.Xr sysctl 8 489on BSD, or 490.Xr ethtool 8 491on Linux). 492.It Dv NIOCREGIF 493puts the interface specified via 494.Va nr_name 495into 496.Nm 497mode, disconnecting it from the host stack, and/or defines which 498rings are controlled through this file descriptor. 499On return, it gives the same info as 500.Dv NIOCGINFO , 501and 502.Va nr_ringid 503indicates the identity of the rings controlled through the file 504descriptor. 505.Pp 506Possible values for 507.Va nr_ringid 508are as follows: 509.Bl -tag -width "Dv NETMAP_HW_RING + i" 510.It 0 511default; all hardware rings 512.It Dv NETMAP_SW_RING 513.Dq host rings 514connecting to the host stack 515.It Dv NETMAP_HW_RING + i 516i-th hardware ring 517.El 518.Pp 519By default, a 520.Xr poll 2 521or 522.Xr select 2 523call pushes out any pending packets on the transmit ring, even if 524no write events were specified. 525The feature can be disabled by OR-ing the flag 526.Dv NETMAP_NO_TX_SYNC 527into 528.Va nr_ringid . 529Normally, you should keep this feature unless you are using 530separate file descriptors for the send and receive rings, because 531otherwise packets are pushed out only if 532.Dv NETMAP_TXSYNC 533is called, or the send queue is full. 534.Pp 535.Dv NIOCREGIF 536can be used multiple times to change the association of a 537file descriptor to a ring pair, always within the same device. 538.Pp 539When registering a virtual interface that is dynamically created to a 540.Xr vale 4 541switch, we can specify the desired number of rings (1 by default, 542and currently up to 16) by setting the 543.Va nr_tx_rings 544and 545.Va nr_rx_rings 546fields accordingly. 547.It Dv NIOCTXSYNC 548tells the hardware about new packets to transmit, and updates the 549number of slots available for transmission. 550.It Dv NIOCRXSYNC 551tells the hardware about consumed packets, and asks for newly available 552packets. 553.El 554.Pp 555.Nm 556uses 557.Xr select 2 558and 559.Xr poll 2 560to wake up processes when significant events occur, and 561.Xr mmap 2 562to map memory. 563.Pp 564Applications may need to create threads and bind them to 565specific cores to improve performance, using standard 566OS primitives; see 567.Xr pthread 3 . 568In particular, 569.Xr pthread_setaffinity_np 3 570may be of use. 571.Sh EXAMPLES 572The following code implements a traffic generator: 573.Bd -literal 574#include <sys/ioctl.h> 575#include <sys/mman.h> 576#include <sys/socket.h> 577#include <sys/time.h> 578#include <sys/types.h> 579#include <net/netmap_user.h> 580 581#include <fcntl.h> 582#include <poll.h> 583#include <string.h> 584 585int 586main(void) 587{ 588 struct netmap_if *nifp; 589 struct netmap_ring *ring; 590 struct pollfd fds; 591 struct nmreq nmr; 592 void *p; 593 int fd; 594 595 fd = open("/dev/netmap", O_RDWR); 596 bzero(&nmr, sizeof(nmr)); 597 strcpy(nmr.nr_name, "ix0"); 598 nmr.nr_version = NETMAP_API; 599 ioctl(fd, NIOCREGIF, &nmr); 600 p = mmap(0, nmr.nr_memsize, PROT_WRITE | PROT_READ, 601 MAP_SHARED, fd, 0); 602 nifp = NETMAP_IF(p, nmr.nr_offset); 603 ring = NETMAP_TXRING(nifp, 0); 604 fds.fd = fd; 605 fds.events = POLLOUT; 606 607 for (;;) { 608 poll(&fds, 1, -1); 609 for (; ring->avail > 0; ring->avail--) { 610 uint32_t i; 611 void *buf; 612 613 i = ring->cur; 614 buf = NETMAP_BUF(ring, ring->slot[i].buf_idx); 615 /* prepare packet in buf */ 616 ring->slot[i].len = 0; /* packet length */ 617 ring->cur = NETMAP_RING_NEXT(ring, i); 618 } 619 } 620} 621.Ed 622.Sh SUPPORTED INTERFACES 623.Nm 624supports the following interfaces: 625.Xr em 4 , 626.Xr igb 4 , 627.Xr ixgbe 4 , 628.Xr lem 4 , 629and 630.Xr re 4 . 631.Sh SEE ALSO 632.Xr vale 4 633.Rs 634.%A Luigi Rizzo 635.%T Revisiting network I/O APIs: the netmap framework 636.%J Communications of the ACM 637.%V 55 (3) 638.%P 45-51 639.%D March 2012 640.Re 641.Rs 642.%A Luigi Rizzo 643.%T netmap: a novel framework for fast packet I/O 644.%D June 2012 645.%O USENIX ATC '12, Boston 646.Re 647.Pp 648.Lk http://info.iet.unipi.it/~luigi/netmap/ 649.Sh AUTHORS 650.An -nosplit 651The 652.Nm 653framework has been originally designed and implemented at the 654Universita` di Pisa in 2011 by 655.An Luigi Rizzo , 656and further extended with help from 657.An Matteo Landi , 658.An Gaetano Catalli , 659.An Giuseppe Lettieri , 660and 661.An Vincenzo Maffione . 662.Pp 663.Nm 664and 665.Xr vale 4 666have been funded by the European Commission within the FP7 Projects 667CHANGE (257422) and OPENLAB (287581). 668