xref: /dragonfly/share/man/man4/netmap.4 (revision 7b728a63)
1.\" Copyright (c) 2011-2013 Matteo Landi, Luigi Rizzo, Universita` di Pisa
2.\" All rights reserved.
3.\"
4.\" Redistribution and use in source and binary forms, with or without
5.\" modification, are permitted provided that the following conditions
6.\" are met:
7.\" 1. Redistributions of source code must retain the above copyright
8.\"    notice, this list of conditions and the following disclaimer.
9.\" 2. Redistributions in binary form must reproduce the above copyright
10.\"    notice, this list of conditions and the following disclaimer in the
11.\"    documentation and/or other materials provided with the distribution.
12.\"
13.\" THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
14.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
15.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
16.\" ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
17.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
18.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
19.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
20.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
21.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
22.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
23.\" SUCH DAMAGE.
24.\"
25.\" This document is derived in part from the enet man page (enet.4)
26.\" distributed with 4.3BSD Unix.
27.\"
28.\" $FreeBSD: head/share/man/man4/netmap.4 228017 2011-11-27 06:55:57Z gjb $
29.\"
30.Dd May 25, 2019
31.Dt NETMAP 4
32.Os
33.Sh NAME
34.Nm netmap
35.Nd a framework for fast packet I/O
36.Sh SYNOPSIS
37.Cd device netmap
38.Sh DESCRIPTION
39.Nm
40is a framework for extremely fast and efficient packet I/O
41(reaching 14.88 Mpps with a single core at less than 1 GHz)
42for both userspace and kernel clients.
43Userspace clients can use the
44.Nm
45API
46to send and receive raw packets through physical interfaces
47or ports of the
48.Xr vale 4
49switch.
50.Pp
51.Xr vale 4
52is a very fast (reaching 20 Mpps per port)
53and modular software switch,
54implemented within the kernel, which can interconnect
55virtual ports, physical devices, and the native host stack.
56.Pp
57.Nm
58uses a memory mapped region to share packet buffers,
59descriptors and queues with the kernel.
60.Xr ioctl 2
61is used to bind interfaces/ports to file descriptors and
62implement non-blocking I/O, whereas blocking I/O uses
63.Xr select 2
64and
65.Xr poll 2 .
66.Nm
67can exploit the parallelism in multiqueue devices and
68multicore systems.
69.Pp
70For the best performance,
71.Nm
72requires explicit support in device drivers;
73a generic emulation layer is available to implement the
74.Nm
75API on top of unmodified device drivers,
76at the price of reduced performance
77(but still better than what can be achieved with
78.Xr socket 2 ,
79.Xr bpf 4 ,
80or
81.Xr pcap 3 ) .
82.Pp
83For a list of devices with native
84.Nm
85support, see section
86.Sx SUPPORTED INTERFACES
87at the end of this manual page.
88.Sh OPERATING THE API
89.Nm
90clients must first issue the following code to open the device
91node and to bind the file descriptor to a specific interface or port:
92.Bd -literal -offset indent
93fd = open("/dev/netmap");
94ioctl(fd, NIOCREGIF, (struct nmreq *)arg);
95.Ed
96.Pp
97.Nm
98has multiple modes of operation controlled by the
99content of the
100.Vt struct nmreq
101passed to
102.Xr ioctl 2 .
103In particular, the
104.Va nr_name
105field specifies whether the client operates on a physical network
106interface or on a port of a
107.Xr vale 4
108switch, as indicated below.
109Additional fields in the
110.Vt struct nmreq
111control the details of operation.
112.Bl -tag -width XXXX
113.It Sy Interface name (e.g. 'em0', 'eth1', ...)
114The data path of the interface is disconnected from the host stack.
115Depending on additional arguments,
116the file descriptor is bound to the NIC (one or all queues),
117or to the host stack.
118.It Sy valeXXX:YYY (arbitrary XXX and YYY)
119The file descriptor is bound to port YYY of a
120.Xr vale 4
121switch called XXX,
122where XXX and YYY are arbitrary alphanumeric strings.
123The string cannot exceed IFNAMSIZ characters, and YYY cannot
124matching the name of any existing interface.
125.Pp
126The switch and the port are created if not existing.
127.It Sy valeXXX:ifname (ifname is an existing interface)
128Flags in the argument control whether the physical interface
129(and optionally the corresponding host stack endpoint)
130are connected or disconnected from the
131.Xr vale 4
132switch named XXX.
133.Pp
134In this case
135.Xr ioctl 2
136is used only for configuring the
137.Xr vale 4
138switch, typically through the
139.Cm vale-ctl
140command.
141The file descriptor cannot be used for I/O, and should be passed to
142.Xr close 2
143after issuing
144.Xr ioctl 2 .
145.El
146.Pp
147The binding can be removed (and the interface returns to
148regular operation, or the virtual port destroyed) with a
149.Xr close 2
150on the file descriptor.
151.Pp
152The processes owning the file descriptor can then
153.Xr mmap 2
154the memory region that contains pre-allocated
155buffers, descriptors and queues, and use them to
156read/write raw packets.
157Non blocking I/O is done with special
158.Xr ioctl 2
159commands, whereas the file descriptor can be passed to
160.Xr select 2
161and
162.Xr poll 2
163to be notified about incoming packet or available transmit buffers.
164.Ss DATA STRUCTURES
165The data structures in the mmapped memory are described below
166(see
167.In net/netmap/netmap.h
168for reference).
169All physical devices operating in
170.Nm
171mode use the same memory region,
172shared by the kernel and all processes who own
173.Pa /dev/netmap
174descriptors bound to those devices
175(NOTE: visibility may be restricted in future implementations).
176Virtual ports instead use separate memory regions,
177shared only with the kernel.
178.Pp
179All references between the shared data structure
180are relative (offsets or indexes).
181Some macros help converting
182them into actual pointers.
183.Bl -tag -width XXXX
184.It Sy struct netmap_if (one per interface)
185indicates the number of rings supported by an interface, their
186sizes, and the offsets of the
187.Nm
188rings associated to the interface.
189.Pp
190.Vt struct netmap_if
191is at offset
192.Va nr_offset
193in the shared memory region indicated by the
194field in the structure returned by
195.Dv NIOCREGIF .
196.Bd -literal
197struct netmap_if {
198    char          ni_name[IFNAMSIZ]; /* name of the interface.    */
199    const u_int   ni_version;        /* API version               */
200    const u_int   ni_rx_rings;       /* number of rx ring pairs   */
201    const u_int   ni_tx_rings;       /* if 0, same as ni_rx_rings */
202    const ssize_t ring_ofs[];        /* offset of tx and rx rings */
203};
204.Ed
205.It Sy struct netmap_ring (one per ring)
206Contains the positions in the transmit and receive rings to
207synchronize the kernel and the application,
208and an array of
209.Nm
210slots describing the buffers.
211.Va reserved
212is used in receive rings to tell the kernel the number of slots after
213.Va cur
214that are still in use indicates how many slots starting from
215.Va cur
216the
217.\" XXX Fix and finish this sentence?
218.Pp
219Each physical interface has one
220.Vt struct netmap_ring
221for each hardware transmit and receive ring,
222plus one extra transmit and one receive structure
223that connect to the host stack.
224.Bd -literal
225struct netmap_ring {
226    const ssize_t  buf_ofs;   /* see details                 */
227    const uint32_t num_slots; /* number of slots in the ring */
228    uint32_t       avail;     /* number of usable slots      */
229    uint32_t       cur;       /* 'current' read/write index  */
230    uint32_t       reserved;  /* not refilled before current */
231
232    const uint16_t nr_buf_size;
233    uint16_t       flags;
234#define NR_TIMESTAMP 0x0002   /* set timestamp on *sync()    */
235#define NR_FORWARD   0x0004   /* enable NS_FORWARD for ring  */
236#define NR_RX_TSTMP  0x0008   /* set rx timestamp in slots   */
237    struct timeval ts;
238    struct netmap_slot slot[0]; /* array of slots            */
239}
240.Ed
241.Pp
242In transmit rings, after a system call
243.Va cur
244indicates the first slot that can be used for transmissions, and
245.Va avail
246reports how many of them are available.
247Before the next
248.Nm Ns -related
249system call on the file
250descriptor, the application should fill buffers and
251slots with data, and update
252.Va cur
253and
254.Va avail
255accordingly, as shown in the figure below:
256.Bd -literal
257              cur
258               |----- avail ---|   (after syscall)
259               v
260     TX  [*****aaaaaaaaaaaaaaaaa**]
261     TX  [*****TTTTTaaaaaaaaaaaa**]
262                    ^
263                    |-- avail --|   (before syscall)
264                   cur
265.Ed
266.Pp
267In receive rings, after a system call
268.Va cur
269indicates the first slot that contains a valid packet, and
270.Va avail
271reports how many of them are available.
272Before the next
273.Nm Ns -related
274system call on the file
275descriptor, the application can process buffers and
276release them to the kernel updating
277.Va cur
278and
279.Va avail
280accordingly, as shown in the figure below.
281Receive rings have an additional field called
282.Va reserved
283to indicate how many buffers before
284.Va cur
285cannot be released because they are still being processed.
286.Bd -literal
287                 cur
288            |-res-|-- avail --|   (after syscall)
289                  v
290     RX  [**rrrrrrRRRRRRRRRRRR******]
291     RX  [**...........rrrrRRR******]
292                       |res|--|<avail (before syscall)
293                           ^
294                          cur
295.Ed
296.It Sy struct netmap_slot (one per packet)
297contains the metadata for a packet:
298.Bd -literal
299struct netmap_slot {
300    uint32_t buf_idx; /* buffer index */
301    uint16_t len;   /* packet length */
302    uint16_t flags; /* buf changed, etc. */
303#define NS_BUF_CHANGED  0x0001  /* must resync, buffer changed */
304#define NS_REPORT       0x0002  /* tell hw to report results,
305                                 * e.g. by generating an interrupt
306                                 */
307#define NS_FORWARD      0x0004  /* pass packet to the other endpoint
308                                 * (host stack or device)
309                                 */
310#define NS_NO_LEARN     0x0008
311#define NS_INDIRECT     0x0010
312#define NS_MOREFRAG     0x0020
313#define NS_PORT_SHIFT   8
314#define NS_PORT_MASK    (0xff << NS_PORT_SHIFT)
315#define NS_RFRAGS(_slot)        (((_slot)->flags >> 8) & 0xff)
316    uint64_t ptr;   /* buffer address (indirect buffers) */
317};
318.Ed
319.Pp
320The flags control how the the buffer associated to the slot
321should be managed.
322.It Sy packet buffers
323are normally fixed size (2 Kbyte) buffers allocated by the kernel
324that contain packet data.
325.El
326.Pp
327Addresses are computed through macros in order to
328support access to objects in the shared memory region, e.g.:
329.Bl -tag -width ".Fn NETMAP_BUF ring buf_idx"
330.It Fn NETMAP_TXRING nifp i
331Returns the address of the
332.Va i Ns -th
333transmit ring.
334.It Fn NETMAP_RXRING nifp i
335Returns the address of the
336.Va i Ns -th
337receive ring.
338.It Fn NETMAP_BUF ring buf_idx
339Returns the address of the buffer with index
340.Va buf_idx
341(which can be part of any ring for the given interface).
342.El
343.Ss FLAGS
344Normally, buffers are associated to slots when interfaces are bound,
345and one packet is fully contained in a single buffer.
346Clients can, however, modify the mapping using the
347following flags:
348.Bl -tag -width ".Fn NS_RFRAGS slot"
349.It Dv NS_BUF_CHANGED
350indicates that the
351.Va buf_idx
352in the slot has changed.
353This can be useful if the client wants to implement
354some form of zero-copy forwarding (e.g. by passing buffers
355from an input interface to an output interface), or
356needs to process packets out of order.
357.Pp
358The flag MUST be used whenever the buffer index is changed.
359.It Dv NS_REPORT
360indicates that we want to be woken up when this buffer
361has been transmitted.
362This reduces performance but insures
363a prompt notification when a buffer has been sent.
364Normally,
365.Nm
366notifies transmit completions in batches, hence signals
367may be delayed indefinitely.
368However, we need such notifications
369before closing a descriptor.
370.It Dv NS_FORWARD
371When the device is opened in
372.Sq transparent
373mode, the client can mark slots in receive rings with this flag.
374For all marked slots, marked packets are forwarded to
375the other endpoint at the next system call, thus restoring
376(in a selective way) the connection between the NIC and the
377host stack.
378.It Dv NS_NO_LEARN
379tells the forwarding code that the SRC MAC address for this
380packet should not be used in the learning bridge.
381.It Dv NS_INDIRECT
382indicates that the packet's payload is not in the
383.Nm Ns -supplied
384buffer, but in a user-supplied buffer whose
385user virtual address is in the
386.Va ptr
387field of the slot.
388The size can reach 65535 bytes.
389This is only supported on the transmit ring of virtual ports.
390.It Dv NS_MOREFRAG
391indicates that the packet continues with subsequent buffers;
392the last buffer in a packet must have the flag cleared.
393The maximum length of a chain is 64 buffers.
394This is only supported on virtual ports.
395.It Fn NS_RFRAGS slot
396on receive rings, returns the number of remaining buffers
397in a packet, including this one.
398Slots with a value greater than 1 also have
399.Dv NS_MOREFRAG
400set.
401The length refers to the individual buffer;
402there is no field for the total length.
403.Pp
404On transmit rings, if
405.Dv NS_DST
406is set, it is passed to the lookup
407function, which can use it e.g. as the index of the destination
408port instead of doing an address lookup.
409.El
410.Sh SYSTEM CALLS
411.Nm
412supports
413.Xr ioctl 2
414commands to synchronize the state of the rings
415between the kernel and the user processes, as well as
416to query and configure the interface.
417The former do not require any argument, whereas the latter use a
418.Vt struct nmreq
419defined as follows:
420.Bd -literal
421struct nmreq {
422        char      nr_name[IFNAMSIZ];
423        uint32_t  nr_version;     /* API version */
424#define NETMAP_API      4         /* current version */
425        uint32_t  nr_offset;      /* nifp offset in the shared region */
426        uint32_t  nr_memsize;     /* size of the shared region */
427        uint32_t  nr_tx_slots;    /* slots in tx rings */
428        uint32_t  nr_rx_slots;    /* slots in rx rings */
429        uint16_t  nr_tx_rings;    /* number of tx rings */
430        uint16_t  nr_rx_rings;    /* number of tx rings */
431        uint16_t  nr_ringid;      /* ring(s) we care about */
432#define NETMAP_HW_RING    0x4000  /* low bits indicate one hw ring */
433#define NETMAP_SW_RING    0x2000  /* we process the sw ring */
434#define NETMAP_NO_TX_POLL 0x1000  /* no gratuitous txsync on poll */
435#define NETMAP_RING_MASK  0xfff   /* the actual ring number */
436        uint16_t  nr_cmd;
437#define NETMAP_BDG_ATTACH       1 /* attach the NIC */
438#define NETMAP_BDG_DETACH       2 /* detach the NIC */
439#define NETMAP_BDG_LOOKUP_REG   3 /* register lookup function */
440#define NETMAP_BDG_LIST         4 /* get bridge's info */
441        uint16_t  nr_arg1;
442        uint16_t  nr_arg2;
443        uint32_t  spare2[3];
444};
445.Ed
446.Pp
447A device descriptor obtained through
448.Pa /dev/netmap
449supports the
450.Xr ioctl 2
451command codes supported by network devices, as well as
452specific command codes defined in
453.In net/netmap/netmap.h .
454These specific command codes are as follows:
455.Bl -tag -width ".Dv NIOCTXSYNC"
456.It Dv NIOCGINFO
457returns
458.Dv EINVAL
459if the named device does not support
460.Nm .
461Otherwise, it returns zero and advisory information
462about the interface.
463Note that all the information below can change before the
464interface is actually put into
465.Nm
466mode.
467.Pp
468.Va nr_memsize
469indicates the size of the
470.Nm
471memory region.
472Physical devices all share the same memory region, whereas
473.Xr vale 4
474ports may have independent regions for each port.
475These sizes can be set through system-wide
476.Xr sysctl 8
477variables.
478.Va nr_tx_slots
479and
480.Va nr_rx_slots
481indicate the size of transmit and receive rings, respectively.
482.Va nr_tx_rings
483and
484.Va nr_rx_rings
485indicate the number of transmit and receive rings, respectively.
486Both ring number and size may be configured at runtime
487using interface-specific functions (e.g.\&
488.Xr sysctl 8
489on BSD, or
490.Xr ethtool 8
491on Linux).
492.It Dv NIOCREGIF
493puts the interface specified via
494.Va nr_name
495into
496.Nm
497mode, disconnecting it from the host stack, and/or defines which
498rings are controlled through this file descriptor.
499On return, it gives the same info as
500.Dv NIOCGINFO ,
501and
502.Va nr_ringid
503indicates the identity of the rings controlled through the file
504descriptor.
505.Pp
506Possible values for
507.Va nr_ringid
508are as follows:
509.Bl -tag -width "Dv NETMAP_HW_RING + i"
510.It 0
511default; all hardware rings
512.It Dv NETMAP_SW_RING
513.Dq host rings
514connecting to the host stack
515.It Dv NETMAP_HW_RING + i
516i-th hardware ring
517.El
518.Pp
519By default, a
520.Xr poll 2
521or
522.Xr select 2
523call pushes out any pending packets on the transmit ring, even if
524no write events were specified.
525The feature can be disabled by OR-ing the flag
526.Dv NETMAP_NO_TX_SYNC
527into
528.Va nr_ringid .
529Normally, you should keep this feature unless you are using
530separate file descriptors for the send and receive rings, because
531otherwise packets are pushed out only if
532.Dv NETMAP_TXSYNC
533is called, or the send queue is full.
534.Pp
535.Dv NIOCREGIF
536can be used multiple times to change the association of a
537file descriptor to a ring pair, always within the same device.
538.Pp
539When registering a virtual interface that is dynamically created to a
540.Xr vale 4
541switch, we can specify the desired number of rings (1 by default,
542and currently up to 16) by setting the
543.Va nr_tx_rings
544and
545.Va nr_rx_rings
546fields accordingly.
547.It Dv NIOCTXSYNC
548tells the hardware about new packets to transmit, and updates the
549number of slots available for transmission.
550.It Dv NIOCRXSYNC
551tells the hardware about consumed packets, and asks for newly available
552packets.
553.El
554.Pp
555.Nm
556uses
557.Xr select 2
558and
559.Xr poll 2
560to wake up processes when significant events occur, and
561.Xr mmap 2
562to map memory.
563.Pp
564Applications may need to create threads and bind them to
565specific cores to improve performance, using standard
566OS primitives; see
567.Xr pthread 3 .
568In particular,
569.Xr pthread_setaffinity_np 3
570may be of use.
571.Sh EXAMPLES
572The following code implements a traffic generator:
573.Bd -literal
574#include <sys/ioctl.h>
575#include <sys/mman.h>
576#include <sys/socket.h>
577#include <sys/time.h>
578#include <sys/types.h>
579#include <net/netmap/netmap_user.h>
580
581#include <fcntl.h>
582#include <poll.h>
583#include <string.h>
584
585int
586main(void)
587{
588	struct netmap_if *nifp;
589	struct netmap_ring *ring;
590	struct pollfd fds;
591	struct nmreq nmr;
592	void *p;
593	int fd;
594
595	fd = open("/dev/netmap", O_RDWR);
596	bzero(&nmr, sizeof(nmr));
597	strcpy(nmr.nr_name, "ix0");
598	nmr.nr_version = NETMAP_API;
599	ioctl(fd, NIOCREGIF, &nmr);
600	p = mmap(0, nmr.nr_memsize, PROT_WRITE | PROT_READ,
601	    MAP_SHARED, fd, 0);
602	nifp = NETMAP_IF(p, nmr.nr_offset);
603	ring = NETMAP_TXRING(nifp, 0);
604	fds.fd = fd;
605	fds.events = POLLOUT;
606
607	for (;;) {
608		poll(&fds, 1, -1);
609		for (; ring->avail > 0; ring->avail--) {
610			uint32_t i;
611			void *buf;
612
613			i = ring->cur;
614			buf = NETMAP_BUF(ring, ring->slot[i].buf_idx);
615			/* prepare packet in buf */
616			ring->slot[i].len = 0; /* packet length */
617			ring->cur = NETMAP_RING_NEXT(ring, i);
618		}
619	}
620}
621.Ed
622.Sh SUPPORTED INTERFACES
623.Nm
624supports the following interfaces:
625.Xr em 4 ,
626.Xr igb 4 ,
627.Xr ixgbe 4 ,
628.Xr lem 4 ,
629and
630.Xr re 4 .
631.Sh SEE ALSO
632.Xr vale 4
633.Rs
634.%A Luigi Rizzo
635.%T Revisiting network I/O APIs: the netmap framework
636.%J Communications of the ACM
637.%V 55 (3)
638.%P 45-51
639.%D March 2012
640.Re
641.Rs
642.%A Luigi Rizzo
643.%T netmap: a novel framework for fast packet I/O
644.%D June 2012
645.%O USENIX ATC '12, Boston
646.Re
647.Pp
648.Lk http://info.iet.unipi.it/~luigi/netmap/
649.Sh AUTHORS
650.An -nosplit
651The
652.Nm
653framework has been originally designed and implemented at the
654Universita` di Pisa in 2011 by
655.An Luigi Rizzo ,
656and further extended with help from
657.An Matteo Landi ,
658.An Gaetano Catalli ,
659.An Giuseppe Lettieri ,
660and
661.An Vincenzo Maffione .
662.Pp
663.Nm
664and
665.Xr vale 4
666have been funded by the European Commission within the FP7 Projects
667CHANGE (257422) and OPENLAB (287581).
668