xref: /freebsd/share/man/man4/bpf.4 (revision e0c4386e)
1.\" Copyright (c) 2007 Seccuris Inc.
2.\" All rights reserved.
3.\"
4.\" This software was developed by Robert N. M. Watson under contract to
5.\" Seccuris Inc.
6.\"
7.\" Redistribution and use in source and binary forms, with or without
8.\" modification, are permitted provided that the following conditions
9.\" are met:
10.\" 1. Redistributions of source code must retain the above copyright
11.\"    notice, this list of conditions and the following disclaimer.
12.\" 2. Redistributions in binary form must reproduce the above copyright
13.\"    notice, this list of conditions and the following disclaimer in the
14.\"    documentation and/or other materials provided with the distribution.
15.\"
16.\" THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
17.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
18.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
19.\" ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
20.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
21.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
22.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
23.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
24.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
25.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
26.\" SUCH DAMAGE.
27.\"
28.\" Copyright (c) 1990 The Regents of the University of California.
29.\" All rights reserved.
30.\"
31.\" Redistribution and use in source and binary forms, with or without
32.\" modification, are permitted provided that: (1) source code distributions
33.\" retain the above copyright notice and this paragraph in its entirety, (2)
34.\" distributions including binary code include the above copyright notice and
35.\" this paragraph in its entirety in the documentation or other materials
36.\" provided with the distribution, and (3) all advertising materials mentioning
37.\" features or use of this software display the following acknowledgement:
38.\" ``This product includes software developed by the University of California,
39.\" Lawrence Berkeley Laboratory and its contributors.'' Neither the name of
40.\" the University nor the names of its contributors may be used to endorse
41.\" or promote products derived from this software without specific prior
42.\" written permission.
43.\" THIS SOFTWARE IS PROVIDED ``AS IS'' AND WITHOUT ANY EXPRESS OR IMPLIED
44.\" WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED WARRANTIES OF
45.\" MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE.
46.\"
47.\" This document is derived in part from the enet man page (enet.4)
48.\" distributed with 4.3BSD Unix.
49.\"
50.Dd October 13, 2021
51.Dt BPF 4
52.Os
53.Sh NAME
54.Nm bpf
55.Nd Berkeley Packet Filter
56.Sh SYNOPSIS
57.Cd device bpf
58.Sh DESCRIPTION
59The Berkeley Packet Filter
60provides a raw interface to data link layers in a protocol
61independent fashion.
62All packets on the network, even those destined for other hosts,
63are accessible through this mechanism.
64.Pp
65The packet filter appears as a character special device,
66.Pa /dev/bpf .
67After opening the device, the file descriptor must be bound to a
68specific network interface with the
69.Dv BIOCSETIF
70ioctl.
71A given interface can be shared by multiple listeners, and the filter
72underlying each descriptor will see an identical packet stream.
73.Pp
74Associated with each open instance of a
75.Nm
76file is a user-settable packet filter.
77Whenever a packet is received by an interface,
78all file descriptors listening on that interface apply their filter.
79Each descriptor that accepts the packet receives its own copy.
80.Pp
81A packet can be sent out on the network by writing to a
82.Nm
83file descriptor.
84The writes are unbuffered, meaning only one packet can be processed per write.
85Currently, only writes to Ethernets and
86.Tn SLIP
87links are supported.
88.Sh BUFFER MODES
89.Nm
90devices deliver packet data to the application via memory buffers provided by
91the application.
92The buffer mode is set using the
93.Dv BIOCSETBUFMODE
94ioctl, and read using the
95.Dv BIOCGETBUFMODE
96ioctl.
97.Ss Buffered read mode
98By default,
99.Nm
100devices operate in the
101.Dv BPF_BUFMODE_BUFFER
102mode, in which packet data is copied explicitly from kernel to user memory
103using the
104.Xr read 2
105system call.
106The user process will declare a fixed buffer size that will be used both for
107sizing internal buffers and for all
108.Xr read 2
109operations on the file.
110This size is queried using the
111.Dv BIOCGBLEN
112ioctl, and is set using the
113.Dv BIOCSBLEN
114ioctl.
115Note that an individual packet larger than the buffer size is necessarily
116truncated.
117.Ss Zero-copy buffer mode
118.Nm
119devices may also operate in the
120.Dv BPF_BUFMODE_ZEROCOPY
121mode, in which packet data is written directly into two user memory buffers
122by the kernel, avoiding both system call and copying overhead.
123Buffers are of fixed (and equal) size, page-aligned, and an even multiple of
124the page size.
125The maximum zero-copy buffer size is returned by the
126.Dv BIOCGETZMAX
127ioctl.
128Note that an individual packet larger than the buffer size is necessarily
129truncated.
130.Pp
131The user process registers two memory buffers using the
132.Dv BIOCSETZBUF
133ioctl, which accepts a
134.Vt struct bpf_zbuf
135pointer as an argument:
136.Bd -literal
137struct bpf_zbuf {
138	void *bz_bufa;
139	void *bz_bufb;
140	size_t bz_buflen;
141};
142.Ed
143.Pp
144.Vt bz_bufa
145is a pointer to the userspace address of the first buffer that will be
146filled, and
147.Vt bz_bufb
148is a pointer to the second buffer.
149.Nm
150will then cycle between the two buffers as they fill and are acknowledged.
151.Pp
152Each buffer begins with a fixed-length header to hold synchronization and
153data length information for the buffer:
154.Bd -literal
155struct bpf_zbuf_header {
156	volatile u_int  bzh_kernel_gen;	/* Kernel generation number. */
157	volatile u_int  bzh_kernel_len;	/* Length of data in the buffer. */
158	volatile u_int  bzh_user_gen;	/* User generation number. */
159	/* ...padding for future use... */
160};
161.Ed
162.Pp
163The header structure of each buffer, including all padding, should be zeroed
164before it is configured using
165.Dv BIOCSETZBUF .
166Remaining space in the buffer will be used by the kernel to store packet
167data, laid out in the same format as with buffered read mode.
168.Pp
169The kernel and the user process follow a simple acknowledgement protocol via
170the buffer header to synchronize access to the buffer: when the header
171generation numbers,
172.Vt bzh_kernel_gen
173and
174.Vt bzh_user_gen ,
175hold the same value, the kernel owns the buffer, and when they differ,
176userspace owns the buffer.
177.Pp
178While the kernel owns the buffer, the contents are unstable and may change
179asynchronously; while the user process owns the buffer, its contents are
180stable and will not be changed until the buffer has been acknowledged.
181.Pp
182Initializing the buffer headers to all 0's before registering the buffer has
183the effect of assigning initial ownership of both buffers to the kernel.
184The kernel signals that a buffer has been assigned to userspace by modifying
185.Vt bzh_kernel_gen ,
186and userspace acknowledges the buffer and returns it to the kernel by setting
187the value of
188.Vt bzh_user_gen
189to the value of
190.Vt bzh_kernel_gen .
191.Pp
192In order to avoid caching and memory re-ordering effects, the user process
193must use atomic operations and memory barriers when checking for and
194acknowledging buffers:
195.Bd -literal
196#include <machine/atomic.h>
197
198/*
199 * Return ownership of a buffer to the kernel for reuse.
200 */
201static void
202buffer_acknowledge(struct bpf_zbuf_header *bzh)
203{
204
205	atomic_store_rel_int(&bzh->bzh_user_gen, bzh->bzh_kernel_gen);
206}
207
208/*
209 * Check whether a buffer has been assigned to userspace by the kernel.
210 * Return true if userspace owns the buffer, and false otherwise.
211 */
212static int
213buffer_check(struct bpf_zbuf_header *bzh)
214{
215
216	return (bzh->bzh_user_gen !=
217	    atomic_load_acq_int(&bzh->bzh_kernel_gen));
218}
219.Ed
220.Pp
221The user process may force the assignment of the next buffer, if any data
222is pending, to userspace using the
223.Dv BIOCROTZBUF
224ioctl.
225This allows the user process to retrieve data in a partially filled buffer
226before the buffer is full, such as following a timeout; the process must
227recheck for buffer ownership using the header generation numbers, as the
228buffer will not be assigned to userspace if no data was present.
229.Pp
230As in the buffered read mode,
231.Xr kqueue 2 ,
232.Xr poll 2 ,
233and
234.Xr select 2
235may be used to sleep awaiting the availability of a completed buffer.
236They will return a readable file descriptor when ownership of the next buffer
237is assigned to user space.
238.Pp
239In the current implementation, the kernel may assign zero, one, or both
240buffers to the user process; however, an earlier implementation maintained
241the invariant that at most one buffer could be assigned to the user process
242at a time.
243In order to both ensure progress and high performance, user processes should
244acknowledge a completely processed buffer as quickly as possible, returning
245it for reuse, and not block waiting on a second buffer while holding another
246buffer.
247.Sh IOCTLS
248The
249.Xr ioctl 2
250command codes below are defined in
251.In net/bpf.h .
252All commands require
253these includes:
254.Bd -literal
255	#include <sys/types.h>
256	#include <sys/time.h>
257	#include <sys/ioctl.h>
258	#include <net/bpf.h>
259.Ed
260.Pp
261Additionally,
262.Dv BIOCGETIF
263and
264.Dv BIOCSETIF
265require
266.In sys/socket.h
267and
268.In net/if.h .
269.Pp
270In addition to
271.Dv FIONREAD
272the following commands may be applied to any open
273.Nm
274file.
275The (third) argument to
276.Xr ioctl 2
277should be a pointer to the type indicated.
278.Bl -tag -width BIOCGETBUFMODE
279.It Dv BIOCGBLEN
280.Pq Li u_int
281Returns the required buffer length for reads on
282.Nm
283files.
284.It Dv BIOCSBLEN
285.Pq Li u_int
286Sets the buffer length for reads on
287.Nm
288files.
289The buffer must be set before the file is attached to an interface
290with
291.Dv BIOCSETIF .
292If the requested buffer size cannot be accommodated, the closest
293allowable size will be set and returned in the argument.
294A read call will result in
295.Er EINVAL
296if it is passed a buffer that is not this size.
297.It Dv BIOCGDLT
298.Pq Li u_int
299Returns the type of the data link layer underlying the attached interface.
300.Er EINVAL
301is returned if no interface has been specified.
302The device types, prefixed with
303.Dq Li DLT_ ,
304are defined in
305.In net/bpf.h .
306.It Dv BIOCGDLTLIST
307.Pq Li "struct bpf_dltlist"
308Returns an array of the available types of the data link layer
309underlying the attached interface:
310.Bd -literal -offset indent
311struct bpf_dltlist {
312	u_int bfl_len;
313	u_int *bfl_list;
314};
315.Ed
316.Pp
317The available types are returned in the array pointed to by the
318.Va bfl_list
319field while their length in u_int is supplied to the
320.Va bfl_len
321field.
322.Er ENOMEM
323is returned if there is not enough buffer space and
324.Er EFAULT
325is returned if a bad address is encountered.
326The
327.Va bfl_len
328field is modified on return to indicate the actual length in u_int
329of the array returned.
330If
331.Va bfl_list
332is
333.Dv NULL ,
334the
335.Va bfl_len
336field is set to indicate the required length of an array in u_int.
337.It Dv BIOCSDLT
338.Pq Li u_int
339Changes the type of the data link layer underlying the attached interface.
340.Er EINVAL
341is returned if no interface has been specified or the specified
342type is not available for the interface.
343.It Dv BIOCPROMISC
344Forces the interface into promiscuous mode.
345All packets, not just those destined for the local host, are processed.
346Since more than one file can be listening on a given interface,
347a listener that opened its interface non-promiscuously may receive
348packets promiscuously.
349This problem can be remedied with an appropriate filter.
350.Pp
351The interface remains in promiscuous mode until all files listening
352promiscuously are closed.
353.It Dv BIOCFLUSH
354Flushes the buffer of incoming packets,
355and resets the statistics that are returned by BIOCGSTATS.
356.It Dv BIOCGETIF
357.Pq Li "struct ifreq"
358Returns the name of the hardware interface that the file is listening on.
359The name is returned in the ifr_name field of
360the
361.Li ifreq
362structure.
363All other fields are undefined.
364.It Dv BIOCSETIF
365.Pq Li "struct ifreq"
366Sets the hardware interface associated with the file.
367This
368command must be performed before any packets can be read.
369The device is indicated by name using the
370.Li ifr_name
371field of the
372.Li ifreq
373structure.
374Additionally, performs the actions of
375.Dv BIOCFLUSH .
376.It Dv BIOCSRTIMEOUT
377.It Dv BIOCGRTIMEOUT
378.Pq Li "struct timeval"
379Sets or gets the read timeout parameter.
380The argument
381specifies the length of time to wait before timing
382out on a read request.
383This parameter is initialized to zero by
384.Xr open 2 ,
385indicating no timeout.
386.It Dv BIOCGSTATS
387.Pq Li "struct bpf_stat"
388Returns the following structure of packet statistics:
389.Bd -literal
390struct bpf_stat {
391	u_int bs_recv;    /* number of packets received */
392	u_int bs_drop;    /* number of packets dropped */
393};
394.Ed
395.Pp
396The fields are:
397.Bl -hang -offset indent
398.It Li bs_recv
399the number of packets received by the descriptor since opened or reset
400(including any buffered since the last read call);
401and
402.It Li bs_drop
403the number of packets which were accepted by the filter but dropped by the
404kernel because of buffer overflows
405(i.e., the application's reads are not keeping up with the packet traffic).
406.El
407.It Dv BIOCIMMEDIATE
408.Pq Li u_int
409Enables or disables
410.Dq immediate mode ,
411based on the truth value of the argument.
412When immediate mode is enabled, reads return immediately upon packet
413reception.
414Otherwise, a read will block until either the kernel buffer
415becomes full or a timeout occurs.
416This is useful for programs like
417.Xr rarpd 8
418which must respond to messages in real time.
419The default for a new file is off.
420.It Dv BIOCSETF
421.It Dv BIOCSETFNR
422.Pq Li "struct bpf_program"
423Sets the read filter program used by the kernel to discard uninteresting
424packets.
425An array of instructions and its length is passed in using
426the following structure:
427.Bd -literal
428struct bpf_program {
429	u_int bf_len;
430	struct bpf_insn *bf_insns;
431};
432.Ed
433.Pp
434The filter program is pointed to by the
435.Li bf_insns
436field while its length in units of
437.Sq Li struct bpf_insn
438is given by the
439.Li bf_len
440field.
441See section
442.Sx "FILTER MACHINE"
443for an explanation of the filter language.
444The only difference between
445.Dv BIOCSETF
446and
447.Dv BIOCSETFNR
448is
449.Dv BIOCSETF
450performs the actions of
451.Dv BIOCFLUSH
452while
453.Dv BIOCSETFNR
454does not.
455.It Dv BIOCSETWF
456.Pq Li "struct bpf_program"
457Sets the write filter program used by the kernel to control what type of
458packets can be written to the interface.
459See the
460.Dv BIOCSETF
461command for more
462information on the
463.Nm
464filter program.
465.It Dv BIOCVERSION
466.Pq Li "struct bpf_version"
467Returns the major and minor version numbers of the filter language currently
468recognized by the kernel.
469Before installing a filter, applications must check
470that the current version is compatible with the running kernel.
471Version numbers are compatible if the major numbers match and the application minor
472is less than or equal to the kernel minor.
473The kernel version number is returned in the following structure:
474.Bd -literal
475struct bpf_version {
476        u_short bv_major;
477        u_short bv_minor;
478};
479.Ed
480.Pp
481The current version numbers are given by
482.Dv BPF_MAJOR_VERSION
483and
484.Dv BPF_MINOR_VERSION
485from
486.In net/bpf.h .
487An incompatible filter
488may result in undefined behavior (most likely, an error returned by
489.Fn ioctl
490or haphazard packet matching).
491.It Dv BIOCGRSIG
492.It Dv BIOCSRSIG
493.Pq Li u_int
494Sets or gets the receive signal.
495This signal will be sent to the process or process group specified by
496.Dv FIOSETOWN .
497It defaults to
498.Dv SIGIO .
499.It Dv BIOCSHDRCMPLT
500.It Dv BIOCGHDRCMPLT
501.Pq Li u_int
502Sets or gets the status of the
503.Dq header complete
504flag.
505Set to zero if the link level source address should be filled in automatically
506by the interface output routine.
507Set to one if the link level source
508address will be written, as provided, to the wire.
509This flag is initialized to zero by default.
510.It Dv BIOCSSEESENT
511.It Dv BIOCGSEESENT
512.Pq Li u_int
513These commands are obsolete but left for compatibility.
514Use
515.Dv BIOCSDIRECTION
516and
517.Dv BIOCGDIRECTION
518instead.
519Sets or gets the flag determining whether locally generated packets on the
520interface should be returned by BPF.
521Set to zero to see only incoming packets on the interface.
522Set to one to see packets originating locally and remotely on the interface.
523This flag is initialized to one by default.
524.It Dv BIOCSDIRECTION
525.It Dv BIOCGDIRECTION
526.Pq Li u_int
527Sets or gets the setting determining whether incoming, outgoing, or all packets
528on the interface should be returned by BPF.
529Set to
530.Dv BPF_D_IN
531to see only incoming packets on the interface.
532Set to
533.Dv BPF_D_INOUT
534to see packets originating locally and remotely on the interface.
535Set to
536.Dv BPF_D_OUT
537to see only outgoing packets on the interface.
538This setting is initialized to
539.Dv BPF_D_INOUT
540by default.
541.It Dv BIOCSTSTAMP
542.It Dv BIOCGTSTAMP
543.Pq Li u_int
544Set or get format and resolution of the time stamps returned by BPF.
545Set to
546.Dv BPF_T_MICROTIME ,
547.Dv BPF_T_MICROTIME_FAST ,
548.Dv BPF_T_MICROTIME_MONOTONIC ,
549or
550.Dv BPF_T_MICROTIME_MONOTONIC_FAST
551to get time stamps in 64-bit
552.Vt struct timeval
553format.
554Set to
555.Dv BPF_T_NANOTIME ,
556.Dv BPF_T_NANOTIME_FAST ,
557.Dv BPF_T_NANOTIME_MONOTONIC ,
558or
559.Dv BPF_T_NANOTIME_MONOTONIC_FAST
560to get time stamps in 64-bit
561.Vt struct timespec
562format.
563Set to
564.Dv BPF_T_BINTIME ,
565.Dv BPF_T_BINTIME_FAST ,
566.Dv BPF_T_NANOTIME_MONOTONIC ,
567or
568.Dv BPF_T_BINTIME_MONOTONIC_FAST
569to get time stamps in 64-bit
570.Vt struct bintime
571format.
572Set to
573.Dv BPF_T_NONE
574to ignore time stamp.
575All 64-bit time stamp formats are wrapped in
576.Vt struct bpf_ts .
577The
578.Dv BPF_T_MICROTIME_FAST ,
579.Dv BPF_T_NANOTIME_FAST ,
580.Dv BPF_T_BINTIME_FAST ,
581.Dv BPF_T_MICROTIME_MONOTONIC_FAST ,
582.Dv BPF_T_NANOTIME_MONOTONIC_FAST ,
583and
584.Dv BPF_T_BINTIME_MONOTONIC_FAST
585are analogs of corresponding formats without _FAST suffix but do not perform
586a full time counter query, so their accuracy is one timer tick.
587The
588.Dv BPF_T_MICROTIME_MONOTONIC ,
589.Dv BPF_T_NANOTIME_MONOTONIC ,
590.Dv BPF_T_BINTIME_MONOTONIC ,
591.Dv BPF_T_MICROTIME_MONOTONIC_FAST ,
592.Dv BPF_T_NANOTIME_MONOTONIC_FAST ,
593and
594.Dv BPF_T_BINTIME_MONOTONIC_FAST
595store the time elapsed since kernel boot.
596This setting is initialized to
597.Dv BPF_T_MICROTIME
598by default.
599.It Dv BIOCFEEDBACK
600.Pq Li u_int
601Set packet feedback mode.
602This allows injected packets to be fed back as input to the interface when
603output via the interface is successful.
604When
605.Dv BPF_D_INOUT
606direction is set, injected outgoing packet is not returned by BPF to avoid
607duplication.
608This flag is initialized to zero by default.
609.It Dv BIOCLOCK
610Set the locked flag on the
611.Nm
612descriptor.
613This prevents the execution of
614ioctl commands which could change the underlying operating parameters of
615the device.
616.It Dv BIOCGETBUFMODE
617.It Dv BIOCSETBUFMODE
618.Pq Li u_int
619Get or set the current
620.Nm
621buffering mode; possible values are
622.Dv BPF_BUFMODE_BUFFER ,
623buffered read mode, and
624.Dv BPF_BUFMODE_ZBUF ,
625zero-copy buffer mode.
626.It Dv BIOCSETZBUF
627.Pq Li struct bpf_zbuf
628Set the current zero-copy buffer locations; buffer locations may be
629set only once zero-copy buffer mode has been selected, and prior to attaching
630to an interface.
631Buffers must be of identical size, page-aligned, and an integer multiple of
632pages in size.
633The three fields
634.Vt bz_bufa ,
635.Vt bz_bufb ,
636and
637.Vt bz_buflen
638must be filled out.
639If buffers have already been set for this device, the ioctl will fail.
640.It Dv BIOCGETZMAX
641.Pq Li size_t
642Get the largest individual zero-copy buffer size allowed.
643As two buffers are used in zero-copy buffer mode, the limit (in practice) is
644twice the returned size.
645As zero-copy buffers consume kernel address space, conservative selection of
646buffer size is suggested, especially when there are multiple
647.Nm
648descriptors in use on 32-bit systems.
649.It Dv BIOCROTZBUF
650Force ownership of the next buffer to be assigned to userspace, if any data
651present in the buffer.
652If no data is present, the buffer will remain owned by the kernel.
653This allows consumers of zero-copy buffering to implement timeouts and
654retrieve partially filled buffers.
655In order to handle the case where no data is present in the buffer and
656therefore ownership is not assigned, the user process must check
657.Vt bzh_kernel_gen
658against
659.Vt bzh_user_gen .
660.It Dv BIOCSETVLANPCP
661Set the VLAN PCP bits to the supplied value.
662.El
663.Sh STANDARD IOCTLS
664.Nm
665now supports several standard
666.Xr ioctl 2 Ns 's
667which allow the user to do async and/or non-blocking I/O to an open
668.I bpf
669file descriptor.
670.Bl -tag -width SIOCGIFADDR
671.It Dv FIONREAD
672.Pq Li int
673Returns the number of bytes that are immediately available for reading.
674.It Dv SIOCGIFADDR
675.Pq Li "struct ifreq"
676Returns the address associated with the interface.
677.It Dv FIONBIO
678.Pq Li int
679Sets or clears non-blocking I/O.
680If arg is non-zero, then doing a
681.Xr read 2
682when no data is available will return -1 and
683.Va errno
684will be set to
685.Er EAGAIN .
686If arg is zero, non-blocking I/O is disabled.
687Note: setting this overrides the timeout set by
688.Dv BIOCSRTIMEOUT .
689.It Dv FIOASYNC
690.Pq Li int
691Enables or disables async I/O.
692When enabled (arg is non-zero), the process or process group specified by
693.Dv FIOSETOWN
694will start receiving
695.Dv SIGIO 's
696when packets arrive.
697Note that you must do an
698.Dv FIOSETOWN
699in order for this to take effect,
700as the system will not default this for you.
701The signal may be changed via
702.Dv BIOCSRSIG .
703.It Dv FIOSETOWN
704.It Dv FIOGETOWN
705.Pq Li int
706Sets or gets the process or process group (if negative) that should
707receive
708.Dv SIGIO
709when packets are available.
710The signal may be changed using
711.Dv BIOCSRSIG
712(see above).
713.El
714.Sh BPF HEADER
715One of the following structures is prepended to each packet returned by
716.Xr read 2
717or via a zero-copy buffer:
718.Bd -literal
719struct bpf_xhdr {
720	struct bpf_ts	bh_tstamp;     /* time stamp */
721	uint32_t	bh_caplen;     /* length of captured portion */
722	uint32_t	bh_datalen;    /* original length of packet */
723	u_short		bh_hdrlen;     /* length of bpf header (this struct
724					  plus alignment padding) */
725};
726
727struct bpf_hdr {
728	struct timeval	bh_tstamp;     /* time stamp */
729	uint32_t	bh_caplen;     /* length of captured portion */
730	uint32_t	bh_datalen;    /* original length of packet */
731	u_short		bh_hdrlen;     /* length of bpf header (this struct
732					  plus alignment padding) */
733};
734.Ed
735.Pp
736The fields, whose values are stored in host order, and are:
737.Pp
738.Bl -tag -compact -width bh_datalen
739.It Li bh_tstamp
740The time at which the packet was processed by the packet filter.
741.It Li bh_caplen
742The length of the captured portion of the packet.
743This is the minimum of
744the truncation amount specified by the filter and the length of the packet.
745.It Li bh_datalen
746The length of the packet off the wire.
747This value is independent of the truncation amount specified by the filter.
748.It Li bh_hdrlen
749The length of the
750.Nm
751header, which may not be equal to
752.\" XXX - not really a function call
753.Fn sizeof "struct bpf_xhdr"
754or
755.Fn sizeof "struct bpf_hdr" .
756.El
757.Pp
758The
759.Li bh_hdrlen
760field exists to account for
761padding between the header and the link level protocol.
762The purpose here is to guarantee proper alignment of the packet
763data structures, which is required on alignment sensitive
764architectures and improves performance on many other architectures.
765The packet filter ensures that the
766.Vt bpf_xhdr ,
767.Vt bpf_hdr
768and the network layer
769header will be word aligned.
770Currently,
771.Vt bpf_hdr
772is used when the time stamp is set to
773.Dv BPF_T_MICROTIME ,
774.Dv BPF_T_MICROTIME_FAST ,
775.Dv BPF_T_MICROTIME_MONOTONIC ,
776.Dv BPF_T_MICROTIME_MONOTONIC_FAST ,
777or
778.Dv BPF_T_NONE
779for backward compatibility reasons.
780Otherwise,
781.Vt bpf_xhdr
782is used.
783However,
784.Vt bpf_hdr
785may be deprecated in the near future.
786Suitable precautions
787must be taken when accessing the link layer protocol fields on alignment
788restricted machines.
789(This is not a problem on an Ethernet, since
790the type field is a short falling on an even offset,
791and the addresses are probably accessed in a bytewise fashion).
792.Pp
793Additionally, individual packets are padded so that each starts
794on a word boundary.
795This requires that an application
796has some knowledge of how to get from packet to packet.
797The macro
798.Dv BPF_WORDALIGN
799is defined in
800.In net/bpf.h
801to facilitate
802this process.
803It rounds up its argument to the nearest word aligned value (where a word is
804.Dv BPF_ALIGNMENT
805bytes wide).
806.Pp
807For example, if
808.Sq Li p
809points to the start of a packet, this expression
810will advance it to the next packet:
811.Dl p = (char *)p + BPF_WORDALIGN(p->bh_hdrlen + p->bh_caplen)
812.Pp
813For the alignment mechanisms to work properly, the
814buffer passed to
815.Xr read 2
816must itself be word aligned.
817The
818.Xr malloc 3
819function
820will always return an aligned buffer.
821.Sh FILTER MACHINE
822A filter program is an array of instructions, with all branches forwardly
823directed, terminated by a
824.Em return
825instruction.
826Each instruction performs some action on the pseudo-machine state,
827which consists of an accumulator, index register, scratch memory store,
828and implicit program counter.
829.Pp
830The following structure defines the instruction format:
831.Bd -literal
832struct bpf_insn {
833	u_short     code;
834	u_char      jt;
835	u_char      jf;
836	bpf_u_int32 k;
837};
838.Ed
839.Pp
840The
841.Li k
842field is used in different ways by different instructions,
843and the
844.Li jt
845and
846.Li jf
847fields are used as offsets
848by the branch instructions.
849The opcodes are encoded in a semi-hierarchical fashion.
850There are eight classes of instructions:
851.Dv BPF_LD ,
852.Dv BPF_LDX ,
853.Dv BPF_ST ,
854.Dv BPF_STX ,
855.Dv BPF_ALU ,
856.Dv BPF_JMP ,
857.Dv BPF_RET ,
858and
859.Dv BPF_MISC .
860Various other mode and
861operator bits are or'd into the class to give the actual instructions.
862The classes and modes are defined in
863.In net/bpf.h .
864.Pp
865Below are the semantics for each defined
866.Nm
867instruction.
868We use the convention that A is the accumulator, X is the index register,
869P[] packet data, and M[] scratch memory store.
870P[i:n] gives the data at byte offset
871.Dq i
872in the packet,
873interpreted as a word (n=4),
874unsigned halfword (n=2), or unsigned byte (n=1).
875M[i] gives the i'th word in the scratch memory store, which is only
876addressed in word units.
877The memory store is indexed from 0 to
878.Dv BPF_MEMWORDS
879- 1.
880.Li k ,
881.Li jt ,
882and
883.Li jf
884are the corresponding fields in the
885instruction definition.
886.Dq len
887refers to the length of the packet.
888.Bl -tag -width BPF_STXx
889.It Dv BPF_LD
890These instructions copy a value into the accumulator.
891The type of the source operand is specified by an
892.Dq addressing mode
893and can be a constant
894.Pq Dv BPF_IMM ,
895packet data at a fixed offset
896.Pq Dv BPF_ABS ,
897packet data at a variable offset
898.Pq Dv BPF_IND ,
899the packet length
900.Pq Dv BPF_LEN ,
901or a word in the scratch memory store
902.Pq Dv BPF_MEM .
903For
904.Dv BPF_IND
905and
906.Dv BPF_ABS ,
907the data size must be specified as a word
908.Pq Dv BPF_W ,
909halfword
910.Pq Dv BPF_H ,
911or byte
912.Pq Dv BPF_B .
913The semantics of all the recognized
914.Dv BPF_LD
915instructions follow.
916.Bd -literal
917BPF_LD+BPF_W+BPF_ABS	A <- P[k:4]
918BPF_LD+BPF_H+BPF_ABS	A <- P[k:2]
919BPF_LD+BPF_B+BPF_ABS	A <- P[k:1]
920BPF_LD+BPF_W+BPF_IND	A <- P[X+k:4]
921BPF_LD+BPF_H+BPF_IND	A <- P[X+k:2]
922BPF_LD+BPF_B+BPF_IND	A <- P[X+k:1]
923BPF_LD+BPF_W+BPF_LEN	A <- len
924BPF_LD+BPF_IMM		A <- k
925BPF_LD+BPF_MEM		A <- M[k]
926.Ed
927.It Dv BPF_LDX
928These instructions load a value into the index register.
929Note that
930the addressing modes are more restrictive than those of the accumulator loads,
931but they include
932.Dv BPF_MSH ,
933a hack for efficiently loading the IP header length.
934.Bd -literal
935BPF_LDX+BPF_W+BPF_IMM	X <- k
936BPF_LDX+BPF_W+BPF_MEM	X <- M[k]
937BPF_LDX+BPF_W+BPF_LEN	X <- len
938BPF_LDX+BPF_B+BPF_MSH	X <- 4*(P[k:1]&0xf)
939.Ed
940.It Dv BPF_ST
941This instruction stores the accumulator into the scratch memory.
942We do not need an addressing mode since there is only one possibility
943for the destination.
944.Bd -literal
945BPF_ST			M[k] <- A
946.Ed
947.It Dv BPF_STX
948This instruction stores the index register in the scratch memory store.
949.Bd -literal
950BPF_STX			M[k] <- X
951.Ed
952.It Dv BPF_ALU
953The alu instructions perform operations between the accumulator and
954index register or constant, and store the result back in the accumulator.
955For binary operations, a source mode is required
956.Dv ( BPF_K
957or
958.Dv BPF_X ) .
959.Bd -literal
960BPF_ALU+BPF_ADD+BPF_K	A <- A + k
961BPF_ALU+BPF_SUB+BPF_K	A <- A - k
962BPF_ALU+BPF_MUL+BPF_K	A <- A * k
963BPF_ALU+BPF_DIV+BPF_K	A <- A / k
964BPF_ALU+BPF_MOD+BPF_K	A <- A % k
965BPF_ALU+BPF_AND+BPF_K	A <- A & k
966BPF_ALU+BPF_OR+BPF_K	A <- A | k
967BPF_ALU+BPF_XOR+BPF_K	A <- A ^ k
968BPF_ALU+BPF_LSH+BPF_K	A <- A << k
969BPF_ALU+BPF_RSH+BPF_K	A <- A >> k
970BPF_ALU+BPF_ADD+BPF_X	A <- A + X
971BPF_ALU+BPF_SUB+BPF_X	A <- A - X
972BPF_ALU+BPF_MUL+BPF_X	A <- A * X
973BPF_ALU+BPF_DIV+BPF_X	A <- A / X
974BPF_ALU+BPF_MOD+BPF_X	A <- A % X
975BPF_ALU+BPF_AND+BPF_X	A <- A & X
976BPF_ALU+BPF_OR+BPF_X	A <- A | X
977BPF_ALU+BPF_XOR+BPF_X	A <- A ^ X
978BPF_ALU+BPF_LSH+BPF_X	A <- A << X
979BPF_ALU+BPF_RSH+BPF_X	A <- A >> X
980BPF_ALU+BPF_NEG		A <- -A
981.Ed
982.It Dv BPF_JMP
983The jump instructions alter flow of control.
984Conditional jumps
985compare the accumulator against a constant
986.Pq Dv BPF_K
987or the index register
988.Pq Dv BPF_X .
989If the result is true (or non-zero),
990the true branch is taken, otherwise the false branch is taken.
991Jump offsets are encoded in 8 bits so the longest jump is 256 instructions.
992However, the jump always
993.Pq Dv BPF_JA
994opcode uses the 32 bit
995.Li k
996field as the offset, allowing arbitrarily distant destinations.
997All conditionals use unsigned comparison conventions.
998.Bd -literal
999BPF_JMP+BPF_JA		pc += k
1000BPF_JMP+BPF_JGT+BPF_K	pc += (A > k) ? jt : jf
1001BPF_JMP+BPF_JGE+BPF_K	pc += (A >= k) ? jt : jf
1002BPF_JMP+BPF_JEQ+BPF_K	pc += (A == k) ? jt : jf
1003BPF_JMP+BPF_JSET+BPF_K	pc += (A & k) ? jt : jf
1004BPF_JMP+BPF_JGT+BPF_X	pc += (A > X) ? jt : jf
1005BPF_JMP+BPF_JGE+BPF_X	pc += (A >= X) ? jt : jf
1006BPF_JMP+BPF_JEQ+BPF_X	pc += (A == X) ? jt : jf
1007BPF_JMP+BPF_JSET+BPF_X	pc += (A & X) ? jt : jf
1008.Ed
1009.It Dv BPF_RET
1010The return instructions terminate the filter program and specify the amount
1011of packet to accept (i.e., they return the truncation amount).
1012A return value of zero indicates that the packet should be ignored.
1013The return value is either a constant
1014.Pq Dv BPF_K
1015or the accumulator
1016.Pq Dv BPF_A .
1017.Bd -literal
1018BPF_RET+BPF_A		accept A bytes
1019BPF_RET+BPF_K		accept k bytes
1020.Ed
1021.It Dv BPF_MISC
1022The miscellaneous category was created for anything that does not
1023fit into the above classes, and for any new instructions that might need to
1024be added.
1025Currently, these are the register transfer instructions
1026that copy the index register to the accumulator or vice versa.
1027.Bd -literal
1028BPF_MISC+BPF_TAX	X <- A
1029BPF_MISC+BPF_TXA	A <- X
1030.Ed
1031.El
1032.Pp
1033The
1034.Nm
1035interface provides the following macros to facilitate
1036array initializers:
1037.Fn BPF_STMT opcode operand
1038and
1039.Fn BPF_JUMP opcode operand true_offset false_offset .
1040.Sh SYSCTL VARIABLES
1041A set of
1042.Xr sysctl 8
1043variables controls the behaviour of the
1044.Nm
1045subsystem
1046.Bl -tag -width indent
1047.It Va net.bpf.optimize_writers : No 0
1048Various programs use BPF to send (but not receive) raw packets
1049(cdpd, lldpd, dhcpd, dhcp relays, etc. are good examples of such programs).
1050They do not need incoming packets to be send to them.
1051Turning this option on
1052makes new BPF users to be attached to write-only interface list until program
1053explicitly specifies read filter via
1054.Fn pcap_set_filter .
1055This removes any performance degradation for high-speed interfaces.
1056.It Va net.bpf.stats :
1057Binary interface for retrieving general statistics.
1058.It Va net.bpf.zerocopy_enable : No 0
1059Permits zero-copy to be used with net BPF readers.
1060Use with caution.
1061.It Va net.bpf.maxinsns : No 512
1062Maximum number of instructions that BPF program can contain.
1063Use
1064.Xr tcpdump 1
1065.Fl d
1066option to determine approximate number of instruction for any filter.
1067.It Va net.bpf.maxbufsize : No 524288
1068Maximum buffer size to allocate for packets buffer.
1069.It Va net.bpf.bufsize : No 4096
1070Default buffer size to allocate for packets buffer.
1071.El
1072.Sh EXAMPLES
1073The following filter is taken from the Reverse ARP Daemon.
1074It accepts only Reverse ARP requests.
1075.Bd -literal
1076struct bpf_insn insns[] = {
1077	BPF_STMT(BPF_LD+BPF_H+BPF_ABS, 12),
1078	BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, ETHERTYPE_REVARP, 0, 3),
1079	BPF_STMT(BPF_LD+BPF_H+BPF_ABS, 20),
1080	BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, ARPOP_REVREQUEST, 0, 1),
1081	BPF_STMT(BPF_RET+BPF_K, sizeof(struct ether_arp) +
1082		 sizeof(struct ether_header)),
1083	BPF_STMT(BPF_RET+BPF_K, 0),
1084};
1085.Ed
1086.Pp
1087This filter accepts only IP packets between host 128.3.112.15 and
1088128.3.112.35.
1089.Bd -literal
1090struct bpf_insn insns[] = {
1091	BPF_STMT(BPF_LD+BPF_H+BPF_ABS, 12),
1092	BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, ETHERTYPE_IP, 0, 8),
1093	BPF_STMT(BPF_LD+BPF_W+BPF_ABS, 26),
1094	BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, 0x8003700f, 0, 2),
1095	BPF_STMT(BPF_LD+BPF_W+BPF_ABS, 30),
1096	BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, 0x80037023, 3, 4),
1097	BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, 0x80037023, 0, 3),
1098	BPF_STMT(BPF_LD+BPF_W+BPF_ABS, 30),
1099	BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, 0x8003700f, 0, 1),
1100	BPF_STMT(BPF_RET+BPF_K, (u_int)-1),
1101	BPF_STMT(BPF_RET+BPF_K, 0),
1102};
1103.Ed
1104.Pp
1105Finally, this filter returns only TCP finger packets.
1106We must parse the IP header to reach the TCP header.
1107The
1108.Dv BPF_JSET
1109instruction
1110checks that the IP fragment offset is 0 so we are sure
1111that we have a TCP header.
1112.Bd -literal
1113struct bpf_insn insns[] = {
1114	BPF_STMT(BPF_LD+BPF_H+BPF_ABS, 12),
1115	BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, ETHERTYPE_IP, 0, 10),
1116	BPF_STMT(BPF_LD+BPF_B+BPF_ABS, 23),
1117	BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, IPPROTO_TCP, 0, 8),
1118	BPF_STMT(BPF_LD+BPF_H+BPF_ABS, 20),
1119	BPF_JUMP(BPF_JMP+BPF_JSET+BPF_K, 0x1fff, 6, 0),
1120	BPF_STMT(BPF_LDX+BPF_B+BPF_MSH, 14),
1121	BPF_STMT(BPF_LD+BPF_H+BPF_IND, 14),
1122	BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, 79, 2, 0),
1123	BPF_STMT(BPF_LD+BPF_H+BPF_IND, 16),
1124	BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, 79, 0, 1),
1125	BPF_STMT(BPF_RET+BPF_K, (u_int)-1),
1126	BPF_STMT(BPF_RET+BPF_K, 0),
1127};
1128.Ed
1129.Sh SEE ALSO
1130.Xr tcpdump 1 ,
1131.Xr ioctl 2 ,
1132.Xr kqueue 2 ,
1133.Xr poll 2 ,
1134.Xr select 2 ,
1135.Xr ng_bpf 4 ,
1136.Xr bpf 9
1137.Rs
1138.%A McCanne, S.
1139.%A Jacobson V.
1140.%T "An efficient, extensible, and portable network monitor"
1141.Re
1142.Sh HISTORY
1143The Enet packet filter was created in 1980 by Mike Accetta and
1144Rick Rashid at Carnegie-Mellon University.
1145Jeffrey Mogul, at
1146Stanford, ported the code to
1147.Bx
1148and continued its development from
11491983 on.
1150Since then, it has evolved into the Ultrix Packet Filter at
1151.Tn DEC ,
1152a
1153.Tn STREAMS
1154.Tn NIT
1155module under
1156.Tn SunOS 4.1 ,
1157and
1158.Tn BPF .
1159.Sh AUTHORS
1160.An -nosplit
1161.An Steven McCanne ,
1162of Lawrence Berkeley Laboratory, implemented BPF in
1163Summer 1990.
1164Much of the design is due to
1165.An Van Jacobson .
1166.Pp
1167Support for zero-copy buffers was added by
1168.An Robert N. M. Watson
1169under contract to Seccuris Inc.
1170.Sh BUGS
1171The read buffer must be of a fixed size (returned by the
1172.Dv BIOCGBLEN
1173ioctl).
1174.Pp
1175A file that does not request promiscuous mode may receive promiscuously
1176received packets as a side effect of another file requesting this
1177mode on the same hardware interface.
1178This could be fixed in the kernel with additional processing overhead.
1179However, we favor the model where
1180all files must assume that the interface is promiscuous, and if
1181so desired, must utilize a filter to reject foreign packets.
1182.Pp
1183The
1184.Dv SEESENT ,
1185.Dv DIRECTION ,
1186and
1187.Dv FEEDBACK
1188settings have been observed to work incorrectly on some interface
1189types, including those with hardware loopback rather than software loopback,
1190and point-to-point interfaces.
1191They appear to function correctly on a
1192broad range of Ethernet-style interfaces.
1193