xref: /freebsd/share/man/man4/netlink.4 (revision f126890a)
1.\"
2.\" Copyright (C) 2022 Alexander Chernikov <melifaro@FreeBSD.org>.
3.\"
4.\" Redistribution and use in source and binary forms, with or without
5.\" modification, are permitted provided that the following conditions
6.\" are met:
7.\" 1. Redistributions of source code must retain the above copyright
8.\"    notice, this list of conditions and the following disclaimer.
9.\" 2. Redistributions in binary form must reproduce the above copyright
10.\"    notice, this list of conditions and the following disclaimer in the
11.\"    documentation and/or other materials provided with the distribution.
12.\"
13.\" THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
14.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
15.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
16.\" ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
17.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
18.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
19.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
20.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
21.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
22.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
23.\" SUCH DAMAGE.
24.\"
25.Dd November 30, 2022
26.Dt NETLINK 4
27.Os
28.Sh NAME
29.Nm Netlink
30.Nd Kernel network configuration protocol
31.Sh SYNOPSIS
32.In netlink/netlink.h
33.In netlink/netlink_route.h
34.Ft int
35.Fn socket AF_NETLINK SOCK_RAW "int family"
36.Sh DESCRIPTION
37Netlink is a user-kernel message-based communication protocol primarily used
38for network stack configuration.
39Netlink is easily extendable and supports large dumps and event
40notifications, all via a single socket.
41The protocol is fully asynchronous, allowing one to issue and track multiple
42requests at once.
43Netlink consists of multiple families, which commonly group the commands
44belonging to the particular kernel subsystem.
45Currently, the supported families are:
46.Pp
47.Bd -literal -offset indent -compact
48NETLINK_ROUTE	network configuration,
49NETLINK_GENERIC	"container" family
50.Ed
51.Pp
52The
53.Dv NETLINK_ROUTE
54family handles all interfaces, addresses, neighbors, routes, and VNETs
55configuration.
56More details can be found in
57.Xr rtnetlink 4 .
58The
59.Dv NETLINK_GENERIC
60family serves as a
61.Do container Dc ,
62allowing registering other families under the
63.Dv NETLINK_GENERIC
64umbrella.
65This approach allows using a single netlink socket to interact with
66multiple netlink families at once.
67More details can be found in
68.Xr genetlink 4 .
69.Pp
70Netlink has its own sockaddr structure:
71.Bd -literal
72struct sockaddr_nl {
73	uint8_t		nl_len;		/* sizeof(sockaddr_nl) */
74	sa_family_t	nl_family;	/* netlink family */
75	uint16_t	nl_pad;		/* reserved, set to 0 */
76	uint32_t	nl_pid;		/* automatically selected, set to 0 */
77	uint32_t	nl_groups;	/* multicast groups mask to bind to */
78};
79.Ed
80.Pp
81Typically, filling this structure is not required for socket operations.
82It is presented here for completeness.
83.Sh PROTOCOL DESCRIPTION
84The protocol is message-based.
85Each message starts with the mandatory
86.Va nlmsghdr
87header, followed by the family-specific header and the list of
88type-length-value pairs (TLVs).
89TLVs can be nested.
90All headers and TLVS are padded to 4-byte boundaries.
91Each
92.Xr send 2 or
93.Xr recv 2
94system call may contain multiple messages.
95.Ss BASE HEADER
96.Bd -literal
97struct nlmsghdr {
98	uint32_t nlmsg_len;   /* Length of message including header */
99	uint16_t nlmsg_type;  /* Message type identifier */
100	uint16_t nlmsg_flags; /* Flags (NLM_F_) */
101	uint32_t nlmsg_seq;   /* Sequence number */
102	uint32_t nlmsg_pid;   /* Sending process port ID */
103};
104.Ed
105.Pp
106The
107.Va nlmsg_len
108field stores the whole message length, in bytes, including the header.
109This length has to be rounded up to the nearest 4-byte boundary when
110iterating over messages.
111The
112.Va nlmsg_type
113field represents the command/request type.
114This value is family-specific.
115The list of supported commands can be found in the relevant family
116header file.
117.Va nlmsg_seq
118is a user-provided request identifier.
119An application can track the operation result using the
120.Dv NLMSG_ERROR
121messages and matching the
122.Va nlmsg_seq
123.
124The
125.Va nlmsg_pid
126field is the message sender id.
127This field is optional for userland.
128The kernel sender id is zero.
129The
130.Va nlmsg_flags
131field contains the message-specific flags.
132The following generic flags are defined:
133.Pp
134.Bd -literal -offset indent -compact
135NLM_F_REQUEST	Indicates that the message is an actual request to the kernel
136NLM_F_ACK	Request an explicit ACK message with an operation result
137.Ed
138.Pp
139The following generic flags are defined for the "GET" request types:
140.Pp
141.Bd -literal -offset indent -compact
142NLM_F_ROOT	Return the whole dataset
143NLM_F_MATCH	Return all entries matching the criteria
144.Ed
145These two flags are typically used together, aliased to
146.Dv NLM_F_DUMP
147.Pp
148The following generic flags are defined for the "NEW" request types:
149.Pp
150.Bd -literal -offset indent -compact
151NLM_F_CREATE	Create an object if none exists
152NLM_F_EXCL	Don't replace an object if it exists
153NLM_F_REPLACE	Replace an existing matching object
154NLM_F_APPEND	Append to an existing object
155.Ed
156.Pp
157The following generic flags are defined for the replies:
158.Pp
159.Bd -literal -offset indent -compact
160NLM_F_MULTI	Indicates that the message is part of the message group
161NLM_F_DUMP_INTR	Indicates that the state dump was not completed
162NLM_F_DUMP_FILTERED	Indicates that the dump was filtered per request
163NLM_F_CAPPED	Indicates the original message was capped to its header
164NLM_F_ACK_TLVS	Indicates that extended ACK TLVs were included
165.Ed
166.Ss TLVs
167Most messages encode their attributes as type-length-value pairs (TLVs).
168The base TLV header:
169.Bd -literal
170struct nlattr {
171	uint16_t nla_len;	/* Total attribute length */
172	uint16_t nla_type;	/* Attribute type */
173};
174.Ed
175The TLV type
176.Pq Va nla_type
177scope is typically the message type or group within a family.
178For example, the
179.Dv RTN_MULTICAST
180type value is only valid for
181.Dv RTM_NEWROUTE
182,
183.Dv RTM_DELROUTE
184and
185.Dv RTM_GETROUTE
186messages.
187TLVs can be nested; in that case internal TLVs may have their own sub-types.
188All TLVs are packed with 4-byte padding.
189.Ss CONTROL MESSAGES
190A number of generic control messages are reserved in each family.
191.Pp
192.Dv NLMSG_ERROR
193reports the operation result if requested, optionally followed by
194the metadata TLVs.
195The value of
196.Va nlmsg_seq
197is set to its value in the original messages, while
198.Va nlmsg_pid
199is set to the socket pid of the original socket.
200The operation result is reported via
201.Vt "struct nlmsgerr":
202.Bd -literal
203struct nlmsgerr {
204	int	error;		/* Standard errno */
205	struct	nlmsghdr msg;	/* Original message header */
206};
207.Ed
208If the
209.Dv NETLINK_CAP_ACK
210socket option is not set, the remainder of the original message will follow.
211If the
212.Dv NETLINK_EXT_ACK
213socket option is set, the kernel may add a
214.Dv NLMSGERR_ATTR_MSG
215string TLV with the textual error description, optionally followed by the
216.Dv NLMSGERR_ATTR_OFFS
217TLV, indicating the offset from the message start that triggered an error.
218Some operations may return additional metadata encapsulated in the
219.Dv NLMSGERR_ATTR_COOKIE
220TLV.
221The metadata format is specific to the operation.
222If the operation reply is a multipart message, then no
223.Dv NLMSG_ERROR
224reply is generated, only a
225.Dv NLMSG_DONE
226message, closing multipart sequence.
227.Pp
228.Dv NLMSG_DONE
229indicates the end of the message group: typically, the end of the dump.
230It contains a single
231.Vt int
232field, describing the dump result as a standard errno value.
233.Sh SOCKET OPTIONS
234Netlink supports a number of custom socket options, which can be set with
235.Xr setsockopt 2
236with the
237.Dv SOL_NETLINK
238.Fa level :
239.Bl -tag -width indent
240.It Dv NETLINK_ADD_MEMBERSHIP
241Subscribes to the notifications for the specific group (int).
242.It Dv NETLINK_DROP_MEMBERSHIP
243Unsubscribes from the notifications for the specific group (int).
244.It Dv NETLINK_LIST_MEMBERSHIPS
245Lists the memberships as a bitmask.
246.It Dv NETLINK_CAP_ACK
247Instructs the kernel to send the original message header in the reply
248without the message body.
249.It Dv NETLINK_EXT_ACK
250Acknowledges ability to receive additional TLVs in the ACK message.
251.El
252.Pp
253Additionally, netlink overrides the following socket options from the
254.Dv SOL_SOCKET
255.Fa level :
256.Bl -tag -width indent
257.It Dv SO_RCVBUF
258Sets the maximum size of the socket receive buffer.
259If the caller has
260.Dv PRIV_NET_ROUTE
261permission, the value can exceed the currently-set
262.Va kern.ipc.maxsockbuf
263value.
264.El
265.Sh SYSCTL VARIABLES
266A set of
267.Xr sysctl 8
268variables is available to tweak run-time parameters:
269.Bl -tag -width indent
270.It Va net.netlink.sendspace
271Default send buffer for the netlink socket.
272Note that the socket sendspace has to be at least as long as the longest
273message that can be transmitted via this socket.
274.El
275.Bl -tag -width indent
276.It Va net.netlink.recvspace
277Default receive buffer for the netlink socket.
278Note that the socket recvspace has to be least as long as the longest
279message that can be received from this socket.
280.El
281.Bl -tag -width indent
282.It Va net.netlink.nl_maxsockbuf
283Maximum receive buffer for the netlink socket that can be set via
284.Dv SO_RCVBUF
285socket option.
286.El
287.Sh DEBUGGING
288Netlink implements per-functional-unit debugging, with different severities
289controllable via the
290.Va net.netlink.debug
291branch.
292These messages are logged in the kernel message buffer and can be seen in
293.Xr dmesg 8
294.
295The following severity levels are defined:
296.Bl -tag -width indent
297.It Dv LOG_DEBUG(7)
298Rare events or per-socket errors are reported here.
299This is the default level, not impacting production performance.
300.It Dv LOG_DEBUG2(8)
301Socket events such as groups memberships, privilege checks, commands and dumps
302are logged.
303This level does not incur significant performance overhead.
304.It Dv LOG_DEBUG3(9)
305All socket events, each dumped or modified entities are logged.
306Turning it on may result in significant performance overhead.
307.El
308.Sh ERRORS
309Netlink reports operation results, including errors and error metadata, by
310sending a
311.Dv NLMSG_ERROR
312message for each request message.
313The following errors can be returned:
314.Bl -tag -width Er
315.It Bq Er EPERM
316when the current privileges are insufficient to perform the required operation;
317.It Bo Er ENOBUFS Bc or Bo Er ENOMEM Bc
318when the system runs out of memory for
319an internal data structure;
320.It Bq Er ENOTSUP
321when the requested command is not supported by the family or
322the family is not supported;
323.It Bq Er EINVAL
324when some necessary TLVs are missing or invalid, detailed info
325may be provided in NLMSGERR_ATTR_MSG and NLMSGERR_ATTR_OFFS TLVs;
326.It Bq Er ENOENT
327when trying to delete a non-existent object.
328.Pp
329Additionally, a socket operation itself may fail with one of the errors
330specified in
331.Xr socket 2
332,
333.Xr recv 2
334or
335.Xr send 2
336.
337.El
338.Sh SEE ALSO
339.Xr genetlink 4 ,
340.Xr rtnetlink 4
341.Rs
342.%A "J. Salim"
343.%A "H. Khosravi"
344.%A "A. Kleen"
345.%A "A. Kuznetsov"
346.%T "Linux Netlink as an IP Services Protocol"
347.%O "RFC 3549"
348.Re
349.Sh HISTORY
350The netlink protocol appeared in
351.Fx 13.2 .
352.Sh AUTHORS
353The netlink was implemented by
354.An -nosplit
355.An Alexander Chernikov Aq Mt melifaro@FreeBSD.org .
356It was derived from the Google Summer of Code 2021 project by
357.An Ng Peng Nam Sean .
358