xref: /freebsd/share/man/man4/netlink.4 (revision 2a58b312)
1.\"
2.\" Copyright (C) 2022 Alexander Chernikov <melifaro@FreeBSD.org>.
3.\"
4.\" Redistribution and use in source and binary forms, with or without
5.\" modification, are permitted provided that the following conditions
6.\" are met:
7.\" 1. Redistributions of source code must retain the above copyright
8.\"    notice, this list of conditions and the following disclaimer.
9.\" 2. Redistributions in binary form must reproduce the above copyright
10.\"    notice, this list of conditions and the following disclaimer in the
11.\"    documentation and/or other materials provided with the distribution.
12.\"
13.\" THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
14.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
15.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
16.\" ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
17.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
18.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
19.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
20.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
21.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
22.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
23.\" SUCH DAMAGE.
24.\"
25.\" $FreeBSD$
26.\"
27.Dd November 30, 2022
28.Dt NETLINK 4
29.Os
30.Sh NAME
31.Nm Netlink
32.Nd Kernel network configuration protocol
33.Sh SYNOPSIS
34.In netlink/netlink.h
35.In netlink/netlink_route.h
36.Ft int
37.Fn socket AF_NETLINK SOCK_RAW "int family"
38.Sh DESCRIPTION
39Netlink is a user-kernel message-based communication protocol primarily used
40for network stack configuration.
41Netlink is easily extendable and supports large dumps and event
42notifications, all via a single socket.
43The protocol is fully asynchronous, allowing one to issue and track multiple
44requests at once.
45Netlink consists of multiple families, which commonly group the commands
46belonging to the particular kernel subsystem.
47Currently, the supported families are:
48.Pp
49.Bd -literal -offset indent -compact
50NETLINK_ROUTE	network configuration,
51NETLINK_GENERIC	"container" family
52.Ed
53.Pp
54The
55.Dv NETLINK_ROUTE
56family handles all interfaces, addresses, neighbors, routes, and VNETs
57configuration.
58More details can be found in
59.Xr rtnetlink 4 .
60The
61.Dv NETLINK_GENERIC
62family serves as a
63.Do container Dc ,
64allowing registering other families under the
65.Dv NETLINK_GENERIC
66umbrella.
67This approach allows using a single netlink socket to interact with
68multiple netlink families at once.
69More details can be found in
70.Xr genetlink 4 .
71.Pp
72Netlink has its own sockaddr structure:
73.Bd -literal
74struct sockaddr_nl {
75	uint8_t		nl_len;		/* sizeof(sockaddr_nl) */
76	sa_family_t	nl_family;	/* netlink family */
77	uint16_t	nl_pad;		/* reserved, set to 0 */
78	uint32_t	nl_pid;		/* automatically selected, set to 0 */
79	uint32_t	nl_groups;	/* multicast groups mask to bind to */
80};
81.Ed
82.Pp
83Typically, filling this structure is not required for socket operations.
84It is presented here for completeness.
85.Sh PROTOCOL DESCRIPTION
86The protocol is message-based.
87Each message starts with the mandatory
88.Va nlmsghdr
89header, followed by the family-specific header and the list of
90type-length-value pairs (TLVs).
91TLVs can be nested.
92All headers and TLVS are padded to 4-byte boundaries.
93Each
94.Xr send 2 or
95.Xr recv 2
96system call may contain multiple messages.
97.Ss BASE HEADER
98.Bd -literal
99struct nlmsghdr {
100	uint32_t nlmsg_len;   /* Length of message including header */
101	uint16_t nlmsg_type;  /* Message type identifier */
102	uint16_t nlmsg_flags; /* Flags (NLM_F_) */
103	uint32_t nlmsg_seq;   /* Sequence number */
104	uint32_t nlmsg_pid;   /* Sending process port ID */
105};
106.Ed
107.Pp
108The
109.Va nlmsg_len
110field stores the whole message length, in bytes, including the header.
111This length has to be rounded up to the nearest 4-byte boundary when
112iterating over messages.
113The
114.Va nlmsg_type
115field represents the command/request type.
116This value is family-specific.
117The list of supported commands can be found in the relevant family
118header file.
119.Va nlmsg_seq
120is a user-provided request identifier.
121An application can track the operation result using the
122.Dv NLMSG_ERROR
123messages and matching the
124.Va nlmsg_seq
125.
126The
127.Va nlmsg_pid
128field is the message sender id.
129This field is optional for userland.
130The kernel sender id is zero.
131The
132.Va nlmsg_flags
133field contains the message-specific flags.
134The following generic flags are defined:
135.Pp
136.Bd -literal -offset indent -compact
137NLM_F_REQUEST	Indicates that the message is an actual request to the kernel
138NLM_F_ACK	Request an explicit ACK message with an operation result
139.Ed
140.Pp
141The following generic flags are defined for the "GET" request types:
142.Pp
143.Bd -literal -offset indent -compact
144NLM_F_ROOT	Return the whole dataset
145NLM_F_MATCH	Return all entries matching the criteria
146.Ed
147These two flags are typically used together, aliased to
148.Dv NLM_F_DUMP
149.Pp
150The following generic flags are defined for the "NEW" request types:
151.Pp
152.Bd -literal -offset indent -compact
153NLM_F_CREATE	Create an object if none exists
154NLM_F_EXCL	Don't replace an object if it exists
155NLM_F_REPLACE	Replace an existing matching object
156NLM_F_APPEND	Append to an existing object
157.Ed
158.Pp
159The following generic flags are defined for the replies:
160.Pp
161.Bd -literal -offset indent -compact
162NLM_F_MULTI	Indicates that the message is part of the message group
163NLM_F_DUMP_INTR	Indicates that the state dump was not completed
164NLM_F_DUMP_FILTERED	Indicates that the dump was filtered per request
165NLM_F_CAPPED	Indicates the original message was capped to its header
166NLM_F_ACK_TLVS	Indicates that extended ACK TLVs were included
167.Ed
168.Ss TLVs
169Most messages encode their attributes as type-length-value pairs (TLVs).
170The base TLV header:
171.Bd -literal
172struct nlattr {
173	uint16_t nla_len;	/* Total attribute length */
174	uint16_t nla_type;	/* Attribute type */
175};
176.Ed
177The TLV type
178.Pq Va nla_type
179scope is typically the message type or group within a family.
180For example, the
181.Dv RTN_MULTICAST
182type value is only valid for
183.Dv RTM_NEWROUTE
184,
185.Dv RTM_DELROUTE
186and
187.Dv RTM_GETROUTE
188messages.
189TLVs can be nested; in that case internal TLVs may have their own sub-types.
190All TLVs are packed with 4-byte padding.
191.Ss CONTROL MESSAGES
192A number of generic control messages are reserved in each family.
193.Pp
194.Dv NLMSG_ERROR
195reports the operation result if requested, optionally followed by
196the metadata TLVs.
197The value of
198.Va nlmsg_seq
199is set to its value in the original messages, while
200.Va nlmsg_pid
201is set to the socket pid of the original socket.
202The operation result is reported via
203.Vt "struct nlmsgerr":
204.Bd -literal
205struct nlmsgerr {
206	int	error;		/* Standard errno */
207	struct	nlmsghdr msg;	/* Original message header */
208};
209.Ed
210If the
211.Dv NETLINK_CAP_ACK
212socket option is not set, the remainder of the original message will follow.
213If the
214.Dv NETLINK_EXT_ACK
215socket option is set, the kernel may add a
216.Dv NLMSGERR_ATTR_MSG
217string TLV with the textual error description, optionally followed by the
218.Dv NLMSGERR_ATTR_OFFS
219TLV, indicating the offset from the message start that triggered an error.
220Some operations may return additional metadata encapsulated in the
221.Dv NLMSGERR_ATTR_COOKIE
222TLV.
223The metadata format is specific to the operation.
224If the operation reply is a multipart message, then no
225.Dv NLMSG_ERROR
226reply is generated, only a
227.Dv NLMSG_DONE
228message, closing multipart sequence.
229.Pp
230.Dv NLMSG_DONE
231indicates the end of the message group: typically, the end of the dump.
232It contains a single
233.Vt int
234field, describing the dump result as a standard errno value.
235.Sh SOCKET OPTIONS
236Netlink supports a number of custom socket options, which can be set with
237.Xr setsockopt 2
238with the
239.Dv SOL_NETLINK
240.Fa level :
241.Bl -tag -width indent
242.It Dv NETLINK_ADD_MEMBERSHIP
243Subscribes to the notifications for the specific group (int).
244.It Dv NETLINK_DROP_MEMBERSHIP
245Unsubscribes from the notifications for the specific group (int).
246.It Dv NETLINK_LIST_MEMBERSHIPS
247Lists the memberships as a bitmask.
248.It Dv NETLINK_CAP_ACK
249Instructs the kernel to send the original message header in the reply
250without the message body.
251.It Dv NETLINK_EXT_ACK
252Acknowledges ability to receive additional TLVs in the ACK message.
253.El
254.Pp
255Additionally, netlink overrides the following socket options from the
256.Dv SOL_SOCKET
257.Fa level :
258.Bl -tag -width indent
259.It Dv SO_RCVBUF
260Sets the maximum size of the socket receive buffer.
261If the caller has
262.Dv PRIV_NET_ROUTE
263permission, the value can exceed the currently-set
264.Va kern.ipc.maxsockbuf
265value.
266.El
267.Sh SYSCTL VARIABLES
268A set of
269.Xr sysctl 8
270variables is available to tweak run-time parameters:
271.Bl -tag -width indent
272.It Va net.netlink.sendspace
273Default send buffer for the netlink socket.
274Note that the socket sendspace has to be at least as long as the longest
275message that can be transmitted via this socket.
276.El
277.Bl -tag -width indent
278.It Va net.netlink.recvspace
279Default receive buffer for the netlink socket.
280Note that the socket recvspace has to be least as long as the longest
281message that can be received from this socket.
282.El
283.Bl -tag -width indent
284.It Va net.netlink.nl_maxsockbuf
285Maximum receive buffer for the netlink socket that can be set via
286.Dv SO_RCVBUF
287socket option.
288.El
289.Sh DEBUGGING
290Netlink implements per-functional-unit debugging, with different severities
291controllable via the
292.Va net.netlink.debug
293branch.
294These messages are logged in the kernel message buffer and can be seen in
295.Xr dmesg 8
296.
297The following severity levels are defined:
298.Bl -tag -width indent
299.It Dv LOG_DEBUG(7)
300Rare events or per-socket errors are reported here.
301This is the default level, not impacting production performance.
302.It Dv LOG_DEBUG2(8)
303Socket events such as groups memberships, privilege checks, commands and dumps
304are logged.
305This level does not incur significant performance overhead.
306.It Dv LOG_DEBUG3(9)
307All socket events, each dumped or modified entities are logged.
308Turning it on may result in significant performance overhead.
309.El
310.Sh ERRORS
311Netlink reports operation results, including errors and error metadata, by
312sending a
313.Dv NLMSG_ERROR
314message for each request message.
315The following errors can be returned:
316.Bl -tag -width Er
317.It Bq Er EPERM
318when the current privileges are insufficient to perform the required operation;
319.It Bo Er ENOBUFS Bc or Bo Er ENOMEM Bc
320when the system runs out of memory for
321an internal data structure;
322.It Bq Er ENOTSUP
323when the requested command is not supported by the family or
324the family is not supported;
325.It Bq Er EINVAL
326when some necessary TLVs are missing or invalid, detailed info
327may be provided in NLMSGERR_ATTR_MSG and NLMSGERR_ATTR_OFFS TLVs;
328.It Bq Er ENOENT
329when trying to delete a non-existent object.
330.Pp
331Additionally, a socket operation itself may fail with one of the errors
332specified in
333.Xr socket 2
334,
335.Xr recv 2
336or
337.Xr send 2
338.
339.El
340.Sh SEE ALSO
341.Xr genetlink 4 ,
342.Xr rtnetlink 4
343.Rs
344.%A "J. Salim"
345.%A "H. Khosravi"
346.%A "A. Kleen"
347.%A "A. Kuznetsov"
348.%T "Linux Netlink as an IP Services Protocol"
349.%O "RFC 3549"
350.Re
351.Sh HISTORY
352The netlink protocol appeared in
353.Fx 13.2 .
354.Sh AUTHORS
355The netlink was implemented by
356.An -nosplit
357.An Alexander Chernikov Aq Mt melifaro@FreeBSD.org .
358It was derived from the Google Summer of Code 2021 project by
359.An Ng Peng Nam Sean .
360