xref: /openbsd/share/man/man4/multicast.4 (revision 1026fc2e)
1.\" Copyright (c) 2001-2003 International Computer Science Institute
2.\"
3.\" Permission is hereby granted, free of charge, to any person obtaining a
4.\" copy of this software and associated documentation files (the "Software"),
5.\" to deal in the Software without restriction, including without limitation
6.\" the rights to use, copy, modify, merge, publish, distribute, sublicense,
7.\" and/or sell copies of the Software, and to permit persons to whom the
8.\" Software is furnished to do so, subject to the following conditions:
9.\"
10.\" The above copyright notice and this permission notice shall be included in
11.\" all copies or substantial portions of the Software.
12.\"
13.\" The names and trademarks of copyright holders may not be used in
14.\" advertising or publicity pertaining to the software without specific
15.\" prior permission. Title to copyright in this software and any associated
16.\" documentation will at all times remain with the copyright holders.
17.\"
18.\" THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
19.\" IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
20.\" FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
21.\" AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
22.\" LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
23.\" FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
24.\" DEALINGS IN THE SOFTWARE.
25.\"
26.\" $FreeBSD: src/share/man/man4/multicast.4,v 1.4 2004/07/09 09:22:36 ru Exp $
27.\" $OpenBSD: multicast.4,v 1.8 2013/08/14 08:46:07 jmc Exp $
28.\" $NetBSD: multicast.4,v 1.3 2004/09/12 13:12:26 wiz Exp $
29.\"
30.Dd $Mdocdate: August 14 2013 $
31.Dt MULTICAST 4
32.Os
33.\"
34.Sh NAME
35.Nm multicast
36.Nd Multicast Routing
37.\"
38.Sh SYNOPSIS
39.Cd "options MROUTING"
40.Pp
41.In sys/types.h
42.In sys/socket.h
43.In netinet/in.h
44.In netinet/ip_mroute.h
45.In netinet6/ip6_mroute.h
46.Ft int
47.Fn getsockopt "int s" IPPROTO_IP MRT_INIT "void *optval" "socklen_t *optlen"
48.Ft int
49.Fn setsockopt "int s" IPPROTO_IP MRT_INIT "const void *optval" "socklen_t optlen"
50.Ft int
51.Fn getsockopt "int s" IPPROTO_IPV6 MRT6_INIT "void *optval" "socklen_t *optlen"
52.Ft int
53.Fn setsockopt "int s" IPPROTO_IPV6 MRT6_INIT "const void *optval" "socklen_t optlen"
54.Sh DESCRIPTION
55.Tn "Multicast routing"
56is used to efficiently propagate data
57packets to a set of multicast listeners in multipoint networks.
58If unicast is used to replicate the data to all listeners,
59then some of the network links may carry multiple copies of the same
60data packets.
61With multicast routing, the overhead is reduced to one copy
62(at most) per network link.
63.Pp
64All multicast-capable routers must run a common multicast routing
65protocol.
66The Distance Vector Multicast Routing Protocol (DVMRP)
67was the first developed multicast routing protocol.
68Later, other protocols such as Multicast Extensions to OSPF (MOSPF),
69Core Based Trees (CBT),
70Protocol Independent Multicast \- Sparse Mode (PIM-SM),
71and Protocol Independent Multicast \- Dense Mode (PIM-DM)
72were developed as well.
73.Pp
74To start multicast routing,
75the user must enable multicast forwarding via the
76.Xr sysctl 8
77variables
78.Va net.inet.ip.mforwarding
79and/or
80.Va net.inet.ip6.mforwarding .
81The user must also run a multicast routing capable user-level process,
82such as
83.Xr mrouted 8 .
84From a developer's point of view,
85the programming guide described in the
86.Sx Programming Guide
87section should be used to control the multicast forwarding in the kernel.
88.\"
89.Ss Programming Guide
90This section provides information about the basic multicast routing API.
91The so-called
92.Dq advanced multicast API
93is described in the
94.Sx "Advanced Multicast API Programming Guide"
95section.
96.Pp
97First, a multicast routing socket must be open.
98That socket would be used
99to control the multicast forwarding in the kernel.
100Note that most operations below require certain privilege
101(i.e., root privilege):
102.Bd -literal -offset indent
103/* IPv4 */
104int mrouter_s4;
105mrouter_s4 = socket(AF_INET, SOCK_RAW, IPPROTO_IGMP);
106.Ed
107.Bd -literal -offset indent
108int mrouter_s6;
109mrouter_s6 = socket(AF_INET6, SOCK_RAW, IPPROTO_ICMPV6);
110.Ed
111.Pp
112Note that if the router needs to open an IGMP or ICMPv6 socket
113(IPv4 or IPv6, respectively)
114for sending or receiving of IGMP or MLD multicast group membership messages,
115then the same
116.Va mrouter_s4
117or
118.Va mrouter_s6
119sockets should be used
120for sending and receiving respectively IGMP or MLD messages.
121In the case of
122.Bx -derived
123kernels,
124it may be possible to open separate sockets
125for IGMP or MLD messages only.
126However, some other kernels (e.g.,
127.Tn Linux )
128require that the multicast
129routing socket must be used for sending and receiving of IGMP or MLD
130messages.
131Therefore, for portability reasons, the multicast
132routing socket should be reused for IGMP and MLD messages as well.
133.Pp
134After the multicast routing socket is open, it can be used to enable
135or disable multicast forwarding in the kernel:
136.Bd -literal -offset 5n
137/* IPv4 */
138int v = 1;        /* 1 to enable, or 0 to disable */
139setsockopt(mrouter_s4, IPPROTO_IP, MRT_INIT, (void *)&v, sizeof(v));
140.Ed
141.Bd -literal -offset 5n
142/* IPv6 */
143int v = 1;        /* 1 to enable, or 0 to disable */
144setsockopt(mrouter_s6, IPPROTO_IPV6, MRT6_INIT, (void *)&v, sizeof(v));
145\&...
146/* If necessary, filter all ICMPv6 messages */
147struct icmp6_filter filter;
148ICMP6_FILTER_SETBLOCKALL(&filter);
149setsockopt(mrouter_s6, IPPROTO_ICMPV6, ICMP6_FILTER, (void *)&filter,
150           sizeof(filter));
151.Ed
152.Pp
153After multicast forwarding is enabled, the multicast routing socket
154can be used to enable PIM processing in the kernel if either PIM-SM or
155PIM-DM are being used
156(see
157.Xr pim 4 ) .
158.Pp
159For each network interface (e.g., physical or a virtual tunnel)
160that would be used for multicast forwarding, a corresponding
161multicast interface must be added to the kernel:
162.Bd -literal -offset 3n
163/* IPv4 */
164struct vifctl vc;
165memset(&vc, 0, sizeof(vc));
166/* Assign all vifctl fields as appropriate */
167vc.vifc_vifi = vif_index;
168vc.vifc_flags = vif_flags;
169vc.vifc_threshold = min_ttl_threshold;
170vc.vifc_rate_limit = max_rate_limit;
171memcpy(&vc.vifc_lcl_addr, &vif_local_address, sizeof(vc.vifc_lcl_addr));
172if (vc.vifc_flags & VIFF_TUNNEL)
173    memcpy(&vc.vifc_rmt_addr, &vif_remote_address,
174           sizeof(vc.vifc_rmt_addr));
175setsockopt(mrouter_s4, IPPROTO_IP, MRT_ADD_VIF, (void *)&vc,
176           sizeof(vc));
177.Ed
178.Pp
179The
180.Va vif_index
181must be unique per vif.
182The
183.Va vif_flags
184contains the
185.Dv VIFF_*
186flags as defined in
187.Aq Pa netinet/ip_mroute.h .
188The
189.Va min_ttl_threshold
190contains the minimum TTL a multicast data packet must have to be
191forwarded on that vif.
192Typically, it would be 1.
193The
194.Va max_rate_limit
195contains the maximum rate (in bits/s) of the multicast data packets forwarded
196on that vif.
197A value of 0 means no limit.
198The
199.Va vif_local_address
200contains the local IP address of the corresponding local interface.
201The
202.Va vif_remote_address
203contains the remote IP address for DVMRP multicast tunnels.
204.Bd -literal -offset indent
205/* IPv6 */
206struct mif6ctl mc;
207memset(&mc, 0, sizeof(mc));
208/* Assign all mif6ctl fields as appropriate */
209mc.mif6c_mifi = mif_index;
210mc.mif6c_flags = mif_flags;
211mc.mif6c_pifi = pif_index;
212setsockopt(mrouter_s6, IPPROTO_IPV6, MRT6_ADD_MIF, (void *)&mc,
213           sizeof(mc));
214.Ed
215.Pp
216The
217.Va mif_index
218must be unique per vif.
219The
220.Va mif_flags
221contains the
222.Dv MIFF_*
223flags as defined in
224.Aq Pa netinet6/ip6_mroute.h .
225The
226.Va pif_index
227is the physical interface index of the corresponding local interface.
228.Pp
229A multicast interface is deleted by:
230.Bd -literal -offset indent
231/* IPv4 */
232vifi_t vifi = vif_index;
233setsockopt(mrouter_s4, IPPROTO_IP, MRT_DEL_VIF, (void *)&vifi,
234           sizeof(vifi));
235.Ed
236.Bd -literal -offset indent
237/* IPv6 */
238mifi_t mifi = mif_index;
239setsockopt(mrouter_s6, IPPROTO_IPV6, MRT6_DEL_MIF, (void *)&mifi,
240           sizeof(mifi));
241.Ed
242.Pp
243After multicast forwarding is enabled, and the multicast virtual
244interfaces have been
245added, the kernel may deliver upcall messages (also called signals
246later in this text) on the multicast routing socket that was open
247earlier with
248.Dv MRT_INIT
249or
250.Dv MRT6_INIT .
251The IPv4 upcalls have a
252.Vt "struct igmpmsg"
253header (see
254.Aq Pa netinet/ip_mroute.h )
255with the
256.Va im_mbz
257field set to zero.
258Note that this header follows the structure of
259.Vt "struct ip"
260with the protocol field
261.Va ip_p
262set to zero.
263The IPv6 upcalls have a
264.Vt "struct mrt6msg"
265header (see
266.Aq Pa netinet6/ip6_mroute.h )
267with the
268.Va im6_mbz
269field set to zero.
270Note that this header follows the structure of
271.Vt "struct ip6_hdr"
272with the next header field
273.Va ip6_nxt
274set to zero.
275.Pp
276The upcall header contains the
277.Va im_msgtype
278and
279.Va im6_msgtype
280fields, with the type of the upcall
281.Dv IGMPMSG_*
282and
283.Dv MRT6MSG_*
284for IPv4 and IPv6, respectively.
285The values of the rest of the upcall header fields
286and the body of the upcall message depend on the particular upcall type.
287.Pp
288If the upcall message type is
289.Dv IGMPMSG_NOCACHE
290or
291.Dv MRT6MSG_NOCACHE ,
292this is an indication that a multicast packet has reached the multicast
293router, but the router has no forwarding state for that packet.
294Typically, the upcall would be a signal for the multicast routing
295user-level process to install the appropriate Multicast Forwarding
296Cache (MFC) entry in the kernel.
297.Pp
298An MFC entry is added by:
299.Bd -literal -offset indent
300/* IPv4 */
301struct mfcctl mc;
302memset(&mc, 0, sizeof(mc));
303memcpy(&mc.mfcc_origin, &source_addr, sizeof(mc.mfcc_origin));
304memcpy(&mc.mfcc_mcastgrp, &group_addr, sizeof(mc.mfcc_mcastgrp));
305mc.mfcc_parent = iif_index;
306for (i = 0; i \*(Lt maxvifs; i++)
307    mc.mfcc_ttls[i] = oifs_ttl[i];
308setsockopt(mrouter_s4, IPPROTO_IP, MRT_ADD_MFC,
309           (void *)&mc, sizeof(mc));
310.Ed
311.Bd -literal -offset indent
312/* IPv6 */
313struct mf6cctl mc;
314memset(&mc, 0, sizeof(mc));
315memcpy(&mc.mf6cc_origin, &source_addr, sizeof(mc.mf6cc_origin));
316memcpy(&mc.mf6cc_mcastgrp, &group_addr, sizeof(mf6cc_mcastgrp));
317mc.mf6cc_parent = iif_index;
318for (i = 0; i \*(Lt maxvifs; i++)
319    if (oifs_ttl[i] \*(Gt 0)
320        IF_SET(i, &mc.mf6cc_ifset);
321setsockopt(mrouter_s4, IPPROTO_IPV6, MRT6_ADD_MFC,
322           (void *)&mc, sizeof(mc));
323.Ed
324.Pp
325The
326.Va source_addr
327and
328.Va group_addr
329fields are the source and group address of the multicast packet (as set
330in the upcall message).
331The
332.Va iif_index
333is the virtual interface index of the multicast interface the multicast
334packets for this specific source and group address should be received on.
335The
336.Va oifs_ttl[]
337array contains the minimum TTL (per interface) a multicast packet
338should have to be forwarded on an outgoing interface.
339If the TTL value is zero, the corresponding interface is not included
340in the set of outgoing interfaces.
341Note that for IPv6 only the set of outgoing interfaces can
342be specified.
343.Pp
344An MFC entry is deleted by:
345.Bd -literal -offset indent
346/* IPv4 */
347struct mfcctl mc;
348memset(&mc, 0, sizeof(mc));
349memcpy(&mc.mfcc_origin, &source_addr, sizeof(mc.mfcc_origin));
350memcpy(&mc.mfcc_mcastgrp, &group_addr, sizeof(mc.mfcc_mcastgrp));
351setsockopt(mrouter_s4, IPPROTO_IP, MRT_DEL_MFC,
352           (void *)&mc, sizeof(mc));
353.Ed
354.Bd -literal -offset indent
355/* IPv6 */
356struct mf6cctl mc;
357memset(&mc, 0, sizeof(mc));
358memcpy(&mc.mf6cc_origin, &source_addr, sizeof(mc.mf6cc_origin));
359memcpy(&mc.mf6cc_mcastgrp, &group_addr, sizeof(mf6cc_mcastgrp));
360setsockopt(mrouter_s4, IPPROTO_IPV6, MRT6_DEL_MFC,
361           (void *)&mc, sizeof(mc));
362.Ed
363.Pp
364The following method can be used to get various statistics per
365installed MFC entry in the kernel (e.g., the number of forwarded
366packets per source and group address):
367.Bd -literal -offset indent
368/* IPv4 */
369struct sioc_sg_req sgreq;
370memset(&sgreq, 0, sizeof(sgreq));
371memcpy(&sgreq.src, &source_addr, sizeof(sgreq.src));
372memcpy(&sgreq.grp, &group_addr, sizeof(sgreq.grp));
373ioctl(mrouter_s4, SIOCGETSGCNT, &sgreq);
374.Ed
375.Bd -literal -offset indent
376/* IPv6 */
377struct sioc_sg_req6 sgreq;
378memset(&sgreq, 0, sizeof(sgreq));
379memcpy(&sgreq.src, &source_addr, sizeof(sgreq.src));
380memcpy(&sgreq.grp, &group_addr, sizeof(sgreq.grp));
381ioctl(mrouter_s6, SIOCGETSGCNT_IN6, &sgreq);
382.Ed
383.Pp
384The following method can be used to get various statistics per
385multicast virtual interface in the kernel (e.g., the number of forwarded
386packets per interface):
387.Bd -literal -offset indent
388/* IPv4 */
389struct sioc_vif_req vreq;
390memset(&vreq, 0, sizeof(vreq));
391vreq.vifi = vif_index;
392ioctl(mrouter_s4, SIOCGETVIFCNT, &vreq);
393.Ed
394.Bd -literal -offset indent
395/* IPv6 */
396struct sioc_mif_req6 mreq;
397memset(&mreq, 0, sizeof(mreq));
398mreq.mifi = vif_index;
399ioctl(mrouter_s6, SIOCGETMIFCNT_IN6, &mreq);
400.Ed
401.Ss Advanced Multicast API Programming Guide
402Adding new features to the kernel makes it difficult
403to preserve backward compatibility (binary and API),
404and at the same time to allow user-level processes to take advantage of
405the new features (if the kernel supports them).
406.Pp
407One of the mechanisms that allows preserving the backward
408compatibility is a sort of negotiation
409between the user-level process and the kernel:
410.Bl -enum
411.It
412The user-level process tries to enable in the kernel the set of new
413features (and the corresponding API) it would like to use.
414.It
415The kernel returns the (sub)set of features it knows about
416and is willing to be enabled.
417.It
418The user-level process uses only that set of features
419the kernel has agreed on.
420.El
421.\"
422.Pp
423To support backward compatibility, if the user-level process does not
424ask for any new features, the kernel defaults to the basic
425multicast API (see the
426.Sx "Programming Guide"
427section).
428.\" XXX: edit as appropriate after the advanced multicast API is
429.\" supported under IPv6
430Currently, the advanced multicast API exists only for IPv4;
431in the future there will be IPv6 support as well.
432.Pp
433Below is a summary of the expandable API solution.
434Note that all new options and structures are defined
435in
436.Aq Pa netinet/ip_mroute.h
437and
438.Aq Pa netinet6/ip6_mroute.h ,
439unless stated otherwise.
440.Pp
441The user-level process uses new
442.Fn getsockopt Ns / Ns Fn setsockopt
443options to
444perform the API features negotiation with the kernel.
445This negotiation must be performed right after the multicast routing
446socket is open.
447The set of desired/allowed features is stored in a bitset
448(currently, in
449.Vt uint32_t
450i.e., maximum of 32 new features).
451The new
452.Fn getsockopt Ns / Ns Fn setsockopt
453options are
454.Dv MRT_API_SUPPORT
455and
456.Dv MRT_API_CONFIG .
457An example:
458.Bd -literal -offset 3n
459uint32_t v;
460getsockopt(sock, IPPROTO_IP, MRT_API_SUPPORT, (void *)&v, sizeof(v));
461.Ed
462.Pp
463This would set
464.Va v
465to the pre-defined bits that the kernel API supports.
466The eight least significant bits in
467.Vt uint32_t
468are the same as the
469eight possible flags
470.Dv MRT_MFC_FLAGS_*
471that can be used in
472.Va mfcc_flags
473as part of the new definition of
474.Vt "struct mfcctl"
475(see below about those flags), which leaves 24 flags for other new features.
476The value returned by
477.Fn getsockopt MRT_API_SUPPORT
478is read-only; in other words,
479.Fn setsockopt MRT_API_SUPPORT
480would fail.
481.Pp
482To modify the API, and to set some specific feature in the kernel, then:
483.Bd -literal -offset 3n
484uint32_t v = MRT_MFC_FLAGS_DISABLE_WRONGVIF;
485if (setsockopt(sock, IPPROTO_IP, MRT_API_CONFIG, (void *)&v, sizeof(v))
486    != 0) {
487    return (ERROR);
488}
489if (v & MRT_MFC_FLAGS_DISABLE_WRONGVIF)
490    return (OK);	/* Success */
491else
492    return (ERROR);
493.Ed
494.Pp
495In other words, when
496.Fn setsockopt MRT_API_CONFIG
497is called, the
498argument to it specifies the desired set of features to
499be enabled in the API and the kernel.
500The return value in
501.Va v
502is the actual (sub)set of features that were enabled in the kernel.
503To obtain later the same set of features that were enabled, use:
504.Bd -literal -offset indent
505getsockopt(sock, IPPROTO_IP, MRT_API_CONFIG, (void *)&v, sizeof(v));
506.Ed
507.Pp
508The set of enabled features is global.
509In other words,
510.Fn setsockopt MRT_API_CONFIG
511should be called right after
512.Fn setsockopt MRT_INIT .
513.Pp
514Currently, the following set of new features is defined:
515.Bd -literal
516#define	MRT_MFC_FLAGS_DISABLE_WRONGVIF (1 \*(Lt\*(Lt 0)/*disable WRONGVIF signals*/
517#define	MRT_MFC_FLAGS_BORDER_VIF   (1 \*(Lt\*(Lt 1)  /* border vif              */
518#define MRT_MFC_RP                 (1 \*(Lt\*(Lt 8)  /* enable RP address	*/
519#define MRT_MFC_BW_UPCALL          (1 \*(Lt\*(Lt 9)  /* enable bw upcalls	*/
520.Ed
521.\" .Pp
522.\" In the future there might be:
523.\" .Bd -literal
524.\" #define MRT_MFC_GROUP_SPECIFIC     (1 \*(Lt\*(Lt 10) /* allow (*,G) MFC entries */
525.\" .Ed
526.\" .Pp
527.\" to allow (*,G) MFC entries (i.e., group-specific entries) in the kernel.
528.\" For now this is left-out until it is clear whether
529.\" (*,G) MFC support is the preferred solution instead of something more generic
530.\" solution for example.
531.\"
532.\" 2. The newly defined struct mfcctl2.
533.\"
534.Pp
535The advanced multicast API uses a newly defined
536.Vt "struct mfcctl2"
537instead of the traditional
538.Vt "struct mfcctl" .
539The original
540.Vt "struct mfcctl"
541is kept as is.
542The new
543.Vt "struct mfcctl2"
544is:
545.Bd -literal
546/*
547 * The new argument structure for MRT_ADD_MFC and MRT_DEL_MFC overlays
548 * and extends the old struct mfcctl.
549 */
550struct mfcctl2 {
551        /* the mfcctl fields */
552        struct in_addr  mfcc_origin;       /* ip origin of mcasts       */
553        struct in_addr  mfcc_mcastgrp;     /* multicast group associated*/
554        vifi_t          mfcc_parent;       /* incoming vif              */
555        u_char          mfcc_ttls[MAXVIFS];/* forwarding ttls on vifs   */
556
557        /* extension fields */
558        uint8_t         mfcc_flags[MAXVIFS];/* the MRT_MFC_FLAGS_* flags*/
559        struct in_addr  mfcc_rp;            /* the RP address           */
560};
561.Ed
562.Pp
563The new fields are
564.Va mfcc_flags[MAXVIFS]
565and
566.Va mfcc_rp .
567Note that for compatibility reasons they are added at the end.
568.Pp
569The
570.Va mfcc_flags[MAXVIFS]
571field is used to set various flags per
572interface per (S,G) entry.
573Currently, the defined flags are:
574.Bd -literal
575#define	MRT_MFC_FLAGS_DISABLE_WRONGVIF (1 \*(Lt\*(Lt 0)/*disable WRONGVIF signals*/
576#define	MRT_MFC_FLAGS_BORDER_VIF       (1 \*(Lt\*(Lt 1) /* border vif          */
577.Ed
578.Pp
579The
580.Dv MRT_MFC_FLAGS_DISABLE_WRONGVIF
581flag is used to explicitly disable the
582.Dv IGMPMSG_WRONGVIF
583kernel signal at the (S,G) granularity if a multicast data packet
584arrives on the wrong interface.
585Usually this signal is used to
586complete the shortest-path switch for PIM-SM multicast routing,
587or to trigger a PIM assert message.
588However, it should not be delivered for interfaces that are not set in
589the outgoing interface, and that are not expecting to
590become an incoming interface.
591Hence, if the
592.Dv MRT_MFC_FLAGS_DISABLE_WRONGVIF
593flag is set for some of the
594interfaces, then a data packet that arrives on that interface for
595that MFC entry will NOT trigger a WRONGVIF signal.
596If that flag is not set, then a signal is triggered (the default action).
597.Pp
598The
599.Dv MRT_MFC_FLAGS_BORDER_VIF
600flag is used to specify whether the Border-bit in PIM
601Register messages should be set (when the Register encapsulation
602is performed inside the kernel).
603If it is set for the special PIM Register kernel virtual interface
604(see
605.Xr pim 4 ) ,
606the Border-bit in the Register messages sent to the RP will be set.
607.Pp
608The remaining six bits are reserved for future usage.
609.Pp
610The
611.Va mfcc_rp
612field is used to specify the RP address (for PIM-SM multicast routing)
613for a multicast
614group G if we want to perform kernel-level PIM Register encapsulation.
615The
616.Va mfcc_rp
617field is used only if the
618.Dv MRT_MFC_RP
619advanced API flag/capability has been successfully set by
620.Fn setsockopt MRT_API_CONFIG .
621.Pp
622.\"
623.\" 3. Kernel-level PIM Register encapsulation
624.\"
625If the
626.Dv MRT_MFC_RP
627flag was successfully set by
628.Fn setsockopt MRT_API_CONFIG ,
629then the kernel will attempt to perform
630the PIM Register encapsulation itself instead of sending the
631multicast data packets to user level (inside
632.Dv IGMPMSG_WHOLEPKT
633upcalls) for user-level encapsulation.
634The RP address would be taken from the
635.Va mfcc_rp
636field
637inside the new
638.Vt "struct mfcctl2" .
639However, even if the
640.Dv MRT_MFC_RP
641flag was successfully set, if the
642.Va mfcc_rp
643field was set to
644.Dv INADDR_ANY ,
645then the
646kernel will still deliver an
647.Dv IGMPMSG_WHOLEPKT
648upcall with the
649multicast data packet to the user-level process.
650.Pp
651In addition, if the multicast data packet is too large to fit within
652a single IP packet after the PIM Register encapsulation (e.g., if
653its size was on the order of 65500 bytes), the data packet will be
654fragmented, and then each of the fragments will be encapsulated
655separately.
656Note that typically a multicast data packet can be that
657large only if it was originated locally from the same hosts that
658performs the encapsulation; otherwise the transmission of the
659multicast data packet over Ethernet for example would have
660fragmented it into much smaller pieces.
661.\"
662.\" Note that if this code is ported to IPv6, we may need the kernel to
663.\" perform MTU discovery to the RP, and keep those discoveries inside
664.\" the kernel so the encapsulating router may send back ICMP
665.\" Fragmentation Required if the size of the multicast data packet is
666.\" too large (see "Encapsulating data packets in the Register Tunnel"
667.\" in Section 4.4.1 in the PIM-SM spec
668.\" draft-ietf-pim-sm-v2-new-05.{txt,ps}).
669.\" For IPv4 we may be able to get away without it, but for IPv6 we need
670.\" that.
671.\"
672.\" 4. Mechanism for "multicast bandwidth monitoring and upcalls".
673.\"
674.Pp
675Typically, a multicast routing user-level process would need to know the
676forwarding bandwidth for some data flow.
677For example, the multicast routing process may want to time out idle MFC
678entries, or for PIM-SM it can initiate (S,G) shortest-path switch if
679the bandwidth rate is above a threshold for example.
680.Pp
681The original solution for measuring the bandwidth of a dataflow was
682that a user-level process would periodically
683query the kernel about the number of forwarded packets/bytes per
684(S,G), and then based on those numbers it would estimate whether a source
685has been idle, or whether the source's transmission bandwidth is above a
686threshold.
687That solution is far from being scalable, hence the need for a new
688mechanism for bandwidth monitoring.
689.Pp
690Below is a description of the bandwidth monitoring mechanism.
691.Bl -bullet
692.It
693If the bandwidth of a data flow satisfies some pre-defined filter,
694the kernel delivers an upcall on the multicast routing socket
695to the multicast routing process that has installed that filter.
696.It
697The bandwidth-upcall filters are installed per (S,G).
698There can be
699more than one filter per (S,G).
700.It
701Instead of supporting all possible comparison operations
702(i.e., \*(Lt \*(Lt= == != \*(Gt \*(Gt= ), there is support only for the
703\*(Lt= and \*(Gt= operations,
704because this makes the kernel-level implementation simpler,
705and because practically we need only those two.
706Furthermore, the missing operations can be simulated by secondary
707user-level filtering of those \*(Lt= and \*(Gt= filters.
708For example, to simulate !=, then we need to install filter
709.Dq bw \*(Lt= 0xffffffff ,
710and after an
711upcall is received, we need to check whether
712.Dq measured_bw != expected_bw .
713.It
714The bandwidth-upcall mechanism is enabled by
715.Fn setsockopt MRT_API_CONFIG
716for the
717.Dv MRT_MFC_BW_UPCALL
718flag.
719.It
720The bandwidth-upcall filters are added/deleted by the new
721.Fn setsockopt MRT_ADD_BW_UPCALL
722and
723.Fn setsockopt MRT_DEL_BW_UPCALL
724respectively (with the appropriate
725.Vt "struct bw_upcall"
726argument of course).
727.El
728.Pp
729From an application point of view, a developer needs to know about
730the following:
731.Bd -literal
732/*
733 * Structure for installing or delivering an upcall if the
734 * measured bandwidth is above or below a threshold.
735 *
736 * User programs (e.g. daemons) may have a need to know when the
737 * bandwidth used by some data flow is above or below some threshold.
738 * This interface allows the userland to specify the threshold (in
739 * bytes and/or packets) and the measurement interval. Flows are
740 * all packet with the same source and destination IP address.
741 * At the moment the code is only used for multicast destinations
742 * but there is nothing that prevents its use for unicast.
743 *
744 * The measurement interval cannot be shorter than some Tmin (3s).
745 * The threshold is set in packets and/or bytes per_interval.
746 *
747 * Measurement works as follows:
748 *
749 * For \*(Gt= measurements:
750 * The first packet marks the start of a measurement interval.
751 * During an interval we count packets and bytes, and when we
752 * pass the threshold we deliver an upcall and we are done.
753 * The first packet after the end of the interval resets the
754 * count and restarts the measurement.
755 *
756 * For \*(Lt= measurement:
757 * We start a timer to fire at the end of the interval, and
758 * then for each incoming packet we count packets and bytes.
759 * When the timer fires, we compare the value with the threshold,
760 * schedule an upcall if we are below, and restart the measurement
761 * (reschedule timer and zero counters).
762 */
763
764struct bw_data {
765        struct timeval  b_time;
766        uint64_t        b_packets;
767        uint64_t        b_bytes;
768};
769
770struct bw_upcall {
771        struct in_addr  bu_src;         /* source address            */
772        struct in_addr  bu_dst;         /* destination address       */
773        uint32_t        bu_flags;       /* misc flags (see below)    */
774#define BW_UPCALL_UNIT_PACKETS (1 \*(Lt\*(Lt 0) /* threshold (in packets)    */
775#define BW_UPCALL_UNIT_BYTES   (1 \*(Lt\*(Lt 1) /* threshold (in bytes)      */
776#define BW_UPCALL_GEQ          (1 \*(Lt\*(Lt 2) /* upcall if bw \*(Gt= threshold */
777#define BW_UPCALL_LEQ          (1 \*(Lt\*(Lt 3) /* upcall if bw \*(Lt= threshold */
778#define BW_UPCALL_DELETE_ALL   (1 \*(Lt\*(Lt 4) /* delete all upcalls for s,d*/
779        struct bw_data  bu_threshold;   /* the bw threshold          */
780        struct bw_data  bu_measured;    /* the measured bw           */
781};
782
783/* max. number of upcalls to deliver together */
784#define BW_UPCALLS_MAX				128
785/* min. threshold time interval for bandwidth measurement */
786#define BW_UPCALL_THRESHOLD_INTERVAL_MIN_SEC	3
787#define BW_UPCALL_THRESHOLD_INTERVAL_MIN_USEC	0
788.Ed
789.Pp
790The
791.Vt bw_upcall
792structure is used as an argument to
793.Fn setsockopt MRT_ADD_BW_UPCALL
794and
795.Fn setsockopt MRT_DEL_BW_UPCALL .
796Each
797.Fn setsockopt MRT_ADD_BW_UPCALL
798installs a filter in the kernel
799for the source and destination address in the
800.Vt bw_upcall
801argument,
802and that filter will trigger an upcall according to the following
803pseudo-algorithm:
804.Bd -literal
805 if (bw_upcall_oper IS "\*(Gt=") {
806    if (((bw_upcall_unit & PACKETS == PACKETS) &&
807         (measured_packets \*(Gt= threshold_packets)) ||
808        ((bw_upcall_unit & BYTES == BYTES) &&
809         (measured_bytes \*(Gt= threshold_bytes)))
810       SEND_UPCALL("measured bandwidth is \*(Gt= threshold");
811  }
812  if (bw_upcall_oper IS "\*(Lt=" && measured_interval \*(Gt= threshold_interval) {
813    if (((bw_upcall_unit & PACKETS == PACKETS) &&
814         (measured_packets \*(Lt= threshold_packets)) ||
815        ((bw_upcall_unit & BYTES == BYTES) &&
816         (measured_bytes \*(Lt= threshold_bytes)))
817       SEND_UPCALL("measured bandwidth is \*(Lt= threshold");
818  }
819.Ed
820.Pp
821In the same
822.Vt bw_upcall ,
823the unit can be specified in both BYTES and PACKETS.
824However, the GEQ and LEQ flags are mutually exclusive.
825.Pp
826Basically, an upcall is delivered if the measured bandwidth is \*(Gt= or
827\*(Lt= the threshold bandwidth (within the specified measurement
828interval).
829For practical reasons, the smallest value for the measurement
830interval is 3 seconds.
831If smaller values are allowed, then the bandwidth
832estimation may be less accurate, or the potentially very high frequency
833of the generated upcalls may introduce too much overhead.
834For the \*(Gt= operation, the answer may be known before the end of
835.Va threshold_interval ,
836therefore the upcall may be delivered earlier.
837For the \*(Lt= operation however, we must wait
838until the threshold interval has expired to know the answer.
839.Sh EXAMPLES
840.Bd -literal -offset indent
841struct bw_upcall bw_upcall;
842/* Assign all bw_upcall fields as appropriate */
843memset(&bw_upcall, 0, sizeof(bw_upcall));
844memcpy(&bw_upcall.bu_src, &source, sizeof(bw_upcall.bu_src));
845memcpy(&bw_upcall.bu_dst, &group, sizeof(bw_upcall.bu_dst));
846bw_upcall.bu_threshold.b_data = threshold_interval;
847bw_upcall.bu_threshold.b_packets = threshold_packets;
848bw_upcall.bu_threshold.b_bytes = threshold_bytes;
849if (is_threshold_in_packets)
850    bw_upcall.bu_flags |= BW_UPCALL_UNIT_PACKETS;
851if (is_threshold_in_bytes)
852    bw_upcall.bu_flags |= BW_UPCALL_UNIT_BYTES;
853do {
854    if (is_geq_upcall) {
855        bw_upcall.bu_flags |= BW_UPCALL_GEQ;
856        break;
857    }
858    if (is_leq_upcall) {
859        bw_upcall.bu_flags |= BW_UPCALL_LEQ;
860        break;
861    }
862    return (ERROR);
863} while (0);
864setsockopt(mrouter_s4, IPPROTO_IP, MRT_ADD_BW_UPCALL,
865          (void *)&bw_upcall, sizeof(bw_upcall));
866.Ed
867.Pp
868To delete a single filter, use
869.Dv MRT_DEL_BW_UPCALL ,
870and the fields of bw_upcall must be set to
871exactly same as when
872.Dv MRT_ADD_BW_UPCALL
873was called.
874.Pp
875To delete all bandwidth filters for a given (S,G), then
876only the
877.Va bu_src
878and
879.Va bu_dst
880fields in
881.Vt "struct bw_upcall"
882need to be set, and then just set only the
883.Dv BW_UPCALL_DELETE_ALL
884flag inside field
885.Va bw_upcall.bu_flags .
886.Pp
887The bandwidth upcalls are received by aggregating them in the new upcall
888message:
889.Bd -literal -offset indent
890#define IGMPMSG_BW_UPCALL  4  /* BW monitoring upcall */
891.Ed
892.Pp
893This message is an array of
894.Vt "struct bw_upcall"
895elements (up to
896.Dv BW_UPCALLS_MAX
897= 128).
898The upcalls are
899delivered when there are 128 pending upcalls, or when 1 second has
900expired since the previous upcall (whichever comes first).
901In an
902.Vt "struct upcall"
903element, the
904.Va bu_measured
905field is filled in to
906indicate the particular measured values.
907However, because of the way
908the particular intervals are measured, the user should be careful how
909.Va bu_measured.b_time
910is used.
911For example, if the
912filter is installed to trigger an upcall if the number of packets
913is \*(Gt= 1, then
914.Va bu_measured
915may have a value of zero in the upcalls after the
916first one, because the measured interval for \*(Gt= filters is
917.Dq clocked
918by the forwarded packets.
919Hence, this upcall mechanism should not be used for measuring
920the exact value of the bandwidth of the forwarded data.
921To measure the exact bandwidth, the user would need to
922get the forwarded packets statistics with the
923.Fn ioctl SIOCGETSGCNT
924mechanism
925(see the
926.Sx Programming Guide
927section) .
928.Pp
929Note that the upcalls for a filter are delivered until the specific
930filter is deleted, but no more frequently than once per
931.Va bu_threshold.b_time .
932For example, if the filter is specified to
933deliver a signal if bw \*(Gt= 1 packet, the first packet will trigger a
934signal, but the next upcall will be triggered no earlier than
935.Va bu_threshold.b_time
936after the previous upcall.
937.\"
938.Sh SEE ALSO
939.Xr getsockopt 2 ,
940.Xr recvfrom 2 ,
941.Xr recvmsg 2 ,
942.Xr setsockopt 2 ,
943.Xr socket 2 ,
944.Xr icmp6 4 ,
945.Xr inet 4 ,
946.Xr inet6 4 ,
947.Xr intro 4 ,
948.Xr ip 4 ,
949.Xr ip6 4 ,
950.Xr pim 4 ,
951.Xr mrouted 8 ,
952.Xr sysctl 8
953.\"
954.Sh AUTHORS
955.An -nosplit
956The original multicast code was written by
957.An David Waitzman
958(BBN Labs),
959and later modified by the following individuals:
960.An Steve Deering
961(Stanford),
962.An Mark J. Steiglitz
963(Stanford),
964.An Van Jacobson
965(LBL),
966.An Ajit Thyagarajan
967(PARC),
968.An Bill Fenner
969(PARC).
970.Pp
971The IPv6 multicast support was implemented by the KAME project
972.Pq Lk http://www.kame.net ,
973and was based on the IPv4 multicast code.
974The advanced multicast API and the multicast bandwidth
975monitoring were implemented by
976.An Pavlin Radoslavov
977(ICSI)
978in collaboration with
979.An Chris Brown
980(NextHop).
981.Pp
982This manual page was written by
983.An Pavlin Radoslavov
984(ICSI).
985