xref: /openbsd/share/man/man4/multicast.4 (revision 305b6e39)
1.\" Copyright (c) 2001-2003 International Computer Science Institute
2.\"
3.\" Permission is hereby granted, free of charge, to any person obtaining a
4.\" copy of this software and associated documentation files (the "Software"),
5.\" to deal in the Software without restriction, including without limitation
6.\" the rights to use, copy, modify, merge, publish, distribute, sublicense,
7.\" and/or sell copies of the Software, and to permit persons to whom the
8.\" Software is furnished to do so, subject to the following conditions:
9.\"
10.\" The above copyright notice and this permission notice shall be included in
11.\" all copies or substantial portions of the Software.
12.\"
13.\" The names and trademarks of copyright holders may not be used in
14.\" advertising or publicity pertaining to the software without specific
15.\" prior permission. Title to copyright in this software and any associated
16.\" documentation will at all times remain with the copyright holders.
17.\"
18.\" THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
19.\" IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
20.\" FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
21.\" AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
22.\" LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
23.\" FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
24.\" DEALINGS IN THE SOFTWARE.
25.\"
26.\" $FreeBSD: src/share/man/man4/multicast.4,v 1.4 2004/07/09 09:22:36 ru Exp $
27.\" $OpenBSD: multicast.4,v 1.7 2012/08/12 17:01:35 schwarze Exp $
28.\" $NetBSD: multicast.4,v 1.3 2004/09/12 13:12:26 wiz Exp $
29.\"
30.Dd $Mdocdate: August 12 2012 $
31.Dt MULTICAST 4
32.Os
33.\"
34.Sh NAME
35.Nm multicast
36.Nd Multicast Routing
37.\"
38.Sh SYNOPSIS
39.Cd "options MROUTING"
40.Pp
41.In sys/types.h
42.In sys/socket.h
43.In netinet/in.h
44.In netinet/ip_mroute.h
45.In netinet6/ip6_mroute.h
46.Ft int
47.Fn getsockopt "int s" IPPROTO_IP MRT_INIT "void *optval" "socklen_t *optlen"
48.Ft int
49.Fn setsockopt "int s" IPPROTO_IP MRT_INIT "const void *optval" "socklen_t optlen"
50.Ft int
51.Fn getsockopt "int s" IPPROTO_IPV6 MRT6_INIT "void *optval" "socklen_t *optlen"
52.Ft int
53.Fn setsockopt "int s" IPPROTO_IPV6 MRT6_INIT "const void *optval" "socklen_t optlen"
54.Sh DESCRIPTION
55.Tn "Multicast routing"
56is used to efficiently propagate data
57packets to a set of multicast listeners in multipoint networks.
58If unicast is used to replicate the data to all listeners,
59then some of the network links may carry multiple copies of the same
60data packets.
61With multicast routing, the overhead is reduced to one copy
62(at most) per network link.
63.Pp
64All multicast-capable routers must run a common multicast routing
65protocol.
66The Distance Vector Multicast Routing Protocol (DVMRP)
67was the first developed multicast routing protocol.
68Later, other protocols such as Multicast Extensions to OSPF (MOSPF),
69Core Based Trees (CBT),
70Protocol Independent Multicast \- Sparse Mode (PIM-SM),
71and Protocol Independent Multicast \- Dense Mode (PIM-DM)
72were developed as well.
73.Pp
74To start multicast routing,
75the user must enable multicast forwarding via the
76.Xr sysctl 8
77variables
78.Va net.inet.ip.mforwarding
79and/or
80.Va net.inet.ip6.mforwarding .
81The user must also run a multicast routing capable user-level process,
82such as
83.Xr mrouted 8 .
84From a developer's point of view,
85the programming guide described in the
86.Sx Programming Guide
87section should be used to control the multicast forwarding in the kernel.
88.\"
89.Ss Programming Guide
90This section provides information about the basic multicast routing API.
91The so-called
92.Dq advanced multicast API
93is described in the
94.Sx "Advanced Multicast API Programming Guide"
95section.
96.Pp
97First, a multicast routing socket must be open.
98That socket would be used
99to control the multicast forwarding in the kernel.
100Note that most operations below require certain privilege
101(i.e., root privilege):
102.Bd -literal -offset indent
103/* IPv4 */
104int mrouter_s4;
105mrouter_s4 = socket(AF_INET, SOCK_RAW, IPPROTO_IGMP);
106.Ed
107.Bd -literal -offset indent
108int mrouter_s6;
109mrouter_s6 = socket(AF_INET6, SOCK_RAW, IPPROTO_ICMPV6);
110.Ed
111.Pp
112Note that if the router needs to open an IGMP or ICMPv6 socket
113(IPv4 or IPv6, respectively)
114for sending or receiving of IGMP or MLD multicast group membership messages,
115then the same
116.Va mrouter_s4
117or
118.Va mrouter_s6
119sockets should be used
120for sending and receiving respectively IGMP or MLD messages.
121In the case of BSD-derived kernels,
122it may be possible to open separate sockets
123for IGMP or MLD messages only.
124However, some other kernels (e.g.,
125.Tn Linux )
126require that the multicast
127routing socket must be used for sending and receiving of IGMP or MLD
128messages.
129Therefore, for portability reasons, the multicast
130routing socket should be reused for IGMP and MLD messages as well.
131.Pp
132After the multicast routing socket is open, it can be used to enable
133or disable multicast forwarding in the kernel:
134.Bd -literal -offset 5n
135/* IPv4 */
136int v = 1;        /* 1 to enable, or 0 to disable */
137setsockopt(mrouter_s4, IPPROTO_IP, MRT_INIT, (void *)&v, sizeof(v));
138.Ed
139.Bd -literal -offset 5n
140/* IPv6 */
141int v = 1;        /* 1 to enable, or 0 to disable */
142setsockopt(mrouter_s6, IPPROTO_IPV6, MRT6_INIT, (void *)&v, sizeof(v));
143\&...
144/* If necessary, filter all ICMPv6 messages */
145struct icmp6_filter filter;
146ICMP6_FILTER_SETBLOCKALL(&filter);
147setsockopt(mrouter_s6, IPPROTO_ICMPV6, ICMP6_FILTER, (void *)&filter,
148           sizeof(filter));
149.Ed
150.Pp
151After multicast forwarding is enabled, the multicast routing socket
152can be used to enable PIM processing in the kernel if either PIM-SM or
153PIM-DM are being used
154(see
155.Xr pim 4 ) .
156.Pp
157For each network interface (e.g., physical or a virtual tunnel)
158that would be used for multicast forwarding, a corresponding
159multicast interface must be added to the kernel:
160.Bd -literal -offset 3n
161/* IPv4 */
162struct vifctl vc;
163memset(&vc, 0, sizeof(vc));
164/* Assign all vifctl fields as appropriate */
165vc.vifc_vifi = vif_index;
166vc.vifc_flags = vif_flags;
167vc.vifc_threshold = min_ttl_threshold;
168vc.vifc_rate_limit = max_rate_limit;
169memcpy(&vc.vifc_lcl_addr, &vif_local_address, sizeof(vc.vifc_lcl_addr));
170if (vc.vifc_flags & VIFF_TUNNEL)
171    memcpy(&vc.vifc_rmt_addr, &vif_remote_address,
172           sizeof(vc.vifc_rmt_addr));
173setsockopt(mrouter_s4, IPPROTO_IP, MRT_ADD_VIF, (void *)&vc,
174           sizeof(vc));
175.Ed
176.Pp
177The
178.Va vif_index
179must be unique per vif.
180The
181.Va vif_flags
182contains the
183.Dv VIFF_*
184flags as defined in
185.Aq Pa netinet/ip_mroute.h .
186The
187.Va min_ttl_threshold
188contains the minimum TTL a multicast data packet must have to be
189forwarded on that vif.
190Typically, it would be 1.
191The
192.Va max_rate_limit
193contains the maximum rate (in bits/s) of the multicast data packets forwarded
194on that vif.
195A value of 0 means no limit.
196The
197.Va vif_local_address
198contains the local IP address of the corresponding local interface.
199The
200.Va vif_remote_address
201contains the remote IP address for DVMRP multicast tunnels.
202.Bd -literal -offset indent
203/* IPv6 */
204struct mif6ctl mc;
205memset(&mc, 0, sizeof(mc));
206/* Assign all mif6ctl fields as appropriate */
207mc.mif6c_mifi = mif_index;
208mc.mif6c_flags = mif_flags;
209mc.mif6c_pifi = pif_index;
210setsockopt(mrouter_s6, IPPROTO_IPV6, MRT6_ADD_MIF, (void *)&mc,
211           sizeof(mc));
212.Ed
213.Pp
214The
215.Va mif_index
216must be unique per vif.
217The
218.Va mif_flags
219contains the
220.Dv MIFF_*
221flags as defined in
222.Aq Pa netinet6/ip6_mroute.h .
223The
224.Va pif_index
225is the physical interface index of the corresponding local interface.
226.Pp
227A multicast interface is deleted by:
228.Bd -literal -offset indent
229/* IPv4 */
230vifi_t vifi = vif_index;
231setsockopt(mrouter_s4, IPPROTO_IP, MRT_DEL_VIF, (void *)&vifi,
232           sizeof(vifi));
233.Ed
234.Bd -literal -offset indent
235/* IPv6 */
236mifi_t mifi = mif_index;
237setsockopt(mrouter_s6, IPPROTO_IPV6, MRT6_DEL_MIF, (void *)&mifi,
238           sizeof(mifi));
239.Ed
240.Pp
241After multicast forwarding is enabled, and the multicast virtual
242interfaces have been
243added, the kernel may deliver upcall messages (also called signals
244later in this text) on the multicast routing socket that was open
245earlier with
246.Dv MRT_INIT
247or
248.Dv MRT6_INIT .
249The IPv4 upcalls have a
250.Vt "struct igmpmsg"
251header (see
252.Aq Pa netinet/ip_mroute.h )
253with the
254.Va im_mbz
255field set to zero.
256Note that this header follows the structure of
257.Vt "struct ip"
258with the protocol field
259.Va ip_p
260set to zero.
261The IPv6 upcalls have a
262.Vt "struct mrt6msg"
263header (see
264.Aq Pa netinet6/ip6_mroute.h )
265with the
266.Va im6_mbz
267field set to zero.
268Note that this header follows the structure of
269.Vt "struct ip6_hdr"
270with the next header field
271.Va ip6_nxt
272set to zero.
273.Pp
274The upcall header contains the
275.Va im_msgtype
276and
277.Va im6_msgtype
278fields, with the type of the upcall
279.Dv IGMPMSG_*
280and
281.Dv MRT6MSG_*
282for IPv4 and IPv6, respectively.
283The values of the rest of the upcall header fields
284and the body of the upcall message depend on the particular upcall type.
285.Pp
286If the upcall message type is
287.Dv IGMPMSG_NOCACHE
288or
289.Dv MRT6MSG_NOCACHE ,
290this is an indication that a multicast packet has reached the multicast
291router, but the router has no forwarding state for that packet.
292Typically, the upcall would be a signal for the multicast routing
293user-level process to install the appropriate Multicast Forwarding
294Cache (MFC) entry in the kernel.
295.Pp
296An MFC entry is added by:
297.Bd -literal -offset indent
298/* IPv4 */
299struct mfcctl mc;
300memset(&mc, 0, sizeof(mc));
301memcpy(&mc.mfcc_origin, &source_addr, sizeof(mc.mfcc_origin));
302memcpy(&mc.mfcc_mcastgrp, &group_addr, sizeof(mc.mfcc_mcastgrp));
303mc.mfcc_parent = iif_index;
304for (i = 0; i \*(Lt maxvifs; i++)
305    mc.mfcc_ttls[i] = oifs_ttl[i];
306setsockopt(mrouter_s4, IPPROTO_IP, MRT_ADD_MFC,
307           (void *)&mc, sizeof(mc));
308.Ed
309.Bd -literal -offset indent
310/* IPv6 */
311struct mf6cctl mc;
312memset(&mc, 0, sizeof(mc));
313memcpy(&mc.mf6cc_origin, &source_addr, sizeof(mc.mf6cc_origin));
314memcpy(&mc.mf6cc_mcastgrp, &group_addr, sizeof(mf6cc_mcastgrp));
315mc.mf6cc_parent = iif_index;
316for (i = 0; i \*(Lt maxvifs; i++)
317    if (oifs_ttl[i] \*(Gt 0)
318        IF_SET(i, &mc.mf6cc_ifset);
319setsockopt(mrouter_s4, IPPROTO_IPV6, MRT6_ADD_MFC,
320           (void *)&mc, sizeof(mc));
321.Ed
322.Pp
323The
324.Va source_addr
325and
326.Va group_addr
327fields are the source and group address of the multicast packet (as set
328in the upcall message).
329The
330.Va iif_index
331is the virtual interface index of the multicast interface the multicast
332packets for this specific source and group address should be received on.
333The
334.Va oifs_ttl[]
335array contains the minimum TTL (per interface) a multicast packet
336should have to be forwarded on an outgoing interface.
337If the TTL value is zero, the corresponding interface is not included
338in the set of outgoing interfaces.
339Note that for IPv6 only the set of outgoing interfaces can
340be specified.
341.Pp
342An MFC entry is deleted by:
343.Bd -literal -offset indent
344/* IPv4 */
345struct mfcctl mc;
346memset(&mc, 0, sizeof(mc));
347memcpy(&mc.mfcc_origin, &source_addr, sizeof(mc.mfcc_origin));
348memcpy(&mc.mfcc_mcastgrp, &group_addr, sizeof(mc.mfcc_mcastgrp));
349setsockopt(mrouter_s4, IPPROTO_IP, MRT_DEL_MFC,
350           (void *)&mc, sizeof(mc));
351.Ed
352.Bd -literal -offset indent
353/* IPv6 */
354struct mf6cctl mc;
355memset(&mc, 0, sizeof(mc));
356memcpy(&mc.mf6cc_origin, &source_addr, sizeof(mc.mf6cc_origin));
357memcpy(&mc.mf6cc_mcastgrp, &group_addr, sizeof(mf6cc_mcastgrp));
358setsockopt(mrouter_s4, IPPROTO_IPV6, MRT6_DEL_MFC,
359           (void *)&mc, sizeof(mc));
360.Ed
361.Pp
362The following method can be used to get various statistics per
363installed MFC entry in the kernel (e.g., the number of forwarded
364packets per source and group address):
365.Bd -literal -offset indent
366/* IPv4 */
367struct sioc_sg_req sgreq;
368memset(&sgreq, 0, sizeof(sgreq));
369memcpy(&sgreq.src, &source_addr, sizeof(sgreq.src));
370memcpy(&sgreq.grp, &group_addr, sizeof(sgreq.grp));
371ioctl(mrouter_s4, SIOCGETSGCNT, &sgreq);
372.Ed
373.Bd -literal -offset indent
374/* IPv6 */
375struct sioc_sg_req6 sgreq;
376memset(&sgreq, 0, sizeof(sgreq));
377memcpy(&sgreq.src, &source_addr, sizeof(sgreq.src));
378memcpy(&sgreq.grp, &group_addr, sizeof(sgreq.grp));
379ioctl(mrouter_s6, SIOCGETSGCNT_IN6, &sgreq);
380.Ed
381.Pp
382The following method can be used to get various statistics per
383multicast virtual interface in the kernel (e.g., the number of forwarded
384packets per interface):
385.Bd -literal -offset indent
386/* IPv4 */
387struct sioc_vif_req vreq;
388memset(&vreq, 0, sizeof(vreq));
389vreq.vifi = vif_index;
390ioctl(mrouter_s4, SIOCGETVIFCNT, &vreq);
391.Ed
392.Bd -literal -offset indent
393/* IPv6 */
394struct sioc_mif_req6 mreq;
395memset(&mreq, 0, sizeof(mreq));
396mreq.mifi = vif_index;
397ioctl(mrouter_s6, SIOCGETMIFCNT_IN6, &mreq);
398.Ed
399.Ss Advanced Multicast API Programming Guide
400Adding new features to the kernel makes it difficult
401to preserve backward compatibility (binary and API),
402and at the same time to allow user-level processes to take advantage of
403the new features (if the kernel supports them).
404.Pp
405One of the mechanisms that allows preserving the backward
406compatibility is a sort of negotiation
407between the user-level process and the kernel:
408.Bl -enum
409.It
410The user-level process tries to enable in the kernel the set of new
411features (and the corresponding API) it would like to use.
412.It
413The kernel returns the (sub)set of features it knows about
414and is willing to be enabled.
415.It
416The user-level process uses only that set of features
417the kernel has agreed on.
418.El
419.\"
420.Pp
421To support backward compatibility, if the user-level process does not
422ask for any new features, the kernel defaults to the basic
423multicast API (see the
424.Sx "Programming Guide"
425section).
426.\" XXX: edit as appropriate after the advanced multicast API is
427.\" supported under IPv6
428Currently, the advanced multicast API exists only for IPv4;
429in the future there will be IPv6 support as well.
430.Pp
431Below is a summary of the expandable API solution.
432Note that all new options and structures are defined
433in
434.Aq Pa netinet/ip_mroute.h
435and
436.Aq Pa netinet6/ip6_mroute.h ,
437unless stated otherwise.
438.Pp
439The user-level process uses new
440.Fn getsockopt Ns / Ns Fn setsockopt
441options to
442perform the API features negotiation with the kernel.
443This negotiation must be performed right after the multicast routing
444socket is open.
445The set of desired/allowed features is stored in a bitset
446(currently, in
447.Vt uint32_t
448i.e., maximum of 32 new features).
449The new
450.Fn getsockopt Ns / Ns Fn setsockopt
451options are
452.Dv MRT_API_SUPPORT
453and
454.Dv MRT_API_CONFIG .
455An example:
456.Bd -literal -offset 3n
457uint32_t v;
458getsockopt(sock, IPPROTO_IP, MRT_API_SUPPORT, (void *)&v, sizeof(v));
459.Ed
460.Pp
461This would set
462.Va v
463to the pre-defined bits that the kernel API supports.
464The eight least significant bits in
465.Vt uint32_t
466are the same as the
467eight possible flags
468.Dv MRT_MFC_FLAGS_*
469that can be used in
470.Va mfcc_flags
471as part of the new definition of
472.Vt "struct mfcctl"
473(see below about those flags), which leaves 24 flags for other new features.
474The value returned by
475.Fn getsockopt MRT_API_SUPPORT
476is read-only; in other words,
477.Fn setsockopt MRT_API_SUPPORT
478would fail.
479.Pp
480To modify the API, and to set some specific feature in the kernel, then:
481.Bd -literal -offset 3n
482uint32_t v = MRT_MFC_FLAGS_DISABLE_WRONGVIF;
483if (setsockopt(sock, IPPROTO_IP, MRT_API_CONFIG, (void *)&v, sizeof(v))
484    != 0) {
485    return (ERROR);
486}
487if (v & MRT_MFC_FLAGS_DISABLE_WRONGVIF)
488    return (OK);	/* Success */
489else
490    return (ERROR);
491.Ed
492.Pp
493In other words, when
494.Fn setsockopt MRT_API_CONFIG
495is called, the
496argument to it specifies the desired set of features to
497be enabled in the API and the kernel.
498The return value in
499.Va v
500is the actual (sub)set of features that were enabled in the kernel.
501To obtain later the same set of features that were enabled, use:
502.Bd -literal -offset indent
503getsockopt(sock, IPPROTO_IP, MRT_API_CONFIG, (void *)&v, sizeof(v));
504.Ed
505.Pp
506The set of enabled features is global.
507In other words,
508.Fn setsockopt MRT_API_CONFIG
509should be called right after
510.Fn setsockopt MRT_INIT .
511.Pp
512Currently, the following set of new features is defined:
513.Bd -literal
514#define	MRT_MFC_FLAGS_DISABLE_WRONGVIF (1 \*(Lt\*(Lt 0)/*disable WRONGVIF signals*/
515#define	MRT_MFC_FLAGS_BORDER_VIF   (1 \*(Lt\*(Lt 1)  /* border vif              */
516#define MRT_MFC_RP                 (1 \*(Lt\*(Lt 8)  /* enable RP address	*/
517#define MRT_MFC_BW_UPCALL          (1 \*(Lt\*(Lt 9)  /* enable bw upcalls	*/
518.Ed
519.\" .Pp
520.\" In the future there might be:
521.\" .Bd -literal
522.\" #define MRT_MFC_GROUP_SPECIFIC     (1 \*(Lt\*(Lt 10) /* allow (*,G) MFC entries */
523.\" .Ed
524.\" .Pp
525.\" to allow (*,G) MFC entries (i.e., group-specific entries) in the kernel.
526.\" For now this is left-out until it is clear whether
527.\" (*,G) MFC support is the preferred solution instead of something more generic
528.\" solution for example.
529.\"
530.\" 2. The newly defined struct mfcctl2.
531.\"
532.Pp
533The advanced multicast API uses a newly defined
534.Vt "struct mfcctl2"
535instead of the traditional
536.Vt "struct mfcctl" .
537The original
538.Vt "struct mfcctl"
539is kept as is.
540The new
541.Vt "struct mfcctl2"
542is:
543.Bd -literal
544/*
545 * The new argument structure for MRT_ADD_MFC and MRT_DEL_MFC overlays
546 * and extends the old struct mfcctl.
547 */
548struct mfcctl2 {
549        /* the mfcctl fields */
550        struct in_addr  mfcc_origin;       /* ip origin of mcasts       */
551        struct in_addr  mfcc_mcastgrp;     /* multicast group associated*/
552        vifi_t          mfcc_parent;       /* incoming vif              */
553        u_char          mfcc_ttls[MAXVIFS];/* forwarding ttls on vifs   */
554
555        /* extension fields */
556        uint8_t         mfcc_flags[MAXVIFS];/* the MRT_MFC_FLAGS_* flags*/
557        struct in_addr  mfcc_rp;            /* the RP address           */
558};
559.Ed
560.Pp
561The new fields are
562.Va mfcc_flags[MAXVIFS]
563and
564.Va mfcc_rp .
565Note that for compatibility reasons they are added at the end.
566.Pp
567The
568.Va mfcc_flags[MAXVIFS]
569field is used to set various flags per
570interface per (S,G) entry.
571Currently, the defined flags are:
572.Bd -literal
573#define	MRT_MFC_FLAGS_DISABLE_WRONGVIF (1 \*(Lt\*(Lt 0)/*disable WRONGVIF signals*/
574#define	MRT_MFC_FLAGS_BORDER_VIF       (1 \*(Lt\*(Lt 1) /* border vif          */
575.Ed
576.Pp
577The
578.Dv MRT_MFC_FLAGS_DISABLE_WRONGVIF
579flag is used to explicitly disable the
580.Dv IGMPMSG_WRONGVIF
581kernel signal at the (S,G) granularity if a multicast data packet
582arrives on the wrong interface.
583Usually this signal is used to
584complete the shortest-path switch for PIM-SM multicast routing,
585or to trigger a PIM assert message.
586However, it should not be delivered for interfaces that are not set in
587the outgoing interface, and that are not expecting to
588become an incoming interface.
589Hence, if the
590.Dv MRT_MFC_FLAGS_DISABLE_WRONGVIF
591flag is set for some of the
592interfaces, then a data packet that arrives on that interface for
593that MFC entry will NOT trigger a WRONGVIF signal.
594If that flag is not set, then a signal is triggered (the default action).
595.Pp
596The
597.Dv MRT_MFC_FLAGS_BORDER_VIF
598flag is used to specify whether the Border-bit in PIM
599Register messages should be set (when the Register encapsulation
600is performed inside the kernel).
601If it is set for the special PIM Register kernel virtual interface
602(see
603.Xr pim 4 ) ,
604the Border-bit in the Register messages sent to the RP will be set.
605.Pp
606The remaining six bits are reserved for future usage.
607.Pp
608The
609.Va mfcc_rp
610field is used to specify the RP address (for PIM-SM multicast routing)
611for a multicast
612group G if we want to perform kernel-level PIM Register encapsulation.
613The
614.Va mfcc_rp
615field is used only if the
616.Dv MRT_MFC_RP
617advanced API flag/capability has been successfully set by
618.Fn setsockopt MRT_API_CONFIG .
619.Pp
620.\"
621.\" 3. Kernel-level PIM Register encapsulation
622.\"
623If the
624.Dv MRT_MFC_RP
625flag was successfully set by
626.Fn setsockopt MRT_API_CONFIG ,
627then the kernel will attempt to perform
628the PIM Register encapsulation itself instead of sending the
629multicast data packets to user level (inside
630.Dv IGMPMSG_WHOLEPKT
631upcalls) for user-level encapsulation.
632The RP address would be taken from the
633.Va mfcc_rp
634field
635inside the new
636.Vt "struct mfcctl2" .
637However, even if the
638.Dv MRT_MFC_RP
639flag was successfully set, if the
640.Va mfcc_rp
641field was set to
642.Dv INADDR_ANY ,
643then the
644kernel will still deliver an
645.Dv IGMPMSG_WHOLEPKT
646upcall with the
647multicast data packet to the user-level process.
648.Pp
649In addition, if the multicast data packet is too large to fit within
650a single IP packet after the PIM Register encapsulation (e.g., if
651its size was on the order of 65500 bytes), the data packet will be
652fragmented, and then each of the fragments will be encapsulated
653separately.
654Note that typically a multicast data packet can be that
655large only if it was originated locally from the same hosts that
656performs the encapsulation; otherwise the transmission of the
657multicast data packet over Ethernet for example would have
658fragmented it into much smaller pieces.
659.\"
660.\" Note that if this code is ported to IPv6, we may need the kernel to
661.\" perform MTU discovery to the RP, and keep those discoveries inside
662.\" the kernel so the encapsulating router may send back ICMP
663.\" Fragmentation Required if the size of the multicast data packet is
664.\" too large (see "Encapsulating data packets in the Register Tunnel"
665.\" in Section 4.4.1 in the PIM-SM spec
666.\" draft-ietf-pim-sm-v2-new-05.{txt,ps}).
667.\" For IPv4 we may be able to get away without it, but for IPv6 we need
668.\" that.
669.\"
670.\" 4. Mechanism for "multicast bandwidth monitoring and upcalls".
671.\"
672.Pp
673Typically, a multicast routing user-level process would need to know the
674forwarding bandwidth for some data flow.
675For example, the multicast routing process may want to time out idle MFC
676entries, or for PIM-SM it can initiate (S,G) shortest-path switch if
677the bandwidth rate is above a threshold for example.
678.Pp
679The original solution for measuring the bandwidth of a dataflow was
680that a user-level process would periodically
681query the kernel about the number of forwarded packets/bytes per
682(S,G), and then based on those numbers it would estimate whether a source
683has been idle, or whether the source's transmission bandwidth is above a
684threshold.
685That solution is far from being scalable, hence the need for a new
686mechanism for bandwidth monitoring.
687.Pp
688Below is a description of the bandwidth monitoring mechanism.
689.Bl -bullet
690.It
691If the bandwidth of a data flow satisfies some pre-defined filter,
692the kernel delivers an upcall on the multicast routing socket
693to the multicast routing process that has installed that filter.
694.It
695The bandwidth-upcall filters are installed per (S,G).
696There can be
697more than one filter per (S,G).
698.It
699Instead of supporting all possible comparison operations
700(i.e., \*(Lt \*(Lt= == != \*(Gt \*(Gt= ), there is support only for the
701\*(Lt= and \*(Gt= operations,
702because this makes the kernel-level implementation simpler,
703and because practically we need only those two.
704Furthermore, the missing operations can be simulated by secondary
705user-level filtering of those \*(Lt= and \*(Gt= filters.
706For example, to simulate !=, then we need to install filter
707.Dq bw \*(Lt= 0xffffffff ,
708and after an
709upcall is received, we need to check whether
710.Dq measured_bw != expected_bw .
711.It
712The bandwidth-upcall mechanism is enabled by
713.Fn setsockopt MRT_API_CONFIG
714for the
715.Dv MRT_MFC_BW_UPCALL
716flag.
717.It
718The bandwidth-upcall filters are added/deleted by the new
719.Fn setsockopt MRT_ADD_BW_UPCALL
720and
721.Fn setsockopt MRT_DEL_BW_UPCALL
722respectively (with the appropriate
723.Vt "struct bw_upcall"
724argument of course).
725.El
726.Pp
727From an application point of view, a developer needs to know about
728the following:
729.Bd -literal
730/*
731 * Structure for installing or delivering an upcall if the
732 * measured bandwidth is above or below a threshold.
733 *
734 * User programs (e.g. daemons) may have a need to know when the
735 * bandwidth used by some data flow is above or below some threshold.
736 * This interface allows the userland to specify the threshold (in
737 * bytes and/or packets) and the measurement interval. Flows are
738 * all packet with the same source and destination IP address.
739 * At the moment the code is only used for multicast destinations
740 * but there is nothing that prevents its use for unicast.
741 *
742 * The measurement interval cannot be shorter than some Tmin (3s).
743 * The threshold is set in packets and/or bytes per_interval.
744 *
745 * Measurement works as follows:
746 *
747 * For \*(Gt= measurements:
748 * The first packet marks the start of a measurement interval.
749 * During an interval we count packets and bytes, and when we
750 * pass the threshold we deliver an upcall and we are done.
751 * The first packet after the end of the interval resets the
752 * count and restarts the measurement.
753 *
754 * For \*(Lt= measurement:
755 * We start a timer to fire at the end of the interval, and
756 * then for each incoming packet we count packets and bytes.
757 * When the timer fires, we compare the value with the threshold,
758 * schedule an upcall if we are below, and restart the measurement
759 * (reschedule timer and zero counters).
760 */
761
762struct bw_data {
763        struct timeval  b_time;
764        uint64_t        b_packets;
765        uint64_t        b_bytes;
766};
767
768struct bw_upcall {
769        struct in_addr  bu_src;         /* source address            */
770        struct in_addr  bu_dst;         /* destination address       */
771        uint32_t        bu_flags;       /* misc flags (see below)    */
772#define BW_UPCALL_UNIT_PACKETS (1 \*(Lt\*(Lt 0) /* threshold (in packets)    */
773#define BW_UPCALL_UNIT_BYTES   (1 \*(Lt\*(Lt 1) /* threshold (in bytes)      */
774#define BW_UPCALL_GEQ          (1 \*(Lt\*(Lt 2) /* upcall if bw \*(Gt= threshold */
775#define BW_UPCALL_LEQ          (1 \*(Lt\*(Lt 3) /* upcall if bw \*(Lt= threshold */
776#define BW_UPCALL_DELETE_ALL   (1 \*(Lt\*(Lt 4) /* delete all upcalls for s,d*/
777        struct bw_data  bu_threshold;   /* the bw threshold          */
778        struct bw_data  bu_measured;    /* the measured bw           */
779};
780
781/* max. number of upcalls to deliver together */
782#define BW_UPCALLS_MAX				128
783/* min. threshold time interval for bandwidth measurement */
784#define BW_UPCALL_THRESHOLD_INTERVAL_MIN_SEC	3
785#define BW_UPCALL_THRESHOLD_INTERVAL_MIN_USEC	0
786.Ed
787.Pp
788The
789.Vt bw_upcall
790structure is used as an argument to
791.Fn setsockopt MRT_ADD_BW_UPCALL
792and
793.Fn setsockopt MRT_DEL_BW_UPCALL .
794Each
795.Fn setsockopt MRT_ADD_BW_UPCALL
796installs a filter in the kernel
797for the source and destination address in the
798.Vt bw_upcall
799argument,
800and that filter will trigger an upcall according to the following
801pseudo-algorithm:
802.Bd -literal
803 if (bw_upcall_oper IS "\*(Gt=") {
804    if (((bw_upcall_unit & PACKETS == PACKETS) &&
805         (measured_packets \*(Gt= threshold_packets)) ||
806        ((bw_upcall_unit & BYTES == BYTES) &&
807         (measured_bytes \*(Gt= threshold_bytes)))
808       SEND_UPCALL("measured bandwidth is \*(Gt= threshold");
809  }
810  if (bw_upcall_oper IS "\*(Lt=" && measured_interval \*(Gt= threshold_interval) {
811    if (((bw_upcall_unit & PACKETS == PACKETS) &&
812         (measured_packets \*(Lt= threshold_packets)) ||
813        ((bw_upcall_unit & BYTES == BYTES) &&
814         (measured_bytes \*(Lt= threshold_bytes)))
815       SEND_UPCALL("measured bandwidth is \*(Lt= threshold");
816  }
817.Ed
818.Pp
819In the same
820.Vt bw_upcall ,
821the unit can be specified in both BYTES and PACKETS.
822However, the GEQ and LEQ flags are mutually exclusive.
823.Pp
824Basically, an upcall is delivered if the measured bandwidth is \*(Gt= or
825\*(Lt= the threshold bandwidth (within the specified measurement
826interval).
827For practical reasons, the smallest value for the measurement
828interval is 3 seconds.
829If smaller values are allowed, then the bandwidth
830estimation may be less accurate, or the potentially very high frequency
831of the generated upcalls may introduce too much overhead.
832For the \*(Gt= operation, the answer may be known before the end of
833.Va threshold_interval ,
834therefore the upcall may be delivered earlier.
835For the \*(Lt= operation however, we must wait
836until the threshold interval has expired to know the answer.
837.Sh EXAMPLES
838.Bd -literal -offset indent
839struct bw_upcall bw_upcall;
840/* Assign all bw_upcall fields as appropriate */
841memset(&bw_upcall, 0, sizeof(bw_upcall));
842memcpy(&bw_upcall.bu_src, &source, sizeof(bw_upcall.bu_src));
843memcpy(&bw_upcall.bu_dst, &group, sizeof(bw_upcall.bu_dst));
844bw_upcall.bu_threshold.b_data = threshold_interval;
845bw_upcall.bu_threshold.b_packets = threshold_packets;
846bw_upcall.bu_threshold.b_bytes = threshold_bytes;
847if (is_threshold_in_packets)
848    bw_upcall.bu_flags |= BW_UPCALL_UNIT_PACKETS;
849if (is_threshold_in_bytes)
850    bw_upcall.bu_flags |= BW_UPCALL_UNIT_BYTES;
851do {
852    if (is_geq_upcall) {
853        bw_upcall.bu_flags |= BW_UPCALL_GEQ;
854        break;
855    }
856    if (is_leq_upcall) {
857        bw_upcall.bu_flags |= BW_UPCALL_LEQ;
858        break;
859    }
860    return (ERROR);
861} while (0);
862setsockopt(mrouter_s4, IPPROTO_IP, MRT_ADD_BW_UPCALL,
863          (void *)&bw_upcall, sizeof(bw_upcall));
864.Ed
865.Pp
866To delete a single filter, use
867.Dv MRT_DEL_BW_UPCALL ,
868and the fields of bw_upcall must be set to
869exactly same as when
870.Dv MRT_ADD_BW_UPCALL
871was called.
872.Pp
873To delete all bandwidth filters for a given (S,G), then
874only the
875.Va bu_src
876and
877.Va bu_dst
878fields in
879.Vt "struct bw_upcall"
880need to be set, and then just set only the
881.Dv BW_UPCALL_DELETE_ALL
882flag inside field
883.Va bw_upcall.bu_flags .
884.Pp
885The bandwidth upcalls are received by aggregating them in the new upcall
886message:
887.Bd -literal -offset indent
888#define IGMPMSG_BW_UPCALL  4  /* BW monitoring upcall */
889.Ed
890.Pp
891This message is an array of
892.Vt "struct bw_upcall"
893elements (up to
894.Dv BW_UPCALLS_MAX
895= 128).
896The upcalls are
897delivered when there are 128 pending upcalls, or when 1 second has
898expired since the previous upcall (whichever comes first).
899In an
900.Vt "struct upcall"
901element, the
902.Va bu_measured
903field is filled in to
904indicate the particular measured values.
905However, because of the way
906the particular intervals are measured, the user should be careful how
907.Va bu_measured.b_time
908is used.
909For example, if the
910filter is installed to trigger an upcall if the number of packets
911is \*(Gt= 1, then
912.Va bu_measured
913may have a value of zero in the upcalls after the
914first one, because the measured interval for \*(Gt= filters is
915.Dq clocked
916by the forwarded packets.
917Hence, this upcall mechanism should not be used for measuring
918the exact value of the bandwidth of the forwarded data.
919To measure the exact bandwidth, the user would need to
920get the forwarded packets statistics with the
921.Fn ioctl SIOCGETSGCNT
922mechanism
923(see the
924.Sx Programming Guide
925section) .
926.Pp
927Note that the upcalls for a filter are delivered until the specific
928filter is deleted, but no more frequently than once per
929.Va bu_threshold.b_time .
930For example, if the filter is specified to
931deliver a signal if bw \*(Gt= 1 packet, the first packet will trigger a
932signal, but the next upcall will be triggered no earlier than
933.Va bu_threshold.b_time
934after the previous upcall.
935.\"
936.Sh SEE ALSO
937.Xr getsockopt 2 ,
938.Xr recvfrom 2 ,
939.Xr recvmsg 2 ,
940.Xr setsockopt 2 ,
941.Xr socket 2 ,
942.Xr icmp6 4 ,
943.Xr inet 4 ,
944.Xr inet6 4 ,
945.Xr intro 4 ,
946.Xr ip 4 ,
947.Xr ip6 4 ,
948.Xr pim 4 ,
949.Xr mrouted 8 ,
950.Xr sysctl 8
951.\"
952.Sh AUTHORS
953.An -nosplit
954The original multicast code was written by
955.An David Waitzman
956(BBN Labs),
957and later modified by the following individuals:
958.An Steve Deering
959(Stanford),
960.An Mark J. Steiglitz
961(Stanford),
962.An Van Jacobson
963(LBL),
964.An Ajit Thyagarajan
965(PARC),
966.An Bill Fenner
967(PARC).
968.Pp
969The IPv6 multicast support was implemented by the KAME project
970.Pq Lk http://www.kame.net ,
971and was based on the IPv4 multicast code.
972The advanced multicast API and the multicast bandwidth
973monitoring were implemented by
974.An Pavlin Radoslavov
975(ICSI)
976in collaboration with
977.An Chris Brown
978(NextHop).
979.Pp
980This manual page was written by
981.An Pavlin Radoslavov
982(ICSI).
983