1.\" Copyright (c) 2001-2003 International Computer Science Institute 2.\" 3.\" Permission is hereby granted, free of charge, to any person obtaining a 4.\" copy of this software and associated documentation files (the "Software"), 5.\" to deal in the Software without restriction, including without limitation 6.\" the rights to use, copy, modify, merge, publish, distribute, sublicense, 7.\" and/or sell copies of the Software, and to permit persons to whom the 8.\" Software is furnished to do so, subject to the following conditions: 9.\" 10.\" The above copyright notice and this permission notice shall be included in 11.\" all copies or substantial portions of the Software. 12.\" 13.\" The names and trademarks of copyright holders may not be used in 14.\" advertising or publicity pertaining to the software without specific 15.\" prior permission. Title to copyright in this software and any associated 16.\" documentation will at all times remain with the copyright holders. 17.\" 18.\" THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 19.\" IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 20.\" FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 21.\" AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 22.\" LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING 23.\" FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER 24.\" DEALINGS IN THE SOFTWARE. 25.\" 26.\" $FreeBSD: src/share/man/man4/multicast.4,v 1.4 2004/07/09 09:22:36 ru Exp $ 27.\" $OpenBSD: multicast.4,v 1.7 2012/08/12 17:01:35 schwarze Exp $ 28.\" $NetBSD: multicast.4,v 1.3 2004/09/12 13:12:26 wiz Exp $ 29.\" 30.Dd $Mdocdate: August 12 2012 $ 31.Dt MULTICAST 4 32.Os 33.\" 34.Sh NAME 35.Nm multicast 36.Nd Multicast Routing 37.\" 38.Sh SYNOPSIS 39.Cd "options MROUTING" 40.Pp 41.In sys/types.h 42.In sys/socket.h 43.In netinet/in.h 44.In netinet/ip_mroute.h 45.In netinet6/ip6_mroute.h 46.Ft int 47.Fn getsockopt "int s" IPPROTO_IP MRT_INIT "void *optval" "socklen_t *optlen" 48.Ft int 49.Fn setsockopt "int s" IPPROTO_IP MRT_INIT "const void *optval" "socklen_t optlen" 50.Ft int 51.Fn getsockopt "int s" IPPROTO_IPV6 MRT6_INIT "void *optval" "socklen_t *optlen" 52.Ft int 53.Fn setsockopt "int s" IPPROTO_IPV6 MRT6_INIT "const void *optval" "socklen_t optlen" 54.Sh DESCRIPTION 55.Tn "Multicast routing" 56is used to efficiently propagate data 57packets to a set of multicast listeners in multipoint networks. 58If unicast is used to replicate the data to all listeners, 59then some of the network links may carry multiple copies of the same 60data packets. 61With multicast routing, the overhead is reduced to one copy 62(at most) per network link. 63.Pp 64All multicast-capable routers must run a common multicast routing 65protocol. 66The Distance Vector Multicast Routing Protocol (DVMRP) 67was the first developed multicast routing protocol. 68Later, other protocols such as Multicast Extensions to OSPF (MOSPF), 69Core Based Trees (CBT), 70Protocol Independent Multicast \- Sparse Mode (PIM-SM), 71and Protocol Independent Multicast \- Dense Mode (PIM-DM) 72were developed as well. 73.Pp 74To start multicast routing, 75the user must enable multicast forwarding via the 76.Xr sysctl 8 77variables 78.Va net.inet.ip.mforwarding 79and/or 80.Va net.inet.ip6.mforwarding . 81The user must also run a multicast routing capable user-level process, 82such as 83.Xr mrouted 8 . 84From a developer's point of view, 85the programming guide described in the 86.Sx Programming Guide 87section should be used to control the multicast forwarding in the kernel. 88.\" 89.Ss Programming Guide 90This section provides information about the basic multicast routing API. 91The so-called 92.Dq advanced multicast API 93is described in the 94.Sx "Advanced Multicast API Programming Guide" 95section. 96.Pp 97First, a multicast routing socket must be open. 98That socket would be used 99to control the multicast forwarding in the kernel. 100Note that most operations below require certain privilege 101(i.e., root privilege): 102.Bd -literal -offset indent 103/* IPv4 */ 104int mrouter_s4; 105mrouter_s4 = socket(AF_INET, SOCK_RAW, IPPROTO_IGMP); 106.Ed 107.Bd -literal -offset indent 108int mrouter_s6; 109mrouter_s6 = socket(AF_INET6, SOCK_RAW, IPPROTO_ICMPV6); 110.Ed 111.Pp 112Note that if the router needs to open an IGMP or ICMPv6 socket 113(IPv4 or IPv6, respectively) 114for sending or receiving of IGMP or MLD multicast group membership messages, 115then the same 116.Va mrouter_s4 117or 118.Va mrouter_s6 119sockets should be used 120for sending and receiving respectively IGMP or MLD messages. 121In the case of BSD-derived kernels, 122it may be possible to open separate sockets 123for IGMP or MLD messages only. 124However, some other kernels (e.g., 125.Tn Linux ) 126require that the multicast 127routing socket must be used for sending and receiving of IGMP or MLD 128messages. 129Therefore, for portability reasons, the multicast 130routing socket should be reused for IGMP and MLD messages as well. 131.Pp 132After the multicast routing socket is open, it can be used to enable 133or disable multicast forwarding in the kernel: 134.Bd -literal -offset 5n 135/* IPv4 */ 136int v = 1; /* 1 to enable, or 0 to disable */ 137setsockopt(mrouter_s4, IPPROTO_IP, MRT_INIT, (void *)&v, sizeof(v)); 138.Ed 139.Bd -literal -offset 5n 140/* IPv6 */ 141int v = 1; /* 1 to enable, or 0 to disable */ 142setsockopt(mrouter_s6, IPPROTO_IPV6, MRT6_INIT, (void *)&v, sizeof(v)); 143\&... 144/* If necessary, filter all ICMPv6 messages */ 145struct icmp6_filter filter; 146ICMP6_FILTER_SETBLOCKALL(&filter); 147setsockopt(mrouter_s6, IPPROTO_ICMPV6, ICMP6_FILTER, (void *)&filter, 148 sizeof(filter)); 149.Ed 150.Pp 151After multicast forwarding is enabled, the multicast routing socket 152can be used to enable PIM processing in the kernel if either PIM-SM or 153PIM-DM are being used 154(see 155.Xr pim 4 ) . 156.Pp 157For each network interface (e.g., physical or a virtual tunnel) 158that would be used for multicast forwarding, a corresponding 159multicast interface must be added to the kernel: 160.Bd -literal -offset 3n 161/* IPv4 */ 162struct vifctl vc; 163memset(&vc, 0, sizeof(vc)); 164/* Assign all vifctl fields as appropriate */ 165vc.vifc_vifi = vif_index; 166vc.vifc_flags = vif_flags; 167vc.vifc_threshold = min_ttl_threshold; 168vc.vifc_rate_limit = max_rate_limit; 169memcpy(&vc.vifc_lcl_addr, &vif_local_address, sizeof(vc.vifc_lcl_addr)); 170if (vc.vifc_flags & VIFF_TUNNEL) 171 memcpy(&vc.vifc_rmt_addr, &vif_remote_address, 172 sizeof(vc.vifc_rmt_addr)); 173setsockopt(mrouter_s4, IPPROTO_IP, MRT_ADD_VIF, (void *)&vc, 174 sizeof(vc)); 175.Ed 176.Pp 177The 178.Va vif_index 179must be unique per vif. 180The 181.Va vif_flags 182contains the 183.Dv VIFF_* 184flags as defined in 185.Aq Pa netinet/ip_mroute.h . 186The 187.Va min_ttl_threshold 188contains the minimum TTL a multicast data packet must have to be 189forwarded on that vif. 190Typically, it would be 1. 191The 192.Va max_rate_limit 193contains the maximum rate (in bits/s) of the multicast data packets forwarded 194on that vif. 195A value of 0 means no limit. 196The 197.Va vif_local_address 198contains the local IP address of the corresponding local interface. 199The 200.Va vif_remote_address 201contains the remote IP address for DVMRP multicast tunnels. 202.Bd -literal -offset indent 203/* IPv6 */ 204struct mif6ctl mc; 205memset(&mc, 0, sizeof(mc)); 206/* Assign all mif6ctl fields as appropriate */ 207mc.mif6c_mifi = mif_index; 208mc.mif6c_flags = mif_flags; 209mc.mif6c_pifi = pif_index; 210setsockopt(mrouter_s6, IPPROTO_IPV6, MRT6_ADD_MIF, (void *)&mc, 211 sizeof(mc)); 212.Ed 213.Pp 214The 215.Va mif_index 216must be unique per vif. 217The 218.Va mif_flags 219contains the 220.Dv MIFF_* 221flags as defined in 222.Aq Pa netinet6/ip6_mroute.h . 223The 224.Va pif_index 225is the physical interface index of the corresponding local interface. 226.Pp 227A multicast interface is deleted by: 228.Bd -literal -offset indent 229/* IPv4 */ 230vifi_t vifi = vif_index; 231setsockopt(mrouter_s4, IPPROTO_IP, MRT_DEL_VIF, (void *)&vifi, 232 sizeof(vifi)); 233.Ed 234.Bd -literal -offset indent 235/* IPv6 */ 236mifi_t mifi = mif_index; 237setsockopt(mrouter_s6, IPPROTO_IPV6, MRT6_DEL_MIF, (void *)&mifi, 238 sizeof(mifi)); 239.Ed 240.Pp 241After multicast forwarding is enabled, and the multicast virtual 242interfaces have been 243added, the kernel may deliver upcall messages (also called signals 244later in this text) on the multicast routing socket that was open 245earlier with 246.Dv MRT_INIT 247or 248.Dv MRT6_INIT . 249The IPv4 upcalls have a 250.Vt "struct igmpmsg" 251header (see 252.Aq Pa netinet/ip_mroute.h ) 253with the 254.Va im_mbz 255field set to zero. 256Note that this header follows the structure of 257.Vt "struct ip" 258with the protocol field 259.Va ip_p 260set to zero. 261The IPv6 upcalls have a 262.Vt "struct mrt6msg" 263header (see 264.Aq Pa netinet6/ip6_mroute.h ) 265with the 266.Va im6_mbz 267field set to zero. 268Note that this header follows the structure of 269.Vt "struct ip6_hdr" 270with the next header field 271.Va ip6_nxt 272set to zero. 273.Pp 274The upcall header contains the 275.Va im_msgtype 276and 277.Va im6_msgtype 278fields, with the type of the upcall 279.Dv IGMPMSG_* 280and 281.Dv MRT6MSG_* 282for IPv4 and IPv6, respectively. 283The values of the rest of the upcall header fields 284and the body of the upcall message depend on the particular upcall type. 285.Pp 286If the upcall message type is 287.Dv IGMPMSG_NOCACHE 288or 289.Dv MRT6MSG_NOCACHE , 290this is an indication that a multicast packet has reached the multicast 291router, but the router has no forwarding state for that packet. 292Typically, the upcall would be a signal for the multicast routing 293user-level process to install the appropriate Multicast Forwarding 294Cache (MFC) entry in the kernel. 295.Pp 296An MFC entry is added by: 297.Bd -literal -offset indent 298/* IPv4 */ 299struct mfcctl mc; 300memset(&mc, 0, sizeof(mc)); 301memcpy(&mc.mfcc_origin, &source_addr, sizeof(mc.mfcc_origin)); 302memcpy(&mc.mfcc_mcastgrp, &group_addr, sizeof(mc.mfcc_mcastgrp)); 303mc.mfcc_parent = iif_index; 304for (i = 0; i \*(Lt maxvifs; i++) 305 mc.mfcc_ttls[i] = oifs_ttl[i]; 306setsockopt(mrouter_s4, IPPROTO_IP, MRT_ADD_MFC, 307 (void *)&mc, sizeof(mc)); 308.Ed 309.Bd -literal -offset indent 310/* IPv6 */ 311struct mf6cctl mc; 312memset(&mc, 0, sizeof(mc)); 313memcpy(&mc.mf6cc_origin, &source_addr, sizeof(mc.mf6cc_origin)); 314memcpy(&mc.mf6cc_mcastgrp, &group_addr, sizeof(mf6cc_mcastgrp)); 315mc.mf6cc_parent = iif_index; 316for (i = 0; i \*(Lt maxvifs; i++) 317 if (oifs_ttl[i] \*(Gt 0) 318 IF_SET(i, &mc.mf6cc_ifset); 319setsockopt(mrouter_s4, IPPROTO_IPV6, MRT6_ADD_MFC, 320 (void *)&mc, sizeof(mc)); 321.Ed 322.Pp 323The 324.Va source_addr 325and 326.Va group_addr 327fields are the source and group address of the multicast packet (as set 328in the upcall message). 329The 330.Va iif_index 331is the virtual interface index of the multicast interface the multicast 332packets for this specific source and group address should be received on. 333The 334.Va oifs_ttl[] 335array contains the minimum TTL (per interface) a multicast packet 336should have to be forwarded on an outgoing interface. 337If the TTL value is zero, the corresponding interface is not included 338in the set of outgoing interfaces. 339Note that for IPv6 only the set of outgoing interfaces can 340be specified. 341.Pp 342An MFC entry is deleted by: 343.Bd -literal -offset indent 344/* IPv4 */ 345struct mfcctl mc; 346memset(&mc, 0, sizeof(mc)); 347memcpy(&mc.mfcc_origin, &source_addr, sizeof(mc.mfcc_origin)); 348memcpy(&mc.mfcc_mcastgrp, &group_addr, sizeof(mc.mfcc_mcastgrp)); 349setsockopt(mrouter_s4, IPPROTO_IP, MRT_DEL_MFC, 350 (void *)&mc, sizeof(mc)); 351.Ed 352.Bd -literal -offset indent 353/* IPv6 */ 354struct mf6cctl mc; 355memset(&mc, 0, sizeof(mc)); 356memcpy(&mc.mf6cc_origin, &source_addr, sizeof(mc.mf6cc_origin)); 357memcpy(&mc.mf6cc_mcastgrp, &group_addr, sizeof(mf6cc_mcastgrp)); 358setsockopt(mrouter_s4, IPPROTO_IPV6, MRT6_DEL_MFC, 359 (void *)&mc, sizeof(mc)); 360.Ed 361.Pp 362The following method can be used to get various statistics per 363installed MFC entry in the kernel (e.g., the number of forwarded 364packets per source and group address): 365.Bd -literal -offset indent 366/* IPv4 */ 367struct sioc_sg_req sgreq; 368memset(&sgreq, 0, sizeof(sgreq)); 369memcpy(&sgreq.src, &source_addr, sizeof(sgreq.src)); 370memcpy(&sgreq.grp, &group_addr, sizeof(sgreq.grp)); 371ioctl(mrouter_s4, SIOCGETSGCNT, &sgreq); 372.Ed 373.Bd -literal -offset indent 374/* IPv6 */ 375struct sioc_sg_req6 sgreq; 376memset(&sgreq, 0, sizeof(sgreq)); 377memcpy(&sgreq.src, &source_addr, sizeof(sgreq.src)); 378memcpy(&sgreq.grp, &group_addr, sizeof(sgreq.grp)); 379ioctl(mrouter_s6, SIOCGETSGCNT_IN6, &sgreq); 380.Ed 381.Pp 382The following method can be used to get various statistics per 383multicast virtual interface in the kernel (e.g., the number of forwarded 384packets per interface): 385.Bd -literal -offset indent 386/* IPv4 */ 387struct sioc_vif_req vreq; 388memset(&vreq, 0, sizeof(vreq)); 389vreq.vifi = vif_index; 390ioctl(mrouter_s4, SIOCGETVIFCNT, &vreq); 391.Ed 392.Bd -literal -offset indent 393/* IPv6 */ 394struct sioc_mif_req6 mreq; 395memset(&mreq, 0, sizeof(mreq)); 396mreq.mifi = vif_index; 397ioctl(mrouter_s6, SIOCGETMIFCNT_IN6, &mreq); 398.Ed 399.Ss Advanced Multicast API Programming Guide 400Adding new features to the kernel makes it difficult 401to preserve backward compatibility (binary and API), 402and at the same time to allow user-level processes to take advantage of 403the new features (if the kernel supports them). 404.Pp 405One of the mechanisms that allows preserving the backward 406compatibility is a sort of negotiation 407between the user-level process and the kernel: 408.Bl -enum 409.It 410The user-level process tries to enable in the kernel the set of new 411features (and the corresponding API) it would like to use. 412.It 413The kernel returns the (sub)set of features it knows about 414and is willing to be enabled. 415.It 416The user-level process uses only that set of features 417the kernel has agreed on. 418.El 419.\" 420.Pp 421To support backward compatibility, if the user-level process does not 422ask for any new features, the kernel defaults to the basic 423multicast API (see the 424.Sx "Programming Guide" 425section). 426.\" XXX: edit as appropriate after the advanced multicast API is 427.\" supported under IPv6 428Currently, the advanced multicast API exists only for IPv4; 429in the future there will be IPv6 support as well. 430.Pp 431Below is a summary of the expandable API solution. 432Note that all new options and structures are defined 433in 434.Aq Pa netinet/ip_mroute.h 435and 436.Aq Pa netinet6/ip6_mroute.h , 437unless stated otherwise. 438.Pp 439The user-level process uses new 440.Fn getsockopt Ns / Ns Fn setsockopt 441options to 442perform the API features negotiation with the kernel. 443This negotiation must be performed right after the multicast routing 444socket is open. 445The set of desired/allowed features is stored in a bitset 446(currently, in 447.Vt uint32_t 448i.e., maximum of 32 new features). 449The new 450.Fn getsockopt Ns / Ns Fn setsockopt 451options are 452.Dv MRT_API_SUPPORT 453and 454.Dv MRT_API_CONFIG . 455An example: 456.Bd -literal -offset 3n 457uint32_t v; 458getsockopt(sock, IPPROTO_IP, MRT_API_SUPPORT, (void *)&v, sizeof(v)); 459.Ed 460.Pp 461This would set 462.Va v 463to the pre-defined bits that the kernel API supports. 464The eight least significant bits in 465.Vt uint32_t 466are the same as the 467eight possible flags 468.Dv MRT_MFC_FLAGS_* 469that can be used in 470.Va mfcc_flags 471as part of the new definition of 472.Vt "struct mfcctl" 473(see below about those flags), which leaves 24 flags for other new features. 474The value returned by 475.Fn getsockopt MRT_API_SUPPORT 476is read-only; in other words, 477.Fn setsockopt MRT_API_SUPPORT 478would fail. 479.Pp 480To modify the API, and to set some specific feature in the kernel, then: 481.Bd -literal -offset 3n 482uint32_t v = MRT_MFC_FLAGS_DISABLE_WRONGVIF; 483if (setsockopt(sock, IPPROTO_IP, MRT_API_CONFIG, (void *)&v, sizeof(v)) 484 != 0) { 485 return (ERROR); 486} 487if (v & MRT_MFC_FLAGS_DISABLE_WRONGVIF) 488 return (OK); /* Success */ 489else 490 return (ERROR); 491.Ed 492.Pp 493In other words, when 494.Fn setsockopt MRT_API_CONFIG 495is called, the 496argument to it specifies the desired set of features to 497be enabled in the API and the kernel. 498The return value in 499.Va v 500is the actual (sub)set of features that were enabled in the kernel. 501To obtain later the same set of features that were enabled, use: 502.Bd -literal -offset indent 503getsockopt(sock, IPPROTO_IP, MRT_API_CONFIG, (void *)&v, sizeof(v)); 504.Ed 505.Pp 506The set of enabled features is global. 507In other words, 508.Fn setsockopt MRT_API_CONFIG 509should be called right after 510.Fn setsockopt MRT_INIT . 511.Pp 512Currently, the following set of new features is defined: 513.Bd -literal 514#define MRT_MFC_FLAGS_DISABLE_WRONGVIF (1 \*(Lt\*(Lt 0)/*disable WRONGVIF signals*/ 515#define MRT_MFC_FLAGS_BORDER_VIF (1 \*(Lt\*(Lt 1) /* border vif */ 516#define MRT_MFC_RP (1 \*(Lt\*(Lt 8) /* enable RP address */ 517#define MRT_MFC_BW_UPCALL (1 \*(Lt\*(Lt 9) /* enable bw upcalls */ 518.Ed 519.\" .Pp 520.\" In the future there might be: 521.\" .Bd -literal 522.\" #define MRT_MFC_GROUP_SPECIFIC (1 \*(Lt\*(Lt 10) /* allow (*,G) MFC entries */ 523.\" .Ed 524.\" .Pp 525.\" to allow (*,G) MFC entries (i.e., group-specific entries) in the kernel. 526.\" For now this is left-out until it is clear whether 527.\" (*,G) MFC support is the preferred solution instead of something more generic 528.\" solution for example. 529.\" 530.\" 2. The newly defined struct mfcctl2. 531.\" 532.Pp 533The advanced multicast API uses a newly defined 534.Vt "struct mfcctl2" 535instead of the traditional 536.Vt "struct mfcctl" . 537The original 538.Vt "struct mfcctl" 539is kept as is. 540The new 541.Vt "struct mfcctl2" 542is: 543.Bd -literal 544/* 545 * The new argument structure for MRT_ADD_MFC and MRT_DEL_MFC overlays 546 * and extends the old struct mfcctl. 547 */ 548struct mfcctl2 { 549 /* the mfcctl fields */ 550 struct in_addr mfcc_origin; /* ip origin of mcasts */ 551 struct in_addr mfcc_mcastgrp; /* multicast group associated*/ 552 vifi_t mfcc_parent; /* incoming vif */ 553 u_char mfcc_ttls[MAXVIFS];/* forwarding ttls on vifs */ 554 555 /* extension fields */ 556 uint8_t mfcc_flags[MAXVIFS];/* the MRT_MFC_FLAGS_* flags*/ 557 struct in_addr mfcc_rp; /* the RP address */ 558}; 559.Ed 560.Pp 561The new fields are 562.Va mfcc_flags[MAXVIFS] 563and 564.Va mfcc_rp . 565Note that for compatibility reasons they are added at the end. 566.Pp 567The 568.Va mfcc_flags[MAXVIFS] 569field is used to set various flags per 570interface per (S,G) entry. 571Currently, the defined flags are: 572.Bd -literal 573#define MRT_MFC_FLAGS_DISABLE_WRONGVIF (1 \*(Lt\*(Lt 0)/*disable WRONGVIF signals*/ 574#define MRT_MFC_FLAGS_BORDER_VIF (1 \*(Lt\*(Lt 1) /* border vif */ 575.Ed 576.Pp 577The 578.Dv MRT_MFC_FLAGS_DISABLE_WRONGVIF 579flag is used to explicitly disable the 580.Dv IGMPMSG_WRONGVIF 581kernel signal at the (S,G) granularity if a multicast data packet 582arrives on the wrong interface. 583Usually this signal is used to 584complete the shortest-path switch for PIM-SM multicast routing, 585or to trigger a PIM assert message. 586However, it should not be delivered for interfaces that are not set in 587the outgoing interface, and that are not expecting to 588become an incoming interface. 589Hence, if the 590.Dv MRT_MFC_FLAGS_DISABLE_WRONGVIF 591flag is set for some of the 592interfaces, then a data packet that arrives on that interface for 593that MFC entry will NOT trigger a WRONGVIF signal. 594If that flag is not set, then a signal is triggered (the default action). 595.Pp 596The 597.Dv MRT_MFC_FLAGS_BORDER_VIF 598flag is used to specify whether the Border-bit in PIM 599Register messages should be set (when the Register encapsulation 600is performed inside the kernel). 601If it is set for the special PIM Register kernel virtual interface 602(see 603.Xr pim 4 ) , 604the Border-bit in the Register messages sent to the RP will be set. 605.Pp 606The remaining six bits are reserved for future usage. 607.Pp 608The 609.Va mfcc_rp 610field is used to specify the RP address (for PIM-SM multicast routing) 611for a multicast 612group G if we want to perform kernel-level PIM Register encapsulation. 613The 614.Va mfcc_rp 615field is used only if the 616.Dv MRT_MFC_RP 617advanced API flag/capability has been successfully set by 618.Fn setsockopt MRT_API_CONFIG . 619.Pp 620.\" 621.\" 3. Kernel-level PIM Register encapsulation 622.\" 623If the 624.Dv MRT_MFC_RP 625flag was successfully set by 626.Fn setsockopt MRT_API_CONFIG , 627then the kernel will attempt to perform 628the PIM Register encapsulation itself instead of sending the 629multicast data packets to user level (inside 630.Dv IGMPMSG_WHOLEPKT 631upcalls) for user-level encapsulation. 632The RP address would be taken from the 633.Va mfcc_rp 634field 635inside the new 636.Vt "struct mfcctl2" . 637However, even if the 638.Dv MRT_MFC_RP 639flag was successfully set, if the 640.Va mfcc_rp 641field was set to 642.Dv INADDR_ANY , 643then the 644kernel will still deliver an 645.Dv IGMPMSG_WHOLEPKT 646upcall with the 647multicast data packet to the user-level process. 648.Pp 649In addition, if the multicast data packet is too large to fit within 650a single IP packet after the PIM Register encapsulation (e.g., if 651its size was on the order of 65500 bytes), the data packet will be 652fragmented, and then each of the fragments will be encapsulated 653separately. 654Note that typically a multicast data packet can be that 655large only if it was originated locally from the same hosts that 656performs the encapsulation; otherwise the transmission of the 657multicast data packet over Ethernet for example would have 658fragmented it into much smaller pieces. 659.\" 660.\" Note that if this code is ported to IPv6, we may need the kernel to 661.\" perform MTU discovery to the RP, and keep those discoveries inside 662.\" the kernel so the encapsulating router may send back ICMP 663.\" Fragmentation Required if the size of the multicast data packet is 664.\" too large (see "Encapsulating data packets in the Register Tunnel" 665.\" in Section 4.4.1 in the PIM-SM spec 666.\" draft-ietf-pim-sm-v2-new-05.{txt,ps}). 667.\" For IPv4 we may be able to get away without it, but for IPv6 we need 668.\" that. 669.\" 670.\" 4. Mechanism for "multicast bandwidth monitoring and upcalls". 671.\" 672.Pp 673Typically, a multicast routing user-level process would need to know the 674forwarding bandwidth for some data flow. 675For example, the multicast routing process may want to time out idle MFC 676entries, or for PIM-SM it can initiate (S,G) shortest-path switch if 677the bandwidth rate is above a threshold for example. 678.Pp 679The original solution for measuring the bandwidth of a dataflow was 680that a user-level process would periodically 681query the kernel about the number of forwarded packets/bytes per 682(S,G), and then based on those numbers it would estimate whether a source 683has been idle, or whether the source's transmission bandwidth is above a 684threshold. 685That solution is far from being scalable, hence the need for a new 686mechanism for bandwidth monitoring. 687.Pp 688Below is a description of the bandwidth monitoring mechanism. 689.Bl -bullet 690.It 691If the bandwidth of a data flow satisfies some pre-defined filter, 692the kernel delivers an upcall on the multicast routing socket 693to the multicast routing process that has installed that filter. 694.It 695The bandwidth-upcall filters are installed per (S,G). 696There can be 697more than one filter per (S,G). 698.It 699Instead of supporting all possible comparison operations 700(i.e., \*(Lt \*(Lt= == != \*(Gt \*(Gt= ), there is support only for the 701\*(Lt= and \*(Gt= operations, 702because this makes the kernel-level implementation simpler, 703and because practically we need only those two. 704Furthermore, the missing operations can be simulated by secondary 705user-level filtering of those \*(Lt= and \*(Gt= filters. 706For example, to simulate !=, then we need to install filter 707.Dq bw \*(Lt= 0xffffffff , 708and after an 709upcall is received, we need to check whether 710.Dq measured_bw != expected_bw . 711.It 712The bandwidth-upcall mechanism is enabled by 713.Fn setsockopt MRT_API_CONFIG 714for the 715.Dv MRT_MFC_BW_UPCALL 716flag. 717.It 718The bandwidth-upcall filters are added/deleted by the new 719.Fn setsockopt MRT_ADD_BW_UPCALL 720and 721.Fn setsockopt MRT_DEL_BW_UPCALL 722respectively (with the appropriate 723.Vt "struct bw_upcall" 724argument of course). 725.El 726.Pp 727From an application point of view, a developer needs to know about 728the following: 729.Bd -literal 730/* 731 * Structure for installing or delivering an upcall if the 732 * measured bandwidth is above or below a threshold. 733 * 734 * User programs (e.g. daemons) may have a need to know when the 735 * bandwidth used by some data flow is above or below some threshold. 736 * This interface allows the userland to specify the threshold (in 737 * bytes and/or packets) and the measurement interval. Flows are 738 * all packet with the same source and destination IP address. 739 * At the moment the code is only used for multicast destinations 740 * but there is nothing that prevents its use for unicast. 741 * 742 * The measurement interval cannot be shorter than some Tmin (3s). 743 * The threshold is set in packets and/or bytes per_interval. 744 * 745 * Measurement works as follows: 746 * 747 * For \*(Gt= measurements: 748 * The first packet marks the start of a measurement interval. 749 * During an interval we count packets and bytes, and when we 750 * pass the threshold we deliver an upcall and we are done. 751 * The first packet after the end of the interval resets the 752 * count and restarts the measurement. 753 * 754 * For \*(Lt= measurement: 755 * We start a timer to fire at the end of the interval, and 756 * then for each incoming packet we count packets and bytes. 757 * When the timer fires, we compare the value with the threshold, 758 * schedule an upcall if we are below, and restart the measurement 759 * (reschedule timer and zero counters). 760 */ 761 762struct bw_data { 763 struct timeval b_time; 764 uint64_t b_packets; 765 uint64_t b_bytes; 766}; 767 768struct bw_upcall { 769 struct in_addr bu_src; /* source address */ 770 struct in_addr bu_dst; /* destination address */ 771 uint32_t bu_flags; /* misc flags (see below) */ 772#define BW_UPCALL_UNIT_PACKETS (1 \*(Lt\*(Lt 0) /* threshold (in packets) */ 773#define BW_UPCALL_UNIT_BYTES (1 \*(Lt\*(Lt 1) /* threshold (in bytes) */ 774#define BW_UPCALL_GEQ (1 \*(Lt\*(Lt 2) /* upcall if bw \*(Gt= threshold */ 775#define BW_UPCALL_LEQ (1 \*(Lt\*(Lt 3) /* upcall if bw \*(Lt= threshold */ 776#define BW_UPCALL_DELETE_ALL (1 \*(Lt\*(Lt 4) /* delete all upcalls for s,d*/ 777 struct bw_data bu_threshold; /* the bw threshold */ 778 struct bw_data bu_measured; /* the measured bw */ 779}; 780 781/* max. number of upcalls to deliver together */ 782#define BW_UPCALLS_MAX 128 783/* min. threshold time interval for bandwidth measurement */ 784#define BW_UPCALL_THRESHOLD_INTERVAL_MIN_SEC 3 785#define BW_UPCALL_THRESHOLD_INTERVAL_MIN_USEC 0 786.Ed 787.Pp 788The 789.Vt bw_upcall 790structure is used as an argument to 791.Fn setsockopt MRT_ADD_BW_UPCALL 792and 793.Fn setsockopt MRT_DEL_BW_UPCALL . 794Each 795.Fn setsockopt MRT_ADD_BW_UPCALL 796installs a filter in the kernel 797for the source and destination address in the 798.Vt bw_upcall 799argument, 800and that filter will trigger an upcall according to the following 801pseudo-algorithm: 802.Bd -literal 803 if (bw_upcall_oper IS "\*(Gt=") { 804 if (((bw_upcall_unit & PACKETS == PACKETS) && 805 (measured_packets \*(Gt= threshold_packets)) || 806 ((bw_upcall_unit & BYTES == BYTES) && 807 (measured_bytes \*(Gt= threshold_bytes))) 808 SEND_UPCALL("measured bandwidth is \*(Gt= threshold"); 809 } 810 if (bw_upcall_oper IS "\*(Lt=" && measured_interval \*(Gt= threshold_interval) { 811 if (((bw_upcall_unit & PACKETS == PACKETS) && 812 (measured_packets \*(Lt= threshold_packets)) || 813 ((bw_upcall_unit & BYTES == BYTES) && 814 (measured_bytes \*(Lt= threshold_bytes))) 815 SEND_UPCALL("measured bandwidth is \*(Lt= threshold"); 816 } 817.Ed 818.Pp 819In the same 820.Vt bw_upcall , 821the unit can be specified in both BYTES and PACKETS. 822However, the GEQ and LEQ flags are mutually exclusive. 823.Pp 824Basically, an upcall is delivered if the measured bandwidth is \*(Gt= or 825\*(Lt= the threshold bandwidth (within the specified measurement 826interval). 827For practical reasons, the smallest value for the measurement 828interval is 3 seconds. 829If smaller values are allowed, then the bandwidth 830estimation may be less accurate, or the potentially very high frequency 831of the generated upcalls may introduce too much overhead. 832For the \*(Gt= operation, the answer may be known before the end of 833.Va threshold_interval , 834therefore the upcall may be delivered earlier. 835For the \*(Lt= operation however, we must wait 836until the threshold interval has expired to know the answer. 837.Sh EXAMPLES 838.Bd -literal -offset indent 839struct bw_upcall bw_upcall; 840/* Assign all bw_upcall fields as appropriate */ 841memset(&bw_upcall, 0, sizeof(bw_upcall)); 842memcpy(&bw_upcall.bu_src, &source, sizeof(bw_upcall.bu_src)); 843memcpy(&bw_upcall.bu_dst, &group, sizeof(bw_upcall.bu_dst)); 844bw_upcall.bu_threshold.b_data = threshold_interval; 845bw_upcall.bu_threshold.b_packets = threshold_packets; 846bw_upcall.bu_threshold.b_bytes = threshold_bytes; 847if (is_threshold_in_packets) 848 bw_upcall.bu_flags |= BW_UPCALL_UNIT_PACKETS; 849if (is_threshold_in_bytes) 850 bw_upcall.bu_flags |= BW_UPCALL_UNIT_BYTES; 851do { 852 if (is_geq_upcall) { 853 bw_upcall.bu_flags |= BW_UPCALL_GEQ; 854 break; 855 } 856 if (is_leq_upcall) { 857 bw_upcall.bu_flags |= BW_UPCALL_LEQ; 858 break; 859 } 860 return (ERROR); 861} while (0); 862setsockopt(mrouter_s4, IPPROTO_IP, MRT_ADD_BW_UPCALL, 863 (void *)&bw_upcall, sizeof(bw_upcall)); 864.Ed 865.Pp 866To delete a single filter, use 867.Dv MRT_DEL_BW_UPCALL , 868and the fields of bw_upcall must be set to 869exactly same as when 870.Dv MRT_ADD_BW_UPCALL 871was called. 872.Pp 873To delete all bandwidth filters for a given (S,G), then 874only the 875.Va bu_src 876and 877.Va bu_dst 878fields in 879.Vt "struct bw_upcall" 880need to be set, and then just set only the 881.Dv BW_UPCALL_DELETE_ALL 882flag inside field 883.Va bw_upcall.bu_flags . 884.Pp 885The bandwidth upcalls are received by aggregating them in the new upcall 886message: 887.Bd -literal -offset indent 888#define IGMPMSG_BW_UPCALL 4 /* BW monitoring upcall */ 889.Ed 890.Pp 891This message is an array of 892.Vt "struct bw_upcall" 893elements (up to 894.Dv BW_UPCALLS_MAX 895= 128). 896The upcalls are 897delivered when there are 128 pending upcalls, or when 1 second has 898expired since the previous upcall (whichever comes first). 899In an 900.Vt "struct upcall" 901element, the 902.Va bu_measured 903field is filled in to 904indicate the particular measured values. 905However, because of the way 906the particular intervals are measured, the user should be careful how 907.Va bu_measured.b_time 908is used. 909For example, if the 910filter is installed to trigger an upcall if the number of packets 911is \*(Gt= 1, then 912.Va bu_measured 913may have a value of zero in the upcalls after the 914first one, because the measured interval for \*(Gt= filters is 915.Dq clocked 916by the forwarded packets. 917Hence, this upcall mechanism should not be used for measuring 918the exact value of the bandwidth of the forwarded data. 919To measure the exact bandwidth, the user would need to 920get the forwarded packets statistics with the 921.Fn ioctl SIOCGETSGCNT 922mechanism 923(see the 924.Sx Programming Guide 925section) . 926.Pp 927Note that the upcalls for a filter are delivered until the specific 928filter is deleted, but no more frequently than once per 929.Va bu_threshold.b_time . 930For example, if the filter is specified to 931deliver a signal if bw \*(Gt= 1 packet, the first packet will trigger a 932signal, but the next upcall will be triggered no earlier than 933.Va bu_threshold.b_time 934after the previous upcall. 935.\" 936.Sh SEE ALSO 937.Xr getsockopt 2 , 938.Xr recvfrom 2 , 939.Xr recvmsg 2 , 940.Xr setsockopt 2 , 941.Xr socket 2 , 942.Xr icmp6 4 , 943.Xr inet 4 , 944.Xr inet6 4 , 945.Xr intro 4 , 946.Xr ip 4 , 947.Xr ip6 4 , 948.Xr pim 4 , 949.Xr mrouted 8 , 950.Xr sysctl 8 951.\" 952.Sh AUTHORS 953.An -nosplit 954The original multicast code was written by 955.An David Waitzman 956(BBN Labs), 957and later modified by the following individuals: 958.An Steve Deering 959(Stanford), 960.An Mark J. Steiglitz 961(Stanford), 962.An Van Jacobson 963(LBL), 964.An Ajit Thyagarajan 965(PARC), 966.An Bill Fenner 967(PARC). 968.Pp 969The IPv6 multicast support was implemented by the KAME project 970.Pq Lk http://www.kame.net , 971and was based on the IPv4 multicast code. 972The advanced multicast API and the multicast bandwidth 973monitoring were implemented by 974.An Pavlin Radoslavov 975(ICSI) 976in collaboration with 977.An Chris Brown 978(NextHop). 979.Pp 980This manual page was written by 981.An Pavlin Radoslavov 982(ICSI). 983