1.\" Copyright (c) 2001-2003 International Computer Science Institute 2.\" 3.\" Permission is hereby granted, free of charge, to any person obtaining a 4.\" copy of this software and associated documentation files (the "Software"), 5.\" to deal in the Software without restriction, including without limitation 6.\" the rights to use, copy, modify, merge, publish, distribute, sublicense, 7.\" and/or sell copies of the Software, and to permit persons to whom the 8.\" Software is furnished to do so, subject to the following conditions: 9.\" 10.\" The above copyright notice and this permission notice shall be included in 11.\" all copies or substantial portions of the Software. 12.\" 13.\" The names and trademarks of copyright holders may not be used in 14.\" advertising or publicity pertaining to the software without specific 15.\" prior permission. Title to copyright in this software and any associated 16.\" documentation will at all times remain with the copyright holders. 17.\" 18.\" THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 19.\" IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 20.\" FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 21.\" AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 22.\" LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING 23.\" FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER 24.\" DEALINGS IN THE SOFTWARE. 25.\" 26.\" $FreeBSD: src/share/man/man4/multicast.4,v 1.4 2004/07/09 09:22:36 ru Exp $ 27.\" $OpenBSD: multicast.4,v 1.10 2015/02/16 16:38:54 naddy Exp $ 28.\" $NetBSD: multicast.4,v 1.3 2004/09/12 13:12:26 wiz Exp $ 29.\" 30.Dd $Mdocdate: February 16 2015 $ 31.Dt MULTICAST 4 32.Os 33.\" 34.Sh NAME 35.Nm multicast 36.Nd Multicast Routing 37.\" 38.Sh SYNOPSIS 39.Cd "options MROUTING" 40.Pp 41.In sys/types.h 42.In sys/socket.h 43.In netinet/in.h 44.In netinet/ip_mroute.h 45.In netinet6/ip6_mroute.h 46.Ft int 47.Fn getsockopt "int s" IPPROTO_IP MRT_INIT "void *optval" "socklen_t *optlen" 48.Ft int 49.Fn setsockopt "int s" IPPROTO_IP MRT_INIT "const void *optval" "socklen_t optlen" 50.Ft int 51.Fn getsockopt "int s" IPPROTO_IPV6 MRT6_INIT "void *optval" "socklen_t *optlen" 52.Ft int 53.Fn setsockopt "int s" IPPROTO_IPV6 MRT6_INIT "const void *optval" "socklen_t optlen" 54.Sh DESCRIPTION 55.Tn "Multicast routing" 56is used to efficiently propagate data 57packets to a set of multicast listeners in multipoint networks. 58If unicast is used to replicate the data to all listeners, 59then some of the network links may carry multiple copies of the same 60data packets. 61With multicast routing, the overhead is reduced to one copy 62(at most) per network link. 63.Pp 64All multicast-capable routers must run a common multicast routing 65protocol. 66The Distance Vector Multicast Routing Protocol (DVMRP) 67was the first developed multicast routing protocol. 68Later, other protocols such as Multicast Extensions to OSPF (MOSPF), 69Core Based Trees (CBT), 70Protocol Independent Multicast \- Sparse Mode (PIM-SM), 71and Protocol Independent Multicast \- Dense Mode (PIM-DM) 72were developed as well. 73.Pp 74To start multicast routing, 75the user must enable multicast forwarding via the 76.Xr sysctl 8 77variables 78.Va net.inet.ip.mforwarding 79and/or 80.Va net.inet.ip6.mforwarding . 81The user must also run a multicast routing capable user-level process, 82such as 83.Xr mrouted 8 . 84From a developer's point of view, 85the programming guide described in the 86.Sx Programming Guide 87section should be used to control the multicast forwarding in the kernel. 88.\" 89.Ss Programming Guide 90This section provides information about the basic multicast routing API. 91The so-called 92.Dq advanced multicast API 93is described in the 94.Sx "Advanced Multicast API Programming Guide" 95section. 96.Pp 97First, a multicast routing socket must be open. 98That socket would be used 99to control the multicast forwarding in the kernel. 100Note that most operations below require certain privilege 101(i.e., root privilege): 102.Bd -literal -offset indent 103/* IPv4 */ 104int mrouter_s4; 105mrouter_s4 = socket(AF_INET, SOCK_RAW, IPPROTO_IGMP); 106.Ed 107.Bd -literal -offset indent 108int mrouter_s6; 109mrouter_s6 = socket(AF_INET6, SOCK_RAW, IPPROTO_ICMPV6); 110.Ed 111.Pp 112Note that if the router needs to open an IGMP or ICMPv6 socket 113(IPv4 or IPv6, respectively) 114for sending or receiving of IGMP or MLD multicast group membership messages, 115then the same 116.Va mrouter_s4 117or 118.Va mrouter_s6 119sockets should be used 120for sending and receiving respectively IGMP or MLD messages. 121In the case of 122.Bx -derived 123kernels, 124it may be possible to open separate sockets 125for IGMP or MLD messages only. 126However, some other kernels (e.g., 127.Tn Linux ) 128require that the multicast 129routing socket must be used for sending and receiving of IGMP or MLD 130messages. 131Therefore, for portability reasons, the multicast 132routing socket should be reused for IGMP and MLD messages as well. 133.Pp 134After the multicast routing socket is open, it can be used to enable 135or disable multicast forwarding in the kernel: 136.Bd -literal -offset 5n 137/* IPv4 */ 138int v = 1; /* 1 to enable, or 0 to disable */ 139setsockopt(mrouter_s4, IPPROTO_IP, MRT_INIT, (void *)&v, sizeof(v)); 140.Ed 141.Bd -literal -offset 5n 142/* IPv6 */ 143int v = 1; /* 1 to enable, or 0 to disable */ 144setsockopt(mrouter_s6, IPPROTO_IPV6, MRT6_INIT, (void *)&v, sizeof(v)); 145\&... 146/* If necessary, filter all ICMPv6 messages */ 147struct icmp6_filter filter; 148ICMP6_FILTER_SETBLOCKALL(&filter); 149setsockopt(mrouter_s6, IPPROTO_ICMPV6, ICMP6_FILTER, (void *)&filter, 150 sizeof(filter)); 151.Ed 152.Pp 153After multicast forwarding is enabled, the multicast routing socket 154can be used to enable PIM processing in the kernel if either PIM-SM or 155PIM-DM are being used 156(see 157.Xr pim 4 ) . 158.Pp 159For each network interface (e.g., physical or a virtual tunnel) 160that would be used for multicast forwarding, a corresponding 161multicast interface must be added to the kernel: 162.Bd -literal -offset 3n 163/* IPv4 */ 164struct vifctl vc; 165memset(&vc, 0, sizeof(vc)); 166/* Assign all vifctl fields as appropriate */ 167vc.vifc_vifi = vif_index; 168vc.vifc_flags = vif_flags; 169vc.vifc_threshold = min_ttl_threshold; 170vc.vifc_rate_limit = max_rate_limit; 171memcpy(&vc.vifc_lcl_addr, &vif_local_address, sizeof(vc.vifc_lcl_addr)); 172if (vc.vifc_flags & VIFF_TUNNEL) 173 memcpy(&vc.vifc_rmt_addr, &vif_remote_address, 174 sizeof(vc.vifc_rmt_addr)); 175setsockopt(mrouter_s4, IPPROTO_IP, MRT_ADD_VIF, (void *)&vc, 176 sizeof(vc)); 177.Ed 178.Pp 179The 180.Va vif_index 181must be unique per vif. 182The 183.Va vif_flags 184contains the 185.Dv VIFF_* 186flags as defined in 187.In netinet/ip_mroute.h . 188The 189.Va min_ttl_threshold 190contains the minimum TTL a multicast data packet must have to be 191forwarded on that vif. 192Typically, it would be 1. 193The 194.Va max_rate_limit 195contains the maximum rate (in bits/s) of the multicast data packets forwarded 196on that vif. 197A value of 0 means no limit. 198The 199.Va vif_local_address 200contains the local IP address of the corresponding local interface. 201The 202.Va vif_remote_address 203contains the remote IP address for DVMRP multicast tunnels. 204.Bd -literal -offset indent 205/* IPv6 */ 206struct mif6ctl mc; 207memset(&mc, 0, sizeof(mc)); 208/* Assign all mif6ctl fields as appropriate */ 209mc.mif6c_mifi = mif_index; 210mc.mif6c_flags = mif_flags; 211mc.mif6c_pifi = pif_index; 212setsockopt(mrouter_s6, IPPROTO_IPV6, MRT6_ADD_MIF, (void *)&mc, 213 sizeof(mc)); 214.Ed 215.Pp 216The 217.Va mif_index 218must be unique per vif. 219The 220.Va mif_flags 221contains the 222.Dv MIFF_* 223flags as defined in 224.In netinet6/ip6_mroute.h . 225The 226.Va pif_index 227is the physical interface index of the corresponding local interface. 228.Pp 229A multicast interface is deleted by: 230.Bd -literal -offset indent 231/* IPv4 */ 232vifi_t vifi = vif_index; 233setsockopt(mrouter_s4, IPPROTO_IP, MRT_DEL_VIF, (void *)&vifi, 234 sizeof(vifi)); 235.Ed 236.Bd -literal -offset indent 237/* IPv6 */ 238mifi_t mifi = mif_index; 239setsockopt(mrouter_s6, IPPROTO_IPV6, MRT6_DEL_MIF, (void *)&mifi, 240 sizeof(mifi)); 241.Ed 242.Pp 243After multicast forwarding is enabled, and the multicast virtual 244interfaces have been 245added, the kernel may deliver upcall messages (also called signals 246later in this text) on the multicast routing socket that was open 247earlier with 248.Dv MRT_INIT 249or 250.Dv MRT6_INIT . 251The IPv4 upcalls have a 252.Vt "struct igmpmsg" 253header (see 254.In netinet/ip_mroute.h ) 255with the 256.Va im_mbz 257field set to zero. 258Note that this header follows the structure of 259.Vt "struct ip" 260with the protocol field 261.Va ip_p 262set to zero. 263The IPv6 upcalls have a 264.Vt "struct mrt6msg" 265header (see 266.In netinet6/ip6_mroute.h ) 267with the 268.Va im6_mbz 269field set to zero. 270Note that this header follows the structure of 271.Vt "struct ip6_hdr" 272with the next header field 273.Va ip6_nxt 274set to zero. 275.Pp 276The upcall header contains the 277.Va im_msgtype 278and 279.Va im6_msgtype 280fields, with the type of the upcall 281.Dv IGMPMSG_* 282and 283.Dv MRT6MSG_* 284for IPv4 and IPv6, respectively. 285The values of the rest of the upcall header fields 286and the body of the upcall message depend on the particular upcall type. 287.Pp 288If the upcall message type is 289.Dv IGMPMSG_NOCACHE 290or 291.Dv MRT6MSG_NOCACHE , 292this is an indication that a multicast packet has reached the multicast 293router, but the router has no forwarding state for that packet. 294Typically, the upcall would be a signal for the multicast routing 295user-level process to install the appropriate Multicast Forwarding 296Cache (MFC) entry in the kernel. 297.Pp 298An MFC entry is added by: 299.Bd -literal -offset indent 300/* IPv4 */ 301struct mfcctl mc; 302memset(&mc, 0, sizeof(mc)); 303memcpy(&mc.mfcc_origin, &source_addr, sizeof(mc.mfcc_origin)); 304memcpy(&mc.mfcc_mcastgrp, &group_addr, sizeof(mc.mfcc_mcastgrp)); 305mc.mfcc_parent = iif_index; 306for (i = 0; i < maxvifs; i++) 307 mc.mfcc_ttls[i] = oifs_ttl[i]; 308setsockopt(mrouter_s4, IPPROTO_IP, MRT_ADD_MFC, 309 (void *)&mc, sizeof(mc)); 310.Ed 311.Bd -literal -offset indent 312/* IPv6 */ 313struct mf6cctl mc; 314memset(&mc, 0, sizeof(mc)); 315memcpy(&mc.mf6cc_origin, &source_addr, sizeof(mc.mf6cc_origin)); 316memcpy(&mc.mf6cc_mcastgrp, &group_addr, sizeof(mf6cc_mcastgrp)); 317mc.mf6cc_parent = iif_index; 318for (i = 0; i < maxvifs; i++) 319 if (oifs_ttl[i] > 0) 320 IF_SET(i, &mc.mf6cc_ifset); 321setsockopt(mrouter_s4, IPPROTO_IPV6, MRT6_ADD_MFC, 322 (void *)&mc, sizeof(mc)); 323.Ed 324.Pp 325The 326.Va source_addr 327and 328.Va group_addr 329fields are the source and group address of the multicast packet (as set 330in the upcall message). 331The 332.Va iif_index 333is the virtual interface index of the multicast interface the multicast 334packets for this specific source and group address should be received on. 335The 336.Va oifs_ttl[] 337array contains the minimum TTL (per interface) a multicast packet 338should have to be forwarded on an outgoing interface. 339If the TTL value is zero, the corresponding interface is not included 340in the set of outgoing interfaces. 341Note that for IPv6 only the set of outgoing interfaces can 342be specified. 343.Pp 344An MFC entry is deleted by: 345.Bd -literal -offset indent 346/* IPv4 */ 347struct mfcctl mc; 348memset(&mc, 0, sizeof(mc)); 349memcpy(&mc.mfcc_origin, &source_addr, sizeof(mc.mfcc_origin)); 350memcpy(&mc.mfcc_mcastgrp, &group_addr, sizeof(mc.mfcc_mcastgrp)); 351setsockopt(mrouter_s4, IPPROTO_IP, MRT_DEL_MFC, 352 (void *)&mc, sizeof(mc)); 353.Ed 354.Bd -literal -offset indent 355/* IPv6 */ 356struct mf6cctl mc; 357memset(&mc, 0, sizeof(mc)); 358memcpy(&mc.mf6cc_origin, &source_addr, sizeof(mc.mf6cc_origin)); 359memcpy(&mc.mf6cc_mcastgrp, &group_addr, sizeof(mf6cc_mcastgrp)); 360setsockopt(mrouter_s4, IPPROTO_IPV6, MRT6_DEL_MFC, 361 (void *)&mc, sizeof(mc)); 362.Ed 363.Pp 364The following method can be used to get various statistics per 365installed MFC entry in the kernel (e.g., the number of forwarded 366packets per source and group address): 367.Bd -literal -offset indent 368/* IPv4 */ 369struct sioc_sg_req sgreq; 370memset(&sgreq, 0, sizeof(sgreq)); 371memcpy(&sgreq.src, &source_addr, sizeof(sgreq.src)); 372memcpy(&sgreq.grp, &group_addr, sizeof(sgreq.grp)); 373ioctl(mrouter_s4, SIOCGETSGCNT, &sgreq); 374.Ed 375.Bd -literal -offset indent 376/* IPv6 */ 377struct sioc_sg_req6 sgreq; 378memset(&sgreq, 0, sizeof(sgreq)); 379memcpy(&sgreq.src, &source_addr, sizeof(sgreq.src)); 380memcpy(&sgreq.grp, &group_addr, sizeof(sgreq.grp)); 381ioctl(mrouter_s6, SIOCGETSGCNT_IN6, &sgreq); 382.Ed 383.Pp 384The following method can be used to get various statistics per 385multicast virtual interface in the kernel (e.g., the number of forwarded 386packets per interface): 387.Bd -literal -offset indent 388/* IPv4 */ 389struct sioc_vif_req vreq; 390memset(&vreq, 0, sizeof(vreq)); 391vreq.vifi = vif_index; 392ioctl(mrouter_s4, SIOCGETVIFCNT, &vreq); 393.Ed 394.Bd -literal -offset indent 395/* IPv6 */ 396struct sioc_mif_req6 mreq; 397memset(&mreq, 0, sizeof(mreq)); 398mreq.mifi = vif_index; 399ioctl(mrouter_s6, SIOCGETMIFCNT_IN6, &mreq); 400.Ed 401.Ss Advanced Multicast API Programming Guide 402Adding new features to the kernel makes it difficult 403to preserve backward compatibility (binary and API), 404and at the same time to allow user-level processes to take advantage of 405the new features (if the kernel supports them). 406.Pp 407One of the mechanisms that allows preserving the backward 408compatibility is a sort of negotiation 409between the user-level process and the kernel: 410.Bl -enum 411.It 412The user-level process tries to enable in the kernel the set of new 413features (and the corresponding API) it would like to use. 414.It 415The kernel returns the (sub)set of features it knows about 416and is willing to be enabled. 417.It 418The user-level process uses only that set of features 419the kernel has agreed on. 420.El 421.\" 422.Pp 423To support backward compatibility, if the user-level process does not 424ask for any new features, the kernel defaults to the basic 425multicast API (see the 426.Sx "Programming Guide" 427section). 428.\" XXX: edit as appropriate after the advanced multicast API is 429.\" supported under IPv6 430Currently, the advanced multicast API exists only for IPv4; 431in the future there will be IPv6 support as well. 432.Pp 433Below is a summary of the expandable API solution. 434Note that all new options and structures are defined 435in 436.In netinet/ip_mroute.h 437and 438.In netinet6/ip6_mroute.h , 439unless stated otherwise. 440.Pp 441The user-level process uses new 442.Fn getsockopt Ns / Ns Fn setsockopt 443options to 444perform the API features negotiation with the kernel. 445This negotiation must be performed right after the multicast routing 446socket is open. 447The set of desired/allowed features is stored in a bitset 448(currently, in 449.Vt uint32_t 450i.e., maximum of 32 new features). 451The new 452.Fn getsockopt Ns / Ns Fn setsockopt 453options are 454.Dv MRT_API_SUPPORT 455and 456.Dv MRT_API_CONFIG . 457An example: 458.Bd -literal -offset 3n 459uint32_t v; 460getsockopt(sock, IPPROTO_IP, MRT_API_SUPPORT, (void *)&v, sizeof(v)); 461.Ed 462.Pp 463This would set 464.Va v 465to the pre-defined bits that the kernel API supports. 466The eight least significant bits in 467.Vt uint32_t 468are the same as the 469eight possible flags 470.Dv MRT_MFC_FLAGS_* 471that can be used in 472.Va mfcc_flags 473as part of the new definition of 474.Vt "struct mfcctl" 475(see below about those flags), which leaves 24 flags for other new features. 476The value returned by 477.Fn getsockopt MRT_API_SUPPORT 478is read-only; in other words, 479.Fn setsockopt MRT_API_SUPPORT 480would fail. 481.Pp 482To modify the API, and to set some specific feature in the kernel, then: 483.Bd -literal -offset 3n 484uint32_t v = MRT_MFC_FLAGS_DISABLE_WRONGVIF; 485if (setsockopt(sock, IPPROTO_IP, MRT_API_CONFIG, (void *)&v, sizeof(v)) 486 != 0) { 487 return (ERROR); 488} 489if (v & MRT_MFC_FLAGS_DISABLE_WRONGVIF) 490 return (OK); /* Success */ 491else 492 return (ERROR); 493.Ed 494.Pp 495In other words, when 496.Fn setsockopt MRT_API_CONFIG 497is called, the 498argument to it specifies the desired set of features to 499be enabled in the API and the kernel. 500The return value in 501.Va v 502is the actual (sub)set of features that were enabled in the kernel. 503To obtain later the same set of features that were enabled, use: 504.Bd -literal -offset indent 505getsockopt(sock, IPPROTO_IP, MRT_API_CONFIG, (void *)&v, sizeof(v)); 506.Ed 507.Pp 508The set of enabled features is global. 509In other words, 510.Fn setsockopt MRT_API_CONFIG 511should be called right after 512.Fn setsockopt MRT_INIT . 513.Pp 514Currently, the following set of new features is defined: 515.Bd -literal 516#define MRT_MFC_FLAGS_DISABLE_WRONGVIF (1 << 0)/*disable WRONGVIF signals*/ 517#define MRT_MFC_FLAGS_BORDER_VIF (1 << 1) /* border vif */ 518#define MRT_MFC_RP (1 << 8) /* enable RP address */ 519#define MRT_MFC_BW_UPCALL (1 << 9) /* enable bw upcalls */ 520.Ed 521.\" .Pp 522.\" In the future there might be: 523.\" .Bd -literal 524.\" #define MRT_MFC_GROUP_SPECIFIC (1 << 10) /* allow (*,G) MFC entries */ 525.\" .Ed 526.\" .Pp 527.\" to allow (*,G) MFC entries (i.e., group-specific entries) in the kernel. 528.\" For now this is left-out until it is clear whether 529.\" (*,G) MFC support is the preferred solution instead of something more generic 530.\" solution for example. 531.\" 532.\" 2. The newly defined struct mfcctl2. 533.\" 534.Pp 535The advanced multicast API uses a newly defined 536.Vt "struct mfcctl2" 537instead of the traditional 538.Vt "struct mfcctl" . 539The original 540.Vt "struct mfcctl" 541is kept as is. 542The new 543.Vt "struct mfcctl2" 544is: 545.Bd -literal 546/* 547 * The new argument structure for MRT_ADD_MFC and MRT_DEL_MFC overlays 548 * and extends the old struct mfcctl. 549 */ 550struct mfcctl2 { 551 /* the mfcctl fields */ 552 struct in_addr mfcc_origin; /* ip origin of mcasts */ 553 struct in_addr mfcc_mcastgrp; /* multicast group associated*/ 554 vifi_t mfcc_parent; /* incoming vif */ 555 u_char mfcc_ttls[MAXVIFS];/* forwarding ttls on vifs */ 556 557 /* extension fields */ 558 uint8_t mfcc_flags[MAXVIFS];/* the MRT_MFC_FLAGS_* flags*/ 559 struct in_addr mfcc_rp; /* the RP address */ 560}; 561.Ed 562.Pp 563The new fields are 564.Va mfcc_flags[MAXVIFS] 565and 566.Va mfcc_rp . 567Note that for compatibility reasons they are added at the end. 568.Pp 569The 570.Va mfcc_flags[MAXVIFS] 571field is used to set various flags per 572interface per (S,G) entry. 573Currently, the defined flags are: 574.Bd -literal 575#define MRT_MFC_FLAGS_DISABLE_WRONGVIF (1 << 0)/*disable WRONGVIF signals*/ 576#define MRT_MFC_FLAGS_BORDER_VIF (1 << 1) /* border vif */ 577.Ed 578.Pp 579The 580.Dv MRT_MFC_FLAGS_DISABLE_WRONGVIF 581flag is used to explicitly disable the 582.Dv IGMPMSG_WRONGVIF 583kernel signal at the (S,G) granularity if a multicast data packet 584arrives on the wrong interface. 585Usually this signal is used to 586complete the shortest-path switch for PIM-SM multicast routing, 587or to trigger a PIM assert message. 588However, it should not be delivered for interfaces that are not set in 589the outgoing interface, and that are not expecting to 590become an incoming interface. 591Hence, if the 592.Dv MRT_MFC_FLAGS_DISABLE_WRONGVIF 593flag is set for some of the 594interfaces, then a data packet that arrives on that interface for 595that MFC entry will NOT trigger a WRONGVIF signal. 596If that flag is not set, then a signal is triggered (the default action). 597.Pp 598The 599.Dv MRT_MFC_FLAGS_BORDER_VIF 600flag is used to specify whether the Border-bit in PIM 601Register messages should be set (when the Register encapsulation 602is performed inside the kernel). 603If it is set for the special PIM Register kernel virtual interface 604(see 605.Xr pim 4 ) , 606the Border-bit in the Register messages sent to the RP will be set. 607.Pp 608The remaining six bits are reserved for future usage. 609.Pp 610The 611.Va mfcc_rp 612field is used to specify the RP address (for PIM-SM multicast routing) 613for a multicast 614group G if we want to perform kernel-level PIM Register encapsulation. 615The 616.Va mfcc_rp 617field is used only if the 618.Dv MRT_MFC_RP 619advanced API flag/capability has been successfully set by 620.Fn setsockopt MRT_API_CONFIG . 621.Pp 622.\" 623.\" 3. Kernel-level PIM Register encapsulation 624.\" 625If the 626.Dv MRT_MFC_RP 627flag was successfully set by 628.Fn setsockopt MRT_API_CONFIG , 629then the kernel will attempt to perform 630the PIM Register encapsulation itself instead of sending the 631multicast data packets to user level (inside 632.Dv IGMPMSG_WHOLEPKT 633upcalls) for user-level encapsulation. 634The RP address would be taken from the 635.Va mfcc_rp 636field 637inside the new 638.Vt "struct mfcctl2" . 639However, even if the 640.Dv MRT_MFC_RP 641flag was successfully set, if the 642.Va mfcc_rp 643field was set to 644.Dv INADDR_ANY , 645then the 646kernel will still deliver an 647.Dv IGMPMSG_WHOLEPKT 648upcall with the 649multicast data packet to the user-level process. 650.Pp 651In addition, if the multicast data packet is too large to fit within 652a single IP packet after the PIM Register encapsulation (e.g., if 653its size was on the order of 65500 bytes), the data packet will be 654fragmented, and then each of the fragments will be encapsulated 655separately. 656Note that typically a multicast data packet can be that 657large only if it was originated locally from the same hosts that 658performs the encapsulation; otherwise the transmission of the 659multicast data packet over Ethernet for example would have 660fragmented it into much smaller pieces. 661.\" 662.\" Note that if this code is ported to IPv6, we may need the kernel to 663.\" perform MTU discovery to the RP, and keep those discoveries inside 664.\" the kernel so the encapsulating router may send back ICMP 665.\" Fragmentation Required if the size of the multicast data packet is 666.\" too large (see "Encapsulating data packets in the Register Tunnel" 667.\" in Section 4.4.1 in the PIM-SM spec 668.\" draft-ietf-pim-sm-v2-new-05.{txt,ps}). 669.\" For IPv4 we may be able to get away without it, but for IPv6 we need 670.\" that. 671.\" 672.\" 4. Mechanism for "multicast bandwidth monitoring and upcalls". 673.\" 674.Pp 675Typically, a multicast routing user-level process would need to know the 676forwarding bandwidth for some data flow. 677For example, the multicast routing process may want to time out idle MFC 678entries, or for PIM-SM it can initiate (S,G) shortest-path switch if 679the bandwidth rate is above a threshold for example. 680.Pp 681The original solution for measuring the bandwidth of a dataflow was 682that a user-level process would periodically 683query the kernel about the number of forwarded packets/bytes per 684(S,G), and then based on those numbers it would estimate whether a source 685has been idle, or whether the source's transmission bandwidth is above a 686threshold. 687That solution is far from being scalable, hence the need for a new 688mechanism for bandwidth monitoring. 689.Pp 690Below is a description of the bandwidth monitoring mechanism. 691.Bl -bullet 692.It 693If the bandwidth of a data flow satisfies some pre-defined filter, 694the kernel delivers an upcall on the multicast routing socket 695to the multicast routing process that has installed that filter. 696.It 697The bandwidth-upcall filters are installed per (S,G). 698There can be 699more than one filter per (S,G). 700.It 701Instead of supporting all possible comparison operations 702(i.e., < <= == != > >= ), there is support only for the 703<= and >= operations, 704because this makes the kernel-level implementation simpler, 705and because practically we need only those two. 706Furthermore, the missing operations can be simulated by secondary 707user-level filtering of those <= and >= filters. 708For example, to simulate !=, then we need to install filter 709.Dq bw <= 0xffffffff , 710and after an 711upcall is received, we need to check whether 712.Dq measured_bw != expected_bw . 713.It 714The bandwidth-upcall mechanism is enabled by 715.Fn setsockopt MRT_API_CONFIG 716for the 717.Dv MRT_MFC_BW_UPCALL 718flag. 719.It 720The bandwidth-upcall filters are added/deleted by the new 721.Fn setsockopt MRT_ADD_BW_UPCALL 722and 723.Fn setsockopt MRT_DEL_BW_UPCALL 724respectively (with the appropriate 725.Vt "struct bw_upcall" 726argument of course). 727.El 728.Pp 729From an application point of view, a developer needs to know about 730the following: 731.Bd -literal 732/* 733 * Structure for installing or delivering an upcall if the 734 * measured bandwidth is above or below a threshold. 735 * 736 * User programs (e.g. daemons) may have a need to know when the 737 * bandwidth used by some data flow is above or below some threshold. 738 * This interface allows the userland to specify the threshold (in 739 * bytes and/or packets) and the measurement interval. Flows are 740 * all packet with the same source and destination IP address. 741 * At the moment the code is only used for multicast destinations 742 * but there is nothing that prevents its use for unicast. 743 * 744 * The measurement interval cannot be shorter than some Tmin (3s). 745 * The threshold is set in packets and/or bytes per_interval. 746 * 747 * Measurement works as follows: 748 * 749 * For >= measurements: 750 * The first packet marks the start of a measurement interval. 751 * During an interval we count packets and bytes, and when we 752 * pass the threshold we deliver an upcall and we are done. 753 * The first packet after the end of the interval resets the 754 * count and restarts the measurement. 755 * 756 * For <= measurement: 757 * We start a timer to fire at the end of the interval, and 758 * then for each incoming packet we count packets and bytes. 759 * When the timer fires, we compare the value with the threshold, 760 * schedule an upcall if we are below, and restart the measurement 761 * (reschedule timer and zero counters). 762 */ 763 764struct bw_data { 765 struct timeval b_time; 766 uint64_t b_packets; 767 uint64_t b_bytes; 768}; 769 770struct bw_upcall { 771 struct in_addr bu_src; /* source address */ 772 struct in_addr bu_dst; /* destination address */ 773 uint32_t bu_flags; /* misc flags (see below) */ 774#define BW_UPCALL_UNIT_PACKETS (1 << 0) /* threshold (in packets) */ 775#define BW_UPCALL_UNIT_BYTES (1 << 1) /* threshold (in bytes) */ 776#define BW_UPCALL_GEQ (1 << 2) /* upcall if bw >= threshold */ 777#define BW_UPCALL_LEQ (1 << 3) /* upcall if bw <= threshold */ 778#define BW_UPCALL_DELETE_ALL (1 << 4) /* delete all upcalls for s,d*/ 779 struct bw_data bu_threshold; /* the bw threshold */ 780 struct bw_data bu_measured; /* the measured bw */ 781}; 782 783/* max. number of upcalls to deliver together */ 784#define BW_UPCALLS_MAX 128 785/* min. threshold time interval for bandwidth measurement */ 786#define BW_UPCALL_THRESHOLD_INTERVAL_MIN_SEC 3 787#define BW_UPCALL_THRESHOLD_INTERVAL_MIN_USEC 0 788.Ed 789.Pp 790The 791.Vt bw_upcall 792structure is used as an argument to 793.Fn setsockopt MRT_ADD_BW_UPCALL 794and 795.Fn setsockopt MRT_DEL_BW_UPCALL . 796Each 797.Fn setsockopt MRT_ADD_BW_UPCALL 798installs a filter in the kernel 799for the source and destination address in the 800.Vt bw_upcall 801argument, 802and that filter will trigger an upcall according to the following 803pseudo-algorithm: 804.Bd -literal 805 if (bw_upcall_oper IS ">=") { 806 if (((bw_upcall_unit & PACKETS == PACKETS) && 807 (measured_packets >= threshold_packets)) || 808 ((bw_upcall_unit & BYTES == BYTES) && 809 (measured_bytes >= threshold_bytes))) 810 SEND_UPCALL("measured bandwidth is >= threshold"); 811 } 812 if (bw_upcall_oper IS "<=" && measured_interval >= threshold_interval) { 813 if (((bw_upcall_unit & PACKETS == PACKETS) && 814 (measured_packets <= threshold_packets)) || 815 ((bw_upcall_unit & BYTES == BYTES) && 816 (measured_bytes <= threshold_bytes))) 817 SEND_UPCALL("measured bandwidth is <= threshold"); 818 } 819.Ed 820.Pp 821In the same 822.Vt bw_upcall , 823the unit can be specified in both BYTES and PACKETS. 824However, the GEQ and LEQ flags are mutually exclusive. 825.Pp 826Basically, an upcall is delivered if the measured bandwidth is >= or 827<= the threshold bandwidth (within the specified measurement 828interval). 829For practical reasons, the smallest value for the measurement 830interval is 3 seconds. 831If smaller values are allowed, then the bandwidth 832estimation may be less accurate, or the potentially very high frequency 833of the generated upcalls may introduce too much overhead. 834For the >= operation, the answer may be known before the end of 835.Va threshold_interval , 836therefore the upcall may be delivered earlier. 837For the <= operation however, we must wait 838until the threshold interval has expired to know the answer. 839.Sh EXAMPLES 840.Bd -literal -offset indent 841struct bw_upcall bw_upcall; 842/* Assign all bw_upcall fields as appropriate */ 843memset(&bw_upcall, 0, sizeof(bw_upcall)); 844memcpy(&bw_upcall.bu_src, &source, sizeof(bw_upcall.bu_src)); 845memcpy(&bw_upcall.bu_dst, &group, sizeof(bw_upcall.bu_dst)); 846bw_upcall.bu_threshold.b_data = threshold_interval; 847bw_upcall.bu_threshold.b_packets = threshold_packets; 848bw_upcall.bu_threshold.b_bytes = threshold_bytes; 849if (is_threshold_in_packets) 850 bw_upcall.bu_flags |= BW_UPCALL_UNIT_PACKETS; 851if (is_threshold_in_bytes) 852 bw_upcall.bu_flags |= BW_UPCALL_UNIT_BYTES; 853do { 854 if (is_geq_upcall) { 855 bw_upcall.bu_flags |= BW_UPCALL_GEQ; 856 break; 857 } 858 if (is_leq_upcall) { 859 bw_upcall.bu_flags |= BW_UPCALL_LEQ; 860 break; 861 } 862 return (ERROR); 863} while (0); 864setsockopt(mrouter_s4, IPPROTO_IP, MRT_ADD_BW_UPCALL, 865 (void *)&bw_upcall, sizeof(bw_upcall)); 866.Ed 867.Pp 868To delete a single filter, use 869.Dv MRT_DEL_BW_UPCALL , 870and the fields of bw_upcall must be set to 871exactly same as when 872.Dv MRT_ADD_BW_UPCALL 873was called. 874.Pp 875To delete all bandwidth filters for a given (S,G), then 876only the 877.Va bu_src 878and 879.Va bu_dst 880fields in 881.Vt "struct bw_upcall" 882need to be set, and then just set only the 883.Dv BW_UPCALL_DELETE_ALL 884flag inside field 885.Va bw_upcall.bu_flags . 886.Pp 887The bandwidth upcalls are received by aggregating them in the new upcall 888message: 889.Bd -literal -offset indent 890#define IGMPMSG_BW_UPCALL 4 /* BW monitoring upcall */ 891.Ed 892.Pp 893This message is an array of 894.Vt "struct bw_upcall" 895elements (up to 896.Dv BW_UPCALLS_MAX 897= 128). 898The upcalls are 899delivered when there are 128 pending upcalls, or when 1 second has 900expired since the previous upcall (whichever comes first). 901In an 902.Vt "struct upcall" 903element, the 904.Va bu_measured 905field is filled in to 906indicate the particular measured values. 907However, because of the way 908the particular intervals are measured, the user should be careful how 909.Va bu_measured.b_time 910is used. 911For example, if the 912filter is installed to trigger an upcall if the number of packets 913is >= 1, then 914.Va bu_measured 915may have a value of zero in the upcalls after the 916first one, because the measured interval for >= filters is 917.Dq clocked 918by the forwarded packets. 919Hence, this upcall mechanism should not be used for measuring 920the exact value of the bandwidth of the forwarded data. 921To measure the exact bandwidth, the user would need to 922get the forwarded packets statistics with the 923.Fn ioctl SIOCGETSGCNT 924mechanism 925(see the 926.Sx Programming Guide 927section) . 928.Pp 929Note that the upcalls for a filter are delivered until the specific 930filter is deleted, but no more frequently than once per 931.Va bu_threshold.b_time . 932For example, if the filter is specified to 933deliver a signal if bw >= 1 packet, the first packet will trigger a 934signal, but the next upcall will be triggered no earlier than 935.Va bu_threshold.b_time 936after the previous upcall. 937.\" 938.Sh SEE ALSO 939.Xr getsockopt 2 , 940.Xr recvfrom 2 , 941.Xr recvmsg 2 , 942.Xr setsockopt 2 , 943.Xr socket 2 , 944.Xr icmp6 4 , 945.Xr inet 4 , 946.Xr inet6 4 , 947.Xr intro 4 , 948.Xr ip 4 , 949.Xr ip6 4 , 950.Xr pim 4 , 951.Xr mrouted 8 , 952.Xr sysctl 8 953.\" 954.Sh AUTHORS 955.An -nosplit 956The original multicast code was written by 957.An David Waitzman 958(BBN Labs), 959and later modified by the following individuals: 960.An Steve Deering 961(Stanford), 962.An Mark J. Steiglitz 963(Stanford), 964.An Van Jacobson 965(LBL), 966.An Ajit Thyagarajan 967(PARC), 968.An Bill Fenner 969(PARC). 970.Pp 971The IPv6 multicast support was implemented by the KAME project 972.Pq Lk http://www.kame.net , 973and was based on the IPv4 multicast code. 974The advanced multicast API and the multicast bandwidth 975monitoring were implemented by 976.An Pavlin Radoslavov 977(ICSI) 978in collaboration with 979.An Chris Brown 980(NextHop). 981.Pp 982This manual page was written by 983.An Pavlin Radoslavov 984(ICSI). 985