Multicast over TCP/IP HOWTO: The internals.

7. The internals.

This section's aim is to provide some information, not needed to reach a basic understanding on how multicast works nor to be able to write multicast programs, but which is very interesting, gives some insight on the underlying multicast protocols and implementations, and may be useful to avoid common errors and misunderstandings.

7.1 IGMP.

When talking about IP_ADD_MEMBERSHIP and IP_DROP_MEMBERSHIP, we said that the information provided by this "commands" was used by the kernel to choose which multicast datagrams accept or discard. This is true, but it is not all the truth. Such a simplification would imply that multicast datagrams for all multicast groups around the world would be received by our host, and then it would check the memberships issued by processes running on it to decide whether to pass the traffic to them or to throw it out. As you can imagine, this is a complete bandwidth waste.

What actually happens is that hosts instruct their routers telling them which multicast groups they are interested in; then, those routers tell their up-stream routers they want to receive that traffic, and so on. Algorithms employed for making the decision of when to ask for a group's traffic or saying that it is not desired anymore, vary a lot. There's something, however, that never changes: how this information is transmitted. IGMP is used for that. It stands for Internet Group Management Protocol. It is a new protocol, similar in many aspects to ICMP, with a protocol number of 2, whose messages are carried in IP datagrams, and which all level 2-compliant host are required to implement.

As said before, it is used both by hosts giving membership information to its routers, and by routers to communicate between themselves. In the following I'll cover only the hosts-routers relationships, mainly because I was unable to find information describing router to router communication other than the mrouted source code (rfc 1075 describing the Distance Vector Multicast Routing Protocol is now obsoleted, and mrouted implements a modified DVMRP not yet documented).

IGMP version 0 is specified in RFC-988 which is now obsoleted. Almost no one uses version 0 now.

IGMP version 1 is described in RFC-1112 and, although it is updated by RFC-2236 (IGMP version 2) it is in wide use still. The Linux kernel implements the full IGMP version 1 and parts of version 2 requirements, but not all.

Now I'll try to give an informal description of the protocol. You can check RFC-2236 for an in-proof formal description, with lots of state diagrams and time-out boundaries.

All IGMP messages have the following structure:

    0                   1                   2                   3
    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |      Type     | Max Resp Time |           Checksum            |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                         Group Address                         |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

IGMP version 1 (hereinafter IGMPv1) labels the "Max Resp Time" as "Unused", zeroes it when sent, and ignores it when received. Also, it brakes the "Type" field in two 4-bits wide fields: "Version" and "Type". As IGMPv1 identifies a "Membership Query" message as 0x11 (version 1, type 1) and IGMPv2 as 0x11 too, the 8 bits have the same effective interpretation.

I think it is more instructive to give first the IGMPv1 description and next point out the IGMPv2 additions, as they are mainly that, additions.

For the following discussions it is important to remember that multicast routers receive all IP multicast datagrams.

IGMP version 1.

Routers periodically send IGMP Host Membership Queries to the all-hosts group (224.0.0.1) with a TTL of 1 (once every minute or two). All multicast-capable hosts hear them, but don't answer immediately to avoid an IGMP Host Membership Report storm. Instead, they start a random delay timer for each group they belong to on the interface they received the query.

Sooner or later, the timer expires in one of the hosts, and it sends an IGMP Host Membership Report (also with TTL 1) to the multicast address of the group being reported. As it is sent to the group, all hosts that joined the group -and which are currently waiting for their own timer to expire- receive it, too. Then, they stop their timers and don't generate any other report. Just one is generated -by the host that chose the smaller timeout-, and that is enough for the router. It only needs to know that there are members for that group in the subnet, not how many nor which.

When no reports are received for a given group after a certain number of queries, the router assumes that no members are left, and thus it doesn't have to forward traffic for that group on that subnet. Note that in IGMPv1 there are no "Leave Group messages".

When a host joins a new group, the kernel sends a report for that group, so that the respective process needs not to wait a minute or two until a new membership query is received. As you can see this IGMP packet is generated by the kernel as a response to the IP_ADD_MEMBERSHIP command, seen in section IP_ADD_MEMBERSHIP. Note the emphasis in the adjective "new": if a process issues an IP_ADD_MEMBERSHIP command for a group the host is already a member of, no IGMP packets are sent as we must already be receiving traffic for that group; instead, a counter for that group's use is incremented. IP_DROP_MEMBERSHIP generates no datagrams in IGMPv1.

Host Membership Queries are identified by Type 0x11, and Host Membership Reports by Type 0x12.

No reports are sent for the all-hosts group. Membership in this group is permanent.

IGMP version 2.

One important addition to the above is the inclusion of a Leave Group message (Type 0x17). The reason is to reduce the bandwidth waste between the time the last host in the subnet drops membership and the time the router times-out for its queries and decides there are no more members present for that group (leave latency). Leave Group messages should be addressed to the all-routers group (224.0.0.2) rather than to the group being left, as that information is of no use for other members (kernel versions up to 2.0.33 send them to the group; although it does no harm to the hosts, it's a waste of time as they have to process them, but don't gain useful information). There are certain subtle details regarding when and when-not to send Leave Messages; if interested, see the RFC.

When an IGMPv2 router receives a Leave Message for a group, it sends Group-Specific Queries to the group being left. This is another addition. IGMPv1 has no group-specific queries. All queries are sent to the all-hosts group. The Type in the IGMP header does not change (0x11, as before), but the "Group Address" is filled with the address of the multicast group being left.

The "Max Resp Time" field, which was set to 0 in transmission and ignored on reception in IGMPv1, is meaningful only in "Membership Query" messages. It gives the maximum time allowed before sending a report in units of 1/10 second. It is used as a tune mechanism.

IGMPv2 adds another message type: 0x16. It is a "Version 2 Membership Report" sent by IGMPv2 hosts if they detect an IGMPv2 router is present (an IGMPv2 host knows an IGMPv1 router is present when it receives a query with the "Max Response" field set to 0).

When more than one router claims to act as querier, IGMPv2 provides a mechanism to avoid "discussions": the router with the lowest IP address is designed to be querier. The other routers keep timeouts. If the router with lower IP address crashes or is shutdown, the decision of who will be the querier is taken again after the timers expire.

7.2 Kernel corner.

This sub-section gives some start-points to study the multicast implementation of the Linux kernel. It does not explain that implementation. It just says where to find things.

The study was carried over version 2.0.32, so it could be a bit outdated by the time you read it (network code seems to have changed A LOT in 2.1.x releases, for instance).

Multicast code in the Linux kernel is always surrounded by #ifdef CONFIG_IP_MULTICAST / #endif pairs, so that you can include/ exclude it from your kernel based on your needs (this inclusion/exclusion is done at compile time, as you probably know if reading that section... #ifdefs are handled by the preprocessor. The decision is made based in what you selected when doing either a make config, make menuconfig or make xconfig).

You might want multicast features, but if your Linux box is not going to act as a multicast router you will probably not want multicast router features included in your new kernel. For this you have the multicast routing code surrounded by #ifdef CONFIG_IP_MROUTE / #endif pairs.

Kernel sources are usually placed in /usr/src/linux. However, the place may change so, both for accuracy and brevity, I will refer to the root directory of the kernel sources as just LINUX. Then, something like LINUX/net/ipv4/udp.c should be the same as /usr/src/linux/net/ipv4/udp.c if you unpacked the kernel sources in the /usr/src/linux directory.

All multicast interfaces with user programs shown in the section devoted to multicast programming were driven across the setsockopt()/ getsockopt() system calls. Both of them are implemented by means of functions that make some tests to verify the parameters passed to them and which, in turn, call another function that makes some additional tests, demultiplexes the call based on the level parameter to either system call, and then calls another function which... (if interested in all this jumps, you can follow them in LINUX/net/socket.c (functions sys_socketcall() and sys_setsockopt(), LINUX/net/ipv4/af_inet.c (function inet_setsockopt()) and LINUX/net/ipv4/ip_sockglue.c (function ip_setsockopt()) ).

The one which interests us is LINUX/net/ipv4/ip_sockglue.c. Here we find ip_setsockopt() and ip_getsockopt() which are mainly a switch (after some error checking) verifying each possible value for optname. Along with unicast options, all multicast ones seen here are handled: IP_MULTICAST_TTL, IP_MULTICAST_LOOP, IP_MULTICAST_IF, IP_ADD_MEMBERSHIP and IP_DROP_MEMBERSHIP. Previously to the switch, a test is made to determine whether the options are multicast router specific, and if so, they are routed to the ip_mroute_setsockopt() and ip_mroute_getsockopt() functions (file LINUX/net/ipv4/ipmr.c).

In LINUX/net/ipv4/af_inet.c we can see the default values we talked about in previous sections (loopback enabled, TTL=1) provided when the socket is created (taken from function inet_create() in this file):

 
#ifdef CONFIG_IP_MULTICAST
        sk->ip_mc_loop=1;
        sk->ip_mc_ttl=1;
        *sk->ip_mc_name=0;
        sk->ip_mc_list=NULL;
#endif

Also, the assertion of "closing a socket makes the kernel drop all memberships this socket had" is corroborated by:

#ifdef CONFIG_IP_MULTICAST
                /* Applications forget to leave groups before exiting */
                ip_mc_drop_socket(sk);
#endif

taken from inet_release(), on the same file as before.

Device independent operations for the Link Layer are kept in LINUX/net/core/dev_mcast.c.

Two important functions are still missing: the processing of input and output multicast datagrams. As any other datagrams, incoming datagrams are passed from the device drivers to the ip_rcv() function (LINUX/net/ipv4/ip_input.c). In this function is where the perfect filtering is applied to multicast packets that crossed the devices layer (recall that lower layers only perform best-effort filtering and is IP who 100% knows whether we are interested in that multicast group or not). If the host is acting as a multicast router, this function decides too whether the datagram should be forwarded and calls ipmr_forward() appropriately. (ipmr_forward() is implemented in LINUX/net/ipv4/ipmr.c).

Code in charge of out-putting packets is kept in LINUX/net/ipv4/ip_output.c. Here is where the IP_MULTICAST_LOOP option takes effect, as it is checked to see whether to loop back the packets or not (function ip_queue_xmit()). Also the TTL of the outgoing packet is selected based on whether it is a multicast or unicast one. In the former case, the argument passed to the IP_MULTICAST_TTL option is used (function (ip_build_xmit()).

While working with mrouted (a program which gives the kernel information about how to route multicast datagrams), we detected that all multicast packets originated on the local network were properly routed..., except the ones from the Linux box that was acting as the multicast router!! ip_input.c was working OK, but it seemed ip_output.c wasn't. Reading the source code for the output functions, we found that outgoing datagrams were not being passed to ipmr_forward(), the function that had to decide whether they should be routed or not. The packets were outputed to the local network but, as network cards are usually unable to read their own transmissions, those datagrams were never routed. We added the necessary code to the ip_build_xmit() function and everything was OK again. (Having the sources for your kernel is not a luxury or pedantry; it's a need!)

ipmr_forward() has been mentioned a couple of times. It is an important function as it solves one important misunderstanding that appears to be widely expanded. When routing multicast traffic, it is not mrouted who makes the copies and sends them to the proper recipients. mrouted receives all multicast traffic and, based on that information, computes the multicast routing tables and tells the kernel how to route: "datagrams for this group coming from that interface should be forwarded to those interfaces". This information is passed to the kernel by calls to setsockopt() on a raw socket opened by the mrouted daemon (the protocol specified when the raw socket was created must be IPPROTO_IGMP). This options are handled in the ip_mroute_setsockopt() function from LINUX/net/ipv4/ipmr.c. The first option (would be better to call them commands rather than options) issued on that socket must be MRT_INIT. All other commands are ignored (returning -EACCES) if MRT_INIT is not issued first. Only one instance of mrouted can be running at the same time in the same host. To keep track of this, when the first MRT_INIT is received, an important variable, struct sock* mroute_socket, is pointed to the socket MRT_INIT was received on. If mroute_socket is not null when attending an MRT_INIT this means another mrouted is already running and -EADDRINUSE is returned. All resting commands (MRT_DONE, MRT_ADD_VIF, MRT_DEL_VIF, MRT_ADD_MFC, MRT_DEL_MFC and MRT_ASSERT) return -EACCES if they come from a socket different than mroute_socket.

As routed multicast datagrams can be received/sent across either physical interfaces or tunnels, a common abstraction for both was devised: VIFs, Virtual InterFaces. mrouted passes vif structures to the kernel, indicating physical or tunnel interfaces to add to its routing tables, and multicast forwarding entries saying where to forward datagrams.

VIFs are added with MRT_ADD_VIF and deleted with MRT_DEL_VIF. Both pass a struct vifctl to the kernel (defined in /usr/include/linux/mroute.h) with the following information:

struct vifctl {
        vifi_t  vifc_vifi;              /* Index of VIF */
        unsigned char vifc_flags;       /* VIFF_ flags */
        unsigned char vifc_threshold;   /* ttl limit */
        unsigned int vifc_rate_limit;   /* Rate limiter values (NI) */
        struct in_addr vifc_lcl_addr;   /* Our address */
        struct in_addr vifc_rmt_addr;   /* IPIP tunnel addr */
};

With this information a vif_device structure is built:

struct vif_device
{
        struct device   *dev;                   /* Device we are using */
        struct route    *rt_cache;              /* Tunnel route cache */
        unsigned long   bytes_in,bytes_out;
        unsigned long   pkt_in,pkt_out;         /* Statistics */
        unsigned long   rate_limit;             /* Traffic shaping (NI) */
        unsigned char   threshold;              /* TTL threshold */
        unsigned short  flags;                  /* Control flags */
        unsigned long   local,remote;           /* Addresses(remote for tunnels)*/
};

Note the dev entry in the structure. The device structure is defined in /usr/include/linux/netdevice.h file. It is a big structure, but the field that interests us is:

  struct ip_mc_list*    ip_mc_list;   /* IP multicast filter chain    */

The ip_mc_list structure -defined in /usr/include/linux/igmp.h- is as follows:

struct ip_mc_list
{
        struct device *interface;
        unsigned long multiaddr;
        struct ip_mc_list *next;
        struct timer_list timer;
        short tm_running;
        short reporter;
        int users;
};

So, the ip_mc_list member from the dev structure is a pointer to a linked list of ip_mc_list structures, each containing an entry for each multicast group the network interface is a member of. Here again we see membership is associated to interfaces. LINUX/net/ipv4/ip_input.c traverses this linked list to decide whether the received datagram is destined to any group the interface that received the datagram belongs to:

#ifdef CONFIG_IP_MULTICAST
                if(!(dev->flags&IFF_ALLMULTI) && brd==IS_MULTICAST 
                   && iph->daddr!=IGMP_ALL_HOSTS 
                   && !(dev->flags&IFF_LOOPBACK))
                {
                        /*
                         *      Check it is for one of our groups
                         */
                        struct ip_mc_list *ip_mc=dev->ip_mc_list;
                        do
                        {
                                if(ip_mc==NULL)
                                {
                                        kfree_skb(skb, FREE_WRITE);
                                        return 0;
                                }
                                if(ip_mc->multiaddr==iph->daddr)
                                        break;
                                ip_mc=ip_mc->next;
                        }
                        while(1);
                }
#endif

The users field in the ip_mc_list structure is used to implement what was said in section IGMP version 1: if a process joins a group and the interface is already a member of that group (ie, another process joined that same group in that same interface before) only the count of members (users) is incremented. No IGMP messages are sent, as you can see in the following code (taken from ip_mc_inc_group(), called by ip_mc_join_group(), both in LINUX/net/ipv4/igmp.c):

        for(i=dev->ip_mc_list;i!=NULL;i=i->next)
        {
                if(i->multiaddr==addr)
                {
                        i->users++;
                        return;
                }
        }

When dropping memberships, the counter is decremented and additional operations are performed only when the count reaches 0 (ip_mc_dec_group()).

MRT_ADD_MFC and MRT_DEL_MFC set or delete forwarding entries in the multicast routing tables. Both pass a struct mfcctl to the kernel (also defined in /usr/include/linux/mroute.h) with this information:

struct mfcctl
{
        struct in_addr mfcc_origin;             /* Origin of mcast      */
        struct in_addr mfcc_mcastgrp;           /* Group in question    */
        vifi_t  mfcc_parent;                    /* Where it arrived     */
        unsigned char mfcc_ttls[MAXVIFS];       /* Where it is going    */
};

With all this information in hand, ipmr_forward() "walks" across the VIFs, and if a matching is found it duplicates the datagram and calls ipmr_queue_xmit() which, in turn, uses the output device specified by the routing table and the proper destination address if the packet is to be sent across a tunnel (ie, the unicast destination address of the other end of the tunnel).

Function ip_rt_event() (not directly related to output, but which is in ip_output.c too) receives events related to a network device, like the device going up. This function assures that then the device joins the ALL-HOSTS multicast group.

IGMP functions are implemented in LINUX/net/ipv4/igmp.c. Important information for that functions appears in /usr/include/linux/igmp.h and /usr/include/linux/mroute.h. The IGMP entry in the /proc/net directory is created with ip_init() in LINUX/net/ipv4/ip_output.c.

Next Previous Contents