As part of this project we modified the IP layer of the linux kernel to support sending packets over multiple IP interfaces. The goal of this portion of the project was to answer the following questions: 1. Can we increase the effective bandwidth between 2 computers by using multiple net devices instead of just one net device? By how much? Up to the aggregate bandwidth?
2. Can we increase the fault tolerance between 2 computers by sending data over multiple devices? How robustly? Can we account for congestion at the IP layer to dynamically load balance across the interfaces? Can we find downed lines at the IP layer to dynamically load balance across the interfaces?

Previous work:
Through research into the subject and contact with Alan Cox (the guy heading up most of linux networking) I've found 2 other network hacks which touch on this work.
1. EQL. Simon Janes, sjanes@telum.abies.com, implemented serial line load balancing. This was intended for use on multiple PPP or SLIP connections.

|your       |---------Phone  ------------|Rest of the
|computer   |---------Company------------|World

The code is in pretty good condition and is implemented as a module.

2. Channel bonding. The beowulf project, http://cesdis.gsfc.nasa.gov/linux-web/beowulf/beowulf.html has a goal of constructing a parallel computer using off the shelf components. They found themselves bandwidth limited on the devices so they constructed a small patch to the linux kernel which allows sending across multiple devices. This project is limited because it requires fully independent networks. If you have 4 computers, each with 4 interfaces, an interface on one computer cannot reach more than one interface on another computer.

Status:
The current implementation is up and running on Ledaig and Convalmore. It is a software device driver derived from the eql implementation by Simon Janes supporting many striping methods (instead of one) and (hopefully) all IP device driver interfaces. I call it IPEQL. Because the device driver is a network driver the kernel treats it the same any other device driver, allowing assignment of routes to the driver in the routing table. The issue of ARP's is avoided because in the linux kernel ethernet device drivers handle their own ARPing.

A basic use of IPEQL would involve the following steps: 1. Use ifconfig to make ipeql, eth1, and eth2 active. This automatically loads the necessary modules into the kernel.
2. Use the user level program 'ipeql_enslave' to enslave eth1 and eth2. 3. Add ipeql to the routing table using 'route'.

There are several additional subtleties. 'ipeql_enslave' can be used to gather information about the ipeql device including which striping method is currently in use and the status of slave devices. In addition, slaves can be free'd, alternative striping methods can be set, and various parameters of particular striping methods can be set.

A casual comparison with EQL suggests improved flexibility in striping methods and range of devices, though I am not certain that functionality with respect to serial IP devices is preserved.
A comparison to Beowulf's channel bonding shows much improved flexibility at a cost in performance. Though no recopying of the packet is incurred in the process of striping, the process of striping involves 3 function calls of overhead one to enter the device, one to switch on the striping method, and one to get the target slave. This is considerably large than the nearly negligable short list traversal which Beowulf's channel bonding uses. There is an additional, hidden, cost because the hardware address caching mechanism used in linux is lost.
A quick analyses suggests that on a Pentium Pro 200, two fast ethernet connections become CPU bound when driven by a UDP source using IPEQL. When sending a UDP source to a single fast ethernet connection, about 30% of the CPU is utilized. The difference in performance is such that it seems unlikely IPEQL would be a replacement for Beowulf's channel bonding when fast ethernet connections are utilized.

Implementation details:

The programming portion of this project was more difficult than I anticipated due partially to the use of kernel code incurring more serious errors (hard reboot instead of seg fault) and other difficulties in debugging. The Linux 2.0.XX TCP code turning out to make assumptions about the sending of packets on devices which was incorrect for our implementation. Luckily, the 2.1.XX series fixed this poor assumption, allowing us to progress without resorting to hacking on the TCP implementation.

The basic implementation has two threads of execution, with a background thread cleaning the state of ipeql (detecting slave devices which have died and removing them), and, in the case of one striping method, implementing a decay on the priority of sending across the interfaces. The background thread is set to wake up rarely enough that it will not have any effect on performance (about every second). The code resides in 'ipeql_timer(...)'. The load balancing thread is called when a packet is ready to be sent out. It first finds the best slave to send the packet on using the current striping method. Linux IP actually attempts to build the device layer headers by hand forcing IPEQL to rip off it's own fake header and install the slave device's header. IPEQL then transmits the packet on the device. The code may be viewed in 'ipeql_slave_xmit(...)'.

The only devices we were given to test striping over were 2 10/100 Base-T ethernet cards. All striping methods appear to be working across these devices. I presume, though I'm not sure, that striping across serial IP interfaces is also working.

Striping methods:
I implemented several striping methods on the belief that no one striping method will fit every situation. In some situation more information is available, allowing more elaborate striping algorithms to be used than in others. Also, in some situations, speed of execution is what matters most, so the overhead of elaborate algorithms is not desired.

The first striping algorithm I implemented is round robin (round_robin). The round robin striping method is most useful when the slave devices have the same bandwidths and latencies. It is the most efficient striping algorithm implemented, incurring only a small constant cost to decide which device is next, so round robin is useful with high bandwidth devices. The round robin algorithm is also used in Beowulf's channel bonding.

Load balancing with varied device bandwidths (bw_load) is implemented. This striping method is useful when latencies on transmission between the devices are the same but the bandwidths of the devices are different. One example might be a fast ethernet/ethernet pair. The algorithm does some bookkeeping for each slave device, incurring an O(number of slaves) cost in deciding which slave to send the packet out on. The algorithm runs through each device, subtracting the relative bandwidth from a 'priority' for each device. The device with the lowest 'priority' is chosen and the sum of the relative bandwidths of slave devices is added to the 'priority' of the selected device. In this way packets will be split across slave devices proportional to their bandwidths. bw_load requires one parameter/slave device = the bandwidth of that device.

Load balancing with polled decay (bw_poll) is a variant of the above algorithm. This is the algorithm used by EQL for serial line load balancing. Essentially, the above selection mechanism is unchanged. In addition, the background thread periodically awakens and shifts the 'priority' towards 0. This essentially attempts to take account of lag on devices if you follow the assumption that lower bandwidth is inversely proportional to lag and the background thread wakes up often enough decay the 'priority' appropriately. Due to the desire that the background thread consume little resources bw_poll is only useful over slow serial devices, such as modems. Under such a device, the polling mechanism will insure that, for example, each letter typed over a telnet session goes over the faster device (assuming you type slow).

Exact load balancing (bw_lag_val) attempts to exactly predict the arrival time of packets at the destination in order to send a packet out on the interface which will cause the packet to arrive quickest. Two parameters are required/slave device: the bandwidth of the slave device and the lag time for packets sent across the device. The exact calculation involves calculating the number of bytes queued on each slave device using the device bandwidth, a timestamp of the last transmit, the current time, and the prior number of queued packets. Then, for each device, the send time is calculated based on the number of queued bytes for transmission and lag times. The bw_lag_val striping method appears to be mostly a curiosity because seldom is the lag time on transmission and bandwidth known well enough to justify the overhead of the more elaborate algorithm over simpler algorithms. One possible use might be with serial lines where lag time is dominated by the modem transmission time.

bw_lag is a simplified version of bw_lag_val which require only the relative ordering of lag times be known. The same algorithm for decaying of queued bytes is used, but the determination of which device will transmit the quickest is simplified to consider only the bandwidth. The list of devices is implicitly stored in the relative order of lag times (smallest = fastest) creating a bias for sending on faster (lag wise) interfaces. This approach matches fairly well with modems interfacing to the internet at large because the absolute bandwidths are known to a good approximation, but only the relative lag on transmission times are known. A 14.4 modem will almost always have a longer ping time than a 28.8 modem, for example.

The last striping method, auto_bw, Attempts to adapt bw_lag by dynamic detection of congestion on interfaces. Essentially, the number of queued bytes is replaced by the number of queued packets on a particular device which is dynamically detected by looking into the slave device. Preliminary tests show that the ethernet driver we were working with apparently would not queue packets for later transmission either dropping or transmitting immediately. This was unfortunate as it seems likely that ether devices on crowded LAN's could use auto_bw to automatically route traffic towards the less used LAN. Auto_bw also attempts to take account of relative lag in the same manner as bw_lag.

An inefficiency of current implementations of network stacks is the placement of the bundling of the congestion detection and bandwidth detection with reliable delivery and streaming of data. Congestion detection algorithms attempt to dynamically detect both the bandwith and round trip time and is naturally associated with a particular source/destination pair rather than an application. If congestion detection was implemented on a by source/destination pair efficiencies could likely be gained over multiple TCP connections and the data dynamically detected bandwidth and round trip time's would be more naturally accessible to the IPEQL device driver which could definitely use such information. It is unfortunate that this is not the case.

Goal fulfillment:
We wanted to build IP striping across multiple IP devices. The major goal of this project was boosting bandwidth by the use of multiple devices. A subsidiary goal was gaining fault tolerance via these same multiple devices.

The process of building a generalized IP stripe driver has been succesful.

The additional fault tolerance goal has been achieved for local (in the same computer) failures by the background thread which removes devices which go down for any of various reasons (modem disconnects, the user brings the device down, etc...). Fault tolerance against remote failure when there are disjoint paths has not been implemented. I believe the general case of this variety of failure could only be detected by a TCP like congestion detection mechanism.