Fast PC Routers
What’s this all about?
We’re building IP routers out of PC’s as tools for research and experimental network development. The goal of this work is to come up with an IP router platform which is completely open to bending, twisting, reprogramming, and the like, and yet has sufficient performance to be useful for experiments in the 1990’s.
Open routing platforms are an important tool for researchers developing new protocols and architectures. The DARTnet testbed, a precursor of this work, used routers built from Sun Sparcstations to catalyze the development of IP Multicast, RSVP, and the MBONE conferencing tools. We hope our project will help do the same for Mobile IP, Scalable Reliable Multicast, and the as yet unknown technologies of tomorrow’s Internet.
We’re using these routers to support our own work on IP Integrated Services QoS management, new security models, and Internet service discrimination and pricing. We also expect this or a derivative design to become the base IP router for the nationwide CAIRN testbed, currently being deployed.
If you want decent performance from a PC router, you must plan to use current high-end hardware. After years of stagnation, the demands of multimedia and high-speed peripheral devices are finally driving PC overall performance up, in contrast with the marketing hype which previously put ever-faster processors on the same old memory and I/O subsystems. Even better, the performance emphasis is moving away from large-block (disk) I/O and towards short-transaction (graphics, networks, audio and video) I/O. The downside is that this development is happening today, and yesterday’s (almost literally!) machines are noticably off the pace. Our blurb on PC Hardware for Network Researchers will give you some information about the equipment we’re using at MIT.
Here’s the basic PC design with a processor, memory, and, in this case, two PCI I/O buses. A router constructed from this hardware exhibits several properties. First, the processor must control all aspects of the router’s operation, both executing the forwarding loop and performing overhead functions such as routing and management protocols. This introduces two performance slow-downs. Not only must the CPU break away from the forwarding loop to execute overhead code, but executing that code will almost certainly remove the loop instructions and routing table from the CPU’s cache. On a typical PC this is a crucial problem, because the main memory subsystem is not terribly fast.
Another point of interest is that with today’s PC designs there is adequate main memory bandwidth to operate two PCI buses and a processor at full rate simultaneously. This is encouraging. The reason for this is that it appears the primary limit to PC router performance is not CPU function or I/O bandwidth, but PCI bus arbitration time. Having two buses available reduces this arbitration bottleneck by a factor of two.
The use of two buses does introduce one drawback. Since PCI is a multiple-master bus, it is possible for appropriately designed interface cards and router software to DMA packet data directly from one interface to another on the same bus, without ever touching system memory, reducing the data transfer cost by a factor of two. This technique is of course impossible between two different PCI buses, but the router can still use it between interfaces on the same bus.
We use this configuration, with one or two PCI buses, for code development and where variable performance is acceptable
Many performance limitations of the basic design can be avoided by adding another processor. With the advent of Intel’s Multiprocessor PC specification, two-processor machines are becoming common.
Originally designed for symmetric multiprocessing, these machines are easily subverted to our needs.
A simple first step is to separate the IP input and forwarding code from the rest of the system and run it on the second processor. With this design, packet routing performance is not subject to slowdown or variation when routing protocols or other user applications execute. This alone gives a large improvement.
Further improvements are possible. With separate, large Level 2 caches for each processor, and correct memory layout and cache control policy, it is possible to ensure that the forwarding processor’s L2 cache is used solely for the forwarding code and a routing table cache. The design of the hardware interrupt controller in these machines allows the forwarding processor to handle network interface interrupts while the control processor handles all others. In rare cases it may be useful for the control processor to handle the initial interrupt at high priority, and have it present the packet to the forwarding processor in a simple, low-overhead way.
We’re currently implementing these ideas, and hope to have measured performance results shortly.
A key aspect of these designs, particularly the multiprocessor versions, is that CPU instruction count does not appear to be the limiting performance bottleneck. In almost all circumstances, memory bandwidth or PCI bus arbitration overhead restrict performance first. This is important for our experimental goals, because it suggests that some additional complexity, such as different queueing algorithms, can be added to the forwarding loop without seriously affecting forwarding throughput.
Our routers currently support a number of interface types, including 100Mb Ethernet, 155MB ATM, and T1 long-lines. We wouldn’t mind finding PCIbus raw (not ATM) DS3 and SONET /OC3 interface cards, if you happen to know of one. Again, you can see our discussion of PC Hardware for Network Researchers for some specifics.
The intent of our work is to build routers which combine a general programming environment (for authors of routing protocols, network management tools, and the like) with a tuned low-overhead IP forwarding path (for folks interested in packet scheduling algorithms and related technology). Our current development work is based in FreeBSD, a publicly available BSD-Unix derivative with non-restrictive copyrights. Ultimately, we hope the community will create CAIRN Networking Snapshots; vendor-neutral code releases which encompass and make available current research results.
We’re changing the standard FreeBSD networking code in several ways to add functionality and performance. Here’s a brief list of work in progress. If you think you’ve heard of these things before, you’re probably right. Much of what we’re doing is not new to the research community, although it is hard to come by a package which brings all of these threads together.
Fast IP forwarding path
We’ve changed the basic BSD IP design from one favoring host functions to one favoring routing functions, and implemented a plausibly fast forwarding path. See Performance and Limitations for more details.
Interface scheduling and overhead control
We’re implementing a new device driver interface which eliminates most of the interrupt overhead and more explicitly allocates CPU time to different processing requirements. This improves short-packet performance and eliminates the possibility of excessive traffic “livelocking” the router.
Separation of the core routing and support functions
We’re cleanly separating the core routing function from other “kernel” functions such as routing table maintainance and disk I/O. This is required to support the dual-processor model described in Hardware.
Dynamic compilation pkt classifier
We’re experimenting with a packet classifier (the portion of the router which examines incoming packets to determine their route, QoS handling, and other processing requirements) based on run-time generation of machine code which directly implements the hash/search filters.
Detailed hardware performance metering
We’ve provided a low-level interface to the processor’s performance meters. This allows the sufficiently dedicated individual to instrument all aspects of code performance, including cache and TLB behavior, memory access patterns, and the like. We’re currently looking at ways to interpret and present this information which quickly and effectively identify performance bottlenecks.
KTG kernel traffic generator
We’ve implemented a simple in-kernel traffic generator which can source and sink small packets at hardware line rates with accurate timing.
ATM (Raw, AAL5, Classical IP over ATM)
Support for ATM data processing and signalling functions is being developed jointly with our colleagues at BBN and ISI-EAST.
Sometime soon we’ll have links to performance measurement graphs here. In the meantime, here are some early findings. See limitations for a discussion of why these routers sometimes don’t work well at all.
The basic IP forwarding loop, including buffer management but no interface overhead, currently runs at a few hundred thousand packets per second on a 150 MHz pentium pro. There is still some room for improvement; on a 200MHZ processor a basic rate above 500,000 pkts/second appears to be achievable. This number will be somewhat lower for traffic loads with a high percentage of multicast packets.
In practice, again with the 150 MHz PPro, we’ve measured forwarding rates of approximately 80,000 packets per second between 100B-TX Fast Ethernet segments. The limiting factor in this experiment was the input interface’s ability to receive packets at a faster rate – once a packet made it through the input interface to the router, it was forwarded successfully virtually 100% of the time. At present, we don’t know whether this limit is a fundamental restriction of the interface chip or is being caused by PCI bus arbitration overhead.
There are two significant limitations on the applicability of these routers. They are less of a problem in our laboratory and testbed environments than they might be in other situations.
The first limitation is that PC routers are inherently low-fanout devices. Inexpensive desktop PC motherboards have at most four PCI slots, which means a limit of four high-speed interfaces. More expensive server motherboards can be obtained with two PCI controllers offering six or eight PCI slots, at which point main memory bandwidth becomes a significant concern. Realistically, six high-speed interfaces may be the workable maximum.
The second limitation is that these are route-caching routers, and when performance is an issue the size of the route table cache is effectively bounded by the size of the L2 hardware cache. This is not a problem in a testbed environment, where routes to at most a few thousand destinations might be expected. In circumstances where the count of active destinations exceeds this range, our current code will experience a sharp performance drop.
We’ve done a (very) little bit of thinking about how to support the use of our routers at sites that don’t care to get involved in the grunge of building and installing PC sofware, and about how to support the use of these routers locked away in closets. Here are some talking points:
We’re using a modified version of the standard FreeBSD network install program. This allows you to boot a machine connected to the network from a 3.5″ floppy disk. The sofware will then download, install, and partially configure the OS and user programs from net. The install program offers a choice of several user-level code collections to tailor machines to specific requirements. Currently we offer only two, “router” and “developer”.
Sometimes it will be necessary for a CAIRN infrastructure router to operate on an unattended basis. We’ve identified several little things which can help. Our routers can be remotely powercycled and rebooted using a separate box which is controlled either over the internet (with password-protected telnet) or via a dialup phone line. A running machine can use a remote serial console, rather than the standard PC VGA adapter. A serial line can be used for remote source-level kernel debugging, if desired.