Table of Contents

Overview

Thanks to the efforts of a number of people, zero copy sockets and NFS patches are available for FreeBSD-current at the URL listed below. Please note that the patches below are out of date, and are only for people who want to patch against older versions of -current. The code was checked into FreeBSD-current on June 25th, 2002, so the latest zero copy code can now be found in the FreeBSD-current tree.

These patches include:

At the moment the NFS patches aren't a part of this patchset. The original zero copy NFS patches, which eliminated a copy from the kernel to userland, used the vfs_ioopt code that Matt Dillon says is broken.

Drew Gallatin wrote a different set of NFS speedups that eliminate a copy from a struct uio to a struct mbuf. I have the patches, but right now I'm concentrating on getting the sockets code done before I try to speed up the NFS code.

The Alteon firmware header splitting and debugging code was written for Pluto Technologies (www.plutotech.com), which kindly agreed to let me release the code.

Current Status

The zero copy sockets code was checked into the FreeBSD-current tree on June 25th, 2002.

Many thanks to all the people who tested the code and provided feedback over the years.

So if you want the latest version of the code, CVSup FreeBSD-current.

The following changes went into the tree between the June 23rd, 2002 snapshot and the commit to -current on June 25th, 2002:

There is a link to the final set of patches below. The June 25th, 2002 patchset is what was applied to -current, minus a missing $FreeBSD$ in ti_fw2.h that I hand-edited in at the last minute.

The following fixes went into the June 23rd, 2002 snapshot:

This snapshot hasn't yet gone through the normal set of regression tests, though hopefully it will in the next day or so.

Barring any complaints, I'm planning on checking the zero copy code into the tree on Tuesday evening.

The following fixes went into the June 20th, 2002 snapshot:

The following fixes went into the June 18th, 2002 snapshot:

With those fixes, plus several fixes that have gone into -current over the past week or so, the zero copy sockets code runs without any WITNESS warnings at all.

The following fixes went into the June 9th, 2002 snapshot:

Thanks to Alfred Perlstein for pointing out these problems.

Please note that the code will currently spew out TONS of warnings if you have WITNESS enabled. I am working on fixing these, so if you don't want to help debug that stuff, make sure you disable WITNESS for now.

The following fixes went into the May 17th, 2002 snapshot:

The following fixes went into the May 4th, 2002 snapshot:

The following fixes went into the November 29th, 2000 snapshot.

The following fixes went into the November 20th, 2000 snapshot.

The following fixes went into the November 14th, 2000 snapshot.

The following fixes went into the November 2nd, 2000 snapshot.

The following fixes went into the September 5th, 2000 snapshot:

The following fixes went into the August 4th, 2000 snapshot.

The following fixes went into the July 8th, 2000 snapshot:

A couple of things have been added to the benchmarks section of this web page, below:

Download

The zero copy sockets code that went into the FreeBSD-current tree on June 25th, 2002, was identical to this patch, except for a missing $FreeBSD$ in ti_fw2.h:

zero_copy.diffs.20020625

The latest code (released June 23rd, 2002), which is based on -current from June 23rd, 2002, is located here:

zero_copy.diffs.20020623

The latest code (released June 20th, 2002), which is based on -current from June 20th, 2002, is located here:

zero_copy.diffs.20020620

The latest code (released June 18th, 2002), which is based on -current from June 18th, 2002, is located here:

zero_copy.diffs.20020618

The latest code (released June 9th, 2002), which is based on -current from June 9th 2002, is located here:

zero_copy.diffs.20020609

The latest code (released May 17th, 2002), which is based on -current from May 17th, 2002, is located here:

zero_copy.diffs.20020517

The latest code (released May 4th, 2002), which is based on -current from May 3rd, 2002, is located here:

zero_copy.diffs.20020503

The latest code (released November 29th, 2000), which is based on -current from early in the day on November 28th, is located here:

zero_copy.diffs.20001128

The latest code (released November 20th, 2000), which is based on -current from early in the day on November 20th, is located here:

zero_copy.diffs.20001120

The latest code (released November 14th, 2000), which is based on -current from early in the day on November 14th, is located here:

zero_copy.diffs.20001114

The November 2nd, 2000 snapshot, which is based on -current from early in the day on October 30th, is located here:

zero_copy.diffs.20001030

The September 5th, 2000 snapshot, which is based on -current from early in the day on September 5th, is located here:

zero_copy.diffs.20000905

The August 4th snapshot, which is based on -current from early in the day on August 3rd, 2000, is located here:

zero_copy.diffs.20000803

The July 8th snapshot, which is based on -current from early in the day on July 8th, 2000, is located here:

zero_copy.diffs.20000708

The June 13th snapshot is located here:

zero_copy.diffs.20000613

The patches above are based on -current from early in the day on June 13th 2000, i.e. before Peter's config changes.

Frequently Asked Questions:

  1. Known Problems.
  2. What is "zero copy"?
  3. How does zero copy work?
  4. What hardware does it work with?
  5. Configuration and performance tuning.
  6. Benchmarks.
  7. References.
  8. Possible future directions.
  1. Known Problems:

    There are no known problems, although bug reports and feedback are welcome.

  2. What is "zero copy"?

    Zero copy is a misnomer, or an accurate description, depending on how you look at things.

    In the normal case, with network I/O, buffers are copied from the user process into the kernel on the send side, and from the kernel into the user process on the receiving side.

    That is the copy that is being eliminated in this case. The DMA or copy from the kernel into the NIC, or from the NIC into the kernel is not the copy that is being eliminated. In fact you can't eliminate that copy without taking packet processing out of the kernel altogether. (i.e. the kernel has to see the packet headers in order to determine what to do with the payload)

    Memory copies from userland into the kernel are one of the largest bottlenecks in network performance on a BSD system, so eliminating them can greatly increase network throughput, and decrease system load when CPU or memory bandwidth isn't the limiting factor.

  3. How does zero copy work?

    The send side and receive side zero copy code work in different ways:

    The send side code takes pages that the userland program writes to a socket, and puts a COW (Copy On Write) mapping on each page, and stuffs it into a mbuf. The data the user program writes must be page sized and start on a page boundary in order for it to be run through the zero copy send code.

    If the userland program doesn't write to the page before it has been sent out on the wire and the mbuf freed (and therefore the COW mapping revoked), the page will not be copied. For TCP, the mbuf isn't freed until the packet is acknowledged by the receiver.

    So send side zero copy is only better than the standard case, where userland buffers are copied into kernel buffers, if the userland program doesn't immediately reuse the buffer.

    Receive side zero copy works in a slightly different manner, and depends in part on the capabilities of the network card in question.

    One requirement for zero copy receive to work is that the chunks of data passed up the network and socket layers have to be at least page sized, and be aligned on page boundaries. This pretty much means that the card has to have a MTU of 4K or 8K in the case of the Alpha. Gigabit Ethernet cards using Jumbo Frames (9000 byte MTU) fall into this category. More on that below.

    Another requirement for zero copy receive to work is that the NIC driver needs to allocate receive side pages from a "disposeable" pool. This means allocating memory apart from the normal mbuf memory, and attaching it as an external buffer to the mbuf.

    It also helps if the NIC can receive packets into multiple buffers, and if the NIC can separate the ethernet, IP, and TCP or UDP headers from the payload. The idea is to get the packet payload into one or more page-sized, page-aligned buffers.

    The NIC driver receives data into these buffers allocated from a disposeable pool. The mbuf with these buffers attached is then passed up the network stack where the headers are removed. Finally it reaches the socket layer, and waits for the user to read it. Once the user reads the data, the kernel page is then substituted for the user's page, and the user's page is then recycled. This is otherwise known as "page flipping".

    The page flip can only occur if both the userland buffer and kernel buffer are page aligned, and if there is at least a page worth of data in the source and destination. Otherwise the data will be copied out using copyout() in the normal manner.

  4. What hardware does it work with?

    The send side zero copy code should work with most any network adapter.

    The receive side code, however, requires an adapter with an MTU that is at least a page size, due to the alignment restrictions for page substitution (or "page flipping").

    The Alteon firmware debugging code requires an Alteon Tigon II board. If you want the patches to the userland tools and Tigon firmware to debug it and make it compile under FreeBSD, contact ken@FreeBSD.ORG.

  5. Configuration and performance tuning.

    There are a number of options that need to be turned on for various things to work:

    options         ZERO_COPY_SOCKETS        # Turn on zero copy send code
    options         MCLSHIFT=12              # page sized mbuf clusters (i386)
    options         TI_JUMBO_HDRSPLIT        # Turn on Tigon header splitting
    

    I would also recommend turning off WITNESS, as well as SMP, if you want to get a good idea of the performance impact of this code.

    To get the maximum performance out of the code, here are some suggestions on various sysctl and other parameters. These assume you've got an Alteon-based board, so if you're using something else, you may want to experiment and find the optimum values for some of them:

            sysctl -w net.inet.tcp.rfc1323=1
    
            sysctl -w kern.ipc.maxsockbuf=2097152
            sysctl -w net.inet.tcp.sendspace=524288
            sysctl -w net.inet.tcp.recvspace=524288
    

    A send window of 512K seems to work well with 1MB Tigon boards, and a send window of 256K seems to work well with 512K Tigon boards. Again, you may want to experiment to find the best settings for your hardware.

            sysctl -w net.inet.udp.recvspace=65535
            sysctl -w net.inet.udp.maxdgram=57344
    

  6. Benchmarks

    One nice benchmark is netperf ( www.netperf.org), which is in the benchmarks subdirectory of the ports tree.

    Netperf isn't exactly a real world benchmark, since it sends page aligned data that is a multiple of the page size. It is good for trying to determine maximum throughput.

    Another benchmark to try is nttcp, which is in ports/net.

    Here is are some netperf numbers for TCP and UDP throughput between two Pentium II 350's with 128MB RAM and 1MB Alteon ACEnics:

    # ./netperf -H 10.0.0.1
    TCP STREAM TEST to 10.0.0.1 : histogram
    Recv   Send    Send                          
    Socket Socket  Message  Elapsed              
    Size   Size    Size     Time     Throughput  
    bytes  bytes   bytes    secs.    10^6bits/sec  
    
    524288 524288 524288    10.01     742.46   
    

    # ./netperf -t UDP_STREAM -H 10.0.0.1 -- -m 8192
    UDP UNIDIRECTIONAL SEND TEST to 10.0.0.1 : histogram
    Socket  Message  Elapsed      Messages                
    Size    Size     Time         Okay Errors   Throughput
    bytes   bytes    secs            #      #   10^6bits/sec
    
     57344    8192   10.01      140396 585086     919.34
     65535           10.01       93525            612.42
    

    As you can see, the TCP performance is 742Mbits/sec, or about 93MBytes/sec.

    Drew Gallatin has achieved much higher performance with faster hardware:

    This is between 2 Dell PowerEdge 4400 servers using prototype 64-bit, 66MHz PCI Myricom Lanai-9 NICs with a 2.56Gb/sec link speed. The MTU was 32828 bytes. They're both uniprocessor 733MHz Xeons running a heavily patched 4.0-RELEASE & my zero-copy code in conjunction with Duke's Trapeze software (drive & firmware) for Myrinet adapters. The receiver is offloading checksums and is 60% idle, the sender is calculating checksums and is pegged at 100% CPU.

    <9:12am>wrath/gallatin:~>netperf -Hsloth-my
    TCP STREAM TEST to sloth-my : histogram
    Recv   Send    Send
    Socket Socket  Message  Elapsed
    Size   Size    Size     Time     Throughput
    bytes  bytes   bytes    secs.    10^6bits/sec
     
    524288 524288 524288    10.00    1764.50
    

    This is between two Dell PowerEdge 2400's, each with a 500MHz Pentium III and an Alteon ACEnic on a 64-bit, 33MHz PCI bus. (The traffic went through an Alteon 180 switch.)

    <3:19pm>eggs/gallatin:common>netperf -Hham-ge
    TCP STREAM TEST to ham-ge : histogram
    Recv   Send    Send
    Socket Socket  Message  Elapsed
    Size   Size    Size     Time     Throughput
    bytes  bytes   bytes    secs.    10^6bits/sec
     
    524288 524288 524288    10.01     986.51
    

  7. References

    There are a number of papers from Duke's Trapeze project that reference Drew Gallatin's zero copy code:

    http://www.cs.duke.edu/ari/publications/publications.html

    In particular, this paper, entitled "End-System Optimizations for High-Speed TCP", by Jeff Chase, Andrew Gallatin and Ken Yocum, includes some performance graphs for Drew's zero copy code (which is available in the diffs referenced above), and a good overview of a number of optimizations that can be used to increase TCP performance:

    http://www.cs.duke.edu/ari/publications/end-system.pdf

    This RDMA over IP problem statement makes a good case for the need for an RDMA framework for TCP.

    http://search.ietf.org/internet-drafts/draft-romanow-rdma-over-ip-problem-statement-00.txt

  8. Possible future directions.

    Send side zero copy:

    One of the obvious problems with the current send side approach is that it only works if the userland application doesn't immediately reuse the buffer.

    In the case of many system applications, though, the application will reuse the buffer immediately, and therefore performance will be no better than the standard case. Many common applications (like ftp) have been written with the current system buffer usage in mind, so they function like this:

            while !done {
                    read x bytes from disk into buffer y
                    write x bytes from buffer y into the socket
            }
    

    That makes sense if the kernel is only going to copy the data, but it doesn't in the zero copy case.

    Another problem with the current send side approach is that it requires page sized and page aligned data in order to apply the COW mapping. Not all data sets fit this requirement.

    One way to address both of the above problems is to implement an alternate zero copy send scheme that uses async I/O. With async I/O semantics, it will be clear to the userland program that the buffer in question is not to be used until it is returned from the kernel.

    So with that approach, you eliminate the need to map the data copy-on-write, and therefore also eliminate the need for the data to be page sized and page aligned.

    Receive side zero copy:

    The main issue with the current receive side zero copy code is the size and alignment restrictions.

    One way to get around the restriction is if it were possible to do operations similar to a page flip on buffers that are less than a page size.

    Another way to get around the restriction is to have the receiving client pass buffers into the kernel (perhaps with an async I/O type interface) and have the NIC DMA the data directly into the buffers the user has supplied.

    One proposal for doing this is called RDMA. There is a problem statement here that makes a good case for the need for an RDMA framework for TCP:

    http://search.ietf.org/internet-drafts/draft-romanow-rdma-over-ip-problem-statement-00.txt

    There used to be a draft standard for RDMA extensions to TCP floating around, but I haven't been able to locate it lately.

    Essentially RDMA allows for the sender and receiver to negotiate destination buffer locations on the receiver. The sender then includes the buffer locations in a TCP header option, and the NIC can then extract the destination location for the payload and DMA it to the appropriate place.

    One drawback to this approach is that it requires support for RDMA on both ends of the connection.