Journal: threading the FreeBSD network stack. -Alfred Perlstein --------------------------------------------- Notes: When I use a lowercase name for someone that refers to thier freefall login name. --------------------------------------------- Preface: I'm writing the journal for several reasons: 1) to provide a place for notes, because the network stack is so large there's going to be many parts I'm going to have to skip over, I'll note what I've skipped over so either I can get back to it or someone else can jump in and do it. 2) document how the locking systems I'm putting in work 3) random thoughts as I progress either towards my goal or insanity I started working on this a day or two after the SMPng commit which brought FreeBSD mutex primatives and interrupt threads, which was sometime in the first week of Sept 2000. I started this journal a couple of days after starting my work so I will detail a few things that have happened so far: Initially I wanted to place mutex locks in both the socket and socketbuffer structures, that proved to be too painful, instead use a lock on the socket and keep the old sleep/flags locking on the socketbuffer, there isn't a race because the socketbuffer flags are protected by my socket lock and the newly added msleep() function allows me to maninpulate the flags and sleep on them safely with my socket mutex interlocked. I'm gone through a lot of the code replacing manipulation of statistical counters with atomic_ operations, some places have many manipulations (particularly the tcp code) it may make more sense to keep a local statistics counter on the stack and do a batched update of the global statistics structure under a spinlock. Other alternatives include per-cpu counters but I've heard many negative comments about doing stats like that. Bosko Milekic was kind enough to MPsafe the mbuf allocator code, we need to test this, he used await/asleep rather than msleep, this ought to be checked for validity as the asleep interface was implemented before SMPng and may not be safe. I'm hoping that Bosko sticks around to help out, he's got some great programming skill and there's a lot of code to work on. I've already decided that my initial goal is going to be getting udp and tcp4 working, unfortunatly that means I'm most likely not working on: BRIDGE, DUMMYNET, INET6, NETATALK, NS, IPX, IPSEC, NETGRAPH I suspect that they can easily be made mpsafe, but they aren't a consideration at this point, I just want to get something working right now and that means userland<-(tcp/udp)->wire MPsafe code. The good part is that now more than ever developers are active enough to jump in and fix these. And before I get flamed off the earth I most likely will not be committing until INET6, IPSEC and NETGRAPH maintainers are comfortable with it. Malloc is now MPsafe thanks to jasone and jake which is obviously an important and key starting point. I had an interesting discovery the other night, when replaceing an spl with a mutex over a particular structure we must be very careful. While the spl is raised we can tsleep and are effectively dropping the mutual exclusion however we must be wary of that when switching over to mutexes to avoid deadlocks. A quick (stupid) example: calling a function to wait for data to arrive on a socket while holding the socket lock and forgetting to drop the lock before calling it. Normally spl would be dropped the instant you slept and the network stack could churn along and dump some data into your socketbuffer, but this is no longer the case, the interrupt must also block against your mutex and if you screw up you block waiting for data while the socket is locked against outside manipulation including data arrival. So far I think I have a pretty sound system protecting sockets, there also some preliminary stuff with routes and pcbs but I need to work on those more. I've switched the ucred system to use atomic ops which should make it mpsafe. Journal continued at: http://people.freebsd.org/~alfred/mpsafe/stackjournal.txt Work in progress: http://people.freebsd.org/~alfred/mpsafe/mpsafestack.diff Ok, and here begins a time based journal. ---------------------------------------------- Mon Sep 11 10:16:50 PDT 2000 Realized that attempting to thread tcp_input code before ether code was a bad idea. The tcp code uses global variables from the IP code which probably uses globals from the ether code, so I'm working in the wrong direction (or working in a direction that's going to have me spread out too thin). I've decided to take this route. either_input->ip_input->tcp/udp_input-> and tcp_output->ip_output->ether_output Mon Sep 11 10:43:55 PDT 2000 skipping work on bpf for the time being. skipping work on vlan Sun Sep 17 09:37:17 PDT 2000 I posted this to arch@freebsd.org, jlemon doesn't agree with me that this is a good idea, but it does need further discussion: From bright@wintelcom.net Tue Sep 12 02:58:08 2000 Date: Tue, 12 Sep 2000 04:26:54 -0700 From: Alfred Perlstein To: arch@freebsd.org Subject: what to do with softinterrupts? Message-ID: <20000912042654.L12231@fw.wintelcom.net> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.2.4i Status: RO Content-Length: 2422 Lines: 57 There's a problem with softinterrupts: we have only one. I'm trying to figure out what to do with the situation because unless I can schedule more than one concurrantly I can't really test much of what I'm working on, there's also making the input path concurrant and responsive. After thinking about it for some time I came up with a solution for network drivers. The idea is to keep several spare interrupt threads available for the network interrupts. Instead of ether_input() figuring out the type of packet and schduling a soft interrupt it will: Swap out its repsonsibility with another spare thread to handle the interface's hardware interrupt Perform a callback to the driver to re-enable interrupts Jump directly into what is now a pseudo-soft-interrupt This will happen as long as it can reserve a spare thread, and we're not pre-empting a softinterrupt running(*) in the same protocol on the same CPU. Otherwise it simply queues and schedules like before. (*) running == not blocked on a mutex, if it's blocked on a mutex we allow another jump from hardware to software to gain concurrancy. When the softinterrupt is complete it will then recheck its input queue for mbufs, if there is none it will return to it's state as an idle interrupt thread making itself free for consumption to service other devices or stacks. From my point of view this will give: better concurrancy, by making the stack less likely to stall because of blocking while accessing potentially locked objects less context switching because there's no need to schedule a soft interrupt unless we are seriously bogged down Poul-Henning Kamp thought that investigating atomic ops and keeping most of the current scheme of things would also work. We'd have to have a softinterrupt bound to each CPU, but as long as they could proceed without blocking we'd probably be ok. The problem I see with this is that it sort of tries to get around the whole interrupt thread idea by going back to non-blocking interrupts instead of taking the advantage (or performance hit) of the threads system we have in place. I'm not dead set on either idea, but I wanted to know what other solutions people could come up with to address this problem, any takers? any existing solutions? thanks, -- -Alfred Perlstein - [bright@wintelcom.net|alfred@freebsd.org] "I have the heart of a child; I keep it in a jar on my desk." Sun Sep 17 11:15:09 PDT 2000 Got mail from cp (Chuck Paterson) with some detailed instructions about how locking ought to be done via a journal done by David Borman. The journal explains a lot of the details of the BSD/os locking methodology in the network stack and explains how splitting the socket locks into locks on the individual socket buffers is a good thing. I will be switching back to that, the reason I shyed away from it was that I thought it would be a good idea to have a lock on both the socket itself as well as the socketbuffers, David's way is to alias the "entire socket" lock onto the recieve sockbuffer lock. I like this and think switching to it may be benificial. He also suggests using macros for the socket locks, I'm doing this just to make the code easier but I don't nessessarily agree with it as it adds yet another level of indirection for someone who later examines the code to figure out. ("is it a spinlock? a sleep lock?") Sun Sep 17 11:49:08 PDT 2000 BSD/os doesn't call most of the protocol routines with the socket passed in locked, this seems to be a bad idea because a lot of the BSD/os code looks like this: /* socket not locked */ so->pru() /* locks and unlocks the socket */ lock socket so something unlock socket I'm investigating locking the socket first and then calling the pru routine.