Journal: threading the FreeBSD network stack.
                   -Alfred Perlstein
---------------------------------------------
Notes:
  When I use a lowercase name for someone that refers
to thier freefall login name.
---------------------------------------------
Preface:
I'm writing the journal for several reasons:
1) to provide a place for notes, because the network
   stack is so large there's going to be many parts I'm
   going to have to skip over, I'll note what I've skipped
   over so either I can get back to it or someone else can
   jump in and do it.
2) document how the locking systems I'm putting in work
3) random thoughts as I progress either towards my goal or insanity

I started working on this a day or two after the SMPng commit which
brought FreeBSD mutex primatives and interrupt threads, which was
sometime in the first week of Sept 2000.

I started this journal a couple of days after starting my work so
I will detail a few things that have happened so far:

Initially I wanted to place mutex locks in both the socket and
socketbuffer structures, that proved to be too painful, instead
use a lock on the socket and keep the old sleep/flags locking on
the socketbuffer, there isn't a race because the socketbuffer flags
are protected by my socket lock and the newly added msleep() function
allows me to maninpulate the flags and sleep on them safely with
my socket mutex interlocked.

I'm gone through a lot of the code replacing manipulation of
statistical counters with atomic_ operations, some places have many
manipulations (particularly the tcp code) it may make more sense
to keep a local statistics counter on the stack and do a batched
update of the global statistics structure under a spinlock.  Other
alternatives include per-cpu counters but I've heard many negative
comments about doing stats like that.

Bosko Milekic <bmilekic@dsuper.net> was kind enough to MPsafe the
mbuf allocator code, we need to test this, he used await/asleep
rather than msleep, this ought to be checked for validity as the
asleep interface was implemented before SMPng and may not be safe.
I'm hoping that Bosko sticks around to help out, he's got some
great programming skill and there's a lot of code to work on.

I've already decided that my initial goal is going to be getting
udp and tcp4 working, unfortunatly that means I'm most likely not
working on:

BRIDGE, DUMMYNET, INET6, NETATALK, NS, IPX, IPSEC, NETGRAPH

I suspect that they can easily be made mpsafe, but they aren't a
consideration at this point, I just want to get something working
right now and that means userland<-(tcp/udp)->wire MPsafe code.

The good part is that now more than ever developers are active enough
to jump in and fix these.  And before I get flamed off the earth
I most likely will not be committing until INET6, IPSEC and NETGRAPH
maintainers are comfortable with it.

Malloc is now MPsafe thanks to jasone and jake which is obviously
an important and key starting point.

I had an interesting discovery the other night, when replaceing an
spl with a mutex over a particular structure we must be very careful.
While the spl is raised we can tsleep and are effectively dropping
the mutual exclusion however we must be wary of that when switching
over to mutexes to avoid deadlocks.

A quick (stupid) example: calling a function to wait for data to
arrive on a socket while holding the socket lock and forgetting to
drop the lock before calling it.  Normally spl would be dropped the
instant you slept and the network stack could churn along and dump
some data into your socketbuffer, but this is no longer the case, the
interrupt must also block against your mutex and if you screw up you
block waiting for data while the socket is locked against outside
manipulation including data arrival.

So far I think I have a pretty sound system protecting sockets, there
also some preliminary stuff with routes and pcbs but I need to work on
those more.

I've switched the ucred system to use atomic ops which should make it
mpsafe.

Journal continued at:
  http://people.freebsd.org/~alfred/mpsafe/stackjournal.txt

Work in progress:
  http://people.freebsd.org/~alfred/mpsafe/mpsafestack.diff
  

Ok, and here begins a time based journal.
----------------------------------------------

Mon Sep 11 10:16:50 PDT 2000

Realized that attempting to thread tcp_input code before ether code was
a bad idea.  The tcp code uses global variables from the IP code
which probably uses globals from the ether code, so I'm working in
the wrong direction (or working in a direction that's going to have
me spread out too thin).

I've decided to take this route.
either_input->ip_input->tcp/udp_input->
and
tcp_output->ip_output->ether_output

Mon Sep 11 10:43:55 PDT 2000
skipping work on bpf for the time being.
skipping work on vlan

Sun Sep 17 09:37:17 PDT 2000
I posted this to arch@freebsd.org, jlemon doesn't agree with me that
this is a good idea, but it does need further discussion:
    
    From bright@wintelcom.net Tue Sep 12 02:58:08 2000
    Date: Tue, 12 Sep 2000 04:26:54 -0700
    From: Alfred Perlstein <bright@wintelcom.net>
    To: arch@freebsd.org
    Subject: what to do with softinterrupts?
    Message-ID: <20000912042654.L12231@fw.wintelcom.net>
    Mime-Version: 1.0
    Content-Type: text/plain; charset=us-ascii
    Content-Disposition: inline
    User-Agent: Mutt/1.2.4i
    Status: RO
    Content-Length: 2422
    Lines: 57
    
    There's a problem with softinterrupts: we have only one.
    
    I'm trying to figure out what to do with the situation because
    unless I can schedule more than one concurrantly I can't really
    test much of what I'm working on, there's also making the input
    path concurrant and responsive.
    
    After thinking about it for some time I came up with a solution
    for network drivers.
    
    The idea is to keep several spare interrupt threads available for
    the network interrupts.
    
    Instead of ether_input() figuring out the type of packet and schduling
    a soft interrupt it will:
    
      Swap out its repsonsibility with another spare thread to handle
        the interface's hardware interrupt
      Perform a callback to the driver to re-enable interrupts
      Jump directly into what is now a pseudo-soft-interrupt
    
    This will happen as long as it can reserve a spare thread, and
    we're not pre-empting a softinterrupt running(*) in the same protocol
    on the same CPU.  Otherwise it simply queues and schedules like
    before.
    
    (*) running == not blocked on a mutex, if it's blocked on a mutex
    we allow another jump from hardware to software to gain concurrancy.
    
    When the softinterrupt is complete it will then recheck its input
    queue for mbufs, if there is none it will return to it's state as
    an idle interrupt thread making itself free for consumption to
    service other devices or stacks.
    
    From my point of view this will give:
       better concurrancy, by making the stack less likely to stall
         because of blocking while accessing potentially locked objects
       less context switching because there's no need to schedule a soft
         interrupt unless we are seriously bogged down
    
    Poul-Henning Kamp thought that investigating atomic ops and
    keeping most of the current scheme of things would also work.
    We'd have to have a softinterrupt bound to each CPU, but as long
    as they could proceed without blocking we'd probably be ok.
    The problem I see with this is that it sort of tries to get around
    the whole interrupt thread idea by going back to non-blocking
    interrupts instead of taking the advantage (or performance hit)
    of the threads system we have in place.
    
    I'm not dead set on either idea, but I wanted to know what other
    solutions people could come up with to address this problem, any
    takers?  any existing solutions?
    
    thanks,
    -- 
    -Alfred Perlstein - [bright@wintelcom.net|alfred@freebsd.org]
    "I have the heart of a child; I keep it in a jar on my desk."
    

Sun Sep 17 11:15:09 PDT 2000

Got mail from cp (Chuck Paterson) with some detailed instructions about
how locking ought to be done via a journal done by David Borman.

The journal explains a lot of the details of the BSD/os locking methodology
in the network stack and explains how splitting the socket locks into locks
on the individual socket buffers is a good thing.  I will be switching
back to that, the reason I shyed away from it was that I thought it would
be a good idea to have a lock on both the socket itself as well as the
socketbuffers, David's way is to alias the "entire socket" lock onto the
recieve sockbuffer lock.  I like this and think switching to it may be 
benificial.

He also suggests using macros for the socket locks, I'm doing this just
to make the code easier but I don't nessessarily agree with it as it
adds yet another level of indirection for someone who later examines the
code to figure out.  ("is it a spinlock? a sleep lock?")

Sun Sep 17 11:49:08 PDT 2000

BSD/os doesn't call most of the protocol routines with the socket passed
in locked, this seems to be a bad idea because a lot of the BSD/os code
looks like this:

	/* socket not locked */
	so->pru()  /* locks and unlocks the socket */
	lock socket
	so something
	unlock socket

I'm investigating locking the socket first and then calling the pru
routine.