FreeBSDSmp Kernel Work related to the BSDI/FreeBSD SMP Project

(0) BSDI / FreeBSD / PatchSets

The algorithms described on this page are essentially the BSDI algorithms plus accomodations we disussed at the Yahoo SMP meeting. However, I did not do a direct port. I did a from-scratch rewrite because, simply put, it was easier to deal with the FreeBSD legacy issues (spl's and such). The variables are named differently but I attempt to follow the same API. All the work is subject to change... the point is to make it work first, then clean it up later.

This document represents the eventual goals that came out of the SMP meeting. It represents the work that several people will be doing (for example, Greg wanted a shot at doing the light weight interrupt implementation). My personal task list is at the end of the document.

Patchset links:

SP Patch #3 (25 June 2000) - This is the first stage Mutex patchset for FreeBSD-current. It is for SP (single-cpu) builds only. MP builds are currently broken and they are going to stay that way until Greg adds heavy-weight interrupt threads to the fray, because that's the only way we can turn GiantMutex into a blocking mutex in an MP system.

(I) The new per-cpu 'idleproc' process.

To simplify scheduler and mutex operation each cpu in the system will have an idleproc process associated with it. This process does not exist on any scheduling queue but will be returned by chooseproc() whenever chooseproc() cannot find a real process to switch to. Each cpu has its own private idleproc (a uni-processor machine will have one idleproc instance). PID numbering for these special processes begins at 10.

With idleproc, cpu_switch() is now able to assume that chooseproc() will always return a process to switch to and, as a consequence, cpu_switch() is now greatly simplified.

The idleproc is never on any scheduler queue. Instead it is returned directly to cpu_switch() by chooseproc() when chooseproc() is not able to find any process to run. Since idleproc() must always be available for chooseproc() to return, idleproc is not allowed to block on anything beyond the scheduler spin mutex. It cannot, for example, obtain any other mutex (not even GiantMutex(1)). idleproc's entire goal in life is to provide a stack context for interrupts, make cpu_switch()'s life easier, and spin looking for new processes to switch to. note1: idleproc can obtain other mutexes in a non-blocking manner only if it already holds the scheduler mutex. This may or may not be useful for page-zeroing operations. In general I think we will have another sort of 'idle' process to deal with page-zeroing.

The idleproc implementation also guarentees a distinct, non-NULL curproc variable for each cpu which we use to great effect as the locking id in our mutexes. The BSDI-style mutexes use curproc to lock rather then cpuid, and so do ours. This is necessary in order to implement light-weight interrupt optimizations and preemptive scheduling within the kernel.

As a consequence of idleproc's special restrictions, the scheduler can in fact rip the idleproc's context out from under it (switch away to another process without saving idleproc's context) at any time when the idleproc is not holding the scheduler mutex. This provides several potential optimizations that make the idleproc implementation considerably lighter-weight then you might otherwise think. For example, when idleproc locates a new process to switch to, it throws its own context away (giving us double the performance of a normal process switch).

The rip-out-from-under feature should allow us to use the hlt instruction within idleproc at some future date without too much trouble.

(II) The Scheduler Mutex and Interrupts

The scheduler is now protected ONLY by SchedMutex. The scheduler is no longer protected by cli/sti in any way. In fact, when we are through there will be only two or three places in the kernel code where a cpu's own interrupts are physically disabled!

The scheduler mutex is a spin-lock -- hopefully the only spinlock we will have in the system. It is used not only by mainline contexts to obtain exclusive access to the scheduler queues and process state (such as process priority), but it is also used by interrupts to implement light weight interrupt threads (Greg's follow-on work to my stuff), as well as heavy-weight interrupt threads. An interrupt which is able to obtain the SchedMutex can perform any scheduling action including waking its own heavy-weight thread up, switching out of the curtask it interrupted, and so forth. In fact, if an interrupt is interrupting the cpu's idle process, it can even throw away the current context and perform a reasonably optimal double-performance heavy-weight switch into its own process.

Interrupts interlock with SchedMutex only on the same cpu. While you hold SchedMutex, your cpu is guarenteed not to take any interrupts (but on an MP system other cpu's may). Of course, this is a deferred interrupt scheme... the interrupt will actually occur, but it will set its bit in ipending and leave immediately if the cpu it is taken on already holds the scheduler mutex. When you release your last recursion reference to SchedMutex, the mutex code will attempt to run any pending interrupts. We have to do things this way in order to eventually allow interrupts to run scheduler operations... they obviously can't if they are interrupting a process which already owns the SchedMutex (and thus the scheduler is in an inconsistent state), but they can if nobody owns SchedMutex or if the process owning it resides on some other cpu (the interrupt code can spin waiting for the mutex).

An interrupt can occur on another cpu while you hold SchedMutex on your cpu. A light weight interrupt taken on another cpu which does not need to do any switching can, in fact, run to completion. However, if the interrupt at any time needs to access the scheduler, it will spin until you release SchedMutex on your cpu.

SchedMutex can be used to interlock against interrupts on the same cpu but cannot be used to interlock against interrupts on other cpu's.

At the moment I've hacked all the console code to obtain the scheduler mutex instead of spltty()'ing all over the place, because in a word: it can't spltty() any more. It's illegal because the console code may be called at any time (see the next section). This is a hack because it isn't MP safe against tty interrupts running on another cpu. Since the current tty interrupt is a fast interrupt, I'm not sure that we're any worse off (fastints aren't masked by the cpl anyway).

**(III) Interrupts, the GiantMutex, and spl*() levels**

The SMP cleanup I did earlier in the year actually implemented a necessary first step to the BSDI/FreeBSD SMP plans... it changed the rules for spl*() calls.

The new rule is as follows: A process or interrupt may only mess around with spl*() levels (the cpl kernel variable) while it holds the GiantMutex. The GiantMutex replaces the FreeBSD MPLock. It serves exactly the same function except:

The scheduler is not protected by GiantMutex. The scheduler is only protected by SchedMutex.
The GiantMutex is a blocking mutex, whereas the FreeBSD MP lock was a spin lock.

What this means is that any process or interrupt trying to obtain the GiantMutex may actually wind up getting descheduled for a period of time. At the moment there are still many unprotected spl*()'s left in the system... if any of them are hit without GiantMutex being held, a panic will be generated.

The cpl variable implements the spl*() mechanisms. It is a system-wide global so its use is restricted only to the process holding GiantMutex. You cannot make any spl*() calls unless you hold GiantMutex!! This is the only way the legacy support can be made to work.

Since all legacy interrupt and legacy mainline code spl*()'s all over the place, they require GiantMutex to operate. The default operating characteristics for any entry into the kernel is to obtain the GiantMutex and release it upon return. The idea is to unwind the requirements in said code so as not to have to hold GiantMutex. Here is a quick overview:

Fast Interrupts - no mutex is aquired (process will eventually be scheduled using SchedMutex, but SchedMutex will be released prior to entry into the thread). By writing a driver that makes use of fast interrupts, your interrupt thread will be entered without any mutexes aquired.

Normal Interrupts - GiantMutex is aquired. The interrupt may obtain/release it any time but, typically, it's either held throughout or it is never held. note: an interrupt thread will eventually have the choice of blocking normally when it is finished, or throwing away its context to yield higher switching performance.

System Calls - GiantMutex is normally aquired, a flag in the syscall table can turn this off. A system call may obtain/release the GiantMutex at any time by the code. This was actually part of my SMP cleanup patch a month or two ago.

VM Fault - GiantMutex is aquired. Mutex may be obtained/released at any time by the code.

All Other Traps - GiantMutex is aquired. Mutex may be obtained/released at any time by the code.

Process Switching - The SchedMutex must be aquired prior to entering the scheduler (e.g. cpu_switch(), mi_switch()). The scheduler entry routines have been renamed to mp_cpu_switch() and mp_switch() to ensure that we get a link failure for any code that hasn't been converted.

Kernel Threads - When creating a kernel thread with the (now renamed to) mp_kthread_create(), the new kernel thread will be entered with SchedMutex held and GiantMutex not held. Most legacy kernel threads will want to immediately exit the SchedMutex, obtain the GiantMutex, and then run based on that.

tsleep() - You may call tsleep(), asleep(), and await() as per normal. These routines do not care whether you are holding the SchedMutex and/or GiantMutex. They will obtain the SchedMutex as appropriate, release the GiantMutex as appropriate, save the SPL state (if they had been entered with the GiantMutex held), do the process switching as appropriate, then restore everything prior to returning. Note: The BSDI tsleep() has an extra argument, allowing a Mutex to be passed in to be released atomically. My patchset currently does not do this but we will probably either make the same change, or we will formalize the use of asleep()/await() so as not to require that sort of atomic interlock.

Remember that the cpl may only be modified (i.e. by calling spl*() procedures) while holding the GiantMutex. If you mess with the cpl, you must restore it prior to releasing the GiantMutex. You cannot hold the cpl through a release/reaquisition however the scheduler (e.g. tsleep) will automatically release the GiantMutex and save the cpl state (only if the GiantMutex was held) when it puts the process to sleep, and restore both when the process is woken up and gets cpu again.

Fast Interrupts do not require GiantMutex to operate ... pretty much the same as fast interrupts worked before. One of our goals will be to migrate all legacy 'slow' interrupts to be 'fast' interrupts. Note that since fast interrupts do not obtain the GiantMutex, no fast interrupt is allowed to use spl*() calls. Fast Interrupts will use the same light weight scheduling mechanism that slow interrupts use, the only difference is whether the GiantMutex is aquired automatically or not.

(IV) Interrupt masking and interlocks

Previously interrupts set their own bit (and other bits) in the cpl to mask themselves in order to be able to reenable the physical interrupt yet still prevent re-entry while the interrupt routine ran.

For obvious reasons (GiantMutex requirement) we cannot use this mechanism for the core of our interrupt handling code. For slow interrupts we will eventually get the GiantMutex and thus be able to use legacy cpl mechanisms (i.e. applying a wider cpl mask when running one particular interrupt), but this occurs much later in the interrupt code.

(XXX irunning is currently not being used, I don't know yet whether we will need an irunning mechanism or whether asleep()/await()/wakeup() will be a sufficient mechanism). In the new system we implement irunning as a complement to ipending. When an interrupt is taken, assuming no conflicts, the irunning bit is set, the interrupt routine is run, and the irunning bit is then cleared. Running the interrupt routine may or may not entail obtaining GiantMutex, it's irrelevant. If the same interrupt occurs again while the interrupt routine is running, irunning causes the second interrupt to be deferred. When the first interrupt completes doreti will test for and deal with the deferred interrupt(s). Unlike the cpl mechanism, the irunning mechanism can only mask the specific interrupt that occurs. It cannot mask a general set of interrupts. This means that the 'fast' version of the interrupt code -- the one that will use private mutexes, must use those mutexes to implement a wider masking if said code needs wider masking.

irunning interlocks cpl for spl*() calls in legacy code. In an MP system one cpu may start running an interrupt without GiantMutex just before another cpu owning the GiantMutex raises its spl*() level. Note that GiantMutex is not required to run a new-style interrupt, so this situation is expected to occur quite often. Also, since interrupts are now scheduled, they can also block, so it is possible for an interrupt to be marked as 'running' for a much longer period of time and thus create an interlock with another process even on a single-cpu system.

When the cpu owning the GiantMutex attempts to spl*() to a higher level, the spl*() call will block (schedule switch type blocking, not spin blocking) until all new bits being added to the cpl mask are found to not be running (by testing them against irunning). The interrupt code will set its bit in irunning before it checks to see if the interrupt has been masked by the cpl. By setting its bit in irunning the interrupt code is now guarenteed that its bit in the cpl will not be set unless it was already set prior to the irunning update. The interrupt code can then test against the cpl without having to worry about races. If the interrupt code sees a conflict it will set ipending, clear its irunning bit, then return.

This interlock also works in reverse. The spl*() code will attempt to set the irunning bit(s) for the interrupt bits being added to the cpl first, interlocking against new interrupts, and only when it is able to successfully accomplish this will it add the bits to its cpl mask. Once it has added the bits to its cpl mask it will clear them in irunning. The compare-exchange instruction is used for most of these operations.

Spl lowering works as it does now ... bits are removed from the cpl and ipending is checked to determine if unmasked interrupts are runnable. Actually, the check is somewehat more complex. For an ipending interrupt to be runnable its bit may not be set in either the cpl or irunning. The interlock code is very carefully written to close all races. By the same token, if a new interrupt comes in and is masked by either cpl or irunning, the interrupt code will defer it (by setting ipending).

The whole point of the irunning addition was to provide a masking mechanism independant of the cpl in order to allow a mixed legacy / new-code system to operate.

(V) Task list

idleproc	Complete
Legacy SPL Support	Under test
Legacy Interrupt Support	Under test
SchedMutext	Complete
GiantMutext	Under test (almost complete)
light-weight ints	TODO (Greg)
heavy-weight ints	TODO (Greg?)
Other mechanisms	Send me mail guys!

(VI) Temporary Hacks List

_mtx_enter_blocking() currently panics if it has to block. This needs to be removed when Greg puts interrupt threads in place.

The ICU interrupt code needs to be adjusted to wakeup interrupt threads rather then jump to the interrupt vector (TODO for Greg).

(VII) TODO list (short term, not for the whole project)

Add HLT instruction support back in, in idle_proc. When interrupt threads are in and interrupts are allowed to rip out the idle_proc context (as in complete, not just switch away from it), we can HLT in the idle_proc context without there being any race conditions for the SP case. The MP case will also be trivial.

asleep() puts the task on a wait queue. We have to make sure that this does not interfere with any preemptive scheduling of the task. That is, the preempter needs to detect the situation and remove the task from the wait queue before placing it back on the run queue.

asleep()/await() verses extra argument to tsleep(). At the moment I am leaning towards asleep()/await(). For example, an interrupt would call asleep() at the top, and await() at the bottom. Reentry into the interrupt would be able to schedule the interrupt thread for the next run even while the thread is running in the first run. This may not work without some work since if the interrupt thread sleeps on some condition in the middle of this it will lose the 'I need to run the interrupt again' state. So we will probably wind up using ipending.

I thought we had a problem on MP systems with spl*() calls having to wait for the appropriate irunning bit to go away, in case some other cpu was running the interrupt trying to be masked. I have come to the conclusion that we do not need an irunning. Legacy interrupts will require the GiantMutex just like the mainline code using spl*(), so there will be no conflict. Fast Mutex interrupts are not allowed to use the spl*() mechanism anyway and neither should mainline code trying to protect against such interrupts (they should use the Mutex mechanism as well). I believe that this means I can rip out irunning.