Digital Technical Journal Volume 6, Number 3 VTX Version of the OSF Paper DEC OSF/1 Symmetric Multiprocessing by Jeffrey M. Denham, Paula Long, and James A. Woodward ABSTRACT The primary goal for an operating system in a symmetric multiprocessing (SMP) implementation is to convert the additional computing power provided to the system, as processors are added, into improved system performance without compromising system quality. The DEC OSF/1 version 3.0 operating system uses a number of techniques to achieve this goal. The techniques include algorithmic enhancements to improve parallelism within the kernel and additional lock-based synchronization to protect global system state. Synchronization primitives include spin locks and blocking locks. An optional locking hierarchy was imposed to detect latent symmetric multiprocessor synchronization issues. Enhancements to the kernel scheduler improve cache usage by enabling soft affinity of threads to the processor on which the thread last ran; a load-balancing algorithm keeps the number of runnable threads spread evenly across the available processors. A highly scalable and stable SMP implementation resulted from the project. INTRODUCTION The DEC OSF/1 operating system is a Digital product based in part on the Open Software Foundation's OSF/1 operating system.[1] One major goal of the DEC OSF/1 version 3.0 project was to provide a leadership multiprocessing implementation of the UNIX operating system for Alpha server systems, such as the Digital AlphaServer 2100 product. This paper describes the goals and development of this operating system feature for the version 3.0 release. THE DEC OSF/1 VERSION 3.0 MULTIPROCESSING PROJECT Multiprocessing platforms like the AlphaServer 2100 product provide a cost-effective means of increasing the computing power of a server. Additional computing capacity can be obtained at a potentially significant cost advantage by simply adding CPU modules to the system rather than by adding a new system to a more loosely coupled network-server arrangement. An effective execution of this server-scaling strategy requires significant cooperation between the hardware and software components of the system. The hardware must provide symmetrical (i.e., equal) access to system resources, such as memory and I/O, for all processors; the operating system software must provide for enough parallelism in its major subsystems to allow applications to take advantage of the additional CPUs in the system. That is, the operating system cost of multiprocessing must be kept low enough to enable most of an additional CPU's computing power to be used by applications rather than by the operating system's efforts to synchronize simultaneous access to shared memory by multiple processors. Regarding hardware, the AlphaServer 2100 product and the other Alpha multiprocessing platforms provide the shared memory and symmetric access to the system and I/O buses desired by the operating system designers.[2] The design allows all CPUs to share a single copy of the operating system in memory. The hardware also has a load-locked/store-conditional instruction sequence, which provides both a mechanism for atomic updates to shared memory by a single processor and an interprocessor interrupt mechanism. Given these hardware features, operating system software developers have a great deal of freedom in developing a multiprocessing strategy. The approach used in DEC OSF/1 version 3.0 is called symmetric multiprocessing (SMP), in which all processors can participate fully in the execution of operating system code. This symmetric design contrasts with asymmetric multiprocessing (ASMP), in which all operating system code must be executed on a single designated "master" processor. Such an approach is undesirable because it provides inadequate utilization of additional "slave" processors for most application mixes. By contrast, for the DEC OSF/1 multiprocessing design, the concept of a master processor applies only to the keeping of the global system time and to other specialized uses (such as supporting subsystems that are not yet fully symmetric). The SMP features in the DEC OSF/1 version 3.0 operating system are based on the joint work of Carnegie Mellon University, for the Mach version 2.5 kernel, and the Open Software Foundation and the Encore Computer Corporation, for the version 1.2 release of the OSF/1 operating system.[3-6] From this substantial technical base, the DEC OSF/1 multiprocessing project focused on achieving UNIX leadership performance on targeted commercial server applications, such as data servers (i.e., DBMS and file servers) and compute servers. These application domains tend to make heavy use of system services. Therefore, shortcomings in the multiprocessing implementation become readily apparent through the failure of these applications to gain significant performance speedups as processors are added to the server. The ideal benefit is, of course, to obtain 100 percent of each additional processor for the applications' use. In reality, a gain of 70 to 80 percent of the last CPU added is well worth the incremental cost of the processor. From the outset of the project, the engineering team was empowered to enhance and augment the OSF/1 version 1.2 code base to obtain this level of multiprocessing performance for DEC OSF/1 version 3.0. At the same time, it was required to maintain the system's stability and reliability. The team was staffed by engineers with extensive multiprocessing and real-time operating system experience inside and outside Digital. Quality assurance (Q/A) and performance teams provided considerable feedback as the product moved through its development base levels. The engineering team faced multiple technical issues in the SMP implementation of the DEC OSF/1 operating system, including o Analyzing concurrency and locking issues o Adapting the base operating system for SMP o Supporting a comprehensive lock package o Adapting thread scheduling for SMP o Ensuring a quality implementation o Benchmarking progress in SMP performance The remainder of this paper describes the highlights of the team's efforts in these areas. ANALYZING CONCURRENCY AND LOCKING ISSUES Moving from a uniprocessor to a shared-memory, symmetric multiprocessing platform places new demands on an operating system. Multiple processes running independently on separate processors can access kernel data structures simultaneously. This level of true concurrency is unobtainable on uniprocessor systems, where concurrency either derives from the asynchronous execution of interrupt service routines (ISRs) or is emulated through the interleaving of processes on a time-share basis. In the first case, synchronization is required for data structures accessed by both mainline kernel code and the ISR. The technique used to achieve synchronization is to raise the processor interrupt priority level (IPL), i.e., system priority level (SPL) in UNIX parlance, in the mainline code to the level used by the competing ISR, thus blocking the interrupt that invokes the ISR. In the case of the virtual concurrency provided by process time-sharing, synchronization is achieved by allowing only one process to be in kernel context at a time. The kernel protects itself by preventing context switching (process preemption) until an executing process has reached a safe point, i.e., usually when it is about to leave kernel context. Other safe points appear when a process must voluntarily block to await the availability of some resource. These are the synchronization strategies employed by traditional UNIX-based operating systems. One powerful feature of the OSF/1 kernel provides a further level of concurrency, which complicates the process of synchronizing access to kernel data; that feature is kernel-based threads. The Mach task/thread model allows multiple threads of execution to be active within a single task (process) address space. Therefore, whereas an unthreaded UNIX system has to protect data shared by multiple processes, e.g., the scheduling queues, a threaded kernel must protect all process-level data, which is shared by all threads in the process. Although in many ways a traditional UNIX system from the user's point of view, the first version of the DEC OSF/1 operating system departed from typical UNIX practice by providing kernel-mode preemption in its real-time version of the kernel. This enhancement, targeted to improve the responsiveness of the system to real-time events, allows preemptive priority-based context switching between threads to occur at any point in kernel execution that meets a set of criteria for preemption safety. These criteria have an immediate relevance and applicability to the work of adapting the OSF/1 uniprocessor code to a multiprocessing environment. In the following discussion of preemption safety, each criterion for safe preemption is presented as it relates and leads to an understanding of correct multiprocessing synchronization. Real-time thread preemption can occur only when all three of the following conditions are met: 1. The processor SPL is zero. This state indicates that all interrupts are enabled and implies that no code is executing in an ISR or is modifying kernel data shared with an ISR. On a nonpreempting uniprocessor kernel, SPL synchronization alone is adequate to protect shared data structures. SPL is a processorwide rather than a systemwide characteristic. Consequently, raising the SPL to interrupt level is inadequate protection on a multiprocessing system, in which one processor's SPL has no effect on another's. The classic multiprocessing solution to this problem is to combine SPL synchronization with mutual-exclusion spin locks to block out other processors as well as ISRs. 2. No simple locks (spin locks) are held. This state is represented in the Mach and OSF/1 kernel code by a call to the simple_lock() routine. This call signifies that the code has entered a critical section where shared data will be modified. On a uniprocessor, calling the simple_lock() routine actually increments a global spin lock count; unlocking decrements that count. If the count is zero, then an attempt to preempt the current process can be made. In this uniprocessor implementation, no actual spin locks exist in memory, and nothing is locked in the physical sense of a lock bit being checked for a state of zero or one. By contrast, on a microprocessing system, real locking, not lock counting, is required; therefore, spin locks occupy real memory. On a multiprocessing system, locking a spin lock involves testing the lock location for a value of zero and then atomically setting the value to one before continuing into the critical code section, assured of exclusive access. If another processor finds the lock bit set (i.e., nonzero), it will repeatedly test the lock location and thus "spin" until the lock value becomes zero when unlocked by its previous holder. Because processors make no progress while they attempt to obtain a spin lock, such a lock is meant to be held for bounded, hopefully brief periods. Extensive or unbounded accesses require the use of complex locks (blocking read/write locks) by which a thread will sleep until a locked resource becomes available and unlocked. (Sleeping to obtain a complex lock is by definition a preemption point.) 3. The code is not funneled to the master processor. This state is another way by which OSF/1 kernel code delineates a critical code section. Funneling forces code to run on a single processor designated as the master processor. Funneling allows device drivers and entire kernel subsystems that have not been adapted to a concurrent-execution environment with simple_lock() calls to modify kernel data safely. On a preempting uniprocessor, funneling is represented simply as a per-thread flag that prevents preemption when set; no context switching is required to cause funneling. By contrast, on a multiprocessing system, funneling to the master processor may involve an actual context switch from the funneling thread's current processor---an expensive form of synchronization. Prior to DEC OSF/1 version 3.0, all UNIX process subsystem components, including the fork(), exec(), wait(), and exit() routines and signal logic, were not safe for preemption and were therefore funneled. All modifications to process data structures could occur only on the master processor. This situation eliminated concerns about access to those structures from another processor but at the same time virtually eliminated the parallelism of the process subsystem. For example, for the fork system calls, the list of active processes in the system (allproc) was traversed in funneled code. Clearly, funneling this fundamental resource introduces significant latencies into the system's response to scheduling events. In multiprocessing terms, no process-level operations can execute in parallel. The development of the DEC OSF/1 real-time kernel leveraged the existing OSF/1 SPL, locking, and funneling constructs to implement preemption on uniprocessor Alpha systems. This work provided a valuable product feature and was a preview of the effort that would be required to adapt the OSF/1 code for the DEC 2000, 4000, and 7000 multiprocessing platforms. Supporting separate preemptive kernels for three versions prior to DEC OSF/1 version 3.0, combined with the developers' experience on other multiprocessing systems (including ULTRIX version 4 and an advanced development project using MIPS multiprocessing platforms), uncovered the following challenges and problems that the team had to overcome to produce a competitive multiprocessing product: o Supporting two complete sets of kernel binary objects---base and real-time---was burdensome for the operating system engineers and awkward for third-party developers. Therefore, the DEC OSF/1 multiprocessing product team had to strive to ship a single, unified set of kernel binaries. This set should encompass the full range of real-time features, including preemption and POSIX fixed-priority scheduling. For that to be practical, the resulting multiprocessing kernel would have to perform as well on a uniprocessor as the non-SMP kernel. o Diagnosing locking problems in the preemptive kernel was expensive in developer time. The process required painstaking inspection of the simple-locking source code, which is often disguised in subsystem-specific macros. Locking or unlocking a spin lock multiple times or not unlocking it at all (usually in code loops) would disable preemption well beyond the end of a critical section or enable it before the end. A coherent locking architecture with automated debugging facilities was needed to ship a reliable product on time. The lock-debugging facility present in the original OSF/1 code was probably inadequate for the task. o Experiments with the real-time kernel revealed unacceptable preemption latencies, especially in funneled code paths. This deficiency indicated that, when moved to a multiprocessing platform, the existing kernel would fail to use additional processors effectively. That is, the kernel would not exhibit adequate parallelism to scale effectively. Clearly, major work was required to significantly increase parallelism in the kernel. This task would likely involve removing most uses of funneling, eliminating some spin locks, and adding other spin locks to create a finer granularity of locking. ADAPTING THE BASE OPERATING SYSTEM FOR SYMMETRIC MULTIPROCESSING Making the leap from a preemptive uniprocessor kernel to an effective SMP implementation, built from a single set of kernel binaries, required contributions from the OSF/1 version 1.2 and the DEC OSF/1 version 3.0 projects. Fundamental changes were introduced into the system to support SMP. The basic approach planned by the SMP project team was first to bootstrap the DEC OSF/1 version 1.3 kernel on the existing Alpha multiprocessing platforms. This task was accomplished by funneling all major subsystems to a single processor while stabilizing the underpinnings of the multiprocessing system (i.e., the scheduler, the virtual memory subsystem, the virtual file system, and the hardware support) in the new environment. This approach allowed the team to make progress in understanding the scope of the effort while analyzing the multiprocessing requirements of each kernel subsystem. The in-depth analysis was necessary to identify those subsystems in the kernel that required modifications to run safely and efficiently under SMP. As each subsystem was confirmed to exhibit parallelism or was made parallel, it was unfunneled and thus freed to run on any processor. This process was iterative. If incorrectly parallelized, a subsystem will reveal itself by (1) leaving data incorrectly unprotected and thus open for corruption and (2) developing a deadlock, i.e., a situation in which each of two threads holds a spin lock required by the other thread and thus neither thread can take the lock and proceed. The efforts at parallelizing the kernel fell into two classes of modification: lock-based synchronization to ensure multiprocessing correctness and algorithmic changes to increase the level of parallelism achieved. Lock-based Synchronization The code base on which the DEC OSF/1 product is built, i.e., the Open Software Foundation's OSF/1 software, provides a strong foundation for SMP. The OSF further strengthened this foundation in OSF/1 versions 1.1 and 1.2, when it corrected multiple SMP problems in the code base and parallelized (and thus unfunneled) additional subsystems. As the multiprocessing bootstrap effort continued, the team analyzed and incorporated the OSF/1 version 1.2 SMP improvements into DEC OSF/1 version 3.0. As strong as this starting point was, however, some structures in the system did not receive the appropriate level of synchronization. The team corrected these problems as they were uncovered through testing and code inspection. The DEC OSF/1 operating system uses a combination of simple locks, complex locks, elevated SPL, and funneling to guarantee synchronized access to system resources and data structures. Simple locks, SPL, and funneling were described briefly in the earlier discussion of preemption. Complex locks, like elevated SPL, are used in both uniprocessor and multiprocessor environments. These locks are usually sleep locks---threads can block while they wait for the lock---which offer additional features, including multiple-reader/single-writer access and recursive acquisition. An example of the use of each synchronization technique follows: o A simple lock is used to protect the kernel's callout (timer) queue. In an SMP environment, multiple threads can update the callout queue at the same time, as each of them adds a timer entry to the queue. Each thread must obtain the callout lock before adding an entry and release the lock when done. The callout simple lock is also a good example of SPL synchronization under multiprocessing because the callout queue is scanned by the system clock ISR. Therefore, before locking the callout lock, a thread must raise the SPL to the clock's IPL. Otherwise, the thread holding the callout lock at an SPL of zero can be interrupted by the clock ISR, which will in turn attempt to take the callout lock. The result is a permanent deadlock. o A complex lock protects the file system directory structure. A blocking lock is required because the directory lock holder must perform I/O to update the directory, which itself can block. Whenever blocking can occur while a lock is held, a complex lock is required. o Funneling is used to synchronize access to the ISO 9660 CD-ROM file system.[7] The decision to funnel this file system was largely due to limitations in the DEC OSF/1 version 3.0 schedule; however, the file system is a good choice for funneling because of its generally slow operation and light usage. To ensure adequate performance and scaling as processors are added to the system, an SMP implementation must provide for as much parallelism through the kernel as possible. The granularity of locks placed in the system has a major impact on the amount of parallelism obtained. During multiprocessing development, locking strategies were designed to o Reduce the total number of locks per subsystem o Reduce the number of locks taken per subsystem operation o Improve the level of parallelism throughout the kernel At times, these goals clashed: enhancing parallelism usually involves adding a lock to some structure or code path. This outcome conflicts with the goal of reducing lock counts. Consequently, in practice, the process of successfully parallelizing a subsystem involves striking a balance between lock reduction and the resulting increase in lock granularity. Often, benchmarking different approaches is required to fine-tune this balance. Several general trends were uncovered during lock analysis and tuning. In some cases locks were removed because they were not needed; they were the products of overzealous synchronization. For example, a structure that is private to a thread may require no locking at all. Moreover, a data element that is read atomically needs no locking. An example of lock removal is the gettimeofday() system call, which is used frequently by DBMS servers. The system call simply reads the system time, a 64-bit quantity, and copies it to a buffer provided by the caller. The original OSF/1 system call, running on a 32-bit architecture, had to take a simple lock before reading the time to guarantee a consistent value. On the Alpha architecture, the system call can read the entire 64-bit time value atomically. Removing the lock resulted in a 40 percent speedup. In other cases, analyzing how structures are used revealed that no locking was needed. For example, an I/O control block called the buf structure was being locked in several device drivers while the block was in a state that allowed only the device driver to access it. Removing these unnecessary locks saved one complex and one simple locking sequence per I/O operation in these drivers. Another effective optimization involved postponing locking until a thread determined that it had actual work to do. This technique was used successfully in a routine frequently called in a transaction processing benchmark. The routine, which was locking structures in anticipation of following a rarely used code path, was modified to lock only when the uncommon code path was needed. This optimization significantly reduced lock overhead. To improve parallelism across the system, the DEC OSF/1 SMP development team modified the lock strategies in numerous other cases. Algorithm Changes In some instances, the effective migration of a subsystem to the multiprocessing environment required significant reworking of its fundamental algorithms. This section presents three examples of this work. The first example involves the rework of the process management subsystem; the second example is a new technique for a thread to refer to its own state; and the third example deals with enhancements in translation buffer coherency or "shootdown." Managing Processes and Process State. Early versions of the DEC OSF/1 software maintained a set of systemwide process lists, most notably proc (static proc structure array), allproc (active process list), and zomproc (zombie process list). These lists tend to be fairly long and are normally traversed sequentially. Operations involving access to these lists include process-creation time (fork()), signal posting, and process termination. The original OSF/1 code protected these process lists and the individual proc structures themselves by means of funneling. This meant that virtually every system call that involved process state, such as exit(), wait(), ptrace(), and sigaction(), was also forced into a single funnel. Experience with real-time preemption indicated that this approach would exact excessive multiprocessing costs. Although it is possible to protect these lists with locks, the development team decided that this basic portion of the kernel must be optimized for maximum multiprocessing performance. The OSF also recognized the need for optimization; they addressed the problem in OSF/1 version 1.2 by adopting a redesign of the process management developed for their Multimax systems by Encore Computer Corporation. The DEC OSF/1 team adopted and enhanced this design for handling process lists, process management system calls, and signal processing. The redesign replaces the statically sized array of proc structures with an array of smaller process identification (PID) entry structures. Each PID entry structure potentially points to a dynamically allocated proc structure. Under this new scheme, finding the proc structure associated with a user PID has been reduced to hashing the PID value to an index into the PID entry array. The process state associated with that PID (active, zombie, or nonexistent) is maintained in the PID entry structure. This allows process structures to be allocated dynamically, as needed, rather than statically at boot time, as before. Simple locks are also added to the process structure to allow multiple threads in the process to perform process management system calls and signal handling concurrently. These changes allowed process management funneling to be removed entirely, which significantly improved the degree of parallelism in the process management subsystem. Accessing Current Thread State. One critical design choice in implementing SMP on the DEC OSF/1 system concerned how to access the state of the currently running thread. This state includes the current thread's process, task, and virtual memory structures, and the so-called uarea, which contains the pageable UNIX state. Access to this state, which threads require frequently as they run in kernel context, must have low overhead. Further, because the DEC OSF/1 operating system supports kernel-mode preemption, the method for accessing the current thread's state must work even if a context switch to another CPU occurs during the access operation. The original OSF/1 code used arrays indexed by the CPU number to look up the state of a running thread. One of these arrays was the U_ADDRESS array, which was used to access the currently active uarea. The U_ADDRESS array was loaded at context switch time and accessed while the thread executed. Before the advent of multiprocessing, the CPU number was a compile-time constant, so that thread-state lookup involved simply reading a global variable to form the pointer to the data. Adding multiprocessing support meant changing the CPU number from a constant to the result of the WHAMI ("Who am I?") PALcode call to get the current CPU number. (PALcode is the operating-system-specific privileged architecture library that provides control over interrupts, exceptions, context switching, etc.[8]) Using such global arrays for accessing the current thread's state presented three shortcomings: 1. The WHAMI PALcode call added a minimum overhead of 21 machine cycles on the AlphaServer 2100 server, not including further overhead due to cache misses or instruction stream stalls. The multiprocessing team felt that this was too large a performance price to pay. 2. Allowing multiple CPUs to write sequential pointers caused cache thrashing and extra overhead during context switching. 3. Indexing by CPU number was not a safe practice when kernel-mode preemption is enabled. A thread could switch processors in the middle of an array access, and the wrong pointer would be fetched. Providing additional locking to prevent this had unacceptable performance implications because the operation is so common. These problems convinced the team that a new algorithm was required for accessing the current thread's state. The solution selected was modeled on the way the OpenVMS VAX system uses the processor interrupt stack pointer to derive the pointer to per-CPU state.[9] In the OSF/1 system, each thread has its own kernel stack. By aligning this stack on a power-of-two boundary, a simple masking of the stack pointer yields a pointer to the per-thread data, such as the process control block (PCB) and uthread structure. Any data item in the per-thread area can be accessed with the following code sequence: lda r16, MASK # Get mask value bic sp, r16, r0 # Mask stack pointer to point to stack base ldq rx, OFFSET(r0) # Add offset to base and fetch item Accessing thread state using the kernel stack pointer solves all three problems with CPU-number-based indexing. First, this technique has very low overhead; accessing the current thread's data involves only a simple masking operation and a read operation. Second, using the kernel stack pointer incurs no extra overhead during context switching because the pointer has to be loaded for other uses. Third, because thread stack areas are pages, no cache conflicts exist between threads running on different processors. Finally, the data access can be preempted at any point, and the correct pointer is still fetched. No processor-dependent state is used to access the current thread state. Interprocessor Translation Lookaside Buffer Shootdown. Alpha processors employ translation lookaside buffers (TLBs) to speed up the translation of physical-to-virtual mappings. The TLB caches page table entries (PTEs) that contain virtual-to-physical address mappings and access control information. Unlike data cache coherency, which the hardware maintains, TLB cache coherency is a task of the software. The DEC OSF/1 system uses an enhanced version of the TLB shootdown algorithm developed for the Mach kernel to maintain TLB coherency.[10] First, a modification to the original shootdown algorithm was needed to implement the Alpha architecture's address space numbers (ASNs). Second, a synchronization feature of the original algorithm was removed entirely to enhance shootdown performance. This feature provided synchronization for architectures in which the hardware can modify PTEs, such as the VAX platform; the added protection is unnecessary for the Alpha architecture. The final shootdown algorithm is as follows. The physical map (PMAP) is the software structure that holds the virtual-to-physical mapping information. Each task within the system has a PMAP; operating system mappings have a special kernel PMAP. Each PMAP contains a list of processors currently using the associated address space. To initiate a virtual-to-physical translation change, a processor (the initiator) first locks the PMAP to prevent any other threads from modifying it. Next, the initiator updates the PTE mapping in memory and flushes the local TLB. The processor then sends an interprocessor interrupt to all other processors (the responders) that are currently active in the same address space. Each responder needs to acknowledge the initiator and invalidate its own mapping. Once all responders are accounted for, the initiator is free to continue with the knowledge that all TLBs are coherent on the system. The initiator marks nonactive processors' ASNs inactive, spins while it waits for other processors to check in, and then unlocks the PMAP. Figure 1 shows this final TLB shootdown algorithm as it progresses from the initiating processor to the potential responding processors. Figure 1 Translation Lookaside Buffer Shootdown Algorithm Initiator: Responders: Lock the PMAP. Update the translation map (PTE). Invalidate the processor TLB entry. Send an interprocessor interrupt to all processors that are using the PMAP. Acknowledge the shootdown. Invalidate the processor TLB entry. Return from the interrupt. Mark the nonactive processors' ASNs inactive. Spin while it waits for other processors to check in. Unlock the PMAP. DEVELOPING THE LOCK PACKAGE Key to meeting the performance and reliability goals for the multiprocessing portion of the DEC OSF/1 version 3.0 release was the development of a lock package with the following characteristics: o Low execution and memory overhead o Flexible support for both uniprocessor and multiprocessor platforms, with and without real-time preemption o Automated debugging facilities to detect incorrect locking practices at run time o Statistical facilities to track the number of locks used, how many times a lock is taken, and how long threads wait to obtain locks Of course, the overall role of the lock package is to provide a set of synchronization primitives, that is, the simple and complex locks described in earlier sections. To support kernel-mode thread preemption, DEC OSF/1 version 1.0 had extended the lock package originally delivered with OSF/1 version 1.0. Early in the DEC OSF/1 version 3.0 project, the development team extended the package again to optimize its performance and to add the desired debugging and statistical features. As previously noted, a major goal for DEC OSF/1 version 3.0 was to ship a single version of its kernel objects, instead of the base and real-time sets of previous releases. Therefore, simple locks would have to be compiled into the kernel, even for kernels that would run only on uniprocessor systems. Achieving this goal required minimizing the size of the lock structure; it would be unacceptable to have hundreds of kilobytes (KB) of memory dedicated to lock structures in systems that did not use such structures. Further, the simple lock and unlock invocations required by the multiprocessing code would have to be present for all platforms, which would raise serious performance issues for uniprocessor systems. In fact, in the original OSF/1 lock package, the CPU overhead cost of compiling in the lock code was between 1 and 20 percent. Compute-intensive benchmarks showed the cost to be less than 5 percent, but the cost for multiuser benchmarks was greater than 10 percent, which represents an unacceptable performance degradation. To meet the goal of a single set of binaries, the development team had to enhance the lock package to be configurable at boot time. That is, the package needed to be able to tailor itself to fit the configuration and real-time requirements of the platform on which it would run. The lock package supplied by the OSF/1 system was further deficient in that it did not support error checking when locks were asserted. This deficiency left developers open to the most common tormentor of concurrent programmers, i.e., deadlocks. Without error checking, potential system hangs caused by locks being asserted in the wrong order could go undetected for years and be difficult to debug. A formal locking order or hierarchy for all locks in the system had to be established, and the lock package needed the ability to check the hierarchy on each lock taken. These needs were met by introducing the notion of lock mode to the lock package. Developers defined the following five modes and associated roles: o Mode 0: No lock operations; for production uniprocessor systems o Mode 1: Lock counting only to manage kernel preemption; for production real-time uniprocessor systems o Mode 2: Locking without kernel preemption; for production multiprocessing systems o Mode 3: Locking with kernel preemption; for production real-time multiprocessing systems o Mode 4: Full lock debugging with or without preemption; for any development system The default uniprocessor lock mode is 0; the multiprocessing default is lock mode 2. Both selections favor non-real-time production systems. The system's lock mode, however, can be selected at boot time by a number of mechanisms. Lock modes are implemented through a dynamic lock configuration scheme that essentially installs the appropriate set of lock primitives for the selected lock mode. Installation is realized by patching the compiled-in function calls, such as simple_lock(), to dispatch to the corresponding lock primitive for the selected lock mode. This technique avoids the overhead of dispatching indirectly to different sets of lock primitives for each call, based on the lock mode. The compiled-in lock function calls to the lock package are all entry points that branch to a call-patching routine called simple_lock_patch(). This routine changes the calling machine instruction to be patched out (for lock mode 0) or to branch to the corresponding primitive in the appropriate set of actual primitives, and then branches there (for lock modes 1 through 4). Thus, the overhead for dynamically switching between the versions of simple lock primitives occurs only once for each code path. In the case of lock mode 0, calls to simple lock primitives are "back patched" out. Under this model, uniprocessor systems pay a one-time cost to invoke the simple lock primitives, after which the expense of executing a lock primitive is reduced to executing a few no-op instructions where the code for the lock call once resided. To address memory consumption issues and to provide better system debug capabilities, the developers reorganized the lock data structures around the concept of the lockinfo structure. This structure is an encapsulation of the lock's ordering (hierarchical relationship) with surrounding locks and its minimum SPL requirement. Lock debugging information and the lock statistics were decoupled from the lock structures themselves. To facilitate the expression of a lock hierarchy, the developers introduced the concept of classes and instances. A lock class is a grouping of locks of the same type. For example, the process structure lock constitutes a lock class. A lock instance is a particular lock of a given class. For example, one process structure simple lock is an instance of the class process structure lock. Error checking and statistics-gathering are performed on a lock-class basis and only in lock mode 4. Decoupling the lock debugging information from the lock itself significantly reduced the sizes of the simple and complex lock structures to 8 and 32 bytes, respectively. Embedded in both structures is a 16-bit index into the lockinfo structure table for that particular lock class. The lockinfo structure is dynamically created at system startup in lock mode 4. All classes in the system are assigned a relative position in a single unified lock hierarchy. A lock class's position in the lockinfo table is also its position in the lock hierarchy; that is, locks must be taken in the order in which they appear in the table. Lock statistics are also maintained on a per-class basis with separate entries for each processor. Keeping lock statistics per processor and separating this information by cache blocks eliminates the need to synchronize lock-primitive access to the statistics. This design, which is illustrated in Figure 2, prevents negative cache effects that could result from sharing this data. Figure 2 Lock Structure LOCK INSTANCES LOCK CLASS LOCK STATISTICS +------------+ +--------+ +---------+ | |--+ | | | | +------------+ | | | | | | | | +---------+ +------------+ | | | +------->| CPU 0 | | |--+ +--------+ | +---------+ +------------+ |->| |---+ | | | +--------+ | | | +------------+ | | | | +---------+ | |--+ | | +------->| CPU 1 | +------------+ | | | +---------+ | | | | | | | | | | | | | +---------+ | | +------->| CPU N | | | +---------+ | | | | | | | | +--------+ +---------+ Once this powerful lock package was operational, developers analyzed the lock design of their kernel subsystems and attempted to place the locks used into classes in the overall system lock hierarchy. The position of a class depends on the order in which its locks are taken and released in relation to other locks in the same code path and in the system. At times, this static lock analysis revealed problems in existing lock protocols, in which locks were taken in varying orders at different points in the code. Clearly, the lock protocol needed to be reworked to produce a consistent order that could be codified in the hierarchy. Thus, the exercise of producing an overall lock hierarchy resulted in a significant cleanup of the original multiprocessing code base. To add a new lock to the system, a developer would have to determine the hierarchical position for the new lock class and the minimum SPL at which the lock must be taken. Running the system in lock mode 4 and exercising code paths of interest provided developers with immediate feedback on their lock protocols. Using the hierarchy and SPL information stored in the run-time lockinfo table, the lock primitives aggressively check for a variety of locking errors, which include the following: o Locking a lock out of hierarchical order o Locking a simple lock at an SPL below the required minimum o Locking a simple lock already held by the caller o Unlocking an unlocked simple lock o Unlocking a simple lock owned by another CPU o Locking a complex lock with a simple lock held o Locking a complex lock at interrupt level o Sleeping with a simple lock held o Locking or unlocking an uninitialized lock Encountering any of these types of violation results in a lock fault, i.e., a system bug check that records the information required by the developer to quickly track down the lock error. The reduction in lock sizes and the major enhancement of the lock package enabled the team to realize its goal of a single set of kernel binaries. Benchmarks that compare a pure uniprocessor kernel and a kernel in lock mode 0 that are both running on the same hardware show a less than 3 percent difference in performance, a cost considered by the team to be well worth the many advantages to returning to a unified kernel. Moreover, the debugging capabilities of the lock package with its hierarchical scheme streamlined the process of lock analysis and provided precise and immediate feedback as developers adapted their subsystems to the multiprocessing environment. ADAPTING THE SCHEDULER FOR MULTIPROCESSING The normal scheduling behavior, i.e., policy, of the OSF/1 system is traditional UNIX time-sharing. The system time-slices processes based on a time quantum and adjusts process priorities to favor interactive jobs over compute-intensive jobs. To support the POSIX real-time standard, the DEC OSF/1 system incorporates two additional fixed-priority scheduling policies: first in, first out (POLICY_FIFO) and round robin (POLICY_RR). A time-share thread's priority degrades with CPU usage; the more recent the thread's CPU usage, the more its priority degrades. (Note that OSF/1 scheduling entities are threads rather than processes.) In contrast, a fixed-priority thread never suffers priority degradation. Instead, a POLICY_RR thread runs until it blocks voluntarily, is preempted by a higher-priority thread, or exhausts a quantum (and even then, the round robin scheduling applies only to threads of equal priority). A POLICY_FIFO thread has no scheduling quantum; it runs until it blocks or is preempted. These specialized policies are used by real-time applications and by threads created and managed by the kernel. Examples of these kernel threads include the swapper and paging threads, device driver threads, and network protocol handlers. A feature called thread binding, or hard affinity, was added to DEC OSF/1 version 3.0. Binding allows a user or the kernel to force a thread to run only on a specified processor. Binding supports the funneling feature used by unparallelized code and the bind_to_cpu() system call. The goal of a multiprocessing operating system in scheduling threads is to run the top N priority threads on N processors at any given time. A simple way to accomplish this would be to schedule threads that are not bound to a CPU in a single, global run queue and schedule bound threads in a run queue local to its bound processor. When a processor reschedules, it would select the highest-priority thread available in the local or the global run queue. Scheduling threads out of a global run queue is highly effective at keeping the N highest-priority threads running; however, two problems arise with this approach: 1. A single run queue leads to contention between processors that are attempting to reschedule, as they race to lock the run queue and remove the highest-priority thread. 2. Scheduling with a global run queue does not take advantage of the cache state that a thread builds on the CPU where it last ran. A thread that migrates to a different processor must reload its state into the new processor's cache. This can substantially degrade performance. To help preserve cache state and reduce wasteful global run queue contention, the developers enhanced the multiprocessing scheduler by adding two new scheduling models: a soft-affinity scheduling model for time-share threads and a last-processor-preference model for fixed-priority threads. Under these models, each processor schedules time-share and bound threads in its local run queue, and it schedules unbound fixed-priority threads out of a global run queue. Fixed-priority threads scheduled from a global run queue are able to run as soon as possible. This behavior is required for high-priority tasks like kernel threads and real-time applications. The last-processor-preference model gives a fixed-priority thread a preference for running on the processor where it last ran; if that processor is busy, the thread runs on the next available processor. Each time-share thread is softly bound to the processor on which it last ran; that is, the thread shows an affinity for that processor. Unlike funneling or user binding, which support hard (mandatory) affinity, soft affinity allows a thread to run elsewhere if it is advantageous, i.e., if such activity balances the load. Otherwise, the softly bound thread tries to return to the processor where it last ran and where its recent cache state may still reside. Under load, however, a soft affinity model used alone can degenerate to a state where one processor builds up a large queue of threads, leaving the other processors with little to do and thus diminishing the performance of the multiprocessing system. To mitigate these side effects of soft affinity, developers paired the soft affinity feature with the ability to load-balance the runnable threads in the system. To keep the load of time-share jobs spread evenly across processors, the scheduler must periodically load-balance the system. In addition to distributing threads evenly across the local run queues in the system, this load-balancing activity must o Cost no more in processing time than it saves o Prevent excessive thread movement among processors o Recognize and effectively accommodate changes in the job mix To implement load balancing, each processor maintains a time-share load average, i.e., the average local run queue depth over the last five seconds. Each processor updates its own load average on each system clock tick. Processors also keep track of the time they spend handling interrupts and running fixed-priority threads, which are not accounted for in the local run queue depth. Taking a processor's total potential execution time for a scheduling period and subtracting from this time the interrupt-processing and fixed-priority run times yields the total time available on a processor (processor ticks available) to run time-share threads. In the worse case, a processor could be completely consumed by fixed-priority threads and/or interrupt processing and have no time to run time-share threads. In this extreme case, the scheduler should give no time-share load to that processor. Adding the time-share load averages of all processors determines the aggregate time-share load for the system. Similarly, summing the processor ticks available yields the total time available on the system for handling time-share tasks. Using this data, the scheduler calculates the desired load for each processor once per second, as follows: Processor ticks System time-share Desired available X load load = ----------------------------------- System ticks available Load balancing is called for when at least one processor is above and one is below its desired load by a minimal amount. If this condition arises, then those processors under their desired loads are declared to be "out of balance." The next time an out-of-balance processor reschedules, it will try to take a thread from the local run queue of a processor that is above its desired load ("thread stealing"). A processor can declare itself back in balance when its current load is above its desired load or when there are no eligible threads to steal. Figure 3 shows a simplified load-balancing scenario, in which a processor below its desired load steals a thread from a processor above its desired load. Figure 3 Load Balancing +---------+ +---------+ +---------+ | CPU 1 | | CPU 2 | ... | CPU N | +---------+ +---------+ +---------+ CURRENT LOAD +---------+ +---------+ +---------+ (NUMBER OF | 5 | CPU 1 IS | 3 | | 4 | THREADS) | | OUT OF | | | | | | BALANCE | | | | DESIRED LOAD | 4 | | 4 | | 4 | | | | | ... | | | | | | | | | LOCAL |<---------| LOCAL | | LOCAL | | RUN |--------->| RUN | | RUN | | QUEUE |CPU 2 | QUEUE | | QUEUE | +----+----+STEALS +----+----+ +----+----+ | ONE THREAD | | | FROM CPU 1 | | HIGHEST PRIORITY | | | THREAD BETWEEN +-----------------+ + +----------------+ LOCAL RUN QUEUES | | | AND GLOBAL RUN QUEUE +-+-----+-+ WINS THE PROCESSOR | | | | | | | GLOBAL | | RUN | | QUEUE | | | | | | | +---------+ To help preserve the cache benefits of soft affinity, a thread is eligible for stealing only when it has not run on its current processor for some configurable number of clock ticks. After this time has elapsed without a thread running, the chance of it having significant cache state remaining has diminished sufficiently that the thread is more likely to benefit from migrating to another processor and running immediately than from waiting longer to run on its current processor. To demonstrate that soft affinity with load balancing improves multiprocessing performance through cache benefits and the elimination of run queue contention, developers ran a simple test program. The program, which writes 128 KB of data, yields the processor, and then reads the same data back, was run on a four-processor DEC 7000 system. Table 1 shows the results of running multiple versions of this program with and without soft affinity and load balancing in operation. Performance benefits appear only when multiple copies of the program begin piling up in the run queues at the 16-job level. Prior to this point, each job keeps running on the same processor, i.e., the cache it had just filled still had its data cached when the program read it back---the ideal case. At the 16-job level, the four processors must be time-shared. The jobs that are running with soft affinity now benefit significantly because they continue to run on the same processor and thus find some of their cache state preserved from when they last ran. The systems that schedule from a global run queue provide no such benefit. Jobs take longer to complete, since they are likely to run on a different processor for each time slice and find no cache state that they can reuse. Table 1 Benefits of Soft Affinity with Load Balancing (SA/LB) Number Time with SA/LB Time without Benefit from of Jobs (Seconds) SA/LB (Seconds) SA/LB (Percent) 1 25.9 26.0 0 4 25.9 26.0 0 16 106.5 141.9 25 The soft affinity and load-balancing features improved many other multiuser benchmarks. For example, a transaction processing benchmark showed a 17 percent performance improvement. FOCUSING ON QUALITY The error-checking focus of the lock package is just one example of how the DEC OSF/1 version 3.0 project focused on the quality and stability of the product. Most members of the multiprocessing team had been involved in an SMP development effort prior to their DEC OSF/1 effort. This past experience, coupled with the difficulties other vendors had experienced with their own multiprocessing implementations, reinforced the need to have a strong quality focus in the SMP project. Developers took multiple steps to ensure that the SMP solution delivered in DEC OSF/1 version 3.0 would be production quality, including o Code reviews o Lock debugging o In-line assertion checking o Multithreaded test suite development for SMP qualification The base kernel code was reviewed for multiprocessing correctness. During this review phase, checks were made to ensure that the proper level of synchronization was employed to protect global data structures. Numerous defects were uncovered during this process and corrected. Running code with lock checking enabled provided empirical evidence of the incremental improvements of the multiprocessing implementation. Beyond code reviews and lock debugging, internal consistency checks (assertions) were coded into the kernel to verify correctness of operations at key points. Assertion checking was enabled during the development process to ensure that the kernel was functioning correctly; it was then compiled out for the production version of the kernel. In parallel with the operating system development effort, new component tests were designed to force as much concurrency as possible through particular code paths. The core of the test suite is a thread-race library, which consists of a set of routines that can be used to construct multithreaded system-call exercisers. The library provides the ability to commence multiple test instances simultaneously. The individual tests are then combined to form focused subsystem tests and systemwide tests. These tests have been used to uncover multiple race conditions in the operating system. The UNIX development organization had a four-processor DEC 7000 system deployed in its development environment for more than 7 months prior to releasing the SMP product. This system has been extremely stable, with few complaints from the user community. Extensive internal and external field testing produced similar results. MEASURING MULTIPROCESSING PERFORMANCE OUTCOMES The major functionality delivered with SMP is improved performance through application concurrency. The goal of the SMP project was to provide leadership performance in the areas of compute and data servers. To gauge success in this effort, several industry-standard benchmarks were utilized. These benchmarks include SPECrate_INT92, SPECrate_FP92, and AIM Suite III. SMP performance is measured in terms of the incremental performance gained as processors are added to the system. Adding processors by no means guarantees increased system performance. Systems that have I/O or memory limitations rarely benefit from introducing additional CPUs. Systems that are compute bound tend to have the largest potential for gain from additional processors. Note that large, monolithic applications tend to see little performance improvement as processors are added because such applications employ little concurrency in their designs. Performance tuning for SMP was an iterative process that can be characterized as follows: 1. Collect and analyze performance data. o CPU utilization across the processors o Lock statistics o I/O rates o Context switch rates o Kernel profiling 2. Identify areas that require improvement. 3. Prototype changes. 4. Incorporate changes that demonstrate improvement. 5. Repeat steps 1 through 4. In reality, the process has two stages for each benchmark. The initial phase was devoted to driving the system to become compute bound. The second phase improved code efficiencies. Figures 4 and 5 show that, as expected, the SPECrate_INT92 and SPECrate_FP92 benchmarks scale almost linearly. Both of these benchmarks are compute intensive and make only nominal demands on the operating system. [Figure 4 (SPECrate Integer Scaling for Four-CPU Systems) is not available in ASCII format.] [Figure 5 (SPECrate Floating-point Scaling for Four-CPU Systems) is not available in ASCII format.] AIM Suite III is a multiuser benchmark that stresses multiple components of an operating system, including the virtual memory system, the scheduler, UNIX pipes, and the I/O subsystem. Figure 6 shows AIM III results for one and four processors, with a resulting 3.27 to 4 scaling factor. This equates to a greater than 80 percent scaling factor, a figure well within the goals for this benchmark at first multiprocessing release. Efforts to produce still better results are under way. [Figure 6 (AIM Suite III Scaling) is not available in ASCII format.] AIM Suite III scaling appears to be gated by a single test in the AIM Suite III benchmark, i.e., directory search. The goal of this test is to create and remove a set of files across a limited number of directories.[10] Because these operations require updating directory information, only one thread of execution can perform these operations on a directory at a time. Some improvements have been applied to mitigate this contention, but this single test still impacts the overall scaling results. CONCLUSION The focus of the first release of SMP capabilities for the DEC OSF/1 operating system was to provide leadership SMP performance without compromising overall system quality. The project team accomplished this goal by carefully modifying the base operating system to take advantage of the additional processing power provided. The team paid particular attention to synchronization, parallel algorithms, and error and inconsistency detection. Work for future releases will continue to focus on performance and quality improvements. Other areas of investigation include features such as processor sets, stopping and starting CPUs, and more flexible handling of interrupts as processors are added. ACKNOWLEDGMENTS Virtually every phase of this project depended on the teamwork and cooperation of multiple groups with the UNIX Software Group. The authors wish to acknowledge the tireless efforts and accomplishments of that entire organization in making DEC OSF/1 version 3.0 and SMP a reality. In particular, we would like to acknowledge the following contributors who were involved in the SMP project from its earliest stages: Tim Burke, Dan Christians, Scott Cranston, Richard Flower, Heather Gray, Gerri Harter, Tim Hoskins, Chet Juszczak, Stan Luke, Shashi Mangalat, Joe Martin, Ron Menner, Brian Nadeau, Ernie Petrides, Rajul Shah, Dave Stanley, and Tony Verhulst. NOTE AND REFERENCES 1. The OSF/1 operating system, based on Carnegie Mellon University's Mach version 2.5 kernel, is developed and distributed by the Open Software Foundation. The DEC OSF/1 operating system, based in part on the OSF/1 system, is developed and distributed by Digital Equipment Corporation. To further clarify the relationship between the two products, DEC OSF/1 versions 1.0, 1.2, 1.3, 2.0, and 2.1 include code mainly from the OSF/1 version 1.0 software. DEC OSF/1 version 3.0 includes code from the OSF/1 version 1.1 and 1.2 software. 2. F. Hayes, "Design of the AlphaServer Multiprocessor Server Systems," Digital Technical Journal, vol. 6, no. 3 (Summer 1994, this issue): 8-19. 3. R. Rashid, "Threads of a New System (Mach: A Multiprocessor Operating System)," UNIX Review (August 1986): 37-49. 4. M. Accetta et al., "Mach: A New Kernel Foundation for Unix Development," USENIX Summer Proceedings (August 1986): 93-112. 5. Open Software Foundation, Design of the OSF/1 Operating System (Englewood Cliffs, NJ: Prentice-Hall, 1993). 6. S. Mangalat and D. Bolinger, "Parallelizing Signal Handling and Process Management in OSF/1," USENIX Symposium Proceedings (November 1991): 105-122. 7. Information Processing---Volume and File Structure of CD-ROM for Information Interchange, ISO 9660 (Geneva: International Organization for Standardization, 1988). 8. R. Sites, ed., Alpha Architecture Reference Manual (Burlington, MA: Digital Press, 1992). 9. R. Gamache and K. Morse, "VMS Symmetric Multiprocessing," Digital Technical Journal, vol. 1, no. 7 (August 1988): 57-63. 10. D. Black et al., "Translation Lookaside Buffer Consistency: A Software Approach," Proceedings of the Third International Conference on Architectural Support for Programming Languages and Operating Systems (1989). TRADEMARKS The following are trademarks of Digital Equipment Corporation: AlphaServer, DEC, Digital, and ULTRIX. Multimax is a trademark of Encore Computer Corporation. Open Software Foundation is a trademark and OSF/1 is a registered trademark of Open Software Foundation, Inc. UNIX is a registered trademark licensed exclusively by X/Open Company Ltd. MIPS is a trademark of MIPS Computer Systems, Inc. BIOGRAPHIES Jeffrey M. Denham A principal software engineer in the UNIX Software Group, Jeff Denham is a contributor to the DEC OSF/1 version 3.0 symmetric multiprocessing effort. Prior to this, he helped add POSIX.1b features to the DEC OSF/1 operating system and worked on the VAXELN real-time kernel. Jeff came to Digital in 1986 from Raytheon Corporation. He holds a B.A. (1979) from Hiram College, an M.A. (1980) from Tufts University, both in English, and an M.S. (1985) in Technical Communication from Rensselaer Polytechnic Institute. Paula Long Since joining Digital in 1986, Paula Long has contributed to various operating system projects. Presently a principal software engineer with the UNIX Software Group, she leads the development of symmetric multiprocessing (SMP) capabilities for the DEC OSF/1 operating system. In previous positions, she led the DEC OSF/1 real-time and DECwindows on VAXELN projects. Paula received a B.S.C.S. from Westfield State College in 1983. James A. Woodward Principal software engineer James Woodward is a member of the UNIX Software Group. He is responsible for DEC OSF/1 symmetric multiprocessing (SMP) processor scheduling and base kernel support. In previous work, Jim led the ULTRIX SMP project and the VAX 8200, VAX 8800, and VAX 6000 ULTRIX operating system ports. He also wrote microcode for the VAX 8200 systems as a member of the Semiconductor Engineering Group. Jim joined Digital in 1981 after receiving a B.S.E.E. from the University of Michigan. ============================================================================= Copyright 1994 Digital Equipment Corporation. Forwarding and copying of this article is permitted for personal and educational purposes without fee provided that Digital Equipment Corporation's copyright is retained with the article and that the content is not modified. This article is not to be distributed for commercial advantage. Abstracting with credit of Digital Equipment Corporation's authorship is permitted. All rights reserved. =============================================================================