The Design and Implementation of the FreeBSD SCSI Subsystem

$Header: //depot/cam/doc/cam.sgml#5$


The Small Computer System Interface protocol offers a high performance interconnect to a wide range of peripherals. This paper describes the implementation of a SCSI framework for the FreeBSD Operating System. This framework expands on the Common Access Method specification for SCSI, providing round-robin prioritized transaction queuing, guaranteed transaction ordering even during error recovery, and a straight forward error recovery model that increases system robustness. Services are provided to both controller (SCSI to Host adapter) and peripheral drivers that allow the system to take full advantage of advanced SCSI features while leveraging off of common code. Strategies used to ensure wide SCSI device compatibility are also discussed.


Introduction

Since the inception of the Small Computer System Interface protocol, system programmers have faced several challenges in implementing software to communicate with SCSI devices. Although presented as a series of ANSI ratified standards, extensive knowledge of "standardized SCSI" isn't sufficient to develop a robust and efficient SCSI subsystem. The interpretations of the SCSI standards are as varied as the vendors manufacturing SCSI devices, increaseing the difficulty of designing a system that offers wide compatibility. Even without the specter of inconsistent vendor implementations, the challenges of reliable error recovery, efficient and fair use of bus and device resources, and exporting a clean API to applications removes the possibility of a simple SCSI implementation.

Recognizing the difficulties inherent in supporting SCSI, the ANSI body developed the Common Access Method specification. CAM provides a formal description of the interfaces in a SCSI subsystem. These interfaces are tools that can be used to to build a robust SCSI framework, but CAM offers few implementation guidelines and often lacks information on the common SCSI specification violations that must be tolerated to achieve inter-operability. What is a good strategy for detecting non-compliant devices? What commands issued in what order should be used to probe for devices? What are the common problems encountered in supporting tagged queuing? These questions, and many more, are left for the implementer to answer. This paper seeks to document how one particular CAM implementation, written for the FreeBSD Operating System, fills in the details missing from the formal CAM specification.

A CAM system is composed of three basic components. At the highest level we find the peripheral drivers. Peripheral drivers serve as the interface between Operating System services and CAM actions. A CAM system often provides peripheral drivers for tape, disk, CD-ROM, or other device types. Applications running on the system use generic Operating System services (read, write, ioctl) to access peripheral devices without needing to know the specific SCSI commands required to perform these actions.

Figure 1. Components of a CAM system

While a peripheral driver understands how to convert user actions into SCSI commands, it knows very little about where particular devices reside in the system. Peripheral drivers rely on the Transport Layer (XPT) to provide these routing services. The XPT notifies registered peripheral drivers of device arrival, departure, and other asynchronous events, transports SCSI and administrative commands in the form of Cam Control Blocks (CCBs) to the appropriate hardware, and provides a round-robin prioritized scheduler for CCBs pertaining to I/O operations. As peripheral drivers provide an abstraction layer between SCSI and Operating System services, the XPT layer provides an abstraction layer between SCSI protocol consumers (the peripheral drivers) and hardware support modules that manage SCSI transactions.

The final component in this system is the System Interface Module (SIM). SIMs are tasked with controlling host bus adapter hardware and converting CCBs into the appropriate hardware actions. SIMs are also responsible for detecting error conditions, performing some aspects of error recovery, and exporting accurate error information through the XPT layer to peripheral driver.

Starting with the Transport Layer, the data structures and strategies used to implement each of these components will be discussed. Special focus will be given to the areas of deterministic error recovery, guaranteed transaction priority and ordering, handling of non-conformant devices, and the tools provided by CAM that are used to implement these features in peripheral and SIM drivers.


The Transport Layer

The "Ring Master" of the CAM subsystem is the Transport Layer (XPT). Charged with the important tasks of tracking peripheral, controller, and device presence, CAM system wide resource management, and the routing of commands between the different CAM layers, the XPT is the source of most device independent functionality in CAM. The XPT's role begins early in system startup when peripheral and controller drivers are attached, SCSI devices are probed, and the state of the CAM system recorded in the Existing Device Table (EDT). Once the topography of the system is known, the XPT can assume its role as a router. The XPT uses it's routing services to provide guaranteed delivery of transactions between peripheral driver instances and controller drivers. Even more critical to the system's proper functioning and performance is the XPT's task as a resource manager. The XPT schedules the allocation of device, controller, and XPT resources for all I/O transactions using a round-robin per priority level approach that guarantees fairness.


The Existing Device Table

The active components in the CAM system are recorded in the Existing Device Table(EDT). Construction of the table begins during system initialization as SCSI Host Bus Adapters (HBA) are detected and the SIMs managing these HBA-s register individual HBA busses with the XPT. During its lifetime, the EDT will contain temporary device entries used during device probing, and also handle the creation and deletion of more permanent entries due to "hot-plug" events. The design goals of the EDT are to provide a simple, space efficient, and easily modified representation of the topology of the CAM system.

Figure 2. The XPT Existing Device Table

The EDT is composed of four data structures arranged into a tree of linked lists. These structures are the CAM Existing Bus (cam_deb), the CAM Existing Target (cam_et), the CAM Existing Device (cam_ed), and the CAM Peripheral (cam_periph). The tree begins with a list of cam_eb objects representing the busses on HBAs in the system. Busses in CAM are assigned a 32 bit value known as a path_id. The path_id determines the order in which the busses are scanned for sub devices and the value is determined by a combination of registration order, and "hard wiring" information specified by the user when compiling the kernel. The cam_eb list is sorted in ascending order by path_id. Beyond it's role as an enumerator of a bus, the cam_eb also records the XPT clients that have requested notification of particular asynchronous events that occur on a given bus.

Each cam_eb contains a list, sorted in ascending order by target id, of active targets on the bus. Each node is represented by a cam_et structure. Unlike the cam_eb, the cam_et holds no additional system meta-data (e.g. async notification lists).

Descending down another level, each cam_et contains a list of all active devices, sorted by logical unit number, that are attached to that target. These nodes, represented by cam_ed objects are the largest of all entries in the EDT. It is here that the XPT keeps almost all information required to perform its tasks. Among this information is a queue of pending CCBs for the device, an async notification registration list, a snapshot of the SCSI inquiry data recorded when the device was probed, awnd a list of "quirks" that indicate special features or SCSI compliance issues that require special actions to properly communicate with this device.

To complete the topographical picture of the CAM system, the EDT also records peripheral instances that have attached to a given device. This is performed using a list of cam_periph objects, sorted in attach order, on the cam_ed. This information is primarily used by applications that need to know that a peripheral instance is attached to a given device, but that will communicate to the device via a different peripheral instance that is also attached to the same device. For instance, a program that deals with CDROM devices may wish to open the application pass through device associated with "cd0", the first CDROM driver instance in the system, in order to send SCSI commands directly to the device. This program would performs an EDT lookup based on "cd0" to determine if a pass-thru instance (e.g "pass25") is attached to the underlying device.

Object lifetime in the CAM EDT is managed by a mixture of explicit object removal and reference counting. No entry at any level in the table may be removed unless it's reference count has gone to zero and all objects bellow it in the tree have been removed. In addition to this requirement, bus objects are persistent from the time they are advertized to the XPT via xpt_bus_register until a SIM shuts down and uses xpt_bus_deregister to explicitly remove the bus instance. Device objects share similar semantics. Until a device has been successfully probed, it's corresponding cam_ed object is marked as "unconfigured" by setting the CAM_DEV_UNCONFIGURED flag. If the probe is successful, the flag is cleared and the device node will remain even when it's reference count goes to zero. This allows the system to place temporary device nodes in the EDT while it queues transactions to determine the device's existance. Device nodes are considered configured until an invalidation event such as a selection timeout occurs.


XPT Routing

The EDT contains all of the information required to determine a route based on the path_id, target, and lun tuple easily generated elsewhere in the system. To do this, we perform a linear search of first all busses in the system, followed by all targets of a bus, and finally all devices attached to a target. Every time a command is issued from a peripheral driver, it must be directed to the proper device. As this can occur hundereds of times per second, for each device in the system, the XPT layer must complete this task quickly and efficiently. The linear search characteristics of the EDT, even though the number of elements in the table is usually small, does not provide the routing speed we desire. While using lists as the primary data structure for the EDT provides a compact representation, it makes it impractical to use the EDT for fast routing.

To solve the routing problem, the FreeBSD CAM layer takes advantage of the relatively long lifetime of most device, peripheral associations. A disk, for example, that is discovered at boot time, will receive it's commands from the same peripheral drivers until the system is halted, or a hardware failure or other asynchronous event causes this link to be severed. This presents an obvious oportunity to cache routing path information to avoid EDT lookups. FreeBSD CAM stores this cached information in cam_path objects. A cam_path is an opaque object that contains not only the bus, target, lun tuple from which it was created, but direct pointers into the EDT to provide immediate access to all EDT objects composing the route from peripheral to device. It may seem that, as the EDT objects themselves contain tuple information, keeping tuple information in the cam_path is redundant, but paths may need to be exported to applications running in the user address space or even a remote system, where access to the EDT information is not possible. This forces the requirement of a portable representation of the tuple.

Figure 3. Anatomy of a cam_path

Not all routing tasks involve a single destination. This is often the case for asynchronous events. When a bus reset occurs, for example, the event applies to all devices attached to the bus where the reset origniated. To handle these kinds of situations, cam_path components may be wild carded. Wild card components have NULL pointers into the EDT and use the CAM_BUS_WILDCARD, CAM_TARGET_WILDCARD, and CAM_LUN_WILDCARD constants for tuple values. Although the EDT is not effecient for performing point to point routing operations, it's hierarchical list format reduces the "broadcasting" associated with wild cards to a simple list traversal with all list components being of interest.


XPT Resource Management

Perhaps the most complicated task performed by the XPT layer is that of resource management. The resources in a CAM system that the XPT is concerned with fall into three areas. The most obvious resource is that of system ram which is used to allocate the CAM Control Blocks (CCBs) that represent transactions as they pass through the CAM system. Often more constraining are the limits imposed by the number of transactions a particular device can handle simultaneously. This paper will often refer to this resource as "device openings". The last resource of concern for the XPT is the number of transactions a particular SIM can queue at a given time, known as "controller openings". Not only must the XPT allocate these resources in a fair manner based on transaction priority, but it must also ensure that resource allocation allows for proper error recovery and that transaction order is consistent. Luckily, a unified aproach can be used to deal with all three types of resources.

The basic data structure used by the XPT layer to handle resource allocation and scheduing is the camq. The camq is a heap based priority queue that sorts its entries by a 32 bit priority value, and a generation count that is incremented and applied by the client of the queue to entries as they are inserted. The generation count combined with the two key sort ensure that no two entries in the queue have the exact same overall priority and that round-robin per priority level scheduling is enforced. Heap based queues offer O(log2n) running time for both insertion and deletion, but in the very common case in CAM where, due to a fixed priority and an always increasing generation count, almost all new entries in the queue have slightly higher priority than the other entries in the queue, insertion becomes O(1).

Perhaps the easiest way to see how the camq is employed by the XPT is to track a typical peripheral request through the system. In this scenario, we are dealing with SCSI command or similar requests which by their nature consume CAM and HBA resources in order to be fulfilled. When a peripheral driver receives request from the system (read, write, ioctl), it queues the request in a manner specific to the peripheral driver, and calls xpt_schedule with a priority appropriate for the transaction. In response to the xpt_schedule call, the XPT places the peripheral instance into a camq of cam_periph objects waiting for resources on the target device. If the peripheral driver was already queued at this level, the XPT simply checks to see if the priority of the peripheral has been increased and updates the queue accordingly. At this point, we have provided a round-robin queue at the peripheral level to consume device openings. To address controller level resources, we maintain a camq< of devices awaiting common controller resources. Each device is assigned the priority of the highest priority peripheral in it's queue. When controller resources become available, the highest priority device is dequeued, the highest priority peripheral is dequeued, and the peripheral's registered start routine is called with an allocated CCB as an argument. The peripheral driver dequeues its highest priority transaction, fills out the passed in CCB based on this transaction, and sends the CCB off to be routed to the SIM with a call to xpt_action. Before returning from its start routine, the peripheral driver determines if it has other transactions waiting, and if so, calls xpt_schedule again. As the call stack unwinds, the process of "re-scheduling" continues with the device requesting itself to be queued again if it still has resources available and peripheral instances requesting those resources.

The camqs at the device and controller level provide for the initial allocation of resources, but how does the system handle an error? The goal of the XPT is to ensure that transaction ordering is maintained to each device even in the event of an error. To do this, we must introduce one more camq, this time holding CCBs that have been allocated, but are waiting to be processed by the SIM. In the common case, this queue is empty. After all, the XPT has gone through a somewhat complicated allocation scheduler to ensure that the SIM and device have the resources necessary to handle an allocated CCB. When an error occurs, however, this low level CCB queue is placed into the "frozen" state, and any CCBs affected by the error condition, returned to the peripheral driver. For every CCB returned, the "frozen" count of the CCB queue is incremented. Only once the count is reduced to 0 by the peripheral driver, are CCBs allowed to flow to the SIM. In order to handle the error, the peripheral driver will likely re-queue the affected transactions, send a high priority error recovery CCB to address the condition, and then release the queue. As the CCB queue is priority ordered, the high priority recovery action automatically becomes the next transaction to run. The fact that the frozen count is incremented for each returned CCB ensures that the order in which the transactions are returned to the peripheral driver and requeued doesn't affect transaction replay as all transactions must be addressed (e.g. requeued into our sourced CCB queue) before the count will go to zero and release the queue.