The Design and Implementation of the FreeBSD SCSI Subsystem
PrevThe Design and Implementation of the FreeBSD SCSI SubsystemNext

The Transport Layer

The “Ring Master” of the CAM subsystem is the Transport Layer (XPT). Charged with the important tasks of tracking peripheral, controller, and device presence, CAM system wide resource management, and the routing of commands between the different CAM layers, the XPT is the source of most device independent functionality in CAM. The XPT's role begins early in system startup when peripheral and controller drivers are attached, SCSI devices are probed, and the state of the CAM system recorded in the Existing Device Table (EDT). Once the topography of the system is known, the XPT can assume its role as a router. The XPT uses it's routing services to provide guaranteed delivery of transactions between peripheral driver instances and controller drivers. Even more critical to the system's proper functioning and performance is the XPT's task as a resource manager. The XPT schedules the allocation of device, controller, and XPT resources for all I/O transactions using a round-robin per priority level approach that guarantees fairness.

The Existing Device Table

The active components in the CAM system are recorded in the Existing Device Table(EDT). Construction of the table begins during system initialization as SCSI Host Bus Adapters (HBA) are detected and the SIMs managing these HBA-s register individual HBA busses with the XPT. During its lifetime, the EDT will contain temporary device entries used during device probing, and also handle the creation and deletion of more permanent entries due to “hot-plug” events. The design goals of the EDT are to provide a simple, space efficient, and easily modified representation of the topology of the CAM system.

Figure 2. The XPT Existing Device Table

The EDT is composed of four data structures arranged into a tree of linked lists. These structures are the CAM Existing Bus (cam_deb), the CAM Existing Target (cam_et), the CAM Existing Device (cam_ed), and the CAM Peripheral (cam_periph). The tree begins with a list of cam_eb objects representing the busses on HBAs in the system. Busses in CAM are assigned a 32 bit value known as a path_id. The path_id determines the order in which the busses are scanned for sub devices and the value is determined by a combination of registration order, and “hard wiring” information specified by the user when compiling the kernel. The cam_eb list is sorted in ascending order by path_id. Beyond it's role as an enumerator of a bus, the cam_eb also records the XPT clients that have requested notification of particular asynchronous events that occur on a given bus.

Each cam_eb contains a list, sorted in ascending order by target id, of active targets on the bus. Each node is represented by a cam_et structure. Unlike the cam_eb, the cam_et holds no additional system meta-data (e.g. async notification lists).

Descending down another level, each cam_et contains a list of all active devices, sorted by logical unit number, that are attached to that target. These nodes, represented by cam_ed objects are the largest of all entries in the EDT. It is here that the XPT keeps almost all information required to perform its tasks. Among this information is a queue of pending CCBs for the device, an async notification registration list, a snapshot of the SCSI inquiry data recorded when the device was probed, awnd a list of “quirks” that indicate special features or SCSI compliance issues that require special actions to properly communicate with this device.

To complete the topographical picture of the CAM system, the EDT also records peripheral instances that have attached to a given device. This is performed using a list of cam_periph objects, sorted in attach order, on the cam_ed. This information is primarily used by applications to that know a peripheral instance that has attached to a given device, but need to communicate to that device via a different peripheral instance that has also been attached. For instance, a program that deals with CDROM devices may wish to open the application pass through device associated with “cd0”, the first CDROM driver instance in the system, in order to send SCSI commands directly to the device. This program would performs an EDT lookup based on “cd0” to determine if a pass-thru instance (e.g pass25) is attached to the underlying device.

Object lifetime in the CAM EDT is managed by a mixture of explicit object removal and reference counting. No entry at any level in the table may be removed unless it's reference count has gone to zero and all objects bellow it in the tree have been removed. In addition to this requirement, bus objects are persistent from the time they are advertized to the XPT vi xpt_bus_register until a SIM shuts down and uses xpt_bus_deregister to explicitly remove the bus instance. Device objects share similar semantics. Until a device has been successfully probed, it's corresponding cam_ed object is marked as “unconfigured” by setting the CAM_DEV_UNCONFIGURED flag. If the probe is successful, the flag is cleared and the device node will remain even when it's reference count goes to zero. This allows the system to place temporary device nodes in the EDT while it queues transactions to determine the device's existance. Device nodes are considered configured until an invalidation event such as a selection timeout occurs.

XPT Routing

The EDT contains all of the information required to determine a route based on the path_id, target, and lun tuple easily generated elsewhere in the system. To do this, we perform a linear search of first all busses in the system, followed by all targets of a bus, and finally all devices attached to a target. Every time a command is issued from a peripheral driver, it must be directed to the proper device. As this can occur hundereds of times per second, for each device in the system, the XPT layer must complete this task quickly and efficiently. The linear search characteristics of the EDT, even though the number of elements in the table is usually small, does not provide the routing speed we desire. While using lists as the primary data structure for the EDT provides a compact representation, it makes it impractical to use the EDT for fast routing.

To solve the routing problem, the FreeBSD CAM layer takes advantage of the relatively long lifetime of most device, peripheral associations. A disk, for example, that is discovered at boot time, will receive it's commands from the same peripheral drivers until the system is halted, or a hardware failure or other asynchronous event causes this link to be severed. This presents an obvious oportunity to cache routing path information to avoid EDT lookups. FreeBSD CAM stores this cached information in cam_path objects. A cam_path is an opaque object that contains not only the bus, target, lun tuple from which it was created, but direct pointers into the EDT to provide immediate access to all EDT objects composing the route from peripheral to device. It may seem that, as the EDT objects themselves contain tuple information, keeping tuple information in the cam_path is redundant, but paths may need to be exported to applications running in the user address space or even a remote system, where access to the EDT information is not possible. This forces the requirement of a portable representation of the tuple.

Figure 3. Anatomy of a cam_path

Not all routing tasks involve a single destination. This is often the case for asynchronous events. When a bus reset occurs, for example, the event applies to all devices attached to the bus where the reset origniated. To handle these kinds of situations, cam_path components may be wild carded. Wild card components have NULL pointers into the EDT and use the CAM_BUS_WILDCARD, CAM_TARGET_WILDCARD, and CAM_LUN_WILDCARD constants for tuple values. Although the EDT is not effecient for performing point to point routing operations, it's hierarchical list format reduces the “broadcasting” associated with wild cards to a simple list traversal with all list components being of interest.

XPT Resource Management

Perhaps the most complicated task performed by the XPT layer is that of resource management. The resources in a CAM system that the XPT is concerned with fall into three areas. The most obvious resource is that of system RAM which is used to allocate the CAM Control Blocks (CCBs) that represent transactions as they pass through the CAM system. Often more constraining are the limits imposed by the number of transactions a particular device can handle simultaneously. This paper will often refer to this resource as “device openings”. The last resource of concern for the XPT is the number of transactions a particular SIM can queue at a given time, known as “controller openings”. Not only must the XPT allocate these resources in a fair manner based on transaction priority, but it must also ensure that resource allocation allows for proper error recovery with transaction order maintained. Luckily, a unified aproach can be used to deal with all three types of resources.

The basic data structure used by the XPT layer to handle resource allocation and scheduing is the camq. The camq is a heap based priority queue that sorts its entries by a 32 bit priority value, and a generation count. The generation count stored in the camq is assigned to new entries at the time they are inserted into the queue and incremented after each use. The generation count combined with the two key sort ensure that no two entries in the queue have the exact same overall priority and that round-robin per priority level scheduling is enforced. Heap based queues offer O(log2n) running time for both insertion and deletion. It is quite common for peripheral drivers to perform most I/O using a single, “default” priority. In this case, insertion complexity drops to O(1), as the increasing generation count in the camq ensures that new entries in the queue have the lowest overal priority and will be inserted at the tail.

The camq sorts cam_pinfo objects. The cam_pinfo contains the basic data necessary for queuing to occur: the priority, the recorded generation count, and an index of where the item is in the heap's array. In general, complex objects that are placed into a camq are defined with a cam_pinfo object as their first member and casts are employed to convert the cam_pinfo items removed from a queue back to the full object.

Perhaps the easiest way to see how the camq is employed by the XPT is to track a typical peripheral request through the system. In this scenario, we are dealing with SCSI command or similar “blocking” requests which by their nature consume CAM and HBA resources in order to be fulfilled. When a peripheral driver receives a request from the system (read, write, ioctl), it queues the request in a manner specific to the peripheral driver, and calls xpt_schedule with a priority appropriate for the transaction. In response to the xpt_schedule call, the XPT places the peripheral instance into the drvq camq in the cam_ed object representing the device of interest. The drvq consists of peripheral instances represented by cam_periph objects, waiting for resources on the same device. In the common case, there is only one peripheral instance using a device at a time, but with multifunction devices such as WORM drives that exhibit both CDROM and WORM characteristics, or if a userland program is using the pass through driver to access a device that is serviced by another driver, more than one peripheral instance may be active. If the peripheral driver was already queued in the drvq, the XPT simply checks to see if the priority of the peripheral has been increased and updates the queue accordingly. At this point, we have provided a round-robin queue at the peripheral instance level to consume device openings.

Figure 4. Peripheral Instances Queuing up for Device Openings

To address controller level resources, we maintain a camq of devices awaiting common controller resources. Each device is assigned the priority of the highest priority peripheral in it's queue. When controller resources become available, the highest priority device is dequeued, the highest priority peripheral on that device is dequeued, and the peripheral's registered start routine is called with an allocated CCB as an argument. The peripheral driver dequeues its highest priority transaction, fills out the passed in CCB based on this transaction, and sends the CCB off to be routed to the SIM with a call to xpt_action. Before returning from its start routine, the peripheral driver determines if it has other transactions waiting, and if so, calls xpt_schedule again. As the call stack unwinds, the process of “re-scheduling” continues with the device requesting itself to be queued again if it still has resources available and peripheral instances requesting those resources.

Figure 5. Devices Queuing up for Control Resource Allocation

The camq structures at the device and controller level provide for the initial allocation of resources, but how does the system handle an error? The goal of the XPT is to ensure that transaction ordering is maintained to each device even during error recovery processing. To meet this goal, we must introduce one more scheduling pass in this system, this time to dole out CCBs that have already been allocated, but are waiting to be processed by the SIM. In the common case, this pass is a no-op. After all, the XPT has gone through a somewhat complicated allocation scheduler to ensure that the SIM and device have the resources necessary to handle an allocated CCB. When an error occurs, however, the system must provide a mechanism for deferring allocated CCBs until the error condition is cleared. The XPT provides a “send CCB” device queue at the controller level as well as a CCB queue at the device level to handle this task. The XPT also tracks the number of openings in these two queues independently of the allocation queues to handle situations where the error recovery action determines that the number of device openings must be reduced. When a SIM determines that an error has occured, the low level CCB queue is placed into the “frozen” state, and any CCBs affected by the error condition are returned to the peripheral driver. For every CCB returned, the “frozen” count of the CCB queue is incremented and the CCB's status byte is flagged to indicate that the error in this CCB caused the queue to be frozen. Only once the count is reduced to 0 by the peripheral driver, the peripheral driver reducing the frozen count by 1 for each CCB it handles that has the frozen status flag set, are CCBs allowed to flow again to the SIM. While handling the error, the peripheral driver will likely re-queue the affected transactions, send a high priority error recovery CCB to address the condition, and then release the queue. As the CCB queue is priority ordered, the high priority recovery action automatically becomes the next transaction to run. The fact that the frozen count is incremented for each returned CCB ensures that the order in which the transactions are returned to the peripheral driver and requeued doesn't affect transaction replay. All transactions must be addressed (e.g. requeued into our sourted CCB queue) before the frozen count will go to zero and the queue released.


PrevHomeNext
The Design and Implementation of the FreeBSD SCSI Subsystem The Peripheral Driver