The Small Computer System Interface protocol offers a high performance interconnect to a wide range of peripherals. This paper describes the implementation of a SCSI framework for the FreeBSD Operating System. This framework expands on the Common Access Method specification for SCSI, providing round-robin prioritized transaction queuing, guaranteed transaction ordering even during error recovery, and a straight forward error recovery model that increases system robustness. Services are provided to both controller (SCSI to Host adapter) and peripheral drivers that allow the system to take full advantage of advanced SCSI features while leveraging off of common code. Strategies used to ensure wide SCSI device compatibility are also discussed.
Since the inception of the Small Computer System Interface protocol, system programmers have faced several challenges in implementing software to communicate with SCSI devices. Although presented as a series of ANSI ratified standards, extensive knowledge of "standardized SCSI" isn't sufficient to develop a robust and efficient SCSI subsystem. The interpretations of the SCSI standards are as varied as the vendors manufacturing SCSI devices, increaseing the difficulty of designing a system that offers wide compatibility. Even without the specter of inconsistent vendor implementations, the challenges of reliable error recovery, efficient and fair use of bus and device resources, and exporting a clean API to applications removes the possibility of a simple SCSI implementation.
Recognizing the difficulties inherent in supporting SCSI, the ANSI body developed the Common Access Method specification. CAM provides a formal description of the interfaces in a SCSI subsystem. These interfaces are tools that can be used to to build a robust SCSI framework, but CAM offers few implementation guidelines and often lacks information on the common SCSI specification violations that must be tolerated to achieve inter-operability. What is a good strategy for detecting non-compliant devices? What commands issued in what order should be used to probe for devices? What are the common problems encountered in supporting tagged queuing? These questions, and many more, are left for the implementer to answer. This paper seeks to document how one particular CAM implementation, written for the FreeBSD Operating System, fills in the details missing from the formal CAM specification.
A CAM system is composed of three basic components. At the highest level we find the peripheral drivers. Peripheral drivers serve as the interface between Operating System services and CAM actions. A CAM system often provides peripheral drivers for tape, disk, CD-ROM, or other device types. Applications running on the system use generic Operating System services (read, write, ioctl) to access peripheral devices without needing to know the specific SCSI commands required to perform these actions.
While a peripheral driver understands how to convert user actions into SCSI commands, it knows very little about where particular devices reside in the system. Peripheral drivers rely on the Transport Layer (XPT) to provide these routing services. The XPT notifies registered peripheral drivers of device arrival, departure, and other asynchronous events, transports SCSI and administrative commands in the form of Cam Control Blocks (CCBs) to the appropriate hardware, and provides a round-robin prioritized scheduler for CCBs pertaining to I/O operations. As peripheral drivers provide an abstraction layer between SCSI and Operating System services, the XPT layer provides an abstraction layer between SCSI protocol consumers (the peripheral drivers) and hardware support modules that manage SCSI transactions.
The final component in this system is the System Interface Module (SIM). SIMs are tasked with controlling host bus adapter hardware and converting CCBs into the appropriate hardware actions. SIMs are also responsible for detecting error conditions, performing some aspects of error recovery, and exporting accurate error information through the XPT layer to peripheral driver.
Starting with the Transport Layer, the data structures and strategies used to implement each of these components will be discussed. Special focus will be given to the areas of deterministic error recovery, guaranteed transaction priority and ordering, handling of non-conformant devices, and the tools provided by CAM that are used to implement these features in peripheral and SIM drivers.
The "Ring Master" of the CAM subsystem is the Transport Layer (XPT). Charged with the important tasks of tracking peripheral, controller, and device presence, CAM system wide resource management, and the routing of commands between the different CAM layers, the XPT is the source of most device independent functionality in CAM. The XPT's role begins early in system startup when peripheral and controller drivers are attached, SCSI devices are probed, and the state of the CAM system recorded in the Existing Device Table (EDT). Once the topography of the system is known, the XPT can assume its role as a router. The XPT uses it's routing services to provide guaranteed delivery of transactions between peripheral driver instances and controller drivers. Even more critical to the system's proper functioning and performance is the XPT's task as a resource manager. The XPT schedules the allocation of device, controller, and XPT resources for all I/O transactions using a round-robin per priority level approach that guarantees fairness.
The active components in the CAM system are recorded in the Existing Device Table(EDT). Construction of the table begins during system initialization as SCSI Host Bus Adapters (HBA) are detected and the SIMs managing these HBA-s register individual HBA busses with the XPT. During its lifetime, the EDT will contain temporary device entries used during device probing, and also handle the creation and deletion of more permanent entries due to "hot-plug" events. The design goals of the EDT are to provide a simple, space efficient, and easily modified representation of the topology of the CAM system.
The EDT is composed of four data structures arranged into a tree of
linked lists. These structures are the CAM Existing Bus
(cam_deb
), the CAM Existing Target
(cam_et
), the CAM Existing Device
(cam_ed
), and the CAM Peripheral
(cam_periph
). The tree begins with a list
of cam_eb
objects representing the busses
on HBAs in the system. Busses in CAM are assigned a 32 bit value
known as a path_id. The path_id
determines the order in which the busses are scanned for sub
devices and the value is determined by a combination of
registration order, and "hard wiring" information
specified by the user when compiling the kernel. The
cam_eb
list is sorted in ascending order
by path_id. Beyond it's role as an enumerator of a
bus, the cam_eb
also records the
XPT clients that have requested notification of particular
asynchronous events that occur on a given bus.
Each cam_eb
contains a list, sorted in
ascending order by target id, of active targets on the bus. Each
node is represented by a cam_et
structure.
Unlike the cam_eb
, the
cam_et
holds no additional system
meta-data (e.g. async notification lists).
Descending down another level, each cam_et
contains a list of all active devices, sorted by logical unit
number, that are attached to that target. These nodes, represented
by cam_ed
objects are the largest of all
entries in the EDT. It is here that the XPT keeps almost all
information required to perform its tasks. Among this information
is a queue of pending CCBs for the device, an async notification
registration list, a snapshot of the SCSI inquiry data recorded
when the device was probed, awnd a list of "quirks"
that indicate special features or SCSI compliance issues that
require special actions to properly communicate with this device.
To complete the topographical picture of the CAM system, the EDT
also records peripheral instances that have attached to a given
device. This is performed using a list of
cam_periph
objects, sorted in attach
order, on the cam_ed
. This information is
primarily used by applications that need to know that a peripheral
instance is attached to a given device, but that will communicate to
the device via a different peripheral instance that is also
attached to the same device. For instance, a program that deals with
CDROM devices may wish to open the application
pass through device associated with "cd0", the first
CDROM driver instance in the system, in order to
send SCSI commands directly to the device. This program would
performs an EDT lookup based on "cd0" to determine if
a pass-thru instance (e.g "pass25") is attached to the
underlying device.
Object lifetime in the CAM EDT is managed by a mixture of explicit
object removal and reference counting. No entry at any level in
the table may be removed unless it's reference count has gone to
zero and all objects bellow it in the tree have been removed. In
addition to this requirement, bus objects are persistent from the
time they are advertized to the XPT via
xpt_bus_register
until a SIM shuts down and
uses xpt_bus_deregister
to explicitly remove
the bus instance. Device objects share similar semantics. Until a
device has been successfully probed, it's corresponding
cam_ed
object is marked as
"unconfigured" by setting the
CAM_DEV_UNCONFIGURED
flag. If the probe is
successful, the flag is cleared and the device node will remain
even when it's reference count goes to zero. This allows the
system to place temporary device nodes in the EDT while it queues
transactions to determine the device's existance. Device nodes are
considered configured until an invalidation event such as a
selection timeout occurs.
The EDT contains all of the information required to determine a route based on the path_id, target, and lun tuple easily generated elsewhere in the system. To do this, we perform a linear search of first all busses in the system, followed by all targets of a bus, and finally all devices attached to a target. Every time a command is issued from a peripheral driver, it must be directed to the proper device. As this can occur hundereds of times per second, for each device in the system, the XPT layer must complete this task quickly and efficiently. The linear search characteristics of the EDT, even though the number of elements in the table is usually small, does not provide the routing speed we desire. While using lists as the primary data structure for the EDT provides a compact representation, it makes it impractical to use the EDT for fast routing.
To solve the routing problem, the FreeBSD CAM layer takes advantage
of the relatively long lifetime of most device, peripheral
associations. A disk, for example, that is discovered at boot
time, will receive it's commands from the same peripheral drivers
until the system is halted, or a hardware failure or other
asynchronous event causes this link to be severed. This presents
an obvious oportunity to cache routing path information to avoid
EDT lookups. FreeBSD CAM stores this cached information in
cam_path
objects. A
cam_path
is an opaque object that contains
not only the bus, target, lun tuple from which it was created, but
direct pointers into the EDT to provide immediate access to all EDT
objects composing the route from peripheral to device. It may seem
that, as the EDT objects themselves contain tuple information,
keeping tuple information in the cam_path
is redundant, but paths may need to be exported to applications
running in the user address space or even a remote system, where
access to the EDT information is not possible. This forces the
requirement of a portable representation of the tuple.
Not all routing tasks involve a single destination. This is often
the case for asynchronous events. When a bus reset occurs, for
example, the event applies to all devices attached to the bus where
the reset origniated. To handle these kinds of situations,
cam_path
components may be wild carded.
Wild card components have NULL pointers into the EDT and use the
CAM_BUS_WILDCARD, CAM_TARGET_WILDCARD, and CAM_LUN_WILDCARD
constants for tuple values. Although the EDT is not effecient for
performing point to point routing operations, it's hierarchical
list format reduces the "broadcasting" associated with
wild cards to a simple list traversal with all list components
being of interest.
Perhaps the most complicated task performed by the XPT layer is that of resource management. The resources in a CAM system that the XPT is concerned with fall into three areas. The most obvious resource is that of system ram which is used to allocate the CAM Control Blocks (CCBs) that represent transactions as they pass through the CAM system. Often more constraining are the limits imposed by the number of transactions a particular device can handle simultaneously. This paper will often refer to this resource as "device openings". The last resource of concern for the XPT is the number of transactions a particular SIM can queue at a given time, known as "controller openings". Not only must the XPT allocate these resources in a fair manner based on transaction priority, but it must also ensure that resource allocation allows for proper error recovery and that transaction order is consistent. Luckily, a unified aproach can be used to deal with all three types of resources.
The basic data structure used by the XPT layer to handle resource
allocation and scheduing is the camq
. The
camq
is a heap based priority queue that
sorts its entries by a 32 bit priority value, and a generation
count that is incremented and applied by the client of the queue to
entries as they are inserted. The generation count combined with
the two key sort ensure that no two entries in the queue have the
exact same overall priority and that round-robin per priority level
scheduling is enforced. Heap based queues offer
O(log2n) running time for both insertion and
deletion, but in the very common case in CAM where, due to a fixed
priority and an always increasing generation count, almost all new
entries in the queue have slightly higher priority than the other
entries in the queue, insertion becomes O(1).
Perhaps the easiest way to see how the
camq
is employed by the XPT is to track a
typical peripheral request through the system. In this scenario,
we are dealing with SCSI command or similar requests which by their
nature consume CAM and HBA resources in order to be fulfilled.
When a peripheral driver receives request from the system (read,
write, ioctl), it queues the request in a manner specific to the
peripheral driver, and calls xpt_schedule
with
a priority appropriate for the transaction. In response to the
xpt_schedule
call, the XPT places the
peripheral instance into a camq
of
cam_periph
objects waiting for resources
on the target device. If the peripheral driver was already queued
at this level, the XPT simply checks to see if the priority of the
peripheral has been increased and updates the queue accordingly.
At this point, we have provided a round-robin queue at the
peripheral level to consume device openings. To address controller
level resources, we maintain a camq<
of
devices awaiting common controller resources. Each device is
assigned the priority of the highest priority peripheral in it's
queue. When controller resources become available, the highest
priority device is dequeued, the highest priority peripheral is
dequeued, and the peripheral's registered start routine is called
with an allocated CCB as an argument. The peripheral driver
dequeues its highest priority transaction, fills out the passed in
CCB based on this transaction, and sends the CCB off to be routed
to the SIM with a call to xpt_action
. Before
returning from its start routine, the peripheral driver determines
if it has other transactions waiting, and if so, calls
xpt_schedule
again. As the call stack
unwinds, the process of "re-scheduling" continues with
the device requesting itself to be queued again if it still has
resources available and peripheral instances requesting those
resources.
The camq
s at the device and controller
level provide for the initial allocation of resources, but how does
the system handle an error? The goal of the XPT is to ensure that
transaction ordering is maintained to each device even in the event
of an error. To do this, we must introduce one more
camq
, this time holding CCBs that have
been allocated, but are waiting to be processed by the SIM. In the
common case, this queue is empty. After all, the XPT has gone
through a somewhat complicated allocation scheduler to ensure that
the SIM and device have the resources necessary to handle an
allocated CCB. When an error occurs, however, this low level CCB
queue is placed into the "frozen" state, and any CCBs
affected by the error condition, returned to the peripheral driver.
For every CCB returned, the "frozen" count of the CCB
queue is incremented. Only once the count is reduced to 0 by the
peripheral driver, are CCBs allowed to flow to the SIM. In order
to handle the error, the peripheral driver will likely re-queue the
affected transactions, send a high priority error recovery CCB to
address the condition, and then release the queue. As the CCB
queue is priority ordered, the high priority recovery action
automatically becomes the next transaction to run. The fact that
the frozen count is incremented for each returned CCB ensures that
the order in which the transactions are returned to the peripheral
driver and requeued doesn't affect transaction replay as all
transactions must be addressed (e.g. requeued into our sourced CCB
queue) before the count will go to zero and release the queue.