The Design and Implementation of the FreeBSD SCSI Subsystem
Justin
Gibbs
gibbs@FreeBSD.org
$Header: //depot/cam/doc/cam.sgml#5$
The
Small Computer System Interface
protocol
offers a high performance interconnect to a wide range of
peripherals. This paper describes the implementation of a SCSI
framework for the FreeBSD Operating System. This framework expands
on the
Common Access Method
specification for
SCSI, providing round-robin prioritized transaction queuing,
guaranteed transaction ordering even during error recovery, and a
straight forward error recovery model that increases system
robustness. Services are provided to both controller (SCSI
to Host adapter) and peripheral drivers that allow the system to
take full advantage of advanced SCSI features while leveraging off of
common code. Strategies used to ensure wide SCSI device
compatibility are also discussed.
Introduction
Since the inception of the
Small Computer System
Interface
protocol, system programmers have faced several
challenges in implementing software to communicate with SCSI
devices. Although presented as a series of ANSI ratified standards,
extensive knowledge of
standardized SCSI
isn't
sufficient to develop a robust and efficient SCSI subsystem. The
interpretations of the SCSI standards are as varied as the vendors
manufacturing SCSI devices, increase ing the difficulty of designing a
system that offers wide compatibility. Even without the specter of
inconsistent vendor implementations, the challenges of reliable error
recovery, efficient and fair use of bus and device resources, and
exporting a clean API to applications removes the possibility of a
simple SCSI implementation.
Recognizing the difficulties inherent in supporting SCSI, the ANSI
body developed the
Common Access Method
specification. CAM provides a formal description of the interfaces
in a SCSI subsystem. These interfaces are tools that can be used to
to build a robust SCSI framework, but CAM offers few implementation
guidelines and often lacks information on the common SCSI
specification violations that must be tolerated to achieve
inter-operability. What is a good strategy for detecting
non-compliant devices? What commands issued in what order should
be used to probe for devices? What are the common problems
encountered in supporting tagged queuing? These questions, and many
more, are left for the implementer to answer. This paper seeks to
document how one particular CAM implementation, written for the
FreeBSD Operating System, fills in the details missing from the
formal CAM specification.
A CAM system is composed of three basic
components. At the highest
level we find the peripheral drivers. Peripheral drivers serve as
the interface between Operating System services and CAM actions. A
CAM system often provides peripheral drivers for tape, disk, CD-ROM,
or other device types. Applications running on the system use
generic Operating System services (read, write, ioctl) to access
peripheral devices without needing to know the specific
SCSI commands required to perform these actions.
Components of a CAM system
While a peripheral driver understands how to convert user actions
into SCSI commands, it knows very little about where particular
devices reside in the system. Peripheral drivers rely on the
Transport Layer (XPT) to provide these routing services. The XPT
notifies registered peripheral drivers of device arrival, departure,
and other asynchronous events, transports SCSI and administrative
commands in the form of
Cam Control Blocks
(CCBs)
to the appropriate hardware, and provides a round-robin prioritized
scheduler for CCBs pertaining to I/O operations. As peripheral
drivers provide an abstraction layer between SCSI and Operating
System services, the XPT layer provides an abstraction layer between
SCSI protocol consumers (the peripheral drivers) and hardware support
modules that manage SCSI transactions.
The final component in this system is the
SCSI Interface
Module
(SIM). SIMs are tasked with controlling host bus
adapter hardware and converting CCBs into the appropriate hardware
actions. SIMs are also responsible for detecting error conditions,
performing some aspects of error recovery, and exporting accurate
error information through the XPT layer to peripheral driver.
Starting with the Transport Layer, the data structures and strategies
used to implement each of these components will be discussed.
Special focus will be given to the areas of deterministic error
recovery, guaranteed transaction priority and ordering, handling of
non-conformant devices, and the tools provided by CAM that are used
to implement these features in peripheral and SIM drivers.
The Transport Layer
The
Ring Master
of the CAM subsystem is the Transport
Layer (XPT). Charged with the important tasks of tracking
peripheral, controller, and device presence, CAM system wide resource
management, and the routing of commands between the different CAM
layers, the XPT is the source of most device independent
functionality in CAM. The XPT's role begins early in system startup
when peripheral and controller drivers are attached, SCSI devices are
probed, and the state of the CAM system recorded in the
Existing Device Table
(EDT). Once the topography
of the system is known, the XPT can assume its role as a router. The
XPT uses it's routing services to provide guaranteed delivery of
transactions between peripheral driver instances and controller
drivers. Even more critical to the system's proper functioning and
performance is the XPT's task as a resource manager. The XPT
schedules the allocation of device, controller, and XPT resources for
all I/O transactions using a round-robin per priority level approach
that guarantees fairness.
The Existing Device Table
The active components in the CAM system are recorded in the
Existing Device Table
(EDT). Construction of the
table begins during system initialization as SCSI
Host Bus
Adapters
(HBA) are detected and the SIMs managing these
HBA-s register individual HBA busses with the XPT. During its
lifetime, the EDT will contain temporary device entries used during
device probing, and also handle the creation and deletion of more
permanent entries due to
hot-plug
events. The
design goals of the EDT are to provide a simple, space efficient,
and easily modified representation of the topology of the CAM
system.
The XPT Existing Device Table
The EDT is composed of four data structures arranged into a tree of
linked lists. These structures are the CAM Existing Bus
(
cam_deb
), the CAM Existing Target
(
cam_et
), the CAM Existing Device
(
cam_ed
), and the CAM Peripheral
(
cam_periph
). The tree begins with a list
of
cam_eb
objects representing the busses
on HBAs in the system. Busses in CAM are assigned a 32 bit value
known as a
path_id
. The
path_id
determines the order in which the busses are scanned for sub
devices and the value is determined by a combination of
registration order, and
hard wiring
information
specified by the user when compiling the kernel. The
cam_eb
list is sorted in ascending order
by
path_id
. Beyond it's role as an enumerator of a
bus, the
cam_eb
also records the
XPT clients that have requested notification of particular
asynchronous events that occur on a given bus.
Each
cam_eb
contains a list, sorted in
ascending order by target id, of active targets on the bus. Each
node is represented by a
cam_et
structure.
Unlike the
cam_eb
, the
cam_et
holds no additional system
meta-data (e.g. async notification lists).
Descending down another level, each
cam_et
contains a list of all active devices, sorted by logical unit
number, that are attached to that target. These nodes, represented
by
cam_ed
objects are the largest of all
entries in the EDT. It is here that the XPT keeps almost all
information required to perform its tasks. Among this information
is a queue of pending CCBs for the device, an async notification
registration list, a snapshot of the SCSI inquiry data recorded
when the device was probed, awnd a list of
quirks
that indicate special features or SCSI compliance issues that
require special
actions to properly communicate with this device.
To complete the topographical picture of the CAM system, the EDT
also records peripheral instances that have attached to a given
device. This is performed using a list of
cam_periph
objects, sorted in attach
order, on the
cam_ed
. This information is
primarily used by applications to that know a peripheral instance
that has attached to a given device, but need to communicate to
that device via a different peripheral instance that has also been
attached. For instance, a program that deals with
CDROM
devices may wish to open the application
pass through device associated with
cd0
, the first
CDROM
driver instance in the system, in order to
send SCSI commands directly to the device. This program would
performs an EDT lookup based on
cd0
to determine if
a pass-thru instance (e.g pass25) is attached to the underlying
device.
Object lifetime in the CAM EDT is managed by a mixture of explicit
object removal and reference counting. No entry at any level in
the table may be removed unless it's reference count has gone to
zero and all objects bellow it in the tree have been removed. In
addition to this requirement, bus objects are persistent from the
time they are advertized to the XPT vi
xpt_bus_register
until a SIM shuts down and
uses
xpt_bus_deregister
to explicitly remove
the bus instance. Device objects share similar semantics. Until a
device has been successfully probed, it's corresponding
cam_ed
object is marked as
unconfigured
by setting the
CAM_DEV_UNCONFIGURED
flag. If the probe is
successful, the flag is cleared and the device node will remain
even when it's reference count goes to zero. This allows the
system to place temporary device nodes in the EDT while it queues
transactions to determine the device's existance. Device nodes are
considered configured until an invalidation event such as a
selection timeout occurs.
XPT Routing
The EDT contains all of the information required to determine a
route based on the path_id, target, and lun tuple easily generated
elsewhere in the system. To do this, we perform a linear search of
first all busses in the system, followed by all targets of a bus,
and finally all devices attached to a target. Every time a command
is issued from a peripheral driver, it must be directed to the
proper device. As this can occur hundereds of times per second,
for each device in the system, the XPT layer must complete this
task quickly and efficiently. The linear search characteristics of
the EDT, even though the number of elements in the table is usually
small, does not provide the routing speed we desire. While using
lists as the primary data structure for the EDT provides a compact
representation, it makes it impractical to use the EDT for fast
routing.
To solve the routing problem, the FreeBSD CAM layer takes advantage
of the relatively long lifetime of most device, peripheral
associations. A disk, for example, that is discovered at boot
time, will receive it's commands from the
same peripheral drivers
until the system is halted, or a hardware failure or other
asynchronous event causes this link to be severed. This presents
an obvious oportunity to cache routing path information to avoid
EDT lookups. FreeBSD CAM stores this cached information in
cam_path
objects. A
cam_path
is an opaque object that contains
not only the bus, target, lun tuple from which it was created, but
direct pointers into the EDT to provide immediate access to all EDT
objects composing the route from peripheral to device. It may seem
that, as the EDT objects themselves contain tuple information,
keeping tuple information in the
cam_path
is redundant, but paths may need to be exported to applications
running in the user address space or even a remote system, where
access to the EDT information is not possible. This forces the
requirement of a portable representation of the tuple.
Anatomy of a
cam_path
Not all routing tasks involve a single destination. This is often
the case for asynchronous events. When a bus reset occurs, for
example, the event applies to all devices attached to the bus where
the reset origniated. To handle these kinds of situations,
cam_path
components may be wild carded.
Wild card components have NULL pointers into the EDT and use the
CAM_BUS_WILDCARD, CAM_TARGET_WILDCARD, and CAM_LUN_WILDCARD
constants for tuple values. Although the EDT is not effecient for
performing point to point routing operations, it's hierarchical
list format reduces the
broadcasting
associated with
wild cards to a simple list traversal with all list components
being of interest.
XPT Resource Management
Perhaps the most complicated task performed by the XPT layer is
that of resource management. The resources in a CAM system that
the XPT is concerned with fall into three areas. The most obvious
resource is that of system ram which is used to allocate the
CAM Control Blocks
(CCBs) that represent
transactions as they pass through the CAM system. Often more
constraining are the limits imposed by the number of transactions a
particular device can handle simultaneously. This paper will often
refer to this resource as
device openings
. The last
resource of concern for the XPT is the number of transactions a
particular SIM can queue at a given time, known as
controller openings
. Not only must the XPT allocate
these resources in a fair manner based on transaction priority, but
it must also ensure that resource allocation allows for proper
error recovery and that transaction order is consistent. Luckily,
a unified aproach can be used to deal with all three types of resources.
The basic data structure used by the XPT layer to handle resource
allocation and scheduing is the
camq
. The
camq
is a heap based priority queue that
sorts its entries by a 32 bit priority value, and a generation
count that is incremented and applied by the client of the queue to
entries as they are inserted. The generation
count combined with
the two key sort ensure that no two entries in the queue have the
exact same overall priority and that round-robin per priority level
scheduling is enforced. Heap based queues offer
O(log
2
n) running time for both insertion and
deletion, but in the very common case in CAM where, due to a fixed
priority and an always increasing generation count, almost all new
entries in the queue have slightly higher priority than the other
entries in the queue, insertion becomes O(1).
Perhaps the easiest way to see how the
camq
is employed by the XPT is to track a
typical peripheral request through the system. In this scenario,
we are dealing with SCSI command or similar requests which by their
nature consume CAM and HBA resources in order to be fulfilled.
When a peripheral driver receives request from the system (read,
write, ioctl), it queues the request in a manner specific to the
peripheral driver, and calls
xpt_schedule
with
a priority appropriate for the transaction. In response to the
xpt_schedule
call, the XPT places the
peripheral instance into a
camq
of
cam_periph
objects waiting for resources
on the target device. If the peripheral driver was already queued
at this level, the XPT simply checks to see if the priority of the
peripheral has been increased and updates the queue accordingly.
At this point, we have provided a round-robin queue at the
peripheral level to consume device openings. To address controller
level resources, we maintain a
camq<
of
devices awaiting common controller resources. Each device is
assigned the priority of the highest priority peripheral in it's
queue. When controller resources become available, the highest
priority device is dequeued, the highest priority peripheral is
dequeued, and the peripheral's registered start routine is called
with an allocated CCB as an argument. The peripheral driver
dequeues its highest priority transaction, fills out the passed in
CCB based on this transaction, and sends the CCB off to be routed
to the SIM with a call to
xpt_action
. Before
returning from its start routine, the peripheral driver determines
if it has other transactions waiting, and if so, calls
xpt_schedule
again. As the call stack
unwinds, the process of
re-scheduling
continues with
the device requesting itself to be queued again if it still has
resources available and peripheral instances requesting those
resources.
The
camq
s at the device and controller
level provide for the initial allocation of resources, but how does
the system handle an error? The goal of the XPT is to ensure that
transaction ordering is maintained to each device even in the event
of an error. To do this, we must introduce one more
camq
, this time holding CCBs that have
been allocated, but are waiting to be processed by the SIM. In the
common case, this queue is empty. After all, the XPT has gone
through a somewhat complicated allocation scheduler to ensure that
the SIM and device have the resources necessary to handle an
allocated CCB. When an error occurs, however, this low level CCB
queue is placed into the
frozen
state, and any CCBs
affected by the error condition, returned to the peripheral driver.
For every CCB returned, the
frozen
count of the CCB
queue is incremented. Only once the count is reduced to 0 by the
peripheral driver, are CCBs allowed to flow to the SIM. In order
to handle the error, the peripheral driver will likely re-queue the
affected transactions, send a high priority error recovery CCB to
address the condition, and then release the queue. As the CCB
queue is priority ordered, the high priority recovery action
automatically becomes the next transaction to run. The fact that
the frozen count is incremented for each returned CCB ensures that
the order in which the transactions are returned to the peripheral
driver and requeued doesn't affect transaction replay as all
transactions must be addressed (e.g. requeued into our sourced CCB
queue) before the count will go to zero and release the queue.