Writing a GEOM Class

Ivan Voras

<ivoras@yahoo.com>

$FreeBSD$

FreeBSD is a registered trademark of the FreeBSD Foundation.

CVSup is a registered trademark of John D. Polstra.

Intel, Celeron, EtherExpress, i386, i486, Itanium, Pentium, and Xeon are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries.

XFree86 is a trademark of The XFree86 Project, Inc.

Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in this document, and the FreeBSD Project was aware of the trademark claim, the designations have been followed by the “™” or the “®” symbol.

This text documents the way I created the gjournal facility, starting with learning how to do kernel programming. It's assumed the reader is familiar with C userland programming.

Table of Contents
1 Introduction
2 Preliminaries
3 On FreeBSD kernel programming
4 On GEOM programming

1 Introduction

1.1 Documentation

Documentation on kernel programming is scarce - it's one of few areas where there's nearly nothing in the way of friendly tutorials, and the phrase “use the source!” really holds true. However, there are some bits and pieces (some of them seriously outdated) floating around that should be studied before beginning to code:

The Developer's Handbook - part of the documentation project, it doesn't contain anything specific to kernel-land programming, but rather some general information.
FreeBSD Architecture Handbook - also from the documentation project, contains descriptions of several low-level facilities and procedures. The most important chapter is #13, Writing FreeBSD device drivers.
The Blueprints section of FreeBSD Diary web site - contains several interesting articles on kernel facilities.
The man pages in section 9 - most important kernel-land calls are documented here.
The geom(4) man page and PHK's GEOM slides - for general introduction of the GEOM subsystem.
style(9) man page, if the code should go to FreeBSD CVS tree

2 Preliminaries

The best way to do kernel developing is to have (at least) two separate computers. One of these would contain the development environment and sources, and the other would be used to test the newly written code by network-booting and network-mounting filesystems from the first one. This way if the new code contains bugs and crashes the machine, it won't mess up the sources (and other “live” data). The second system doesn't event have to have a proper display - it could be connected with a serial cable or KVM to the first one.

But, since not everybody has two+ computers handy, there are a few things that can be done to prepare an otherwise "live" system for developing kernel code.

2.1 Converting a system for development

For any kernel programming a kernel with INVARIANTS enabled is a must have. So enter these in your kernel configuration file:

  options INVARIANT_SUPPORT
  options INVARIANTS

For debugging crash dumps, a kernel with debug symbols is needed:

  makeoptions    DEBUG=-g

With the usual way of installing the kernel (make installkernel) the debug kernel will not be automatically installed. It's called kernel.debug and located in /usr/obj/usr/src/sys/KERNELNAME/. For convenience it should be copied to /boot/kernel/.

Another convenience is enabling the kernel debugger so you can examine a kernel panic when it happens. For this, enter the following lines in your kernel configuration file:

  options     KDB
  options     DDB
  options     KDB_TRACE

For this to work you might need to set a sysctl (if it's not on by default):

  debug.debugger_on_panic=1

Kernel panics will happen, so care should be taken with the filesystem cache. In particular, having softupdates might mean a latest file version could be lost if a panic occurs before it's committed to storage. Disabling softupdates yields a great performance hit (and it still doesn't guarantee data consistency - mounting filesystem with the "sync" option is needed for that) so for a compromise, the cache delays can be shortened. There are three sysctl's that are useful for this (best to be set in /etc/sysctl.conf):

  kern.filedelay=5
  kern.dirdelay=4
  kern.metadelay=3

The numbers represent seconds.

For debugging kernel panics, kernel core dumps are required. Since a kernel panic might make filesystems unusable, this crash dump is first written to a raw partition. Usually, this is the swap partition (it must be at least as large as the physical RAM in the machine). On the next boot (after filesystems are checked and mounted and before swap is enabled), the dump is copied to a regular file. This is controlled with two /etc/rc.conf variables:

  dumpdev="/dev/ad0s4b"
  dumpdir="/usr/core"

The dumpdev variable specifies the swap partition and dumpdir tells the system where in the filesystem to relocate the core dump on reboot.

Writing kernel core dumps is slow and takes a long time so if you have lots of memory (>256M) and lots of panics it could be frustrating to sit and wait while it's done (twice - first to write it to swap, then to relocate it to filesystem). It's convenient then to limit the amount of RAM the system will use via a /boot/loader.conf tunable:

  hw.physmem="256M"

If the panics are frequent and filesystems large (or you simply don't trust softupdates+background fsck) it's advisable to turn background fsck off via /etc/rc.conf variable:

  background_fsck="NO"

This way, the filesystems will always get checked when needed (with background fsck, a new panic could happen while it's checking the disks). Again, the safest way is not to have many local filesystems by using another computer as NFS server.

2.2 Starting the project

For the purpose of making gjournal, a new empty subdirectory was created under an arbitrary user-accessible directory. You don't have to create the module directory under /usr/src.

2.3 The Makefile

It's good practice to create Makefiles for every nontrivial coding project, which of course includes kernel modules.

Creating the Makefile is simple thanks to extensive set of helper routines provided by the system. In short, here's how it looks:

  SRCS=g_journal.c
  KMOD=geom_journal

  .include <bsd.kmod.mk>

This Makefile (with changed filenames) will do for any kernel module. If more than one file is required, list it in SRCS variable separated with whitespace from other filenames.

3 On FreeBSD kernel programming

3.1 Memory allocation

See malloc(9). Basic memory allocation is only slightly different than its user-land equivalent. Most notably, malloc() and free() accept additional parameters as is described in the man page.

A “malloc type” must be declared in the declaration section of a source file, like this:

  static MALLOC_DEFINE(M_GJOURNAL, "gjournal data", "GEOM_JOURNAL Data");

To use the macro, sys/param.h, sys/kernel.h and sys/malloc.h headers must be included.

There's another mechanism for allocating memory, the UMA (Universal Memory Allocator). See uma(9) for details, but it's a special type of allocator mainly used for speedy allocation of lists comprised of same-sized items (for example, dynamic arrays of structs).

3.2 Lists and queues

See queue(3). There are a LOT of cases when a list of things needs to be maintained. Fortunately, this data structure is implemented (in several ways) by the C macros included in the system. The most used list type is TAILQ because it's the most flexible. It's also the one with largest memory requirements (its elements are doubly-linked) and theoretically the slowest (though the speed variation is on the order of several CPU instructions more, so it shouldn't be taken seriously).

If data retrieval speed is very important, see tree(3).

3.3 BIOs

Structure bio is used for any and all Input/Output operations concerning GEOM. It basically contains information about what device ('provider') should satisfy the request, request type, offset, length, pointer to a buffer, and a bunch of “user-specific” flags and fields that can help implement various hacks.

The important thing here is that bios are dealt with asynchronously. That means that, in most parts of the code, there's no analogue to userland's read(2) and write(2) calls that don't return until a request is done. Rather, a developer-supplied function is called as a notification when the request gets completed (or results in error).

Unfortunately, the asynchronous programming model (also called "event-driven") imposed this way is somewhat harder than the much more used imperative one (at least it takes a while to get used to it). In some cases helper routines g_write_data() and g_read_data() can be used (NOT ALWAYS!).

4 On GEOM programming

4.1 Ggate

If maximum performance is not needed, a much simpler way of making a data transformation is to implement it in userland via the ggate (GEOM gate) facility. Unfortunately, there's no easy way to convert between, or even share code between the two approaches.

4.2 GEOM class

GEOM class has several "class methods" that get called when there's no geom instance available (or they're simply not bound to a single instance):

.init is called when GEOM becomes aware of a GEOM class (e.g. when the kernel module gets loaded.)
.fini gets called when GEOM abandons the class (e.g. when the module gets unloaded)
.taste is called next, once for each provider the system has available. If applicable, this function will usually create and start a geom instance.
.destroy_geom is called when the geom should be disbanded
.ctlconf is called when user requests reconfiguration of existing geom

Also defined are the GEOM event functions, which will get copied to the geom instance.

Field .geom in the g_class structure is a LIST of geoms instantiated from the class.

These functions are called from g_event? kernel thread.

4.3 Softc

The name “softc” is a legacy term for “driver private data”. The name most probably comes from archaic term “software control block”. In GEOM, it's a structure (more precise: pointer to a structure) that can be attached to a geom instance to hold whatever data is private to the geom instance. In gjournal (and most of the other GEOM classes), some of it's members are:

struct g_provider *provider : The “provider” this geom instantiates
uint16_t n_disks : Number of consumer this geom consumes
struct g_consumer **disks : Array of struct g_consumer*. (It's not possible to use just single indirection because struct g_consumer* are created on our behalf by GEOM).

The softc structure contains all the state of geom instance. Every geom instance has its own softc.

4.4 Metadata

Format of metadata is more-or-less class-dependent, but MUST start with:

16 byte buffer for null-terminated signature (usually the class name)
uint32 version ID

It's assumed that geom classes know how to handle metadata with version ID's lower than theirs.

Metadata is located in the last sector of the provider (and thus must fit in it).

(All this is implementation-dependent but all existing code works like that, and it's supported by libraries.)

4.5 Labeling/creating a geom

The sequence of events is:

user calls geom(8) utility (or one of it's hardlinked friends)
the utility figures out which geom class it's supposed to handle and searches for geom_CLASSNAME.so library (usually in /lib/geom).
it dlopen(3)-es the library, extracts the definitions of command-line parameters and helper functions.

In the case of creating/labeling a new geom, this is what happens:

geom(8) looks in the command-line definition for the command (usually "label"), calls a helper function.
helper function checks parameters & gathers metadata, which it proceeds to write to all concerned providers.
this "spoils" existing geoms (if any) and initializes a new round of "tasting" of the providers. The intended geom class recognizes the metadata and brings the geom up.

(The above sequence of events is implementation-dependent but all existing code works like that, and it's supported by libraries.)

4.6 Geom command structure

The helper geom_CLASSNAME.so library exports class_commands structure, which is an array of struct g_command elements. Commands are of uniform format and look like:

  verb [-options] geomname [other]

Common verbs are:

label - to write metadata to devices so they can be recognized at tasting and brought up in geoms
destroy - to destroy metadata, so the geoms get destroyed

Common options are:

-v : be verbose
-f : force

Many actions, such as labeling and destroying metadata can be performed in userland. For this, struct g_command provides field gc_func that can be set to a function (in the same .so) that will be called to process a verb. If gc_func is NULL, the command will be passed to kernel module, to .ctlreq function of the geom class.

4.7 Geoms

Geoms are instances of geom classes. They have internal data (a softc structure) and some functions with which they respond to external events.

The event functions are:

.access : calculates permissions (read/write/exclusive)
.dumpconf : returns XML-formatted information about the geom
.orphan : called when some underlying provider gets disconnected
.spoiled : called when some underlying provider gets written to
.start : handles IO

These functions are called from g_down? kernel thread and there can be no sleeping in this context (no blocking on a mutex or any kind of locks) which limits what can be done quite a bit, but forces the handling to be fast.

Of these, the most important function for doing actual usefull work is the .start() function, which is called when a BIO requests arrives for a provider managed by a instance of geom class.

4.8 Geom threads

There are three kernel threads created and run by the GEOM framework:

g_down : Handles requests coming from high-level entities (such as a userland request) on the way to physical devices
g_up : Handles responses from device drivers to requests made by higher-level entities
g_event : Handles all other cases: creation of geom instances, access counting, "spoil" events, etc.

When a user process issues “read data X at offset Y of a file” request, this is what happenes:

The filesystem converts the request into struct bio instance and passes it to GEOM subsystem. It knows what geom instance should handle it because filesystems are hosted directly on a geom instance.
The request ends up as a call to .start() function made on the g_down thread and reaches the top-level geom instance.
This top-level geom instance (for example the partition slicer) determines that the request should be routed to a lower-level instance (for example the disk driver). It makes a copy of the bio request (bio requests ALWAYS need to be copied between instances, with g_clone_bio()!), modifies the data offset and target provider fields and executes the copy with g_io_request()
The disk driver gets the bio request also as a call to .start() on the g_down thread. It talks to hardware, gets the data back, and calls g_io_deliver() on the bio.
Now, the notification of bio completion “bubbles up” in the g_up thread. First the partition slicer gets .done() called in the g_up thread, it uses information stored in the bio to free the cloned bio structure (with g_destroy_bio()) and calls g_io_deliver() on the original request.
The filesystem gets the data and transfers it to userland.

See g_bio(9) man page for information how the data is passed back and forth in the bio structure (note particular the bio_parent and bio_children fields and how they are handled).

One important feature is: THERE CAN BE NO SLEEPING IN G_UP AND G_DOWN THREADS. This means that none of the following things can be done in those threads (the list is of course not complete, but only informative):

Calls to msleep() and tsleep(), obviously.
Calls to g_write_data() and g_read_data(), because these sleep between passing the data to consumers and returning.
Calls to malloc(9) and uma_zalloc() with M_WAITOK flag set
sx locks

This restriction is here to stop geom code clogging the IO request path, because sleeping in the code is usually not time-bound and there can be no guarantiees on how long will it take (there are some other, more technical reasons also). It also means that there's not much that can be done in those threads; for example, almost any complex thing requires memory allocation. Fortunately, there is a way out: creating additional kernel threads.

4.9 Kernel threads for use in geom code

Kernel threads are created with kthread_create(9) function, and they are sort of similar to userland threads in behaviour, only they can't return to caller to signify termination, but must call kthread_exit(9).

In geom code, the usual use of threads is to offload processing of requests from g_down thread (the .start() function). These threads look like “event handlers”: they have a linked list of event associated with them (on which events can posted by various functions in various threads so it must be protected by a mutex), take the events from the list one by one and process them in a big switch() statement.

The main benefit of using a thread to handle IO requests is that it can sleep when needed. Now, this sounds good, but should be carefully thought out. Sleeping is well and very convenient but can very effectively destroy performance of the geom transformation. Extremely performance-sensitive classes probably should do all the work in .start() function call, taking great care to handle out-of-memory and similar errors.

The other benefit of having a event-handler thread like that is to serialize all the requests and responses coming from different geom threads into one thread. This is also very convenient but can be slow. In most cases, handling of .done() requests can be left to the g_up thread.

Mutexes in FreeBSD kernel (see mutex(9) man page) have one distinction from their more common userland cousins - they disallow sleeping (meaning: the code can't sleep while holding a mutex). If the code needs to sleep a lot, sx(9) locks may be more appropriate. (On the other hand, if you do almost everything in a single thread, you may get away with no mutexes at all).