"Geometry" - an idea.

Introduction

The handling of disklabels and slices and whatsnot, have until now been a rather ad-hoc proposition. Some lump of code somewhere did it, partly through magic and that was that.

This is my proposal to make a structural framework for this area of a UNIX kernel.

The same "basic" kind of setup under "geometry" would look like this:

Now the SCSI and WD drivers just provide access to the hardware, they don't know anything about layout, and separate "methods" do the handling of layout, slicing and partitioning.

The red circles mark "geometry" devices. These are the points in the graph which can be accessed from /dev/something or other.

A geometry device has the following public properties:

A name like "sd0", "sd0s1", "mirror1" or "foo". This is the name which will appear in /dev for this device.
A sector size. I/O must be done in transactions of an integral number of sectors of this size.
A size in number of sectors.
The name of the method providing this device.

Now, lets look at a more advanced setup:

This is basically a machine with two mirrored disks. I will use this to illustrate an important concept of "geometry": on the fly insertion.

When the machine boots, let say from sd0, we need to find a suitable root filesystem. Since we want to be backwards compatible, the MBR and BSD methods will be self-identifying; ie: they will examine the available devices and instantiate themselves on those devices on which they find their respective magic sectors.

So at the time when /sbin/init gets executed the picture looks like this:

So before we mount anything read/write, we want to activate the mirroring:

Dismantle the BSD method on sd1 (the top right box)
Dismantle the MBR method on sd1 (the one right below)
Dismantle the MBR method on sd0

Now, how and why can we do that ? Well, in this case we use the "dangerousely dedicated mode" really, the MBR represents a 1:1 mapping in that case and since it is transparent we can remove it without affecting the mounted filesystem.

Now it looks like this:

Next, using the same set of conditions we enable the mirror:

Insert mirror between SCSI/sd0 and BSD
Attach SCSI/sd1 to mirror

The reason why we can insert a mirror just like that, is that the mirror is also a 1:1 mapping when it has only one child.

Now we're back to the setup we started with:

There are no limits to what a method can do really. Here is a beastiarium over some of the ones I can imagine:

BSD
Understands BSD style disklabels

MBR
Understands DOS/MBR/FDISK style disklabels

MIRROR
mirrors data over multiple lower devices

CONCAT
Concatenates a number of lower devices into one larger device

STRIPE
Like CONCAT, but with interleaved layout.

RAID-5
Raid-5 method over a number of lower devices.

INTERLEAVE
This is the opposite of STRIPE. It interleaves a number of upper devices onto one lower device. For two interleaves devices, all the even numbered sectors on the lower device will belong to the first upper device and the odd numbered ones to the other.

COW
"Copy On Write" Imagine the case were you had a nasty crash and fsck barfs badly over one of your filesystems. The temptation to just run "fsck -y" is there, but what will happen ? Well you put a "COW" on your device, and tell the "COW" to use your swap partition for temporary storage. Then you say fsck -y on your COWed device. The COW module will look just like a normal device, but all the writes fsck does will be stored in the temporary storage until you tell COW to "commit". So if fsck -y looked ok, you mount the device, peek around find nothing important missing and tell "COW" to commit, COW will copy all the blocks written by fsck from temporary storage to the "real" device and we're all happy. If on the other hand fsck -y removed pretty much everything on the filesystem, you will probably tell "COW" to "abandon" and take the long road home to recovery. Call it the "What if ?" method if you like, but it is my favourite method.
APPLE, SUN, MVS, XENIX
These are various methods to read the disklabels as they look on various other machines and OSs.

"YOUR IDEA GOES HERE"
A method should hopefully be something very simple to write, so if you have a good idea...

Summary

I hope the above gives an idea about what I'm talking about, otherwise yell.

The basic idea is kind of a LEGO inspired idea: you have a number of bricks and you put them together as you like. All the various commercial systems I have tried impose a hierarchy on the methods they provide (pdisk, disk, subdisk, plex, volume for instance), and I don't like that straight-jacket. If I feel like mirroring before I partition I should be allowed to do that. I probably have a reason for wanting to.

It's about providing tools, not policies really...

I actually had a prototype of this running, but it suffered badly from "second systems syndrome", so a fresh start should be made. I am unlikely to have the time for it, unless I find a sponsor for it.

Poul-Henning Kamp
phk@FreeBSD.org
19990925