In this document I will outline the work I intend to carry out if my fund-raising succeeds.
I will caution though, that changes to this plan are not only likely but to be expected. This is partly because nobody works alone in the FreeBSD kernel, there are numerous interactions between subsystems and my interaction with other developers will result in better ideas and changed concepts along the way. But also the fact that there is a lot of code with small important details and tons of compatibility issues to deal with underway, I have undoubtedly overlooked more than a few things.
This is by the way, one of the things that is so attractive about developing on FreeBSD: If we make a plan and later find out it was a mistake, we can fix the mistake rather than "deliver appendix 9 as written because that's what the contract tells us to do.".
So please do not see this as the Gospels, this is a plan and a pretty far-reaching one at that.
Today, when accessing a device like a disk or tty from a program, the calls go through the system call, through the filedescriptor switch, into the vnode layer. In the vnode layer, the system call gets turned into a sequence of VOP_*() calls and directed to whatever filesystem backed the character device vnode which the program operates on. (That filesystem is DEVFS for all practical purposes). From DEVFS the VOP_*() calls pass on to SPECFS, where they get mapped onto a device driver and finally calls are made into this device driver through the cdevsw structure which the device driver published when it created the device node with make_dev(9).
Device vnodes have some properties which are different from all other vnodes, and that leads to a fair amount of special case handling at various levels of this picture. The most magic property it that it is possible to access the same device through multiple mountpoints.
In other words, it is possible to open the floppy drive both as "/dev/fd0" and "/cdrom/dev/fd0" but in the end it is the same device.
We have some historical access rules which for instance mandate that if a a disk device is mounted as a filesystem a program must not be able to write directly to the disk device. To enforce rules like that, it is necessary for the vnode layer to be aware that /dev/fd0 and /cdrom/dev/fd0 is in fact the same device so that access can be denied on both if one of them is mounted as a filesystem.
There are other reasons as well, but the result is what is known as "vnode aliasing", a concept which significantly adds complexity to the vnode code and to the locking of the vnode code.
The solution to the above problem is to not go through the vnode layer more than we absolutely have to. The bits we cannot avoid is that the vnode layer provides naming service for the devices: this is how we match the path "/dev/fd0" up with the device driver in the kernel.
But once this linkage is established, and once we have successfully open(2)'ed the device, the subsequent read(2), write(2), ioctl(2) etc calls do not need to pass through the vnode layer.
In other words, we can go directly from the file descriptor switch to the device once the open(2) call is finished.
Doing that would result in the device I/O traffic not getting anywhere near the vnode locking.
Fortunately, we can decide for each device class, device driver and even per device if we want to take the bypass path.
The first and most obvious candidate is tty devices, and in particluar the pty(4) device driver since that accounts to for nearly all terminal traffic in these days.
Getting ttys out from under Giant will improve the interactive response of FreeBSD a lot.
There is a little detail to this in the revoke(2) system call, but I think I have a workable solution for that.
When we mount /dev/ad0s1d on /home, the UFS filesystem uses the vnode layer to open the disk device, and subsequently the vnode reference obtained that way is used by the buffer cache and the VM system to do the actual trafic on the device.
The quick solution would be to replace the vnode reference by a filedescriptor reference and be done with it, but closer scrutiny shows that it is not a feasible nor even desirable solution.
The buffer cache used to work on disk devices only and it was by the way the only way to work on a disk device. Then came virtual memory and suddenly things got a lot more complicated but we have managed to kludge things so that the buffer cache and VM system talk together. Then came NFS which does not have a diskdevice as backing store, but rather a network connection, but once again we managed to kludge it.
It's time to stop kluding and do some real design work instead.
The buffer cache is needed by many different pieces of code with a number of different backing stores. So far I can count disk based filesystems, network based filesystems, virtual memory filesystems, RAID5 software (for parity caching), GBDE (for key sector caching) and there are probably even more that I have not discovered.
Ideally we would ditch the buffer cache entirely and go directly to to VM system, but that would require a lot of changes to a lot of code, and most of them would be regressions because the buffercache model and API is actually pretty much OK, it is the code in the buffercache for handling the backing store that needs to be modified.
So the plan is to make the buffer cache a sort of library, where users create an instance, provide access methods to the backing store and then use it more or less like the always did.
By making the backing store access a parameter of the instance, it will be possible to use the buffer cache to cache anything to which we can get direct (addressed) access, or even just simulate it.
Performance wise, we do a very pointless thing today: we map the data area of all disk I/O requests into kernel virtual memory even though most contemporary disk device drivers do not need that.
There are however drivers and GEOM classes which need to access data through kernel virtual memory, RAID5 for parity calculation, md(4), GBDE for encryption, so we cannot go directly to the other ditch, we have to support both ways.
Adding a scatter/gather and map/umapped ability to struct bio and the related code will give us a tangible performance gain.
There are a lot of tie-ins to this concept, BUSDMA, vm system etc, so this is in the optional part of the project plan.
Once the buffer cache has become more agile with backing store, there is not really any reason why a disk backed filesystem should access the disk directly as a GEOM class.
In addition to the reduced code path and increased performance, this also allows us to use the stronger R/W/E access model of GEOM to implement the traditional magic access checks and in addition to that, fix the issues surrounding ro/rw up- and down-grade mounts.
But most importantly, it means that filesystem I/O no longer takes two tours through the vnode layer. Today we take a tour first to get to the filesystem which implements the vnode for our /tmp/foo file, and then that filesystem takes another tour to get to its disk device.
With device aliases and filesystem disk I/O gone from the vnode layer, removing Giant locking from vnodes will be a lot simpler, and most of the race/deadlock opportunities will be eliminated.