Advantage of Reducing the inode count


It should be noted that I do not consider this the primary goal of the pkg-data project. However, it is a nice consequence of making the proposed change. I have probably talked about this one issue much more than I should have. I focused on this because it is an immediate benefit of the initial (short-term) project, and it is also nice because it is something that one can make objective measurements of. As I stated on the initial page, the main goal of the long-term project is to move many of the little bits of information about each specific port into some XML-ish format. IMO, that will be a better strategy than stuffing some of those values in make-variables, and other values into an assortment of disjoint files.


Let me stress that I not saying that this is some dramatic, earth-shaking advance in computing. I am just saying that it is a nice advantage. If all other things are equal, then a ports collection with "fewer inodes than we have now" is easier to work with than one with "more inodes". I understand that there are limits to this. We obviously wouldn't want the entire ports collection as a single file, but at the moment I think we do still have something to gain by reducing the number of files in the ports collection.

The thing about the ports collection right now is that most of the files in the collection are small. Smaller than the block-size of the filesystems we load them on. As disks continue to grow larger, then there is an increasing performance penalty in having a lot of small files instead of a smaller number of somewhat larger files.

Also, there is per-file overhead on the disk. Each file has its own chmod bits, its own owner field, its own storage (on the disk) for many of the fields you find in a stat() call. However, in the ports collection there is no reason to have separate values for those fields. It would be a pretty weird port, for instance, where 'patch.00a' NEEDS to be permitted differently, or have a different owner, than 'patch.00b' in the same port. I understand that with larger disks we have plenty of disk space to waste, but that by itself does not mean we must waste it.

Also, there is a per-file overhead for any operations which operate on the entire ports collection. Operations like 'cvsup' should go faster if they have fewer files to operate on. Again, I understand that there is obviously a limit to this, and that we would not gain by collapsing the entire ports collection into a single file. However, I do suspect that the proposed change should speed up operations such as 'cvsup', as time goes on and the ports collection continues to add even more ports.

And the more inodes it takes to hold the ports collection, the more likely that the disk partition which holds the collection (on each user's machine) is going to run out of inodes -- long before it runs out of disk space. I know the ports collection has caused me to run out of inodes in the past, although that was before previous projects which reduced the inode count of the ports collection.
 


As of this writing, we have done some initial work to see what the ports collection would look like after it was transformed to the pkg-data ideas. What we did was put the entire ports collection on one disk partition, and then transform that and put the result on another (initially-empty) disk partition of the same size. An initial test indicates that the filesystem savings would be something like:

1K-blocks-Used  InodesUsed
        235608       77915  - the present ports collection (Apr 14 2004)
        162860       33032  - after pkg-data transform
           31%         58%  - reduction due to transform
This was with an initial ports-collection created by running 'cvsup' with 'tag=.'. This means it does not include the CVS directories. I suspect this is how most users work with the ports collection.
 

Comments can be sent to:   drosih@rpi.edu
Web page last on updated:  Apr 14/2004