Introduction

This page is a collection of thoughts on the design of a new binary package format, intended for implementation in the now dead OpenPackages project

Before we start. No, I don't think I'm the supreme authority on the design of the OpenPackages system. These are my views on how to design the system. However, I am willing to stand up for these ideas in a public debate. I've been working with the FreeBSD Ports Collection for seven years, and I'm responsible for a number of features (probably better described as horrible hacks) in the collection. I've also rewritten the package tools, so I have some idea of how this work and what the problems are.

It is my feeling that OpenPackages has an opportunity to fix some of the problems which have become apparent in the design and implementation of the Ports/Packages collections on FreeBSD, NetBSD and OpenBSD. This does not mean that these systems don't work or are not good. Especially FreeBSD's Port's Collection is a major selling point of the OS, and can even stand it's own against Debian's apt-get in a /. flame-war. We should not be trying to design an all singing and dancing package manager which can meet every possible need. This is what happened with the launch of OpenPackages.org, and after everyone had posted their wish list, nothing happened. This is the same problem which is plaguing FreeBSD 5.x at the moment - a grand design and nowhere to start. The backbone of open source is incremental improvement.

However, there are problems with the design of the existing collections. This is not surprising: the prototype for the Port's collection was knocked together in a few days, and was never intended to scale to 10000+ pieces of software. With some careful management it has, and that is a testimony to the strength of the underlying principles in the design. In particular, being able to combine source, binary and third party software with a fairly loose set of dependencies.

It is hard to put a finger on a lot of the problems in the current system, especially because they are being worked around in various ways by the three collections, and by third party software such as portupgrade. We can also learn a lot from other systems such as apt-get and RPM. In particular, we should be looking at keeping OpenPackages as lean as possible, so that it can be distributed and built with the OS, and used to package the OS. We need to focus on the needs to the BSD community, and not on making a tool which can work on PDP-9's and Palm Pilots.

OK, so let's get down to some concrete design work. I'm basing this mostly on FreeBSD's Port's Collection. Why? Because tools and architecture are great. Applications bring users. If we start to far away from FreeBSD, we throw away the FreeBSD collection, and with it the army of maintainers, because most people aren't going to want to maintain a Port and an OpenPackage. So, we need to walk a very fine line.

Overall Architecture

I think most people know how the system works, but a quick overview of what we want is appropriate. The goal is to make building and installing software as easy as possible. We have a large number of open source applications running wild, and we need to bring them into line. The current system works by:

Specifying what we are building.
Specifying where to find it, and how to download it.
Specifying how to lay it out and patch it.
Building it.
Specifying what it's going to install and where.
Installing it.
Keeping track of it.
Allowing a binary package to be built.
Allowing the use of the binary packing instead of the source.
Allowing it to be uninstalled.

In addition, if the software needs any other pieces to extract, patch, build, install or run, it ensures that these are installed and available. We also have tools which allow us to query the uninstalled and installed packages, and perform basic administration. We use make and a few other tools for the first six items, and the pkg_* tools for the remainder.

For OpenPackages, these have been grouped into six tasks:

OpenPackages tools
Fetch and extract
Build infrastructure
Package creation
Package install and uninstall
Package meta tools

What are we building

Of fairly fundamental importance to the entire operation is the need to establish exactly what we're building. The primary key to this is the package name, which tells us what the software is, and what version it is. We also have some higher level data: the general category into which the software falls, the target host and a few other details.

Package Name

The following constitutes a working specification for package names within the OpenPackages project. It is not a formal specification. In time, a tool (part of a portlint replacement) will be developed which will enforce a formal specification. This specification is based on the actual names of packages found in the existing collections.

A package name has two components, the base name, which is normally the name the developers gave the software, and the developer's version. However, these are not sufficient to adequately describe the software. To fully describe the software we require seven components! In the Makefile they have the following names:

PKGROOT The name used by the developer.

PKGBRANCH The development branch.

PKGLANG The language we're building.

PKGFLAVOR The customisation we've applied.

PKGSUBSET The subset we're building.

PKGVERSION The developer's version number.

PKGEPOCH For developer's who can't count.

PKGREVISION Our revision of this package.

The PKGROOT and PKGVERSION are required. There are default values for the remainder, which are not rendered in most cases. Astute reader's will have noticed there are eight items in the list - flavor and subset are mutually exclusive.

PKGROOT consists of a string of [-_+a-zA-Z0-9], starting with [a-zA-Z]. It should be based either the developer's name for the program, the name of the distribution archive or the name of the installed program. It doesn't take a rocket scientist to name something. The maximum length of PKGROOT is 35 characters.

PKGBRANCH details the branch of the software which is being built. The default value is 'stable'. Normally, once a software project of any decent size gets under way, there's a stable (ie working) and development (ie broken) version. Or people get stuck on old versions of the software. Suitable values for this are the same as for PKGROOT, with the exception that this can start with punctuation. Likely values are '2', '3', '44', 'devel'. The maximum length of this field is 10 characters.

PKGLANG is the standard short language code that we are building with. The default is 'en' (implying en_US). This is part of the build time options for the package. The maximum field width is 5 characters.

PKGFLAVOR defines the build time customisations which have been applied to a package. The default value is ''. Typical values would be 'a4', 'no_x11', 'gnome', 'qt'. The maximum field width is 10 characters, and can only consist of characters in the set [+_a-zA-Z0-9], and may not be a valid standard language code. Flavours are mutually exclusive.

PKGSUBSET indicates that we are only building part of a package. The default value is '', implying we are building the entire package. Typical values are 'docs', 'examples', 'images'. The maximum field width is 10 characters, with the same restrictions as PKGFLAVOR.

PKGFLAVOR and PKGSUBSET are mutually exclusive. The reason for this is that PKGSUBSET is used to build the common components of all flavors, and must therefore not be flavored. This does not mean the flavors require subsets or that subsets require flavors.

PKGVERSION consists of a period separated list of numbers and optionally letters. This should be based on the developer's version number, and can include duplication of the PKGBRANCH (e.g. a branch of '2' and a version of '2.1.13'). The first and last characters may not be a period. Letters behave like fractions and a letter appearing after a period (or at the beginning) is less than zero. So '1.0' < '1.1' < '1.1a' < '1.2.b2' < '1.2' < '1.2.0p1'. Multiletter sequences are allowed, but discouraged. The maxiumum length of this field is 15 characters.

PKGEPOCH is a integer number indicating the number of times that the version number has moved backwards (according to the rules above). the default value is 0. This is occurs more often than one would think. It is maintained by the package maintainer and *must* be incremented when needed. It should not be incremented when an upgrade is backed out (e.g. for security or stability reasons).

PKGREVISION is also an integer number, this time indicating the number of times that a maintainer has made significant changes to the way that the package is built. It should not be incremented just because a broken build is fixed, or for spelling or other minor changes. Unnecessary increments can lead to large unneeded downloads/compiles, and missing increments can lead to a proliferation of bug reports.

PKGBASE is a combination of the first five, using the following rule: ${PKGROOT}[${PKGBRANCH:C/^[a-z]/-&/}][(${PKGFLAVOR}|${PKGSUBSET})][-${PKGLANG}]. Or in english... We start with root, append the branch (prefixed with a hyphen if the branch begins with a letter), then add either the flavor or pkgsubset in parentheses, if either is defined, followed by the language, if it is defined. The first two components form PACKAGE_BASE, which is the display name used for any sort of index (like a web page).

PKGNAME is a combination of PKGBASE and the three versions: ${PKGBASE}-${PKGVERSION}[_${PKGREVISION][,${PKGEPOCH]. The PKGNAME is what is used to name package files. ${PACKAGE_BASE}-${PKGVERSION} forms PACKAGE_NAME and is used in verbose index displays.

Within package lists, the eight components are rendered as strings, seperated by a comma (','). Default values are rendered as empty strings (not NULL). Within software the version is stored in a structure similar to:

 struct _pkg_version {
  char name[PKGNAME_MAX];
  char *root;
  char *branch;
  char *lang;
  char *flavor;
  char *subset;
  char *version;
  char *epoch;
  char *revision;
};
typedef struct _pkg_version pkg_version;

The pointers all point into the name array, and the code can covert quickly between separate strings and a comma separated string by replacing the [-1] element of the last seven strings with ',' or '\0'. We can use this to also tell if we are currently comma or string separated. Initial parsing (to get the eight pointers) can be done with strsep(&cp, ","). Comparing the temporary pointer used for strsep and the revision pointer to NULL will quickly tell one if we had an under or overflow during parsing.

This format can also be handled easily by sed(1) or perl(1), and to output this format from make(1) is very easy. To read it in make is not difficult, but not really elegant.

OpenPackages Binary Package Format

We need a format which is readable, compact and secure. We also need a format that does not require local storage (i.e. it can be extracted straight from the stream). There's also not much point in drifting too far from the existing format. The exisiting format describes the package in a CONTENTS file, with '@' directives of various kinds. I would put the entire package into this format, and extend the processing a little. So a package would look something like this (for foo-1.0):

@openpackage 1.0.0
@name foo,,,,,1.0,,
@origin OP|openpackages.org:misc/foo|maintainer@openpackage.org
@target FreeBSD/i386-4.6-RELEASE|root@bento.freebsd.org|2002/05/22 22:31:15
@www http://foo.sourceforge.net/
@index Foo is a example of an OpenPackage.
@descr
Is this is the longest description of a example, that you could come up
with...
@end
@script pre-install
#!/bin/sh
#
rm -rf /
@end
@libdep baz.1|baz-2.0_1
@plist
@file bin/foo|MD5:f180ec82343dd60e010a26c677a81600|755|bin|bin
@link bin/baz bin/foo
@dir share/doc/foo
@file share/doc/foo/README|MD5:e10f1231702886b1181b5d6dd33bba51|644|root|wheel
@end
@signed root@bento.freebsd.org

insert ACSII OpenSSL signature for everything after @openpackage line and
before the @signed line.  Only @payload is valid outside of this.

@payload bz|MD5:f180ec82343dd60e010a26c677a81600|12562|3694

The raw bzip2 stream for bin/foo (matched by md5) goes here, with the
numbers above giving the uncompressed and compressed sizes.

@end
@payload bz|MD5:e10f1231702886b1181b5d6dd33bba51|523|192

More bzip2'ed file here...

@end

The first few lines would be in a fixed order, and must be at the top, so we can head(1) the package and get just about everything we want to know, in a human readable format. The stuff after the @plist could be compressed also. Most fields would be '|' serparated, with 'type:' if we could have multiple types. Any second level seperation is by ','. The payload is secure because we have a signed MD5, and we verfify this with the MD5 of the extracted file. All of the current special files would just make their way into directives.

Like I said, I've not really given this enough thought. It's on the list of things for my design document... But the example above gives some good hints as to how I'm thinking. I've gone for a none tar based format, because I don't think tar buys us anything, and we have to fork and run tar, which might not be usable on some platforms. The format above only needs the very kernel of libbz2, which is only a few files and is pretty cross platform. We could also use libz. I've not gone with a zip file because a proper zip file has a directory at the end, which means you need the entire archive before you can start processing...

A description of the payload is mostly absent... My intention for the format is to have a signed header, which has all of the magic for the package, including a list of files with MD5s. This is then followed by a bunch of @payloads, which are indexed by MD5 (so that we can cheat with hard links/duplicates). Each @payload would be one file. Each @payload would have a compression format (gz, bz, ...) and the uncompessed and compressed sizes. We fread() the compressed size, malloc() a buffer for the uncompressed size, and run BZ2_bzBuffToBuffDecompress(), and MD5Data() the result. Then we write it, and chown it... Simple. And we don't need any external processes, which isn't so much a problem for *BSD, but might be on other platforms...

We can also use the uncompressed size to determine quickly if we have sufficent disk space. (Although I'd probably put the total right up front, so we can check as soon as possible). Instead of just saying four times the tgz's size, if it's a local file. If it's an FTP transfer, let it run and see if it bombs... Very professional.

Introduction

Overall Architecture

What are we building

Package Name

OpenPackages Binary Package Format

OpenPackages Source Directory Layout

OpenPackages Tools

Fetch and Extract

Build Infrastructure

Package Creation

Package Install and Uninstall

Package meta tools

PKGROOT	The name used by the developer.
PKGBRANCH	The development branch.
PKGLANG	The language we're building.
PKGFLAVOR	The customisation we've applied.
PKGSUBSET	The subset we're building.
PKGVERSION	The developer's version number.
PKGEPOCH	For developer's who can't count.
PKGREVISION	Our revision of this package.