libarchive, bsdtar, pkg_add
Source Code
About versions: An a in the version indicates a release
with known problems unsuitable for production use. A b in
the version indicates a release with experimental features that
work, but may not yet be fully tested. For production applications, please
use the most recent non-alpha/non-beta release. Feedback on any version
is always appreciated.
NEWS
- July 2, 2008: libarchive 2.5 is stable and ready for production use.
In particular, I think bsdcpio has finally earned the "1.0.0" label.
- Apr 2, 2008: Benjamin Otte has been using libarchive to implement
a new archive browser for Nautilus.
The feature is still in development but should be widely available this fall.
- Mar 31, 2008: 2.5.1 includes an overhaul to the character-translation
logic used to convert between pax UTF-8 headers and the current locale.
This should make it much more robust when working with non-ASCII
filenames in mixed locales.
- Mar 15, 2008: The Feb 25 fix had a nasty bug that caused a lot of
entries to get rejected. Use 2.4.14 or later for a fix.
- Feb 25, 2008: When writing tar/pax files, names that could not
convert to UTF-8 got truncated at the first non-convertible character.
This was fixed by fully implementing the "hdrcharset" extension that
will be included in SUSv3-2008. (Thanks to Gunnar Ritter for
getting the POSIX/SUS committee to standardize a fix for this.)
- Oct 28, 2007: Jan Psota's
benchmarks
of several tar implementations inspired me to take a critical
look at some areas of libarchive.
As a result, libarchive 2.3.5 uses significantly less CPU time
reading and writing uncompressed archives.
- Aug 31, 2007: New mtree reader and manpage.
- Aug 5, 2007: New configure options --disable-bsdtar
- July 12, 2007: libarchive 2.2.4 fixes a couple of fairly serious
crash and memory-corruption bugs that have the potential to cause
security issues in some usages. In particular, certain malformed pax or
ustar archives can cause libarchive to crash, hang, or overrun internal
buffers. Anyone using libarchive 2.2.3 or earlier is strongly encouraged
to upgrade. This FreeBSD Security Advisory has more details.
Thanks to Colin Percival and the FreeBSD security team for helping
to identify and correct these problems.
- Apr 15, 2007: New 'ar' support from Kai Wang, new support
for external compression programs from Joerg Sonnenberger.
- Mar 11, 2007: bsdtar 2.0 is over 40% faster than bsdtar 1.x
in some applications.
- Mar 2, 2007: Libarchive 2.0 has many performance improvements,
a test harness for validating new ports, an improved API for restoring
objects to disk, and many bug fixes.
Platform Support
Libarchive and bsdtar run on so many platforms now that I'm
no longer keeping a list. In particular, I have reports of
it running on most Linux distributions, most BSD-based systems,
MacOS X, AIX, Interix, Solaris, HP-UX, and Cygwin. It is the
standard system tar for FreeBSD, DragonflyBSD, and TinySofa
classic server. If you have problems with libarchive or bsdtar
on any platform, please let me know.
Regarding Windows: Windows support continues to gradually improve
thanks to feedback from a growing number of users. I know of one remaining
architectural issue--the reliance on dev/ino numbers to identify
like files--that needs to be reconsidered. This is related to
the question of how best to handle hardlink and symbolic links
on Windows. The next big milestones are to get a Visual C++ solutions
file for building libarchive and to build and run some subset of the
libarchive_test harness on Windows. Please let me know if you find
any other issues.
Documentation
What Is It?
Libarchive is a programming library that can create and
read several different streaming archive formats, including most
popular tar variants, several cpio formats, and both BSD and GNU
ar variants. It can also write shar archives and read ISO9660
CDROM images and ZIP archives. The bsdtar program is an
implementation of tar(1) that is built on top of libarchive.
It started as a test harness, but has grown into a feature-competitive
replacement for GNU tar. The bsdcpio program is an implementation
of cpio(1) that is built on top of libarchive.
The libarchive library offers a number of features
that make it both very flexible and very powerful.
- Automatic format detection: libarchive can automatically
determine both the compression and the archive format, regardless of
the data source. (GNU tar and star only do full format detection
when reading from a file, for instance. Gunnar Ritter's
heirloom tar also does
full automatic format detection.)
- Reads popular formats: libarchive can read GNU tar,
ustar, pax interchange format, cpio, zip, and ISO9660 formats. The
internal architecture is easily extensible. The only requirement
for read support is that all metadata for a file must precede the
file data itself within the archive.
- Writes popular formats: libarchive can write
ustar, pax interchange format, cpio, and shar formats. The
internal architecture is easily extensible. The only requirement
for write support is that all metadata for a file must follow the
preceding file's data within the archive. (Yes, there are formats
that libarchive can write but not read and vice versa.)
- Reads and writes POSIX formats: libarchive reads
and writes POSIX-standard formats, including "ustar,"
"pax interchange format," and the POSIX "cpio" format.
- Supports pax interchange format: Pax interchange format (which, despite
the name, is really an extended tar format) eliminates almost all
limitations of historic tar formats and provides a standard method
for incorporating vendor-specific extensions. libarchive
exploits this extension mechanism to support ACLs and file flags, for
example. (Joerg Schilling's star archiver
and recent versions of GNU tar also support pax interchange format.)
- High-Level API: the libarchive API makes it fairly
simple to build an archive from a list of filenames or to extract
the entries from an archive. However, the API also provides extreme
flexibility with regards to data sources. For example, there are
generic hooks that allow you to write an archive to a socket or
read data from an archive entry into a memory buffer.
- Modular: The library design carefully minimizes link pollution.
If you only need read support for a single format, for example, you will only
get the required code. This minimizes the size of statically-linked
executables. (In particular, zlib or libbz2 are only required if
you specifically request gzip or bzip2 support.)
- Extensible: The internal design uses generic interfaces
for compression, archive format detection and decoding, and
archive data I/O. It should be very easy to add new formats,
new compression methods, or new ways of reading/writing archives.
- Featureful: Libarchive handles ACLs, file flags, extended
attributes, international characters, large files, long pathnames,
and many other features. Details vary depending on the particular
format, of course.
- Fast: Libarchive minimizes data copying when handling archive
files and contains carefully-tuned code for recreating objects on disk.
The bsdtar archiving program is built on libarchive,
so offers a variety of modern features. One unusual feature it
offers is the ability to function as a format-conversion filter,
reading entries from one archive and emitting an archive in a
different format with the same contents. This feature was simple
to implement because libarchive's robust automatic format
detection makes it unnecessary to specify the format of the input
archive. More details are available in the bsdtar documentation
above.
The bsdtar program has a number of advantages over previous
tar implementations:
- Library. Since the core functionality is in a library, it
can be used by other tools.
- Automatic format detection. Libarchive automatically detects
the compression (none/gzip/bzip2) and format (old tar, ustar, gnutar,
pax, cpio, iso9660, zip) when reading archives. It does this for any data
source.
- Pax Interchange Format Support. This is a POSIX/SUSv3 extension to the
old "ustar" tar format that adds arbitrary extended attributes to
each entry. Does everything that GNU tar format does, only better.
- Handles file flags, ACLs, arbitrary pathnames, etc. Pax interchange
format supports key/value attributes using an easily-extensible
technique. Arbitrary pathnames, group names, user names, file sizes
are part of the POSIX standard; libarchive extends this with support
for file flags, ACLs, and arbitrary device numbers.
- GNU tar support. Libarchive reads most GNU tar archives.
If there is demand, this can be improved further.
- BSD license.
The new pkg_add I've been experimenting with has a
number of advantages over the earlier implementation:
- No temp directory: It analyzes the packing list as
it extracts the archive. This allows it to place files directly
into their final locations without requiring a temp directory.
- Faster: Eliminating the temp directory provides a significant
performance boost.
- Automatic format detection: It automatically detects and correctly
handles tar/gzip or tar/bzip2 packages, regardless of filename or data
source.
- Cleaner code: The old pkg_add had a lot of overhead caused by
the need to build command strings for a separate tar program.
By using a library, I expect to eliminate a lot of that code.
- New features: Libarchive supports an extended tar format
called "pax interchange format" that can handle ACLs, file flags, etc.
What Still Needs To Be Done?
- libarchive: Everything is pretty mature right now.
Support for writing sparse files and building on Windows are the two
items I'm most interested in right now.
- bsdtar: Very mature. Most-requested features are
ability to write sparse files into archives and multi-volume
support.
- bsdcpio: Beta-quality. It is basically feature-complete
and should be usable as a replacement for GNU cpio, but more testing
is certainly warranted.
- rmt support. I'm looking at importing librmt from
NetBSD.
Why?
A few years ago, there was a debate on one of the mailing lists
about the FreeBSD package tools. The debate concerned two
interrelated questions:
- What is the "right" format for a package system?
- Why are FreeBSD's package tools slower than their
counterparts on other systems?
After looking into it a bit, I concluded that tar/gzip and tar/bzip2
were fine formats and that the performance problems were purely
implementation issues.
So, I started a project to rewrite the package tools, beginning
with pkg_add. Key to this project is a library that
understands tar/gzip and tar/bzip2 archives. Once I had built
libarchive, I realized that I nearly had a complete
BSD-licensed replacement for GNU tar, hence bsdtar.
My rewrite of pkg_add to use libarchive has stalled
due to lack of time; I hope to get back to it sometime soon.
An early prototype of the core ideas showed a three-fold performance
increase over the earlier pkg_add implementation without any
changes to the package file format.