The FreeBSD Ports Monitoring System

Introduction

Among the least-publicized strengths of the FreeBSD development
model are users' access to the CVS source tree and the continual
QA work being done via onging build processes.  The work described
in this article attempts to leverage these strengths to help ease the
process of porting, and maintaining, applications for FreeBSD.

Overview

The set of programs known as 'portsmon', currently running at
portsmon.FreeBSD.org, is a tool to gather and disseminate information
about FreeBSD ports.  (In FreeBSD terminology, unlike NetBSD terminology,
a 'port' refers to a Makefile and set of related source files to build
and application; a 'package' refers to the binary file(s) created as
its output.)

portsmon is written in Python and carries a BSD license.

It works by data-mining some existing information that was already
available on the web in several different locations.  For instance,
there are several automated processes that exist to provide Quality
Assurance (QA) feedback for the ports tree.  Each of these processes
produces results that are generally posted in HTML format on a regular
basis.  In addition, there are other sources of information (such as
the Problem Report (PR) database and the source files themselves, which
are located on repository mirrors), which are also suitable for mining
information from.

Until the creation of portsmon there was no way to correlate these
sources of information in a way that could be browsed by a human.
portsmon grabs the HTML pages, parses them, puts them into a database,
and allows interactive queries from HTML forms.  In addition, it
periodically outputs email with the status of ports that have some
kind of error.

The first instance of portsmon was installed in May 2003 and one
instance or another of it has been running continuously since that
time.

Why It Is Needed

Let's consider how people might actually need or want to use the
existing information.

 - A maintainer of an individual port is primarily interested
   in the question "do my ports work?"  The majority of
   maintainers may be most interested in finding out whether
   their ports build on the "stable" branch (currently 6.X) on
   e.g. the Intel x86 architecture, which is available on a single
   HTML summary page.  But beyond that, they should also be
   encouraged to, firstly, make sure that their ports build on
   the "current" branch, and secondly, on other architectures
   as well.  (Without this latter, there is not much point in
   having architectures other than x86 supported as "first-tier"
   architectures!)  In addition to the build problems, they will
   likely also want to see pending PRs against their ports.

 - An individual user is primarily intersted in "will this port
   I just found out about work on my OS release and architecture?"

 - A FreeBSD committer may be interested in finding out information
   that applies to one maintainer.

 - A member of the port release management team ("portmgr") may
   be interested in finding any problems affecting either large
   numbers of ports, or the integrity of the ports build mechanism
   itself.

 - Any general member of the FreeBSD community may be interested in
   seeing metrics about the overall "health" of the ports collection.

What portsmon does that is to provide a single place to see correlated
results of the Problem Reports, the build error logs, and data from
the CVS tree.  To the extent the the PR database is lacking in certain
features, it serves as a supplement to it.  This will be discussed further
below.

Sources of Information

Build Logs

There exists a cluster of FreeBSD machines (known as the "bento cluster",
from the former hostname of the coordinating machine) whose sole purpose
is to run makes of the entire set of applications (known in the FreeBSD
parlance as the "ports tree").  The source for each port is automatically
fetched from wherever the source is hosted, and then built into a
complete binary package.  The logfiles that result from each port are
scanned for build errors, and if so, an error summary line is created.
This operation is repeated over all combinations of the "stable"
OS release code versus the "current" OS release code, and the various
architectures supported by FreeBSD.  For each combination, several
HTML summary pages are created, each sorted by various columns such
as portname, maintainer, and so forth, having as their contents
the errory summary lines.

For purposes of this discussion, the individual combinations of OS
release and processor architecture are termed "build environments" or
"buildenvs".  It should be noted that the bento cluster only produces
its own HTML reports for each individual buildenv.

Problem Report Database (GNATS)

The PR database, known as GNATS, does not have any particular knowledge
of what a "port" or a "maintainer" are.  Anyone can send a PR, either from
a FreeBSD machine which can communicate via email via send-pr(1), or from
a web form.

PRs vary greatly in quality; there is currently no individual who is
guaranteed to "screen" them for applicability, accuracy, and so forth
(although the present author, as a member of the bugmaster team, attempts
to do so).  Further, GNATS has no concept of "individual component", and
thus no concept of "maintainer of component".  However, each port in
the Ports Collection is assigned to a maintainer (although there is a
fallback email address standing in for "no maintainer").  Therefore, an
algorithm had to be introduced that would parse the raw data in incoming
and modified PRs in the 'ports' category and attempt to assign a category
and portname to each one.  The current version of this algorithm is
approximately 93% accurate.

As of mid-2006, FreeBSD is averaging somewhere around 40 ports PRs per
day.  There are usually somewhere between 500 and 1500 ports PRs, depending
on the stage of the release cycle.  By comparison, there are over 14,000
ports.  The percentage of ports with PRs has actually gone down in the
past two years due to some concentrated effort by a number of committers.

CVS repository

The source files in the ports tree (in particular the Makefiles) are
also used to extract 'metadata' about the port, such as its name, its
maintainer, whether it is buildable, and others.  cvsup is used to
fetch the latest updates, and its output is examined to indicate when
a port's metadata might have changed (and thus need to be regenerated).

Changes to individual category Makefiles are used to detect when a port
has been added or deleted.  In addition, a file called MOVED has entries
for ports that have either been renamed, obsoleted by some other port,
or deleted as being no longer maintained by its upstream author or having
security problems.  The MOVED technology is used by automated tools
such as portupgrade(1) to help the user deal with these changes.  In
particular, when ports from one category are moved to another (possibly
new) category, these tools can update the tree and save the user from
having to deal with the issues.

Information Extracted from the CVS Tree

 - for individual ports: 

   - Data from Makevars: 

     - MAINTAINER

     - status: BROKEN/DEPRECATED/IGNORE/FORBIDDEN

       These are special designations which are useful to users and maintainers.

     - EXPIRATION_DATE

       Ports marked for deletion are usually not deleted immediately, unless
       there is a licensing problem.  A mechanism using the Makevars DEPRECATED
       and EXPIRATION_DATE make this process more transparent to the greater
       community.

       Since this process was instituted, a large number of ports that were
       previously broken and not being paid attention to were marked for
       deletion.  This brought a much greater deal of attention to their
       state, and in nearly half the cases have led to the ports being
       fixed.  Others have been deleted as being deemed to have outlived
       their usefulness; but in this case, at least potential users have
       been given a heads-up.

     - IS_SLAVE_PORT

     - MASTERPORT

       Since it would be too time-consuming to rebuild the entire set of
       metadata for the tree on every CVSup run, a shortcut is adopted to
       only necessitate updates to "affected ports".  For portsmon's
       purposes, this is defined as the ports having a master-slave
       relationship; this is the most general cases of variable inheritance
       in the Makefiles.
       
   - There are some others that only affect the display on portsoverall.py.

   - Data from raw files: 

     - Makefile version

       This is used to decide when a ports' metadata needs to be rescanned
       to start with.

 - for the ports tree overall: 

   - Data from raw files: 

     - bsd.port.mk

       This file contains the master list of categories.

     - category Makefiles

       These files contain the list of ports within each category.

     - MOVED

       This file is a mechanism for tracking ports that have either been
       renamed (for instance, the project changed its name, or a new
       version has been released that is incompatible with an old one,
       forcing the renaming of the existing portname); or deleted (either
       being due to no longer being developed, or licensing or security
       problems.)

Schema: Tables

See the slides for a more complete explanation.

Schema: Attributes

See the slides for a more complete explanation.

Database updating

All of the database updating is done on the fly; static updates would take
too long, as an entire dependency tree would have to be built.  Since
portsmon does not model the dependency tree, this time would be wasted.
Also, portsmon relies on some metadata which is not included in the default
'make index' run, such as IS_SLAVE_PORT.

Some optimizations have had to be made to the database algorithms to enable
incremental updates.  For instance, IS_SLAVE_PORT is used to look up any
extra ports that will need to have their metadata re-read because of an
update to a master port.  The existing INDEX file can be used to figure
out if a port has a master port, but cannot be used (as such) to figure
out which ports have slave ports.

Tables driven from CVS data updated every 60 minutes; this is the CVSup
interval of the servers.

FREEBSD_ERRORLOG_TABLE is updated every 30 minutes; the scan of this file
is very fast, so it could be done more quickly, but given that it takes
several days to build all 14,000+ ports, even on the fast architectures,
not that many entries really change more often than that.

Outputs

The outputs of portsmon are divided into two areas: on-demand HTML
pages, and email notifications.

Reports Available Interactively via HTML: Regular Reports

This list is not all-inclusive, but focuses on the reports that are
most useful to invidual committers or maintainers.

Each report tries to focus on one particular area.  It would be infeasible
to try to show all possible data contained in the database; consider it
as an N-dimensional space based on category/portname; maintainer; error
type; PR number; and so forth.

The reports that focus on build errors:

 - by type and buildenv (portserrcounts.py) 

 - by category (portserrcountsbycategory.py) 

 - by portname and buildenv (portscrossref.py) 

Build errors and Problem Reports 

 - by portname (portsconcordance.py) 

 - for one maintainer (portsconcordanceformaintainer.py) 

 - error counts by maintainer (portsbymaintainer.py) 

 - by portname, for one maintainer (portsconcordanceforresponsible.py) 

 - by portname, unmaintained ports (portsconcordancefornoresponsible.py) 

 - by portname, ports for one status (e.g. BROKEN)
   (portsconcordanceforbroken.py et. al.)  Note: this is currently only
   evaluated on i386-CURRENT; see below.

 - by portname, for one or more buildenvs (portsconcordanceforbuildenv.py) 

 - overall status (portsoverall.py) 

Problem Reports 

 by portname, for existing ports (portsprsbyportname.py) 

 by PR number, for existing ports (portsprsbyexplanation.py?explanation=existing) 

 by PR number, for new ports (portsprsbyexplanation.py?explanation=new) 

 by PR number, for framework (portsprsbyexplanation.py?explanation=framework) 

 by PR number, for unknown (portsprsbyexplanation.py?explanation=unknown) 

 Overview of one port (portoverview.py) 

   This gives all the PRs and error logs for one port; in addition, links
   to the CVS web page, the URL for the mastersite, the FreshPorts page,
   and other interesting entries are included.

Reports Available Interactively via HTML: Specialized Reports

This list is not all-inclusive, but focuses on the reports that are
most useful to invidual committers or maintainers.

Anomalies (portsanomalies.py) 

This attempts to find both errors where the internal algorithms have
generated inconsitent results, and also PRs which may have become mis-
assigned (for instance, still assigned to maintainer who has resigned.)

Dependency tree for one port (portdependencytree.py) 

The file /usr/ports/INDEX can show dependencies of ports as they are
built by default.  However, it does lag by some number of hours.
portsmon has a complete mirror of the tree since the last CVS update,
however, so it can allow for queries to be run from the web pages.

Unlike the existing tools in the ports tree that will only show
dependencies for ports installed on one particular machine, this HTML
page will show all possible dependencies as though the total tree were
installed (e.g. potential dependencies).

Port has moved (portsprsformoved.py)

This is used to find PRs with stale categorization; e.g., where the
port has been renamed or repocopied, but the PR assignment is now stale.

Port has maintainer update (portsprsmaintainerupdates.py) 

This allows committers, if they wish, to prioritize these PRs.

Ports where where maintainer is committer (portsprsmaintaineriscommitter.py)
  or maintainer is not commiter (portsprsmaintainerisnotcommitter.py) 

These are mainly of interest for ensuring that PRs are assigned correctly.

Ports where maintainer might not know (portsprsmaintainermightnotknow.py) 

Since GNATS has no concept of 'maintainer', it is possible for PRs to be
entered where the maintainer will never know about them (e.g., the
maintainer was not Cc:ed.)  An automated process on the main FreeBSD
repository machine reminds people with @FreeBSD.org addresses about all
their PR assignments (not just ports), but this report fills in the gap
for non-committer maintainers.

Ports with no maintainer (portsprsunmaintained.py) 

This is to allow committers to look for PRs that may not be being attended
to because there is no maintainer; this may also allow interested users to
see if there are unmaintained ports that they would like to adopt.

Ports with possibly misconveyed PRs (portsmisconveyedprs.py)

This, again, is mainly of interest for ensuring that PRs are assigned correctly.

Reports Generated via Email

Every two weeks a set of email reports are generated to attempt to
alert users and maintainers of the status of individual ports.  This
is an attempt to make parts of the process (e.g. the process for
deleting ports are scheduled for removal either due to such technical
issues as fetch failures, or security-related issues):

 - Ports which are currently marked broken; this is sent to the
   individual maintainers

 - Ports which are currently scheduled for deletion; this is sent to
   freebsd-ports@FreeBSD.org

 - Ports which are currently marked "forbidden", to maintainers 

 - Ports with PRs where maintainer might not be aware of them, to
   maintainers

This latter is necessary because GNATS, as stated before, does not
have any concept of 'maintainer', and thus cannot notifiy a maintainer
when a PR comes in.  A tool known as 'gtk-send-pr' takes care of this
automatically, but not everyone uses it.

Charts and Graphs

There is an uncompleted project to provide some charts and graphs.
This includes

 - a bar chart of number of ports marked BROKEN, by maintainer;

 - a bar chart of percent of ports with build errors by build environment;

 - a pie graph of percent of PRs by explanation type;

 - a pie graph of percent of PRs by state; and

 - a bar chart of unique error counts in all buildenvs.

How Well Does It Work?

Despite its hacky orgins, the system is quite robust.  It has been online
for public access for over 2 years, in one instance or another, and has
not crashed during this time.

The GNATS PR-classifier get about ~93% true positives on PR assignment,
~3% false positives; the rest are classified as "unknown" and must be
manually fixed.  Often, it is easier to change the PR synopsis than simply
do the override in the database; this also allows for people that use
GNATS to search PRs to get the right answer.

You only have to totally rebuild database on new installs; the code
keeps up with incremental changes otherwise.  However, there are some
bugs in the MOVED processing, so that MOVED needs to be rescanned
periodically.  There may be some interaction with respect to IS_SLAVE_PORT
here.

What's Missing?

portsmon does not yet model packages that build successfully.  Since
all but a few percent of ports do successfully build into packages,
this is a fair amount of data.  An alpha-quality code implementation
exists that scans the directory listings (e.g. via ftp), but due to its
slowness it has not yet been integrated.

Integrating this code would enable us to answer the questions "how far
behind is a given package for the underlying port", and "how many ports
successfully package on amd64 vs. i386."

portsmon does also not yet model port metadata such as BROKEN/FORBIDDEN/
IGNORE for anything other than the i386 architecture, and for one single
OS revision.  Therefore, only one buildenv is completely represented in
the database.  Using i386 is easiest because the metadata are evaluated
on an i386 machine.  However, it is possible to override both the arch
and the OS version -- in fact, the latest OS version is generally used
because it is thought that that is where most new problems will arise
that need maintainer attention..  Using a single buildenv both simplifies
the database, and minimizes update time, since the make -V invocations
are one of the rate-limiting factors of the whole process.

It is also unknown exactly how many other metadata other than the build
status depends on buildenv.  Before portsmon and FreshPorts, it is
entirely possible that no one even considered the possible impact of
allowing other metadata to vary based on these variables.  Therefore,
it is hard to know exactly how many port Makefiles might make such
changes, and thus exactly how much of the database would have to be
generalized to be by-buildenv.  (There is at least one case known where
PORTREVISION varies depending on ARCH.)

Another source of data is the list of ports for which one or more
distfiles fail to fetch, generated by Bill Fenner.  This is discussed
more thoroughly below.

One more source of data are two new projects that attempt to scan the
home pages for various port projects and identify new revisions that
may be available.  Both Edwin Groothius and one other author have
programs that do this.  portsmon's author had also implemented an
alpha-quality scanner, but it has not yet been finished and integrated.
(It does work surprisingly well -- correctly identifying updates in at
least 50% of the pages it scans -- but it has no caching or mastersite
selection, and so just hits the same mastersites over and over again.
This doesn't seem to be in the spirit of 'fair play'.)

It is unknown how well the 3 implementations' algorithms compare with
each other.

Finally, it is not currently possible to get data about such things
as "maintainer timeouts" to learn when an individual may have lost
interest in participating, and thus the maintainership should be
passed to someone else.  To do this, it will be necessary to mine
the data from the CVS logs.  An alpha implementation of the parser
exists, but no database backend has been created.

Having this data would then allow the asking of questions such as
"show me only ports that have failed to build after their last CVS
update".

There are many other reports that could also be generated, including
a more general-purpose query page.  The latter would require some
restructuring of the database.

Bugs

 - Some of the reports still do not have the ability to sort
   on some columns.

 - The design of the database was done to optimize the ability to
   tinker with the reports, not to make the queries efficient.  This
   has the disadvantage of making the pages fairly slow.  The advantage
   is that there have not needed to be any "flag days" where the database
   had to be completely reloaded as new features were added.

 - The database update mechanism could be improved;
   for a larger deployment, moving some of the SQL to transactions
   would be necessary.

 - Some of the configuration needs to be less hard-coded.

Related work

FreshPorts

Dan Langille's FreshPorts (www.freshports.org) is a set of web pages
that correlate a great deal of information about FreeBSD ports.  The
scope of FreshPorts overlaps this work to some degree, but the projects
are complementary.  There are areas where portsmon has information that
FreshPorts does not have, and vice versa.  In each case we are trying
to view various "meta-information" about individual ports.  Both are
necessary but there is still more than can be added.

FreshPorts concentrates on all ports, not just ports that have either
build errors or PRs.  It is more oriented towards individual users
than is portsmon.  It also is designed to automatically notify interested
maintainers (via email) of problems as soon as they occur.  The framework
for the emailing and subscriber control is outside what the scope of this
work would be able to do.  So, while there is some degree of overlap,
the intended audience and application is different.

In addition, FreshPorts models the dependency tree, which is not a focus
for portsmon.  portsmon is more concerned with individual ports.

A final implementation note: FreshPorts parses CVS commit mail to
track updates; portsmon uses the output of cvsup.  This allows FreshPorts
to be more up-to-date than portsmon; however, in the even of unreliable
email, portsmon may prove to be more robust.

Bill Fenner's Distfile Survey

Bill Fenner maintains a report of ports that fail to fetch.  Unlike
the pointyhat errorlogs, which only assert the error "fetch" if the
sourcefile cannot be fetched from _any_ server, Bill's reports
include all sourcefiles that cannot be fetched from each of the
servers on which they are supposed to reside.  Further, his
report shows how long each of the individual fetch failures has
been ongoing.  This data should be included in the database.

Edwin Groothius' PR auto-assigner

When portsmon was first written, there was no automated way of seeing
which ports PRs ought to be assigned to which maintainers.  For the
initial year or so of its deployment it was the only source of that
information, and its author used the web page to scan through the list
once or twice a day to do any missing assignments.  Since then, Edwin
Groothius has written a program to scan the PRs and attempt to auto-
assign them using edit-pr(1).  It is unknown which PR classification
algorithm is more effective for identifying existing ports, although
they appear to be comparable.  This process has saved portsmon's
author some considerable time in poring over its output :-)  However,
occasional passes over the reports are still necessary, to catch
PRs that are either not automatically classified, or incorrectly
classified.

Also note the that PR auto-assigner does not model new ports or ports
framework issues, which portsmon does.

Dirk Meyer's Reports

Some work very similar to this work has been done by Dirk Meyer
and is hosted at URL: http://ports.dinoex.net/errorlogs/.
(The present author was also unaware of this work when he began).
Like Fenner's work, this focuses on the build errors
rather than the PRs; however, it's worth noting that unlike
Fenner's reports, which are statically generated, Meyer's
reports are database-driven.  However, the presentation
differs from the current work.  Interested users may find
one or the other presentation more useful.

Future Work

In addition to the data sources that are not yet being data-mined
as mentioned above, various people have requested the ability to
'subscribe' via email to PRs about certain ports (or even proposed
new ports).  Currently, there is no provision for being able to do so.

An even more interesting project would be to automatically identify
ports whose authors have updated them and then attempt to update its
Makefile in a dedicated area of a ports tinderbox and build it.  The
build logs could then be scanned by a local instance of portsmon and
possibly save an interested party some of the "detail-work" involved
in updating ports.

Summary

The goal of this work is to reduce the time and frustration involved
in maintaining source-based applications on FreeBSD.  It is hoped
that with these reports, problems such as "maintainer hasn't noticed
that new changes to system include files broke all his ports" or "maintainer
hasn't made sure that his ports run on Sparc-64" or "maintainer has
gone missing in action" can be spotted and the problems corrected with
much less time and frustration on everyone's part.