Proposed format for the new pkg-data files


What I would like to do is simple enough. The format is meant to be simple, so that it will be easy to write a program to transform the present ports-tree into a new ports tree that uses these pkg-data files. Also, I am no expert on XML, so I am suggesting a format which is visibly different than standard XML. While it does look a lot like XML, I already know it is not perfect XML. So please don't send me detailed email telling me how I misunderstand some key aspect of XML. My only goal is to have a format which is easy to parse, and is flexible for future changes. I think the following gives us that.

I am not tied to this exact format. I would be willing to go with any other format, as long as ports-developers can readily agree on some more-appropriate format. I suspect that two months is about the longest amount of time that Darren will be available, so if we spend six months designing the perfect format, then this project is guaranteed to never deliver any usable product. That is my only reason for suggesting the following format. I believe this is a "good-enough" format, as I think it is flexible and allows for future expansion.

Also, in some sense the exact format of these files will (hopefully) not matter very much. At least initially, I plan that all operations which work on these pkg-data files will do that work via the PdHandlingProgram. My hope is that we can implement most of the changes needed to use these pkg-data files before we have to pick the final official format that they should be in. If we do decide on some other format before committing to this one, we should only need to change the PdHandlingProgram.

This is what I propose as the basic overall structure of each pkg-data file:
 


==<freebsd-ports>
==<data-format>1</data-format>
==<copyright>
   Copyright (c) 2004 - The FreeBSD-Ports Project
   All rights reserved.  Etc.
==</copyright>
==<data-version>
   $FreeBSD$
==</data-version>

==<distinfo>
  ...info from the distinfo file...
==</distinfo>
==<pkg-descr>
  ...info from the pkg-descr file...
==</pkg-descr>
==<pkg-plist>
  ...info from the pkg-plist file...
==</pkg-plist>

     ...data from other standard files...

==</freebsd-ports>

For at least the initial implementation, most of these tags would be required to be written as a separate line. Each "tag-line" would start with two equal-signs, followed by the tag name. The value for that tag would start on the following line, not on the same line. The only exception to that rule would be for items where the value would never have a newline in it, such as "<data-format>". For those items, we would require that the tag, its value, and the closing-tag must all appear on a single line.

Everything before the initial "<freebsd-ports>" tag would be ignored by the program(s) which process this file. Everything after the "</freebsd-ports>" tag would also be ignored. Blank lines between tags would be allowed (for readability, as above), and ignored. The values would be allowed to be in any order, with the exception that the "<data-format>" tag is required to be the first tag after the initial "<freebsd-ports>" line.

The "<data-format>" tag is expected to identify the version of the format used for this individual pkg-data file. As time goes on, we will probably want to expand this format in some manner, and this allows us to have multiple formats of pkg-data files in the same ports tree. If someone has a better name than "data-format", I will be happy to use it. I also wonder if it might be better to collapse the data-format value into the initial "freebsd-ports" tag. Ie, the pkg-data information would start with a tag-line of
==<freebsd-ports:v1>
but still end with a "</freebsd-ports>" tag.

The "<data-version>" tag is meant to hold the version information for a specific port. By this I mean the RCS-id-type of information that CVS will update when you commit a new version of the pkg-data file for a given port. Perhaps the section should be called 'scm-version'. I wanted to avoid names like "port-version", because they would have so many other obvious meanings. I believe it would be an advantage to have this field, so users will now have a way to identify the exact version "of a port" (ie, the version of the entire collection of files which are used to generate a particular port).

By collapsing all the data for a port into a single file, we also gain the ability to put a copyright on that collection of data. I don't know if ports-developers feel that is important, but it seemed like a nice idea to me.

It may be that some of the original files can not be reasonably collapsed into the pkg-data file. The "distinfo" information might not make sense, for instance. But as a minor tangent, if the "distinfo" can be collapsed into the pkg-data file, I would (personally) prefer to use the 'md5 -r' format. I just think it looks neater... However, right now the transformation program does do a straight-copy of the data from distinfo into the pkg-data file.

The above shows how all the standard simple-files would be stored in the pkg-data file. This same tactic would be used for "distinfo", "pkg-deinstall", "pkg-descr", "pkg-install", "pkg-message", "pkg-plist", and "pkg-req" files. (I got that list of "standard files" from somewhere, let me know if I am missing some). Note that "Makefile" and "Makefile.inc" files are obviously NOT going to be collapsed into the new pkg-data file.

The following is how I plan to handle the standard directory-names that are used in ports:
 


  ...
==<filesdir>
==<file:patch-ac>
  ...info from first patch...
==</file>
==<file:patch-af>
  ...info from second patch...
==</file>
==<file:patch-bashline.c>
  ...info from third patch...
==</file>

     ...etc, for other files in the directory...

==</filesdir>
  ...

So, the pkg-data file would have a tagged directory-section called "filesdir", and then individual "file" entries inside of it. The same tactic would be used for the standard "scripts" and "src" directories (at least, those both seem to be standard directories).

After having done some of the initial work on this, I am thinking we might want to pull all "patch-" files into a new directory-section called "patchesdir". All other files found in the "files" directory would remain in the directory section called "filesdir". Right now there are no ports with a directory named "patch" or "patches", so this should be safe. The closest ones are "patches.4" and "patches.5", which show up in the ports for japanese/msdosfs and korean/msdosfs.

Notice that once ports are using this pkg-data format, it would be possible to use any naming-scheme for naming patches (except that names could not include a greater-than sign). In the pkg-data file itself, patch names could include '/'s, for instance.
 


Comments can be sent to:   drosih@rpi.edu
Web page last on updated:  Mar 31/2004