issues with pointyhat itself

bugs

In general, error handling is weak.
qmanager can lose its connection. Once this happens, there is nothing to do but restart qmanager and all the builds. linimon has added some debug code to try to catch it in the act.
qmanager can fail with "list.remove(x): x not in list". This may be more prevalent while killing builds.
qmanager can fail with "Deleted rowcount 0 does not match number of objects deleted 1". This may be more prevalent while killing builds.
qmanager can fail with "The transaction is inactive due to a rollback in a subtransaction. Issue rollback() to cancel the transaction."
qmanager can sometimes fail to start up. Unknown cause.
Sometimes a build startup hangs in zfs clone. This could possibly be the error described here.
If a package build succeeds, but the scp back from the node fails, pointyhat thinks the package has succeeded anyway, and will attempt to incorrectly build packages that depend on that one in an infinite loop. That comprises 2 separate bugs.
The 'cvsdone' datestamp is not being handled correctly in the case of zfs-cloned portstrees.
The cvsdone file needs to be owned by ports-:portmgr 664.
Sometimes pollmachine and qmanager do not start up correctly on pointyhat reboot.
On rare occasion qmanager will dispatch the same job twice.
A node whose kernel is missing nullfs will throw errors that should be detected.
Figure out the cause of the "bad exports list line" output on reboot.
Occasional "tar: Couldn't list extended attributes: Permission denied" on tarring up a ports tree.

hangs

on occasion we see NFS hangs. TBD. These have been less prevalent lately.
about once every 2 months pointyhat needs to be restarted. TBD.

panics

haven't been any for a while.

performance problems

static performance

It's not clear which of these are the worse problem:

disk. Can stay at 100% for minutes at a time.
CPU. Can stay at 100% for minutes at a time.

It's also hard to tell because we are often overwhelmed by the following:

dynamic performance

Dynamic performance problems appear when the following tasks are running:

zbackup. This is the main culprit. The system becomes nearly unresponsive for tens of minutes at a time. It is possible that kmacy's latest fixes and/or tuning ZFS may have helped us here.
zexpire. Impact TBD.
errorlog compression. The system becomes sluggish for tens of minutes as a time. This code is naive and can probably be fixed. One possibility would be to compress the logfiles as soon as they are created (if the grep can be taught about this.)
dinoex cronjob. Impact TBD.

underscheduling

Underscheduling is a symptom, not a problem. Any of the above performance problems will cause it.

However, underscheduling also occurs when a node fails (either at setup, or because it stopped responding) and the code doesn't detect this. The code will attempt to repetitively schedule jobs on that node. Setting it "offline" in the database is the only cure, but due to a further bug, any jobs that were already put on the "can possibly run on this node" queue will have to drain before the problem fixes itself.

desired improvements

Allow entries in mlist to be commented out.
Refactor the code that repacks src/ and ports/ take make it easier to make patches and push to all nodes.
Create RSS feeds for individual package builds, and get rid of the hardcoding in arch/portbuild.conf.
Evaluate performance of making qmanager dynamically select nodes, rather than at dependency-tree-walk time. This would alleviate two problems: failed or offlined nodes are not detected, and instead underscheduling happens; deleting a node which has jobs queued for it instantly crashes qmanager.
Integrate pollmachine and qmanager. Right now machines need to be offlined manually when they fail.
pgollucci has coded up a possible replacement for the errorlog classification script. This needs to be tested. linimon has neglected this for several years :-(
IWBNI we could create charts of the zfs hierarchy on-the-fly. linimon has code to do this but it's on his local machines and involves manual steps. This might involve having a machine where we could run graphviz on.
IWBNI we could schedule certain large builds only on certain machines. This would avoid "insufficient free space to install gnome-user-docs" and similar bothers.
Make it easier to do src and ports regression runs by having scripts to do "grab latest CVS tree, patch with patches from directory/, and start a build".

issues with the nodes

bugs

The most common node failure is "nfs_getpages: error -703333868
vm_fault: pager read error, pid 30481 (squid)".
On occasion, nodes will report that a dependent package is truncated. This happens in tar(1) during pkg_add(1). It may or may not happen when more than one pkg_add(1) is running. linimon has an open item with tkientzle about this. (Note: the larger njobs, the more likely this is to happen: harlow used to fall over constantly before linimon reduced it.)

The current approach has been to dramatically increase the amount of memory allocated to the swap-backed md(1) mounts used for /var and /tmp, vs. the default supplied by PXE boot, on hobson9. (pkg_add(1) uses /var/tmp; linimon doesn't know what tar(1) uses.) However, this went boom, with "/buildscript: not found" (q.v.), 20091224.

linimon has code that he copies to some of the sparc64 nodes to try to get more information (bobbi, harlow, and regensburg).
symptom: "/buildscript: not found". This is especially prevalent on ia64 machines right after reboot, but can be seen on powerpc as well. The problem is that there is a corrupt chroot:

dynode# du -s chroot/[0-9]*

38508 chroot/52939

222276 chroot/59702

222272 chroot/60440

222276 chroot/80913

As of yet, linimon does not know how this happens. See portbuild.linimon for the debugging code.
The amd64 nodes have now started complaining about the setting of TERM.

hangs

"swap_pager: indefinite wait buffer: bufobj: 0, blkno: xxx, size: 4096" will always lead to a hang. It is often seen on the blades.
Sometimes a node will simply stop responding to the console. The hobsons really like to do this. As of 20100410, kib claims that this should be fixed in the most recent -HEAD. The failure mode will show "rpcrecon" in ps under ddb.
sometimes ssh will stop responding (possibly, when swap is exhausted?). In -current, sshd can't be killed due to resource exhaustion, so perhaps this is fixed.

performance problems

Scheduling more than (GB RAM / 2) jobs generally pushes things out to swap. linimon doesn't know yet how bad swapping affects performance, since the nodes generally fall over in this situation before long.

desired improvements

Over time, switch all logins/ownerships to ports-${arch}.
Block TCP port access from anything other than *.freebsd.org.
Switch nodes to zfs.
Figure out what gohan10 and gohan18 are being used for.
Where possible, upgrade nodes to 4G RAM.

Last modified: Thu Apr 22 03:22:21 UTC 2010

Header And Logo

Peripheral Links

Search

Site Navigation

issues with pointyhat itself

bugs

hangs

panics

performance problems

static performance

dynamic performance

underscheduling

desired improvements

issues with the nodes

bugs

hangs

performance problems

desired improvements