A summary of issues having to do with the build cluster.
pointyhat:
bugs;
panics;
performance
(static)
(dynamic);
underscheduling.
nodes:
bugs;
hangs;
performance.
problems on pointyhat itself
bugs
-
error handling is weak. Once something goes wrong, you usually have to restart all the builds.
panics
-
haven't been any for a while.
performance
static performance
appears to be limited by disk and CPU. I can't tell which one is worse. It's also hard to tell because we are often overwhelmed by the following:
dynamic performance
Dynamic performance problems appear when the following tasks are running:
-
zbackup . This is the main culprit. -
zexpire . -
errorlog compression. This code is naive and can probably be fixed.
-
dinoex cronjob.
underscheduling
Underscheduling is a symptom, not a problem. Any of the above performance problems will cause it.
Underscheduling also occurs when a node fails (either at setup, or because it stopped responding) and the code doesn't detect this. The code will attempt to repetitively schedule jobs on that node. Setting it "offline" in the database is the only cure, but due to a further bug, any jobs that were already put on the "can possibly run on this node" queue will have to drain before the problem fixes itself.
problems on the nodes
bugs
-
XXX
hangs
-
sometimes ssh will stop responding (possibly, when swap is exhausted?). In
-current , swapd can't be killed due to resource exhaustion. -
Sometimes a node will simply stop responding to the console. The hobsons really like to do this. This problem needs to be investigated.
performance
-
XXX
Last modified: Wed Dec 23 08:03:44 UTC 2009