Process Creation/Destruction Benchmarks

I measured the total time required to fork a new process, have it exit immediately, and have its parent reap it. This was measured with respect to three different pid allocation algorithms, and with varying numbers of other sleeping processes occupying pids. The goal of the test was to determine which of the pid allocators performed the best in a variety of circumstances.

The contending pid allocation algorithms were:

old The present pid allocator looks roughly like the following:
```
	retry:
	foreach process p in the system {
		if (nextpid == p.pid) {
			nextpid++;
			goto retry;
		}
	}
```
Although there is an added heuristic that depends on the order of the allproc list, this operation can be O(mn), where m is the number of processes in the system and n is the number of pids that must be searched.
tjr Tim has a hash-based allocator in Perforce. The change is simple: use a constant-time hash lookup to check for collisions instead of a linear-time search of the allproc list. This is O(n), where n is the number of pids that must be searched.
net Jun Su ported NetBSD's new pid allocator, which also uses a hash table. However, there is never any hash chaining because the code avoids allocating pids that would conflict. This reduces the pid space, so the table must be expanded when it gets "too full", where the appropriate definition of "too full" is a subject of debate. Additionally, the empty slots in the hash table are chained together into a FIFO freelist, so allocation is O(1).

The test environment consisted of a 450 MHz Pentium 3 with 384 MB of RAM running FreeBSD-CURRENT-20040206 with the usual debugging options disabled. The test program forks the specified number of "sleeper" processes with random pids, then measures the time required to fork a child and wait for that child to exit immediately. (The fork()/wait4() operation is iterated many times to increase the accuracy of the results.)

The results of running the test for 1000 forks with the three algorithms (repeated 3 times each) are presented in the figure above. The "better" algorithms performed nearly as badly as the old allocator, so clearly confounding factors were involved. Through profiling, I identified two performance bottlenecks:

FreeBSD's wait4() implementation performs poorly when the parent has a large number of children and wishes to wait for a specific child. This is unlikely to be a serious problem in practice, but it was a huge problem for the test.
Tearing down a process' page tables (pmap_remove_pages()) on the i386 architecture is the most expensive part of the exit() operation. This problem may be worth looking into, particularly if the exiting process has very few PTEs in the first place.

To correct for these, I made two changes to the test program:

Use vfork() instead of fork().
Run the test in a new process that has no children, rather than from the process that is also the parent of potentially several thousand "sleepers".

Fork time decreased so much that I had to run 10,000 iterations instead of 1000 in order to avoid excessive jitter. The results for the new test program are below.

These results demonstrate that both of the new approaches are effective and solve a real performance problem for systems with many processes. Moreover, there is no statistically significant performance impact in the common case of only several hundred processes. So is Tim's approach or the NetBSD approach more appropriate? I'm inclined to opt for Tim's allocator because:

It is just as fast as the competing algorithm in practice.
It results in a net reduction in complexity rather than a net increase.
It does not change the semantics of pid allocation by causing pids to be potentially reused more frequently.

Another interesting observation is that the variance is much higher when there are more sleeping processes. What accounts for these random spikes? The measurements were taken with the ULE scheduler, so there's no schedcpu(). One theory is that the additional memory pressure caused the pageout daemon to kick into gear at unpredictable times to look for clean pages to evict. (The test system was not configured with swap.) In any case, it's possible that some other component of the system behaves badly when there are many processes.

Raw data