Progress on scaling of FreeBSD on 8 CPU systems (Feb 2007)

Note: this document describes progress as of February 2007, and has been partly superceded and exceeded by subsequent work

Now that the goals of the SMPng project are complete, for the past year or more several of us have been working hard on profiling FreeBSD in various multiprocessor workloads, and looking for performance bottlenecks to be optimized.

We have recently made significant progress on optimizing for MySQL running on an 8-core amd64 system. The following graph shows the number of MySQL transactions/second performed by a multi-threaded client workload against a local MySQL database with varying numbers of client threads, with identically configured FreeBSD and Linux systems on the same machine.

The graph of results may be found here.

The test was run on FreeBSD 7.0, with the latest version of the ULE 2.0 scheduler, the libthr threading library, and an uncommitted patch from Jeff Roberson [1] that addresses poor scalability of file descriptor locking (using a new sleepable mutex primitive); this patch is responsible for almost all of the performance and scaling improvements measured. It also includes some other patches that have been shown to help contention in MySQL workloads in the past (including a UNIX domain socket locking pushdown patch from Robert Watson), but these were shown to only give small individual contributions, with a cumulative effect on the order of 5-10%.

With this configuration we are able to achieve performance that is consistent with Linux at peak (the graph shows Linux 2% faster, but this is commensurate with the margin of error coming from variance between runs, so more data is needed to distinguish them), with 8 client threads (=1 thread/CPU core), and significantly outperforms Linux at higher than peak loads, when running on the same hardware.

Specifically, beyond 8 client threads FreeBSD has only minor performance degradation (an 8% drop from peak throughput at 8 clients to 20 clients), but Linux collapses immediately above 8 threads, and above 14 threads asymptotes to essentially single-threaded levels. At 20 clients FreeBSD outperforms Linux by a factor of 4.

We see this result as part of the payoff we are seeing from the hard work of many developers over the past 7 years. In particular it is a significant validation of the SMP and locking strategies chosen for the FreeBSD kernel in the post-FreeBSD 4.x world.

Configuration details

The sysbench OLTP benchmark (0.4.8, from ports) was used in the configuration recommended by the people:

  • sysbench --test=oltp --num-threads=${i} --mysql-user=root --max-time=120 --max-requests=0 --oltp-read-only=on run
  • (This benchmark is designed to have more realistic properties than the popular super-smack benchmark, which does some unrealistic and expensive things like performing lots of 1-byte I/O between client and server).

    I performed 6 minutes of "warm-up" with 8 threads after booting the machine before taking data.

    The recommended MySQL configuration from the above URL was also used on both FreeBSD and Linux. I used MySQL 5.0.33 from ports on the FreeBSD system.

    These tests were run on a 2.0Ghz 8-core amd64 system with 16GB of RAM, running FreeBSD and Fedora FC6. They were confirmed on a similar 8-core system with 2.2GHz CPUs and 3GB of RAM [2]. I am not yet able to run Linux on this second machine, but the FreeBSD data is identical when corrected for the CPU speed differences, and the direct comparison of FreeBSD and Linux is performed on the same hardware.

    Initially MySQL 5.0.27 was used on Linux, upgraded to 5.0.33 after concerns that MySQL may have fixed a scalability bottleneck in 5.0.33. This gave Linux a moderate (~10%) performance boost at peak, bringing it to slightly above the FreeBSD peak, but did not affect the poor scaling beyond 8 threads. Similarly, updating from the default 2.6.18 kernel distributed with FC6 (which it was reported by a redhat engineer accidentally shipped with some debugging left enabled) to the 2.6.19 fedora kernel, did not affect the apparent scalability problem in Linux. Neither did the vanilla kernel, which was the best performing, with slightly less poor scaling but the same asymptotic behaviour (asymptotic performance was reached at 14 threads instead of 12). A comparison of the different linux kernels tested may be found here

    It is still possible that something is poorly configured in a default linux kernel, but so far none of the >1000 people who viewed Jeff's initial blog posting, in which he asks for advice on tuning linux to try and fix the bottleneck, have been able to suggest anything concrete (apart from the changes tried above which did not resolve the problem). Concrete suggestions are welcomed.

    Future work

    There is evidence that FreeBSD performance can be further consolidated by scheduler tuning and related architectural work. In particular the global sched_lock spinlock appears to be a significant barrier to performance: setting the kern.sched.pick_pri=0 sysctl instructs ULE to use a scheduling algorithm which heavily favours the current CPU over others, which avoids sched_lock contention since it is already held by the current CPU. This gives a significant boost on this workload, outlining the potential scope for improvement with further architectural work. Work is in progress by Jeff Roberson and Attilio Rao to investigate ways of overcoming this limitation by moving to per-CPU scheduler locks.

    A more linear ramp-up to 8 threads should also be achievable with scheduler tuning: e.g. the sub-linear scaling from 4 to 7 threads, and the big jump from 7 to 8 clients restoring the linear scaling extrapolated from 1-4 threads, suggests that processes are not currently being scheduled efficiently at light workloads.

    Kris Kennaway

    [1] Patch may be found here

    [2] Donated by AMD via David O'Brien, hosted by ISC. Thanks to Doug White for installation and admin work.