Your Infrastructure should be Boring

Thursday, 20 Oct 2016 Tags: boring freebsd infrastructure reliability

Imagine if every time you picked up your phone to make a call, they had completely re-designed and re-invented the buttons on it. Or if you went to turn on the stove, instead of having a button or knob to press, it was now something different - maybe you needed to swipe it like a touch device, or give a special hand gesture. Now imagine that it was different every day. If you work in tech, you know what I’m talking about already. This sort of stuff is infuriating.

There are many tools we use daily to get our jobs done, and if their interface or working principles change repeatedly, we can’t get that job done and instead spend our time researching and fixing the underlying problem. Sometimes that’s OK, if you have free time to burn and you enjoy doing this instead of, say, testing craft beers, or skiing, but for the most part it just gets in your face.

You get frustrated and annoyed, and if it happens every day you get angry. Maybe your customers are getting angry too, and maybe your family is upset because you’re fixing some computer thing so that you can get back to work on the thing you were planning to do, and actually deliver the thing you promised somebody, and when that’s done then you can read the kids a story, go for a bike ride, and spend some time with your partner.

But instead you’re debugging some random crap

This is to a large degree how I feel about the current state of Linux, specifically server-side Linux. I want my infrastructure to be boring, so that I can spend my time & energy on developing and improving and innovating on that thing I promised. I need to rely on it, and to be able to manage change consistently at a time and place of my choosing, not at the whim and mercy of the distribution.

I really don’t want to spend my free time tracking down how the latest kernel pulls in additional functionality from systemd that promptly breaks stuff that hasn’t changed in a decade, or how needing an openssl update ends up in a cascading infinitely expanding vortex of doom that desperately wants to be the first software-defined black hole and demonstrates this by requiring all the packages on my system to be upgraded to versions that haven’t been tested with my application.

You know, the application that is the entire reason this server exists, and the reason why my users are there at all. The thing I’m paid for.

I shouldn’t have to make a choice between up-to-date security patches, or tracking reliable versions of the tools I use to deliver and manage the business. If I need to re-install a system, it should be possible to have the setup identical with a given point in time, so that development and testing builds match what’s actually deployed.

And yet it seems in the last year, this has become the case. Managing the infrastructure has become a job in itself, when it should largely be a necessary sub task of shipping great software and services.

It’s time for a change

Since about 2013, I have been using FreeBSD more & more, and since mid 2016 it has become my main computing platform for work on both server, workstation, and laptop. Each of these comments could be an entire blog post in itself, so I’ll drop a brief list here with the hope that in future I get around to fleshing it out.

Storage

openzfs is one of the 3 killer features of FreeBSD. A work of art in the computing world, ZFS is a fully featured filesystem with inbuilt checksums, snapshots, rollbacks, and inter-server replication built in, all with superb performance characteristics. If you have data of any worth (family photos or company records) and you’re not storing it on ZFS, then, Sir, you are a Fool. You have already experienced bitrot and premature drive death, it’s just that with ZFS you will at least know about it.

Upgrades

ZFS-based boot environments provide a point in time fully bootable snapshot of your OS. Making one is instant, and allows you to switch “back” to the preceding OS version simply. With separate zfs datasets holding your application data and files, there is no need to re-play transaction logs to bring a restored copy up to date, because it is possible to roll back the OS without touching the application data.

“ABI” compatibility is the contract between the kernel, and the user libraries and applications. FreeBSD preserves this across minor versions, so software built for FreeBSD 9.0 will run without modification on 9.1, 9.2, and 9.3. It’s not necessary to rebuild our applications just because the OS binaries have been bumped.

The kernel and userland are developed together. At first I didn’t really appreciate the significance of this, but after a few horrific experiences with Debian apt-hell updates I get it: as the FreeBSD project gets closer to releasing the next version, the whole kernel + userland are being tested together, by people like me and you. Our applications are completely separate from this, and we can upgrade one or the other without impact. In Linux, updating to the next Debian release will not only grab a new kernel but also pull in a random wedge of dependencies that may or may not have anything to do with resolving the security exposure that is presumably the main reason to update in the first place. The result is that our applications can break.

With these 3 features, we can roll forward being comfortable of our ability to roll the OS back if needed, and also be sure that our applications are not impacted when we do.

Ports, Packages, and Poudriere

I have been a huge fan of the OSX project called homebrew, a user-contributed repository of install and packaging scripts for thousands of open source tools and projects. The idea is not new, and it comes from the Ports trees available on the BSD derived operating systems. It’s a massive repository of open-source software that has specifically been set up, tweaked and patched, for this particular operating system. The entire setup is based on standard BSD makefiles, which while not as nice as homebrew’s ruby-based DSL, are very powerful. In addition, for developers, adding a missing package or library will take an hour or so in most cases. The ports tree is available as both subversion and git repositories, so carrying your own patches while waiting for them to go through the review and commit pipeline is straightforwards.

FreeBSD 10 introduced a new & very flexible binary package deployment system, that supports custom package repos as easily as using the publically available ones. This lets you use prior versions of software where needed, or use custom private packages containing your own secret sauce.

Poudriere is a build framework that uses ports and jails to do a build of each requested package, from source, in a clean room jail, to produce a new binary package repository. It supports all the options and flags that the ports tree does, so producing custom packages is a piece of cake.

All together, this means we have the holy trinity of sysadmin software management:

repeatable builds using ports & poudriere
consistent deploys using binary packages
full source control of custom changes

Yes, it is possible to do all of this in Linux, but it gets complicated and messy very quickly, and you’ll find yourself in a veritable peat bog of complexity trying to put it all together from incomplete documentation & blog posts.

Core functionality or Packages?

Is there a clear dividing line between core functionality and 3rd party tools in your Linux distribution? I think there should be. Is your clustering or failover technology partially dependent on 3rd party components? Are they tested and maintained together? How could you tell? If they are, what happens when the kernel is updated but the userland components are not?

While this may seem like an arbitrary philosophical choice, my experience with systemd’s influx has shown how important this is. If I choose to replace the default syslog or ntpd on FreeBSD, I can be sure of exactly what interfaces I need to duplicate or implement.

However depending on exactly what Linux kernel and distribution, specific functionality may be subsumed by systemd. This is surprisingly complicated - for example, newer versions of systemd include a dhcp server, change permissions for hardware devices, allow creating users, include a new console driver, all of which may or may not have been considered when your distro release was prepared, or even the reverse - functionality was removed but the distro relies on it.

These issues crop up in test, in production, and take time and energy to manage in an area of the software stack where most of us spend very little time. Do you track what changed between systemd versions? Do you understand the implications?

If you are trying to ship software that is portable across distros, then you need to manage the intricacies of these sorts of variations across all of them. You may find that the consoled has come - and gone again - or that log management is done differently. The functionality you rely on to spawn your daemon may exist in one release but not in another.

If you need to upgrade your kernel to close down a security risk, which is one of the main reasons we upgrade these days, or to avoid a remote code exploit, or a privilege escalation, you cannot avoid these changes, and you end up trying to manage them after the fact, causing chaos in production or endless delays in testing basic functionality. Which option sounds less painful?

Linux has never had a clear separation between the kernel, core userspace functionality and daemons needed to provide a working operating system, and third party libraries, and this uncoordinated chaos is the sad inevitable result. From distro to distro, across varying versions of kernel, supplied tools, and basic functionality like logging, service management, and network funtionality, this generates a staggering amount of unnecessary re-work.

Infrastructure should be Boring

In comparison, I’ve found that on FreeBSD, the resulting platform is stable and manageable, with minimal effort. I am not dependent on leading-edge kernel features that are not available in my currently deployed fleet, nor is the OS a chimaera of partially tested combinations, between 3rd party tools, filesystems, and the Linux kernel.

When the next major release is under development, I can easily move my build pipeline over to the alpha, beta, and release candidate builds to test with, and when the final release is announced, there has been 2+ months of testing, so upgrading is reasonably safe, knowing that a trusted rollback is just a zfs boot environment away in the unlikely event something unexpected turns up.

I think of FreeBSD as a platform for building robust & repeatable services - the core functionality is built in, and if necessary, it’s very simple to make changes, and to build and roll them out – a very flexible operating environment for running services on, and not just a kernel with a hodge-podge of extra packages that seemed a good choice at the rime.

Maybe you should give it a try.