FreeBSD based high density filers

Baptiste Daroussin
<bapt@gandi.net>
<bapt@FreeBSD.org>

AsiaBSDCon 2016

Gandi.net

Refreshing the filers

  • Nexenta based since 2007
  • Difficulty to provide non attended setup
  • Kernel patches for multipath on new disks
  • Python stuck to an old (and buggy) version
  • Very long boot time
    • OS
    • zpool import
    • iSCSI export

Study

Study

requirements

  • ZFS
  • Ability to server 1000 NFS and 900 iSCSI
  • Possible extension to 1500 NFS and 1000 iSCSI
  • Support NFSv4 with delegation
  • Powerfull debugging tools (in particular dtrace)
  • Support accessing JBOD with multipath
  • OpenSource with an active community
  • Ability to easily upstream patches
  • Ability to run containers

Study

Candidates

  • Illumos family:
    • OpenIndiana
    • OmniOS
    • SmartOS
    • Newer Nexenta
  • FreeBSD
  • Linux (with ZoL)

Linux

Rejected:

  • ZoL cannot be upstreamed due to license incompatiblities
  • Lots of regressions due to not being part of the upstream kernel

Illumos

Illumos

Nexenta

Rejected:

  • Community version limited to 18TB
  • Upstreaming not easy

Illumos

OpenIndiana

Rejected:

  • Small community
  • Fragile build system
  • Old python (2.6)

Illumos

SmartOS

Rejected:

  • Global zone hard to customize
  • no iSCSI/NFS management delegation
  • Not design to make filers

Illumos

  • Exporting lots of iSCSI targets still long: more than 5 minutes
  • Kernel still has to be patched for new disk manufactureres

FreeBSD

FreeBSD

The good

  • Strong reputation on storage area
  • Support modern ZFS and dtrace
  • ctld(4)
  • Very fast iSCSI export: few seconds
  • good NFSv4 support
  • mdb -> sysctl/kgdb
  • Fast zpool import (tips: disable trim support)

FreeBSD

the bad

  • Bad support for diskless netbooting
  • Slow to boot on large MFSROOT
  • No multiboot support == no proper iPXE support

Design: diskless

  • Unattended setup via puppet
  • Upgradability: just reboot
  • Easy backtracking: just reboot
  • Free from admin heroes
  • Easy migration from Nexenta
  • Safe migration from Nexenta

Design: booting sequence

Early boot

  1. DHCP request
  2. tftp get pxeboot
  3. tftp get /boot/ configs
  4. tftp get kernel, modules, miniroot

Design: booting sequence

Boot miniroot

  1. run custom rc
  2. create a ramdisk
  3. http get filer.txz config.txz puppet-.txz
  4. extrac into ramdisk
  5. reroot on ramdisk

Design: booting sequence

Final boot

  1. zpool import
  2. puppet run
  3. starts Gandi's middleware
  4. ready to serve

Contributions

Contributions

py-libzfs (FreeNAS)

  • Implement zfs clone support
  • Implement zfs promote support
  • Implement support for properties (including custom)
  • Implement volume support
  • Bug fixing

Contributions

mpsutil(8)/mprutil(8) (Netflix)

  • Finish integration with FreeBSD build system
  • implement flashing firmwares/bios

Contributions

Playing the guinea pig

  • reroot (by trasz@)
  • smarter mount root wait (by trasz@)

sesutil(8)

Managing SCSI Enclosure Services

  • blink locate led (only disks)
  • blink fault led (only disks)
  • show the detailed map of an enclosure
  • easy to use:
  • $ sesutil locate da3 on
    $ sesutil locale all off

sesutil(8)

Vendor tools

  • Lots of noise in the logs
  • 2 differents tools for SAS2 and SAS3
  • Unfriendly UI

sesutil(8)

sg_ses (sg3_utils)

  • Unfriendly UI
  • mapping disks complex

Rework pxeboot with TFTP support

Add support for root-path DHCP option to act like pxeboot with NFS support

Running HEAD

stable most of the time

needed features only available there

easier to upstream patches

find (and fix) as early as possible bugs

Gandi's workload very well identified

Test lab

  • Drived by Zopkio
  • Simulating broken disks using gnop(8)
  • Simulating bad network access using ipfw(8) + dummynet(4)
  • Simulating crash and reboot under high load from consumers
  • Profile based test lab

Futur plans

Futur plans

Improve sesutil(8)

  • libxoify(?)
  • Add microcode update support
  • Extend locate to support other devices

Futur plans

Improve ZFS(8)

  • Improve zpool import speed
  • Tuning tunable like arc_max into safe read/write tunables
  • Maybe new features to improve reliability

Futur plans

Improve for iPXE support

  • Implement a FreeBSD specific loader
  • or
  • Turn the FreeBSD kernel into multiboot

Futur plans

Improve CTL(4)

  • Convert the number of ports and lun per ports into sysctl
  • Turn ctl(4) into using libucl (too late)

Futur plans

Storage related tooling

  • Implement port some dtrace scripts from Illumos
  • Improve geom_multipath algorithm to better match ZFS requirements

Thanks!

Questions?

AsiaBSDCon 2016