Expanding a mirrored zpool in-place
Thursday, 5 Dec 2024
My main workstation, a consumer PC desktop, needs more storage. I’m moving from 2TiB NVMe to 4TiB NVMe, because at the end of the tax year, if you don’t spend the money, you get taxed on it. I’m not going to let that tax go to waste.
As this is a consumer PC, niceties like hot-swap PCI devices are not an option, and a screwdriver and some disassembly is required. If this were a proper server, this whole escapade could have been done without down time.
I cheated a little, and added an iSCSI volume on a nearby server as an additional layer of safety, during the migration. The setup of that will be added as a separate post, but it wasn’t very difficult.
The plan is to swap one drive in the mirror at a time, with a reboot in between, and once both drives have been replaced, I will then grow the underlying GPT partitions, and then finally grow the vdevs to allow access to the new space.
Don’t forget what I forgot, which is to copy your EFI partition as well, before you start. As I don’t have a spare PC to plug in the old NVMe drive, I needed to boot from a USB stick to recover it later. Now that the upgrade is finished, I can restore the EFI partition from previous backups. I use refind as an additional boot manager, so there is more than just the usual
/EFI/BOOT/BOOTX64.EFI
file.
I shut down, swapped out one of the existing mirrored NVMes, and rebooted.
You can use gpart backup | gpart recover
to ensure the partition layout
is identical on both systems, just make sure the label names don’t conflict.
# dmesg |grep nda
FreeBSD is a registered trademark of The FreeBSD Foundation.
nda0 at nvme0 bus 0 scbus8 target 0 lun 1
nda0: <Samsung SSD 990 PRO 2TB 0B2QJXG7 S7DNNJ0WC12665P>
nda0: Serial Number S7DNNJ0WC12665P
nda0: nvme version 2.0
nda0: 1907729MB (3907029168 512 byte sectors)
nda1 at nvme1 bus 0 scbus9 target 0 lun 1
nda1: <Samsung SSD 990 PRO 4TB 4B2QJXD7 S7DPNF0XA36669E>
nda1: Serial Number S7DPNF0XA36669E
nda1: nvme version 2.0
nda1: 3815447MB (7814037168 512 byte sectors)
# gpart show -l nda0
=> 40 3907029088 nda0 GPT (1.8T)
40 532480 1 efiboot0 (260M)
532520 2008 - free - (1.0M)
534528 41943040 2 swap0 (20G)
42477568 3864551424 3 zfs0 (1.8T)
3907028992 136 - free - (68K)
# gpart show nda1
gpart: No such geom: nda1.
# gpart backup nda0 | sed -E -e 's/0 *$/1/' | gpart restore nda1
# gpart show -l nda1
=> 34 7814037101 nda1 GPT (3.6T)
34 6 - free - (3.0K)
40 532480 1 (null) (260M)
532520 2008 - free - (1.0M)
534528 41943040 2 (null) (20G)
42477568 3864551424 3 (null) (1.8T)
3907028992 3907008143 - free - (1.8T)
# gpart modify -i 1 -l efiboot1 nda1
nda1p1 modified
# gpart modify -i 2 -l swap1 nda1
nda1p2 modified
# gpart modify -i 3 -l zfs1 nda1
nda1p3 modified
# gpart show -l nda1
=> 34 7814037101 nda1 GPT (3.6T)
34 6 - free - (3.0K)
40 532480 1 efiboot1 (260M)
532520 2008 - free - (1.0M)
534528 41943040 2 swap1 (20G)
42477568 3864551424 3 zfs1 (1.8T)
3907028992 3907008143 - free - (1.8T)
Next up, let’s start the mirroring process by replacing the old
devices. ZFS is smart enough to notice that the old GPT label and
the new GPT label, while matching in name, are not actually the
same devices. Everything is cake. Note that gpt/zfs2
is the
remote iscsi volume, in the event of catastrophic hardware issues
this volume is always available to recover the whole zpool from.
# zpool list -v
NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
zroot 1.80T 1.53T 270G - - 51% 85% 1.00x DEGRADED -
mirror-0 1.80T 1.53T 270G - - 51% 85.3% - DEGRADED
gpt/zfs0 1.80T - - - - - - - ONLINE
gpt/zfs1 - - - - - - - - UNAVAIL
gpt/zfs2 1.82T - - - - - - - ONLINE
# zpool offline zroot gpt/zfs1
# zpool replace zroot gpt/zfs1 /dev/gpt/zfs1
# zpool status -v
pool: zroot
state: DEGRADED
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scan: resilver in progress since Wed Dec 4 08:23:12 2024
118G / 1.53T scanned at 5.38G/s, 0B / 1.53T issued
0B resilvered, 0.00% done, no estimated completion time
config:
NAME STATE READ WRITE CKSUM
zroot DEGRADED 0 0 0
mirror-0 DEGRADED 0 0 0
gpt/zfs0 ONLINE 0 0 0
replacing-1 DEGRADED 0 0 0
gpt/zfs1/old OFFLINE 0 0 0
gpt/zfs1 ONLINE 0 0 0
gpt/zfs2 ONLINE 0 0 0
errors: No known data errors
The resilvering speed isn’t as good as I would like, so I wondered if the remote iSCSI vol somehow impacts performance. It’s over a 1Gb network, so this is reasonable. I offlined it, and the speed has indeed gone up a lot.
# zpool offline zroot gpt/zfs2
# zpool status -v
pool: zroot
state: DEGRADED
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scan: resilver in progress since Wed Dec 4 08:23:12 2024
1.31T / 1.53T scanned at 19.2G/s, 0B / 1.53T issued
0B resilvered, 0.00% done, no estimated completion time
config:
NAME STATE READ WRITE CKSUM
zroot DEGRADED 0 0 0
mirror-0 DEGRADED 0 0 0
gpt/zfs0 ONLINE 0 0 0
replacing-1 DEGRADED 0 0 0
gpt/zfs1/old OFFLINE 0 0 0
gpt/zfs1 ONLINE 0 0 0
gpt/zfs2 OFFLINE 0 0 0
errors: No known data errors
In fact, its so fast that I was unable to finish the blog post before the re-silvering of the mirror completed.
After a few days of monitoring, I repeated this process with the other half of the ZFS mirror, and then removed the iSCSI volume, and grew the pool.
Growing the Pool
Now the resilvering is complete, this is what the pool looks like:
# zpool list -v
NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
zroot 1.80T 1.54T 266G - - 51% 85% 1.00x DEGRADED -
mirror-0 1.80T 1.54T 266G - - 51% 85.5% - DEGRADED
gpt/zfs0 1.80T - - - - - - - ONLINE
gpt/zfs1 1.80T - - - - - - - ONLINE
gpt/zfs2 1.82T - - - - - - - OFFLINE
That gpt/zfs2
is the iSCSI volume, and its been offline during
resilvering as the local NVMe devices will sync at 60GiB/s if left
alone, but including the iSCSI device drags it all down.
We can remove it now.
# zpool detach zroot gpt/zfs2
# zpool list -v
NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
zroot 1.80T 1.54T 266G - - 51% 85% 1.00x ONLINE -
mirror-0 1.80T 1.54T 266G - - 51% 85.5% - ONLINE
gpt/zfs0 1.80T - - - - - - - ONLINE
gpt/zfs1 1.80T - - - - - - - ONLINE
Great, the pool is healthy. Let’s resize the vdev partitions first:
# gpart resize -i 3 nda0
nda0p3 resized
# gpart resize -i 3 nda1
nda1p3 resized
# gpart show -l
=> 34 7814037101 nda1 GPT (3.6T)
34 6 - free - (3.0K)
40 532480 1 efiboot1 (260M)
532520 2008 - free - (1.0M)
534528 41943040 2 swap1 (20G)
42477568 3864551424 3 zfs1 (1.8T)
3907028992 3907008143 - free - (1.8T)
=> 40 3906994096 da0 GPT (1.8T)
40 3906994096 1 zfs2 (1.8T)
=> 34 7814037101 nda0 GPT (3.6T)
34 6 - free - (3.0K)
40 532480 1 efiboot0 (260M)
532520 2008 - free - (1.0M)
534528 41943040 2 swap0 (20G)
42477568 3864551424 3 zfs0 (1.8T)
3907028992 3907008143 - free - (1.8T)
# doas zpool list -v
NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
zroot 1.80T 1.54T 266G - - 51% 85% 1.00x ONLINE -
mirror-0 1.80T 1.54T 266G - - 51% 85.5% - ONLINE
gpt/zfs0 1.80T - - - - - - - ONLINE
gpt/zfs1 1.80T - - - - - - - ONLINE
Notice that while the partitions have free space, the zpool still does not, neither at the top mirror vdev, nor its child zfs vdevs. Let’s fix that.
# zpool online -e zroot gpt/zfs0
# zpool online -e zroot gpt/zfs1
# zpool online -e zroot mirror-0
# zpool list -v
NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
zroot 3.61T 1.54T 2.07T - - 25% 42% 1.00x ONLINE -
mirror-0 3.61T 1.54T 2.07T - - 25% 42.6% - ONLINE
gpt/zfs0 3.62T - - - - - - - ONLINE
gpt/zfs1 3.62T - - - - - - - ONLINE
Job done, thanks OpenZFS!