Random Musings

O for a muse of fire, that would ascend the brightest heaven of invention!


Expanding a mirrored zpool in-place

Thursday, 5 Dec 2024 Tags: expansionfreebsdzfs

My main workstation, a consumer PC desktop, needs more storage. I’m moving from 2TiB NVMe to 4TiB NVMe, because at the end of the tax year, if you don’t spend the money, you get taxed on it. I’m not going to let that tax go to waste.

As this is a consumer PC, niceties like hot-swap PCI devices are not an option, and a screwdriver and some disassembly is required. If this were a proper server, this whole escapade could have been done without down time.

I cheated a little, and added an iSCSI volume on a nearby server as an additional layer of safety, during the migration. The setup of that will be added as a separate post, but it wasn’t very difficult.

The plan is to swap one drive in the mirror at a time, with a reboot in between, and once both drives have been replaced, I will then grow the underlying GPT partitions, and then finally grow the vdevs to allow access to the new space.

Don’t forget what I forgot, which is to copy your EFI partition as well, before you start. As I don’t have a spare PC to plug in the old NVMe drive, I needed to boot from a USB stick to recover it later. Now that the upgrade is finished, I can restore the EFI partition from previous backups. I use refind as an additional boot manager, so there is more than just the usual /EFI/BOOT/BOOTX64.EFI file.

I shut down, swapped out one of the existing mirrored NVMes, and rebooted. You can use gpart backup | gpart recover to ensure the partition layout is identical on both systems, just make sure the label names don’t conflict.

# dmesg |grep nda
FreeBSD is a registered trademark of The FreeBSD Foundation.
nda0 at nvme0 bus 0 scbus8 target 0 lun 1
nda0: <Samsung SSD 990 PRO 2TB 0B2QJXG7 S7DNNJ0WC12665P>
nda0: Serial Number S7DNNJ0WC12665P
nda0: nvme version 2.0
nda0: 1907729MB (3907029168 512 byte sectors)
nda1 at nvme1 bus 0 scbus9 target 0 lun 1
nda1: <Samsung SSD 990 PRO 4TB 4B2QJXD7 S7DPNF0XA36669E>
nda1: Serial Number S7DPNF0XA36669E
nda1: nvme version 2.0
nda1: 3815447MB (7814037168 512 byte sectors)

# gpart show -l nda0
=>        40  3907029088  nda0  GPT  (1.8T)
          40      532480     1  efiboot0  (260M)
      532520        2008        - free -  (1.0M)
      534528    41943040     2  swap0  (20G)
    42477568  3864551424     3  zfs0  (1.8T)
  3907028992         136        - free -  (68K)

# gpart show nda1
gpart: No such geom: nda1.

# gpart backup nda0 | sed -E -e 's/0 *$/1/' | gpart restore nda1

# gpart show -l nda1
=>        34  7814037101  nda1  GPT  (3.6T)
          34           6        - free -  (3.0K)
          40      532480     1  (null)  (260M)
      532520        2008        - free -  (1.0M)
      534528    41943040     2  (null)  (20G)
    42477568  3864551424     3  (null)  (1.8T)
  3907028992  3907008143        - free -  (1.8T)

# gpart modify -i 1 -l efiboot1 nda1
nda1p1 modified
# gpart modify -i 2 -l swap1 nda1
nda1p2 modified
# gpart modify -i 3 -l zfs1 nda1
nda1p3 modified
# gpart show -l nda1
=>        34  7814037101  nda1  GPT  (3.6T)
          34           6        - free -  (3.0K)
          40      532480     1  efiboot1  (260M)
      532520        2008        - free -  (1.0M)
      534528    41943040     2  swap1  (20G)
    42477568  3864551424     3  zfs1  (1.8T)
  3907028992  3907008143        - free -  (1.8T)

Next up, let’s start the mirroring process by replacing the old devices. ZFS is smart enough to notice that the old GPT label and the new GPT label, while matching in name, are not actually the same devices. Everything is cake. Note that gpt/zfs2 is the remote iscsi volume, in the event of catastrophic hardware issues this volume is always available to recover the whole zpool from.

# zpool list -v
NAME           SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
zroot         1.80T  1.53T   270G        -         -    51%    85%  1.00x  DEGRADED  -
  mirror-0    1.80T  1.53T   270G        -         -    51%  85.3%      -  DEGRADED
    gpt/zfs0  1.80T      -      -        -         -      -      -      -    ONLINE
    gpt/zfs1      -      -      -        -         -      -      -      -   UNAVAIL
    gpt/zfs2  1.82T      -      -        -         -      -      -      -    ONLINE

# zpool offline zroot gpt/zfs1

# zpool replace zroot gpt/zfs1 /dev/gpt/zfs1

# zpool status -v
  pool: zroot
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Wed Dec  4 08:23:12 2024
        118G / 1.53T scanned at 5.38G/s, 0B / 1.53T issued
        0B resilvered, 0.00% done, no estimated completion time
config:

        NAME                STATE     READ WRITE CKSUM
        zroot               DEGRADED     0     0     0
          mirror-0          DEGRADED     0     0     0
            gpt/zfs0        ONLINE       0     0     0
            replacing-1     DEGRADED     0     0     0
              gpt/zfs1/old  OFFLINE      0     0     0
              gpt/zfs1      ONLINE       0     0     0
            gpt/zfs2        ONLINE       0     0     0

errors: No known data errors

The resilvering speed isn’t as good as I would like, so I wondered if the remote iSCSI vol somehow impacts performance. It’s over a 1Gb network, so this is reasonable. I offlined it, and the speed has indeed gone up a lot.

# zpool offline zroot gpt/zfs2

# zpool status -v
  pool: zroot
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Wed Dec  4 08:23:12 2024
        1.31T / 1.53T scanned at 19.2G/s, 0B / 1.53T issued
        0B resilvered, 0.00% done, no estimated completion time
config:

        NAME                STATE     READ WRITE CKSUM
        zroot               DEGRADED     0     0     0
          mirror-0          DEGRADED     0     0     0
            gpt/zfs0        ONLINE       0     0     0
            replacing-1     DEGRADED     0     0     0
              gpt/zfs1/old  OFFLINE      0     0     0
              gpt/zfs1      ONLINE       0     0     0
            gpt/zfs2        OFFLINE      0     0     0

errors: No known data errors

In fact, its so fast that I was unable to finish the blog post before the re-silvering of the mirror completed.

After a few days of monitoring, I repeated this process with the other half of the ZFS mirror, and then removed the iSCSI volume, and grew the pool.

Growing the Pool

Now the resilvering is complete, this is what the pool looks like:

# zpool list -v
NAME           SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
zroot         1.80T  1.54T   266G        -         -    51%    85%  1.00x  DEGRADED  -
  mirror-0    1.80T  1.54T   266G        -         -    51%  85.5%      -  DEGRADED
    gpt/zfs0  1.80T      -      -        -         -      -      -      -    ONLINE
    gpt/zfs1  1.80T      -      -        -         -      -      -      -    ONLINE
    gpt/zfs2  1.82T      -      -        -         -      -      -      -   OFFLINE

That gpt/zfs2 is the iSCSI volume, and its been offline during resilvering as the local NVMe devices will sync at 60GiB/s if left alone, but including the iSCSI device drags it all down.

We can remove it now.

# zpool detach zroot gpt/zfs2
# zpool list -v
NAME           SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
zroot         1.80T  1.54T   266G        -         -    51%    85%  1.00x    ONLINE  -
  mirror-0    1.80T  1.54T   266G        -         -    51%  85.5%      -    ONLINE
    gpt/zfs0  1.80T      -      -        -         -      -      -      -    ONLINE
    gpt/zfs1  1.80T      -      -        -         -      -      -      -    ONLINE

Great, the pool is healthy. Let’s resize the vdev partitions first:

# gpart resize -i 3 nda0
nda0p3 resized
# gpart resize -i 3 nda1
nda1p3 resized

# gpart show -l
=>        34  7814037101  nda1  GPT  (3.6T)
          34           6        - free -  (3.0K)
          40      532480     1  efiboot1  (260M)
      532520        2008        - free -  (1.0M)
      534528    41943040     2  swap1  (20G)
    42477568  3864551424     3  zfs1  (1.8T)
  3907028992  3907008143        - free -  (1.8T)

=>        40  3906994096  da0  GPT  (1.8T)
          40  3906994096    1  zfs2  (1.8T)

=>        34  7814037101  nda0  GPT  (3.6T)
          34           6        - free -  (3.0K)
          40      532480     1  efiboot0  (260M)
      532520        2008        - free -  (1.0M)
      534528    41943040     2  swap0  (20G)
    42477568  3864551424     3  zfs0  (1.8T)
  3907028992  3907008143        - free -  (1.8T)

# doas zpool list -v
NAME           SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
zroot         1.80T  1.54T   266G        -         -    51%    85%  1.00x    ONLINE  -
  mirror-0    1.80T  1.54T   266G        -         -    51%  85.5%      -    ONLINE
    gpt/zfs0  1.80T      -      -        -         -      -      -      -    ONLINE
    gpt/zfs1  1.80T      -      -        -         -      -      -      -    ONLINE

Notice that while the partitions have free space, the zpool still does not, neither at the top mirror vdev, nor its child zfs vdevs. Let’s fix that.

# zpool online -e zroot gpt/zfs0
# zpool online -e zroot gpt/zfs1
# zpool online -e zroot mirror-0
# zpool list -v
NAME           SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
zroot         3.61T  1.54T  2.07T        -         -    25%    42%  1.00x    ONLINE  -
  mirror-0    3.61T  1.54T  2.07T        -         -    25%  42.6%      -    ONLINE
    gpt/zfs0  3.62T      -      -        -         -      -      -      -    ONLINE
    gpt/zfs1  3.62T      -      -        -         -      -      -      -    ONLINE

Job done, thanks OpenZFS!