Change 663442 by justing@justing_ns1_spectrabsd_integration on 2013/04/03 12:52:39 Enhance the ZFS vdev layer to maintain both a logical and a physical minimum allocation size for devices. Use this information to automatically increase ZFS's minimum allocation size for new top-level vdevs to a value that more closely matches the optimum device allocation size. Many modern devices use physical allocation units that are much larger than the minimum logical allocation size accessible by external commands. Two prevalent examples of this are 512e disk drives (512b logical sector, 4K physical sector) and flash devices (512b logical sector, 4K or larger allocation block size, and 128k or larger erase block size). Operations that modify less than the physical sector size result in a costly read-modify-write or garbage collection sequence on these devices. Simply exporting the true physical sector of the device to ZFS would yield optimal performance, but has two serious drawbacks: 1) Existing pools created with devices that have different logical and physical sector sizes will suddenly find that the vdev allocation size has increased. This can be easily tolerated for active members of the array, but ZFS would prevent replacement of a vdev with another identical device because it appears that the smaller allocation size required by the pool is not supported by the new device. 2) Space wastage that cannot be tolerated by the user. The optimal allocation size for the vdev may be quite large. For example, a RAID controller may export a vdev that can only guarantee read-modify-write cycles can be avoided by using an allocation size of 64k. The user should be able to decide how to balance the trade off of performance and wasted space. The solution provided here is to report both a logical and physical allocation size for vdevs. When creating a new top-level vdev, the logical size is grown toward the physical size, but that growth is capped by the "max_auto_ashift" tunable. "max_auto_ashift" defaults to SPA_MAXAUTOASHIFT which is 13 (8k). sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/spa.h: o Add the SPA_MAXAUTOASHIFT constant. sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/vdev_impl.h: o Update the vdev_open() API so that both logical (what was just ashift before) and physical ashift are reported. o Add a new field, vdev_physical_ashift, to vdev_t. sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev.c: o Add tunable logic for max_auto_ashift. o Propogate vdev_physical_ashift between child and parent vdevs as is done for vdev_ashift. o In open processing code, rename ashift to logical_ashift to make its semantics clear. o Perform ashift growth in vdev_metaslab_set_size(), which is called anytime a new top-level vdev is allocated. sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_file.c: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_mirror.c: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_missing.c: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_raidz.c: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_root.c: Export physical_ashift via the vdev_open() api. Affected files ... ... //SpectraBSD/stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/spa.h#5 edit ... //SpectraBSD/stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/vdev_impl.h#6 edit ... //SpectraBSD/stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev.c#10 edit ... //SpectraBSD/stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_file.c#4 edit ... //SpectraBSD/stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c#17 edit ... //SpectraBSD/stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_mirror.c#6 edit ... //SpectraBSD/stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_missing.c#4 edit ... //SpectraBSD/stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_raidz.c#6 edit ... //SpectraBSD/stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_root.c#4 edit Differences ... ==== //SpectraBSD/stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/spa.h#5 (text) ==== @@ -86,6 +86,13 @@ #define SPA_MINBLOCKSIZE (1ULL << SPA_MINBLOCKSHIFT) #define SPA_MAXBLOCKSIZE (1ULL << SPA_MAXBLOCKSHIFT) +/* + * Limit any device's preferred ashift to minimize wastage. + * If the logical asize of the device is larger than this limit, + * ZFS will still honor it. + */ +#define SPA_MAXAUTOASHIFT 13 + /** * We currently support nine block sizes, from 512 bytes to 128K. * We could go higher, but the benefits are near-zero and the cost ==== //SpectraBSD/stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/vdev_impl.h#6 (text) ==== @@ -60,7 +60,7 @@ * \{ */ typedef int vdev_open_func_t(vdev_t *vd, uint64_t *size, uint64_t *max_size, - uint64_t *ashift); + uint64_t *logical_ashift, uint64_t *physical_ashift); typedef void vdev_close_func_t(vdev_t *vd); typedef uint64_t vdev_asize_func_t(vdev_t *vd, uint64_t psize); typedef int vdev_io_start_func_t(zio_t *zio); @@ -128,12 +128,22 @@ uint64_t vdev_min_asize; /**< min acceptable asize */ uint64_t vdev_max_asize; /**< max acceptable asize */ /** - * block alignment shift + * Logical block alignment shift * - * Forces all IO to this VDev to be aligned to multiples of (1 << vdev_ashift) - * bytes. + * Forces all IO to this VDev to be aligned to multiples of + * (1 << vdev_ashift) bytes. */ uint64_t vdev_ashift; + /** + * Physical block alignment shift + * + * The device supports logical I/Os with vdev_ashift size/alignment, + * but optimum performance will be achieved by aligning to + * vdev_physical_size. + * + * May be 0 to indicate no preference (i.e. use vdev_ashift). + */ + uint64_t vdev_physical_ashift; uint64_t vdev_state; /**< see VDEV_STATE_* #defines */ uint64_t vdev_prevstate; /**< used when reopening a vdev */ vdev_ops_t *vdev_ops; /**< vdev operations */ ==== //SpectraBSD/stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev.c#10 (text) ==== @@ -51,6 +51,33 @@ SYSCTL_DECL(_vfs_zfs); SYSCTL_NODE(_vfs_zfs, OID_AUTO, vdev, CTLFLAG_RW, 0, "ZFS VDEV"); +/** + * The limit for ZFS to automatically increase a top-level vdev's ashift + * from virtual ashift to physical ashift. + * + * Example: one or more 512b emulation child vdevs + * child->vdev_ashift = 9 (512 bytes) + * child->vdev_physical_ashift = 12 (4096 bytes) + * zfs_max_auto_ashift = 11 (2048 bytes) + * + * On pool creation or the addition of a new top-leve vdev, ZFS will + * bump the ashift of the top-level vdev to 2048. + * + * Example: one or more 512b emulation child vdevs + * child->vdev_ashift = 9 (512 bytes) + * child->vdev_physical_ashift = 12 (4096 bytes) + * zfs_max_auto_ashift = 13 (8192 bytes) + * + * On pool creation or the addition of a new top-leve vdev, ZFS will + * bump the ashift of the top-level vdev to 4096. + */ +static uint64_t zfs_max_auto_ashift = SPA_MAXAUTOASHIFT; + +TUNABLE_QUAD("vfs.zfs.max_auto_ashift", &zfs_max_auto_ashift); +SYSCTL_UQUAD(_vfs_zfs_vdev, OID_AUTO, max_auto_ashift, CTLFLAG_RW, + &zfs_max_auto_ashift, 0, + "Cap on logical -> physical ashift adjustment on new top-level vdevs"); + static vdev_ops_t *vdev_ops_table[] = { &vdev_root_ops, &vdev_raidz_ops, @@ -746,6 +773,7 @@ mvd->vdev_min_asize = cvd->vdev_min_asize; mvd->vdev_max_asize = cvd->vdev_max_asize; mvd->vdev_ashift = cvd->vdev_ashift; + mvd->vdev_physical_ashift = cvd->vdev_physical_ashift; mvd->vdev_state = cvd->vdev_state; mvd->vdev_crtxg = cvd->vdev_crtxg; @@ -777,6 +805,7 @@ mvd->vdev_ops == &vdev_replacing_ops || mvd->vdev_ops == &vdev_spare_ops); cvd->vdev_ashift = mvd->vdev_ashift; + cvd->vdev_physical_ashift = mvd->vdev_physical_ashift; vdev_remove_child(mvd, cvd); vdev_remove_child(pvd, mvd); @@ -1120,7 +1149,8 @@ uint64_t osize = 0; uint64_t max_osize = 0; uint64_t asize, max_asize, psize; - uint64_t ashift = 0; + uint64_t logical_ashift = 0; + uint64_t physical_ashift = 0; ASSERT(vd->vdev_open_thread == curthread || spa_config_held(spa, SCL_STATE_ALL, RW_WRITER) == SCL_STATE_ALL); @@ -1150,7 +1180,8 @@ return (ENXIO); } - error = vd->vdev_ops->vdev_op_open(vd, &osize, &max_osize, &ashift); + error = vd->vdev_ops->vdev_op_open(vd, &osize, &max_osize, + &logical_ashift, &physical_ashift); /* * Reset the vdev_reopening flag so that we actually close @@ -1250,12 +1281,14 @@ */ vd->vdev_asize = asize; vd->vdev_max_asize = max_asize; - vd->vdev_ashift = MAX(ashift, vd->vdev_ashift); + vd->vdev_ashift = MAX(logical_ashift, vd->vdev_ashift); + vd->vdev_physical_ashift = + MAX(physical_ashift, vd->vdev_physical_ashift); } else { /* * Make sure the alignment requirement hasn't increased. */ - if (ashift > vd->vdev_top->vdev_ashift) { + if (logical_ashift > vd->vdev_top->vdev_ashift) { vdev_set_state(vd, B_TRUE, VDEV_STATE_CANT_OPEN, VDEV_AUX_BAD_LABEL); return (EINVAL); @@ -1560,6 +1593,17 @@ vdev_metaslab_set_size(vdev_t *vd) { /* + * Choose a logical asize that is supported by the + * the device but is as close to the physical asize + * as possible without incurring too much wastage. + */ + if ((vd->vdev_ashift < vd->vdev_physical_ashift) && + (vd->vdev_ashift < zfs_max_auto_ashift)) { + vd->vdev_ashift = MIN(zfs_max_auto_ashift, + vd->vdev_physical_ashift); + } + + /* * Aim for roughly 200 metaslabs per vdev. */ vd->vdev_ms_shift = highbit(vd->vdev_asize / 200); ==== //SpectraBSD/stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_file.c#4 (text) ==== @@ -50,7 +50,7 @@ static int vdev_file_open(vdev_t *vd, uint64_t *psize, uint64_t *max_psize, - uint64_t *ashift) + uint64_t *logical_ashift, uint64_t *physical_ashift) { vdev_file_t *vf; vnode_t *vp; @@ -129,7 +129,8 @@ } *max_psize = *psize = vattr.va_size; - *ashift = SPA_MINBLOCKSHIFT; + *logical_ashift = SPA_MINBLOCKSHIFT; + *physical_ashift = SPA_MINBLOCKSHIFT; return (0); } ==== //SpectraBSD/stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c#17 (text) ==== @@ -721,10 +721,11 @@ static int vdev_geom_open(vdev_t *vd, uint64_t *psize, uint64_t *max_psize, - uint64_t *ashift) + uint64_t *logical_ashift, uint64_t *physical_ashift) { struct g_provider *pp; struct g_consumer *cp; + uint64_t asize; size_t bufsize; int error; @@ -827,9 +828,11 @@ *max_psize = *psize = pp->mediasize; /* - * Determine the device's minimum transfer size. + * Determine the device's minimum transfer size and preferred + * transfer size. */ - *ashift = highbit(MAX(pp->sectorsize, SPA_MINBLOCKSIZE)) - 1; + *logical_ashift = highbit(MAX(pp->sectorsize, SPA_MINBLOCKSIZE)) - 1; + *physical_ashift = highbit(pp->stripesize) - 1; /* * Clear the nowritecache settings, so that on a vdev_reopen() ==== //SpectraBSD/stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_mirror.c#6 (text) ==== @@ -175,7 +175,7 @@ static int vdev_mirror_open(vdev_t *vd, uint64_t *asize, uint64_t *max_asize, - uint64_t *ashift) + uint64_t *logical_ashift, uint64_t *physical_ashift) { int numerrors = 0; int lasterror = 0; @@ -198,7 +198,9 @@ *asize = MIN(*asize - 1, cvd->vdev_asize - 1) + 1; *max_asize = MIN(*max_asize - 1, cvd->vdev_max_asize - 1) + 1; - *ashift = MAX(*ashift, cvd->vdev_ashift); + *logical_ashift = MAX(*logical_ashift, cvd->vdev_ashift); + *physical_ashift = MAX(*physical_ashift, + cvd->vdev_physical_ashift); } if (numerrors == vd->vdev_children) { ==== //SpectraBSD/stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_missing.c#4 (text) ==== @@ -48,7 +48,7 @@ /* ARGSUSED */ static int vdev_missing_open(vdev_t *vd, uint64_t *psize, uint64_t *max_psize, - uint64_t *ashift) + uint64_t *logical_ashift, uint64_t *physical_ashift) { /* * Really this should just fail. But then the root vdev will be in the @@ -58,7 +58,8 @@ */ *psize = 0; *max_psize = 0; - *ashift = 0; + *logical_ashift = 0; + *physical_ashift = 0; return (0); } ==== //SpectraBSD/stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_raidz.c#6 (text) ==== @@ -1491,7 +1491,7 @@ */ static int vdev_raidz_open(vdev_t *vd, uint64_t *asize, uint64_t *max_asize, - uint64_t *ashift) + uint64_t *logical_ashift, uint64_t *physical_ashift) { vdev_t *cvd; uint64_t nparity = vd->vdev_nparity; @@ -1520,7 +1520,9 @@ *asize = MIN(*asize - 1, cvd->vdev_asize - 1) + 1; *max_asize = MIN(*max_asize - 1, cvd->vdev_max_asize - 1) + 1; - *ashift = MAX(*ashift, cvd->vdev_ashift); + *logical_ashift = MAX(*logical_ashift, cvd->vdev_ashift); + *physical_ashift = MAX(*physical_ashift, + cvd->vdev_physical_ashift); } *asize *= vd->vdev_children; ==== //SpectraBSD/stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_root.c#4 (text) ==== @@ -56,7 +56,7 @@ static int vdev_root_open(vdev_t *vd, uint64_t *asize, uint64_t *max_asize, - uint64_t *ashift) + uint64_t *logical_ashift, uint64_t *physical_ashift) { int lasterror = 0; int numerrors = 0; @@ -84,7 +84,8 @@ *asize = 0; *max_asize = 0; - *ashift = 0; + *logical_ashift = 0; + *physical_ashift = 0; return (0); }