Stop ganging due to past vdev write errors

= Problem While examining a customer's system we noticed unreasonable space usage from a few snapshots due to gang blocks. Under some further analysis we discovered that the pool would create gang blocks because all its disks had non-zero write error counts and they'd be skipped for normal metaslab allocations due to the following if-clause in `metaslab_alloc_dva()`: ``` /* * Avoid writing single-copy data to a failing, * non-redundant vdev, unless we've already tried all * other vdevs. */ if ((vd->vdev_stat.vs_write_errors > 0 || vd->vdev_state < VDEV_STATE_HEALTHY) && d == 0 && !try_hard && vd->vdev_children == 0) { metaslab_trace_add(zal, mg, NULL, psize, d, TRACE_VDEV_ERROR, allocator); goto next; } ``` = Proposed Solution Get rid of the predicate in the if-clause that checks the past write errors of the selected vdev. We still try to allocate from HEALTHY vdevs anyway by checking vdev_state so the past write errors doesn't seem to help us (quite the opposite - it can cause issues in long-lived pools like the one from our customer). = Testing I first created a pool with 3 vdevs: ``` $ zpool list -v volpool NAME SIZE ALLOC FREE volpool 22.5G 117M 22.4G xvdb 7.99G 40.2M 7.46G xvdc 7.99G 39.1M 7.46G xvdd 7.99G 37.8M 7.46G ``` And used `zinject` like so with each one of them: ``` $ sudo zinject -d xvdb -e io -T write -f 0.1 volpool ``` And got the vdevs to the following state: ``` $ zpool status volpool pool: volpool state: ONLINE status: One or more devices has experienced an unrecoverable error. ...<cropped>.. action: Determine if the device needs to be replaced, and clear the ...<cropped>.. config: NAME STATE READ WRITE CKSUM volpool ONLINE 0 0 0 xvdb ONLINE 0 1 0 xvdc ONLINE 0 1 0 xvdd ONLINE 0 4 0 ``` I also double-checked their write error counters with sdb: ``` sdb> spa volpool | vdev | member vdev_stat.vs_write_errors (uint64_t)0 # <---- this is the root vdev (uint64_t)2 (uint64_t)1 (uint64_t)1 ``` Then I checked that I the problem was reproduced in my VM as I the gang count was growing in zdb as I was writting more data: ``` $ sudo zdb volpool | grep gang ganged count: 1384 $ sudo zdb volpool | grep gang ganged count: 1393 $ sudo zdb volpool | grep gang ganged count: 1402 $ sudo zdb volpool | grep gang ganged count: 1414 ``` Then I updated my bits with this patch and the gang count stayed the same. Reviewed-by: Mark Maybee <[email protected]> Reviewed-by: Matthew Ahrens <[email protected]> Reviewed-by: Richard Yao <[email protected]> Signed-off-by: Serapheim Dimitropoulos <[email protected]> Closes #14003
author: Serapheim Dimitropoulos <[email protected]> 2022-10-11 12:27:41 -0700
committer: Brian Behlendorf <[email protected]> 2022-11-01 12:36:25 -0700
commit: 37d5a3e04b7baa88e41a230c1a243874c82fcb80 (patch)
tree: 25d7396012dea76cb48076d8d5dc791fc4b3f8ac /module
parent: 25096e11800a545ff80764e51bd86dcc2a03a4bd (diff)
1 files changed, 2 insertions, 3 deletions
diff --git a/module/zfs/metaslab.c b/module/zfs/metaslab.c
index ecc70298d..74796096d 100644
--- a/module/zfs/metaslab.c
+++ b/module/zfs/metaslab.c
@@ -5207,12 +5207,11 @@ top:
 		ASSERT(mg->mg_initialized);
 
 		/*
-		 * Avoid writing single-copy data to a failing,
+		 * Avoid writing single-copy data to an unhealthy,
 		 * non-redundant vdev, unless we've already tried all
 		 * other vdevs.
 		 */
-		if ((vd->vdev_stat.vs_write_errors > 0 ||
-		    vd->vdev_state < VDEV_STATE_HEALTHY) &&
+		if (vd->vdev_state < VDEV_STATE_HEALTHY &&
 		    d == 0 && !try_hard && vd->vdev_children == 0) {
 			metaslab_trace_add(zal, mg, NULL, psize, d,
 			    TRACE_VDEV_ERROR, allocator);
author	Serapheim Dimitropoulos <[email protected]>	2022-10-11 12:27:41 -0700
committer	Brian Behlendorf <[email protected]>	2022-11-01 12:36:25 -0700
commit	37d5a3e04b7baa88e41a230c1a243874c82fcb80 (patch)
tree	25d7396012dea76cb48076d8d5dc791fc4b3f8ac /module
parent	25096e11800a545ff80764e51bd86dcc2a03a4bd (diff)