diff options
author | Serapheim Dimitropoulos <[email protected]> | 2022-10-11 12:27:41 -0700 |
---|---|---|
committer | Brian Behlendorf <[email protected]> | 2022-11-01 12:36:25 -0700 |
commit | 37d5a3e04b7baa88e41a230c1a243874c82fcb80 (patch) | |
tree | 25d7396012dea76cb48076d8d5dc791fc4b3f8ac /module | |
parent | 25096e11800a545ff80764e51bd86dcc2a03a4bd (diff) |
Stop ganging due to past vdev write errors
= Problem
While examining a customer's system we noticed unreasonable space
usage from a few snapshots due to gang blocks. Under some further
analysis we discovered that the pool would create gang blocks because
all its disks had non-zero write error counts and they'd be skipped
for normal metaslab allocations due to the following if-clause in
`metaslab_alloc_dva()`:
```
/*
* Avoid writing single-copy data to a failing,
* non-redundant vdev, unless we've already tried all
* other vdevs.
*/
if ((vd->vdev_stat.vs_write_errors > 0 ||
vd->vdev_state < VDEV_STATE_HEALTHY) &&
d == 0 && !try_hard && vd->vdev_children == 0) {
metaslab_trace_add(zal, mg, NULL, psize, d,
TRACE_VDEV_ERROR, allocator);
goto next;
}
```
= Proposed Solution
Get rid of the predicate in the if-clause that checks the past
write errors of the selected vdev. We still try to allocate from
HEALTHY vdevs anyway by checking vdev_state so the past write
errors doesn't seem to help us (quite the opposite - it can cause
issues in long-lived pools like the one from our customer).
= Testing
I first created a pool with 3 vdevs:
```
$ zpool list -v volpool
NAME SIZE ALLOC FREE
volpool 22.5G 117M 22.4G
xvdb 7.99G 40.2M 7.46G
xvdc 7.99G 39.1M 7.46G
xvdd 7.99G 37.8M 7.46G
```
And used `zinject` like so with each one of them:
```
$ sudo zinject -d xvdb -e io -T write -f 0.1 volpool
```
And got the vdevs to the following state:
```
$ zpool status volpool
pool: volpool
state: ONLINE
status: One or more devices has experienced an unrecoverable error.
...<cropped>..
action: Determine if the device needs to be replaced, and clear the
...<cropped>..
config:
NAME STATE READ WRITE CKSUM
volpool ONLINE 0 0 0
xvdb ONLINE 0 1 0
xvdc ONLINE 0 1 0
xvdd ONLINE 0 4 0
```
I also double-checked their write error counters with sdb:
```
sdb> spa volpool | vdev | member vdev_stat.vs_write_errors
(uint64_t)0 # <---- this is the root vdev
(uint64_t)2
(uint64_t)1
(uint64_t)1
```
Then I checked that I the problem was reproduced in my VM as I the
gang count was growing in zdb as I was writting more data:
```
$ sudo zdb volpool | grep gang
ganged count: 1384
$ sudo zdb volpool | grep gang
ganged count: 1393
$ sudo zdb volpool | grep gang
ganged count: 1402
$ sudo zdb volpool | grep gang
ganged count: 1414
```
Then I updated my bits with this patch and the gang count stayed the
same.
Reviewed-by: Mark Maybee <[email protected]>
Reviewed-by: Matthew Ahrens <[email protected]>
Reviewed-by: Richard Yao <[email protected]>
Signed-off-by: Serapheim Dimitropoulos <[email protected]>
Closes #14003
Diffstat (limited to 'module')
-rw-r--r-- | module/zfs/metaslab.c | 5 |
1 files changed, 2 insertions, 3 deletions
diff --git a/module/zfs/metaslab.c b/module/zfs/metaslab.c index ecc70298d..74796096d 100644 --- a/module/zfs/metaslab.c +++ b/module/zfs/metaslab.c @@ -5207,12 +5207,11 @@ top: ASSERT(mg->mg_initialized); /* - * Avoid writing single-copy data to a failing, + * Avoid writing single-copy data to an unhealthy, * non-redundant vdev, unless we've already tried all * other vdevs. */ - if ((vd->vdev_stat.vs_write_errors > 0 || - vd->vdev_state < VDEV_STATE_HEALTHY) && + if (vd->vdev_state < VDEV_STATE_HEALTHY && d == 0 && !try_hard && vd->vdev_children == 0) { metaslab_trace_add(zal, mg, NULL, psize, d, TRACE_VDEV_ERROR, allocator); |