summaryrefslogtreecommitdiffstats
path: root/module/zfs
diff options
context:
space:
mode:
authorMatthew Ahrens <[email protected]>2021-01-26 16:05:05 -0800
committerGitHub <[email protected]>2021-01-26 16:05:05 -0800
commit62d4287f279a0d184f8f332475f27af58b7aa87e (patch)
tree22fae51be20ef0036624e2664e9b6938d6127385 /module/zfs
parentd7265b330954fe844958642c5f61e790c58442d5 (diff)
RAIDZ2/3 fails to heal silently corrupted parity w/2+ bad disks
When scrubbing, (non-sequential) resilvering, or correcting a checksum error using RAIDZ parity, ZFS should heal any incorrect RAIDZ parity by overwriting it. For example, if P disks are silently corrupted (P being the number of failures tolerated; e.g. RAIDZ2 has P=2), `zpool scrub` should detect and heal all the bad state on these disks, including parity. This way if there is a subsequent failure we are fully protected. With RAIDZ2 or RAIDZ3, a block can have silent damage to a parity sector, and also damage (silent or known) to a data sector. In this case the parity should be healed but it is not. The problem can be noticed by scrubbing the pool twice. Assuming there was no damage concurrent with the scrubs, the first scrub should fix all silent damage, and the second scrub should be "clean" (`zpool status` should not report checksum errors on any disks). If the bug is encountered, then the second scrub will repair the silently-damaged parity that the first scrub failed to repair, and these checksum errors will be reported after the second scrub. Since the first scrub repaired all the damaged data, the bug can not be encountered during the second scrub, so subsequent scrubs (more than two) are not necessary. The root cause of the problem is some code that was inadvertently added to `raidz_parity_verify()` by the DRAID changes. The incorrect code causes the parity healing to be aborted if there is damaged data (`rc_error != 0`) or the data disk is not present (`!rc_tried`). These checks are not necessary, because we only call `raidz_parity_verify()` if we have the correct data (which may have been reconstructed using parity, and which was verified by the checksum). This commit fixes the problem by removing the incorrect checks in `raidz_parity_verify()`. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> Closes #11489 Closes #11510
Diffstat (limited to 'module/zfs')
-rw-r--r--module/zfs/vdev_raidz.c10
1 files changed, 0 insertions, 10 deletions
diff --git a/module/zfs/vdev_raidz.c b/module/zfs/vdev_raidz.c
index 07934cdff..f4812e612 100644
--- a/module/zfs/vdev_raidz.c
+++ b/module/zfs/vdev_raidz.c
@@ -1903,16 +1903,6 @@ raidz_parity_verify(zio_t *zio, raidz_row_t *rr)
if (checksum == ZIO_CHECKSUM_NOPARITY)
return (ret);
- /*
- * All data columns must have been successfully read in order
- * to use them to generate parity columns for comparison.
- */
- for (c = rr->rr_firstdatacol; c < rr->rr_cols; c++) {
- rc = &rr->rr_col[c];
- if (!rc->rc_tried || rc->rc_error != 0)
- return (ret);
- }
-
for (c = 0; c < rr->rr_firstdatacol; c++) {
rc = &rr->rr_col[c];
if (!rc->rc_tried || rc->rc_error != 0)