aboutsummaryrefslogtreecommitdiffstats
path: root/tests
diff options
context:
space:
mode:
authorPrakash Surya <[email protected]>2017-12-07 11:26:32 -0800
committerBrian Behlendorf <[email protected]>2017-12-28 10:18:04 -0800
commit2fe61a7ecc507d031451c21b3077fae549b58ec3 (patch)
tree520a5ddf2397d4ae945503b753819d2af0982237 /tests
parent823d48bfb182137c53b9432498f1f0564eaa8bfc (diff)
OpenZFS 8909 - 8585 can cause a use-after-free kernel panic
Authored by: Prakash Surya <[email protected]> Reviewed by: John Kennedy <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Reviewed by: George Wilson <[email protected]> Reviewed by: Brad Lewis <[email protected]> Reviewed by: Igor Kozhukhov <[email protected]> Reviewed by: Brian Behlendorf <[email protected]> Approved by: Robert Mustacchi <[email protected]> Ported-by: Prakash Surya <[email protected]> PROBLEM ======= There's a race condition that exists if `zil_free_lwb` races with either `zil_commit_waiter_timeout` and/or `zil_lwb_flush_vdevs_done`. Here's an example panic due to this bug: > ::status debugging crash dump vmcore.0 (64-bit) from ip-10-110-205-40 operating system: 5.11 dlpx-5.2.2.0_2017-12-04-17-28-32b6ba51fb (i86pc) image uuid: 4af0edfb-e58e-6ed8-cafc-d3e9167c7513 panic message: BAD TRAP: type=e (#pf Page fault) rp=ffffff0010555970 addr=60 occurred in module "zfs" due to a NULL pointer dereference dump content: kernel pages only > $c zio_shrink+0x12() zil_lwb_write_issue+0x30d(ffffff03dcd15cc0, ffffff03e0730e20) zil_commit_waiter_timeout+0xa2(ffffff03dcd15cc0, ffffff03d97ffcf8) zil_commit_waiter+0xf3(ffffff03dcd15cc0, ffffff03d97ffcf8) zil_commit+0x80(ffffff03dcd15cc0, 9a9) zfs_write+0xc34(ffffff03dc38b140, ffffff0010555e60, 40, ffffff03e00fb758, 0) fop_write+0x5b(ffffff03dc38b140, ffffff0010555e60, 40, ffffff03e00fb758, 0) write+0x250(42, fffffd7ff4832000, 2000) sys_syscall+0x177() If there's an outstanding lwb that's in `zil_commit_waiter_timeout` waiting to timeout, waiting on it's waiter's CV, we must be sure not to call `zil_free_lwb`. If we end up calling `zil_free_lwb`, then that LWB may be freed and can result in a use-after-free situation where the stale lwb pointer stored in the `zil_commit_waiter_t` structure of the thread waiting on the waiter's CV is used. A similar situation can occur if an lwb is issued to disk, and thus in the `LWB_STATE_ISSUED` state, and `zil_free_lwb` is called while the disk is servicing that lwb. In this situation, the lwb will be freed by `zil_free_lwb`, which will result in a use-after-free situation when the lwb's zio completes, and `zil_lwb_flush_vdevs_done` is called. This race condition is prevented in `zil_close` by calling `zil_commit` before `zil_free_lwb` is called, which will ensure all outstanding (i.e. all lwb's in the `LWB_STATE_OPEN` and/or `LWB_STATE_ISSUED` states) reach the `LWB_STATE_DONE` state before the lwb's are freed (`zil_commit` will not return untill all the lwb's are `LWB_STATE_DONE`). Further, this race condition is prevented in `zil_sync` by only calling `zil_free_lwb` for lwb's that do not have their `lwb_buf` pointer set. All lwb's not in the `LWB_STATE_DONE` state will have a non-null value for this pointer; the pointer is only cleared in `zil_lwb_flush_vdevs_done`, at which point the lwb's state will be changed to `LWB_STATE_DONE`. This race *is* present in `zil_suspend`, leading to this bug. At first glance, it would appear as though this would not be true because `zil_suspend` will call `zil_commit`, just like `zil_close`, but the problem is that `zil_suspend` will set the zilog's `zl_suspend` field prior to calling `zil_commit`. Further, in `zil_commit`, if `zl_suspend` is set, `zil_commit` will take a special branch of logic and use `txg_wait_synced` instead of performing the normal `zil_commit` logic. This call to `txg_wait_synced` might be good enough for the data to reach disk safely before it returns, but it does not ensure that all outstanding lwb's reach the `LWB_STATE_DONE` state before it returns. This is because, if there's an lwb "stuck" in `zil_commit_waiter_timeout`, waiting for it's lwb to timeout, it will maintain a non-null value for it's `lwb_buf` field and thus `zil_sync` will not free that lwb. Thus, even though the lwb's data is already on disk, the lwb will be left lingering, waiting on the CV, and will eventually timeout and be issued to disk even though the write is unnecessary. So, after `zil_commit` is called from `zil_suspend`, we incorrectly assume that there are not outstanding lwb's, and proceed to free all lwb's found on the zilog's lwb list. As a result, we free the lwb that will later be used `zil_commit_waiter_timeout`. SOLUTION ======== The solution to this, is to ensure all outstanding lwb's complete before calling `zil_free_lwb` via `zil_destroy` in `zil_suspend`. This patch accomplishes this goal by forcing the normal `zil_commit` logic when called from `zil_sync`. Now, `zil_suspend` will call `zil_commit_impl` which will always use the normal logic of waiting/issuing lwb's to disk before it returns. As a result, any lwb's outstanding when `zil_commit_impl` is called will be guaranteed to reach the `LWB_STATE_DONE` state by the time it returns. Further, no new lwb's will be created via `zil_commit` since the zilog's `zl_suspend` flag will be set. This will force all new callers of `zil_commit` to use `txg_wait_synced` instead of creating and issuing new lwb's. Thus, all lwb's left on the zilog's lwb list when `zil_destroy` is called will be in the `LWB_STATE_DONE` state, and we'll avoid this race condition. OpenZFS-issue: https://www.illumos.org/issues/8909 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/ece62b6f8d Closes #6940
Diffstat (limited to 'tests')
-rw-r--r--tests/runfiles/linux.run3
-rw-r--r--tests/zfs-tests/tests/functional/slog/Makefile.am1
-rwxr-xr-xtests/zfs-tests/tests/functional/slog/slog_015_neg.ksh73
3 files changed, 76 insertions, 1 deletions
diff --git a/tests/runfiles/linux.run b/tests/runfiles/linux.run
index c4650ab41..ad35e92be 100644
--- a/tests/runfiles/linux.run
+++ b/tests/runfiles/linux.run
@@ -656,7 +656,8 @@ tags = ['functional', 'scrub_mirror']
tests = ['slog_001_pos', 'slog_002_pos', 'slog_003_pos', 'slog_004_pos',
'slog_005_pos', 'slog_006_pos', 'slog_007_pos', 'slog_008_neg',
'slog_009_neg', 'slog_010_neg', 'slog_011_neg', 'slog_012_neg',
- 'slog_013_pos', 'slog_014_pos', 'slog_replay_fs', 'slog_replay_volume']
+ 'slog_013_pos', 'slog_014_pos', 'slog_015_neg', 'slog_replay_fs',
+ 'slog_replay_volume']
tags = ['functional', 'slog']
[tests/functional/snapshot]
diff --git a/tests/zfs-tests/tests/functional/slog/Makefile.am b/tests/zfs-tests/tests/functional/slog/Makefile.am
index f446facfe..157e66707 100644
--- a/tests/zfs-tests/tests/functional/slog/Makefile.am
+++ b/tests/zfs-tests/tests/functional/slog/Makefile.am
@@ -18,5 +18,6 @@ dist_pkgdata_SCRIPTS = \
slog_012_neg.ksh \
slog_013_pos.ksh \
slog_014_pos.ksh \
+ slog_015_neg.ksh \
slog_replay_fs.ksh \
slog_replay_volume.ksh
diff --git a/tests/zfs-tests/tests/functional/slog/slog_015_neg.ksh b/tests/zfs-tests/tests/functional/slog/slog_015_neg.ksh
new file mode 100755
index 000000000..37821888e
--- /dev/null
+++ b/tests/zfs-tests/tests/functional/slog/slog_015_neg.ksh
@@ -0,0 +1,73 @@
+#!/bin/ksh -p
+#
+# This file and its contents are supplied under the terms of the
+# Common Development and Distribution License ("CDDL"), version 1.0.
+# You may only use this file in accordance with the terms of version
+# 1.0 of the CDDL.
+#
+# A full copy of the text of the CDDL should have accompanied this
+# source. A copy of the CDDL is also available via the Internet at
+# http://www.illumos.org/license/CDDL.
+#
+
+#
+# Copyright (c) 2017 by Delphix. All rights reserved.
+#
+
+. $STF_SUITE/tests/functional/slog/slog.kshlib
+
+#
+# DESCRIPTION:
+# Concurrent sync writes with log offline/online works.
+#
+# STRATEGY:
+# 1. Configure "zfs_commit_timeout_pct"
+# 2. Create pool with a log device.
+# 3. Concurrently do the following:
+# 3.1. Perform 8K sync writes
+# 3.2. Perform log offline/online commands
+# 4. Loop to test with growing "zfs_commit_timout_pct" values.
+#
+
+verify_runnable "global"
+
+function cleanup
+{
+ #
+ # Wait for any of the writes and/or zpool commands that were
+ # kicked off in the background to complete. On failure, we may
+ # enter this function without previously waiting for them.
+ #
+ wait
+
+ set_tunable64 zfs_commit_timeout_pct $ORIG_TIMEOUT
+
+ poolexists $TESTPOOL && zpool destroy -f $TESTPOOL
+}
+
+ORIG_TIMEOUT=$(get_tunable zfs_commit_timeout_pct | tail -1 | awk '{print $NF}')
+log_onexit cleanup
+
+for PCT in 0 1 2 4 8 16 32 64 128 256 512 1024; do
+ log_must set_tunable64 zfs_commit_timeout_pct $PCT
+
+ log_must zpool create $TESTPOOL $VDEV log $SDEV
+
+ for i in {1..10}; do
+ log_must fio --rw write --sync 1 --directory "/$TESTPOOL" \
+ --bs 8K --size 8K --name slog-test
+ done &
+
+ for i in {1..10}; do
+ log_must zpool offline $TESTPOOL $SDEV
+ log_must verify_slog_device $TESTPOOL $SDEV 'OFFLINE'
+ log_must zpool online $TESTPOOL $SDEV
+ log_must verify_slog_device $TESTPOOL $SDEV 'ONLINE'
+ done &
+
+ wait
+
+ log_must zpool destroy -f $TESTPOOL
+done
+
+log_pass "Concurrent writes with slog offline/online works."