diff options
author | Prakash Surya <[email protected]> | 2017-12-07 11:26:32 -0800 |
---|---|---|
committer | Brian Behlendorf <[email protected]> | 2017-12-28 10:18:04 -0800 |
commit | 2fe61a7ecc507d031451c21b3077fae549b58ec3 (patch) | |
tree | 520a5ddf2397d4ae945503b753819d2af0982237 /include | |
parent | 823d48bfb182137c53b9432498f1f0564eaa8bfc (diff) |
OpenZFS 8909 - 8585 can cause a use-after-free kernel panic
Authored by: Prakash Surya <[email protected]>
Reviewed by: John Kennedy <[email protected]>
Reviewed by: Matthew Ahrens <[email protected]>
Reviewed by: George Wilson <[email protected]>
Reviewed by: Brad Lewis <[email protected]>
Reviewed by: Igor Kozhukhov <[email protected]>
Reviewed by: Brian Behlendorf <[email protected]>
Approved by: Robert Mustacchi <[email protected]>
Ported-by: Prakash Surya <[email protected]>
PROBLEM
=======
There's a race condition that exists if `zil_free_lwb` races with either
`zil_commit_waiter_timeout` and/or `zil_lwb_flush_vdevs_done`.
Here's an example panic due to this bug:
> ::status
debugging crash dump vmcore.0 (64-bit) from ip-10-110-205-40
operating system: 5.11 dlpx-5.2.2.0_2017-12-04-17-28-32b6ba51fb (i86pc)
image uuid: 4af0edfb-e58e-6ed8-cafc-d3e9167c7513
panic message:
BAD TRAP: type=e (#pf Page fault) rp=ffffff0010555970 addr=60 occurred in module "zfs" due to a NULL pointer dereference
dump content: kernel pages only
> $c
zio_shrink+0x12()
zil_lwb_write_issue+0x30d(ffffff03dcd15cc0, ffffff03e0730e20)
zil_commit_waiter_timeout+0xa2(ffffff03dcd15cc0, ffffff03d97ffcf8)
zil_commit_waiter+0xf3(ffffff03dcd15cc0, ffffff03d97ffcf8)
zil_commit+0x80(ffffff03dcd15cc0, 9a9)
zfs_write+0xc34(ffffff03dc38b140, ffffff0010555e60, 40, ffffff03e00fb758, 0)
fop_write+0x5b(ffffff03dc38b140, ffffff0010555e60, 40, ffffff03e00fb758, 0)
write+0x250(42, fffffd7ff4832000, 2000)
sys_syscall+0x177()
If there's an outstanding lwb that's in `zil_commit_waiter_timeout`
waiting to timeout, waiting on it's waiter's CV, we must be sure not to
call `zil_free_lwb`. If we end up calling `zil_free_lwb`, then that LWB
may be freed and can result in a use-after-free situation where the
stale lwb pointer stored in the `zil_commit_waiter_t` structure of the
thread waiting on the waiter's CV is used.
A similar situation can occur if an lwb is issued to disk, and thus in
the `LWB_STATE_ISSUED` state, and `zil_free_lwb` is called while the
disk is servicing that lwb. In this situation, the lwb will be freed by
`zil_free_lwb`, which will result in a use-after-free situation when the
lwb's zio completes, and `zil_lwb_flush_vdevs_done` is called.
This race condition is prevented in `zil_close` by calling `zil_commit`
before `zil_free_lwb` is called, which will ensure all outstanding (i.e.
all lwb's in the `LWB_STATE_OPEN` and/or `LWB_STATE_ISSUED` states)
reach the `LWB_STATE_DONE` state before the lwb's are freed
(`zil_commit` will not return untill all the lwb's are
`LWB_STATE_DONE`).
Further, this race condition is prevented in `zil_sync` by only calling
`zil_free_lwb` for lwb's that do not have their `lwb_buf` pointer set.
All lwb's not in the `LWB_STATE_DONE` state will have a non-null value
for this pointer; the pointer is only cleared in
`zil_lwb_flush_vdevs_done`, at which point the lwb's state will be
changed to `LWB_STATE_DONE`.
This race *is* present in `zil_suspend`, leading to this bug.
At first glance, it would appear as though this would not be true
because `zil_suspend` will call `zil_commit`, just like `zil_close`, but
the problem is that `zil_suspend` will set the zilog's `zl_suspend`
field prior to calling `zil_commit`. Further, in `zil_commit`, if
`zl_suspend` is set, `zil_commit` will take a special branch of logic
and use `txg_wait_synced` instead of performing the normal `zil_commit`
logic.
This call to `txg_wait_synced` might be good enough for the data to
reach disk safely before it returns, but it does not ensure that all
outstanding lwb's reach the `LWB_STATE_DONE` state before it returns.
This is because, if there's an lwb "stuck" in
`zil_commit_waiter_timeout`, waiting for it's lwb to timeout, it will
maintain a non-null value for it's `lwb_buf` field and thus `zil_sync`
will not free that lwb. Thus, even though the lwb's data is already on
disk, the lwb will be left lingering, waiting on the CV, and will
eventually timeout and be issued to disk even though the write is
unnecessary.
So, after `zil_commit` is called from `zil_suspend`, we incorrectly
assume that there are not outstanding lwb's, and proceed to free all
lwb's found on the zilog's lwb list. As a result, we free the lwb that
will later be used `zil_commit_waiter_timeout`.
SOLUTION
========
The solution to this, is to ensure all outstanding lwb's complete before
calling `zil_free_lwb` via `zil_destroy` in `zil_suspend`. This patch
accomplishes this goal by forcing the normal `zil_commit` logic when
called from `zil_sync`.
Now, `zil_suspend` will call `zil_commit_impl` which will always use the
normal logic of waiting/issuing lwb's to disk before it returns. As a
result, any lwb's outstanding when `zil_commit_impl` is called will be
guaranteed to reach the `LWB_STATE_DONE` state by the time it returns.
Further, no new lwb's will be created via `zil_commit` since the zilog's
`zl_suspend` flag will be set. This will force all new callers of
`zil_commit` to use `txg_wait_synced` instead of creating and issuing
new lwb's.
Thus, all lwb's left on the zilog's lwb list when `zil_destroy` is
called will be in the `LWB_STATE_DONE` state, and we'll avoid this race
condition.
OpenZFS-issue: https://www.illumos.org/issues/8909
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/ece62b6f8d
Closes #6940
Diffstat (limited to 'include')
-rw-r--r-- | include/sys/zil.h | 1 | ||||
-rw-r--r-- | include/sys/zil_impl.h | 39 | ||||
-rw-r--r-- | include/sys/zio.h | 1 |
3 files changed, 33 insertions, 8 deletions
diff --git a/include/sys/zil.h b/include/sys/zil.h index 5513c7ce8..fc0cc5065 100644 --- a/include/sys/zil.h +++ b/include/sys/zil.h @@ -494,6 +494,7 @@ extern void zil_itx_destroy(itx_t *itx); extern void zil_itx_assign(zilog_t *zilog, itx_t *itx, dmu_tx_t *tx); extern void zil_commit(zilog_t *zilog, uint64_t oid); +extern void zil_commit_impl(zilog_t *zilog, uint64_t oid); extern int zil_vdev_offline(const char *osname, void *txarg); extern int zil_claim(struct dsl_pool *dp, diff --git a/include/sys/zil_impl.h b/include/sys/zil_impl.h index 7e27215ca..6901d51f0 100644 --- a/include/sys/zil_impl.h +++ b/include/sys/zil_impl.h @@ -36,13 +36,38 @@ extern "C" { #endif /* - * Possbile states for a given lwb structure. An lwb will start out in - * the "closed" state, and then transition to the "opened" state via a - * call to zil_lwb_write_open(). After the lwb is "open", it can - * transition into the "issued" state via zil_lwb_write_issue(). After - * the lwb's zio completes, and the vdev's are flushed, the lwb will - * transition into the "done" state via zil_lwb_write_done(), and the - * structure eventually freed. + * Possible states for a given lwb structure. + * + * An lwb will start out in the "closed" state, and then transition to + * the "opened" state via a call to zil_lwb_write_open(). When + * transitioning from "closed" to "opened" the zilog's "zl_issuer_lock" + * must be held. + * + * After the lwb is "opened", it can transition into the "issued" state + * via zil_lwb_write_issue(). Again, the zilog's "zl_issuer_lock" must + * be held when making this transition. + * + * After the lwb's zio completes, and the vdev's are flushed, the lwb + * will transition into the "done" state via zil_lwb_write_done(). When + * transitioning from "issued" to "done", the zilog's "zl_lock" must be + * held, *not* the "zl_issuer_lock". + * + * The zilog's "zl_issuer_lock" can become heavily contended in certain + * workloads, so we specifically avoid acquiring that lock when + * transitioning an lwb from "issued" to "done". This allows us to avoid + * having to acquire the "zl_issuer_lock" for each lwb ZIO completion, + * which would have added more lock contention on an already heavily + * contended lock. + * + * Additionally, correctness when reading an lwb's state is often + * achieved by exploiting the fact that these state transitions occur in + * this specific order; i.e. "closed" to "opened" to "issued" to "done". + * + * Thus, if an lwb is in the "closed" or "opened" state, holding the + * "zl_issuer_lock" will prevent a concurrent thread from transitioning + * that lwb to the "issued" state. Likewise, if an lwb is already in the + * "issued" state, holding the "zl_lock" will prevent a concurrent + * thread from transitioning that lwb to the "done" state. */ typedef enum { LWB_STATE_CLOSED, diff --git a/include/sys/zio.h b/include/sys/zio.h index 8beea7adf..bf4818e16 100644 --- a/include/sys/zio.h +++ b/include/sys/zio.h @@ -598,7 +598,6 @@ extern enum zio_checksum zio_checksum_dedup_select(spa_t *spa, extern enum zio_compress zio_compress_select(spa_t *spa, enum zio_compress child, enum zio_compress parent); -extern void zio_cancel(zio_t *zio); extern void zio_suspend(spa_t *spa, zio_t *zio); extern int zio_resume(spa_t *spa); extern void zio_resume_wait(spa_t *spa); |