| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Initially, metaslabs and space maps used to be the same thing
in ZFS. Later, we started differentiating them by referring
to the space map as the on-disk state of the metaslab, making
the metaslab a higher-level concept that is metadata that deals
with space accounting. Today we've managed to split that code
furthermore, with the space map being its own on-disk data
structure used in areas of ZFS besides metaslabs (e.g. the
vdev-wide space maps used for zpool checkpoint or vdev removal
features).
This patch refactors the space map code to further split the
space map code from the metaslab code. It does so by getting
rid of the idea that the space map can have a different in-core
and on-disk length (sm_length vs smp_length) which is something
that is only used for the metaslab code, and other consumers
of space maps just have to deal with. Instead, this patch
introduces changes that move the old in-core length of the
metaslab's space map to the metaslab structure itself (see
ms_synced_length field) while making the space map code only
care about the actual space map's length on-disk.
The result of this is that space map consumers no longer have
to deal with syncing two different lengths for the same
structure (e.g. space_map_update() goes away) while metaslab
specific behavior stays within the metaslab code. Specifically,
the ms_synced_length field keeps track of the amount of data
metaslab_load() can read from the metaslab's space map while
working concurrently with metaslab_sync() that may be
appending to that same space map.
As a side note, the patch also adds a few comments around
the metaslab code documenting some assumptions and expected
behavior.
Reviewed-by: Matt Ahrens <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed by: Pavel Zakharov <[email protected]>
Signed-off-by: Serapheim Dimitropoulos <[email protected]>
Closes #8328
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
zfs create, receive and rename can bypass this hierarchy rule. Update
both userland and kernel module to prevent this issue and use pyzfs
unit tests to exercise the ioctls directly.
Note: this commit slightly changes zfs_ioc_create() ABI. This allow to
differentiate a generic error (EINVAL) from the specific case where we
tried to create a dataset below a ZVOL (ZFS_ERR_WRONG_PARENT).
Reviewed-by: Paul Dagnelie <[email protected]>
Reviewed-by: Matt Ahrens <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Tom Caputi <[email protected]>
Signed-off-by: loli10K <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Due to an off-by-one condition in spa_preferred_class() we are picking
the "normal" allocation class instead of the "special" one for file
blocks with size equal to the special_small_blocks property value.
This change fix the small code issue, update the ZFS Test Suite and the
zfs(8) man page.
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Don Brady <[email protected]>
Signed-off-by: loli10K <[email protected]>
Closes #8351
Closes #8361
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Re-factor arc_read() to better account for embedded data blkptrs.
Previously, reading the payload from an embedded blkptr would cause
arcstats such as demand_metadata_misses to be bumped when there was
actually no cache "miss" because the data are already available in
the blkptr.
The following test procedure was used to demonstrate the problem:
zpool create tank ...
zfs create -o compression=lz4 tank/fs
echo blah > /tank/fs/blah
stat /tank/fs/blah
grep 'meta.*mis' /proc/spl/kstat/zfs/arcstats
and repeating the last two steps to watch the metadata miss counter
increment. This can also be demonstrated via the zfs_arc_miss DTRACE4
probe in arc_read().
Reviewed-by: loli10K <[email protected]>
Reviewed-by: George Wilson <[email protected]>
Reviewed-by: Matt Ahrens <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: George Melikov <[email protected]>
Signed-off-by: Tim Chase <[email protected]>
Closes #8319
|
|
|
|
|
|
|
|
|
|
| |
Get rid of the majority metaslab metadata when removing log vdevs
in spa_vdev_remove_log() with a call to metaslab_fini() instead
of duplicating a lot of that in vdev_remove_empty_log().
Reviewed-by: Matt Ahrens <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Serapheim Dimitropoulos <[email protected]>
Closes #8347
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The current L2 ARC device code consistently uses psize to
increment vs_alloc but varies between psize and lsize when
decrementing it. The result of this behavior is that
vs_alloc can be decremented more that it is incremented
and underflow. This patch changes the code so asize is
used anywhere.
In addition, it ensures that vs_alloc gets incremented by
the L2 ARC device code as buffers are written and not at
the end of the l2arc_write_buffers() routine. The latter
(and old) way would temporarily underflow vs_alloc as
buffers that were just written, would be destroyed while
l2arc_write_buffers() was still looping.
Reviewed-by: Matt Ahrens <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Serapheim Dimitropoulos <[email protected]>
Closes #8298
|
|
|
|
|
|
|
|
|
|
|
|
| |
Address a deadlock caused by simultaneous wakeup and cancel on a zthr
by remove the hold of zthr_request_lock from zthr_wakeup. This
allows thr_wakeup to not block a thread that is in the process of
being cancelled.
Reviewed-by: Matt Ahrens <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Serapheim Dimitropoulos <[email protected]>
Signed-off-by: Sara Hartse <[email protected]>
Closes #8333
|
|
|
|
|
|
|
|
|
|
|
| |
The Linux 5.0 kernel updated the bio_set_dev() macro so it calls the
GPL-only bio_associate_blkg() symbol thus inadvertently converting
the entire macro. Provide a minimal version which always assigns the
request queue's root_blkg to the bio.
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #8287
|
|
|
|
|
|
|
|
|
| |
SUBDIRs has been deprecated for a long time, and was finally removed in
the 5.0 kernel. Use "M=" instead.
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Tony Hutter <[email protected]>
Closes #8257
|
|
|
|
|
|
|
|
|
|
|
| |
In the 5.0 kernel, only the mount namespace code should use the MS_*
macos. Filesystems should use the SB_* ones.
https://patchwork.kernel.org/patch/10552493/
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Tony Hutter <[email protected]>
Closes #8264
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
totalram_pages() was converted to an atomic variable in 5.0:
https://patchwork.kernel.org/patch/10652795/
Its value should now be read though the totalram_pages() helper
function.
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Tony Hutter <[email protected]>
Closes #8263
|
|
|
|
|
|
|
|
| |
access_ok no longer needs a 'type' parameter in the 5.0 kernel.
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Tony Hutter <[email protected]>
Closes #8261
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
= Old behavior
For vdev sizes 100GB to 50TB we keep ~200 metaslabs per
vdev and the metaslab size grows from 512MB to 256GB.
For vdev's bigger than that we start increasing the
number of metaslabs until we hit the 128K limit.
= New Behavior
For vdev sizes 100GB to 3TB we keep ~200 metaslabs per
vdev and the metaslab size grows from 512MB to 16GB.
For vdev's bigger than that we start increasing the
number of metaslabs until we hit the 128K limit.
= Reasoning
The old behavior makes metaslabs grow in size when
the vdev range is between 3TB (ms_size 16GB) and
32PB (ms_size 256GB). Even though keeping the number
of metaslabs is good in terms of potential number of
I/Os per TXG, these bigger metaslabs take longer
to be loaded and after they are loaded they can
take up a lot of memory because of their range trees.
This change tries to put a boundary in memory and
loading time for the specific range of vdev sizes.
Reviewed-by: Matt Ahrens <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Don Brady <[email protected]>
Signed-off-by: Serapheim Dimitropoulos <[email protected]>
Closes #8324
|
|
|
|
|
|
|
|
|
|
|
|
| |
The range_tree_verify function looks for a segment in a
range tree and panics if the segment is present on the
tree. This patch gives the function a more descriptive
name.
Reviewed-by: Matt Ahrens <[email protected]>
Reviewed-by: George Melikov <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Serapheim Dimitropoulos <[email protected]>
Closes #8327
|
|
|
|
|
|
|
|
|
| |
This allows the spa config refcounts to use tracking in debug builds
without triggering the "No such hold %p on refcount" panic.
Reviewed-by: Olaf Faaland <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Tim Chase <[email protected]>
Closes #8326
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Currently, zvol_rename_minors_impl() calls kmem_asprintf()
to allocate and initialize a string. This function is a thin
wrapper around the kernel's kvasprintf() and does not call
into the SPL's kmem tracking code when it is enabled. However,
this function frees the string with the tracked kmem_free()
instead of the untracked strfree(), which causes the SPL
kmem tracking code to believe that the function is attempting
to free memory it never allocated, triggering an ASSERT. This
patch simply corrects this issue.
Reviewed by: Matt Ahrens <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Tom Caputi <[email protected]>
Closes #8307
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Since d8fdfc2 was integrated dsl_pool_create() does not call
dmu_objset_create_impl() for the root dataset when running in
userland (ztest): this creates a pool with a partially initialized
root dataset. Trying to import and use this pool results in both
zpool and zfs executables dumping core.
Fix this by adopting an alternative change suggested in OpenZFS 8607
code review.
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed by: Tom Caputi <[email protected]>
Original-patch-by: Robert Mustacchi <[email protected]>
Signed-off-by: loli10K <[email protected]>
Closes #8277
|
|
|
|
|
|
|
|
|
|
|
|
| |
This check provides no real additional protection and unnecessarily
introduces a dependency on the "oops_in_progress" kernel symbol.
Remove the check, it there are special circumstances on other
platforms which make this a requirement it can be reintroduced
for all relevant call paths in a more portable comprehensive manor.
Reviewed-by: Matt Ahrens <[email protected]>
Reviewed-by: George Melikov <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #8297
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Most callers that need to operate on a loaded metaslab, always
call metaslab_load_wait() before loading the metaslab just in
case someone else is already doing the work.
Factoring metaslab_load_wait() within metaslab_load() makes the
later more robust, as callers won't have to do the load-wait
check explicitly every time they need to load a metaslab.
Reviewed-by: Matt Ahrens <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Serapheim Dimitropoulos <[email protected]>
Closes #8290
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Currently, when a DRR_OBJECT record is read into memory in
receive_read_record(), memory is allocated for the bonus buffer.
However, if the object doesn't have a bonus buffer the code will
still "allocate" the zero bytes, but the memory will not be passed
to the processing thread for cleanup later. This causes the spl
kmem tracking code to report a leak. This patch simply changes the
code so that it only allocates this memory if it has a non-zero
length.
Reviewed by: Matt Ahrens <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Tom Caputi <[email protected]>
Closes #8266
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Trying to set user properties with their length 1 byte shorter than the
maximum size triggers an assertion failure in zap_leaf_array_create():
panic[cpu0]/thread=ffffff000a092c40:
assertion failed: num_integers * integer_size < (8<<10) (0x2000 < 0x2000), file: ../../common/fs/zfs/zap_leaf.c, line: 233
ffffff000a092500 genunix:process_type+167c35 ()
ffffff000a0925a0 zfs:zap_leaf_array_create+1d2 ()
ffffff000a092650 zfs:zap_entry_create+1be ()
ffffff000a092720 zfs:fzap_update+ed ()
ffffff000a0927d0 zfs:zap_update+1a5 ()
ffffff000a0928d0 zfs:dsl_prop_set_sync_impl+5c6 ()
ffffff000a092970 zfs:dsl_props_set_sync_impl+fc ()
ffffff000a0929b0 zfs:dsl_props_set_sync+79 ()
ffffff000a0929f0 zfs:dsl_sync_task_sync+10a ()
ffffff000a092a80 zfs:dsl_pool_sync+3a3 ()
ffffff000a092b50 zfs:spa_sync+4e6 ()
ffffff000a092c20 zfs:txg_sync_thread+297 ()
ffffff000a092c30 unix:thread_start+8 ()
This patch simply corrects the assertion.
Reviewed-by: Matt Ahrens <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: loli10K <[email protected]>
Closes #8278
|
|
|
|
|
|
|
|
|
|
|
|
| |
The point of this refactoring is to break the high-level conceptual
steps of spa_sync() to their own helper functions. In general large
functions can enhance readability if structured well, but in this
case the amount of conceptual steps taken could use the help of
helper functions.
Reviewed-by: Matt Ahrens <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Serapheim Dimitropoulos <[email protected]>
Closes #8293
|
|
|
|
|
|
|
|
|
|
|
|
| |
Currently, the functions dbuf_prefetch_indirect_done() and
dmu_assign_arcbuf_by_dnode() assume that dbuf_hold_level() cannot
fail. In the event of an error the former will cause a NULL pointer
dereference and the later will trigger a VERIFY. This patch adds
error handling to these functions and their callers where necessary.
Reviewed by: Matt Ahrens <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Tom Caputi <[email protected]>
Closes #8291
|
|
|
|
|
|
|
|
|
| |
The following fields from the vdev_t struct are not used anywhere.
Reviewed-by: George Melikov <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Tony Hutter <[email protected]>
Signed-off-by: Serapheim Dimitropoulos <[email protected]>
Closes #8285
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The ztest_ddt_repair() test is designed inflict damage to the
ddt which can be repairable by a scrub. Unfortunately, this
repair logic was broken at some point and it went undetected.
This issue is not specific to ztest, but thankfully this extra
redundancy is rarely enabled and even more rarely needed.
The root cause was identified to be the ddt_bp_create()
function called by dsl_scan_ddt_entry() which did not set the
dedup bit of the generated block pointer.
The consequence of this was that the ZIO_DDT_READ_PIPELINE was
never enabled for the block pointer during the scrub, and the
dedup ditto repair logic was never run. Note that for demand
reads which don't rely on ddt_bp_create() the required pipeline
stages would be enabled and the repair performed.
This was resolved by unconditionally setting the dedup bit in
ddt_bp_create(). This way all codes paths which may need to
perform a repair from a block pointer generated from the dtt
entry will be able too. The only exception is that the dedup
bit is cleared in ddt_phys_free() which is required to avoid
leaking space.
Reviewed by: Matt Ahrens <[email protected]>
Reviewed by: Tom Caputi <[email protected]>
Reviewed by: Serapheim Dimitropoulos <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #8270
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Since the new spacemap encoding was ported to ZoL that's no longer
a limitation. This patch updates vdev_is_spacemap_addressable()
that was performing that check.
It also updates the appropriate test to ensure that the same
functionality is tested. The test does so by creating pools that
don't have the new spacemap encoding enabled - just the checkpoint
feature. This patch also reorganizes that same tests in order to
cut in half its memory consumption.
Reviewed by: Matt Ahrens <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Serapheim Dimitropoulos <[email protected]>
Closes #8286
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Increase the default allowed number of reconstruction attempts.
There's not an exact right number for this setting. It needs
to be set large enough to cover any realistic failure scenarios
and small enough to avoid stalling the IO pipeline and invoking
the dead man detection.
The current value of 256 was empirically determined to be too
low based on multi-day runs of ztest. The fault injection code
would inject more damage than could be reconstructed given the
relatively small number of attempts. However, in all observed
cases the block could be reconstructed using a slightly higher
limit.
Based on local testing increasing the default value to 4096 was
determined to strike the best balance. Checking all combinations
takes less than 10s in the worst case, and has so far eliminated
the vast majority of false positives detected by ztest. This
delay is roughly on par with how long retries may be performed
to a misbehaving HDD and was deemed to be reasonable. Better to
err on the side of a brief delay rather than fail to reconstruct
the data.
Lastly, the -Y flag has been added to zdb to make it easy to try all
possible combinations when performing split block reconstruction.
For badly damaged blocks with 18 splits, they can be fully enumerated
within a few minutes. This has been done to ensure permanent errors
are never incorrectly reported when ztest verifies the pool with zdb.
Reviewed by: Tom Caputi <[email protected]>
Reviewed by: Matt Ahrens <[email protected]>
Reviewed by: Serapheim Dimitropoulos <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #8271
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Currently, dbuf_read() may decide to create a zio_root which is
used as a parent for any child zios created in dbuf_read_impl().
However, if there is an error in dbuf_read_impl(), this zio is
never executed and ends up leaked. This patch simply ensures
that we always execute the root zio, even i it has no real work
to do.
Reviewed-by: Matt Ahrens <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Tom Caputi <[email protected]>
Closes #8267
|
|
|
|
|
|
|
|
|
|
| |
Some minor spelling mistakes and typos. No functional changes.
Reviewed-by: Neal Gompa <[email protected]>
Reviewed by: Matt Ahrens <[email protected]>
Reviewed-by: Giuseppe Di Natale <[email protected]>
Reviewed-by: bunder2015 <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #8272
|
|
|
|
|
|
|
|
|
|
|
| |
Adds a new lock for serializing operations on zthrs.
The commit also includes some code cleanup and
refactoring.
Reviewed by: Matt Ahrens <[email protected]>
Reviewed by: Tom Caputi <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Serapheim Dimitropoulos <[email protected]>
Closes #8229
|
|
|
|
|
|
|
|
|
|
| |
On full pool when pool root filesystem references very few bytes,
the f_blocks returned to statvfs is 0 but should be at least 1.
Reviewed by: Tom Caputi <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Paul Zuchowski <[email protected]>
Closes #8253
Closes #8254
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Object allocation performance can be improved for complex operations
by providing an interface which returns the newly allocated dnode.
This allows the caller to immediately use the dnode without incurring
the expense of looking up the dnode by object number.
The functions dmu_object_alloc_hold(), zap_create_hold(), and
dmu_bonus_hold_by_dnode() were added for this purpose.
The zap_create_* functions have been updated to take advantage of
this new functionality. The dmu_bonus_hold_impl() function should
really have never been included in sys/dmu.h and was removed.
It's sole caller was converted to use dmu_bonus_hold_by_dnode().
The new symbols have been exported for use by Lustre.
Reviewed-by: Tony Hutter <[email protected]>
Reviewed by: Matt Ahrens <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #8015
|
|
|
|
|
|
|
|
|
|
| |
This patch simply fixes a small bug where dnode_hold_impl() could
attempt to allocate a dnode that was in the process of being freed,
but which still had active references. This patch simply adds the
required check.
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Tom Caputi <[email protected]>
Closes #8249
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This commit fixes a small issue which causes both zfs receive and
rollback operations to incorrectly increase the "filesystem_count"
property value.
This change also adds a new test group "limits" to the ZFS Test Suite
to exercise both filesystem_count/limit and snapshot_count/limit
functionality.
Reviewed by: Jerry Jelinek <[email protected]>
Reviewed by: Brian Behlendorf <[email protected]>
Signed-off-by: loli10K <[email protected]>
Closes #8232
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Scrubbing is supposed to detect and repair all errors in the pool.
However, it wrongly ignores active spare devices. The problem can
easily be reproduced in OpenZFS at git rev 0ef125d with these
commands:
truncate -s 64m /tmp/a /tmp/b /tmp/c
sudo zpool create testpool mirror /tmp/a /tmp/b spare /tmp/c
sudo zpool replace testpool /tmp/a /tmp/c
/bin/dd if=/dev/zero bs=1024k count=63 oseek=1 conv=notrunc of=/tmp/c
sync
sudo zpool scrub testpool
zpool status testpool # Will show 0 errors, which is wrong
sudo zpool offline testpool /tmp/a
sudo zpool scrub testpool
zpool status testpool # Will show errors on /tmp/c,
# which should've already been fixed
FreeBSD head is partially affected: the first scrub will detect
some errors, but the second scrub will detect more. This same
test was run on Linux before applying the fix and the FreeBSD
head behavior was observed.
Authored by: asomers <[email protected]>
Reviewed by: Andy Stormont <[email protected]>
Reviewed by: Matt Ahrens <[email protected]>
Reviewed by: George Wilson <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Approved by: Richard Lowe <[email protected]>
Ported-by: Brian Behlendorf <[email protected]>
Sponsored by: Spectra Logic Corp
OpenZFS-issue: https://www.illumos.org/issues/8473
FreeBSD-commit: https://github.com/freebsd/freebsd/commit/e20ec8879
OpenZFS-commit: https://github.com/illumos/illumos-gate/commit/554675ee
Closes #8251
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
PROBLEM
========
When invoking "zpool initialize" on a pool the command will
create a thread to initialize each disk. Unfortunately, it does
this serially across many transaction groups which can result
in commands taking a long time to return to the user and may
appear hung. The same thing is true when trying to suspend/cancel
the operation.
SOLUTION
=========
This change refactors the way we invoke the initialize interface
to ensure we can start or stop the intialization in just a few
transaction groups.
When stopping or cancelling a vdev initialization perform it
in two phases. First signal each vdev initialization thread
that it should exit, then after all threads have been signaled
wait for them to exit.
On a pool with 40 leaf vdevs this reduces the vdev initialize
stop/cancel time from ~10 minutes to under a second. The reason
for this is spa_vdev_initialize() no longer needs to wait on
multiple full TXGs per leaf vdev being stopped.
This commit additionally adds some missing checks for the passed
"initialize_vdevs" input nvlist. The contents of the user provided
input "initialize_vdevs" nvlist must be validated to ensure all
values are uint64s. This is done in zfs_ioc_pool_initialize() in
order to keep all of these checks in a single location.
Updated the innvl and outnvl comments to match the formatting used
for all other new sytle ioctls.
Reviewed by: Matt Ahrens <[email protected]>
Reviewed-by: loli10K <[email protected]>
Reviewed-by: Tim Chase <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Signed-off-by: George Wilson <[email protected]>
Closes #8230
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
PROBLEM
========
The first access to a block incurs a performance penalty on some platforms
(e.g. AWS's EBS, VMware VMDKs). Therefore we recommend that volumes are
"thick provisioned", where supported by the platform (VMware). This can
create a large delay in getting a new virtual machines up and running (or
adding storage to an existing Engine). If the thick provision step is
omitted, write performance will be suboptimal until all blocks on the LUN
have been written.
SOLUTION
=========
This feature introduces a way to 'initialize' the disks at install or in the
background to make sure we don't incur this first read penalty.
When an entire LUN is added to ZFS, we make all space available immediately,
and allow ZFS to find unallocated space and zero it out. This works with
concurrent writes to arbitrary offsets, ensuring that we don't zero out
something that has been (or is in the middle of being) written. This scheme
can also be applied to existing pools (affecting only free regions on the
vdev). Detailed design:
- new subcommand:zpool initialize [-cs] <pool> [<vdev> ...]
- start, suspend, or cancel initialization
- Creates new open-context thread for each vdev
- Thread iterates through all metaslabs in this vdev
- Each metaslab:
- select a metaslab
- load the metaslab
- mark the metaslab as being zeroed
- walk all free ranges within that metaslab and translate
them to ranges on the leaf vdev
- issue a "zeroing" I/O on the leaf vdev that corresponds to
a free range on the metaslab we're working on
- continue until all free ranges for this metaslab have been
"zeroed"
- reset/unmark the metaslab being zeroed
- if more metaslabs exist, then repeat above tasks.
- if no more metaslabs, then we're done.
- progress for the initialization is stored on-disk in the vdev’s
leaf zap object. The following information is stored:
- the last offset that has been initialized
- the state of the initialization process (i.e. active,
suspended, or canceled)
- the start time for the initialization
- progress is reported via the zpool status command and shows
information for each of the vdevs that are initializing
Porting notes:
- Added zfs_initialize_value module parameter to set the pattern
written by "zpool initialize".
- Added zfs_vdev_{initializing,removal}_{min,max}_active module options.
Authored by: George Wilson <[email protected]>
Reviewed by: John Wren Kennedy <[email protected]>
Reviewed by: Matthew Ahrens <[email protected]>
Reviewed by: Pavel Zakharov <[email protected]>
Reviewed by: Prakash Surya <[email protected]>
Reviewed by: loli10K <[email protected]>
Reviewed by: Brian Behlendorf <[email protected]>
Approved by: Richard Lowe <[email protected]>
Signed-off-by: Tim Chase <[email protected]>
Ported-by: Tim Chase <[email protected]>
OpenZFS-issue: https://www.illumos.org/issues/9102
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/c3963210eb
Closes #8230
|
|
|
|
|
|
|
|
|
| |
This change is required to ease the transition when upgrading
from 0.7.x to 0.8.x. It allows 0.8.x user space utilities to
remain compatible with 0.7.x and older kernel modules.
Reviewed-by: Don Brady <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #8231
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The dmu_objset_remap_indirects_impl() logic depends on dnode_hold()
returning ENOENT for dnodes which will be freed and should be skipped.
This behavior can only be relied upon when taking a new hold and
while the caller has an open transaction. This ensures that the
open txg cannot advance and that a concurrent free will end up
in the same txg (which is critical). Relying on an existing hold
will not prevent dnode_free() from succeeding.
The solution is to take an additional dnode_hold() after assigning
the transaction. This ensures the remap will never dirty the dnode
if it was freed while we were waiting in dmu_tx_assign(, TXG_WAIT).
Randomly set zfs_object_remap_one_indirect_delay_ms in ztest. This
increases the likelihood of an operation racing with the remap.
Converted from ticks to milliseconds.
Reviewed by: Matt Ahrens <[email protected]>
Reviewed by: Tom Caputi <[email protected]>
Reviewed by: Igor Kozhukhov <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #8215
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Following the fix for 9018 (Replace kmem_cache_reap_now() with
kmem_cache_reap_soon), the arc_reclaim_thread() no longer blocks
while reaping. However, the code is still confusing and error-prone,
because this thread has two responsibilities. We should instead
separate this into two threads each with their own responsibility:
1. keep `arc_size` under `arc_c`, by calling `arc_adjust()`, which
improves `arc_is_overflowing()`
2. keep enough free memory in the system, by calling
`arc_kmem_reap_now()` plus `arc_shrink()`, which improves
`arc_available_memory()`.
Furthermore, we can use the zthr infrastructure to separate the
"should we do something" from "do it" parts of the logic, and
normalize the start up / shut down of the threads.
Authored by: Brad Lewis <[email protected]>
Reviewed by: Matt Ahrens <[email protected]>
Reviewed by: Serapheim Dimitropoulos <[email protected]>
Reviewed by: Pavel Zakharov <[email protected]>
Reviewed by: Dan Kimmel <[email protected]>
Reviewed by: Paul Dagnelie <[email protected]>
Reviewed by: Dan McDonald <[email protected]>
Reviewed by: Tim Kordas <[email protected]>
Reviewed by: Tim Chase <[email protected]>
Reviewed by: Brian Behlendorf <[email protected]>
Ported-by: Brad Lewis <[email protected]>
Signed-off-by: Brad Lewis <[email protected]>
OpenZFS-issue: https://www.illumos.org/issues/9284
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/de753e34f9
Closes #8165
|
|
|
|
|
|
|
|
|
|
|
| |
In dfbe2675 zfs_dirty_data_sync was changed to a new tunable named
zfs_dirty_data_sync_percent. Unfortunately, the module parameter
documentation is the code was not updated accordingly. This patch
simply corrects that.
Reviewed-by: George Melikov <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Tom Caputi <[email protected]>
Closes #8212
|
|
|
|
|
|
|
|
|
|
|
| |
This patch simply removes an invalid assert from the zap_update()
function. The ASSERT is invalid because it does not hold the zap
lock from the time it fetches the old value to the time it confirms
that it is what it should be.
Reviewed by: Matt Ahrens <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Tom Caputi <[email protected]>
Closes #8209
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Porting Notes:
* Additional changes to recv_rename_impl() were required due to
encryption code not being merged in OpenZFS yet.
* libzfs_core python bindings (pyzfs) were updated to fully support
both lzc_rename() and lzc_destroy()
Authored by: Andriy Gapon <[email protected]>
Reviewed by: Andy Stormont <[email protected]>
Reviewed by: Matt Ahrens <[email protected]>
Reviewed by: Serapheim Dimitropoulos <[email protected]>
Reviewed by: Brian Behlendorf <[email protected]>
Approved by: Dan McDonald <[email protected]>
Ported-by: loli10K <[email protected]>
OpenZFS-issue: https://www.illumos.org/issues/9630
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/049ba63
Closes #8207
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This patch addresses an issue found in ztest where resilver
write zios that were passed to an indirect vdev would end up
being handled as though they were resilver read zios. This
caused issues where the zio->io_abd would be both read to
and written from at the same time, causing asserts to fail.
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed by: Matt Ahrens <[email protected]>
Reviewed-by: Serapheim Dimitropoulos <[email protected]>
Signed-off-by: Tom Caputi <[email protected]>
Closes #8193
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Linux kstat IO and TIMER printed values as signed. However the counters
only increment. Thus humans looking at the data can be confused when
the counters roll over.
Note: The recommended use of these values is to monitor the derivative,
which don't really care about the sign. See explanations related to
non-negative derivatives in the various time-series databases.
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Giuseppe Di Natale <[email protected]>
Reviewed-by: Tony Hutter <[email protected]>
Signed-off-by: Richard Elling <[email protected]>
Closes #8131
Closes #8198
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Macro ZFS_MINOR, introduced in commit a6cc9756 to record the chosen
static minor number for /dev/zfs, conflicts with an existing macro
in Lustre. The lustre macro (along with _MAJOR, _PATCH, _FIX) is
used to record the zfsonlinux version Lustre is being built against.
Since the Lustre macro came first, and is used in past versions of
lustre at least going back to 2.10, it makes sense to rename the
macro in ZFS instead of doing so in Lustre which would require
backporting the patch.
Reviewed-by: Giuseppe Di Natale <[email protected]>
Reviewed-by: Tony Hutter <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Olaf Faaland <[email protected]>
Closes #8195
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
As a result of the changes made in 8585, it's possible for an excessive
amount of vdev flush commands to be issued under some workloads.
Specifically, when the workload consists of mostly async write activity,
interspersed with some sync write and/or fsync activity, we can end up
issuing more flush commands to the underlying storage than is actually
necessary. As a result of these flush commands, the write latency and
overall throughput of the pool can be poorly impacted (latency
increases, throughput decreases).
Currently, any time an lwb completes, the vdev(s) written to as a result
of that lwb will be issued a flush command. The intenion is so the data
written to that vdev is on stable storage, prior to communicating to any
waiting threads that their data is safe on disk.
The problem with this scheme, is that sometimes an lwb will not have any
threads waiting for it to complete. This can occur when there's async
activity that gets "converted" to sync requests, as a result of calling
the zil_async_to_sync() function via zil_commit_impl(). When this
occurs, the current code may issue many lwbs that don't have waiters
associated with them, resulting in many flush commands, potentially to
the same vdev(s).
For example, given a pool with a single vdev, and a single fsync() call
that results in 10 lwbs being written out (e.g. due to other async
writes), that will result in 10 flush commands to that single vdev (a
flush issued after each lwb write completes). Ideally, we'd only issue a
single flush command to that vdev, after all 10 lwb writes completed.
Further, and most important as it pertains to this change, since the
flush commands are often very impactful to the performance of the pool's
underlying storage, unnecessarily issuing these flush commands can
poorly impact the performance of the lwb writes themselves. Thus, we
need to avoid issuing flush commands when possible, in order to acheive
the best possible performance out of the pool's underlying storage.
This change attempts to address this problem by changing the ZIL's logic
to only issue a vdev flush command when it detects an lwb that has a
thread waiting for it to complete. When an lwb does not have threads
waiting for it, the responsibility of issuing the flush command to the
vdevs involved with that lwb's write is passed on to the "next" lwb.
It's only once a write for an lwb with waiters completes, do we issue
the vdev flush command(s). As a result, now when we issue the flush(s),
we will issue them to the vdevs involved with that specific lwb's write,
but potentially also to vdevs involved with "previous" lwb writes (i.e.
if the previous lwbs did not have waiters associated with them).
Thus, in our prior example with 10 lwbs, it's only once the last lwb
completes (which will be the lwb containing the waiter for the thread
that called fsync) will we issue the vdev flush command; all of the
other lwbs will find they have no waiters, so they'll pass the
responsibility of the flush to the "next" lwb (until reaching the last
lwb that has the waiter).
Porting Notes:
* Reconciled conflicts with the fastwrite feature.
Authored by: Prakash Surya <[email protected]>
Reviewed by: Matt Ahrens <[email protected]>
Reviewed by: Brad Lewis <[email protected]>
Reviewed by: Patrick Mooney <[email protected]>
Reviewed by: Jerry Jelinek <[email protected]>
Approved by: Joshua M. Clulow <[email protected]>
Ported-by: Signed-off-by: Brian Behlendorf <[email protected]>
OpenZFS-issue: https://www.illumos.org/issues/9962
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/545190c6
Closes #8188
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Porting Notes:
* Add options to zfs-module-parameters(5) man page.
* zfs_nocacheflush move to vdev.c instead of vdev_disk.c, since
the latter doesn't get built for user space.
Authored by: Prakash Surya <[email protected]>
Reviewed by: Matt Ahrens <[email protected]>
Reviewed by: Brad Lewis <[email protected]>
Reviewed by: Patrick Mooney <[email protected]>
Reviewed by: Tom Caputi <[email protected]>
Reviewed by: George Melikov <[email protected]>
Approved by: Dan McDonald <[email protected]>
Ported-by: Signed-off-by: Brian Behlendorf <[email protected]>
OpenZFS-issue: https://www.illumos.org/issues/9963
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f8fdf68125
Closes #8186
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Authored by: George Wilson <[email protected]>
Reviewed by: Prakash Surya <[email protected]>
Reviewed by: Brad Lewis <[email protected]>
Reviewed by: Matt Ahrens <[email protected]>
Reviewed by: Tom Caputi <[email protected]>
Reviewed by: George Melikov <[email protected]>
Approved by: Dan McDonald <[email protected]>
Ported-by: Brian Behlendorf <[email protected]>
OpenZFS-issue: https://www.illumos.org/issues/9993
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/2258ad0b
Closes #8185
|
|
|
|
|
|
|
|
|
| |
This patch simply ensures that scn->scn_prefetch_queue is emptied
before the kernel module is unloaded and when scanning completes.
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Alek Pinchuk <[email protected]>
Signed-off-by: Tom Caputi <[email protected]>
Closes #8178
|