| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
| |
if pool root is not mounted, then zpool umount in next test will leave
dataset mountpoint directory around and next zfs mount -a will fail
with error: cannot mount '/testpool': directory is not empty
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Toomas Soome <[email protected]>
Closes #11417
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Virtuozzo 7 kernels starting 3.10.0-1127.18.2.vz7.163.46
have the following configuration:
* no HAVE_VFS_RW_ITERATE
* HAVE_VFS_DIRECT_IO_ITER_RW_OFFSET
=> let's add implementation of zpl_direct_IO() via
zpl_aio_{read,write}() in this case.
https://bugs.openvz.org/browse/OVZ-7243
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Konstantin Khorenko <[email protected]>
Closes #11410
Closes #11411
|
|
|
|
|
|
|
|
| |
Reviewed-by: Igor Kozhukhov <[email protected]>
Reviewed-by: Ryan Moeller <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Matthew Ahrens <[email protected]>
Closes #11396
|
|
|
|
|
|
|
|
| |
Reviewed-by: Igor Kozhukhov <[email protected]>
Reviewed-by: Ryan Moeller <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Matthew Ahrens <[email protected]>
Closes #11396
|
|
|
|
|
|
|
|
| |
Reviewed-by: Igor Kozhukhov <[email protected]>
Reviewed-by: Ryan Moeller <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Matthew Ahrens <[email protected]>
Closes #11396
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
In `zpool_find_config()`, the `pools` nvlist is leaked. Part of it (a
sub-nvlist) is returned in `*configp`, but the callers also leak that.
Additionally, in `zdb.c:main()`, the `searchdirs` is leaked.
The leaks were detected by ASAN (`configure --enable-asan`).
This commit resolves the leaks.
Reviewed-by: Igor Kozhukhov <[email protected]>
Reviewed-by: Ryan Moeller <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Matthew Ahrens <[email protected]>
Closes #11396
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Build error on illumos with gcc 10 did reveal:
In function 'dmu_objset_refresh_ownership':
../../common/fs/zfs/dmu_objset.c:857:25: error: implicit conversion
from 'boolean_t' to 'ds_hold_flags_t' {aka 'enum ds_hold_flags'}
[-Werror=enum-conversion]
857 | dsl_dataset_disown(ds, decrypt, tag);
| ^~~~~~~
cc1: all warnings being treated as errors
libzfs_input_check.c: In function 'zfs_ioc_input_tests':
libzfs_input_check.c:754:28: error: implicit conversion from
'enum dmu_objset_type' to 'enum lzc_dataset_type'
[-Werror=enum-conversion]
754 | err = lzc_create(dataset, DMU_OST_ZFS, NULL, NULL, 0);
| ^~~~~~~~~~~
cc1: all warnings being treated as errors
The same issue is present in openzfs, and also the same issue about
ds_hold_flags_t, which currently defines exactly one valid value.
Reviewed-by: Igor Kozhukhov <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Toomas Soome <[email protected]>
Closes #11406
|
|
|
|
|
|
|
|
|
|
|
|
| |
As of 5.11 the blk_register_region() and blk_unregister_region()
functions have been retired. This isn't a problem since add_disk()
has implicitly allocated minor numbers for a very long time.
Reviewed-by: Rafael Kitover <[email protected]>
Reviewed-by: Coleman Kane <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #11387
Closes #11390
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Both revalidate_disk_size() and revalidate_disk() have been removed.
Functionally this isn't a problem because we only relied on these
functions to call zvol_revalidate_disk() for us and to perform any
additional handling which might be needed for that kernel version.
When neither are available we know there's no additional handling
needed and we can directly call zvol_revalidate_disk().
Reviewed-by: Rafael Kitover <[email protected]>
Reviewed-by: Coleman Kane <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #11387
Closes #11390
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The bd_contains member was removed from the block_device structure.
Callers needing to determine if a vdev is a whole block device should
use the new bdev_whole() wrapper. For older kernels we provide our
own bdev_whole() wrapper which relies on bd_contains for compatibility.
Reviewed-by: Rafael Kitover <[email protected]>
Reviewed-by: Coleman Kane <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #11387
Closes #11390
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The generic IO accounting functions have been removed in favor of the
bio_start_io_acct() and bio_end_io_acct() functions which provide a
better interface. These new functions were introduced in the 5.8
kernels but it wasn't until the 5.11 kernel that the previous generic
IO accounting interfaces were removed.
This commit updates the blk_generic_*_io_acct() wrappers to provide
and interface similar to the updated kernel interface. It's slightly
different because for older kernels we need to pass the request queue
as well as the bio.
Reviewed-by: Rafael Kitover <[email protected]>
Reviewed-by: Coleman Kane <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #11387
Closes #11390
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The lookup_bdev() function has been updated to require a dev_t
be passed as the second argument. This is actually pretty nice
since the major number stored in the dev_t was the only part we
were interested in. This allows to us avoid handling the bdev
entirely. The vdev_lookup_bdev() wrapper was updated to emulate
the behavior of the new lookup_bdev() for all supported kernels.
Reviewed-by: Rafael Kitover <[email protected]>
Reviewed-by: Coleman Kane <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #11387
Closes #11390
|
|
|
|
|
|
|
|
|
|
|
|
| |
Update the ZFS_LINUX_TEST_PROGRAM macro to always set the module
license. As of the 5.11 kernel not setting a license has been
converted from a warning to an error.
Reviewed-by: Rafael Kitover <[email protected]>
Reviewed-by: Coleman Kane <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #11387
Closes #11390
|
|
|
|
|
|
|
| |
Replace "is" with "==" and "is not" with "!=".
Reviewed-by: George Melikov <[email protected]>
Signed-off-by: Ryan Moeller <[email protected]>
Closes #11394
|
|
|
|
|
|
|
|
| |
Increase the Linux-Maximum version in the META file to 5.10.
All of the required compatibility patches have been merged.
Reviewed-by: George Melikov <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #11391
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Individual transactions may not be larger than DMU_MAX_ACCESS.
This is enforced by the assertions in dmu_tx_hold_write() and
dmu_tx_hold_write_by_dnode(). There's an additional check in
dmu_tx_count_write() however it has no effect and only sets a
local err variable. We could enable this check, however since
it's already enforced by ASSERTs elsewhere I opted to remove it
instead.
Reviewed-by: Matthew Ahrens <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #3731
Closes #11384
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Before this patch, dracut wouldn't find zfs.ko for inclusion in
initramfs. This was caused by the packages installing in to
/lib/modules instead of /usr/lib/modules. Correcting this allows
dracut to do the right thing, even without
# /etc/dracut.conf
add_drivers+=" zfs "
Notably, rpm/redhat/zfs-kmod.spec.in does not contain the definition of
the `prefix` macro that this commit removes in the generic kmod spec.
And https://rpmfusion.org/Packaging/KernelModules/Kmods2 does not
mention `prefix` at all.
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Christian Schwarz <[email protected]>
Closes #11381
|
|
|
|
|
|
|
|
| |
Instead of creating issues with type "question"
Forward to the GitHub Discussion system.
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Kjeld Schouten-Lebbing <[email protected]>
Closes #11383
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
After porting the fix for https://github.com/openzfs/zfs/issues/5295
over to illumos, we started hitting an assertion failure when running
the testsuite:
assertion failed: rc->rc_count == number, file: .../refcount.c
and the unexpected hold has this stack:
dsl_dataset_long_hold+0x59 dmu_objset_upgrade+0x73
dmu_objset_id_quota_upgrade+0x15 dmu_objset_own+0x14f
The simplest reproducer for this in illumos is
zpool create -f -O version=1 testpool c3t0d0; zpool destroy testpool
which is run as part of the zpool_create_tempname test, but I can't get
this to trigger on FreeBSD. This appears to be because of the call to
txg_wait_synced() in dmu_objset_upgrade_stop() (which was missing in
illumos), slows down dmu_objset_disown() enough to avoid the condition.
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Andy Fiddaman <[email protected]>
Closes #11368
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The CentOS stream 4.18.0-257 kernel appears to have backported
the Linux 5.9 change to make_request_fn and the associated API.
To maintain weak modules compatibility the original symbol was
retained and the new interface blk_alloc_queue_rh() was added.
Unfortunately, blk_alloc_queue() was replaced in the blkdev.h
header by blk_alloc_queue_bh() so there doesn't seem to be a way
to build new kmods against the old interfces. Even though they
appear to still be available for weak module binding.
To accommodate this a configure check is added for the new _rh()
variant of the function and used if available. If compatibility
code gets added to the kernel for the original blk_alloc_queue()
interface this should be fine. OpenZFS will simply continue to
prefer the new interface and only fallback to blk_alloc_queue()
when blk_alloc_queue_rh() isn't available.
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #11374
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Commit 1c2358c12 restructured this code and introduced a warning
about the variable maybe not being initialized. This cannot happen
with the updated code but we should initialize the variable anyway
to silence the warning.
zpl_file.c: In function ‘zpl_iter_write’:
zpl_file.c:324:9: warning: ‘count’ may be used uninitialized
in this function [-Wmaybe-uninitialized]
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #11373
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
There's no need to call iov_iter_advance() in zpl_iter_read().
This was preserved from the previous code where it wasn't needed
but also didn't cause any problems. Now that the iter functions
also handle pipes that's no longer the case. When fully reading a
pipe buffer iov_iter_advance() may results in the pipe buf release
function being called which will not be registered resulting in
a NULL dereference.
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #11375
Closes #11378
|
|
|
|
|
|
|
|
| |
Based on a conversation with Matt on the OpenZFS Slack.
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Matthew Ahrens <[email protected]>
Signed-off-by: Christian Schwarz <[email protected]>
Closes #11370
|
|
|
|
|
|
|
|
|
|
|
|
| |
Commit 59b68723 added a configure check for 5.10, which removed
revalidate_disk(), and conditionally replaced it's usage with a call to
the new revalidate_disk_size() function. However, the old function also
invoked the device's registered callback, in our case
zvol_revalidate_disk(). This commit adds a call to zvol_revalidate_disk()
in zvol_update_volsize() to make sure the code path stays the same.
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Michael D Labriola <[email protected]>
Closes #11358
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Check for the history_event type instead.
The zfs-list-cacher.sh script currently respects the event types
excluded from syslog(!) in ZED_SYSLOG_SUBCLASS_EXCLUDE.
This makes little sense in this single-purpose script and
silently breaks when history_events are excluded from syslog,
which is the default since 13d65987a9d9958de77422f5d9d25b47e486537d.
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: InsanePrawn <[email protected]>
Closes #11164
Closes #11347
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
As of the 5.10 kernel the generic splice compatibility code has been
removed. All filesystems are now responsible for registering a
->splice_read and ->splice_write callback to support this operation.
The good news is the VFS provided generic_file_splice_read() and
iter_file_splice_write() callbacks can be used provided the ->iter_read
and ->iter_write callback support pipes. However, this is currently
not the case and only iovecs and bvecs (not pipes) are ever attached
to the uio structure.
This commit changes that by allowing full iov_iter structures to be
attached to uios. Ever since the 4.9 kernel the iov_iter structure
has supported iovecs, kvecs, bvevs, and pipes so it's desirable to
pass the entire thing when possible. In conjunction with this the
uio helper functions (i.e uiomove(), uiocopy(), etc) have been
updated to understand the new UIO_ITER type.
Note that using the kernel provided uio_iter interfaces allowed the
existing Linux specific uio handling code to be simplified. When
there's no longer a need to support kernel's older than 4.9, then
it will be possible to remove the iovec and bvec members from the
uio structure and always use a uio_iter. Until then we need to
maintain all of the existing types for older kernels.
Some additional refactoring and cleanup was included in this change:
- Added checks to configure to detect available iov_iter interfaces.
Some are available all the way back to the 3.10 kernel and are used
when available. In particular, uio_prefaultpages() now always uses
iov_iter_fault_in_readable() which is available for all supported
kernels.
- The unused UIO_USERISPACE type has been removed. It is no longer
needed now that the uio_seg enum is platform specific.
- Moved zfs_uio.c from the zcommon.ko module to the Linux specific
platform code for the zfs.ko module. This gets it out of libzfs
where it was never needed and keeps this Linux specific code out
of the common sources.
- Removed unnecessary O_APPEND handling from zfs_iter_write(), this
is redundant and O_APPEND is already handled in zfs_write();
Reviewed-by: Colin Ian King <[email protected]>
Reviewed-by: Tony Hutter <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #11351
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Consider the test to be a success as long as the initializing pattern
is found at least once per metaslab. This indicates that at least
part of the free space was initialized. Ideally we'd check that the
pattern was written to all free space but that's much trickier so this
check is a reasonable compromise.
Using a here-string to feed the loop in this test causes an empty
string to still trigger the loop so we miss the `spacemaps=0` case.
Pipe into the loop instead.
While here, we can use `zpool wait -t initialize $TESTPOOL` to wait for
the pool to initialize.
Co-authored-by: Brian Behlendorf <[email protected]>
Signed-off-by: Ryan Moeller <[email protected]>
Closes #11365
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The space in special devices is not included in spa_dspace (or
dsl_pool_adjustedsize(), or the zfs `available` property). Therefore
there is always at least as much free space in the normal class, as
there is allocated in the special class(es). And therefore, there is
always enough free space to remove a special device.
However, the checks for free space when removing special devices did not
take this into account. This commit corrects that.
Reviewed-by: Ryan Moeller <[email protected]>
Reviewed-by: Don Brady <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Matthew Ahrens <[email protected]>
Closes #11329
|
|
|
|
|
|
|
|
|
|
|
| |
After e357046 it should not be necessary to periodically update ARC
kstats and tunables. Tunable updates are applied when modified, and
kstats are updated on demand.
Update kstats in `arc_evict_cb_check()` for `ZFS_DEBUG` builds only.
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Ryan Moeller <[email protected]>
Closes #11237
|
|
|
|
|
|
|
|
| |
Use the correct return type for getopt otherwise clang complains
about tautological-constant-out-of-range-compare.
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Sterling Jensen <[email protected]>
Closes #11359
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
On a system with very high fragmentation, we may need to do lots of gang
allocations (e.g. most indirect block allocations (~50KB) may need to
gang). Before failing a "normal" allocation and resorting to ganging, we
try every metaslab. This has the impact of loading every metaslab (not
a huge deal since we now typically keep all metaslabs loaded), and also
iterating over every metaslab for every failing allocation. If there are
many metaslabs (more than the typical ~200, e.g. due to vdev expansion
or very large vdevs), the CPU cost of this iteration can be very
impactful. This iteration is done with the mg_lock held, creating long
hold times and high lock contention for concurrent allocations,
ultimately causing long txg sync times and poor application performance.
To address this, this commit changes the behavior of "normal" (not
try_hard, not ZIL) allocations. These will now only examine the 100
best metaslabs (as determined by their ms_weight). If none of these
have a large enough free segment, then the allocation will fail and
we'll fall back on ganging.
To accomplish this, we will now (normally) gang before doing a
`try_hard` allocation. Non-try_hard allocations will only examine the
100 best metaslabs of each vdev. In summary, we will first try normal
allocation. If that fails then we will do a gang allocation. If that
fails then we will do a "try hard" gang allocation. If that fails then
we will have a multi-layer gang block.
Reviewed-by: Paul Dagnelie <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Matthew Ahrens <[email protected]>
Closes #11327
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Metaslab rotor and aliquot are used to distribute workload between
vdevs while keeping some locality for logically adjacent blocks. Once
multiple allocators were introduced to separate allocation of different
objects it does not make much sense for different allocators to write
into different metaslabs of the same metaslab group (vdev) same time,
competing for its resources. This change makes each allocator choose
metaslab group independently, colliding with others only sporadically.
Test including simultaneous write into 4 files with recordsize of 4KB
on a striped pool of 30 disks on a system with 40 logical cores show
reduction of vdev queue lock contention from 54 to 27% due to better
load distribution. Unfortunately it won't help much ZVOLs yet since
only one dataset/ZVOL is synced at a time, and so for the most part
only one allocator is used, but it may improve later.
While there, to reduce the number of pointer dereferences change
per-allocator storage for metaslab classes and groups from several
separate malloc()'s to variable length arrays at the ends of the
original class and group structures.
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Matthew Ahrens <[email protected]>
Signed-off-by: Alexander Motin <[email protected]>
Closes #11288
|
|
|
|
|
|
|
|
|
|
|
|
| |
Fedora does not guarantee a stable kABI, so weak modules should be dis-
abled. See the dkms man page for a more detailed explanation of the weak
module feature.
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Gregory Bartholomew <[email protected]>
Closes #9891
Closes #11128
Closes #11242
Closes #11335
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Avoid a bug with gcc's -Wreturn-local-addr warning with some
obfuscation. In buggy versions of gcc, if a return value is an
expression that involves the address of a local variable, and even if
that address is legally converted to a non-pointer type, a warning may
be emitted and the value of the address may be replaced with zero.
Howerver, buggy versions don't emit the warning or replace the value
when simply returning a local variable of non-pointer type.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90737
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Ryan Libby <[email protected]>
Closes #11337
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Building the spa module for i386 caused gcc to emit
-Wint-to-pointer-cast "cast to pointer from integer of different size"
because spa.spa_did was uint64_t but pthread_join (via thread_join in
spa_deactivate) takes a pointer (32-bit on i386). Define spa_did to be
pointer-size instead. For now spa_did is in fact never non-zero and the
thread_join could instead be ifdef'd out, but changing the size of
spa_did may be more useful for the future.
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Ryan Libby <[email protected]>
Closes #11336
|
|
|
|
|
|
|
|
|
|
|
|
| |
Building libzfs with gcc on FreeBSD failed because gcc is picky about
the order of keywords in declarations with __thread, whereas clang is
more relaxed.
https://gcc.gnu.org/onlinedocs/gcc/Thread-Local.html
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Ryan Moeller <[email protected]>
Signed-off-by: Ryan Libby <[email protected]>
Closes #11331
|
|
|
|
|
|
|
|
|
|
| |
The last change caused the read completion callback to not be called
if the IO was still in progress. This change restores allocation
of the arc buf callback, but in the callback path checks the new
acb_nobuf field to know to skip buffer allocation.
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Matt Macy <[email protected]>
Closes #11324
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When removing and subsequently reattaching a vdev, CKSUM errors may
occur as vdev_indirect_read_all() reads from all children of a mirror
in case of a resilver.
Fix this by checking whether a child is missing the data and setting a
flag (ic_error) which is then checked in vdev_indirect_repair() and
suppresses incrementing the checksum counter.
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: George Amanakis <[email protected]>
Closes #11277
|
|
|
|
|
|
|
|
|
|
| |
In an earlier revision of dRAID there existed an /etc/zfs/draid.d
directory. This was removed before the final version was integrated
but a little bit was accidentally overlooked in the zfs_helpers.sh
script. Remove this remnant.
Reviewed-by: George Melikov <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #11326
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Some tunables shown by arc_summary3 have string values that may exceed
the normal line length, leaving a negative offset between the name and
value fields. The negative space is of course not valid and Python
rightly barfs up an exception traceback.
Handle an overflowing value field width by ignoring the line length
and separating the name from the value by a single space instead.
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Ryan Moeller <[email protected]>
Closes #11270
|
|
|
|
|
|
|
|
|
|
|
|
| |
There is a tunable to select the fletcher 4 checksum implementation on
Linux but it was not present in FreeBSD.
Implement the sysctl handler for FreeBSD and use ZFS_MODULE_PARAM_CALL
to provide the tunable on both platforms.
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Ryan Moeller <[email protected]>
Closes #11270
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The performance of `zfs receive` can be bottlenecked on the CPU consumed
by the `receive_writer` thread, especially when receiving streams with
small compressed block sizes. Much of the CPU is spent creating and
destroying dbuf's and arc buf's, one for each `WRITE` record in the send
stream.
This commit introduces the concept of "lightweight writes", which allows
`zfs receive` to write to the DMU by providing an ABD, and instantiating
only a new type of `dbuf_dirty_record_t`. The dbuf and arc buf for this
"dirty leaf block" are not instantiated.
Because there is no dbuf with the dirty data, this mechanism doesn't
support reading from "lightweight-dirty" blocks (they would see the
on-disk state rather than the dirty data). Since the dedup-receive code
has been removed, `zfs receive` is write-only, so this works fine.
Because there are no arc bufs for the received data, the received data
is no longer cached in the ARC.
Testing a receive of a stream with average compressed block size of 4KB,
this commit improves performance by 50%, while also reducing CPU usage
by 50% of a CPU. On a per-block basis, CPU consumed by receive_writer()
and dbuf_evict() is now 1/7th (14%) of what it was.
Baseline: 450MB/s, CPU in receive_writer() 40% + dbuf_evict() 35%
New: 670MB/s, CPU in receive_writer() 17% + dbuf_evict() 0%
The code is also restructured in a few ways:
Added a `dr_dnode` field to the dbuf_dirty_record_t. This simplifies
some existing code that no longer needs `DB_DNODE_ENTER()` and related
routines. The new field is needed by the lightweight-type dirty record.
To ensure that the `dr_dnode` field remains valid until the dirty record
is freed, we have to ensure that the `dnode_move()` doesn't relocate the
dnode_t. To do this we keep a hold on the dnode until it's zio's have
completed. This is already done by the user-accounting code
(`userquota_updates_task()`), this commit extends that so that it always
keeps the dnode hold until zio completion (see `dnode_rele_task()`).
`dn_dirty_txg` was previously zeroed when the dnode was synced. This
was not necessary, since its meaning can be "when was this dnode last
dirtied". This change simplifies the new `dnode_rele_task()` code.
Removed some dead code related to `DRR_WRITE_BYREF` (dedup receive).
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Paul Dagnelie <[email protected]>
Reviewed-by: George Wilson <[email protected]>
Signed-off-by: Matthew Ahrens <[email protected]>
Closes #11105
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
In the redaction list traversal code, there is a bug in the binary search
logic when looking for the resume point. Maxbufid can be decremented to -1,
causing us to read the last possible block of the object instead of the one we
wanted. This can cause incorrect resume behavior, or possibly even a hang in
some cases. In addition, when examining non-last blocks, we can treat the
block as being the same size as the last block, causing us to miss entries in
the redaction list when determining where to resume. Finally, we were ignoring
the case where the resume point was found in the buffer being searched, and
resuming from minbufid. All these issues have been corrected, and the code has
been significantly simplified to make future issues less likely.
Reviewed-by: Serapheim Dimitropoulos <[email protected]>
Reviewed-by: Matthew Ahrens <[email protected]>
Signed-off-by: Paul Dagnelie <[email protected]>
Closes #11297
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
vfs.zfs.arc_no_grow_shift has an invalid type (15) and this causes
py-sysctl to format it as a bytearray when it should be an integer.
"U" is not a valid format, it should be "I" and the type should match
the variable type, int. We can return EINVAL if the value is set below
zero.
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Ryan Moeller <[email protected]>
Closes #11318
|
|
|
|
|
|
|
|
|
|
|
|
| |
py-sysctl now includes the CTLTYPE_NODE type nodes in the list returned
by sysctl.filter() on FreeBSD head. It also provides descriptions now.
Eliminate the subprocess call to get descriptions, and filter out the
nodes so we only deal with values.
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Ryan Moeller <[email protected]>
Closes #11318
|
|
|
|
|
|
|
|
|
|
|
|
| |
Resolve an uninitialized variable warning when compiling.
In function ‘zfs_domount’:
warning: ‘root_inode’ may be used uninitialized in this
function [-Wmaybe-uninitialized]
sb->s_root = d_make_root(root_inode);
Reviewed-by: Tony Hutter <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #11306
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
ZFS currently doesn't react to hotplugging cpu or memory into the
system in any way. This patch changes that by adding logic to the ARC
that allows the system to take advantage of new memory that is added
for caching purposes. It also adds logic to the taskq infrastructure
to support dynamically expanding the number of threads allocated to a
taskq.
Reviewed-by: Brian Behlendorf <[email protected]>
Co-authored-by: Matthew Ahrens <[email protected]>
Co-authored-by: Brian Behlendorf <[email protected]>
Signed-off-by: Paul Dagnelie <[email protected]>
Closes #11212
|
|\
| |
| |
| |
| |
| |
| |
| | |
Run ztest via zloop for 20 minutes, total run time is ~30 minutes.
Reviewed-by: Kjeld Schouten <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: George Melikov <[email protected]>
Closes #11319
|
| |
| |
| |
| |
| |
| | |
Run ztest via zloop for 20 minutes, total run time is ~30 minutes.
Signed-off-by: George Melikov <[email protected]>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
There has been a panic affecting some system configurations where the
thread FPU context is disturbed during the fletcher 4 benchmarks,
leading to a panic at boot.
module_init() registers zcommon_init to run in the last subsystem
(SI_SUB_LAST). Running it as soon as interrupts have been configured
(SI_SUB_INT_CONFIG_HOOKS) makes sure we have finished the benchmarks
before we start doing other things.
While it's not clear *how* the FPU context was being disturbed, this
does seem to avoid it.
Add a module_init_early() macro to run zcommon_init() at this earlier
point on FreeBSD. On Linux this is defined as module_init().
Authored by: Konstantin Belousov <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Ryan Moeller <[email protected]>
Closes #11302
|