aboutsummaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* zfs_mount_all_mountpoints: cleanup_all should leave pool root mountedToomas Soome2021-01-021-0/+2
| | | | | | | | | if pool root is not mounted, then zpool umount in next test will leave dataset mountpoint directory around and next zfs mount -a will fail with error: cannot mount '/testpool': directory is not empty Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Toomas Soome <[email protected]> Closes #11417
* VZ 7 kernel compat: introduce ITER-enabled .direct_IO() via IOVECsKonstantin Khorenko2020-12-301-1/+14
| | | | | | | | | | | | | | | | | Virtuozzo 7 kernels starting 3.10.0-1127.18.2.vz7.163.46 have the following configuration: * no HAVE_VFS_RW_ITERATE * HAVE_VFS_DIRECT_IO_ITER_RW_OFFSET => let's add implementation of zpl_direct_IO() via zpl_aio_{read,write}() in this case. https://bugs.openvz.org/browse/OVZ-7243 Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Konstantin Khorenko <[email protected]> Closes #11410 Closes #11411
* Memory leak in zdb:import_checkpointed_state()Matthew Ahrens2020-12-281-2/+5
| | | | | | | | Reviewed-by: Igor Kozhukhov <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> Closes #11396
* Memory leak in ztest_dmu_objset_own()Matthew Ahrens2020-12-281-1/+5
| | | | | | | | Reviewed-by: Igor Kozhukhov <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> Closes #11396
* Memory leak in ztest_vdev_attach_detach()Matthew Ahrens2020-12-281-2/+1
| | | | | | | | Reviewed-by: Igor Kozhukhov <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> Closes #11396
* nvlist leaked in zpool_find_config()Matthew Ahrens2020-12-284-3/+16
| | | | | | | | | | | | | | | | | In `zpool_find_config()`, the `pools` nvlist is leaked. Part of it (a sub-nvlist) is returned in `*configp`, but the callers also leak that. Additionally, in `zdb.c:main()`, the `searchdirs` is leaked. The leaks were detected by ASAN (`configure --enable-asan`). This commit resolves the leaks. Reviewed-by: Igor Kozhukhov <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> Closes #11396
* implicit conversion from 'boolean_t' to 'ds_hold_flags_t'Toomas Soome2020-12-275-18/+29
| | | | | | | | | | | | | | | | | | | | | | | | | | | Build error on illumos with gcc 10 did reveal: In function 'dmu_objset_refresh_ownership': ../../common/fs/zfs/dmu_objset.c:857:25: error: implicit conversion from 'boolean_t' to 'ds_hold_flags_t' {aka 'enum ds_hold_flags'} [-Werror=enum-conversion] 857 | dsl_dataset_disown(ds, decrypt, tag); | ^~~~~~~ cc1: all warnings being treated as errors libzfs_input_check.c: In function 'zfs_ioc_input_tests': libzfs_input_check.c:754:28: error: implicit conversion from 'enum dmu_objset_type' to 'enum lzc_dataset_type' [-Werror=enum-conversion] 754 | err = lzc_create(dataset, DMU_OST_ZFS, NULL, NULL, 0); | ^~~~~~~~~~~ cc1: all warnings being treated as errors The same issue is present in openzfs, and also the same issue about ds_hold_flags_t, which currently defines exactly one valid value. Reviewed-by: Igor Kozhukhov <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Toomas Soome <[email protected]> Closes #11406
* Linux 5.11 compat: blk_{un}register_region()Brian Behlendorf2020-12-274-78/+0
| | | | | | | | | | | | As of 5.11 the blk_register_region() and blk_unregister_region() functions have been retired. This isn't a problem since add_disk() has implicitly allocated minor numbers for a very long time. Reviewed-by: Rafael Kitover <[email protected]> Reviewed-by: Coleman Kane <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #11387 Closes #11390
* Linux 5.11 compat: revalidate_disk_size()Brian Behlendorf2020-12-273-7/+31
| | | | | | | | | | | | | | | Both revalidate_disk_size() and revalidate_disk() have been removed. Functionally this isn't a problem because we only relied on these functions to call zvol_revalidate_disk() for us and to perform any additional handling which might be needed for that kernel version. When neither are available we know there's no additional handling needed and we can directly call zvol_revalidate_disk(). Reviewed-by: Rafael Kitover <[email protected]> Reviewed-by: Coleman Kane <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #11387 Closes #11390
* Linux 5.11 compat: bdev_whole()Brian Behlendorf2020-12-272-4/+37
| | | | | | | | | | | | | The bd_contains member was removed from the block_device structure. Callers needing to determine if a vdev is a whole block device should use the new bdev_whole() wrapper. For older kernels we provide our own bdev_whole() wrapper which relies on bd_contains for compatibility. Reviewed-by: Rafael Kitover <[email protected]> Reviewed-by: Coleman Kane <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #11387 Closes #11390
* Linux 5.11 compat: bio_start_io_acct() / bio_end_io_acct()Brian Behlendorf2020-12-273-39/+92
| | | | | | | | | | | | | | | | | | | The generic IO accounting functions have been removed in favor of the bio_start_io_acct() and bio_end_io_acct() functions which provide a better interface. These new functions were introduced in the 5.8 kernels but it wasn't until the 5.11 kernel that the previous generic IO accounting interfaces were removed. This commit updates the blk_generic_*_io_acct() wrappers to provide and interface similar to the updated kernel interface. It's slightly different because for older kernels we need to pass the request queue as well as the bio. Reviewed-by: Rafael Kitover <[email protected]> Reviewed-by: Coleman Kane <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #11387 Closes #11390
* Linux 5.11 compat: lookup_bdev()Brian Behlendorf2020-12-273-30/+74
| | | | | | | | | | | | | | | The lookup_bdev() function has been updated to require a dev_t be passed as the second argument. This is actually pretty nice since the major number stored in the dev_t was the only part we were interested in. This allows to us avoid handling the bdev entirely. The vdev_lookup_bdev() wrapper was updated to emulate the behavior of the new lookup_bdev() for all supported kernels. Reviewed-by: Rafael Kitover <[email protected]> Reviewed-by: Coleman Kane <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #11387 Closes #11390
* Linux 5.11 compat: conftestBrian Behlendorf2020-12-276-17/+28
| | | | | | | | | | | | Update the ZFS_LINUX_TEST_PROGRAM macro to always set the module license. As of the 5.11 kernel not setting a license has been converted from a warning to an error. Reviewed-by: Rafael Kitover <[email protected]> Reviewed-by: Coleman Kane <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #11387 Closes #11390
* dbufstat: Fix warnings with Python 3.8Ryan Moeller2020-12-231-2/+2
| | | | | | | Replace "is" with "==" and "is not" with "!=". Reviewed-by: George Melikov <[email protected]> Signed-off-by: Ryan Moeller <[email protected]> Closes #11394
* Linux 5.10 compat: METABrian Behlendorf2020-12-231-1/+1
| | | | | | | | Increase the Linux-Maximum version in the META file to 5.10. All of the required compatibility patches have been merged. Reviewed-by: George Melikov <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #11391
* Remove unused check from dmu_tx_count_write()Brian Behlendorf2020-12-211-3/+0
| | | | | | | | | | | | | | Individual transactions may not be larger than DMU_MAX_ACCESS. This is enforced by the assertions in dmu_tx_hold_write() and dmu_tx_hold_write_by_dnode(). There's an additional check in dmu_tx_count_write() however it has no effect and only sets a local err variable. We could enable this check, however since it's already enforced by ASSERTs elsewhere I opted to remove it instead. Reviewed-by: Matthew Ahrens <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #3731 Closes #11384
* zfs-kmods: install to /lib/modules instead of /usr/lib/modulesChristian Schwarz2020-12-211-4/+0
| | | | | | | | | | | | | | | | | | Before this patch, dracut wouldn't find zfs.ko for inclusion in initramfs. This was caused by the packages installing in to /lib/modules instead of /usr/lib/modules. Correcting this allows dracut to do the right thing, even without # /etc/dracut.conf add_drivers+=" zfs " Notably, rpm/redhat/zfs-kmod.spec.in does not contain the definition of the `prefix` macro that this commit removes in the generic kmod spec. And https://rpmfusion.org/Packaging/KernelModules/Kmods2 does not mention `prefix` at all. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Christian Schwarz <[email protected]> Closes #11381
* Forward questions to github discussionsKjeld Schouten-Lebbing2020-12-212-37/+3
| | | | | | | | Instead of creating issues with type "question" Forward to the GitHub Discussion system. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Kjeld Schouten-Lebbing <[email protected]> Closes #11383
* Dangling reference from dmu_objset_upgradeAndy Fiddaman2020-12-211-1/+8
| | | | | | | | | | | | | | | | | | | | | | | | | After porting the fix for https://github.com/openzfs/zfs/issues/5295 over to illumos, we started hitting an assertion failure when running the testsuite: assertion failed: rc->rc_count == number, file: .../refcount.c and the unexpected hold has this stack: dsl_dataset_long_hold+0x59 dmu_objset_upgrade+0x73 dmu_objset_id_quota_upgrade+0x15 dmu_objset_own+0x14f The simplest reproducer for this in illumos is zpool create -f -O version=1 testpool c3t0d0; zpool destroy testpool which is run as part of the zpool_create_tempname test, but I can't get this to trigger on FreeBSD. This appears to be because of the call to txg_wait_synced() in dmu_objset_upgrade_stop() (which was missing in illumos), slows down dmu_objset_disown() enough to avoid the condition. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Andy Fiddaman <[email protected]> Closes #11368
* Linux 4.18.0-257.el8 compat: blk_alloc_queue()Brian Behlendorf2020-12-212-17/+58
| | | | | | | | | | | | | | | | | | | | | The CentOS stream 4.18.0-257 kernel appears to have backported the Linux 5.9 change to make_request_fn and the associated API. To maintain weak modules compatibility the original symbol was retained and the new interface blk_alloc_queue_rh() was added. Unfortunately, blk_alloc_queue() was replaced in the blkdev.h header by blk_alloc_queue_bh() so there doesn't seem to be a way to build new kmods against the old interfces. Even though they appear to still be available for weak module binding. To accommodate this a configure check is added for the new _rh() variant of the function and used if available. If compatibility code gets added to the kernel for the original blk_alloc_queue() interface this should be fine. OpenZFS will simply continue to prefer the new interface and only fallback to blk_alloc_queue() when blk_alloc_queue_rh() isn't available. Signed-off-by: Brian Behlendorf <[email protected]> Closes #11374
* Fix maybe uninitialized variable warningBrian Behlendorf2020-12-201-1/+1
| | | | | | | | | | | | | Commit 1c2358c12 restructured this code and introduced a warning about the variable maybe not being initialized. This cannot happen with the updated code but we should initialize the variable anyway to silence the warning. zpl_file.c: In function ‘zpl_iter_write’: zpl_file.c:324:9: warning: ‘count’ may be used uninitialized in this function [-Wmaybe-uninitialized] Signed-off-by: Brian Behlendorf <[email protected]> Closes #11373
* Remove iov_iter_advance() from iter_readBrian Behlendorf2020-12-201-3/+0
| | | | | | | | | | | | | There's no need to call iov_iter_advance() in zpl_iter_read(). This was preserved from the previous code where it wasn't needed but also didn't cause any problems. Now that the iter functions also handle pipes that's no longer the case. When fully reading a pipe buffer iov_iter_advance() may results in the pipe buf release function being called which will not be registered resulting in a NULL dereference. Signed-off-by: Brian Behlendorf <[email protected]> Closes #11375 Closes #11378
* dsl_pool: extend comment on DSL Pool Configuration LockChristian Schwarz2020-12-191-2/+29
| | | | | | | | Based on a conversation with Matt on the OpenZFS Slack. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Matthew Ahrens <[email protected]> Signed-off-by: Christian Schwarz <[email protected]> Closes #11370
* Linux 5.10 compat: also zvol_revalidate_disk()Michael D Labriola2020-12-181-2/+3
| | | | | | | | | | | | Commit 59b68723 added a configure check for 5.10, which removed revalidate_disk(), and conditionally replaced it's usage with a call to the new revalidate_disk_size() function. However, the old function also invoked the device's registered callback, in our case zvol_revalidate_disk(). This commit adds a call to zvol_revalidate_disk() in zvol_update_volsize() to make sure the code path stays the same. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Michael D Labriola <[email protected]> Closes #11358
* ZED/zfs-list-cacher.sh: don't exit on ignored event typePrawn2020-12-181-1/+1
| | | | | | | | | | | | | | Check for the history_event type instead. The zfs-list-cacher.sh script currently respects the event types excluded from syslog(!) in ZED_SYSLOG_SUBCLASS_EXCLUDE. This makes little sense in this single-purpose script and silently breaks when history_events are excluded from syslog, which is the default since 13d65987a9d9958de77422f5d9d25b47e486537d. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: InsanePrawn <[email protected]> Closes #11164 Closes #11347
* Linux 5.10 compat: use iov_iter in uio structureBrian Behlendorf2020-12-1816-321/+576
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | As of the 5.10 kernel the generic splice compatibility code has been removed. All filesystems are now responsible for registering a ->splice_read and ->splice_write callback to support this operation. The good news is the VFS provided generic_file_splice_read() and iter_file_splice_write() callbacks can be used provided the ->iter_read and ->iter_write callback support pipes. However, this is currently not the case and only iovecs and bvecs (not pipes) are ever attached to the uio structure. This commit changes that by allowing full iov_iter structures to be attached to uios. Ever since the 4.9 kernel the iov_iter structure has supported iovecs, kvecs, bvevs, and pipes so it's desirable to pass the entire thing when possible. In conjunction with this the uio helper functions (i.e uiomove(), uiocopy(), etc) have been updated to understand the new UIO_ITER type. Note that using the kernel provided uio_iter interfaces allowed the existing Linux specific uio handling code to be simplified. When there's no longer a need to support kernel's older than 4.9, then it will be possible to remove the iovec and bvec members from the uio structure and always use a uio_iter. Until then we need to maintain all of the existing types for older kernels. Some additional refactoring and cleanup was included in this change: - Added checks to configure to detect available iov_iter interfaces. Some are available all the way back to the 3.10 kernel and are used when available. In particular, uio_prefaultpages() now always uses iov_iter_fault_in_readable() which is available for all supported kernels. - The unused UIO_USERISPACE type has been removed. It is no longer needed now that the uio_seg enum is platform specific. - Moved zfs_uio.c from the zcommon.ko module to the Linux specific platform code for the zfs.ko module. This gets it out of libzfs where it was never needed and keeps this Linux specific code out of the common sources. - Removed unnecessary O_APPEND handling from zfs_iter_write(), this is redundant and O_APPEND is already handled in zfs_write(); Reviewed-by: Colin Ian King <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #11351
* ZTS: Simplify zpool_initialize_verify_initializedBrian Behlendorf2020-12-181-24/+25
| | | | | | | | | | | | | | | | | | Consider the test to be a success as long as the initializing pattern is found at least once per metaslab. This indicates that at least part of the free space was initialized. Ideally we'd check that the pattern was written to all free space but that's much trickier so this check is a reasonable compromise. Using a here-string to feed the loop in this test causes an empty string to still trigger the loop so we miss the `spacemaps=0` case. Pipe into the loop instead. While here, we can use `zpool wait -t initialize $TESTPOOL` to wait for the pool to initialize. Co-authored-by: Brian Behlendorf <[email protected]> Signed-off-by: Ryan Moeller <[email protected]> Closes #11365
* special device removal space accounting fixesMatthew Ahrens2020-12-174-35/+43
| | | | | | | | | | | | | | | | The space in special devices is not included in spa_dspace (or dsl_pool_adjustedsize(), or the zfs `available` property). Therefore there is always at least as much free space in the normal class, as there is allocated in the special class(es). And therefore, there is always enough free space to remove a special device. However, the checks for free space when removing special devices did not take this into account. This commit corrects that. Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Don Brady <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> Closes #11329
* Avoid extra work updating ARC kstats and tunablesRyan Moeller2020-12-171-16/+9
| | | | | | | | | | | After e357046 it should not be necessary to periodically update ARC kstats and tunables. Tunable updates are applied when modified, and kstats are updated on demand. Update kstats in `arc_evict_cb_check()` for `ZFS_DEBUG` builds only. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Ryan Moeller <[email protected]> Closes #11237
* Use the correct return type for getoptsterlingjensen2020-12-175-5/+5
| | | | | | | | Use the correct return type for getopt otherwise clang complains about tautological-constant-out-of-range-compare. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Sterling Jensen <[email protected]> Closes #11359
* Only examine best metaslabs on each vdev Matthew Ahrens2020-12-164-55/+95
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | On a system with very high fragmentation, we may need to do lots of gang allocations (e.g. most indirect block allocations (~50KB) may need to gang). Before failing a "normal" allocation and resorting to ganging, we try every metaslab. This has the impact of loading every metaslab (not a huge deal since we now typically keep all metaslabs loaded), and also iterating over every metaslab for every failing allocation. If there are many metaslabs (more than the typical ~200, e.g. due to vdev expansion or very large vdevs), the CPU cost of this iteration can be very impactful. This iteration is done with the mg_lock held, creating long hold times and high lock contention for concurrent allocations, ultimately causing long txg sync times and poor application performance. To address this, this commit changes the behavior of "normal" (not try_hard, not ZIL) allocations. These will now only examine the 100 best metaslabs (as determined by their ms_weight). If none of these have a large enough free segment, then the allocation will fail and we'll fall back on ganging. To accomplish this, we will now (normally) gang before doing a `try_hard` allocation. Non-try_hard allocations will only examine the 100 best metaslabs of each vdev. In summary, we will first try normal allocation. If that fails then we will do a gang allocation. If that fails then we will do a "try hard" gang allocation. If that fails then we will have a multi-layer gang block. Reviewed-by: Paul Dagnelie <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> Closes #11327
* Make metaslab class rotor and aliquot per-allocator.Alexander Motin2020-12-157-100/+116
| | | | | | | | | | | | | | | | | | | | | | | | | | Metaslab rotor and aliquot are used to distribute workload between vdevs while keeping some locality for logically adjacent blocks. Once multiple allocators were introduced to separate allocation of different objects it does not make much sense for different allocators to write into different metaslabs of the same metaslab group (vdev) same time, competing for its resources. This change makes each allocator choose metaslab group independently, colliding with others only sporadically. Test including simultaneous write into 4 files with recordsize of 4KB on a striped pool of 30 disks on a system with 40 logical cores show reduction of vdev queue lock contention from 54 to 27% due to better load distribution. Unfortunately it won't help much ZVOLs yet since only one dataset/ZVOL is synced at a time, and so for the most part only one allocator is used, but it may improve later. While there, to reduce the number of pointer dereferences change per-allocator storage for metaslab classes and groups from several separate malloc()'s to variable length arrays at the ends of the original class and group structures. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Matthew Ahrens <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Closes #11288
* DKMS: Disable weak modulesgregory-lee-bartholomew2020-12-151-0/+1
| | | | | | | | | | | | Fedora does not guarantee a stable kABI, so weak modules should be dis- abled. See the dkms man page for a more detailed explanation of the weak module feature. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Gregory Bartholomew <[email protected]> Closes #9891 Closes #11128 Closes #11242 Closes #11335
* lua: avoid gcc -Wreturn-local-addr bugRyan Libby2020-12-151-4/+6
| | | | | | | | | | | | | | | Avoid a bug with gcc's -Wreturn-local-addr warning with some obfuscation. In buggy versions of gcc, if a return value is an expression that involves the address of a local variable, and even if that address is legally converted to a non-pointer type, a warning may be emitted and the value of the address may be replaced with zero. Howerver, buggy versions don't emit the warning or replace the value when simply returning a local variable of non-pointer type. https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90737 Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Ryan Libby <[email protected]> Closes #11337
* spa: avoid type narrowing warningRyan Libby2020-12-151-1/+1
| | | | | | | | | | | | | Building the spa module for i386 caused gcc to emit -Wint-to-pointer-cast "cast to pointer from integer of different size" because spa.spa_did was uint64_t but pthread_join (via thread_join in spa_deactivate) takes a pointer (32-bit on i386). Define spa_did to be pointer-size instead. For now spa_did is in fact never non-zero and the thread_join could instead be ifdef'd out, but changing the size of spa_did may be more useful for the future. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Ryan Libby <[email protected]> Closes #11336
* FreeBSD libzfs: gcc requires __thread after staticRyan Libby2020-12-141-1/+1
| | | | | | | | | | | | Building libzfs with gcc on FreeBSD failed because gcc is picky about the order of keywords in declarations with __thread, whereas clang is more relaxed. https://gcc.gnu.org/onlinedocs/gcc/Thread-Local.html Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Signed-off-by: Ryan Libby <[email protected]> Closes #11331
* dmu_zfetch: fix memory leakMatthew Macy2020-12-122-4/+4
| | | | | | | | | | The last change caused the read completion callback to not be called if the IO was still in progress. This change restores allocation of the arc buf callback, but in the callback path checks the new acb_nobuf field to know to skip buffer allocation. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matt Macy <[email protected]> Closes #11324
* Fix reporting of CKSUM errors in indirect vdevsGeorge Amanakis2020-12-114-11/+97
| | | | | | | | | | | | | When removing and subsequently reattaching a vdev, CKSUM errors may occur as vdev_indirect_read_all() reads from all children of a mirror in case of a resilver. Fix this by checking whether a child is missing the data and setting a flag (ic_error) which is then checked in vdev_indirect_repair() and suppresses incrementing the checksum counter. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: George Amanakis <[email protected]> Closes #11277
* Remove draid.d symlink from zfs_helpers.shBrian Behlendorf2020-12-111-3/+0
| | | | | | | | | | In an earlier revision of dRAID there existed an /etc/zfs/draid.d directory. This was removed before the final version was integrated but a little bit was accidentally overlooked in the zfs_helpers.sh script. Remove this remnant. Reviewed-by: George Melikov <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #11326
* arc_summary3: Handle overflowing value widthRyan Moeller2020-12-111-2/+6
| | | | | | | | | | | | | | Some tunables shown by arc_summary3 have string values that may exceed the normal line length, leaving a negative offset between the name and value fields. The negative space is of course not valid and Python rightly barfs up an exception traceback. Handle an overflowing value field width by ignoring the line length and separating the name from the value by a single space instead. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Ryan Moeller <[email protected]> Closes #11270
* FreeBSD: Implement sysctl for fletcher4 implRyan Moeller2020-12-113-8/+75
| | | | | | | | | | | | There is a tunable to select the fletcher 4 checksum implementation on Linux but it was not present in FreeBSD. Implement the sysctl handler for FreeBSD and use ZFS_MODULE_PARAM_CALL to provide the tunable on both platforms. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Ryan Moeller <[email protected]> Closes #11270
* Improve zfs receive performance with lightweight writeMatthew Ahrens2020-12-1110-267/+540
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The performance of `zfs receive` can be bottlenecked on the CPU consumed by the `receive_writer` thread, especially when receiving streams with small compressed block sizes. Much of the CPU is spent creating and destroying dbuf's and arc buf's, one for each `WRITE` record in the send stream. This commit introduces the concept of "lightweight writes", which allows `zfs receive` to write to the DMU by providing an ABD, and instantiating only a new type of `dbuf_dirty_record_t`. The dbuf and arc buf for this "dirty leaf block" are not instantiated. Because there is no dbuf with the dirty data, this mechanism doesn't support reading from "lightweight-dirty" blocks (they would see the on-disk state rather than the dirty data). Since the dedup-receive code has been removed, `zfs receive` is write-only, so this works fine. Because there are no arc bufs for the received data, the received data is no longer cached in the ARC. Testing a receive of a stream with average compressed block size of 4KB, this commit improves performance by 50%, while also reducing CPU usage by 50% of a CPU. On a per-block basis, CPU consumed by receive_writer() and dbuf_evict() is now 1/7th (14%) of what it was. Baseline: 450MB/s, CPU in receive_writer() 40% + dbuf_evict() 35% New: 670MB/s, CPU in receive_writer() 17% + dbuf_evict() 0% The code is also restructured in a few ways: Added a `dr_dnode` field to the dbuf_dirty_record_t. This simplifies some existing code that no longer needs `DB_DNODE_ENTER()` and related routines. The new field is needed by the lightweight-type dirty record. To ensure that the `dr_dnode` field remains valid until the dirty record is freed, we have to ensure that the `dnode_move()` doesn't relocate the dnode_t. To do this we keep a hold on the dnode until it's zio's have completed. This is already done by the user-accounting code (`userquota_updates_task()`), this commit extends that so that it always keeps the dnode hold until zio completion (see `dnode_rele_task()`). `dn_dirty_txg` was previously zeroed when the dnode was synced. This was not necessary, since its meaning can be "when was this dnode last dirtied". This change simplifies the new `dnode_rele_task()` code. Removed some dead code related to `DRR_WRITE_BYREF` (dedup receive). Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Paul Dagnelie <[email protected]> Reviewed-by: George Wilson <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> Closes #11105
* Fix kernel panic induced by redacted sendPaul Dagnelie2020-12-111-61/+53
| | | | | | | | | | | | | | | | | In the redaction list traversal code, there is a bug in the binary search logic when looking for the resume point. Maxbufid can be decremented to -1, causing us to read the last possible block of the object instead of the one we wanted. This can cause incorrect resume behavior, or possibly even a hang in some cases. In addition, when examining non-last blocks, we can treat the block as being the same size as the last block, causing us to miss entries in the redaction list when determining where to resume. Finally, we were ignoring the case where the resume point was found in the buffer being searched, and resuming from minbufid. All these issues have been corrected, and the code has been significantly simplified to make future issues less likely. Reviewed-by: Serapheim Dimitropoulos <[email protected]> Reviewed-by: Matthew Ahrens <[email protected]> Signed-off-by: Paul Dagnelie <[email protected]> Closes #11297
* FreeBSD: Fix format of vfs.zfs.arc_no_grow_shiftRyan Moeller2020-12-101-6/+5
| | | | | | | | | | | | | vfs.zfs.arc_no_grow_shift has an invalid type (15) and this causes py-sysctl to format it as a bytearray when it should be an integer. "U" is not a valid format, it should be "I" and the type should match the variable type, int. We can return EINVAL if the value is set below zero. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Ryan Moeller <[email protected]> Closes #11318
* FreeBSD: Update usage of py-sysctlRyan Moeller2020-12-103-28/+26
| | | | | | | | | | | | py-sysctl now includes the CTLTYPE_NODE type nodes in the list returned by sysctl.filter() on FreeBSD head. It also provides descriptions now. Eliminate the subprocess call to get descriptions, and filter out the nodes so we only deal with values. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Ryan Moeller <[email protected]> Closes #11318
* Fix possibly uninitialized 'root_inode' variable warningBrian Behlendorf2020-12-101-1/+1
| | | | | | | | | | | | Resolve an uninitialized variable warning when compiling. In function ‘zfs_domount’: warning: ‘root_inode’ may be used uninitialized in this function [-Wmaybe-uninitialized] sb->s_root = d_make_root(root_inode); Reviewed-by: Tony Hutter <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #11306
* Implement memory and CPU hotplugPaul Dagnelie2020-12-1014-36/+290
| | | | | | | | | | | | | | ZFS currently doesn't react to hotplugging cpu or memory into the system in any way. This patch changes that by adding logic to the ARC that allows the system to take advantage of new memory that is added for caching purposes. It also adds logic to the taskq infrastructure to support dynamically expanding the number of threads allocated to a taskq. Reviewed-by: Brian Behlendorf <[email protected]> Co-authored-by: Matthew Ahrens <[email protected]> Co-authored-by: Brian Behlendorf <[email protected]> Signed-off-by: Paul Dagnelie <[email protected]> Closes #11212
* CI: add zloop workflowBrian Behlendorf2020-12-101-0/+67
|\ | | | | | | | | | | | | | | Run ztest via zloop for 20 minutes, total run time is ~30 minutes. Reviewed-by: Kjeld Schouten <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: George Melikov <[email protected]> Closes #11319
| * CI: add zloop workflowGeorge Melikov2020-12-101-0/+67
| | | | | | | | | | | | Run ztest via zloop for 20 minutes, total run time is ~30 minutes. Signed-off-by: George Melikov <[email protected]>
* | FreeBSD: Do zcommon_init sooner to avoid FPU panicRyan Moeller2020-12-093-1/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There has been a panic affecting some system configurations where the thread FPU context is disturbed during the fletcher 4 benchmarks, leading to a panic at boot. module_init() registers zcommon_init to run in the last subsystem (SI_SUB_LAST). Running it as soon as interrupts have been configured (SI_SUB_INT_CONFIG_HOOKS) makes sure we have finished the benchmarks before we start doing other things. While it's not clear *how* the FPU context was being disturbed, this does seem to avoid it. Add a module_init_early() macro to run zcommon_init() at this earlier point on FreeBSD. On Linux this is defined as module_init(). Authored by: Konstantin Belousov <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Ryan Moeller <[email protected]> Closes #11302