aboutsummaryrefslogtreecommitdiffstats
path: root/module
Commit message (Collapse)AuthorAgeFilesLines
* The abd child/parent relationship does not need to be trackedMatthew Ahrens2021-01-301-1/+17
| | | | | | | | | | | | | | | | | ABD's currently track their parent/child relationship. This applies to `abd_get_offset()` and `abd_borrow_buf()`. However, nothing depends on knowing this relationship, it's only used for consistency checks to verify that we are not destroying an ABD that's still in use. When we are creating/destroying ABD's frequently, the performance impact of maintaining these data structures (in particular the atomic increment/decrement operations) can be measurable. This commit removes this verification code on production builds, but keeps it when ZFS_DEBUG is set. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Brian Atkinson <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> Closes #11535
* Fixing gang ABD when adding another gangBrian Atkinson2021-01-281-8/+6
| | | | | | | | | | | | | | I originally applied a fix in #11539 to fix a parent's child references when a gang ABD is free'd. However, I did not take into account abd_gang_add_gang(). We still need to make sure to update the child references in this function as well. In order to resolve this I removed decreasing the gang ABD's size in abd_free_gang() as well as moved back the original placeent of zfs_refcount_remove_many() in abd_free(). Reviewed-by: Mark Maybee <[email protected]> Reviewed-by: Matthew Ahrens <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Brian Atkinson <[email protected]> Closes #11542
* fix abd_nr_pages_off for gang abdMatthew Ahrens2021-01-282-33/+40
| | | | | | | | | | | | | | | | | | | | | | | | | | | | `__vdev_disk_physio()` uses `abd_nr_pages_off()` to allocate a bio with a sufficient number of iovec's to process this zio (i.e. `nr_iovecs`/`bi_max_vecs`). If there are not enough iovec's in the bio, then additional bio's will be allocated. However, this is a sub-optimal code path. In particular, it requires several abd calls (to `abd_nr_pages_off()` and `abd_bio_map_off()`) which will have to walk the constituents of the ABD (the pages or the gang children) because they are looking for offsets > 0. For gang ABD's, `abd_nr_pages_off()` returns the number of iovec's needed for the first constituent, rather than the sum of all constituents (within the requested range). This always under-estimates the required number of iovec's, which causes us to always need several bio's. The end result is that `__vdev_disk_physio()` is usually O(n^2) for gang ABD's (and occasionally O(n^3), when more than 16 bio's are needed). This commit fixes `abd_nr_pages_off()`'s handling of gang ABD's, to correctly determine how many iovec's are needed, by adding up the number of iovec's for each of the gang children in the requested range. Reviewed-by: Mark Maybee <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Brian Atkinson <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> Closes #11536
* Avoid updating the L2ARC device header unnecessarilyGeorge Amanakis2021-01-281-1/+3
| | | | | | | | | If we do not write any buffers to the cache device and the evict hand has not advanced do not update the cache device header. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: George Amanakis <[email protected]> Closes #11522 Closes #11537
* Removing ABD Parent Child Reference Before Freeing ABDBrian Atkinson2021-01-281-12/+10
| | | | | | | | | | | | | | | | Moving the call to zfs_refcount_remove_many() in abd_free() to be called before any of the ABD free variants are called. This is necessary because abd_free_gang() adjusts the abd_size for the gang ABD. If the parent's child references are removed after free'ing the gang ABD the refcount is not adjusted correctly for the parent's children. I also removed some stray abd_put() in comments and changed abd_free_gang_abd() -> abd_free_gang(). Reviewed-by: Mark Maybee <[email protected]> Reviewed-by: Matthew Ahrens <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Brian Atkinson <[email protected]> Closes #11539
* Revert special case code from pre-hashtable nvlist eraMark Maybee2021-01-272-22/+6
| | | | | | | | | | | | | Before a hash table was added on top of the nvlist code, there were cases where the nvlist allocation was changed from fnvlist_alloc() to nvlist_alloc() to avoid expensive NV_UNIQUE_NAME checks. Now this is no longer necessary. These changes should be reverted to be consistent with other code. There are some cases where this change will also reduce the number of iterations. Reviewed-by: Serapheim Dimitropoulos <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Mark Maybee <[email protected]> Closes #11464
* Fix zrele race in zrele_async that can cause hangPaul Dagnelie2021-01-271-12/+22
| | | | | | | | | | | | | | | | There is a race condition in zfs_zrele_async when we are checking if we would be the one to evict an inode. This can lead to a txg sync deadlock. Instead of calling into iput directly, we attempt to perform the atomic decrement ourselves, unless that would set the i_count value to zero. In that case, we dispatch a call to iput to run later, to prevent a deadlock from occurring. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Matthew Ahrens <[email protected]> Signed-off-by: Paul Dagnelie <[email protected]> Closes #11527 Closes #11530
* Parallelize vdev_validateAlan Somers2021-01-262-3/+39
| | | | | | | | | | | The runtime of vdev_validate is dominated by the disk accesses in vdev_label_read_config. Speed it up by validating all vdevs in parallel using a taskq. Sponsored by: Axcient Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alan Somers <[email protected]> Closes #11470
* Read all disk labels concurrently in vdev_label_read_configAlan Somers2021-01-261-14/+21
| | | | | | | | | This is similar to what we already do in vdev_geom_read_config. Sponsored by: Axcient Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alan Somers <[email protected]> Closes #11470
* Parallelize vdev_loadAlan Somers2021-01-262-3/+45
| | | | | | | | | | | | | | | | | | | | | | | | | | metaslab_init is the slowest part of importing a mature pool, and it must be repeated hundreds of times for each top-level vdev. But its speed is dominated by a few serialized disk accesses. That can lead to import times of > 1 hour for pools with many top-level vdevs on spinny disks. Speed up the import by using a taskqueue to parallelize vdev_load across all top-level vdevs. This also requires adding mutex protection to metaslab_class_t.mc_historgram. The mc_histogram fields were unprotected when that code was first written in "Illumos 4976-4984 - metaslab improvements" (OpenZFS f3a7f6610f2df0217ba3b99099019417a954b673). The lock wasn't added until 3dfb57a35e8cbaa7c424611235d669f3c575ada1, though it's unclear exactly which fields it's supposed to protect. In any case, it wasn't until vdev_load was parallelized that any code attempted concurrent access to those fields. Sponsored by: Axcient Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alan Somers <[email protected]> Closes #11470
* cppcheck: integrete cppcheckBrian Behlendorf2021-01-264-5/+21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In order for cppcheck to perform a proper analysis it needs to be aware of how the sources are compiled (source files, include paths/files, extra defines, etc). All the needed information is available from the Makefiles and can be leveraged with a generic cppcheck Makefile target. So let's add one. Additional minor changes: * Removing the cppcheck-suppressions.txt file. With cppcheck 2.3 and these changes it appears to no longer be needed. Some inline suppressions were also removed since they appear not to be needed. We can add them back if it turns out they're needed for older versions of cppcheck. * Added the ax_count_cpus m4 macro to detect at configure time how many processors are available in order to run multiple cppcheck jobs. This value is also now used as a replacement for nproc when executing the kernel interface checks. * "PHONY =" line moved in to the Rules.am file which is included at the top of all Makefile.am's. This is just convenient becase it allows us to use the += syntax to add phony targets. * One upside of this integration worth mentioning is it now allows `make cppcheck` to be run in any directory to check that subtree. * For the moment, cppcheck is not run against the FreeBSD specific kernel sources. The cppcheck-FreeBSD target will need to be implemented and testing on FreeBSD to support this. Reviewed-by: Ryan Moeller <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #11508
* cppcheck: return value always 0Brian Behlendorf2021-01-261-1/+1
| | | | | | | | | Identical condition and return expression 'rc', return value is always 0. Reviewed-by: Ryan Moeller <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #11508
* cppcheck: remove redundant ASSERTsBrian Behlendorf2021-01-262-5/+0
| | | | | | | | | The ASSERT that the passed pointer isn't NULL appears after the pointer has already been dereferenced. Remove the redundant check. Reviewed-by: Ryan Moeller <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #11508
* RAIDZ2/3 fails to heal silently corrupted parity w/2+ bad disksMatthew Ahrens2021-01-261-10/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When scrubbing, (non-sequential) resilvering, or correcting a checksum error using RAIDZ parity, ZFS should heal any incorrect RAIDZ parity by overwriting it. For example, if P disks are silently corrupted (P being the number of failures tolerated; e.g. RAIDZ2 has P=2), `zpool scrub` should detect and heal all the bad state on these disks, including parity. This way if there is a subsequent failure we are fully protected. With RAIDZ2 or RAIDZ3, a block can have silent damage to a parity sector, and also damage (silent or known) to a data sector. In this case the parity should be healed but it is not. The problem can be noticed by scrubbing the pool twice. Assuming there was no damage concurrent with the scrubs, the first scrub should fix all silent damage, and the second scrub should be "clean" (`zpool status` should not report checksum errors on any disks). If the bug is encountered, then the second scrub will repair the silently-damaged parity that the first scrub failed to repair, and these checksum errors will be reported after the second scrub. Since the first scrub repaired all the damaged data, the bug can not be encountered during the second scrub, so subsequent scrubs (more than two) are not necessary. The root cause of the problem is some code that was inadvertently added to `raidz_parity_verify()` by the DRAID changes. The incorrect code causes the parity healing to be aborted if there is damaged data (`rc_error != 0`) or the data disk is not present (`!rc_tried`). These checks are not necessary, because we only call `raidz_parity_verify()` if we have the correct data (which may have been reconstructed using parity, and which was verified by the checksum). This commit fixes the problem by removing the incorrect checks in `raidz_parity_verify()`. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> Closes #11489 Closes #11510
* spa_export_common: refactor common exit pointsWill Andrews2021-01-251-11/+12
| | | | | | | | | Create a common exit point for spa_export_common (a very long function), which avoids missing steps on failure. This work is helpful for the planned forced pool export changes. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Will Andrews <[email protected]> Closes #11514
* spl-taskq: Make sure thread tsd hash entry is clearedMatthew Macy2021-01-251-0/+1
| | | | | | | | | Like any other thread created by thread_create() we need to call thread_exit() to properly clean it up. In particular, this ensures the tsd hash for the thread is cleared. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matt Macy <[email protected]> Closes #11512
* Fix two minor lint errors (cppcheck)Colm2021-01-232-4/+5
| | | | | | | | | | | | | | Fix two minor errors reported by cppcheck: In module/zfs/abd.c (abd_get_offset_impl), add non-NULL assertion to prevent NULL dereference warning. In module/zfs/arc.c (l2arc_write_buffers), change 'try' variable to 'pass' to avoid C++ reserved word. Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Colm Buckley <[email protected]> Closes #11507
* Relax special_small_blocks assertion.Alexander Motin2021-01-231-1/+1
| | | | | | | | Follow up for commit 624222a, value asserted <= SPA_OLD_MAXBLOCKSIZE instead of SPA_MAXBLOCKSIZE as it should be after the previous change. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Closes #11501
* FreeBSD: upstream changes to VFS interfaceRyan Moeller2021-01-231-2/+4
| | | | | | | | | Set VIRF_MOUNTPOINT flag on snapshot mountpoint. Authored-by: Mateusz Guzik <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Ryan Moeller <[email protected]> Closes #11458
* Set aside a metaslab for ZIL blocksMatthew Ahrens2021-01-216-99/+314
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Mixing ZIL and normal allocations has several problems: 1. The ZIL allocations are allocated, written to disk, and then a few seconds later freed. This leaves behind holes (free segments) where the ZIL blocks used to be, which increases fragmentation, which negatively impacts performance. 2. When under moderate load, ZIL allocations are of 128KB. If the pool is fairly fragmented, there may not be many free chunks of that size. This causes ZFS to load more metaslabs to locate free segments of 128KB or more. The loading happens synchronously (from zil_commit()), and can take around a second even if the metaslab's spacemap is cached in the ARC. All concurrent synchronous operations on this filesystem must wait while the metaslab is loading. This can cause a significant performance impact. 3. If the pool is very fragmented, there may be zero free chunks of 128KB or more. In this case, the ZIL falls back to txg_wait_synced(), which has an enormous performance impact. These problems can be eliminated by using a dedicated log device ("slog"), even one with the same performance characteristics as the normal devices. This change sets aside one metaslab from each top-level vdev that is preferentially used for ZIL allocations (vdev_log_mg, spa_embedded_log_class). From an allocation perspective, this is similar to having a dedicated log device, and it eliminates the above-mentioned performance problems. Log (ZIL) blocks can be allocated from the following locations. Each one is tried in order until the allocation succeeds: 1. dedicated log vdevs, aka "slog" (spa_log_class) 2. embedded slog metaslabs (spa_embedded_log_class) 3. other metaslabs in normal vdevs (spa_normal_class) The space required for the embedded slog metaslabs is usually between 0.5% and 1.0% of the pool, and comes out of the existing 3.2% of "slop" space that is not available for user data. On an all-ssd system with 4TB storage, 87% fragmentation, 60% capacity, and recordsize=8k, testing shows a ~50% performance increase on random 8k sync writes. On even more fragmented systems (which hit problem #3 above and call txg_wait_synced()), the performance improvement can be arbitrarily large (>100x). Reviewed-by: Serapheim Dimitropoulos <[email protected]> Reviewed-by: George Wilson <[email protected]> Reviewed-by: Don Brady <[email protected]> Reviewed-by: Mark Maybee <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> Closes #11389
* Linux 5.10 compat: restore custom uio_prefaultpages()Brian Behlendorf2021-01-211-11/+44
| | | | | | | | | | | | | | | | As part of commit 1c2358c1 the custom uio_prefaultpages() code was removed in favor of using the generic kernel provided iov_iter_fault_in_readable() interface. Unfortunately, it turns out that up until the Linux 4.7 kernel the function would only ever fault in the first iovec of the iov_iter. The result being uiomove_iov() may hang waiting for the page. This commit effectively restores the custom uio_prefaultpages() pages code for Linux 4.9 and earlier kernels which contain the troublesome version of iov_iter_fault_in_readable(). Signed-off-by: Brian Behlendorf <[email protected]> Closes #11463 Closes #11484
* Extending FreeBSD UIO StructBrian Atkinson2021-01-2023-355/+390
| | | | | | | | | | | | | | In FreeBSD the struct uio was just a typedef to uio_t. In order to extend this struct, outside of the definition for the struct uio, the struct uio has been embedded inside of a uio_t struct. Also renamed all the uio_* interfaces to be zfs_uio_* to make it clear this is a ZFS interface. Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Jorgen Lundman <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Brian Atkinson <[email protected]> Closes #11438
* allow callers to allocate and provide the abd_t structMatthew Ahrens2021-01-2012-242/+187
| | | | | | | | | | | | | | | | | | | | | | | | | | | The `abd_get_offset_*()` routines create an abd_t that references another abd_t, and doesn't allocate any pages/buffers of its own. In some workloads, these routines may be called frequently, to create many abd_t's representing small pieces of a single large abd_t. In particular, the upcoming RAIDZ Expansion project makes heavy use of these routines. This commit adds the ability for the caller to allocate and provide the abd_t struct to a variant of `abd_get_offset_*()`. This eliminates the cost of allocating the abd_t and performing the accounting associated with it (`abdstat_struct_size`). The RAIDZ/DRAID code uses this for the `rc_abd`, which references the zio's abd. The upcoming RAIDZ Expansion project will leverage this infrastructure to increase performance of reads post-expansion by around 50%. Additionally, some of the interfaces around creating and destroying abd_t's are cleaned up. Most significantly, the distinction between `abd_put()` and `abd_free()` is eliminated; all types of abd_t's are now disposed of with `abd_free()`. Reviewed-by: Brian Atkinson <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> Issue #8853 Closes #11439
* record ioctl elapsed time in zpool historyMatthew Ahrens2021-01-112-1/+10
| | | | | | | | | | | | | | | | | | | | | Each zfs ioctl that changes on-disk state (e.g. set property, create snapshot, destroy filesystem) is recorded in the zpool history, and is printed by `zpool history -i`. For performance diagnostic purposes, it would be useful to know how long each of these ioctls took to run. This commit adds that functionality, with a new `ZPOOL_HIST_ELAPSED_NS` member of the history nvlist. Additionally, the time recorded in this history log is currently the time that the history record is written to disk. But in many cases (CLI args logging and ioctl logging), this happens asynchronously, potentially many seconds after the operation completed. This commit changes the timestamp to reflect when the history event was created, rather than when it was written to disk. Reviewed-by: Mark Maybee <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> Closes #11440
* assertion failed in arc_wait_for_eviction()Matthew Ahrens2021-01-071-8/+14
| | | | | | | | | | | | | | | | | | | | | | If the system is very low on memory (specifically, `arc_free_memory() < arc_sys_free/2`, i.e. less than 1/16th of RAM free), `arc_evict_state_impl()` will defer wakups. In this case, the arc_evict_waiter_t's remain on the list, even though `arc_evict_count` has been incremented past their `aew_count`. The problem is that `arc_wait_for_eviction()` assumes that if there are waiters on the list, the count they are waiting for has not yet been reached. However, the deferred wakeups may violate this, causing `ASSERT(last->aew_count > arc_evict_count)` to fail. This commit resolves the issue by having new waiters use the greater of `arc_evict_count` and the last `aew_count`. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: George Wilson <[email protected]> Reviewed-by: George Amanakis <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> Closes #11285 Closes #11397
* VZ 7 kernel compat: introduce ITER-enabled .direct_IO() via IOVECsKonstantin Khorenko2020-12-301-1/+14
| | | | | | | | | | | | | | | | | Virtuozzo 7 kernels starting 3.10.0-1127.18.2.vz7.163.46 have the following configuration: * no HAVE_VFS_RW_ITERATE * HAVE_VFS_DIRECT_IO_ITER_RW_OFFSET => let's add implementation of zpl_direct_IO() via zpl_aio_{read,write}() in this case. https://bugs.openvz.org/browse/OVZ-7243 Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Konstantin Khorenko <[email protected]> Closes #11410 Closes #11411
* implicit conversion from 'boolean_t' to 'ds_hold_flags_t'Toomas Soome2020-12-273-17/+27
| | | | | | | | | | | | | | | | | | | | | | | | | | | Build error on illumos with gcc 10 did reveal: In function 'dmu_objset_refresh_ownership': ../../common/fs/zfs/dmu_objset.c:857:25: error: implicit conversion from 'boolean_t' to 'ds_hold_flags_t' {aka 'enum ds_hold_flags'} [-Werror=enum-conversion] 857 | dsl_dataset_disown(ds, decrypt, tag); | ^~~~~~~ cc1: all warnings being treated as errors libzfs_input_check.c: In function 'zfs_ioc_input_tests': libzfs_input_check.c:754:28: error: implicit conversion from 'enum dmu_objset_type' to 'enum lzc_dataset_type' [-Werror=enum-conversion] 754 | err = lzc_create(dataset, DMU_OST_ZFS, NULL, NULL, 0); | ^~~~~~~~~~~ cc1: all warnings being treated as errors The same issue is present in openzfs, and also the same issue about ds_hold_flags_t, which currently defines exactly one valid value. Reviewed-by: Igor Kozhukhov <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Toomas Soome <[email protected]> Closes #11406
* Linux 5.11 compat: blk_{un}register_region()Brian Behlendorf2020-12-271-44/+0
| | | | | | | | | | | | As of 5.11 the blk_register_region() and blk_unregister_region() functions have been retired. This isn't a problem since add_disk() has implicitly allocated minor numbers for a very long time. Reviewed-by: Rafael Kitover <[email protected]> Reviewed-by: Coleman Kane <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #11387 Closes #11390
* Linux 5.11 compat: revalidate_disk_size()Brian Behlendorf2020-12-271-2/+4
| | | | | | | | | | | | | | | Both revalidate_disk_size() and revalidate_disk() have been removed. Functionally this isn't a problem because we only relied on these functions to call zvol_revalidate_disk() for us and to perform any additional handling which might be needed for that kernel version. When neither are available we know there's no additional handling needed and we can directly call zvol_revalidate_disk(). Reviewed-by: Rafael Kitover <[email protected]> Reviewed-by: Coleman Kane <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #11387 Closes #11390
* Linux 5.11 compat: bdev_whole()Brian Behlendorf2020-12-271-4/+12
| | | | | | | | | | | | | The bd_contains member was removed from the block_device structure. Callers needing to determine if a vdev is a whole block device should use the new bdev_whole() wrapper. For older kernels we provide our own bdev_whole() wrapper which relies on bd_contains for compatibility. Reviewed-by: Rafael Kitover <[email protected]> Reviewed-by: Coleman Kane <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #11387 Closes #11390
* Linux 5.11 compat: bio_start_io_acct() / bio_end_io_acct()Brian Behlendorf2020-12-271-16/+33
| | | | | | | | | | | | | | | | | | | The generic IO accounting functions have been removed in favor of the bio_start_io_acct() and bio_end_io_acct() functions which provide a better interface. These new functions were introduced in the 5.8 kernels but it wasn't until the 5.11 kernel that the previous generic IO accounting interfaces were removed. This commit updates the blk_generic_*_io_acct() wrappers to provide and interface similar to the updated kernel interface. It's slightly different because for older kernels we need to pass the request queue as well as the bio. Reviewed-by: Rafael Kitover <[email protected]> Reviewed-by: Coleman Kane <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #11387 Closes #11390
* Linux 5.11 compat: lookup_bdev()Brian Behlendorf2020-12-271-9/+4
| | | | | | | | | | | | | | | The lookup_bdev() function has been updated to require a dev_t be passed as the second argument. This is actually pretty nice since the major number stored in the dev_t was the only part we were interested in. This allows to us avoid handling the bdev entirely. The vdev_lookup_bdev() wrapper was updated to emulate the behavior of the new lookup_bdev() for all supported kernels. Reviewed-by: Rafael Kitover <[email protected]> Reviewed-by: Coleman Kane <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #11387 Closes #11390
* Remove unused check from dmu_tx_count_write()Brian Behlendorf2020-12-211-3/+0
| | | | | | | | | | | | | | Individual transactions may not be larger than DMU_MAX_ACCESS. This is enforced by the assertions in dmu_tx_hold_write() and dmu_tx_hold_write_by_dnode(). There's an additional check in dmu_tx_count_write() however it has no effect and only sets a local err variable. We could enable this check, however since it's already enforced by ASSERTs elsewhere I opted to remove it instead. Reviewed-by: Matthew Ahrens <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #3731 Closes #11384
* Dangling reference from dmu_objset_upgradeAndy Fiddaman2020-12-211-1/+8
| | | | | | | | | | | | | | | | | | | | | | | | | After porting the fix for https://github.com/openzfs/zfs/issues/5295 over to illumos, we started hitting an assertion failure when running the testsuite: assertion failed: rc->rc_count == number, file: .../refcount.c and the unexpected hold has this stack: dsl_dataset_long_hold+0x59 dmu_objset_upgrade+0x73 dmu_objset_id_quota_upgrade+0x15 dmu_objset_own+0x14f The simplest reproducer for this in illumos is zpool create -f -O version=1 testpool c3t0d0; zpool destroy testpool which is run as part of the zpool_create_tempname test, but I can't get this to trigger on FreeBSD. This appears to be because of the call to txg_wait_synced() in dmu_objset_upgrade_stop() (which was missing in illumos), slows down dmu_objset_disown() enough to avoid the condition. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Andy Fiddaman <[email protected]> Closes #11368
* Fix maybe uninitialized variable warningBrian Behlendorf2020-12-201-1/+1
| | | | | | | | | | | | | Commit 1c2358c12 restructured this code and introduced a warning about the variable maybe not being initialized. This cannot happen with the updated code but we should initialize the variable anyway to silence the warning. zpl_file.c: In function ‘zpl_iter_write’: zpl_file.c:324:9: warning: ‘count’ may be used uninitialized in this function [-Wmaybe-uninitialized] Signed-off-by: Brian Behlendorf <[email protected]> Closes #11373
* Remove iov_iter_advance() from iter_readBrian Behlendorf2020-12-201-3/+0
| | | | | | | | | | | | | There's no need to call iov_iter_advance() in zpl_iter_read(). This was preserved from the previous code where it wasn't needed but also didn't cause any problems. Now that the iter functions also handle pipes that's no longer the case. When fully reading a pipe buffer iov_iter_advance() may results in the pipe buf release function being called which will not be registered resulting in a NULL dereference. Signed-off-by: Brian Behlendorf <[email protected]> Closes #11375 Closes #11378
* dsl_pool: extend comment on DSL Pool Configuration LockChristian Schwarz2020-12-191-2/+29
| | | | | | | | Based on a conversation with Matt on the OpenZFS Slack. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Matthew Ahrens <[email protected]> Signed-off-by: Christian Schwarz <[email protected]> Closes #11370
* Linux 5.10 compat: also zvol_revalidate_disk()Michael D Labriola2020-12-181-2/+3
| | | | | | | | | | | | Commit 59b68723 added a configure check for 5.10, which removed revalidate_disk(), and conditionally replaced it's usage with a call to the new revalidate_disk_size() function. However, the old function also invoked the device's registered callback, in our case zvol_revalidate_disk(). This commit adds a call to zvol_revalidate_disk() in zvol_update_volsize() to make sure the code path stays the same. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Michael D Labriola <[email protected]> Closes #11358
* Linux 5.10 compat: use iov_iter in uio structureBrian Behlendorf2020-12-189-307/+300
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | As of the 5.10 kernel the generic splice compatibility code has been removed. All filesystems are now responsible for registering a ->splice_read and ->splice_write callback to support this operation. The good news is the VFS provided generic_file_splice_read() and iter_file_splice_write() callbacks can be used provided the ->iter_read and ->iter_write callback support pipes. However, this is currently not the case and only iovecs and bvecs (not pipes) are ever attached to the uio structure. This commit changes that by allowing full iov_iter structures to be attached to uios. Ever since the 4.9 kernel the iov_iter structure has supported iovecs, kvecs, bvevs, and pipes so it's desirable to pass the entire thing when possible. In conjunction with this the uio helper functions (i.e uiomove(), uiocopy(), etc) have been updated to understand the new UIO_ITER type. Note that using the kernel provided uio_iter interfaces allowed the existing Linux specific uio handling code to be simplified. When there's no longer a need to support kernel's older than 4.9, then it will be possible to remove the iovec and bvec members from the uio structure and always use a uio_iter. Until then we need to maintain all of the existing types for older kernels. Some additional refactoring and cleanup was included in this change: - Added checks to configure to detect available iov_iter interfaces. Some are available all the way back to the 3.10 kernel and are used when available. In particular, uio_prefaultpages() now always uses iov_iter_fault_in_readable() which is available for all supported kernels. - The unused UIO_USERISPACE type has been removed. It is no longer needed now that the uio_seg enum is platform specific. - Moved zfs_uio.c from the zcommon.ko module to the Linux specific platform code for the zfs.ko module. This gets it out of libzfs where it was never needed and keeps this Linux specific code out of the common sources. - Removed unnecessary O_APPEND handling from zfs_iter_write(), this is redundant and O_APPEND is already handled in zfs_write(); Reviewed-by: Colin Ian King <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #11351
* special device removal space accounting fixesMatthew Ahrens2020-12-172-31/+40
| | | | | | | | | | | | | | | | The space in special devices is not included in spa_dspace (or dsl_pool_adjustedsize(), or the zfs `available` property). Therefore there is always at least as much free space in the normal class, as there is allocated in the special class(es). And therefore, there is always enough free space to remove a special device. However, the checks for free space when removing special devices did not take this into account. This commit corrects that. Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Don Brady <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> Closes #11329
* Avoid extra work updating ARC kstats and tunablesRyan Moeller2020-12-171-16/+9
| | | | | | | | | | | After e357046 it should not be necessary to periodically update ARC kstats and tunables. Tunable updates are applied when modified, and kstats are updated on demand. Update kstats in `arc_evict_cb_check()` for `ZFS_DEBUG` builds only. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Ryan Moeller <[email protected]> Closes #11237
* Only examine best metaslabs on each vdev Matthew Ahrens2020-12-162-55/+60
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | On a system with very high fragmentation, we may need to do lots of gang allocations (e.g. most indirect block allocations (~50KB) may need to gang). Before failing a "normal" allocation and resorting to ganging, we try every metaslab. This has the impact of loading every metaslab (not a huge deal since we now typically keep all metaslabs loaded), and also iterating over every metaslab for every failing allocation. If there are many metaslabs (more than the typical ~200, e.g. due to vdev expansion or very large vdevs), the CPU cost of this iteration can be very impactful. This iteration is done with the mg_lock held, creating long hold times and high lock contention for concurrent allocations, ultimately causing long txg sync times and poor application performance. To address this, this commit changes the behavior of "normal" (not try_hard, not ZIL) allocations. These will now only examine the 100 best metaslabs (as determined by their ms_weight). If none of these have a large enough free segment, then the allocation will fail and we'll fall back on ganging. To accomplish this, we will now (normally) gang before doing a `try_hard` allocation. Non-try_hard allocations will only examine the 100 best metaslabs of each vdev. In summary, we will first try normal allocation. If that fails then we will do a gang allocation. If that fails then we will do a "try hard" gang allocation. If that fails then we will have a multi-layer gang block. Reviewed-by: Paul Dagnelie <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> Closes #11327
* Make metaslab class rotor and aliquot per-allocator.Alexander Motin2020-12-154-76/+82
| | | | | | | | | | | | | | | | | | | | | | | | | | Metaslab rotor and aliquot are used to distribute workload between vdevs while keeping some locality for logically adjacent blocks. Once multiple allocators were introduced to separate allocation of different objects it does not make much sense for different allocators to write into different metaslabs of the same metaslab group (vdev) same time, competing for its resources. This change makes each allocator choose metaslab group independently, colliding with others only sporadically. Test including simultaneous write into 4 files with recordsize of 4KB on a striped pool of 30 disks on a system with 40 logical cores show reduction of vdev queue lock contention from 54 to 27% due to better load distribution. Unfortunately it won't help much ZVOLs yet since only one dataset/ZVOL is synced at a time, and so for the most part only one allocator is used, but it may improve later. While there, to reduce the number of pointer dereferences change per-allocator storage for metaslab classes and groups from several separate malloc()'s to variable length arrays at the ends of the original class and group structures. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Matthew Ahrens <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Closes #11288
* lua: avoid gcc -Wreturn-local-addr bugRyan Libby2020-12-151-4/+6
| | | | | | | | | | | | | | | Avoid a bug with gcc's -Wreturn-local-addr warning with some obfuscation. In buggy versions of gcc, if a return value is an expression that involves the address of a local variable, and even if that address is legally converted to a non-pointer type, a warning may be emitted and the value of the address may be replaced with zero. Howerver, buggy versions don't emit the warning or replace the value when simply returning a local variable of non-pointer type. https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90737 Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Ryan Libby <[email protected]> Closes #11337
* dmu_zfetch: fix memory leakMatthew Macy2020-12-121-4/+3
| | | | | | | | | | The last change caused the read completion callback to not be called if the IO was still in progress. This change restores allocation of the arc buf callback, but in the callback path checks the new acb_nobuf field to know to skip buffer allocation. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matt Macy <[email protected]> Closes #11324
* Fix reporting of CKSUM errors in indirect vdevsGeorge Amanakis2020-12-111-9/+21
| | | | | | | | | | | | | When removing and subsequently reattaching a vdev, CKSUM errors may occur as vdev_indirect_read_all() reads from all children of a mirror in case of a resilver. Fix this by checking whether a child is missing the data and setting a flag (ic_error) which is then checked in vdev_indirect_repair() and suppresses incrementing the checksum counter. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: George Amanakis <[email protected]> Closes #11277
* FreeBSD: Implement sysctl for fletcher4 implRyan Moeller2020-12-111-8/+59
| | | | | | | | | | | | There is a tunable to select the fletcher 4 checksum implementation on Linux but it was not present in FreeBSD. Implement the sysctl handler for FreeBSD and use ZFS_MODULE_PARAM_CALL to provide the tunable on both platforms. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Ryan Moeller <[email protected]> Closes #11270
* Improve zfs receive performance with lightweight writeMatthew Ahrens2020-12-118-265/+512
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The performance of `zfs receive` can be bottlenecked on the CPU consumed by the `receive_writer` thread, especially when receiving streams with small compressed block sizes. Much of the CPU is spent creating and destroying dbuf's and arc buf's, one for each `WRITE` record in the send stream. This commit introduces the concept of "lightweight writes", which allows `zfs receive` to write to the DMU by providing an ABD, and instantiating only a new type of `dbuf_dirty_record_t`. The dbuf and arc buf for this "dirty leaf block" are not instantiated. Because there is no dbuf with the dirty data, this mechanism doesn't support reading from "lightweight-dirty" blocks (they would see the on-disk state rather than the dirty data). Since the dedup-receive code has been removed, `zfs receive` is write-only, so this works fine. Because there are no arc bufs for the received data, the received data is no longer cached in the ARC. Testing a receive of a stream with average compressed block size of 4KB, this commit improves performance by 50%, while also reducing CPU usage by 50% of a CPU. On a per-block basis, CPU consumed by receive_writer() and dbuf_evict() is now 1/7th (14%) of what it was. Baseline: 450MB/s, CPU in receive_writer() 40% + dbuf_evict() 35% New: 670MB/s, CPU in receive_writer() 17% + dbuf_evict() 0% The code is also restructured in a few ways: Added a `dr_dnode` field to the dbuf_dirty_record_t. This simplifies some existing code that no longer needs `DB_DNODE_ENTER()` and related routines. The new field is needed by the lightweight-type dirty record. To ensure that the `dr_dnode` field remains valid until the dirty record is freed, we have to ensure that the `dnode_move()` doesn't relocate the dnode_t. To do this we keep a hold on the dnode until it's zio's have completed. This is already done by the user-accounting code (`userquota_updates_task()`), this commit extends that so that it always keeps the dnode hold until zio completion (see `dnode_rele_task()`). `dn_dirty_txg` was previously zeroed when the dnode was synced. This was not necessary, since its meaning can be "when was this dnode last dirtied". This change simplifies the new `dnode_rele_task()` code. Removed some dead code related to `DRR_WRITE_BYREF` (dedup receive). Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Paul Dagnelie <[email protected]> Reviewed-by: George Wilson <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> Closes #11105
* Fix kernel panic induced by redacted sendPaul Dagnelie2020-12-111-61/+53
| | | | | | | | | | | | | | | | | In the redaction list traversal code, there is a bug in the binary search logic when looking for the resume point. Maxbufid can be decremented to -1, causing us to read the last possible block of the object instead of the one we wanted. This can cause incorrect resume behavior, or possibly even a hang in some cases. In addition, when examining non-last blocks, we can treat the block as being the same size as the last block, causing us to miss entries in the redaction list when determining where to resume. Finally, we were ignoring the case where the resume point was found in the buffer being searched, and resuming from minbufid. All these issues have been corrected, and the code has been significantly simplified to make future issues less likely. Reviewed-by: Serapheim Dimitropoulos <[email protected]> Reviewed-by: Matthew Ahrens <[email protected]> Signed-off-by: Paul Dagnelie <[email protected]> Closes #11297
* FreeBSD: Fix format of vfs.zfs.arc_no_grow_shiftRyan Moeller2020-12-101-6/+5
| | | | | | | | | | | | | vfs.zfs.arc_no_grow_shift has an invalid type (15) and this causes py-sysctl to format it as a bytearray when it should be an integer. "U" is not a valid format, it should be "I" and the type should match the variable type, int. We can return EINVAL if the value is set below zero. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Ryan Moeller <[email protected]> Closes #11318