aboutsummaryrefslogtreecommitdiffstats
path: root/module
Commit message (Collapse)AuthorAgeFilesLines
* Linux 6.1 compat: open inside tmpfile()Antonio Russo2023-01-061-0/+15
| | | | | | | | | | | | | | | | Linux 863f144 modified the .tmpfile interface to pass a struct file, rather than a struct dentry, and expect the tmpfile implementation to open inside of tmpfile(). This patch implements a configuration test that checks for this new API and appropriately sets a HAVE_TMPFILE_DENTRY flag that tracks this old API. Contingent on this flag, the appropriate API is implemented. Reviewed-by: Richard Yao <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Antonio Russo <[email protected]> Closes #14301 Closes #14343
* Illumos #15286: do_composition() needs sign awarenessRichard Yao2023-01-051-6/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Authored by: Dan McDonald <[email protected]> Reviewed by: Patrick Mooney <[email protected]> Reviewed by: Richard Lowe <[email protected]> Approved by: Joshua M. Clulow <[email protected]> Ported-by: Richard Yao <[email protected]> Illumos-issue: https://www.illumos.org/issues/15286 Illumos-commit: https://github.com/illumos/illumos-gate/commit/f137b22e734e85642da3e56e8b94da3f5f027c73 Porting Notes: The patch in illumos did not have much of a commit message, and did not provide attribution to the reporter, while original patch proposed to OpenZFS did, so I am listing the reporter (myself) and original patch author (also myself) below while including the original commit message with some minor corrections as part of the porting notes: In do_composition(), we have: size = u8_number_of_bytes[*p]; if (size <= 1 || (p + size) > oslast) break; There, we have type promotion from int8_t to size_t, which is unsigned. C will sign extend the value as part of the widening before treating the value as unsigned and the negative values we can counter are error values from U8_ILLEGAL_CHAR and U8_OUT_OF_RANGE_CHAR, which are -1 and -2 respectively. The unsigned versions of these under two's complement are SIZE_MAX and SIZE_MAX-1 respectively. The bounds check is written under the assumption that `size <= 1` does a signed comparison. This is followed by a pointer comparison to see if the string has the correct length, which is fine. A little further down we have: for (i = 0; i < size; i++) tc[i] = *p++; When an error condition is encountered, this will attempt to iterate at least SIZE_MAX-1 times, which will massively overflow the buffer, which is not fine. The kernel will kill the loop as soon as it hits the kernel stack guard on Linux systems built with CONFIG_VMAP_STACK=y, which should be just about all of them. That prevents arbitrary code execution and just about any other bad thing that a black hat attacker might attempt with knowledge of this buffer overflow. Other systems' kernels have mitigations for unbounded in-kernel buffer overflows that will catch this too. Also, the patch in illumos-gate made an effort to fix C style issues that had been fixed in the OpenZFS/ZFSOnLinux repository. Those issues had been mentioned in the email that I originally sent them about this issue. One of the fixes had not been already done, so it is included. Another to collect_a_seq()'s arguments was handled differently in OpenZFS. For the sake of avoiding unnecessary differences, it has been adopted. This has the interesting effect that if you correct the paths in the illumos-gate patch to match the current OpenZFS repository, you can reverse apply it cleanly. Original-patch-by: Richard Yao <[email protected]> Reported-by: Richard Yao <[email protected]> Co-authored-by: Dan McDonald <[email protected]> Closes #14318 Closes #14342
* FreeBSD: catch up to 1400077Mateusz Guzik2023-01-051-0/+3
| | | | | | | Reviewed-by: Richard Yao <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Mateusz Guzik <[email protected]> Closes #14328
* Hide b_freeze_* under ZFS_DEBUGAlexander Motin2023-01-051-6/+27
| | | | | | | | | | | This saves 40 bytes per full ARC header, reducing it on FreeBSD from 240 to 200 bytes on production bits. Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Richard Yao <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Matthew Ahrens <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Closes #14315
* Implement uncached prefetchAlexander Motin2023-01-046-131/+211
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Previously the primarycache property was handled only in the dbuf layer. Since the speculative prefetcher is implemented in the ARC, it had to be disabled for uncacheable buffers. This change gives the ARC knowledge about uncacheable buffers via arc_read() and arc_write(). So when remove_reference() drops the last reference on the ARC header, it can either immediately destroy it, or if it is marked as prefetch, put it into a new arc_uncached state. That state is scanned every second, evicting stale buffers that were not demand read. This change also tracks dbufs that were read from the beginning, but not to the end. It is assumed that such buffers may receive further reads, and so they are stored in dbuf cache. If a following reads reaches the end of the buffer, it is immediately evicted. Otherwise it will follow regular dbuf cache eviction. Since the dbuf layer does not know actual file sizes, this logic is not applied to the final buffer of a dnode. Since uncacheable buffers should no longer stay in the ARC for long, this patch also tries to optimize I/O by allocating ARC physical buffers as linear to allow buffer sharing. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: George Wilson <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #14243
* arc_read()/arc_access() refactoring and cleanupAlexander Motin2022-12-223-288/+267
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ARC code was many times significantly modified over the years, that created significant amount of tangled and potentially broken code. This should make arc_access()/arc_read() code some more readable. - Decouple prefetch status tracking from b_refcnt. It made sense originally, but became highly cryptic over the years. Move all the logic into arc_access(). While there, clean up and comment state transitions in arc_access(). Some transitions were weird IMO. - Unify arc_access() calls to arc_read() instead of sometimes calling it from arc_read_done(). To avoid extra state changes and checks add one more b_refcnt for ARC_FLAG_IO_IN_PROGRESS. - Reimplement ARC_FLAG_WAIT in case of ARC_FLAG_IO_IN_PROGRESS with the same callback mechanism to not falsely account them as hits. Count those as "iohits", an intermediate between "hits" and "misses". While there, call read callbacks in original request order, that should be good for fairness and random speculations/allocations/aggregations. - Introduce additional statistic counters for prefetch, accounting predictive vs prescient and hits vs iohits vs misses. - Remove hash_lock argument from functions not needing it. - Remove ARC_FLAG_PREDICTIVE_PREFETCH, since it should be opposite to ARC_FLAG_PRESCIENT_PREFETCH if ARC_FLAG_PREFETCH is set. We may wish to add ARC_FLAG_PRESCIENT_PREFETCH to few more places. - Fix few false positive tests found in the process. Reviewed-by: George Wilson <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Richard Yao <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #14123
* FreeBSD: Fix potential boot panic with bad labelRyan Moeller2022-12-221-0/+2
| | | | | | | | vdev_geom_read_pool_label() can leave NULL in configs. Check for it and skip consistently when generating rootconf. Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Ryan Moeller <[email protected]> Closes #14291
* deadlock between spa_errlog_lock and dp_config_rwlockMatthew Ahrens2022-12-224-175/+109
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There is a lock order inversion deadlock between `spa_errlog_lock` and `dp_config_rwlock`: A thread in `spa_delete_dataset_errlog()` is running from a sync task. It is holding the `dp_config_rwlock` for writer (see `dsl_sync_task_sync()`), and waiting for the `spa_errlog_lock`. A thread in `dsl_pool_config_enter()` is holding the `spa_errlog_lock` (see `spa_get_errlog_size()`) and waiting for the `dp_config_rwlock` (as reader). Note that this was introduced by #12812. This commit address this by defining the lock ordering to be dp_config_rwlock first, then spa_errlog_lock / spa_errlist_lock. spa_get_errlog() and spa_get_errlog_size() can acquire the locks in this order, and then process_error_block() and get_head_and_birth_txg() can verify that the dp_config_rwlock is already held. Additionally, a buffer overrun in `spa_get_errlog()` is corrected. Many code paths didn't check if `*count` got to zero, instead continuing to overwrite past the beginning of the userspace buffer at `uaddr`. Tested by having some errors in the pool (via `zinject -t data /path/to/file`), one thread running `zpool iostat 0.001`, and another thread runs `zfs destroy` (in a loop, although it hits the first time). This reproduces the problem easily without the fix, and works with the fix. Reviewed-by: Mark Maybee <[email protected]> Reviewed-by: George Amanakis <[email protected]> Reviewed-by: George Wilson <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> Closes #14239 Closes #14289
* FreeBSD: Remove stray debug printfDoug Rabson2022-12-131-5/+2
| | | | | | | | Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Richard Yao <[email protected]> Signed-off-by: Doug Rabson <[email protected]> Closes #14286 Closes #14287
* Zero end of embedded block buffer in dump_write_embedded()Richard Yao2022-12-131-1/+7
| | | | | | | | | | This fixes a kernel stack leak. Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Tested-by: Nicholas Sherlock <[email protected]> Signed-off-by: Richard Yao <[email protected]> Closes #13778 Closes #14255
* Cache dbuf_hash() calculationRichard Yao2022-12-132-17/+24
| | | | | | | | | | | We currently compute a 64-bit hash three times, which consumes 0.8% CPU time on ARC eviction heavy workloads. Caching the 64-bit value in the dbuf allows us to avoid that overhead. Sponsored-By: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Matthew Ahrens <[email protected]> Signed-off-by: Richard Yao <[email protected]> Closes #14251
* zfs list: Allow more fields in ZFS_ITER_SIMPLE modeAllan Jude2022-12-131-3/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If the fields to be listed and sorted by are constrained to those populated by dsl_dataset_fast_stat(), then zfs list is much faster, as it does not need to open each objset and reads its properties. A previous optimization by Pawel Dawidek (0cee24064a79f9c01fc4521543c37acea538405f) took advantage of this to make listing snapshot names sorted only by name much faster. However, it was limited to `-o name -s name`, this work extends this optimization to work with: - name - guid - createtxg - numclones - inconsistent - redacted - origin and could be further extended to any other properties supported by dsl_dataset_fast_stat() or similar, that do not require extra locking or reading from disk. This was committed before (9a9e2e343dfa2af28bf7910de77ae73aa006de62), but was reverted due to a regression when used with an older kernel. If the kernel does not populate zc->zc_objset_stats, we now fallback to getting the properties via the slower interface, to avoid problems with newer userland and older kernels. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Allan Jude <[email protected]> Closes #14110
* Skip permission checks for extended attributesAmeer Hamza2022-12-123-6/+3
| | | | | | | | | | | | | | | zfs_zaccess_trivial() calls the generic_permission() to read xattr attributes. This causes deadlock if called from zpl_xattr_set_dir() context as xattr and the dent locks are already held in this scenario. This commit skips the permissions checks for extended attributes since the Linux VFS stack already checks it before passing us the control. Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Youzhong Yang <[email protected]> Signed-off-by: Ameer Hamza <[email protected]> Closes #14220
* Restrict visibility of per-dataset kstats inside FreeBSD jailsAllan Jude2022-12-091-10/+79
| | | | | | | | | | | | | When inside a jail, visibility on datasets not "jailed" to the jail is restricted. However, it was possible to enumerate all datasets in the pool by looking at the kstats sysctl MIB. Only the kstats corresponding to datasets that the user has visibility on are accessible now. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Richard Yao <[email protected]> Signed-off-by: Allan Jude <[email protected]> Closes #14254
* Bypass metaslab throttle for removal allocationsSerapheim Dimitropoulos2022-12-092-5/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Context: We recently had a scenario where a customer with 2x10TB disks at 95+% fragmentation and capacity, wanted to migrate their disks to a 2x20TB setup. So they added the 2 new disks and submitted the removal of the first 10TB disk. The removal took a lot more than expected (order of more than a week to 2 weeks vs a couple of days) and once it was done it generated a huge indirect mappign table in RAM (~16GB vs expected ~1GB). Root-Cause: The removal code calls `metaslab_alloc_dva()` to allocate a new block for each evacuating block in the removing device and it tries to batch them into 16MB segments. If it can't find such a segment it tries for 8MBs, 4MBs, all the way down to 512 bytes. In our scenario what would happen is that `metaslab_alloc_dva()` from the removal thread pick the new devices initially but wouldn't allocate from them because of throttling in their metaslab allocation queue's depth (see `metaslab_group_allocatable()`) as these devices are new and favored for most types of allocations because of their free space. So then the removal thread would look at the old fragmented disk for allocations and wouldn't find any contiguous space and finally retry with a smaller allocation size until it would to the low KB range. This caused a lot of small mappings to be generated blowing up the size of the indirect table. It also wasted a lot of CPU while the removal was active making everything slow. This patch: Make all allocations coming from the device removal thread bypass the throttle checks. These allocations are not even counted in the metaslab allocation queues anyway so why check them? Side-Fix: Allocations with METASLAB_DONT_THROTTLE in their flags would not be accounted at the throttle queues but they'd still abide by the throttling rules which seems wrong. This patch fixes this by checking for that flag in `metaslab_group_allocatable()`. I did a quick check to see where else this flag is used and it doesn't seem like this change would cause issues. Reviewed-by: Matthew Ahrens <[email protected]> Reviewed-by: Mark Maybee <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Serapheim Dimitropoulos <[email protected]> Closes #14159
* Fix dereference after null check in enqueue_rangeRichard Yao2022-12-081-1/+3
| | | | | | | | | | | | | If the bp is NULL, we have a hole. However, when we build with assertions, we will dereference bp when `blkid == DMU_SPILL_BLKID`. When this happens on a hole, we will have a NULL pointer dereference. Reported-by: Coverity (CID-1524670) Reviewed-by: Damian Szuberski <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Richard Yao <[email protected]> Closes #14264
* Linux: Cleanup unnecessary NULL check in __vdev_disk_physio()Richard Yao2022-12-081-1/+1
| | | | | | | | | | | | | | | zio is never NULL when given to the vdev. Coverity complained saying: "Either the check against null is unnecessary, or there may be a null pointer dereference." Reported-by: Coverity (CID-1466174) Reviewed-by: Damian Szuberski <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Richard Yao <[email protected]> Closes #14263
* Remove duplicate statically allocated variableRichard Yao2022-12-081-1/+0
| | | | | | | | | | | | | | | dsl_dataset_snapshot_sync_impl() declares `static zil_header_t zero_zil __maybe_unused;`, but this is also declared globally. This wastes memory. CodeQL's cpp/local-variable-hides-global-variable check caught this. Reviewed-by: Damian Szuberski <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Richard Yao <[email protected]> Closes #14263
* Micro-optimize fletcher4 calculationsRichard Yao2022-12-056-68/+36
| | | | | | | | | | | | | | | | | | | | | When processing abds, we execute 1 `kfpu_begin()`/`kfpu_end()` pair on every page in the abd. This is wasteful and slows down checksum performance versus what the benchmark claimed. We correct this by moving those calls to the init and fini functions. Also, we always check the buffer length against 0 before calling the non-scalar checksum functions. This means that we do not need to execute the loop condition for the first loop iteration. That allows us to micro-optimize the checksum calculations by switching to do-while loops. Note that we do not apply that micro-optimization to the scalar implementation because there is no check in `fletcher_4_incremental_native()`/`fletcher_4_incremental_byteswap()` against 0 sized buffers being passed. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Richard Yao <[email protected]> Closes #14247
* FreeBSD: zfs_register_callbacks() must implement error check correctlyRichard Yao2022-12-051-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | I read the following article and noticed a couple of ZFS bugs mentioned: https://pvs-studio.com/en/blog/posts/cpp/0377/ I decided to search for them in the modern OpenZFS codebase and then found one that matched the description of the first one: V593 Consider reviewing the expression of the 'A = B != C' kind. The expression is calculated as following: 'A = (B != C)'. zfs_vfsops.c 498 The consequence of this is that the error value is replaced with `1` when there is an error. When there is no error, 0 is correctly passed. This is a very minor issue that is unlikely to cause any real problems. The incorrect error code would either be returned to the mount command on a failure or any of `zfs receive`, `zfs recv`, `zfs rollback` or `zfs upgrade`. The second one has already been fixed. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Damian Szuberski <[email protected]> Signed-off-by: Richard Yao <[email protected]> Closes #14261
* zio can deadlock during device removalGeorge Wilson2022-12-021-2/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When doing a device removal on a pool with gang blocks, the zio pipeline can deadlock when trying to free blocks from a device which is being removed with a stack similar to this: 0xffff8ab9a13a1740 UNINTERRUPTIBLE 4 __schedule+0x2e5 __schedule+0x2e5 schedule+0x33 schedule_preempt_disabled+0xe __mutex_lock.isra.12+0x2a7 __mutex_lock.isra.12+0x2a7 __mutex_lock_slowpath+0x13 mutex_lock+0x2c free_from_removing_vdev+0x61 metaslab_free_impl+0xd6 metaslab_free_dva+0x5e metaslab_free+0x196 zio_free_sync+0xe4 zio_free_gang+0x38 zio_gang_tree_issue+0x42 zio_gang_tree_issue+0xa2 zio_gang_issue+0x6d zio_execute+0x94 zio_execute+0x94 taskq_thread+0x23b kthread+0x120 ret_from_fork+0x1f Since there are gang blocks we have to read the gang members as part of the free. This can be seen with a zio dependency tree that looks like this: sdb> echo 0xffff900c24f8a700 | zio -rc | zio ADDRESS TYPE STAGE WAITER 0xffff900c24f8a700 NULL CHECKSUM_VERIFY 0xffff900ddfd31740 0xffff900c24f8c920 FREE GANG_ASSEMBLE - 0xffff900d93d435a0 READ DONE In the illustration above we are processing frees but because of gang block we have to read the constituents blocks. Once we finish the READ in the zio pipeline we will execute the parent. In this case the parent is a FREE but the zio taskq is a READ and we continue to process the pipeline leading to the stack above. In the stack above, we are blocked waiting for the svr_lock so as a result a READ interrupt taskq thread is now consumed. Eventually, all of the READ taskq threads end up blocked and we're unable to complete any read requests. In zio_notify_parent there is an optimization to continue to use the taskq thread to exectue the parent's pipeline. To resolve the deadlock above, we only allow this optimization if the parent's zio type matches the child which just completed. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Matthew Ahrens <[email protected]> Signed-off-by: George Wilson <[email protected]> External-issue: DLPX-80130 Closes #14236
* nopwrites on dmu_sync-ed blocks can result in a panicGeorge Wilson2022-12-021-8/+10
| | | | | | | | | | | | | | | | After a device has been removed, any nopwrites for blocks on that indirect vdev should be ignored and a new block should be allocated. The original code attempted to handle this but used the wrong block pointer when checking for indirect vdevs and failed to check all DVAs. This change corrects both of these issues and modifies the test case to ensure that it properly tests nopwrites with device removal. Reviewed-by: Prakash Surya <[email protected]> Reviewed-by: Matthew Ahrens <[email protected]> Reviewed-by: Richard Yao <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: George Wilson <[email protected]> Closes #14235
* Bump checksum error counter before reporting to ZEDRob Wing2022-12-023-12/+12
| | | | | | | | | | | | | The checksum error counter is incremented after reporting to ZED. This leads ZED to receiving a checksum error report with 0 checksum errors. To avoid this, bump the checksum error counter before reporting to ZED. Sponsored-by: Seagate Technology LLC Reviewed-by: Richard Yao <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rob Wing <[email protected]> Closes #14190
* Fix Clang 15 compilation errorsszubersk2022-11-306-2/+21
| | | | | | | | | | | | | | | | | - Clang 15 doesn't support `-fno-ipa-sra` anymore. Do a separate check for `-fno-ipa-sra` support by $KERNEL_CC. - Don't enable `-mgeneral-regs-only` for certain module files. Fix #13260 - Scope `GCC diagnostic ignored` statements to GCC only. Clang doesn't need them to compile the code. Reviewed-by: Richard Yao <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: szubersk <[email protected]> Closes #13260 Closes #14150
* Fix GCC 12 compilation errorsszubersk2022-11-303-2/+20
| | | | | | | | | Squelch false positives reported by GCC 12 with UBSan. Reviewed-by: Richard Yao <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: szubersk <[email protected]> Closes #14150
* zstd: Refactor prefetching for the decoding loopYann Collet2022-11-291-22/+28
| | | | | | | | | | | | | | | | | | | | | | | | | | | Following facebook/zstd#2545, I noticed that one field in `seq_t` is optional, and only used in combination with prefetching. (This may have contributed to static analyzer failure to detect correct initialization). I then wondered if it would be possible to rewrite the code so that this optional part is handled directly by the prefetching code rather than delegated as an option into `ZSTD_decodeSequence()`. This resulted into this refactoring exercise where the prefetching responsibility is better isolated into its own function and `ZSTD_decodeSequence()` is streamlined to contain strictly Sequence decoding operations. Incidently, due to better code locality, it reduces the need to send information around, leading to simplified interface, and smaller state structures. Port of facebook/zstd@f5434663ea2a79b287ab4cd299179342f64a23a7 Reported-by: Coverity (CID 1462271) Reviewed-by: Damian Szuberski <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Tino Reichardt <[email protected]> Ported-by: Richard Yao <[email protected]> Closes #14212
* zstd: [superblock] Add defensive assert and bounds checkNick Terrell2022-11-291-6/+10
| | | | | | | | | | | | | | | | The bound check condition should always be met because we selected `set_basic` as our encoding type. But that code is very far away, so assert it is true so if it is ever false we can catch it, and add a bounds check. Port of facebook/zstd@1047097dadd5f8dfd47e25af0738eeb4bd82f6ec Reported-by: Coverity (CID 1524446) Reviewed-by: Damian Szuberski <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Tino Reichardt <[email protected]> Ported-by: Richard Yao <[email protected]> Closes #14212
* Fix NULL pointer dereference in dbuf_prefetch_indirect_done()Richard Yao2022-11-291-2/+2
| | | | | | | | | | | | | | When ZFS is built with assertions, a prefetch is done on a redacted blkptr and `dpa->dpa_dnode` is NULL, we will have a NULL pointer dereference in `dbuf_prefetch_indirect_done()`. Both Coverity and Clang's Static Analyzer caught this. Reported-by: Coverity (CID 1524671) Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Signed-off-by: Richard Yao <[email protected]> Closes #14210
* Cleanup: Delete dead code from send_merge_thread()Richard Yao2022-11-291-2/+0
| | | | | | | | | | | | | | | | | | range is always deferenced before it reaches this check, such that the kmem_zalloc() call is never executed. A previously version of this had erronously also pruned the `range->eos_marker = B_TRUE` line, but it must be set whenever we encounter an error or are cancelled early. Coverity incorrectly complained about a potential NULL pointer dereference because of this. Reported-by: Coverity (CID 1524550) Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Signed-off-by: Richard Yao <[email protected]> Closes #14210
* Fix the last two CFI callback prototype mismatchesAlexander2022-11-294-5/+5
| | | | | | | | | | | | | | | | | There was the series from me a year ago which fixed most of the callback vs implementation prototype mismatches. It was based on running the CFI-enabled kernel (in permissive mode -- warning instead of panic) and performing a full ZTS cycle, and then fixing all of the problems caught by CFI. Now, Clang 16-dev has new warning flag, -Wcast-function-type-strict, which detect such mismatches at compile-time. It allows to find the remaining issues missed by the first series. There are only two of them left: one for the secpolicy_vnode_setattr() callback and one for taskq_dispatch(). The fix is easy, since they are not used anywhere else. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Lobakin <[email protected]> Closes #14207
* Lua: Fix bad bitshift in lua_strx2number()Richard Yao2022-11-291-1/+1
| | | | | | | | | | | | | | | | The port of lua to OpenZFS modified lua to use int64_t for numbers instead of double. As part of this, a function for calculating exponentiation was replaced with a bit shift. Unfortunately, it did not handle negative values. Also, it only supported exponents numbers with 7 digits before before overflow. This supports exponents up to 15 digits before overflow. Clang's static analyzer reported this as "Result of operation is garbage or undefined" because the exponent was negative. Reviewed-by: Damian Szuberski <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Richard Yao <[email protected]> Closes #14204
* Remove few pointer dereferences in dbuf_read()Alexander Motin2022-11-292-13/+4
| | | | | | | Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Richard Yao <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Closes #14199
* FreeBSD: stop using buffer cache-only routines on syncMateusz Guzik2022-11-292-5/+0
| | | | | | | | | Both vop_fsync and vfs_stdsync are effectively just costly no-ops as they only act on ->v_bufobj.bo_dirty et al, which are unused by zfs. Reviewed-by: Ryan Moeller <[email protected]> Signed-off-by: Mateusz Guzik <[email protected]> Closes #14157
* Switch dnode stats to wmsumsAlexander Motin2022-11-291-0/+127
| | | | | | | | | | I've noticed that some of those counters are used in hot paths like dnode_hold_impl(), and results of this change is visible in profiler. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Richard Yao <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Closes #14198
* Micro-optimize zrl_remove()Alexander Motin2022-11-291-4/+4
| | | | | | | | | atomic_dec_32() should be a bit lighter than atomic_dec_32_nv(). Reviewed-by: Tino Reichardt <[email protected]> Reviewed-by: Richard Yao <[email protected]> Signed-off-by: Ryan Moeller <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Closes #14200
* zed: unclean disk attachment faults the vdevAmeer Hamza2022-11-291-2/+2
| | | | | | | | | | | | | | | If the attached disk already contains a vdev GUID, it means the disk is not clean. In such a scenario, the physical path would be a match that makes the disk faulted when trying to online it. So, we would only want to proceed if either GUID matches with the last attached disk or the disk is in a clean state. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Signed-off-by: Ameer Hamza <[email protected]> Closes #14181
* Convert some sprintf() calls to kmem_scnprintf()Richard Yao2022-11-285-11/+17
| | | | | | | | | | | | | | | | | | | These `sprintf()` calls are used repeatedly to write to a buffer. There is no protection against overflow other than reviewers explicitly checking to see if the buffers are big enough. However, such issues are easily missed during review and when they are missed, we would rather stop printing rather than have a buffer overflow, so we convert these functions to use `kmem_scnprintf()`. The Linux kernel provides an entire page for module parameters, so we are safe to write up to PAGE_SIZE. Removing `sprintf()` from these functions removes the last instances of `sprintf()` usage in our platform-independent kernel code. This improves XNU kernel compatibility because the XNU kernel does not support (removed support for?) `sprintf()`. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Jorgen Lundman <[email protected]> Signed-off-by: Richard Yao <[email protected]> Closes #14209
* Avoid a null pointer dereference in zfs_mount() on FreeBSDAllan Jude2022-11-281-1/+2
| | | | | | | | | | | | | | When mounting the root filesystem, vfs_t->mnt_vnodecovered is null This will cause zfsctl_is_node() to dereference a null pointer when mounting, or updating the mount flags, on the root filesystem, both of which happen during the boot process. Reported-by: Martin Matuska <[email protected]> Reviewed-by: Richard Yao <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Richard Yao <[email protected]> Signed-off-by: Allan Jude <[email protected]> Closes #14218
* Remove atomics from zh_refcountAlexander Motin2022-11-281-8/+6
| | | | | | | | | | It is protected by z_hold_locks, so we do not need more serialization, simple integer math should be fine. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Richard Yao <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Closes #14196
* zed: post a udev change event from spa_vdev_attach()Ameer Hamza2022-11-181-1/+1
| | | | | | | | | | | | | | | In order for zed to process the removal event correctly, udev change event needs to be posted to sync the blkid information. spa_create() and spa_config_update() posts the event already through spa_write_cachefile(). Doing the same for spa_vdev_attach() that handles the case for vdev attachment and replacement. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Richard Yao <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Signed-off-by: Ameer Hamza <[email protected]> Closes #14172
* Fix setting the large_block feature after receiving a snapshotGeorge Amanakis2022-11-181-0/+15
| | | | | | | | | | | | | We are not allowed to dirty a filesystem when done receiving a snapshot. In this case the flag SPA_FEATURE_LARGE_BLOCKS will not be set on that filesystem since the filesystem is not on dp_dirty_datasets, and a subsequent encrypted raw send will fail. Fix this by checking in dsl_dataset_snapshot_sync_impl() if the feature needs to be activated and do so if appropriate. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: George Amanakis <[email protected]> Closes #13699 Closes #13782
* Handle and detect #13709's unlock regression (#14161)Rich Ercolani2022-11-151-2/+16
| | | | | | | | | | | | | | | | In #13709, as in #11294 before it, it turns out that 63a26454 still had the same failure mode as when it was first landed as d1d47691, and fails to unlock certain datasets that formerly worked. Rather than reverting it again, let's add handling to just throw out the accounting metadata that failed to unlock when that happens, as well as a test with a pre-broken pool image to ensure that we never get bitten by this again. Fixes: #13709 Signed-off-by: Rich Ercolani <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Tony Hutter <[email protected]>
* Fix arc_p aggressive increaseshodanshok2022-11-111-2/+3
| | | | | | | | | | | | | | | | | | | The original ARC paper called for an initial 50/50 MRU/MFU split and this is accounted in various places where arc_p = arc_c >> 1, with further adjustment based on ghost lists size/hit. However, in current code both arc_adapt() and arc_get_data_impl() aggressively grow arc_p until arc_c is reached, causing unneeded pressure on MFU and greatly reducing its scan-resistance until ghost list adjustments kick in. This patch restores the original behavior of initially having arc_p as 1/2 of total ARC, without preventing MRU to use up to 100% total ARC when MFU is empty. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Gionatan Danti <[email protected]> Closes #14137 Closes #14120
* Cleanup: Suppress Coverity dereference before/after NULL check reportsRichard Yao2022-11-101-1/+1
| | | | | | | | | | | | | | | | f224eddf922a33ca4b86d83148e9e6fa155fc290 began dereferencing a NULL checked pointer in zpl_vap_init(), which made Coverity complain because either the dereference is unsafe or the NULL check is unnecessary. Upon inspection, this pointer is guaranteed to never be NULL because it is from the Linux kernel VFS. The calls into ZFS simply would not make sense if this pointer were NULL, so the NULL check is unnecessary. Reported-by: Coverity (CID 1527260) Reported-by: Coverity (CID 1527262) Reviewed-by: Mariusz Zaborski <[email protected]> Reviewed-by: Youzhong Yang <[email protected]> Signed-off-by: Richard Yao <[email protected]> Closes #14170
* Fix potential NULL pointer dereference regressionRichard Yao2022-11-101-1/+1
| | | | | | | | | | | | | | | | 945b407486a0072ec7dd117a0bde2f72d52eb445 neglected to `NULL` check `tx->tx_objset`, which is already done in the function. This upset Coverity, which complained about a "dereference after null check". Upon inspection, it was found that whenever `dmu_tx_create_dd()` is called followed by `dmu_tx_assign()`, such as in `dsl_sync_task_common()`, `tx->tx_objset` will be `NULL`. Reported-by: Coverity (CID 1527261) Reviewed-by: Mariusz Zaborski <[email protected]> Reviewed-by: Youzhong Yang <[email protected]> Signed-off-by: Richard Yao <[email protected]> Closes #14170
* Allow to control failfastMariusz Zaborski2022-11-103-2/+63
| | | | | | | | | | | | | | | | | | | | | Linux defaults to setting "failfast" on BIOs, so that the OS will not retry IOs that fail, and instead report the error to ZFS. In some cases, such as errors reported by the HBA driver, not the device itself, we would wish to retry rather than generating vdev errors in ZFS. This new property allows that. This introduces a per vdev option to disable the failfast option. This also introduces a global module parameter to define the failfast mask value. Reviewed-by: Brian Behlendorf <[email protected]> Co-authored-by: Allan Jude <[email protected]> Signed-off-by: Allan Jude <[email protected]> Signed-off-by: Mariusz Zaborski <[email protected]> Sponsored-by: Seagate Technology LLC Submitted-by: Klara, Inc. Closes #14056
* quota: disable quota check for ZVOLMariusz Zaborski2022-11-081-4/+18
| | | | | | | | | | | | | | | | | | | The quota for ZVOLs is set to the size of the volume. When the quota reaches the maximum, there isn't an excellent way to check if the new writers are overwriting the data or if they are inserting a new one. Because of that, when we reach the maximum quota, we wait till txg is flushed. This is causing a significant fluctuation in bandwidth. In the case of ZVOL, the quota is enforced by the volsize, so we can omit it. This commit adds a sysctl thats allow to control if the quota mechanism should be enforced or not. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Mariusz Zaborski <[email protected]> Sponsored-by: Zededa Inc. Sponsored-by: Klara Inc. Closes #13838
* Optionally skip zil_close during zvol_create_minor_implAlan Somers2022-11-083-11/+20
| | | | | | | | | | | | | | | If there were no zil entries to replay, skip zil_close. zil_close waits for a transaction to sync. That can take several seconds, for example during pool import of a resilvering pool. Skipping zil_close can cut the time for "zpool import" from 2 hours to 45 seconds on a resilvering pool with a thousand zvols. Reviewed-by: Richard Yao <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Sponsored-by: Axcient Closes #13999 Closes #14015
* Support idmapped mount in user namespaceyouzhongyang2022-11-0811-56/+146
| | | | | | | | | | | | | | | | | | Linux 5.17 commit torvalds/linux@5dfbfe71e enables "the idmapping infrastructure to support idmapped mounts of filesystems mounted with an idmapping". Update the OpenZFS accordingly to improve the idmapped mount support. This pull request contains the following changes: - xattr setter functions are fixed to take mnt_ns argument. Without this, cp -p would fail for an idmapped mount in a user namespace. - idmap_util is enhanced/fixed for its use in a user ns context. - One test case added to test idmapped mount in a user ns. Reviewed-by: Christian Brauner <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Youzhong Yang <[email protected]> Closes #14097
* dsl_prop_known_index(): check for invalid propDamian Szuberski2022-11-082-1/+28
| | | | | | | | Resolve UBSAN array-index-out-of-bounds error in zprop_desc_t. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: szubersk <[email protected]> Closes #14142 Closes #14147