aboutsummaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* Add SHA2 SIMD feature tests for libsplTino Reichardt2023-03-022-24/+99
| | | | | | | | | | | | | | | | These are added via HWCAP interface: - zfs_neon_available() for arm and aarch64 - zfs_sha256_available() for arm and aarch64 - zfs_sha512_available() for aarch64 This one via cpuid() call: - zfs_shani_available() for x86_64 Tested-by: Rich Ercolani <[email protected]> Tested-by: Sebastian Gottschall <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tino Reichardt <[email protected]> Closes #13741
* Add SHA2 SIMD feature tests for LinuxTino Reichardt2023-03-026-13/+169
| | | | | | | | | | | | | | | These are added: - zfs_neon_available() for arm and aarch64 - zfs_sha256_available() for arm and aarch64 - zfs_sha512_available() for aarch64 - zfs_shani_available() for x86_64 Tested-by: Rich Ercolani <[email protected]> Tested-by: Sebastian Gottschall <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tino Reichardt <[email protected]> Co-Authored-By: Sebastian Gottschall <[email protected]> Closes #13741
* Add SHA2 SIMD feature tests for FreeBSDTino Reichardt2023-03-027-28/+205
| | | | | | | | | | | | | | | | | These are added: - zfs_neon_available() for arm and aarch64 - zfs_sha256_available() for arm and aarch64 - zfs_sha512_available() for aarch64 - zfs_shani_available() for x86_64 Changes: - simd_powerpc.h: change license from CDDL to BSD Tested-by: Rich Ercolani <[email protected]> Tested-by: Sebastian Gottschall <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tino Reichardt <[email protected]> Closes #13741
* Add ARM architecture to OpenZFS buildsystemTino Reichardt2023-03-021-1/+5
| | | | | | | | Tested-by: Rich Ercolani <[email protected]> Tested-by: Sebastian Gottschall <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tino Reichardt <[email protected]> Closes #13741
* Remove old or redundant SHA2 filesTino Reichardt2023-03-0218-7312/+3
| | | | | | | | | | | | | | | | | We had three sha2.h headers in different places. The FreeBSD version, the Linux version and the generic solaris version. The only assembly used for acceleration was some old x86-64 openssl implementation for sha256 within the icp module. For FreeBSD the whole SHA2 files of FreeBSD were copied into OpenZFS, these files got removed also. Tested-by: Rich Ercolani <[email protected]> Tested-by: Sebastian Gottschall <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tino Reichardt <[email protected]> Closes #13741
* zdb: add decryption supportRob N2023-03-027-15/+262
| | | | | | | | | | | | | | The approach is straightforward: for dataset ops, if a key was offered, find the encryption root and the various encryption parameters, derive a wrapping key if necessary, and then unlock the encryption root. After that all the regular dataset ops will return unencrypted data, and that's kinda the whole thing. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Jorgen Lundman <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes #11551 Closes #12707 Closes #14503
* System-wide speculative prefetch limit.Alexander Motin2023-03-012-5/+25
| | | | | | | | | | | | | | | With some pathological access patterns it is possible to make ZFS accumulate almost unlimited amount of speculative prefetch ZIOs. Combined with linear ABD allocations in RAIDZ code, it appears to be possible to exhaust system KVA, triggering kernel panic. Address this by introducing a system-wide counter of active prefetch requests and blocking prefetch distance doubling per stream hits if the number of active requests is higher that ~6% of ARC size. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #14516
* Accommodate debug-kernel stack frame sizeBrian Behlendorf2023-03-011-2/+2
| | | | | | | | | | | | | | The blk_queue_discard() and blk_queue_sector_erase() functions slightly exceed the allowed 4096 maximum stack frame size when building with the RedHat debug kernel which causes their configure checks to fail. Add an exception for these two tests so the interfaces are correctly detected. Reviewed-by: Tino Reichardt <[email protected]> Reviewed-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #14540
* Fix data race between zil_commit() and zil_suspend()Richard Yao2023-03-012-0/+28
| | | | | | | | | | | | | | | | | | | | | | | | openzfsonwindows/openzfs#206 found that it is possible to trip `VERIFY(list_is_empty(&lwb->lwb_itxs))` when a `zil_commit()` is delayed by the scheduler long enough for a parallel `zil_suspend()` operation to exit `zil_commit_impl()`. This is a data race. To prevent this, we introduce a `zilog->zl_suspend_lock` rwlock to ensure that all outstanding `zil_commit()` operations finish before `zil_suspend()` begins and that subsequent operations fallback to `txg_wait_synced()` after `zil_suspend()` has begun. On `PREEMPT_RT` Linux kernels, the `rw_enter()` implementation suffers from writer starvation. This means that a ZIL intensive system can delay `zil_suspend()` indefinitely. This is a pre-existing problem that affects everything that uses rw locks, so it needs to be addressed in the SPL. However, builds against `PREEMPT_RT` Linux kernels are currently broken due to a GPL symbol issue (#11097), so we can safely disregard that issue for now. Reported-by: Arun KV <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Richard Yao <[email protected]> Closes #14514
* Remove bad kmem_free() oversight from previous zfsdev_state_list patchRichard Yao2023-03-011-1/+0
| | | | | | | | | | | | I forgot to remove the corresponding kmem_free() from zfs_kmod_fini() in 9a14ce43c3d6a9939804215bbbe66de5115ace42. Clang's static analyzer did not complain, but the Coverity scan that was run after the patch was merged did. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Tino Reichardt <[email protected]> Signed-off-by: Richard Yao <[email protected]> Reported-by: Coverity (CID-1535275) Closes #14556
* Linux: zfs_fillpage() should handle partial pages from end of fileRichard Yao2023-03-011-1/+5
| | | | | | | | | | | | | | | | | | | | | | | After 89cd2197b94986d315b9b1be707b645baf59af4f was merged, Clang's static analyzer began complaining about a dead assignment in `zfs_fillpage()`. Upon inspection, I noticed that the dead assignment was because we are not using the calculated io_len that we should use to avoid asking the DMU to read past the end of a file. This should result in `dmu_buf_hold_array_by_dnode()` calling `zfs_panic_recover()`. This issue predates 89cd2197b94986d315b9b1be707b645baf59af4f, but its simplification of zfs_fillpage() eliminated the only use of the assignment to io_len, which made Clang's static analyzer complain about the issue. Also, as a precaution, we add an assertion that io_offset < i_size. If this ever fails, bad things will happen. Otherwise, we are blindly trusting the kernel not to give us invalid offsets. We continue to blindly trust it on non-debug kernels. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Brian Atkinson <[email protected]> Signed-off-by: Richard Yao <[email protected]> Closes #14534
* Update reclaim disk space scriptTino Reichardt2023-03-011-1/+4
| | | | | | | | | | | | | | The script uses systemd-run, which does the job in background. We should take the the time and wait for the job to finish. Maybe some functional tests suffer from not really freed disk space and fail because of this. We also add some trimming in the end of the script. Reviewed-by: George Melikov <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tino Reichardt <[email protected]> Closes #14554
* Handle unexpected errors in zil_lwb_commit() without ASSERT()Richard Yao2023-03-011-6/+32
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We tripped `ASSERT(error == ENOENT || error == EEXIST || error == EALREADY)` in `zil_lwb_commit()` at Klara when doing robustness testing of ZIL against drive power cycles. That assertion presumably exists because when this code was written, the only errors expected from here were EIO, ENOENT, EEXIST and EALREADY, with EIO having its own handling before the assertion. However, upon doing a manual depth first search traversal of the source tree, it turns out that a large number of unexpected errors are possible here. In theory, EINVAL and ENOSPC can come from dnode_hold_impl(). However, most unexpected errors originate in the block layer and come to us from zio_wait() in various ways. One way is ->zl_get_data() -> dmu_buf_hold() -> dbuf_read() -> zio_wait(). From vdev_disk.c on Linux alone, zio_wait() can return the unexpected errors ENXIO, ENOTSUP, EOPNOTSUPP, ETIMEDOUT, ENOSPC, ENOLINK, EREMOTEIO, EBADE, ENODATA, EILSEQ and ENOMEM This was only observed after what have been likely over 1000 test iterations, so we do not expect to reproduce this again to find out what the error code was. However, circumstantial evidence suggests that the error was ENXIO. When ENXIO or any other unexpected error occurs, the `fsync()` or equivalent operation that called zil_commit() will return success, when in fact, dirty data has not been committed to stable storage. This is a violation of the Single UNIX Specification. The code should be able to handle this and any other unknown error by calling `txg_wait_synced()`. In addition to changing the code to call txg_wait_synced() on unexpected errors instead of returning, we modify it to print information about unexpected errors to dmesg. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Richard Yao <[email protected]> Sponsored-By: Wasabi Technology, Inc. Closes #14532
* mancheck: exclude dotfiles in man dirRob N2023-03-011-1/+1
| | | | | | | | | | | | | | | Its not uncommon for an editor to drop a hidden swap file in the dir while editing a file there. mancheck would find it and run mandoc on it, which would complain about its distinctly not-manpage format. A more correct solution might be to reconfigure the editor to not put swap files in the same dir, but its the default a lot of the time, and this is a very small change that gives a very nice quality-of-life improvement. Reviewed-by: Richard Yao <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes #14549
* man: note that zdb operates directly on pool storageRob N2023-02-281-0/+6
| | | | | | | | | | | | | | | | | | | | A frequent misunderstanding is that zdb accesses the pool through the kernel or filesystem, leading to confusion particularly when it can't access something that it seems like it should be able to. I've seen this confusion recently when zdb couldn't access a pool because the user didn't have permission to read directly from the block devices, and when it couldn't show attributes of encrypted files even though the dataset was unlocked. The manpage already speaks to another symptom of this, namely that zdb may "behave erratically" on an active pool; here I'm trying to make that a little more explicit. Reviewed-by: Richard Yao <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: George Melikov <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes #14539
* Makefile.bsd: cleanup and sync with FreeBSDAllan Jude2023-02-281-64/+112
| | | | | | | | | | | | | | | xxhash.c was not being compiled, so when FreeBSD's kernel switched to a newer version of ZSTD a few weeks ago, out-of-tree ZFS failed to build Sync module/Makefile.bsd with FreeBSD's sys/modules/zfs/Makefile And restore the alphabetical sort in a number of places Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Richard Yao <[email protected]> Signed-off-by: Allan Jude <[email protected]> Sponsored-by: Klara, Inc. Closes #14508
* Suppress static analyzer warning in dmu_objset_create_impl_dnstats()Richard Yao2023-02-281-0/+1
| | | | | | | | | | | | | | Clang's static analyzer claims that dereferencing ds in dmu_objset_create_impl_dnstats() could cause a NULL pointer dereference when a previous NULL check confirms that it is NULL. It is only NULL on the MOS, for which dmu_objset_userused_enabled(os) should always return false, so ds will never be dereferenced when it is NULL. We add an assertion to suppress this warning. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Brian Atkinson <[email protected]> Signed-off-by: Richard Yao <[email protected]> Closes #14470
* Suppress static analyzer warning in dbuf_hold_copy()Richard Yao2023-02-281-1/+3
| | | | | | | | | | | | | | Clang's static analyzer claims that dbuf_hold_copy() will have a NULL pointer dereference in data->b_data when called by dbuf_hold_impl(). This is impossible because data is dr->dt.dl.dr_data, which is non-NULL whenever db->db_level == 0, which is always the case whenever dbuf_hold_impl() calls dbuf_hold_copy(). We add an assertion to suppress the complaint. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Brian Atkinson <[email protected]> Signed-off-by: Richard Yao <[email protected]> Closes #14470
* Statically allocate first node of zfsdev_state_listRichard Yao2023-02-281-7/+6
| | | | | | | | | | | | | This avoids a call to kmem_alloc() during module load. It also suppresses a defect report from Clang's static analyzer that claims that we will have a NULL pointer dereference in zfsdev_state_init() because it does not understand that this has already been allocated in zfs_kmod_init(). Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Brian Atkinson <[email protected]> Signed-off-by: Richard Yao <[email protected]> Closes #14470
* Suppress static analyzer warning in sa_attr_iter()Richard Yao2023-02-281-0/+1
| | | | | | | | | | | | | | Clang's static analyzer points out that when IS_SA_BONUSTYPE(type) is true and .sa_length is 0 for an attribute, we have a NULL pointer dereference. We suppress this with an IMPLY() statement. This was also identified by Coverity. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Brian Atkinson <[email protected]> Signed-off-by: Richard Yao <[email protected]> Reported-by: Coverity (CID-1017954) Closes #14470
* Suppress static analyzer warnings in zio_checksum_error_impl()Richard Yao2023-02-281-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | Clang's static analyzer informs us of multiple NULL pointer dereferences involving zio_checksum_error_impl(). The first is a NULL pointer dereference if bp is NULL and ci->ci_flags & ZCHECKSUM_FLAG_EMBEDDED is false, but bp is NULL implies that ci->ci_flags & ZCHECKSUM_FLAG_EMBEDDED is true, so we add an IMPLY() statement to suppress the report. The second and third are identical, and are duplicated because while the NULL pointer dereference occurs in zio_checksum_gang_verifier(), it is called by zio_checksum_error_impl() and there is a report for each of the two functions. The reports state that when bp is NULL, ci->ci_flags & ZCHECKSUM_FLAG_EMBEDDED is true and checksum is not ZIO_CHECKSUM_LABEL, we also have a NULL pointer dereference. bp is NULL should imply that checksum == ZIO_CHECKSUM_LABEL, so we add an IMPLY() statement to suppress the second report. The two reports are functionally identical. A fourth variation of this was also reported by Coverity. It occurs when checksum == ZIO_CHECKSUM_ZILOG2. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Brian Atkinson <[email protected]> Signed-off-by: Richard Yao <[email protected]> Reported-by: Coverity (CID-1524672) Closes #14470
* icp: Prevent compilers from optimizing away memset() in gcm_clear_ctx()Richard Yao2023-02-282-31/+41
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The recently merged f58e513f7408f353bf0151fdaf235d4e062e8950 was intended to zero sensitive data before exit from encryption functions to harden the code against theoretical information leaks. Unfortunately, the method by which it did that is optimized away by the compiler, so some information still leaks. This was confirmed by counting function calls in disassembly. After studying how the OpenBSD, FreeBSD and Linux kernels handle this, and looking at our disassembly, I decided on a two-factor approach to protect us from compiler dead store elimination passes. The first factor is to stop trying to inline gcm_clear_ctx(). GCC does not actually inline it in the first place, and testing suggests that dead store elimination passes appear to become more powerful in a bad way when inlining is forced, so we recognize that and move gcm_clear_ctx() to a C file. The second factor is to implement an explicit_memset() function based on the technique used by `secure_zero_memory()` in FreeBSD's blake2 implementation, which coincidentally is functionally identical to the one used by Linux. The source for this appears to be a LLVM bug: https://llvm.org/bugs/show_bug.cgi?id=15495 Unlike both FreeBSD and Linux, we explicitly avoid the inline keyword, based on my observations that GCC's dead store elimination pass becomes more powerful when inlining is forced, under the assumption that it will be equally powerful when the compiler does decide to inline function calls. Disassembly of GCC's output confirms that all 6 memset() calls are executed with this patch applied. Reviewed-by: Attila Fülöp <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Richard Yao <[email protected]> Closes #14544
* Linux: Assert mutex is held in mutex_exit()Richard Yao2023-02-281-0/+1
| | | | | | | | | | | | | | A spurious mutex_exit() in a development branch caused weird issues until I identified it. An assertion prior to mutex_exit() would have caught it. Rather than adding assertions before invocations of mutex_exit() in the code, let us simply add an assertion to mutex_exit(). It is cheap and will likely improve developer productivity. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Brian Atkinson <[email protected]> Signed-off-by: Richard Yao <[email protected]> Sponsored-By: Wasabi Technology, Inc. Closes #14541
* Revert zfeature_active() to staticGeorge Amanakis2023-02-282-2/+1
| | | | | | | | | Commit 34ce4c4 made zfeature_active() non-static. This is not required. Reviewed-by: Richard Yao <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Brian Atkinson <[email protected]> Signed-off-by: George Amanakis <[email protected]> Closes #14546
* Fix github build failures because of vdevprops.7Tino Reichardt2023-02-281-0/+1
| | | | | | | | | | | This small fix adds the manpage vdevprops.7 to the file contrib/debian/openzfs-zfsutils.install and the github actions will work again. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: George Melikov <[email protected]> Reviewed-by: Brian Atkinson <[email protected]> Signed-off-by: Tino Reichardt <[email protected]> Closes #14553
* Prevent incorrect datasets being mountedJohn Poduska2023-02-271-2/+20
| | | | | | | | | | | | | | | | | | | | | During a mount, zpl_mount_impl(), uses sget() with the callback zpl_test_super() to find a super_block with a matching objset, stored in z_os. It does so without taking the teardown lock on the zfsvfs. The problem is that operations like rollback will replace the z_os. And, there is a window where the objset in the rollback is freed, but z_os still points to it. Then, a mount like operation, for instance a clone, can reallocate that exact same pointer and zpl_test_super() will then match the super_block associated with the rollback as opposed to the clone. This fix tests for a match and if so, takes the teardown lock before doing the final match test. Reviewed-by: Richard Yao <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: John Poduska <[email protected]> Closes #14518
* Add vdevprops.7 to the MakefileD. Ebdrup2023-02-271-0/+1
| | | | | | | | | | | Adding vdevprops.7 to the Makefile ensures that it gets installed properly on FreeBSD. Reviewed-by: Richard Yao <[email protected]> Reviewed-by: Allan Jude <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: George Melikov <[email protected]> Signed-off-by: Daniel Ebdrup Jensen <[email protected]> Closes #14527
* Skip memory allocation when compressing holesRichard Yao2023-02-275-16/+23
| | | | | | | | | | | Hole detection in the zio compression code allows us to opportunistically skip compression on holes. We can go a step further by not doing memory allocations on holes either. Reviewed-by: Brian Atkinson <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Richard Yao <[email protected]> Sponsored-by: Wasabi Technology, Inc. Closes #14500
* ICP: AES-GCM: Refactor gcm_clear_ctx()Attila Fülöp2023-02-274-55/+36
| | | | | | | | | | | | | | | | | Currently the temporary buffer in which decryption takes place isn't cleared on context destruction. Further in some routines we fail to call gcm_clear_ctx() on error exit. Both flaws may result in leaking sensitive data. We follow best practices and zero out the plaintext buffer before freeing the memory holding it. Also move all cleanup into gcm_clear_ctx() and call it on any context destruction. The performance impact should be negligible. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Rob Norris <[email protected]> Signed-off-by: Attila Fülöp <[email protected]> Closes #14528
* Move zap_attribute_t to the heap in dsl_deadlist_mergeMariusz Zaborski2023-02-271-10/+16
| | | | | | | | | | | | | | | In the case of a regular compilation, the compiler raises a warning for a dsl_deadlist_merge function, that the stack size is to large. In debug build this can generate an error. Move large structures to heap. Reviewed-by: Richard Yao <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Mariusz Zaborski <[email protected]> Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Closes #14524
* Workaround GitHub Action failureBrian Behlendorf2023-02-274-8/+16
| | | | | | | | | | | | | | | Ubuntu 20.04 and 22.04 workflows are failing due to an error which is hit when running `apt-get update`. Until the problematic package is fixed apply the suggested workaround described here: https://github.com/orgs/community/discussions/47863 Reviewed-by: Matthew Ahrens <[email protected]> Reviewed-by: George Melikov <[email protected]> Reviewed-by: Richard Yao <[email protected]> Reviewed-by: Tino Reichardt <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #14530
* Use .section .rodata instead of .rodata on FreeBSDDimitry Andric2023-02-242-2/+2
| | | | | | | | | | | | | | | In commit 0a5b942d4 the FreeBSD SECTION_STATIC macro was set to ".rodata". This assembler directive is supported by LLVM (as a convenience alias for ".section .rodata") by not by GNU as. This caused the FreeBSD builds that are done with gcc to fail. Therefore, use ".section .rodata" instead, similar to the other asm_linkage.h headers. Reviewed-by: Mateusz Guzik <[email protected]> Reviewed-by: Attila Fülöp <[email protected]> Reviewed-by: Jorgen Lundman <[email protected]> Signed-off-by: Dimitry Andric <[email protected]> Closes #14526
* Move dmu_buf_rele() after dsl_dataset_sync_done()George Amanakis2023-02-234-10/+4
| | | | | | | | | | Otherwise the dataset may be freed after the last dmu_buf_rele() leading to a panic. Reviewed-by: Mark Maybee <[email protected]> Reviewed-by: Matthew Ahrens <[email protected]> Signed-off-by: George Amanakis <[email protected]> Closes #14522 Closes #14523
* ZTS: Minor fixesBrian Behlendorf2023-02-232-3/+3
| | | | | | | | | | | | | | - The migration_012_pos.ksh test case was failing because of a missing space after `log_must`. - None of the tests listed in the runfiles should include the .ksh suffix. Reviewed-by: Richard Yao <[email protected]> Reviewed-by: Brian Atkinson <[email protected]> Reviewed-by: George Melikov <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #14515
* Fix buffered/direct/mmap I/O raceBrian Behlendorf2023-02-237-105/+175
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When a page is faulted in for memory mapped I/O the page lock may be dropped before it has been read and marked up to date. If a buffered read encounters such a page in mappedread() it must wait until the page has been updated. Failure to do so will result in a panic on debug builds and incorrect data on production builds. The critical part of this change is in mappedread() where pages which are not up to date are now handled. Additionally, it includes the following simplifications. - zfs_getpage() and zfs_fillpage() could be passed an array of pages. This could be more efficient if it was used but in practice only a single page was ever provided. These interfaces were simplified to acknowledge that. - update_pages() was modified to correctly set the PG_error bit on a page when it cannot be read by dmu_read(). - Setting PG_error and PG_uptodate was moved to zfs_fillpage() from zpl_readpage_common(). This is consistent with the handling in update_pages() and mappedread(). - Minor additional refactoring to comments and variable declarations to improve readability. - Add a test case to exercise concurrent buffered, direct, and mmap IO to the same file. - Reduce the mmap_sync test case default run time. Reviewed-by: Richard Yao <[email protected]> Reviewed-by: Brian Atkinson <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #13608 Closes #14498
* Fix NULL pointer dereference in zio_ready()Richard Yao2023-02-231-1/+1
| | | | | | | | | | | | Clang's static analyzer correctly identified a NULL pointer dereference in zio_ready() when ZIO_FLAG_NODATA has been set on a zio that is missing a block pointer. The NULL pointer dereference occurs because we have logic intended to disable ZIO_FLAG_NODATA when it has been set on a gang block. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Brian Atkinson <[email protected]> Signed-off-by: Richard Yao <[email protected]> Closes #14469
* Use rw_tryupgrade() in dmu_bonus_hold_by_dnode()Richard Yao2023-02-221-2/+4
| | | | | | | | | | | | | | When dn->dn_bonus == NULL, dmu_bonus_hold_by_dnode() will unlock its read lock on dn->dn_struct_rwlock and grab a write lock. This can be micro-optimized by calling rw_tryupgrade(). Linux will not benefit from this since it does not support rwlock upgrades, but FreeBSD will. Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Brian Atkinson <[email protected]> Signed-off-by: Richard Yao <[email protected]> Closes #14517
* Improve error message of zfs redactPaul Dagnelie2023-02-211-3/+18
| | | | | | | | | | | We improve the error message of zfs redact by checking if the target snapshot exists, and if all the redaction snapshots exist. As a future improvement we could iterate over every snapshot provided and use that to determine which one specifically doesn't exist. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Paul Dagnelie <[email protected]> Closes #11426 Closes #14496
* FreeBSD: don't verify recycled vnode for zfs control directoryrob-wing2023-02-211-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Under certain loads, the following panic is hit: panic: page fault KDB: stack backtrace: #0 0xffffffff805db025 at kdb_backtrace+0x65 #1 0xffffffff8058e86f at vpanic+0x17f #2 0xffffffff8058e6e3 at panic+0x43 #3 0xffffffff808adc15 at trap_fatal+0x385 #4 0xffffffff808adc6f at trap_pfault+0x4f #5 0xffffffff80886da8 at calltrap+0x8 #6 0xffffffff80669186 at vgonel+0x186 #7 0xffffffff80669841 at vgone+0x31 #8 0xffffffff8065806d at vfs_hash_insert+0x26d #9 0xffffffff81a39069 at sfs_vgetx+0x149 #10 0xffffffff81a39c54 at zfsctl_snapdir_lookup+0x1e4 #11 0xffffffff8065a28c at lookup+0x45c #12 0xffffffff806594b9 at namei+0x259 #13 0xffffffff80676a33 at kern_statat+0xf3 #14 0xffffffff8067712f at sys_fstatat+0x2f #15 0xffffffff808ae50c at amd64_syscall+0x10c #16 0xffffffff808876bb at fast_syscall_common+0xf8 The page fault occurs because vgonel() will call VOP_CLOSE() for active vnodes. For this reason, define vop_close for zfsctl_ops_snapshot. While here, define vop_open for consistency. After adding the necessary vop, the bug progresses to the following panic: panic: VERIFY3(vrecycle(vp) == 1) failed (0 == 1) cpuid = 17 KDB: stack backtrace: #0 0xffffffff805e29c5 at kdb_backtrace+0x65 #1 0xffffffff8059620f at vpanic+0x17f #2 0xffffffff81a27f4a at spl_panic+0x3a #3 0xffffffff81a3a4d0 at zfsctl_snapshot_inactive+0x40 #4 0xffffffff8066fdee at vinactivef+0xde #5 0xffffffff80670b8a at vgonel+0x1ea #6 0xffffffff806711e1 at vgone+0x31 #7 0xffffffff8065fa0d at vfs_hash_insert+0x26d #8 0xffffffff81a39069 at sfs_vgetx+0x149 #9 0xffffffff81a39c54 at zfsctl_snapdir_lookup+0x1e4 #10 0xffffffff80661c2c at lookup+0x45c #11 0xffffffff80660e59 at namei+0x259 #12 0xffffffff8067e3d3 at kern_statat+0xf3 #13 0xffffffff8067eacf at sys_fstatat+0x2f #14 0xffffffff808b5ecc at amd64_syscall+0x10c #15 0xffffffff8088f07b at fast_syscall_common+0xf8 This is caused by a race condition that can occur when allocating a new vnode and adding that vnode to the vfs hash. If the newly created vnode loses the race when being inserted into the vfs hash, it will not be recycled as its usecount is greater than zero, hitting the above assertion. Fix this by dropping the assertion. FreeBSD-issue: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=252700 Reviewed-by: Andriy Gapon <[email protected]> Reviewed-by: Mateusz Guzik <[email protected]> Reviewed-by: Alek Pinchuk <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Signed-off-by: Rob Wing <[email protected]> Co-authored-by: Rob Wing <[email protected]> Submitted-by: Klara, Inc. Sponsored-by: rsync.net Closes #14501
* Fix per-jail zfs.mount_snapshot settingAllan Jude2023-02-211-2/+14
| | | | | | | | | | | When jail.conf set the nopersist flag during startup, it was incorrectly destroying the per-jail ZFS settings. Reported-by: Martin Matuska <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Signed-off-by: Allan Jude <[email protected]> Sponsored-by: Modirum MDPay Sponsored-by: Klara, Inc. Closes #14509
* Partially revert eee9362a7George Amanakis2023-02-213-24/+7
| | | | | | | | | | With commit 34ce4c42f applied, there is no need for eee9362a7. Revert that aside from the test. All tests introduced in those commits pass. Reviewed-by: Richard Yao <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: George Amanakis <[email protected]> Closes #14502
* Sync thread should avoid holding the spa config write lock when possibleRichard Yao2023-02-161-1/+7
| | | | | | | | | | | | | | | | | | | spa_sync() currently grabs the write lock due to an old hack that is documented by a comment: We need the write lock here because, for aux vdevs, calling vdev_config_dirty() modifies sav_config. This is ugly and will become unnecessary when we eliminate the aux vdev wart by integrating all vdevs into the root vdev tree. This has lead to deadlocks in rare edge cases from holding the write lock. We can reduce incidence of these deadlocks by not grabbing the write lock on pools without auxillary vdevs. Sponsored-By: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Richard Yao <[email protected]> Closes #14282
* zfs redact fails when dnodesize=autoPaul Dagnelie2023-02-161-0/+2
| | | | | | | Add handling to dmu_object_next for the case where *objectp == 0. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Paul Dagnelie <[email protected]> Closes #14479
* zdb: zero-pad checksum output follow upBrian Behlendorf2023-02-151-1/+1
| | | | | | | | | | | Apply zero padding for checksums consistently. The SNPRINTF_BLKPTR macro was not updated in commit ac7648179c8 which results in the `cli_root/zdb/zdb_checksum.ksh` test case reliably failing. Reviewed-by: Igor Kozhukhov <[email protected]> Reviewed-by: Akash B <[email protected]> Reviewed-by: Brian Atkinson <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #14497
* Suppress Clang static analyzer complaint in zfs_replay_create()Richard Yao2023-02-141-3/+3
| | | | | | | | | | | | | | Clang's static analyzer incorrectly complains about an undefined value here when lr->lr_common.lrc_txtype == TX_SYMLINK and txtype == TX_CREATE. This is impossible, because of this line: txtype = (lr->lr_common.lrc_txtype & ~TX_CI((uint64_t)0x1 << 63)); Changing the code to compare against txtype suppresses the report. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Brian Atkinson <[email protected]> Signed-off-by: Richard Yao <[email protected]> Closes #14472
* Linux: use filemap_range_has_page()Brian Behlendorf2023-02-1411-17/+65
| | | | | | | | | | As of the 4.13 kernel filemap_range_has_page() can be used to check if there is a page mapped in a given file range. When available this interface should be used which eliminates the need for the zp->z_is_mapped boolean. Reviewed-by: Brian Atkinson <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #14493
* Give strlcat() full buffer lengths rather than smaller buffer lengthsRichard Yao2023-02-143-5/+4
| | | | | | | | | | | | | | | | | | strlcat() is supposed to be given the length of the destination buffer, including the existing contents. Unfortunately, I had been overzealous when I wrote a51288aabbbc176a8a73a8b3cd56f79607db32cf, since I gave it the length of the destination buffer, minus the existing contents. This likely caused a regression on large strings. On the topic of being overzealous, the use of strlcat() in dmu_send_estimate_fast() was unnecessary because recv_clone_name is a fixed length string. We continue using strlcat() mostly as defensive programming, in case the string length is ever changed, even though it is unnecessary. Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Richard Yao <[email protected]> Closes #14476
* quick fix for lingering snapdir unmount problemsRich Ercolani2023-02-131-1/+14
| | | | | | | | | | | | Unfortunately, even after e79b6807, I still, much more rarely, tripped asserts when playing with many ctldir mounts at once. Since this appears to happen if we dispatched twice too fast, just ignore it. We don't actually need to do anything if someone already started doing it for us. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rich Ercolani <[email protected]> Closes #14462
* Fix a race condition in dsl_dataset_sync() when activating featuresGeorge Amanakis2023-02-134-11/+34
| | | | | | | | | | | | | | | | | | | | The zio returned from arc_write() in dmu_objset_sync() uses zio_nowait(). However we may reach the end of dsl_dataset_sync() which checks if we need to activate features in the filesystem without knowing if that zio has even run through the ZIO pipeline yet. In that case we will flag features to be activated in dsl_dataset_block_born() but dsl_dataset_sync() has already completed its run and those features will not actually be activated. Mitigate this by moving the feature activation code in dsl_dataset_sync_done(). Also add new ASSERTs in dsl_scan_visitbp() checking if a block contradicts any filesystem flags. Reviewed-by: Matthew Ahrens <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Brian Atkinson <[email protected]> Signed-off-by: George Amanakis <[email protected]> Closes #13816
* Reduce need for contiguous memory for ioctlsPrakash Surya2023-02-132-7/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We've had cases where we trigger an OOM despite having memory freely available on the system. For example, here, we had about 21GB free: kernel: Node 0 Normal: 2418758*4kB (UME) 1549533*8kB (UE) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 22071296kB The problem being, all the memory is in 4K and 8K contiguous regions, but the allocation request was for a 16K contiguous region: kernel: SafeExecutors-4 invoked oom-killer: gfp_mask=0x42dc0(GFP_KERNEL|__GFP_NOWARN|__GFP_COMP|__GFP_ZERO), order=2, oom_score_adj=0 The offending allocation came from this call trace: kernel: Call Trace: kernel: dump_stack+0x57/0x7a kernel: dump_header+0x4f/0x1e1 kernel: oom_kill_process.cold.33+0xb/0x10 kernel: out_of_memory+0x1ad/0x490 kernel: __alloc_pages_slowpath+0xd55/0xe40 kernel: __alloc_pages_nodemask+0x2df/0x330 kernel: kmalloc_large_node+0x42/0x90 kernel: __kmalloc_node+0x25a/0x320 kernel: ? spl_kmem_free_impl+0x21/0x30 [spl] kernel: spl_kmem_alloc_impl+0xa5/0x100 [spl] kernel: spl_kmem_zalloc+0x19/0x20 [spl] kernel: zfsdev_ioctl+0x2b/0xe0 [zfs] kernel: do_vfs_ioctl+0xa9/0x640 kernel: ? __audit_syscall_entry+0xdd/0x130 kernel: ksys_ioctl+0x67/0x90 kernel: __x64_sys_ioctl+0x1a/0x20 kernel: do_syscall_64+0x5e/0x200 kernel: entry_SYSCALL_64_after_hwframe+0x44/0xa9 kernel: RIP: 0033:0x7fdca3674317 The problem is, for each ioctl that ZFS makes, it has to allocate a zfs_cmd_t structure, which is 13744 bytes in size (on my system): sdb> sizeof zfs_cmd (size_t)13744 This size, coupled with the fact that we currently allocate it with kmem_zalloc, means we need a 16K contiguous region of memory to satisfy the request. The solution taken by this change, is to use "vmem" instead of "kmem" to do the allocation, such that we don't necessarily need a contiguous 16K memory region to satisfy the allocation. Arguably, a better solution would be not to require such a large allocation to begin with (e.g. reduce the size of the zfs_cmd_t structure), but that'd be a much larger change than this "one liner". Thus, I've opted for this approach for now; we can always circle back and attempt to reduce the size of the structure in the future. Reviewed-by: Matthew Ahrens <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Richard Yao <[email protected]> Reviewed-by: Mark Maybee <[email protected]> Reviewed-by: Don Brady <[email protected]> Signed-off-by: Prakash Surya <[email protected]> Closes #14474