aboutsummaryrefslogtreecommitdiffstats
path: root/module
Commit message (Collapse)AuthorAgeFilesLines
* Use <fcntl.h> instead of <sys/fcntl.h>Sam James2024-11-071-1/+3
| | | | | | | | | | | | | | | | | | | | When building on musl, we get: ``` In file included from tests/zfs-tests/cmd/getversion.c:22: /usr/include/sys/fcntl.h:1:2: error: #warning redirecting incorrect #include <sys/fcntl.h> to <fcntl.h> [-Werror=cpp] 1 | #warning redirecting incorrect #include <sys/fcntl.h> to <fcntl.h> In file included from module/os/linux/zfs/vdev_file.c:36: /usr/include/sys/fcntl.h:1:2: error: #warning redirecting incorrect #include <sys/fcntl.h> to <fcntl.h> [-Werror=cpp] 1 | #warning redirecting incorrect #include <sys/fcntl.h> to <fcntl.h> ``` Bug: https://bugs.gentoo.org/925235 Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Sam James <[email protected]> Closes #15925
* Update ABD stats for linear page LinuxBrian Atkinson2024-11-071-0/+2
| | | | | | | | | | | | | | | | | | a10e552 updated abd_free_linear_page() to no longer call abd_update_scatter_stat(). This meant that linear pages that were not attached to Direct I/O requests were not doing waste accounting for the ARC. This led to performance issues due to incorrect ARC accounting that resulted in 100% of CPU time being spent in arc_evict() during prolonged I/O workloads with the ARC. The call to abd_update_scatter_stats() is now conditionally called in abd_free_linear_page() when the ABD is not from a Direct I/O request. Reviewed-by: Mark Maybee <[email protected]> Reviewed-by: Tony Nguyen <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Brian Atkinson <[email protected]> Closes #16729
* ZFS send should use spill block prefetched from send_reader_threadChunwei Chen2024-11-061-62/+64
| | | | | | | | | | | | | | | | | Currently, even though send_reader_thread prefetches spill block, do_dump() will not use it and issues its own blocking arc_read. This causes significant performance degradation when sending datasets with lots of spill blocks. For unmodified spill blocks, we also create send_range struct for them in send_reader_thread and issue prefetches for them. We piggyback them on the dnode send_range instead of enqueueing them so we don't break send_range_after check. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Chunwei Chen <[email protected]> Co-authored-by: david.chen <[email protected]> Closes #16701
* Use simple folio migration functiontstabrawa2024-11-061-0/+6
| | | | | | | | | | | | Avoids using fallback_migrate_folio, which starts unnecessary writeback (leading to BUG in migrate_folio_extra). Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Brian Atkinson <[email protected]> Signed-off-by: tstabrawa <[email protected]> Closes #16568 Closes #16723
* Revert "Avoid BUG in migrate_folio_extra"tstabrawa2024-11-063-71/+0
| | | | | | | | | | | This reverts commit b052035990594408899fa32fd4ad6603b75b6c6d. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Brian Atkinson <[email protected]> Signed-off-by: tstabrawa <[email protected]> Closes #16568 Closes #16723
* module: unicode: remove unused tolower transformationsнаб2024-11-041-5/+32
| | | | | | | | | | | | | | | With the previous patch this yields $ size -G ./module/zfs.ko ./module/zfs.new.ko text data bss total filename 2865126 1597982 755768 5218876 ./module/zfs.ko 2864038 1429784 755768 5049590 ./module/zfs.new.ko -1088 -168198 -1k -164k Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Ahelenia Ziemiańska <[email protected]> Closes #16704
* Reduce dirty records memory usageAlexander Motin2024-11-044-11/+22
| | | | | | | | | | | | | | | | | | | Small block workloads may use a very large number of dirty records. During simple block cloning test due to BRT still using 4KB blocks I can easily see up to 2.5M of those used. Before this change dbuf_dirty_record_t structures representing them were allocated via kmem_zalloc(), that rounded their size up to 512 bytes. Introduction of specialized kmem cache allows to reduce the size from 512 to 408 bytes. Additionally, since override and raw params in dirty records are mutually exclusive, puting them into a union allows to reduce structure size down to 368 bytes, increasing the saving to 28%, that can be a 0.5GB or more of RAM. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Brian Atkinson <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #16694
* zfs(4): remove "experimental" from zfs_bclone_enabledRob Norris2024-11-011-3/+3
| | | | | | | | | | | I think we've done enough experiments. Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: George Melikov <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes #16189 Closes #16712
* module: unicode: remove unused uconv.cнаб2024-11-013-863/+2
| | | | | | Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Ahelenia Ziemiańska <[email protected]> Closes #16702
* Revert "Workaround issue of Linux vdev_disk.c, (#16678)"Rob Norris2024-10-311-14/+0
| | | | | | | | | | | | | | | Now that we can handle these different alignments, we don't this workaround. This reverts commit aefc2da8a594d7a8059c862eab464d5f798393b3. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes #16687
* vdev_disk: move abd return and free off the interrupt handlerRob Norris2024-10-311-13/+27
| | | | | | | | | | | | | | | | | | | Freeing an ABD can take sleeping locks to update various stats. We aren't allowed to sleep on an interrupt handler. So, move the free off to the io_done callback. We should never have been freeing things in the interrupt handler, but we got away with it because we were usually freeing a linear ABD, which at most is returning two objects to a cache and never sleeping. Scatter ABDs can be used now, and those have more complex locking. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes #16687
* vdev_disk: try harder to ensure IO alignment rulesRob Norris2024-10-311-53/+67
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It seems out our notion of "properly" aligned IO was incomplete. In particular, dm-crypt does its own splitting, and assumes that a logical block will never cross an order-0 page boundary (ie, the physical page size, not compound size). This effectively means that it needs to be possible to split a BIO at any page or block size boundary and have it work correctly. This updates the alignment check function to enforce these rules (to the extent possible). Our response to misaligned data is to make some new allocation that is properly aligned, and copy the data into it. It turns out that linearising (via abd_borrow_buf()) is not enough, because we allocate eg 4K blocks from a general purpose slab, and so may receive (or already have) a 4K block that crosses pages. So instead, we allocate a new ABD, which is guaranteed to be aligned properly to block sizes, and then copy everything into it, and back out on the way back. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes #16687 #16631 #15646 #15533 #14533
* Add warning for external consumers of dmu_tx_callback_registerSerapheim Dimitropoulos2024-10-301-0/+7
| | | | | | | | | | | | | | | While reading some code @grwilson came across the above function that seemingly had no consumers besides a ztest callback that ensures that the tx_callback infrastructure works correctly. It turns out that Lustre is the main (and potentially the only) consumer of this. Refer to `osd_trans_commit_cb` of `lustre/osd-zfs/osd_handler.c` in the Lustre repo for more info. Let's add a comment highlighting this before someone removes it by mistake. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Serapheim Dimitropoulos <[email protected]> Closes #16698
* On the first vdev open ignore impossible ashift hintsAlexander Motin2024-10-291-2/+3
| | | | | | | | | | | | If on the first open device's logical ashift is bigger than set by pool's ashift property, ignore the last as unusable instead of creating vdev that will fail most of I/Os due to misalignment. Reviewed-by: Rob Norris <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Ameer Hamza <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #16690
* Fix gcc uninitialized warning in FreeBSD zio_crypt.cDimitry Andric2024-10-291-3/+2
| | | | | | | | | | | In FreeBSD's `zio_do_crypt_data()`, ensure that two `struct uio` variables are cleared before copying data out of them. This avoids accessing garbage data, and fixes gcc `-Wuninitialized` warnings. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Toomas Soome <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Dimitry Andric <[email protected]> Closes #16688
* Workaround issue of Linux vdev_disk.c, (#16678)Alexander Motin2024-10-231-0/+14
| | | | | | | | | | | | | in some cases not linearizing buffers with disk sector crossing a page boundary. It is fine for hardware, but somehow required by LUKS. It is not typical for ZFS to produce such buffers, but it may happen if 6KB block is compressed to 4KB, while still having 2KB alignment. Banning the 6KB buffers helps vdevs with ashifh=12. Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Tino Reichardt <[email protected]>
* config: fix dequeue_signal check for kernels <4.20Rob Norris2024-10-201-3/+3
| | | | | | | | | | | | | | | | | | | | | | Before 4.20, kernel_siginfo_t was just called siginfo_t. This was causing the kthread_dequeue_signal_3arg_task check, which uses kernel_siginfo_t, to fail on older kernels. In d6b8c17f1, we started checking for the "new" three-arg dequeue_signal() by testing for the "old" version. Because that test is explicitly using kernel_siginfo_t, it would fail, leading to the build trying to use the new three-arg version, which would then not compile. This commit fixes that by avoiding checking for the old 3-arg dequeue_signal entirely. Instead, we check for the new one, as well as the 4-arg form, and we use the old form as a fallback. This way, we never have to test for it explicitly, and once we're building HAVE_SIGINFO will make sure we get the right kernel_siginfo_t for it, so everything works out nice. Original-patch-by: Finix <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes #16666
* Fix inconsistent mount options for ZFS rootUmer Saleem2024-10-172-7/+108
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | While mounting ZFS root during boot on Linux distributions from initrd, mount from busybox is effectively used which executes mount system call directly. This skips the ZFS helper mount.zfs, which checks and enables the mount options as specified in dataset properties. As a result, datasets mounted during boot from initrd do not have correct mount options as specified in ZFS dataset properties. There has been an attempt to use mount.zfs in zfs initrd script, responsible for mounting the ZFS root filesystem (PR#13305). This was later reverted (PR#14908) after discovering that using mount.zfs breaks mounting of snapshots on root (/) and other child datasets of root have the same issue (Issue#9461). This happens because switching from busybox mount to mount.zfs correctly parses the mount options but also adds 'mntpoint=/root' to the mount options, which is then prepended to the snapshot mountpoint in '.zfs/snapshot'. '/root' is the directory on Debian with initramfs-tools where root filesystem is mounted before pivot_root. When Linux runtime is reached, trying to access the snapshots on root results in automounting the snapshot on '/root/.zfs/*', which fails. This commit attempts to fix the automounting of snapshots on root, while using mount.zfs in initrd script. Since the mountpoint of dataset is stored in vfs_mntpoint field, we can check if current mountpoint of dataset and vfs_mntpoint are same or not. If they are not same, reset the vfs_mntpoint field with current mountpoint. This fixes the mountpoints of root dataset and children in respective vfs_mntpoint fields when we try to access the snapshots of root dataset or its children. With correct mountpoint for root dataset and children stored in vfs_mntpoint, all snapshots of root dataset are mounted correctly and become accessible. This fix will come into play only if current process, that is trying to access the snapshots is not in chroot context. The Linux kernel API that is used to convert struct path into char format (d_path), returns the complete path for given struct path. It works in chroot environment as well and returns the correct path from original filesystem root. However d_path fails to return the complete path if any directory from original root filesystem is mounted using --bind flag or --rbind flag in chroot environment. In this case, if we try to access the snapshot from outside the chroot environment, d_path returns the path correctly, i.e. it returns the correct path to the directory that is mounted with --bind flag. However inside the chroot environment, it only returns the path inside chroot. For now, there is not a better way in my understanding that gives the complete path in char format and handles the case where directories from root filesystem are mounted with --bind or --rbind on another path which user will later chroot into. So this fix gets enabled if current process trying to access the snapshot is not in chroot context. With the snapshots issue fixed for root filesystem, using mount.zfs in ZFS initrd script, mounts the datasets with correct mount options. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Ameer Hamza <[email protected]> Signed-off-by: Umer Saleem <[email protected]> Closes #16646
* Revert "Temporarily disable Direct IO by default"Brian Behlendorf2024-10-121-0/+7
| | | | | | | | | | This partially reverts commit 41210597. Now that b4e4cbeb2 has been merged Direct IO can be enabled by default for Linux, but for FreeBSD there still remains a potentially insufficient range locking in zfs_getpages() which needs to be resolved. Reviewed-by: Tony Hutter <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #16629
* Always validate checksums for Direct I/O readsBrian Atkinson2024-10-0910-49/+226
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This fixes an oversight in the Direct I/O PR. There is nothing that stops a process from manipulating the contents of a buffer for a Direct I/O read while the I/O is in flight. This can lead checksum verify failures. However, the disk contents are still correct, and this would lead to false reporting of checksum validation failures. To remedy this, all Direct I/O reads that have a checksum verification failure are treated as suspicious. In the event a checksum validation failure occurs for a Direct I/O read, then the I/O request will be reissued though the ARC. This allows for actual validation to happen and removes any possibility of the buffer being manipulated after the I/O has been issued. Just as with Direct I/O write checksum validation failures, Direct I/O read checksum validation failures are reported though zpool status -d in the DIO column. Also the zevent has been updated to have both: 1. dio_verify_wr -> Checksum verification failure for writes 2. dio_verify_rd -> Checksum verification failure for reads. This allows for determining what I/O operation was the culprit for the checksum verification failure. All DIO errors are reported only on the top-level VDEV. Even though FreeBSD can write protect pages (stable pages) it still has the same issue as Linux with Direct I/O reads. This commit updates the following: 1. Propogates checksum failures for reads all the way up to the top-level VDEV. 2. Reports errors through zpool status -d as DIO. 3. Has two zevents for checksum verify errors with Direct I/O. One for read and one for write. 4. Updates FreeBSD ABD code to also check for ABD_FLAG_FROM_PAGES and handle ABD buffer contents validation the same as Linux. 5. Updated manipulate_user_buffer.c to also manipulate a buffer while a Direct I/O read is taking place. 6. Adds a new ZTS test case dio_read_verify that stress tests the new code. 7. Updated man pages. 8. Added an IMPLY statement to zio_checksum_verify() to make sure that Direct I/O reads are not issued as speculative. 9. Removed self healing through mirror, raidz, and dRAID VDEVs for Direct I/O reads. This issue was first observed when installing a Windows 11 VM on a ZFS dataset with the dataset property direct set to always. The zpool devices would report checksum failures, but running a subsequent zpool scrub would not repair any data and report no errors. Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Brian Atkinson <[email protected]> Closes #16598
* Fix generation of kernel uevents for snapshot rename on linuxJKDingwall2024-10-061-2/+9
| | | | | | | | | | | | | | `zvol_rename_minors()` needs to be given the full path not just the snapshot name. Use code removed in a0bd735ad as a guide to providing the necessary values. Add ZTS check for /dev changes after snapshot rename. After renaming a snapshot with 'snapdev=visible' ensure that the /dev entries are updated to reflect the rename. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: James Dingwall <[email protected]> Closes #14223 Closes #16600
* ARC: Cache arc_c value during arc_evict()Alexander Motin2024-10-041-7/+8
| | | | | | | | | | | | | | | Since arc_evict() run can take some time, arc_c change during it may result in undesired shift in ARC states balance. Primarily in case of arc_c reduction it may cause eviction from MFU data state despite its being below the target already. Instead we should evict as originally planned and if needed do another round after. Reviewed-by: Theera K. <[email protected]> Reviewed-by: George Melikov <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #16576 Closes #16605
* Defer resilver only when progress is above a thresholdPavel Snajdr2024-10-041-14/+39
| | | | | | | | | | | | | | | | | Restart a resilver from scratch, if the current one in progress is below a new tunable, zfs_resilver_defer_percent (defaulting to 10%). The original rationale for deferring additional resilvers, when there is already one in progress, was to help achieving data redundancy sooner for the data that gets scanned at the end of the resilver. But in case the admin wants to attach multiple disks to a single vdev, it wasn't immediately obvious the admin is supposed to run `zpool resilver` afterwards to reset the deferred resilvers and start a new one from scratch. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Pavel Snajdr <[email protected]> Closes #15810
* feature: large_microzapRob Norris2024-10-025-7/+103
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In a4b21eadec we added the zap_micro_max_size tuneable to raise the size at which "micro" (single-block) ZAPs are upgraded to "fat" (multi-block) ZAPs. Before this, a microZAP was limited to 128KiB, which was the old largest block size. The side effect of raising the max size past 128KiB is that it be stored in a large block, requiring the large_blocks feature. Unfortunately, this means that a backup stream created without the --large-block (-L) flag to zfs send would split the microZAP block into smaller blocks and send those, as is normal behaviour for large blocks. This would be received correctly, but since microZAPs are limited to the first block in the object by definition, the entries in the later blocks would be inaccessible. For directory ZAPs, this gives the appearance of files being lost. This commit adds a feature flag, large_microzap, that must be enabled for microZAPs to grow beyond 128KiB, and which will be activated the first time that occurs. This feature is later checked when generating the stream and if active, the send operation will abort unless --large-block has also been requested. Changing the limit still requires zap_micro_max_size to be changed. The state of this flag effectively sets the upper value for this tuneable, that is, if the feature is disabled, the tuneable will be clamped to 128KiB. A stream flag is also added to ensure that the receiver also activates its own feature flag upon receiving the stream. This is not strictly necessary to _use_ the received microZAP, since it doesn't care how large its block is, but it is required to send the microZAP object on, otherwise the original problem occurs again. Because it's difficult to reliably distinguish a microZAP from a fatZAP from outside the ZAP code, and because it seems unlikely that most users are affected (a fairly niche tuneable combined with what should be an uncommon use of send), and for the sake of expediency, this change activates the feature the first time a microZAP grows to use a large block, and is never deactivated after that. This can be improved in the future. This commit changes nothing for existing pools that already have large microZAPs. The feature will not be retroactively applied, but will be activated the next time a microZAP grows past the limit. Don't use large_blocks feature for enable/disable tests. The large_microzap depends on large_blocks, so it gets enabled as a dependency, breaking the test. Instead use feature "longname", which has the exact same feature characteristics. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Allan Jude <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes #16593
* Temporarily disable Direct IO by defaultBrian Behlendorf2024-10-021-1/+1
| | | | | | | | | While some remaining issues are resolved with the recently merged Direct IO functionality disable it by default. Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Brian Atkinson <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #16597
* snapdir: add 'disabled' value to make .zfs inaccessibleBrian Behlendorf2024-10-027-6/+37
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In some environments, just making the .zfs control dir hidden from sight might not be enough. In particular, the following scenarios might warrant not allowing access at all: - old snapshots with wrong permissions/ownership - old snapshots with exploitable setuid/setgid binaries - old snapshots with sensitive contents Introducing a new 'disabled' value that not only hides the control dir, but prevents access to its contents by returning ENOENT solves all of the above. The new property value takes advantage of 'iuv' semantics ("ignore unknown value") to automatically fall back to the old default value when a pool is accessed by an older version of ZFS that doesn't yet know about 'disabled' semantics. I think that technically the zfs_dirlook change is enough to prevent access, but preventing lookups and dir entries in an already opened .zfs handle might also be a good idea to prevent races when modifying the property at runtime. Add zfs_snapshot_no_setuid parameter to control whether automatically mounted snapshots have the setuid mount option set or not. this could be considered a partial fix for one of the scenarios mentioned in desired. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Tino Reichardt <[email protected]> Signed-off-by: Fabian Grünbichler <[email protected]> Co-authored-by: Fabian Grünbichler <[email protected]> Closes #3963 Closes #16587
* Avoid computing strlen() inside loopsrilysh2024-10-022-4/+6
| | | | | | | | | | | | | | | | Compiling with -O0 (no proper optimizations), strlen() call in loops for comparing the size, isn't being called/initialized before the actual loop gets started, which causes n-numbers of strlen() calls (as long as the string is). Keeping the length before entering in the loop is a good idea. On some places, even with -O2, both GCC and Clang can't recognize this pattern, which seem to happen in an array of char pointer. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: rilysh <[email protected]> Closes #16584
* Linux 6.12: PG_error flag was removedRob Norris2024-10-011-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | torvalds/linux@09022bc196d2 removes the flag, and the corresponding SetPageError() and ClearPageError() macros, with no replacement offered. Going back through the upstream history, use of this flag has been gradually removed over the last year as part of the long tail of converting everything to folios. Interesting tidbit comments from torvalds/linux@29e9412b250e and torvalds/linux@420e05d0de18 suggest that this flag has not been used meaningfully since page writeback failures started being recorded in errseq_t instead (the whole "fsyncgate" thing, ~2017, around torvalds/linux@8ed1e46aaf1b). Given that, it's possible that since perhaps Linux 4.13 we haven't been getting anything by setting the flag. I don't know if that's true and/or if there's something we should be doing instead, but my gut feel is that its probably fine we only use the page cache as a proxy to allow mmap() to work, rather than backing IO with it. As such, I'm expecting that removing this will do no harm, but I'm leaving it in for older kernels to maintain status quo, and if there is an overall better way, that is left for a future change. Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes #16582
* Linux 6.12: support 3arg dequeue_signal() without task paramRob Norris2024-10-011-8/+10
| | | | | | | | | | | | | See torvalds/linux@a2b80ce87a87. It claims the task arg is always `current`, and so it is with us, so this is a safe change to make. The only spanner is that we also support the older pre-5.17 3-arg dequeue_signal() which had different meaning, so we have to check the types to get the right one. Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes #16582
* Support for longnames for files/directories (Linux part)Sanjeev Bagewadi2024-10-0119-43/+372
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch adds the ability for zfs to support file/dir name up to 1023 bytes. This number is chosen so we can support up to 255 4-byte characters. This new feature is represented by the new feature flag feature@longname. A new dataset property "longname" is also introduced to toggle longname support for each dataset individually. This property can be disabled, even if it contains longname files. In such case, new file cannot be created with longname but existing longname files can still be looked up. Note that, to my knowledge native Linux filesystems don't support name longer than 255 bytes. So there might be programs not able to work with longname. Note that NFS server may needs to use exportfs_get_name to reconnect dentries, and the buffer being passed is limit to NAME_MAX+1 (256). So NFS may not work when longname is enabled. Note, FreeBSD vfs layer imposes a limit of 255 name lengh, so even though we add code to support it here, it won't actually work. Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Chunwei Chen <[email protected]> Closes #15921
* Allocate zap_attribute_t from kmem instead of stackSanjeev Bagewadi2024-10-0131-296/+420
| | | | | | | | | | | | This patch is preparatory work for long name feature. It changes all users of zap_attribute_t to allocate it from kmem instead of stack. It also make zap_attribute_t and zap_name_t structure variable length. Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Chunwei Chen <[email protected]> Closes #15921
* Restrict raidz faulted vdev countDon Brady2024-10-011-10/+33
| | | | | | | | | | | Specifically, a child in a replacing vdev won't count when assessing the dtl during a vdev_fault() Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Tino Reichardt <[email protected]> Signed-off-by: Don Brady <[email protected]> Closes #16569
* lua: add flex array field to TString typeRob Norris2024-09-304-12/+15
| | | | | | | | | | | | | | | | | Linux 6.10+ with CONFIG_FORTIFY_SOURCE notices memcpy() accessing past the end of TString, because it has no indication that there there may be an additional allocation there. There's no appropriate upstream change for this (ancient) version of Lua, so this is the narrowest change I could come up with to add a flex array field to the end of TString to satisfy the check. It's loosely based on changes from lua/lua@ca41b43f and lua/lua@9514abc2. Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes #16541 Closes #16583
* zfs_log: add flex array fields to log record structsRob Norris2024-09-272-113/+120
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ZIL log record structs (lr_XX_t) are frequently allocated with extra space after the struct to carry variable-sized "payload" items. Linux 6.10+ compiled with CONFIG_FORTIFY_SOURCE has been doing runtime bounds checking on memcpy() calls. Because these types had no indicator that they might use more space than their simple definition, __fortify_memcpy_chk will frequently complain about overruns eg: memcpy: detected field-spanning write (size 7) of single field "lr + 1" at zfs_log.c:425 (size 0) memcpy: detected field-spanning write (size 9) of single field "(char *)(lr + 1)" at zfs_log.c:593 (size 0) memcpy: detected field-spanning write (size 4) of single field "(char *)(lr + 1) + snamesize" at zfs_log.c:594 (size 0) memcpy: detected field-spanning write (size 7) of single field "lr + 1" at zfs_log.c:425 (size 0) memcpy: detected field-spanning write (size 9) of single field "(char *)(lr + 1)" at zfs_log.c:593 (size 0) memcpy: detected field-spanning write (size 4) of single field "(char *)(lr + 1) + snamesize" at zfs_log.c:594 (size 0) memcpy: detected field-spanning write (size 7) of single field "lr + 1" at zfs_log.c:425 (size 0) memcpy: detected field-spanning write (size 9) of single field "(char *)(lr + 1)" at zfs_log.c:593 (size 0) memcpy: detected field-spanning write (size 4) of single field "(char *)(lr + 1) + snamesize" at zfs_log.c:594 (size 0) To fix this, this commit adds flex array fields to all lr_XX_t structs that require them, and then uses those fields to access that end-of-struct area rather than more complicated casts and pointer addition. Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes #16501 Closes #16539
* Avoid BUG in migrate_folio_extratstabrawa2024-09-263-0/+71
| | | | | | | | | | | Linux page migration code won't wait for writeback to complete unless it needs to call release_folio. Call SetPagePrivate wherever PageUptodate is set and define .release_folio, to cause fallback_migrate_folio to wait for us. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: tstabrawa <[email protected]> Closes #15140 Closes #16568
* Properly release key in spa_keystore_dsl_key_hold_dd()Alexander Motin2024-09-251-1/+1
| | | | | | | | | | | Since dsl_crypto_key_open() references the key, 0d23f5e2e4 should have called dsl_crypto_key_rele() to drop it first instead of calling dsl_crypto_key_free() directly. The final result should actually be the same, but without triggering dck_holds assertion. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #16567
* FreeBSD: Sync taskq_cancel_id() returns with LinuxAlexander Motin2024-09-241-2/+2
| | | | | | | | | | | Couple places in the code depend on 0 returned only if the task was actually cancelled. Doing otherwise could lead to extra references being dropped. The race could be small, but I believe CI hit it from time to time. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #16565
* Add missing guard defines for simd_statw0xel2024-09-241-0/+6
| | | | | | | | | This adds the HAVE_KERNEL_NEON and HAVE_KERNEL_FPU_INTERNAL guards to simd_stat.c defaulted to 0 to make it build again. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Shengqi Chen <[email protected]> Signed-off-by: Sebastian Wuerl <[email protected]> Closes #16558
* Evicting too many bytes from MFU metadataTheera K.2024-09-231-1/+1
| | | | | | | | | | | | Without updating 'm' we evict from MFU metadata all that we wanted to evict from all metadata, including already evicted MRU metadata ('m' is the total amount of metadata we had at the beginning, and 'w' is the total amount of metadata we want to have). Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Theera K. <[email protected]> Closes #16521 Closes #16546
* linux: log a scary warning when used with an experimental kernelRob Norris2024-09-231-0/+6
| | | | | | | | | | | | | Since the person using the kernel may not be the person who built it, show a warning at module load too, in case they aren't aware that it might be weird. Reviewed-by: Robert Evans <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Signed-off-by: Rob Norris <[email protected]> Sponsored-by: https://despairlabs.com/sponsor/ Closes #15986
* xattr dataset prop: change defaults to saGeorge Melikov2024-09-232-3/+3
| | | | | | | | | | | | | | | It's the main recommendation to set xattr=sa even in man pages, so let's set it by default. xattr=sa don't use feature flag, so in the worst case we'll have non-readable xattrs by other non-openzfs platforms. Non-overridden default `xattr` prop of existing pools will automatically use `sa` after this commit too. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: George Melikov <[email protected]> Closes #15147
* Fix /proc/spl/kstat/simd on x86Rich Ercolani2024-09-221-1/+9
| | | | | | | | Evidently while reworking it on aarch64, I broke it on x86 and didn't notice. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rich Ercolani <[email protected]> Closes #16556
* FreeBSD: restore zfs_znode_update_vfs()Rob Norris2024-09-211-0/+12
| | | | | | | | | | | I accidentally removed this in c22d56e3e, and didn't notice because it doesn't fail the build, but does fail to load into the kernel because it can't link it. Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: George Melikov <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes #16554
* Add SIMD metadata in /proc on Linux follow upBrian Behlendorf2024-09-201-3/+0
| | | | | | | | | This change accidentally broke the FreeBSD build due to a conflict between the simd_stat_init()/simd_stat_fini() macros on FreeBSD and the extern function prototype. Reviewed-by: Tony Hutter <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #16552
* Add SIMD metadata in /proc on LinuxRich Ercolani2024-09-203-0/+196
| | | | | | | | | | | | | | Too many times, people's performance problems have amounted to "somehow your SIMD support isn't working", and determining that at runtime is difficult to describe to people. This adds a /proc/spl/kstat/zfs/simd node, which exposes metadata about which instructions ZFS thinks it can use, on AArch64 and x86_64 Linux, to make investigating things like this much easier. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rich Ercolani <[email protected]> Closes #16530
* arc_hdr_authenticate: make explicit errorGeorge Melikov2024-09-191-2/+6
| | | | | | | | | | | On compression we could be more explicit here for cases where we can not recompress the data. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Co-authored-by: Alexander Motin <[email protected]> Signed-off-by: George Melikov <[email protected]> Closes #9416
* ZLE compression: don't use BPE_PAYLOAD_SIZEGeorge Melikov2024-09-192-5/+11
| | | | | | | | | | | | | ZLE compressor needs additional bytes to process d_len argument efficiently. Don't use BPE_PAYLOAD_SIZE as d_len with it before we rework zle compressor somehow. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: George Melikov <[email protected]> Closes #9416
* zio_compress: introduce max size thresholdGeorge Melikov2024-09-194-18/+32
| | | | | | | | | | | | | | | | | | | | | | | | | | Now default compression is lz4, which can stop compression process by itself on incompressible data. If there are additional size checks - we will only make our compressratio worse. New usable compression thresholds are: - less than BPE_PAYLOAD_SIZE (embedded_data feature); - at least one saved sector. Old 12.5% threshold is left to minimize affect on existing user expectations of CPU utilization. If data wasn't compressed - it will be saved as ZIO_COMPRESS_OFF, so if we really need to recompress data without ashift info and check anything - we can just compress it with zero threshold. So, we don't need a new feature flag here! Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: George Melikov <[email protected]> Closes #9416
* zfs_debug: specific variant for userspaceRob Norris2024-09-192-77/+1
| | | | | | | | | | Just nice and simple, with room to grow. Reviewed by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Rob Norris <[email protected]> Sponsored-by: https://despairlabs.com/sponsor/ Closes #16492
* zfs_znode: lift common code to a single shared fileRob Norris2024-09-195-747/+401
| | | | | | | | | | | | For now, userspace has no znode implementation. Some of the property and path handling code is used there though and is the same on all platforms, so we only need a single copy of it. Reviewed by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Rob Norris <[email protected]> Sponsored-by: https://despairlabs.com/sponsor/ Closes #16492