aboutsummaryrefslogtreecommitdiffstats
path: root/module
Commit message (Collapse)AuthorAgeFilesLines
* abd_iter_page: rework to handle multipage scatterlistsRob N2024-07-171-46/+74
| | | | | | | | | | | | | | | | | | | Previously, abd_iter_page() would assume that every scatterlist would contain a single page (compound or no), because that's all we ever create in abd_alloc_chunks(). However, scatterlists can contain multiple pages of arbitrary provenance, and if we get one of those, we'd get all the math wrong. This reworks things to handle multiple pages in a scatterlist, by properly finding the right page within it for the given offset, and understanding better where the end of the page is and not crossing it. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reported-by: Brian Atkinson <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Brian Atkinson <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes #16108
* Linux 5.16: use bdev_nr_bytes() to get device capacityRob Norris2024-07-161-5/+9
| | | | | | | | | | | This helper was introduced long ago, in 5.16. Since 6.10, bd_inode no longer exists, but the helper has been updated, so detect it and use it in all versions where it is available. Signed-off-by: Rob Norris <[email protected]> Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]>
* Linux 6.10: work harder to avoid kmem_cache_alloc reuseRob Norris2024-07-161-10/+2
| | | | | | | | | | | | | | | | | | | | | Linux 6.10 change kmem_cache_alloc to be a macro, rather than a function, such that the old #undef for it in spl-kmem-cache.c would remove its definition completely, breaking the build. This inverts the model used before. Rather than always defining the kmem_cache_* macro, then undefining then inside spl-kmem-cache.c, instead we make a special tag to indicate we're currently inside spl-kmem-cache.c, and not defining those in macros in the first place, so we can use the kernel-supplied kmem_cache_* functions to implement spl_kmem_cache_*, as we expect. For all other callers, we create the macros as normal and remove access to the kernel's own conflicting names. Signed-off-by: Rob Norris <[email protected]> Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]>
* Linux 6.10: rework queue limits setupRob Norris2024-07-161-70/+116
| | | | | | | | | | | | | | | | | | | | | | | Linux has started moving to a model where instead of applying block queue limits through individual modification functions, a complete limits structure is built up and applied atomically, either when the block device or open, or some time afterwards. As of 6.10 this transition appears only partly completed. This commit matches that model within OpenZFS in a way that should work for past and future kernels. We set up a queue limits structure with any limits that have had their modification functions removed. For newer kernels that can have limits applied at block device open (HAVE_BLK_ALLOC_DISK_2ARG), we have a conversion function to turn the OpenZFS queue limits structure into Linux's queue_limits structure, which can then be passed in. For older kernels, we provide an application function that just calls the old functions for each limit in the structure. Signed-off-by: Rob Norris <[email protected]> Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]>
* Linux 6.9: Fix UBSAN errors in zap_micro.cTony Hutter2024-07-161-0/+1
| | | | | | | | | | | | | | | | | You can use the UBSAN_SANITIZE_* Kbuild options to exclude certain kernel objects from the UBSAN checks. We previously excluded zap_micro.o with: UBSAN_SANITIZE_zap_micro.o := n For some reason that didn't work for the 6.9 kernel, which wants us to use: UBSAN_SANITIZE_zfs/zap_micro.o := n Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tony Hutter <[email protected]> Closes #16278 Closes #16330
* Linux 6.9: Call add_disk() from workqueue to fix zfs_allow_010_pos (#16282)Tony Hutter2024-07-161-5/+97
| | | | | | | | | | | | | | | The 6.9 kernel behaves differently in how it releases block devices. In the common case it will async release the device only after the return to userspace. This is different from the 6.8 and older kernels which release the block devices synchronously. To get around this, call add_disk() from a workqueue so that the kernel uses a different codepath to release our zvols in the way we expect. This stops zfs_allow_010_pos from hanging. Fixes: #16089 Signed-off-by: Tony Hutter <[email protected]> Reviewed-by: Tino Reichardt <[email protected]> Reviewed-by: Rob Norris <[email protected]>
* head_errlog: fix use-after-freeGeorge Amanakis2024-07-151-2/+5
| | | | | | | | | | | | | In the commit of the head_errlog feature we introduced a bug in dsl_dataset_promote_sync(): we may dereference origin_head and hds, both dereferencing ddpa after calling promote_sync() on ddpa. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Chunwei Chen <[email protected]> Reviewed-by: Rob Norris <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Signed-off-by: George Amanakis <[email protected]> Closes #16272 Closes #16273
* Fix assertion in Persistent L2ARCGeorge Amanakis2024-05-291-1/+1
| | | | | | | | | At the end of l2arc_evict() fix an assertion in the case that l2ad_hand + distance == l2ad_end. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: George Amanakis <[email protected]> Closes #16202 Closes #16207
* FreeBSD: Add zfs_link_create() error handlingAlexander Motin2024-05-293-13/+43
| | | | | | | | | | | Originally Solaris didn't expect errors there, but they may happen if we fail to add entry into ZAP. Linux fixed it in #7421, but it was never fully ported to FreeBSD. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored-By: iXsystems, Inc. Closes #13215 Closes #16138
* ZAP: Fix leaf references on zap_expand_leaf() errorsAlexander Motin2024-05-291-13/+14
| | | | | | | | | | | | Depending on kind of error zap_expand_leaf() may return with or without valid leaf reference held. Make sure it returns NULL if due to error it has no leaf to return. Make its callers to check the returned leaf pointer, and release the leaf if it is not NULL. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #12366 Closes #16159
* Fix ZIL clone records for legacy holesAlexander Motin2024-05-291-5/+3
| | | | | | | | | | | | Previous code overengineered cloned range calculation by using BP_GET_LSIZE(). The problem is that legacy holes don't have the logical size, so result will be wrong. But we also don't need to look on every block size, since they all must be identical. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Brian Atkinson <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #16165
* Fix scn_queue races on very old poolsAlexander Motin2024-05-291-0/+6
| | | | | | | | | | | | | | | Code for pools before version 11 uses dmu_objset_find_dp() to scan for children datasets/clones. It calls enqueue_clones_cb() and enqueue_cb() callbacks in parallel from multiple taskq threads. It ends up bad for scan_ds_queue_insert(), corrupting scn_queue AVL-tree. Fix it by introducing a mutex to protect those two scan_ds_queue_insert() calls. All other calls are done from the sync thread and so serialized. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Brian Atkinson <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #16162
* Slightly improve dnode hashAlexander Motin2024-05-291-3/+3
| | | | | | | | | | | | As I understand just for being less predictable dnode hash includes 8 bits of objset pointer, starting at 6. But since objset_t is more than 1KB in size, its allocations are likely aligned to 2KB, that means 11 lower bits provide no entropy. Just take the 8 bits starting from 11. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #16131
* Make more taskq parameters writableAlexander Motin2024-05-291-4/+4
| | | | | | | | | | | | There is no reason for these module parameters to be read-only. Being modified they just apply on next pool import/creation, that is useful for testing different values. Reviewed-by: Rich Ercolani <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #16118
* L2ARC: Cleanup buffer re-compressionAlexander Motin2024-05-291-39/+20
| | | | | | | | | | | | | | When compressed ARC is disabled, we may have to re-compress when writing into L2ARC. If doing so we can't fit it into the original physical size, we should just fail immediately, since even if it may still fit into allocation size, its checksum will never match. While there, refactor the code similar to other compression places without using abd_return_buf_copy(). Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #16038
* Refactor dbuf_read() for safer decryptionAlexander Motin2024-05-291-110/+104
| | | | | | | | | | | | | | | | | | | | In dbuf_read_verify_dnode_crypt(): - We don't need original dbuf locked there. Instead take a lock on a dnode dbuf, that is actually manipulated. - Block decryption for a dnode dbuf if it is currently being written. ARC hash lock does not protect anonymous buffers, so arc_untransform() is unsafe when used on buffers being written, that may happen in case of encrypted dnode buffers, since they are not copied by dbuf_dirty()/dbuf_hold_copy(). In dbuf_read(): - If the buffer is in flight, recheck its compression/encryption status after it is cached, since it may need arc_untransform(). Tested-by: Rich Ercolani <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #16104
* Replace P2ALIGN with P2ALIGN_TYPED and delete P2ALIGN.chenqiuhao19972024-05-138-16/+20
| | | | | | | | | | | | | In P2ALIGN, the result would be incorrect when align is unsigned integer and x is larger than max value of the type of align. In that case, -(align) would be a positive integer, which means high bits would be zero and finally stay zero after '&' when align is converted to a larger integer type. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Youzhong Yang <[email protected]> Signed-off-by: Qiuhao Chen <[email protected]> Closes #15940
* Fix updating the zvol_htable when renaming a zvolAlan Somers2024-04-302-2/+2
| | | | | | | | | | | | When renaming a zvol, insert it into zvol_htable using the new name, not the old name. Otherwise some operations won't work. For example, "zfs set volsize" while the zvol is open. Sponsored by: Axcient Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alek Pinchuk <[email protected]> Signed-off-by: Alan Somers <[email protected]> Closes #16127 Closes #16128
* Add prefetch property Brian Behlendorf2024-04-303-1/+36
| | | | | | | | | | | | | | | | | | | | | | ZFS prefetch is currently governed by the zfs_prefetch_disable tunable. However, this is a module-wide settings - if a specific dataset benefits from prefetch, while others have issue with it, an optimal solution does not exists. This commit introduce the "prefetch" tri-state property, which enable granular control (at dataset/volume level) for prefetching. This patch does not remove the zfs_prefetch_disable, which remains a system-wide switch for enable/disable prefetch. However, to avoid duplication, it would be preferable to deprecate and then remove the module tunable. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Ameer Hamza <[email protected]> Signed-off-by: Gionatan Danti <[email protected]> Co-authored-by: Gionatan Danti <[email protected]> Closes #15237 Closes #15436
* vdev probe to slow disk can stall mmp write checkerDon Brady2024-04-308-37/+126
| | | | | | | | | | | | | | Simplify vdev probes in the zio_vdev_io_done context to avoid holding the spa config lock for a long duration. Also allow zpool clear if no evidence of another host is using the pool. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Olaf Faaland <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Don Brady <[email protected]> Closes #15839
* Extend import_progress kstat with a notes fieldDon Brady2024-04-293-9/+118
| | | | | | | | | | | | | Detail the import progress of log spacemaps as they can take a very long time. Also grab the spa_note() messages to, as they provide insight into what is happening Sponsored-By: OpenDrives Inc. Sponsored-By: Klara Inc. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Don Brady <[email protected]> Co-authored-by: Allan Jude <[email protected]> Closes #15539
* Add ashift validation when adding devices to a poolGeorge Wilson2024-04-292-4/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | Currently, zpool add allows users to add top-level vdevs that have different ashifts but doing so prevents users from being able to perform a top-level vdev removal. Often times consumers may not realize that they have mismatched ashifts until the top-level removal fails. This feature adds ashift validation to the zpool add command and will fail the operation if the sector size of the specified vdev does not match the existing pool. This behavior can be disabled by using the -f flag. In addition, new flags have been added to provide fine-grained control to disable specific checks. These flags are: --allow-in-use --allow-ashift-mismatch --allow-replicaton-mismatch The force flag will disable all of these checks. Reviewed by: Brian Behlendorf <[email protected]> Reviewed by: Alexander Motin <[email protected]> Reviewed-by: Mark Maybee <[email protected]> Signed-off-by: George Wilson <[email protected]> Closes #15509
* Use ASSERT0P() to check that a pointer is NULL.Dag-Erling Smørgrav2024-04-291-1/+1
| | | | | | | | Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Kay Pedersen <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Dag-Erling Smørgrav <[email protected]> Closes #15225
* GCC: Fixes for gcc 14 on Fedora 40Tony Hutter2024-04-291-2/+3
| | | | | | | | | | | - Workaround dangling pointer in uu_list.c (#16124) - Fix calloc() transposed arguments in zpool_vdev_os.c - Make some temp variables unsigned to prevent triggering a '-Werror=alloc-size-larger-than' error. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tony Hutter <[email protected]> Closes #16124 Closes #16125
* Do no use .cfi_negate_ra_state within the assembly on Arm64Tino Reichardt2024-04-292-5/+21
| | | | | | | | | | | | | | | Compiling openzfs on aarch64 with gcc-8 and gcc-9 is failing currently. See issue #14965 for deeper context. On platforms without pointer authentication, .cfi_negate_ra_state can be defined to a no-op: https://sourceware.org/git/?p=binutils-gdb.git;a=blob;f=gdb/aarch64-tdep.c#l1413 I have tested this on Arm64 FreeBSD 13.2 and AlmaLinux-8. Reviewed-by: Andrew Turner <[email protected]> Signed-off-by: Tino Reichardt <[email protected]> Closes #14965 Closes #15784
* Add the BTI elf note to the AArch64 SHA2 assemblyAndrew Turner2024-04-292-0/+20
| | | | | | | | | | | | | | | | | | | On ELF platforms there is a note to specify when an application or library supports BTI. When linking one of these the linker needs all input object files to have the note. If not it will not include it in the output file. Normally the compiler would generate it, but for assembly files we need to do it our selves. Add the note to the aarch64 sha256 and sha512 assembly files. Tested by building with BTI enabled and using the -zbti-report=error flag to lld that makes it an error if the note is missing. Reviewed-by: Tino Reichardt <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Andrew Turner <[email protected]> Closes #16086
* Fix panics when truncating/deleting filesPavel Snajdr2024-04-291-10/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There's an union in dbuf_dirty_record_t; dr_brtwrite could evaluate to B_TRUE if the dirty record is of another type than dl. Adding more explicit dr type check before trying to access dr_brtwrite. Fixes two similar panics: [ 1373.806119] VERIFY0(db->db_level) failed (0 == 1) [ 1373.807232] PANIC at dbuf.c:2549:dbuf_undirty() [ 1373.814979] dump_stack_lvl+0x71/0x90 [ 1373.815799] spl_panic+0xd3/0x100 [spl] [ 1373.827709] dbuf_undirty+0x62a/0x970 [zfs] [ 1373.829204] dmu_buf_will_dirty_impl+0x1e9/0x5b0 [zfs] [ 1373.831010] dnode_free_range+0x532/0x1220 [zfs] [ 1373.833922] dmu_free_long_range+0x4e0/0x930 [zfs] [ 1373.835277] zfs_trunc+0x75/0x1e0 [zfs] [ 1373.837958] zfs_freesp+0x9b/0x470 [zfs] [ 1373.847236] zfs_setattr+0x161a/0x3500 [zfs] [ 1373.855267] zpl_setattr+0x125/0x320 [zfs] [ 1373.856725] notify_change+0x1ee/0x4a0 [ 1373.859207] do_truncate+0x7f/0xd0 [ 1373.859968] do_sys_ftruncate+0x28e/0x2e0 [ 1373.860962] do_syscall_64+0x38/0x90 [ 1373.861751] entry_SYSCALL_64_after_hwframe+0x6e/0xd8 [ 1822.381337] VERIFY0(db->db_level) failed (0 == 1) [ 1822.382376] PANIC at dbuf.c:2549:dbuf_undirty() [ 1822.389232] dump_stack_lvl+0x71/0x90 [ 1822.389920] spl_panic+0xd3/0x100 [spl] [ 1822.399567] dbuf_undirty+0x62a/0x970 [zfs] [ 1822.400583] dmu_buf_will_dirty_impl+0x1e9/0x5b0 [zfs] [ 1822.401752] dnode_free_range+0x532/0x1220 [zfs] [ 1822.402841] dmu_object_free+0x74/0x120 [zfs] [ 1822.403869] zfs_znode_delete+0x75/0x120 [zfs] [ 1822.404906] zfs_rmnode+0x3f6/0x7f0 [zfs] [ 1822.405870] zfs_inactive+0xa3/0x610 [zfs] [ 1822.407803] zpl_evict_inode+0x3e/0x90 [zfs] [ 1822.408831] evict+0xc1/0x1c0 [ 1822.409387] do_unlinkat+0x147/0x300 [ 1822.410060] __x64_sys_unlinkat+0x33/0x60 [ 1822.410802] do_syscall_64+0x38/0x90 [ 1822.411458] entry_SYSCALL_64_after_hwframe+0x6e/0xd8 Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Brian Atkinson <[email protected]> Signed-off-by: Pavel Snajdr <[email protected]> Closes #15983
* Add slow disk diagnosis to ZEDDon Brady2024-04-294-0/+66
| | | | | | | | | | | | | | | | | | | | | Slow disk response times can be indicative of a failing drive. ZFS currently tracks slow I/Os (slower than zio_slow_io_ms) and generates events (ereport.fs.zfs.delay). However, no action is taken by ZED, like is done for checksum or I/O errors. This change adds slow disk diagnosis to ZED which is opt-in using new VDEV properties: VDEV_PROP_SLOW_IO_N VDEV_PROP_SLOW_IO_T If multiple VDEVs in a pool are undergoing slow I/Os, then it skips the zpool_vdev_degrade(). Sponsored-By: OpenDrives Inc. Sponsored-By: Klara Inc. Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Allan Jude <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Co-authored-by: Rob Wing <[email protected]> Signed-off-by: Don Brady <[email protected]> Closes #15469
* xdr: header cleanupRob N2024-04-292-0/+2
| | | | | | | | | | | | | | | | | | | | | #16047 notes that include/os/freebsd/spl/rpc/xdr.h carried an (apparently) incompatible license. While looking into it, it seems that this file is actually unnecessary these days - FreeBSD's kernel XDR has XDR_CONTROL, xdrmem_control and XDR_GET_BYTES_AVAIL, while userspace has XDR_CONTROL and xdrmem_control, and our implementation of XDR_GET_BYTES_AVAIL for libspl works nicely with it. So this removes that file outright. To keep the includes in nvpair.c tidy, I've made a few small adjustments to the Linux headers. By definition, rpc/types.h provides bool_t and is included before rpc/xdr.h, so I've created rpc/types.h for Linux. This isn't necessary for userspace; both FreeBSD native and tirpc on Linux already have these headers set up correctly. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rob Norris <[email protected]> Sponsored-by: https://despairlabs.com/sponsor/ Closes #16047 Closes #16051
* Linux 6.9 compat: blk_alloc_disk() now takes two argsRob Norris2024-04-221-1/+22
| | | | | | | | | | | | | | | | There's an extra nullable arg for queue limits. Detect it, and set it to NULL. Similar change for blk_mq_alloc_disk(), now three args, same treatment. Error return now has error encoded in the return, so detect with IS_ERR() and explicitly NULL our own return. Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rob Norris <[email protected]> Sponsored-by: https://despairlabs.com/sponsor/ Closes #16027 Closes #16033
* Linux 6.9 compat: bdev handles are now struct fileRob Norris2024-04-221-5/+19
| | | | | | | | | | | | | | bdev_open_by_path() is replaced by bdev_file_open_by_path(), which returns a plain old struct file*. Release function is gone entirely; the regular file release function fput() will take care of the bdev specifics. Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rob Norris <[email protected]> Sponsored-by: https://despairlabs.com/sponsor/ Closes #16027 Closes #16033
* vdev_disk: clean up spa/bdev mode conversionRob N2024-04-221-42/+39
| | | | | | | | | | | | | | | | | | | | 43e8f6e37 introduced a subtle API misuse, in that it passed the output from vdev_bdev_mode() back into itself. Fortunately, the SPA_MODE_(READ|WRITE) bit values exactly map to the FMODE_(READ|WRITE) & BLK_OPEN_(READ|WRITE) bit values, so it didn't result in a bug, but it was hard to read and understand, so I cleaned it up. In doing so, I noticed that the only call to vdev_bdev_mode() without the "exclusive" flag set was in that misuse, and actually, we never do a non-exclusive blkdev_get_by_path(). So I've just made exclusive be always-on. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed by: Brian Behlendorf <[email protected]> Reviewed-by: Allan Jude <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes #15995
* zvols: prevent overflow of minor device numbersFabian-Gruenbichler2024-04-221-0/+7
| | | | | | | | | | | | | | | | | | | | | | currently, the linux kernel allows 2^20 minor devices per major device number. ZFS reserves blocks of 2^4 minors per zvol: 1 for the zvol itself, the other 15 for the first partitions of that zvol. as a result, only 2^16 such blocks are available for use. there are no checks in place to avoid overflowing into the major device number when more than 2^16 zvols are allocated (with volmode=dev or default). instead of ignoring this limit, which comes with all sorts of weird knock-on effects, detect this situation and simply fail allocating the zvol block device early on. without this safeguard, the kernel will reject the attempt to create an already existing block device, but ZFS doesn't handle this error and gets confused about which zvol occupies which minor slot, potentially resulting in kernel NULL derefs and other issues later on. Reviewed-by: Tony Hutter <[email protected]> Reviewed by: Brian Behlendorf <[email protected]> Signed-off-by: Fabian Grünbichler <[email protected]> Closes #16006
* L2ARC: Relax locking during writeAlexander Motin2024-04-195-98/+127
| | | | | | | | | | | | | | | | | | | | | | | Previous code held ARC state sublist lock throughout all L2ARC write process, which included number of allocations and even ZIO issues. Being blocked in any of those places the code could also block ARC eviction, that could cause OOM activation or even dead- lock if system is low on memory or one is too fragmented. Fix it by dropping the lock as soon as we see a block eligible for L2ARC writing and pick it up later using earlier inserted marker. While there, also reduce scope of hash lock, moving ZIO allocation and other operations not requiring header access out of it. All operations requiring header access move under hash lock, since L2_WRITING flag does not prevent header eviction only transition to arc_l2c_only state with L1 header. To be able to manipulate sublist lock and marker as needed add few more multilist functions and modify one. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #16040
* Small fix to prefetch ranges aggregationAlexander Motin2024-04-191-2/+2
| | | | | | | | | | When after #16022 adding new range we aggregate more than two existing ranges, that should be very rare, only if several streams overlap, we may need to zero not the last range, but some earlier. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #16072
* Remove db_state DB_NOFILL checks from syncing contextAlexander Motin2024-04-191-25/+19
| | | | | | | | | | | | | | Syncing context should not depend on current state of dbuf, which could already change several times in later transaction groups, but rely solely on dirty record for the transaction group being synced. Some of the checks seem already impossible, while instead of others I think we should better check for absence of data in the specific dirty record rather than DB_NOFILL. Reviewed-by: Robert Evans <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #16057
* Speculative prefetch for reordered requestsAlexander Motin2024-04-192-57/+240
| | | | | | | | | | | | | | | | | | | Before this change speculative prefetcher was able to detect a stream only if all of its accesses are perfectly sequential. It was easy to implement and is perfectly fine for single-threaded applications. Unfortunately multi-threaded network servers, such as iSCSI, SMB or NFS usually have plenty of threads and may often reorder requests, preventing successful speculation and prefetch. This change allows speculative prefetcher to detect streams even if requests are reordered by introducing a list of 9 non-contiguous ranges up to 16MB ahead of current stream position and filling the gaps as more requests arrive. It also allows stream to proceed even with holes up to a certain configurable threshold (25%). Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #16022
* Fix read errors race after block cloningAlexander Motin2024-04-191-21/+20
| | | | | | | | | | | | | | | | | | | | Investigating read errors triggering panic fixed in #16042 I've found that we have a race in a sync process between the moment dirty record for cloned block is removed and the moment dbuf is destroyed. If dmu_buf_hold_array_by_dnode() take a hold on a cloned dbuf before it is synced/destroyed, then dbuf_read_impl() may see it still in DB_NOFILL state, but without the dirty record. Such case is not an error, but equivalent to DB_UNCACHED, since the dbuf block pointer is already updated by dbuf_write_ready(). Unfortunately it is impossible to safely change the dbuf state to DB_UNCACHED there, since there may already be another cloning in progress, that dropped dbuf lock before creating a new dirty record, protected only by the range lock. Reviewed-by: Rob Norris <[email protected]> Reviewed-by: Robert Evans <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #16052
* Improve dbuf_read() error reportingAlexander Motin2024-04-191-18/+20
| | | | | | | | | | | | | Previous code reported non-ZIO errors only via return value, but not via parent ZIO. It could cause NULL-dereference panics due to dmu_buf_hold_array_by_dnode() ignoring the return value, relying solely on parent ZIO status. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Ameer Hamza <[email protected]> Reported by: Ameer Hamza <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #16042
* BRT: Fix holes cloning.Alexander Motin2024-04-191-13/+13
| | | | | | | | | | | | | | | | | | | - When reading L0 block pointers handle buffers without ones and without dirty records as a holes. Those appear when dnode size was increased, but the end was never written, so there are no new indirection levels to store the pointers. It makes no sense to return EAGAIN here, since sync won't create new indirection levels until there will be actual writes. - When cloning blocks set destination hole logical birth time to the current TXG. Otherwise if we are cloning over existing data, newly created holes may not be properly replicated later. Use BP_SET_BIRTH() when possible to not replicate its logic. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #15994 Closes #16007
* BRT: Skip getting length in brt_entry_lookup()Alexander Motin2024-04-191-16/+2
| | | | | | | | | | | | | | | Unlike DDT, where ZAP values may have different lengths due to compression, all BRT entries are identical 8-byte counters. It does not make sense to first fetch the length only to assert it. zap_lookup_uint64() is specifically designed to work with counters of different size and should return error if something odd found. Calling it straight allows to save some measurable CPU time. Reviewed-by: Pawel Jakub Dawidek <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Rob Norris <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #15950
* BRT: Make BRT block sizes configurableAlexander Motin2024-04-191-11/+11
| | | | | | | | | | | | | Similar to DDT make BRT data and indirect block sizes configurable via module parameters. I am not sure what would be the best yet, but similar to DDT 4KB blocks kill all chances of compression on vdev with ashift=12 or more, that on my tests reaches 3x. While here, fix documentation for respective DDT parameters. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #15967
* BRT: Relax brt_pending_apply() lockingAlexander Motin2024-04-191-11/+5
| | | | | | | | | | | | | Since brt_pending_apply() is running in syncing context, no other brt_pending_tree accesses are possible for the TXG. We don't need to acquire brt_pending_lock here. Reviewed-by: Pawel Jakub Dawidek <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Brian Atkinson <[email protected]> Reviewed-by: Rob Norris <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #15955
* ZAP: Massively switch to _by_dnode() interfacesAlexander Motin2024-04-196-170/+191
| | | | | | | | | | | | | | | | | Before this change ZAP called dnode_hold() for almost every block access, that was clearly visible in profiler under heavy load, such as BRT. This patch makes it always hold the dnode reference between zap_lockdir() and zap_unlockdir(). It allows to avoid most of dnode operations between those. It also adds several new _by_dnode() APIs to ZAP and uses them in BRT code. Also adds dmu_prefetch_by_dnode() variant and uses it in the ZAP code. After this there remains only one call to dmu_buf_dnode_enter(), which seems to be unneeded. So remove the call and the functions. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #15951
* BRT: Skip duplicate BRT prefetchesAlexander Motin2024-04-191-3/+3
| | | | | | | | | | | | If there is a pending entry for this block, then we've already issued BRT prefetch for it within this TXG, so don't do it again. BRT vdev lookup and following zap_prefetch_uint64() call can be pretty expensive and should be avoided when not necessary. Reviewed-by: Pawel Jakub Dawidek <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #15941
* ZAP: Some cleanups/micro-optimizationsAlexander Motin2024-04-191-43/+34
| | | | | | | | | | | | | | | - Remove custom zap_memset(), use regular memset(). - Use PANIC() instead of opaque cmn_err(CE_PANIC). - Provide entry parameter to zap_leaf_rehash_entry(). - Reduce branching in zap_leaf_array_create() inner loop. - Remove signedness where it should not be. Should be no function changes. Reviewed-by: Brian Atkinson <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #15976
* BRT: Change brt_pending_tree sorting orderAlexander Motin2024-04-191-6/+7
| | | | | | | | | | | | | | | | | | It does not look important how exactly brt_pending_tree is sorted. When cloning large file, it is quite likely that all of its blocks have identical physical birth times, so comparing them first does not provide useful entropy, while accesses additional cache line. In most cases combination of vdev and offset provides unique result and physical birth time comparison is not even needed. Meanwhile, when traversing the tree inside brt_pending_apply(), it can be beneficial for dbuf cache and CPU cache hits to group processing by vdev and so by the per-VDEV BRT ZAPs. Reviewed-by: Rob Norris <[email protected]> Reviewed-by: Brian Atkinson <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #15954
* Update resume token at object receive.Alexander Motin2024-04-191-0/+10
| | | | | | | | | | | | | | | | | | | | | Before this change resume token was updated only on data receive. Usually it is enough to resume replication without much overlap. But we've got a report of a curios case, where replication source was traversed with recursive grep, which through enabled atime modified every object without modifying any data. It produced several gigabytes of replication traffic without a single data write and so without a single resume point. While the resume token was not designed to resume from an object, I've found that the send implementation always sends object before any data. So by requesting resume from offset 0 we are effectively resuming from the object, followed (or not) by the data at offset 0, just as we need it. Reviewed-by: Allan Jude <[email protected]> Reviewed-by: Paul Dagnelie <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #15927
* Linux: Cleanup taskq threads spawn/exitAlexander Motin2024-04-191-56/+29
| | | | | | | | | | | | | | | | This changes taskq_thread_should_stop() to limit maximum exit rate for idle threads to one per 5 seconds. I believe the previous one was broken, not allowing any thread exits for tasks arriving more than one at a time and so completing while others are running. Also while there: - Remove taskq_thread_spawn() calls on task allocation errors. - Remove extra taskq_thread_should_stop() call. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Rich Ercolani <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #15873
* Refactor dmu_prefetch().Alexander Motin2024-04-196-57/+71
| | | | | | | | | | | | | | | | | | - Split dmu_prefetch_dnode() from dmu_prefetch() into a separate function. It is quite inconvenient to read the code where len = 0 means dnode prefetch instead indirect/data prefetch. One function doing both has no benefits, since the code paths are independent. - Improve dmu_prefetch() handling of long block ranges. Instead of limiting L0 data length to prefetch for to dmu_prefetch_max, make dmu_prefetch_max limit the actual amount of prefetch at the specified level, and, if there is more, prefetch all the rest at higher indirection level. It should improve random access times within the prefetched range of any length, reducing importance of specific dmu_prefetch_max value. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #15076