aboutsummaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* Fix panics when truncating/deleting filesPavel Snajdr2024-04-291-10/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There's an union in dbuf_dirty_record_t; dr_brtwrite could evaluate to B_TRUE if the dirty record is of another type than dl. Adding more explicit dr type check before trying to access dr_brtwrite. Fixes two similar panics: [ 1373.806119] VERIFY0(db->db_level) failed (0 == 1) [ 1373.807232] PANIC at dbuf.c:2549:dbuf_undirty() [ 1373.814979] dump_stack_lvl+0x71/0x90 [ 1373.815799] spl_panic+0xd3/0x100 [spl] [ 1373.827709] dbuf_undirty+0x62a/0x970 [zfs] [ 1373.829204] dmu_buf_will_dirty_impl+0x1e9/0x5b0 [zfs] [ 1373.831010] dnode_free_range+0x532/0x1220 [zfs] [ 1373.833922] dmu_free_long_range+0x4e0/0x930 [zfs] [ 1373.835277] zfs_trunc+0x75/0x1e0 [zfs] [ 1373.837958] zfs_freesp+0x9b/0x470 [zfs] [ 1373.847236] zfs_setattr+0x161a/0x3500 [zfs] [ 1373.855267] zpl_setattr+0x125/0x320 [zfs] [ 1373.856725] notify_change+0x1ee/0x4a0 [ 1373.859207] do_truncate+0x7f/0xd0 [ 1373.859968] do_sys_ftruncate+0x28e/0x2e0 [ 1373.860962] do_syscall_64+0x38/0x90 [ 1373.861751] entry_SYSCALL_64_after_hwframe+0x6e/0xd8 [ 1822.381337] VERIFY0(db->db_level) failed (0 == 1) [ 1822.382376] PANIC at dbuf.c:2549:dbuf_undirty() [ 1822.389232] dump_stack_lvl+0x71/0x90 [ 1822.389920] spl_panic+0xd3/0x100 [spl] [ 1822.399567] dbuf_undirty+0x62a/0x970 [zfs] [ 1822.400583] dmu_buf_will_dirty_impl+0x1e9/0x5b0 [zfs] [ 1822.401752] dnode_free_range+0x532/0x1220 [zfs] [ 1822.402841] dmu_object_free+0x74/0x120 [zfs] [ 1822.403869] zfs_znode_delete+0x75/0x120 [zfs] [ 1822.404906] zfs_rmnode+0x3f6/0x7f0 [zfs] [ 1822.405870] zfs_inactive+0xa3/0x610 [zfs] [ 1822.407803] zpl_evict_inode+0x3e/0x90 [zfs] [ 1822.408831] evict+0xc1/0x1c0 [ 1822.409387] do_unlinkat+0x147/0x300 [ 1822.410060] __x64_sys_unlinkat+0x33/0x60 [ 1822.410802] do_syscall_64+0x38/0x90 [ 1822.411458] entry_SYSCALL_64_after_hwframe+0x6e/0xd8 Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Brian Atkinson <[email protected]> Signed-off-by: Pavel Snajdr <[email protected]> Closes #15983
* vdev props comment and manpage should include zfsd and FreeBSD mentionsAlek P2024-04-292-2/+8
| | | | | | | | | Reviewed-by: Tino Reichardt <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Rob Norris <[email protected]> Reviewed-by: Allan Jude <[email protected]> Signed-off-by: Alek Pinchuk <[email protected]> Closes #15968
* Add slow disk diagnosis to ZEDDon Brady2024-04-2929-70/+654
| | | | | | | | | | | | | | | | | | | | | Slow disk response times can be indicative of a failing drive. ZFS currently tracks slow I/Os (slower than zio_slow_io_ms) and generates events (ereport.fs.zfs.delay). However, no action is taken by ZED, like is done for checksum or I/O errors. This change adds slow disk diagnosis to ZED which is opt-in using new VDEV properties: VDEV_PROP_SLOW_IO_N VDEV_PROP_SLOW_IO_T If multiple VDEVs in a pool are undergoing slow I/Os, then it skips the zpool_vdev_degrade(). Sponsored-By: OpenDrives Inc. Sponsored-By: Klara Inc. Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Allan Jude <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Co-authored-by: Rob Wing <[email protected]> Signed-off-by: Don Brady <[email protected]> Closes #15469
* [2.2.4-only] Stub RAIDZ enums to prevent conflictsTony Hutter2024-04-293-1/+5
| | | | | | | Stub in the RAIDZ expansions enums for now so that the slow IO commit merges cleanly. Signed-off-by: Tony Hutter <[email protected]>
* zap_leaf: make l_hash[] variable length to silence UBSANRob N2024-04-291-1/+1
| | | | | | | | | | | | When UBSAN is active and OpenZFS is a debug build, the l_hash assert at the bottom of zap_open_leaf() causes UBSAN to complain. This follows the example in 786641dcf to shut it up. Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes #15964
* Give a better message from 'zpool get' with invalid pool namePaul Dagnelie2024-04-291-4/+3
| | | | | | | Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Don Brady <[email protected]> Reviewed-by: Tony Nguyen <[email protected]> Signed-off-by: Paul Dagnelie <[email protected]> Closes #15942
* xdr: header cleanupRob N2024-04-297-75/+33
| | | | | | | | | | | | | | | | | | | | | #16047 notes that include/os/freebsd/spl/rpc/xdr.h carried an (apparently) incompatible license. While looking into it, it seems that this file is actually unnecessary these days - FreeBSD's kernel XDR has XDR_CONTROL, xdrmem_control and XDR_GET_BYTES_AVAIL, while userspace has XDR_CONTROL and xdrmem_control, and our implementation of XDR_GET_BYTES_AVAIL for libspl works nicely with it. So this removes that file outright. To keep the includes in nvpair.c tidy, I've made a few small adjustments to the Linux headers. By definition, rpc/types.h provides bool_t and is included before rpc/xdr.h, so I've created rpc/types.h for Linux. This isn't necessary for userspace; both FreeBSD native and tirpc on Linux already have these headers set up correctly. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rob Norris <[email protected]> Sponsored-by: https://despairlabs.com/sponsor/ Closes #16047 Closes #16051
* Fix buffer underflow if sysfs file is emptyRobert Evans2024-04-291-1/+1
| | | | | | | | Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Jason Lee <[email protected]> Signed-off-by: Robert Evans <[email protected]> Closes #16028 Closes #16035
* ZTS: fix flakiness in cp_files_002_posRobert Evans2024-04-291-3/+3
| | | | | | | | | | | | Fix RANDOM to not return zero. Overwriting with `dd ... count=0` does not test anything. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: George Melikov <[email protected]> Reviewed-by: Allan Jude <[email protected]> Signed-off-by: Robert Evans <[email protected]> Closes #16029
* Fix option string, adding -e and fixing orderCameron Harr2024-04-292-29/+28
| | | | | | | | | | | The recently added '-e' option (PR #15769) missed adding the new option in the online `zpool status` help command. This adds the options and reorders a couple of the other options that were not listed alphabetically. Reviewed-by: Brian Atkinson <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Cameron Harr <[email protected]> Closes #16008
* freebsd: fix missing headers in distribution tarballRob N2024-04-291-0/+2
| | | | | | | | | | | arc_os.h and freebsd_event.h aren't included in release tarballs, so the build fails on FreeBSD. This fixes it. Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Tino Reichardt <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes #15963
* Check for minimum partition sizeBrian Behlendorf2024-04-291-0/+10
| | | | | | | | | | | | | | | On Linux block devices used for vdevs will by partitioned. The block device must be large enough for an 64M partition starting at offset of 2048 sectors (part1), and a second 64M reserved partition at the end of the device (part9). This commit adds a capacity check when creating the GPT label to immediately detect a device which is too small. With the existing code this would be caught slightly latter when attempting to use the partition. Catching it sooner let's us print a more useful error. Reviewed-by: Tony Hutter <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #15898
* Add VERIFY0P() and ASSERT0P() macros.Dag-Erling Smørgrav2024-04-293-0/+37
| | | | | | | | | | | These macros are similar to VERIFY0() and ASSERT0() but are intended for pointers, and therefore use uintptr_t instead of int64_t. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Kay Pedersen <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Dag-Erling Smørgrav <[email protected]> Closes #15225
* Clean up existing VERIFY*() macros.Dag-Erling Smørgrav2024-04-223-29/+27
| | | | | | | | | | | | | | | Chiefly: - Remove unnecessary parentheses around variable names. - Remove spaces between the type and variable in casts. - Make the panic message for VERIFY0() reflect how the macro is used. - Use %p to format pointers, except in Linux kernel code. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Kay Pedersen <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Dag-Erling Smørgrav <[email protected]> Closes #15225
* etc/init.d: decide which variant to use at build time.Benda Xu2024-04-228-13/+14
| | | | | | | | | | | | | | | | | | | | | | Let Debian use the sysv-rc variant of the script, even when OpenRC is installed. Unlike on Gentoo, OpenRC on Debian consumes both the sysv-rc scripts and OpenRC ones. ZFS initscripts on Debian should be the sysv-rc version to provide most compatibility and to integrate with the rest of initscripts for dependency tracking. Restrict the substitution in the Makefile to the dedicated list. This construct is inspired by Mo Zhou's detection of the execution shell and follows the strategy of Peter in 6ef28c526ba7. As of 2024, the initscripts are mostly relevant on Debian, Gentoo and their derivatives. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Benda Xu <[email protected]> Issue #8063 Issue #8204 Issue #8359 Closes #15977
* config/Substfiles.am: restrict to the dedicated list.Benda Xu2024-04-221-1/+1
| | | | | | | | We recover the scope of $(SUBSTFILES) to explicitly control what files are being generated from the corresponding .in. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Benda Xu <[email protected]> Closes #15980
* man: move zfs_prepare_disk.8 to nodist_man_MANSShengqi Chen2024-04-221-2/+2
| | | | | | | | | | | The commit b53077a added zfs_prepare_disk.8 to the wrong list dist_man_MANS, in which @zfsexecdir@ will not be properly substituted. This leads to wrong path in the manpage in generated release tarballs. Reported-by: Benda Xu <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Shengqi Chen <[email protected]> Closes #15979
* Add support for zfs mount -R <filesystem>Umer Saleem2024-04-227-18/+216
| | | | | | | | | | | | | | | | This commit adds support for mounting a dataset along with all of it's children with '-R' flag for zfs mount. There can be scenarios where we want to mount all datasets under one hierarchy instead of mounting all datasets present on system with '-a' flag. '-R' flag should work on all root and non-root datasets. Usage information and man page has been updated for zfs mount. A test for verifying the behavior for '-R' flag is also added. Reviewed-by: Ameer Hamza <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Umer Saleem <[email protected]> Closes #16015
* Linux 6.9 compat: blk_alloc_disk() now takes two argsRob Norris2024-04-222-1/+55
| | | | | | | | | | | | | | | | There's an extra nullable arg for queue limits. Detect it, and set it to NULL. Similar change for blk_mq_alloc_disk(), now three args, same treatment. Error return now has error encoded in the return, so detect with IS_ERR() and explicitly NULL our own return. Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rob Norris <[email protected]> Sponsored-by: https://despairlabs.com/sponsor/ Closes #16027 Closes #16033
* Linux 6.9 compat: bdev handles are now struct fileRob Norris2024-04-222-7/+60
| | | | | | | | | | | | | | bdev_open_by_path() is replaced by bdev_file_open_by_path(), which returns a plain old struct file*. Release function is gone entirely; the regular file release function fput() will take care of the bdev specifics. Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rob Norris <[email protected]> Sponsored-by: https://despairlabs.com/sponsor/ Closes #16027 Closes #16033
* vdev_disk: clean up spa/bdev mode conversionRob N2024-04-221-42/+39
| | | | | | | | | | | | | | | | | | | | 43e8f6e37 introduced a subtle API misuse, in that it passed the output from vdev_bdev_mode() back into itself. Fortunately, the SPA_MODE_(READ|WRITE) bit values exactly map to the FMODE_(READ|WRITE) & BLK_OPEN_(READ|WRITE) bit values, so it didn't result in a bug, but it was hard to read and understand, so I cleaned it up. In doing so, I noticed that the only call to vdev_bdev_mode() without the "exclusive" flag set was in that misuse, and actually, we never do a non-exclusive blkdev_get_by_path(). So I've just made exclusive be always-on. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed by: Brian Behlendorf <[email protected]> Reviewed-by: Allan Jude <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes #15995
* Linux 5.18+ compat: Detect filemap_range_has_pageRobert Evans2024-04-221-0/+1
| | | | | | | | | | | In v5.18 `filemap_range_has_page` moved to `pagemap.h` `pagemap.h` has been around since 3.10 so just include both Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Rob Norris <[email protected]> Signed-off-by: Robert Evans <[email protected]> Closes #16034
* udev: correctly handle partition #16 and laterFabian-Gruenbichler2024-04-221-4/+5
| | | | | | | | | | | | | | | | | If a zvol has more than 15 partitions, the minor device number exhausts the slot count reserved for partitions next to the zvol itself. As a result, the minor number cannot be used to determine the partition number for the higher partition, and doing so results in wrong named symlinks being generated by udev. Since the partition number is encoded in the block device name anyway, let's just extract it from there instead. Reviewed-by: Tony Hutter <[email protected]> Reviewed by: Brian Behlendorf <[email protected]> Reviewed-by: Tino Reichardt <[email protected]> Signed-off-by: Fabian Grünbichler <[email protected]> Closes #15904 Closes #15970
* zvols: prevent overflow of minor device numbersFabian-Gruenbichler2024-04-221-0/+7
| | | | | | | | | | | | | | | | | | | | | | currently, the linux kernel allows 2^20 minor devices per major device number. ZFS reserves blocks of 2^4 minors per zvol: 1 for the zvol itself, the other 15 for the first partitions of that zvol. as a result, only 2^16 such blocks are available for use. there are no checks in place to avoid overflowing into the major device number when more than 2^16 zvols are allocated (with volmode=dev or default). instead of ignoring this limit, which comes with all sorts of weird knock-on effects, detect this situation and simply fail allocating the zvol block device early on. without this safeguard, the kernel will reject the attempt to create an already existing block device, but ZFS doesn't handle this error and gets confused about which zvol occupies which minor slot, potentially resulting in kernel NULL derefs and other issues later on. Reviewed-by: Tony Hutter <[email protected]> Reviewed by: Brian Behlendorf <[email protected]> Signed-off-by: Fabian Grünbichler <[email protected]> Closes #16006
* Linux 6.8 compat: META (#16099)Tony Hutter2024-04-221-1/+1
| | | | | | | Update the META file to reflect compatibility with the 6.8 kernel. Signed-off-by: Tony Hutter <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Rob Norris <[email protected]>
* bdev_discard_supported: understand discard_granularity=0Rob N2024-04-191-2/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Kernel documentation for the discard_granularity property says: A discard_granularity of 0 means that the device does not support discard functionality. Some older kernels had drivers (notably loop, but also some USB-SATA adapters) that would set the QUEUE_FLAG_DISCARD capability flag, but have discard_granularity=0. Since 5.10 (torvalds/linux@b35fd7422c2f) the discard entry point blkdev_issue_discard() has had a check for this, which would immediately reject the call with EOPNOTSUPP, and throw a scary diagnostic message into the log. See #16068. Since 6.8, the block layer sets a non-zero default for discard_granularity (torvalds/linux@3c407dc723bb), and a future kernel will remove the check entirely[1]. As such, there's no good reason for us to enable discard when discard_granularity=0. The kernel will never let the request go in anyway; better that we just disable it so we can report it properly to the user. 1. https://patchwork.kernel.org/project/linux-block/patch/[email protected]/ Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <[email protected]> (cherry picked from commit b181b2e604de3f36feab1092c702cdec5e78c693)
* L2ARC: Relax locking during writeAlexander Motin2024-04-196-99/+131
| | | | | | | | | | | | | | | | | | | | | | | Previous code held ARC state sublist lock throughout all L2ARC write process, which included number of allocations and even ZIO issues. Being blocked in any of those places the code could also block ARC eviction, that could cause OOM activation or even dead- lock if system is low on memory or one is too fragmented. Fix it by dropping the lock as soon as we see a block eligible for L2ARC writing and pick it up later using earlier inserted marker. While there, also reduce scope of hash lock, moving ZIO allocation and other operations not requiring header access out of it. All operations requiring header access move under hash lock, since L2_WRITING flag does not prevent header eviction only transition to arc_l2c_only state with L1 header. To be able to manipulate sublist lock and marker as needed add few more multilist functions and modify one. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #16040
* Small fix to prefetch ranges aggregationAlexander Motin2024-04-191-2/+2
| | | | | | | | | | When after #16022 adding new range we aggregate more than two existing ranges, that should be very rare, only if several streams overlap, we may need to zero not the last range, but some earlier. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #16072
* Remove db_state DB_NOFILL checks from syncing contextAlexander Motin2024-04-191-25/+19
| | | | | | | | | | | | | | Syncing context should not depend on current state of dbuf, which could already change several times in later transaction groups, but rely solely on dirty record for the transaction group being synced. Some of the checks seem already impossible, while instead of others I think we should better check for absence of data in the specific dirty record rather than DB_NOFILL. Reviewed-by: Robert Evans <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #16057
* Speculative prefetch for reordered requestsAlexander Motin2024-04-195-63/+272
| | | | | | | | | | | | | | | | | | | Before this change speculative prefetcher was able to detect a stream only if all of its accesses are perfectly sequential. It was easy to implement and is perfectly fine for single-threaded applications. Unfortunately multi-threaded network servers, such as iSCSI, SMB or NFS usually have plenty of threads and may often reorder requests, preventing successful speculation and prefetch. This change allows speculative prefetcher to detect streams even if requests are reordered by introducing a list of 9 non-contiguous ranges up to 16MB ahead of current stream position and filling the gaps as more requests arrive. It also allows stream to proceed even with holes up to a certain configurable threshold (25%). Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #16022
* Fix read errors race after block cloningAlexander Motin2024-04-191-21/+20
| | | | | | | | | | | | | | | | | | | | Investigating read errors triggering panic fixed in #16042 I've found that we have a race in a sync process between the moment dirty record for cloned block is removed and the moment dbuf is destroyed. If dmu_buf_hold_array_by_dnode() take a hold on a cloned dbuf before it is synced/destroyed, then dbuf_read_impl() may see it still in DB_NOFILL state, but without the dirty record. Such case is not an error, but equivalent to DB_UNCACHED, since the dbuf block pointer is already updated by dbuf_write_ready(). Unfortunately it is impossible to safely change the dbuf state to DB_UNCACHED there, since there may already be another cloning in progress, that dropped dbuf lock before creating a new dirty record, protected only by the range lock. Reviewed-by: Rob Norris <[email protected]> Reviewed-by: Robert Evans <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #16052
* Improve dbuf_read() error reportingAlexander Motin2024-04-191-18/+20
| | | | | | | | | | | | | Previous code reported non-ZIO errors only via return value, but not via parent ZIO. It could cause NULL-dereference panics due to dmu_buf_hold_array_by_dnode() ignoring the return value, relying solely on parent ZIO status. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Ameer Hamza <[email protected]> Reported by: Ameer Hamza <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #16042
* BRT: Check pool clone stats in more testsAlexander Motin2024-04-191-5/+15
| | | | | | | | | | | This should allow to catch some leaks, if those happen. While there fix some cosmetic issues. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #16007
* BRT: Fix tests to work on non-empty poolsAlexander Motin2024-04-191-21/+26
| | | | | | | | | | It should not normally happen, but if it does, better to not fail everything for no good reason, or it may be hard to debug. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #16007
* BRT: Fix holes cloning.Alexander Motin2024-04-191-13/+13
| | | | | | | | | | | | | | | | | | | - When reading L0 block pointers handle buffers without ones and without dirty records as a holes. Those appear when dnode size was increased, but the end was never written, so there are no new indirection levels to store the pointers. It makes no sense to return EAGAIN here, since sync won't create new indirection levels until there will be actual writes. - When cloning blocks set destination hole logical birth time to the current TXG. Otherwise if we are cloning over existing data, newly created holes may not be properly replicated later. Use BP_SET_BIRTH() when possible to not replicate its logic. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #15994 Closes #16007
* BRT: Skip getting length in brt_entry_lookup()Alexander Motin2024-04-191-16/+2
| | | | | | | | | | | | | | | Unlike DDT, where ZAP values may have different lengths due to compression, all BRT entries are identical 8-byte counters. It does not make sense to first fetch the length only to assert it. zap_lookup_uint64() is specifically designed to work with counters of different size and should return error if something odd found. Calling it straight allows to save some measurable CPU time. Reviewed-by: Pawel Jakub Dawidek <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Rob Norris <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #15950
* BRT: Make BRT block sizes configurableAlexander Motin2024-04-192-13/+26
| | | | | | | | | | | | | Similar to DDT make BRT data and indirect block sizes configurable via module parameters. I am not sure what would be the best yet, but similar to DDT 4KB blocks kill all chances of compression on vdev with ashift=12 or more, that on my tests reaches 3x. While here, fix documentation for respective DDT parameters. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #15967
* BRT: Relax brt_pending_apply() lockingAlexander Motin2024-04-191-11/+5
| | | | | | | | | | | | | Since brt_pending_apply() is running in syncing context, no other brt_pending_tree accesses are possible for the TXG. We don't need to acquire brt_pending_lock here. Reviewed-by: Pawel Jakub Dawidek <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Brian Atkinson <[email protected]> Reviewed-by: Rob Norris <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #15955
* ZAP: Massively switch to _by_dnode() interfacesAlexander Motin2024-04-199-172/+202
| | | | | | | | | | | | | | | | | Before this change ZAP called dnode_hold() for almost every block access, that was clearly visible in profiler under heavy load, such as BRT. This patch makes it always hold the dnode reference between zap_lockdir() and zap_unlockdir(). It allows to avoid most of dnode operations between those. It also adds several new _by_dnode() APIs to ZAP and uses them in BRT code. Also adds dmu_prefetch_by_dnode() variant and uses it in the ZAP code. After this there remains only one call to dmu_buf_dnode_enter(), which seems to be unneeded. So remove the call and the functions. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #15951
* BRT: Skip duplicate BRT prefetchesAlexander Motin2024-04-191-3/+3
| | | | | | | | | | | | If there is a pending entry for this block, then we've already issued BRT prefetch for it within this TXG, so don't do it again. BRT vdev lookup and following zap_prefetch_uint64() call can be pretty expensive and should be avoided when not necessary. Reviewed-by: Pawel Jakub Dawidek <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #15941
* ZAP: Some cleanups/micro-optimizationsAlexander Motin2024-04-192-47/+38
| | | | | | | | | | | | | | | - Remove custom zap_memset(), use regular memset(). - Use PANIC() instead of opaque cmn_err(CE_PANIC). - Provide entry parameter to zap_leaf_rehash_entry(). - Reduce branching in zap_leaf_array_create() inner loop. - Remove signedness where it should not be. Should be no function changes. Reviewed-by: Brian Atkinson <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #15976
* BRT: Change brt_pending_tree sorting orderAlexander Motin2024-04-191-6/+7
| | | | | | | | | | | | | | | | | | It does not look important how exactly brt_pending_tree is sorted. When cloning large file, it is quite likely that all of its blocks have identical physical birth times, so comparing them first does not provide useful entropy, while accesses additional cache line. In most cases combination of vdev and offset provides unique result and physical birth time comparison is not even needed. Meanwhile, when traversing the tree inside brt_pending_apply(), it can be beneficial for dbuf cache and CPU cache hits to group processing by vdev and so by the per-VDEV BRT ZAPs. Reviewed-by: Rob Norris <[email protected]> Reviewed-by: Brian Atkinson <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #15954
* Update resume token at object receive.Alexander Motin2024-04-191-0/+10
| | | | | | | | | | | | | | | | | | | | | Before this change resume token was updated only on data receive. Usually it is enough to resume replication without much overlap. But we've got a report of a curios case, where replication source was traversed with recursive grep, which through enabled atime modified every object without modifying any data. It produced several gigabytes of replication traffic without a single data write and so without a single resume point. While the resume token was not designed to resume from an object, I've found that the send implementation always sends object before any data. So by requesting resume from offset 0 we are effectively resuming from the object, followed (or not) by the data at offset 0, just as we need it. Reviewed-by: Allan Jude <[email protected]> Reviewed-by: Paul Dagnelie <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #15927
* Linux: Cleanup taskq threads spawn/exitAlexander Motin2024-04-193-71/+34
| | | | | | | | | | | | | | | | This changes taskq_thread_should_stop() to limit maximum exit rate for idle threads to one per 5 seconds. I believe the previous one was broken, not allowing any thread exits for tasks arriving more than one at a time and so completing while others are running. Also while there: - Remove taskq_thread_spawn() calls on task allocation errors. - Remove extra taskq_thread_should_stop() call. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Rich Ercolani <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #15873
* Refactor dmu_prefetch().Alexander Motin2024-04-197-57/+72
| | | | | | | | | | | | | | | | | | - Split dmu_prefetch_dnode() from dmu_prefetch() into a separate function. It is quite inconvenient to read the code where len = 0 means dnode prefetch instead indirect/data prefetch. One function doing both has no benefits, since the code paths are independent. - Improve dmu_prefetch() handling of long block ranges. Instead of limiting L0 data length to prefetch for to dmu_prefetch_max, make dmu_prefetch_max limit the actual amount of prefetch at the specified level, and, if there is more, prefetch all the rest at higher indirection level. It should improve random access times within the prefetched range of any length, reducing importance of specific dmu_prefetch_max value. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #15076
* ZIL: Update Linux tracing after #15635Alexander Motin2024-04-191-4/+10
| | | | | | | | | | | | While picking parts from #14909 I've missed Linux tracing specific ones, that went unnoticed in default configurations, but breaks the build in some. Reviewed-by: Ameer Hamza <[email protected]> Reviewed-by: Brian Atkinson <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #15730
* ZIL: Improve next log block size predictionAlexander Motin2024-04-192-74/+201
| | | | | | | | | | | | | | | | | | | Track history in context of bursts, not individual log blocks. It allows to not blow away all the history by single large burst of many block, and same time allows optimizations covering multiple blocks in a burst and even predicted following burst. For each burst account its optimal block size and minimal first block size. Use that statistics from the last 8 bursts to predict first block size of the next burst. Remove predefined set of block sizes. Allocate any size we see fit, multiple of 4KB, as required by ZIL now. With compression enabled by default, ZFS already writes pretty random block sizes, so this should not surprise space allocator any more. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #15635
* ZIO: Optimize zio_flush()Alexander Motin2024-04-192-22/+16
| | | | | | | | | | | | | | | | | - Generalize vdev_nowritecache handling by traversing through the VDEV tree and skipping children ZIOs where not supported. - Remove intermediate zio_null() in case of several VDEV children. - Remove children handling from zio_ioctl(). There are no other use cases for this code beside DKIOCFLUSHWRITECACHED, and would there be, I doubt they would so straightforward apply to all VDEV children. Comparing to removed previous optimization this should improve cases of redundant ZILs/SLOGs. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: George Wilson <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #15515
* ZIL: Detect single-threaded workloadsAlexander Motin2024-04-193-60/+44
| | | | | | | | | | | | | | | | | | | ... by checking that previous block is fully written and flushed. It allows to skip commit delays since we can give up on aggregation in that case. This removes zil_min_commit_timeout parameter, since for single-threaded workloads it is not needed at all, while on very fast devices even some multi-threaded workloads may get detected as single-threaded and still bypass the wait. To give multi-threaded workloads more aggregation chances increase zfs_commit_timeout_pct from 5 to 10%, as they should suffer less from additional latency. Also single-threaded workloads detection allows in perspective better prediction of the next block size. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Prakash Surya <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #15381
* zvol_os: fix compile with blk-mq on Linux 4.xRob N2024-04-172-0/+20
| | | | | | | | | | | | | | | 99741bde5 accesses a cached blk-mq hardware context through the mq_hctx field of struct request. However, this field did not exist until 5.0. Before that, the private function blk_mq_map_queue() was used to dig it out of broader queue context. This commit detects this situation, and handles it with a poor-man's simulation of that function. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Ameer Hamza <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes #16069