aboutsummaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* Remove set but not used variable in ddt.c (#16522)Tino Reichardt2024-09-101-2/+1
| | | | | | | module/zfs/ddt.c:2612:6: error: variable 'total' set but not used Signed-off-by: Tino Reichardt <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Tony Hutter <[email protected]>
* Fix an uninitialized data access (#16511)Alan Somers2024-09-102-2/+2
| | | | | | | | | | | | | | zfs_acl_node_alloc allocates an uninitialized data buffer, but upstack zfs_acl_chmod only partially initializes it. KMSAN reported that this memory remained uninitialized at the point when it was read by lzjb_compress, which suggests a possible kernel memory disclosure bug. The full KMSAN warning may be found in the PR. https://github.com/openzfs/zfs/pull/16511 Signed-off-by: Alan Somers <[email protected]> Sponsored by: Axcient Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Tony Hutter <[email protected]>
* zts-report: don't crash on non-UTF-8 chars in the log (#16497)Rob Norris2024-09-091-1/+1
| | | | | | | | | | | | | | | | | The report generator expects the log to be clean and tidy UTF-8. That can be a problem if you use some of the verbose/debug test runner options, which sends all sorts of weird output from arbitrary programs to the log. This just makes Python a little more relaxed about such things. It shouldn't matter in practice, as those lines didn't match the test result regex anyway, and are discarded immediately. Sponsored-by: https://despairlabs.com/sponsor/ Signed-off-by: Rob Norris <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Tony Hutter <[email protected]>
* sys/types32.h: Remove struct timeval32 from libspl's header (#16491)Jessica Clarke2024-09-091-5/+0
| | | | | | | | | | | | | | macOS Sequoia's sys/sockio.h, as included by various bootstrap tools whilst building FreeBSD, has started to include net/if.h, which then includes sys/_types/_timeval32.h and provide a conflicting definition for struct timeval32. Since this type is entirely unused within OpenZFS, simply delete the type rather than adding in some kind of OS detection. This fixes building FreeBSD on macOS Sequoia (Beta). Signed-off-by: Jessica Clarke <[email protected]> Reviewed-by: Rob Norris <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Tony Hutter <[email protected]>
* zio_resume: log when unsuspending the pool (#16485)Rob Norris2024-09-091-1/+5
| | | | | | | | | | | | When reviewing logs after a failure, its useful to see where unsuspend/resume was requested. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Tony Hutter <[email protected]>
* libzstd: also build with LIBZPOOL_CPPFLAGSRob Norris2024-09-091-0/+2
| | | | | | | | | | | | | | | | | | libzstd now also allocates its own abd_t, and so has the same issue as zstream did, so this applies the same workaround: compile it with ZFS_DEBUG. See 92fca1c2d. This looks weird, because libzstd doesn't appear to look related to the ZFS kernel, but there is already a cross-dependency there: zstd needs zfs_lz4_compress, and zfs needs zfs_zstd_compress (and others), so the two can never really be separated without more work. Another job for another time. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed by: Brian Behlendorf <[email protected]> Reviewed-by: Mark Maybee <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes #16489
* spa_prop_get: require caller to supply output nvlistRob Norris2024-09-064-66/+53
| | | | | | | | | | | | | | | | | All callers to spa_prop_get() and spa_prop_get_nvlist() supplied their own preallocated nvlist (except ztest), so we can remove the option to have them allocate one if none is supplied. This sidesteps a bug in spa_prop_get(), where the error var wasn't initialised, which could lead to the provided nvlist being freed at the end. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Jorgen Lundman <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes #16505
* zpool events: expand value strings for ZIO error valuesRob Norris2024-09-051-1/+23
| | | | | | Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <[email protected]>
* value strings: pretty printers for flags and enumsRob Norris2024-09-0511-0/+427
| | | | | | | | | | | This adds zfs_valstr, a collection of pretty printers for bitfields and enums. These are useful in debugging, logging and other display contexts where raw values are difficult for the untrained (or even trained!) eye to decipher. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <[email protected]>
* Add DDT prune commandDon Brady2024-09-0421-85/+905
| | | | | | | | | | | | | | | | Requires the new 'flat' physical data which has the start time for a class entry. The amount to prune can be based on a target percentage of the unique entries or based on the age (i.e., every entry older than N days). Sponsored-by: Klara, Inc. Sponsored-by: iXsystems, Inc. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Don Brady <[email protected]> Closes #16277
* zdb: rework dedup accounting for log, quota and pruneRob Norris2024-09-041-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | The simplest thing first: add the FDT and log objects to the list of objects to be considered when checking for leaks. The rest is based on a conceptual change in all of this patch stack: a block on disk with a 'D' bit is not necessarily in the DDT at all (pruned), or in the DDT ZAPs (still on the log). As such, walking the DDT up front is difficult (for all the reasons that walking an unflushed log is difficult) and not really useful, since it's not a reflection of what's on disk anyway. Instead, we rework things here to be more like the BRT checks. When we see a dedup'd block, we look it up in the DDT, consume a refcount, and for the second-or-later instances, count them as duplicates. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Co-authored-by: Allan Jude <[email protected]> Co-authored-by: Don Brady <[email protected]> Signed-off-by: Rob Norris <[email protected]> Sponsored-by: Klara, Inc. Sponsored-by: iXsystems, Inc. Closes #16277
* Remove unused sysctl nodeSeth Hoffert2024-09-031-1/+0
| | | | | | | | | | PR #14953 removed vdev-level read cache but accidentally left this sysctl node behind. Reviewed-by: Rich Ercolani <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Seth Hoffert <[email protected]> Closes #16493
* build: rename FORCEDEBUG_CPPFLAGS to LIBZPOOL_CPPFLAGSRob Norris2024-08-277-8/+11
| | | | | | | | | | | | | | | This is just a very small attempt to make it more obvious that these flags aren't optional for libzpool-using programs, by not making it seem like there's an option to say "well, I don't _want_ to force debugging". Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Allan Jude <[email protected]> Reviewed-by: Rich Ercolani <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rob Norris <[email protected]> Issue #16476 Closes #16477
* zstream: build with debug to fix stack overrunsRob Norris2024-08-271-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | abd_t differs in size depending on whether or not ZFS_DEBUG is set. It turns out that libzpool is built with FORCEDEBUG_CPPFLAGS, which sets -DZFS_DEBUG, and so it always has a larger abd_t with extra debug fields, regardless of whether or not --enable-debug is set. zdb, ztest and zhack are also all built with FORCEDEBUG_CPPFLAGS, so had the same idea of the size of abd_t, but zstream was not, and used the "smaller" abd_t. In practice this didn't matter because it never used abd_t directly. This changed in b4d81b1a6, zstream was switched to use stack ABDs for compression. When built with --enable-debug, zstream implicitly gets ZFS_DEBUG, and everything was fine. Productions builds without that flag ends up with the smaller abd_t, which is now mismatched with libzpool, and causes stack overruns in zstream recompress. The simplest fix for now is to compile zstream with FORCEDEBUG_CPPFLAGS like the other binaries. This commit does that. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Allan Jude <[email protected]> Reviewed-by: Rich Ercolani <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rob Norris <[email protected]> Issue #16476 Closes #16477
* fm: pass io_flags through events & zed as uint64_tRob Norris2024-08-262-4/+13
| | | | | | | | | | | | | | | | | | | In 4938d01db (#14086) zio_flag_t was converted from an enum (generally signed 32-bit) to a uint64_t. The corresponding change wasn't made to the error reporting subsystem, limiting the error flags being delivered to zed to 32 bits. This bumps the whole pipeline to use uint64s. A tiny bit of compatibility is added for newer zed working agsinst an older kernel module, because its easy to do and misdetecting scrub/resilver errors and taking action is potentially dangerous. Making it work for new kernel modules against older zed seems to be far more invasive for far less benefit, so I have not. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes #16469
* Fix issig() to check signal_pending after dequeue SIGSTOP/SIGTSTPJitendra Patidar2024-08-261-0/+7
| | | | | | | | | When process got SIGSTOP/SIGTSTP, issig() dequeue them and return 0. But process could still have another signal pending after dequeue. So, after dequeue, check and return 1, if signal_pending. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Jitendra Patidar <[email protected]> Closes #16464
* zpool: Provide GUID to zpool-reguid(8) with -g (#16239)Mateusz Piotrowski2024-08-2616-16/+342
| | | | | | | | | | | | | | This commit extends the zpool-reguid(8) command with a -g flag, which allows the user to specify the GUID to set. This change also adds some general tests for zpool-reguid(8). Sponsored-by: Wasabi Technology, Inc. Sponsored-by: Klara, Inc. Signed-off-by: Mateusz Piotrowski <[email protected]> Reviewed-by: Rob Norris <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Tony Hutter <[email protected]>
* spl-taskq: fix task counts for delayed and cancelled tasksRob Norris2024-08-231-0/+2
| | | | | | | | | | | | Dispatched delayed tasks were not added to tasks_total, and cancelled tasks were not removed. This notably could make tasks_total go to UNIT64_MAX, but just generally meant the count could be wrong. So lets not! Sponsored-by: Klara, Inc. Sponsored-by: Syneto Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes #16473
* Make mount.zfs(8) calling zfs_mount_at for legacy mounts as wellLow-power2024-08-232-7/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 329e2ffa4bca456e65c3db7f5c5c04931c551b61 has made mount.zfs(8) to call libzfs function 'zfs_mount_at', in order to propagate dataset properties into mount options. This fix however, is limited to a special use case where mount.zfs(8) is used in initrd with option '-o zfsutil'. If either initrd or the user need to use mount.zfs(8) to mount a file system with 'mountpoint' set to 'legacy', '-o zfsutil' can't be used and the original issue #7947 will still happen. Since the existing code already excluded the possibility of calling 'zfs_mount_at' when it was invoked as a helper program from zfs(8), by checking 'ZFS_MOUNT_HELPER' environment variable, it makes no sense to avoid calling 'zfs_mount_at' without '-o zfsutil'. An exception however, is when mount.zfs(8) was invoked with '-o remount' to update the mount options for an existing mount point. In this case call mount(2) directly without modifying the mount options passed from command line. Furthermore, don't run mount.zfs(8) helper for automounting snapshot. The above change to make mount.zfs(8) to call 'zfs_mount_at' apparently caused it to trigger an automount for the snapshot directory. When the helper was invoked as a result of a snapshot automount, an infinite recursion will occur. Since the need of invoking user mode mount(8) for automounting was to overcome that the 'vfs_kern_mount' being GPL-only, just run mount(8) without the mount.zfs(8) helper by adding option '-i'. Reviewed-by: Umer Saleem <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: WHR <[email protected]> Closes #16393
* zstream recompress: fix zero recompressed buffer and outputRob Norris2024-08-221-10/+12
| | | | | | | | | | | | | | | If compression happend, any garbage past the compress size was not zeroed out. If compression didn't happen, then the payload size was still set to the rounded-up return from zio_compress_data(), which is dependent on the input, which is not necessarily the logical size. So that's all fixed too, mostly from stealing the math from zio.c. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <[email protected]>
* zstream decompress: fix decompress size and outputRob Norris2024-08-221-12/+19
| | | | | | | | | | | | | | | This was incorrectly using the compressed length for the size of the decompress buffer, and quietly doing nothing if the decompressor refused to decompress the block because there wasn't enough space. After that, it wasn't correctly rewriting the record to indicate "not compressed". So that's fixed now. Sigh. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <[email protected]>
* zio_compress_data: limit dest length to ABD sizeRob Norris2024-08-221-4/+3
| | | | | | | | | | | | | | | | | | Some callers (eg `do_corrective_recv()`) pass in a dest buffer much smaller than the wanted 87.5% of the source buffer, because the incoming abd is larger than the source data and they "know" what the decompressed size with be. However, `abd_borrow_buf()` rightly asserts if we try to borrow more than is available, so these callers fail. Previously when all we had was a dest buffer, we didn't know how big it was, so we couldn't do anything. Now we have a dest abd, with a size, so we can clamp dest size to the abd size. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <[email protected]>
* compress: change zio_compress API to use ABDsRob Norris2024-08-2210-112/+116
| | | | | | | | | | | | | | This commit changes the frontend zio_compress_data and zio_decompress_data APIs to take ABD points instead of buffer pointers. All callers are updated to match. Any that already have an appropriate ABD nearby now use it directly, while at the rest we create an one. Internally, the ABDs are passed through to the provider directly. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <[email protected]>
* compress: change compression providers API to use ABDsRob Norris2024-08-2211-54/+120
| | | | | | | | | | | | | | | This commit changes the provider compress and decompress API to take ABD pointers instead of buffer pointers for both data source and destination. It then updates all providers to match. This doesn't actually change the providers to do chunked compression, just changes the API to allow such an update in the future. Helper macros are added to easily adapt the ABD functions to their buffer-based implementations. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <[email protected]>
* compress: standardise names of compression functionsRob Norris2024-08-229-106/+122
| | | | | | | | This is mostly to make searching easier. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <[email protected]>
* compress: remove unused abd compress prototypeRob Norris2024-08-221-10/+0
| | | | | | Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <[email protected]>
* compress: remove zio_decompress_data_bufRob Norris2024-08-222-16/+10
| | | | | | | | Nothing uses it anymore! Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <[email protected]>
* zstream: use zio_compress calls for compressionRob Norris2024-08-222-118/+98
| | | | | | | | | | This is updating zstream to use the zio_compress calls rather than using its own dispatch. Since that was fairly entangled, some refactoring included. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <[email protected]>
* compress: rework callers to always use the zio_compress callsRob Norris2024-08-223-6/+15
| | | | | | | | | | | | This will make future refactoring easier. There are two we can't change for the moment, because zio_compress_data does hole detection & collapsing which zio_decompress_data does not actually know how to handle. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <[email protected]>
* abd_get_from_buf_struct: wrap existing buf with ABD stored on stackRob Norris2024-08-222-5/+18
| | | | | | | | | This allows a simple "wrapping" ABD for an existing linear buffer to be allocated on the stack, avoiding an allocation. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <[email protected]>
* Linux 6.10 compat: METATony Hutter2024-08-211-1/+1
| | | | | | | | Update the META file to reflect compatibility with the 6.10 kernel. Reviewed-by: Rob Norris <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tony Hutter <[email protected]> Closes #16466
* libzpool/abd_os: iovec-based scatter abdRob Norris2024-08-212-300/+173
| | | | | | | | | | | | | This is intended to be a simple userspace scatter abd based on struct iovec. It's not very sophisticated as-is, but sets a base for something much more interesting. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes #16253
* abd_os: break out platform-specific header partsRob Norris2024-08-2114-48/+294
| | | | | | | | | | | | Removing the platform #ifdefs from shared headers in favour of per-platform headers. Makes abd_t much leaner, among other things. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes #16253
* abd_os: split userspace and Linux kernel codeRob Norris2024-08-214-149/+498
| | | | | | | | | | | | | | The Linux abd_os.c serves double-duty as the userspace scatter abd implementation, by carrying an emulation of kernel scatterlists. This commit lifts common and userspace-specific parts out into a separate abd_os.c for libzpool. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes #16253
* zio: no alloc canary in userspaceRob Norris2024-08-211-6/+14
| | | | | | | | | | | | Makes it harder to use memory debuggers like valgrind directly, because they can't see canary overruns. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes #16253
* abd: remove ABD_FLAG_ZEROSRob Norris2024-08-214-5/+4
| | | | | | | | | | | Nothing ever checks it. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes #16253
* Ignore zfs_arc_shrinker_limit in direct reclaim modeshodanshok2024-08-212-3/+4
| | | | | | | | | | | | | | | | | zfs_arc_shrinker_limit (default: 10000) avoids ARC collapse due to excessive memory reclaim. However, when the kernel is in direct reclaim mode (ie: low on memory), limiting ARC reclaim increases OOM risk. This is especially true on system without (or with inadequate) swap. This patch ignores zfs_arc_shrinker_limit when the kernel is in direct reclaim mode, avoiding most OOM. It also restores "echo 3 > /proc/sys/vm/drop_caches" ability to correctly drop (almost) all ARC. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Adam Moss <[email protected]> Signed-off-by: Gionatan Danti <[email protected]> Closes #16313
* linux/zvol_os.c: cleanup limits for non-blk mq caseAmeer Hamza2024-08-201-5/+0
| | | | | | | | | | | Rob Noris suggested that we could clean up redundant limits for the case of non-blk mq scenario. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Rob Norris <[email protected]> Signed-off-by: Ameer Hamza <[email protected]> Closes #16462
* linux/zvol_os.c: Fix max_discard_sectors limit for 6.8+ kernelAmeer Hamza2024-08-201-0/+1
| | | | | | | | | | | | | | | | In kernels 6.8 and later, the zvol block device is allocated with qlimits passed during initialization. However, the zvol driver does not set `max_hw_discard_sectors`, which is necessary to properly initialize `max_discard_sectors`. This causes the `zvol_misc_trim` test to fail on 6.8+ kernels when invoking the `blkdiscard` command. Setting `max_hw_discard_sectors` in the `HAVE_BLK_ALLOC_DISK_2ARG` case resolve the issue. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Rob Norris <[email protected]> Signed-off-by: Ameer Hamza <[email protected]> Closes #16462
* spl-proc: remove old taskq statsRob Norris2024-08-192-279/+0
| | | | | | | | | | | | | | | These had minimal useful information for the admin, didn't work properly in some places, and knew far too much about taskq internals. With the new stats available, these should never be needed anymore. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Tino Reichardt <[email protected]> Signed-off-by: Rob Norris <[email protected]> Sponsored-by: Klara, Inc. Sponsored-by: Syneto Closes #16171
* spl-taskq: summary stats for all taskqsRob Norris2024-08-191-0/+98
| | | | | | | | | | | | | This adds /proc/spl/kstats/taskq/summary, which attempts to show a useful subset of stats for all taskqs in the system. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Tino Reichardt <[email protected]> Signed-off-by: Rob Norris <[email protected]> Sponsored-by: Klara, Inc. Sponsored-by: Syneto Closes #16171
* spl-taskq: per-taskq kstatsRob Norris2024-08-192-14/+342
| | | | | | | | | | | | | | | | | | | | This exposes a variety of per-taskq stats under /proc/spl/kstat/taskq, one file per taskq, named for the taskq name.instance. These include a small amount of info about the taskq config, the current state of the threads and queues, and various counters for thread and queue activity since the taskq was created. To assist with decrementing queue size counters, the list an entry is on is encoded in spare bits in the entry flags. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Tino Reichardt <[email protected]> Signed-off-by: Rob Norris <[email protected]> Sponsored-by: Klara, Inc. Sponsored-by: Syneto Closes #16171
* spl-generic: bring up kstats subsystem before taskqRob Norris2024-08-191-10/+10
| | | | | | | | | | | | | For spl-taskq to use the kstats infrastructure, it has to be available first. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Tino Reichardt <[email protected]> Signed-off-by: Rob Norris <[email protected]> Sponsored-by: Klara, Inc. Sponsored-by: Syneto Closes #16171
* Skip ro check for snaps when multi-mountChunwei Chen2024-08-191-1/+7
| | | | | | | | | | Skip ro check for snapshots since they are always ro regardless if ro flag is passed by mount or not. This allows multi-mounting snapshots without requiring to specify ro flag. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Signed-off-by: Chunwei Chen <[email protected]> Closes #16299
* Enable L2 cache of all (MRU+MFU) metadata but MFU data onlyshodanshok2024-08-162-7/+18
| | | | | | | | | | | | | | | | | | `l2arc_mfuonly` was added to avoid wasting L2 ARC on read-once MRU data and metadata. However it can be useful to cache as much metadata as possible while, at the same time, restricting data cache to MFU buffers only. This patch allow for such behavior by setting `l2arc_mfuonly` to 2 (or higher). The list of possible values is the following: 0: cache both MRU and MFU for both data and metadata; 1: cache only MFU for both data and metadata; 2: cache both MRU and MFU for metadata, but only MFU for data. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Gionatan Danti <[email protected]> Closes #16343 Closes #16402
* Man page updates for dmu_ddt_copiesAllan Jude2024-08-161-0/+11
| | | | | | | Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Allan Jude <[email protected]> Closes #15895
* ddt: lookup and log statsRob Norris2024-08-162-6/+159
| | | | | | | | | | | | | Adds per-DDT stats counting lookups and where they were serviced from (either log or backing zap), number of log entries in memory, and flow rates. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rob Norris <[email protected]> Sponsored-by: Klara, Inc. Sponsored-by: iXsystems, Inc. Closes #15895
* ddt: block scan until log is flushed, and flush aggressivelyRob Norris2024-08-165-6/+125
| | | | | | | | | | | | | | | | | | | | | | | The dedup log does not have a stable cursor, so its not possible to persist our current scan location within it across pool reloads. Beccause of this, when walking (scanning), we can't treat it like just another source of dedup entries. Instead, when a scan is wanted, we switch to an aggressive flushing mode, pushing out entries older than the scan start txg as fast as we can, before starting the scan proper. Entries after the scan start txg will be handled via other methods; the DDT ZAPs and logs will be written as normal, and blocks not seen yet will be offered to the scan machinery as normal. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Co-authored-by: Allan Jude <[email protected]> Signed-off-by: Rob Norris <[email protected]> Sponsored-by: Klara, Inc. Sponsored-by: iXsystems, Inc. Closes #15895
* ddt: dedup logRob Norris2024-08-1617-110/+1600
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Adds a log/journal to dedup. At the end of txg, instead of writing the entry directly to the ZAP, instead its adding to an in-memory tree and appended to an on-disk object. The on-disk object is only read at import, to reload the in-memory tree. Lookups first go the the log tree before going to the ZAP, so recently-used entries will remain close by in memory. This vastly reduces overhead from dedup IO, as it will not have to do so many read/update/write cycles on ZAP leaf nodes. A flushing facility is added at end of txg, to push logged entries out to the ZAP. There's actually two separate "logs" (in-memory tree and on-disk object), one active (recieving updated entries) and one flushing (writing out to disk). These are swapped (ie flushing begins) based on memory used by the in-memory log trees and time since we last flushed something. The flushing facility monitors the amount of entries coming in and being flushed out, and calibrates itself to try to flush enough each txg to keep up with the ingest rate without competing too much with other IO. Multiple tuneables are provided to control the flushing facility. All the histograms and stats are update to accomodate the log as a separate entry store. zdb gains knowledge of how to count them and dump them. Documentation included! Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Co-authored-by: Allan Jude <[email protected]> Signed-off-by: Rob Norris <[email protected]> Sponsored-by: Klara, Inc. Sponsored-by: iXsystems, Inc. Closes #15895
* ddt: tuneable to override copies= on dedup metadata objectsRob Norris2024-08-161-0/+32
| | | | | | | | | | | | | | | | | | | | | | | | | All objects stored in the MOS get copies=3. For a large dedup table, this requires significant extra IO and disk space, when its not really necessary - the dedup table itself isn't needed to read or write data, only to keep data usage down. Losing the dedup table does not render the pool unusable, it just messes up the accounting somewhat. This adds a dmu_ddt_copies tuneable. When set to 0, the existing behaviour is used. When set higher, dedup table blocks (ZAP and log) will have this many copies rather than the usual 3, while indirect blocks will have one more again. This is a tuneable for now mostly for testing. Losing a dedup table can cause blocks to be leaked, and we currently have no facilities to repair that. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Co-authored-by: Allan Jude <[email protected]> Signed-off-by: Rob Norris <[email protected]> Sponsored-by: Klara, Inc. Sponsored-by: iXsystems, Inc. Closes #15895