aboutsummaryrefslogtreecommitdiffstats
path: root/module/zfs
Commit message (Collapse)AuthorAgeFilesLines
* sa_lookup() ignores buffer size.Jason King2023-11-071-1/+2
| | | | | | | | | When retrieving a system attribute, the size of the supplied buffer is ignored. If the buffer is too small to hold the attribute, sa_attr_op() will write past the end of the buffer. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Jason King <[email protected]> Closes #15476
* Update the kstat dataset_name when renaming a zvolAlan Somers2023-11-071-0/+12
| | | | | | | | | | | Add a dataset_kstats_rename function, and call it when renaming a zvol on FreeBSD and Linux. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alan Somers <[email protected]> Sponsored-by: Axcient Closes #15482 Closes #15486
* Make abd_raidz_gen_iterate() pass an initialized pointer to the callbackMark Johnston2023-11-071-3/+6
| | | | | | | | | | | | | Otherwise callbacks may trigger KMSAN violations in the dlen == 0 case. For example, raidz_syn_pq_abd() will compare an uninitialized pointer with itself before returning. This seems harmless, but let's maintain good hygiene and avoid passing uninitialized variables, if only to placate KMSAN. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Allan Jude <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Mark Johnston <[email protected]> Closes #15491
* Fix accounting error for pending sync IO ops in zpool iostatMigeljanImeri2023-11-071-2/+5
| | | | | | | | | | | | | Currently vdev_queue_class_length is responsible for checking how long the queue length is, however, it doesn't check the length when a list is used, rather it just returns whether it is empty or not. To fix this I added a counter variable to vdev_queue_class to keep track of the sync IO ops, and changed vdev_queue_class_length to reference this variable instead. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: MigeljanImeri <[email protected]> Closes #15478
* Improve ZFS objset sync parallelismednadolski-ix2023-11-068-110/+321
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | As part of transaction group commit, dsl_pool_sync() sequentially calls dsl_dataset_sync() for each dirty dataset, which subsequently calls dmu_objset_sync(). dmu_objset_sync() in turn uses up to 75% of CPU cores to run sync_dnodes_task() in taskq threads to sync the dirty dnodes (files). There are two problems: 1. Each ZVOL in a pool is a separate dataset/objset having a single dnode. This means the objsets are synchronized serially, which leads to a bottleneck of ~330K blocks written per second per pool. 2. In the case of multiple dirty dnodes/files on a dataset/objset on a big system they will be sync'd in parallel taskq threads. However, it is inefficient to to use 75% of CPU cores of a big system to do that, because of (a) bottlenecks on a single write issue taskq, and (b) allocation throttling. In addition, if not for the allocation throttling sorting write requests by bookmarks (logical address), writes for different files may reach space allocators interleaved, leading to unwanted fragmentation. The solution to both problems is to always sync no more and (if possible) no fewer dnodes at the same time than there are allocators the pool. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Edmund Nadolski <[email protected]> Closes #15197
* Fix block cloning between unencrypted and encrypted datasetsMartin Matuška2023-10-311-0/+9
| | | | | | | | | | | | | | | Block cloning from an encrypted dataset into an unencrypted dataset and vice versa is not possible. The current code did allow cloning unencrypted files into an encrypted dataset causing a panic when these were accessed. Block cloning between encrypted and encrypted is currently supported on the same filesystem only. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Kay Pedersen <[email protected]> Reviewed-by: Rob N <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Martin Matuska <[email protected]> Closes #15464 Closes #15465
* zvol: fix delayed update to block device ro entryAmeer Hamza2023-10-312-0/+29
| | | | | | | | | | | | | The change in the zvol readonly property does not update the block device readonly entry until the first IO to the ZVOL. This patch addresses the issue by updating the block device readonly property from the set property IOCTL call. Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Ameer Hamza <[email protected]> Closes #15409
* zvol: Implement zvol threading as a PropertyAmeer Hamza2023-10-312-0/+23
| | | | | | | | | | | | Currently, zvol threading can be switched through the zvol_request_sync module parameter system-wide. By making it a zvol property, zvol threading can be switched per zvol. Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Ameer Hamza <[email protected]> Closes #15409
* zvol: Cleanup set propertyAmeer Hamza2023-10-312-101/+32
| | | | | | | | | | | zvol_set_volmode() and zvol_set_snapdev() share a common code path. Merge this shared code path into zvol_set_common(). Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Ameer Hamza <[email protected]> Closes #15409
* Unify arc_prune_async() codeAlexander Motin2023-10-301-0/+52
| | | | | | | | | | | | | | There is no sense to have separate implementations for FreeBSD and Linux. Make Linux code shared as more functional and just register FreeBSD-specific prune callback with arc_add_prune_callback() API. Aside of code cleanup this should fix excessive pruning on FreeBSD: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=274698 Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Mark Johnston <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #15456
* Tune zio buffer caches and their alignmentsAlexander Motin2023-10-301-50/+39
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We should not always use PAGESIZE alignment for caches bigger than it and SPA_MINBLOCKSIZE otherwise. Doing that caches for 5, 6, 7, 10 and 14KB rounded up to 8, 12 and 16KB respectively make no sense. Instead specify as alignment the biggest power-of-2 divisor. This way 2KB and 6KB caches are both aligned to 2KB, while 4KB and 8KB are aligned to 4KB. Reduce number of caches to half-power of 2 instead of quarter-power of 2. This removes caches difficult for underlying allocators to fit into page-granular slabs, such as: 2.5, 3.5, 5, 7, 10KB, etc. Since these caches are mostly used for transient allocations like ZIOs and small DBUF cache it does not worth being too aggressive. Due to the above alignment issue some of those caches were not working properly any way. 6KB cache now finally has a chance to work right, placing 2 buffers into 3 pages, that makes sense. Remove explicit alignment in Linux user-space case. I don't think it should be needed any more with the above fixes. As result on FreeBSD instead of such numbers of pages per slab: vm.uma.zio_buf_comb_16384.keg.ppera: 4 vm.uma.zio_buf_comb_14336.keg.ppera: 4 vm.uma.zio_buf_comb_12288.keg.ppera: 3 vm.uma.zio_buf_comb_10240.keg.ppera: 3 vm.uma.zio_buf_comb_8192.keg.ppera: 2 vm.uma.zio_buf_comb_7168.keg.ppera: 2 vm.uma.zio_buf_comb_6144.keg.ppera: 2 <= Broken vm.uma.zio_buf_comb_5120.keg.ppera: 2 vm.uma.zio_buf_comb_4096.keg.ppera: 1 vm.uma.zio_buf_comb_3584.keg.ppera: 7 <= Hard to free vm.uma.zio_buf_comb_3072.keg.ppera: 3 vm.uma.zio_buf_comb_2560.keg.ppera: 2 vm.uma.zio_buf_comb_2048.keg.ppera: 1 vm.uma.zio_buf_comb_1536.keg.ppera: 2 vm.uma.zio_buf_comb_1024.keg.ppera: 1 vm.uma.zio_buf_comb_512.keg.ppera: 1 I am now getting such: vm.uma.zio_buf_comb_16384.keg.ppera: 4 vm.uma.zio_buf_comb_12288.keg.ppera: 3 vm.uma.zio_buf_comb_8192.keg.ppera: 2 vm.uma.zio_buf_comb_6144.keg.ppera: 3 <= Fixed, 2 in 3 pages vm.uma.zio_buf_comb_4096.keg.ppera: 1 vm.uma.zio_buf_comb_3072.keg.ppera: 3 vm.uma.zio_buf_comb_2048.keg.ppera: 1 vm.uma.zio_buf_comb_1536.keg.ppera: 2 vm.uma.zio_buf_comb_1024.keg.ppera: 1 vm.uma.zio_buf_comb_512.keg.ppera: 1 Reviewed-by: Allan Jude <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #15452
* RAIDZ: Use cache blocking during parity mathAlexander Motin2023-10-302-79/+101
| | | | | | | | | | | | | | | | | | | | | | | RAIDZ parity is calculated by adding data one column at a time. It works OK for small blocks, but for large blocks results of previous addition may already be evicted from CPU caches to main memory, and in addition to extra memory write require extra read to get it back. This patch splits large parity operations into 64KB chunks, that should in most cases fit into CPU L2 caches from the last decade. I haven't touched more complicated cases of data reconstruction to not over complicate the code. Those should be relatively rare. My tests on Xeon Gold 6242R CPU with 1MB of L2 cache per core show up to 10/20% memory traffic reduction when writing to 4-wide RAIDZ/ RAIDZ2 blocks of ~4MB and up. Older CPUs with 256KB of L2 cache should see the effect even on smaller blocks. Wider vdevs may need bigger blocks to be affected. Reviewed-by: Brian Atkinson <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #15448
* ZIL: Cleanup sync and commit handlingAlexander Motin2023-10-304-31/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ZVOL: - Mark all ZVOL ZIL transactions as sync. Since ZVOLs have only one object, it makes no sense to maintain async queue and on each commit merge it into sync. Single sync queue is just cheaper, while it changes nothing until actual commit request arrives. - Remove zsd_sync_cnt and the zil_async_to_sync() calls since we are no longer switching between sync and async queues. ZFS: - Mark write transactions as sync based only on number of sync opens (z_sync_cnt). We can not randomly jump between sync and async unless we want data corruptions due to writes reordering. - When file first opened with O_SYNC (z_sync_cnt incremented to 1) call zil_async_to_sync() for it to preserve correct ordering between past and future writes. - Drop zfs_fsyncer_key logic. Looks like it was an optimization for workloads heavily intermixing async writes with tons of fsyncs. But first it was broken 8 years ago due to Linux tsd implementation not allowing data storage between syscalls, and second, I doubt it is safe to switch from async to sync so often and without calling zil_async_to_sync(). - Rename sync argument of *_log_write() into commit, now only signalling caller's intent to call zil_commit() soon after. It allows WR_COPIED optimizations without extra other meanings. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: George Wilson <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #15366
* Read prefetched buffers from L2ARCshodanshok2023-10-261-4/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Prefetched buffers are currently read from L2ARC if, and only if, l2arc_noprefetch is set to non-default value of 0. This means that a streaming read which can be served from L2ARC will instead engage the main pool. For example, consider what happens when a file is sequentially read: - application requests contiguous data, engaging the prefetcher; - ARC buffers are initially marked as prefetched but, as the calling application consumes data, the prefetch tag is cleared; - these "normal" buffers become eligible for L2ARC and are copied to it; - re-reading the same file will *not* engage L2ARC even if it contains the required buffers; - main pool has to suffer another sequential read load, which (due to most NCQ-enabled HDDs preferring sequential loads) can dramatically increase latency for uncached random reads. In other words, current behavior is to write data to L2ARC (wearing it) without using the very same cache when reading back the same data. This was probably useful many years ago to preserve L2ARC read bandwidth but, with current SSD speed/size/price, it is vastly sub-optimal. Setting l2arc_noprefetch=1, while enabling L2ARC to serve these reads, means that even prefetched but unused buffers will be copied into L2ARC, further increasing wear and load for potentially not-useful data. This patch enable prefetched buffer to be read from L2ARC even when l2arc_noprefetch=1 (default), increasing sequential read speed and reducing load on the main pool without polluting L2ARC with not-useful (ie: unused) prefetched data. Moreover, it clear users confusion about L2ARC size increasing but not serving any IO when doing sequential reads. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Gionatan Danti <[email protected]> Closes #15451
* Add mutex_enter_interruptible() for interruptible sleeping IOCTLsThomas Bertschinger2023-10-262-10/+12
| | | | | | | | | | | | | | | | | | | | | | Many long-running ZFS ioctls lock the spa_namespace_lock, forcing concurrent ioctls to sleep for the mutex. Previously, the only option is to call mutex_enter() which sleeps uninterruptibly. This is a usability issue for sysadmins, for example, if the admin runs `zpool status` while a slow `zpool import` is ongoing, the admin's shell will be locked in uninterruptible sleep for a long time. This patch resolves this admin usability issue by introducing mutex_enter_interruptible() which sleeps interruptibly while waiting to acquire a lock. It is implemented for both Linux and FreeBSD. The ZFS_IOC_POOL_CONFIGS ioctl, used by `zpool status`, is changed to use this new macro so that the command can be interrupted if it is issued during a concurrent `zpool import` (or other long-running operation). Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Thomas Bertschinger <[email protected]> Closes #15360
* ZIO: Remove READY pipeline stage from root ZIOsAlexander Motin2023-10-252-11/+44
| | | | | | | | | | | | | | | | zio_root() has no arguments for ready callback or parent ZIO. Except one recent case in ZIL code if root ZIOs ever have a parent it is also a root ZIO. It means we do not need READY pipeline stage for them, which takes some time to process, but even more time to wait for the children and be woken by them, and both for no good reason. The most visible effect of this change is that it avoids one taskq wakeup per ZIL block written, previously used to run zio_ready() for lwb_root_zio and skipped now. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #15398
* ZIL: Detect single-threaded workloadsAlexander Motin2023-10-241-51/+40
| | | | | | | | | | | | | | | | | | | ... by checking that previous block is fully written and flushed. It allows to skip commit delays since we can give up on aggregation in that case. This removes zil_min_commit_timeout parameter, since for single-threaded workloads it is not needed at all, while on very fast devices even some multi-threaded workloads may get detected as single-threaded and still bypass the wait. To give multi-threaded workloads more aggregation chances increase zfs_commit_timeout_pct from 5 to 10%, as they should suffer less from additional latency. Also single-threaded workloads detection allows in perspective better prediction of the next block size. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Prakash Surya <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #15381
* ABD: Be more assertive in iteratorsAlexander Motin2023-10-241-78/+26
| | | | | | | | | | | Once we verified the ABDs and asserted the sizes we should never see premature ABDs ends. Assert that and remove extra branches from production builds. Reviewed-by: Brian Atkinson <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #15428
* Add prefetch property Brian Behlendorf2023-10-242-1/+25
| | | | | | | | | | | | | | | | | | | | | | ZFS prefetch is currently governed by the zfs_prefetch_disable tunable. However, this is a module-wide settings - if a specific dataset benefits from prefetch, while others have issue with it, an optimal solution does not exists. This commit introduce the "prefetch" tri-state property, which enable granular control (at dataset/volume level) for prefetching. This patch does not remove the zfs_prefetch_disable, which remains a system-wide switch for enable/disable prefetch. However, to avoid duplication, it would be preferable to deprecate and then remove the module tunable. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Ameer Hamza <[email protected]> Signed-off-by: Gionatan Danti <[email protected]> Co-authored-by: Gionatan Danti <[email protected]> Closes #15237 Closes #15436
* Revert "Do not persist user/group/project quota zap objects when unneeded"Brian Behlendorf2023-10-231-16/+2
| | | | | | | This reverts commit 797f55ef12d752d2a7fb04fae5d24e019adf2a1d which was causing a VERIFY failure when running the project quota tests. Signed-off-by: Brian Behlendorf <[email protected]> Closes #15438
* spa: document spa_thread() and SDC feature gatesRob N2023-10-231-6/+35
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | spa_thread() and the "System Duty Cycle" scheduling class are from Illumos and have not yet been adapted to Linux or FreeBSD. HAVE_SPA_THREAD has long been explicitly undefined and used to mark spa_thread(), but there's some related taskq code that can never be invoked without it, which makes some already-tricky code harder to read. HAVE_SYSDC is introduced in this commit to mark the SDC parts. SDC requires spa_thread(), but the inverse is not true, so they are separate. I don't want to make the call to just remove it because I still harbour hopes that OpenZFS could become a first-class citizen on Illumos someday. But hopefully this will at least make the reason it exists a bit clearer for people without long memories and/or an interest in history. For those that are interested in the history, the original FreeBSD port of ZFS (before ZFS-on-Linux was adopted there) did have a spa_thread(), but not SDC. The last version of that before it was removed can be read here: https://github.com/freebsd/freebsd-src/blob/22df1ffd812f0395cdb7c0b1edae1f67b991562a/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/spa.c Meanwhile, more information on the SDC scheduling class is here: https://github.com/illumos/illumos-gate/blob/master/usr/src/uts/common/disp/sysdc.c Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes #15406
* Do not persist user/group/project quota zap objects when unneededSam Atkinson2023-10-201-2/+16
| | | | | | | | | | | | | | | In the zfs_id_over*quota functions, there is a short-circuit to skip the zap_lookup when the quota zap does not exist. If quotas are never used in a zpool, then the quota zap will never exist. But if user/group/project quotas are ever used, the zap objects will be created and will persist even if the quotas are deleted. The quota zap_lookup in the write path can become a bottleneck for write-heavy small I/O workloads. Before this commit, it was not possible to remove this lookup without creating a new zpool. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Sam Atkinson <[email protected]> Closes #14721
* Trust ARC_BUF_SHARED() moreAlexander Motin2023-10-201-17/+24
| | | | | | | | | | | | | | | | | In my understanding ARC_BUF_SHARED() and arc_buf_is_shared() should return identical results, except the second also asserts it deeper. The first is much cheaper though, saving few pointer dereferences. Replace production arc_buf_is_shared() calls with ARC_BUF_SHARED(), and call arc_buf_is_shared() in random assertions, while making it even more strict. On my tests this in half reduces arc_buf_destroy_impl() time, that noticeably reduces hash_lock congestion under heavy dbuf eviction. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: George Wilson <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #15397
* Remove lock from dsl_pool_need_dirty_delay()Alexander Motin2023-10-201-7/+7
| | | | | | | | | | | | | | Torn reads/writes of dp_dirty_total are unlikely: on 64-bit systems due to register size, while on 32-bit due to memory constraints. And even if we hit some race, the code implementing the delay takes the lock any way. Removal of the poll-wide lock acquisition saves ~1% of CPU time on 8-thread 8KB write workload. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #15390
* Set spa_ccw_fail_time=0 when expanding a vdev.Colin Percival2023-10-201-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | When a vdev is to be expanded -- either via `zpool online -e` or via the autoexpand option -- a SPA_ASYNC_CONFIG_UPDATE request is queued to be handled via an asynchronous worker thread (spa_async_thread). This normally happens almost immediately; but will be delayed up to zfs_ccw_retry_interval seconds (default 5 minutes) if an attempt to write the zpool configuration cache failed. When FreeBSD boots ZFS-root VM images generated using `makefs -t zfs`, the zpoolupgrade rc.d script runs `zpool upgrade`, which modifies the pool configuration and triggers an attempt to write to the cache file. This attempted write fails because the filesystem is still mounted read-only at this point in the boot process, triggering a 5-minute cooldown before SPA_ASYNC_CONFIG_UPDATE requests will be handled by the asynchronous worker thread. When expanding a vdev, reset the "when did a configuration cache write last fail" value so that the SPA_ASYNC_CONFIG_UPDATE request will be handled promptly. A cleaner but more intrusive option would be to use separate SPA_ASYNC_ flags for "configuration changed" and "try writing the configuration cache again", but with FreeBSD 14.0 coming very soon I'd prefer to leave such refactoring for a later date. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Colin Percival <[email protected]> Closes #15405
* Large sync writes perform worse with slogJohn Wren Kennedy2023-10-131-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For synchronous write workloads with large IO sizes, a pool configured with a slog performs worse than one with an embedded zil: sequential_writes 1m sync ios, 16 threads Write IOPS: 1292 438 -66.10% Write Bandwidth: 1323570 448910 -66.08% Write Latency: 12128400 36330970 3.0x sequential_writes 1m sync ios, 32 threads Write IOPS: 1293 430 -66.74% Write Bandwidth: 1324184 441188 -66.68% Write Latency: 24486278 74028536 3.0x The reason is the `zil_slog_bulk` variable. In `zil_lwb_write_open`, if a zil block is greater than 768K, the priority of the write is downgraded from sync to async. Increasing the value allows greater throughput. To select a value for this PR, I ran an fio workload with the following values for `zil_slog_bulk`: zil_slog_bulk KiB/s 1048576 422132 2097152 478935 4194304 533645 8388608 623031 12582912 827158 16777216 1038359 25165824 1142210 33554432 1211472 50331648 1292847 67108864 1308506 100663296 1306821 134217728 1304998 At 64M, the results with a slog are now improved to parity with an embedded zil: sequential_writes 1m sync ios, 16 threads Write IOPS: 438 1288 2.9x Write Bandwidth: 448910 1319062 2.9x Write Latency: 36330970 12163408 -66.52% sequential_writes 1m sync ios, 32 threads Write IOPS: 430 1290 3.0x Write Bandwidth: 441188 1321693 3.0x Write Latency: 74028536 24519698 -66.88% None of the other tests in the performance suite (run with a zil or slog) had a significant change, including the random_write_zil tests, which use multiple datasets. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Tony Nguyen <[email protected]> Signed-off-by: John Wren Kennedy <[email protected]> Closes #14378
* Zpool can start allocating from metaslab before TRIMs have completedJason King2023-10-121-9/+19
| | | | | | | | | | | | | | | | | | | | | | | | When doing a manual TRIM on a zpool, the metaslab being TRIMmed is potentially re-enabled before all queued TRIM zios for that metaslab have completed. Since TRIM zios have the lowest priority, it is possible to get into a situation where allocations occur from the just re-enabled metaslab and cut ahead of queued TRIMs to the same metaslab. If the ranges overlap, this will cause corruption. We were able to trigger this pretty consistently with a small single top-level vdev zpool (i.e. small number of metaslabs) with heavy parallel write activity while performing a manual TRIM against a somewhat 'slow' device (so TRIMs took a bit of time to complete). With the patch, we've not been able to recreate it since. It was on illumos, but inspection of the OpenZFS trim code looks like the relevant pieces are largely unchanged and so it appears it would be vulnerable to the same issue. Reviewed-by: Igor Kozhukhov <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Jason King <[email protected]> Illumos-issue: https://www.illumos.org/issues/15939 Closes #15395
* DMU: Do not pre-read holes during writeAlexander Motin2023-10-111-3/+5
| | | | | | | | | | | | | | | | | | | | dmu_tx_check_ioerr() pre-reads blocks that are going to be dirtied as part of transaction to both prefetch them and check for errors. But it makes no sense to do it for holes, since there are no disk reads to prefetch and there can be no errors. On the other side those blocks are anonymous, and they are freed immediately by the dbuf_rele() without even being put into dbuf cache, so we just burn CPU time on decompression and overheads and get absolutely no result at the end. Use of dbuf_hold_impl() with fail_sparse parameter allows to skip the extra work, and on my tests with sequential 8KB writes to empty ZVOL with 32KB blocks shows throughput increase from 1.7 to 2GB/s. Reviewed-by: Brian Atkinson <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #15371
* ZIL: Reduce maximum size of WR_COPIED to 7.5KAlexander Motin2023-10-061-6/+11
| | | | | | | | | | | | Benchmarks show that at certain write sizes range lock/unlock take not so much time as extra memory copy. The exact threshold is not obvious due to other overheads, but it is definitely lower than ~63KB used before. Make it configurable, defaulting at 7.5KB, that is 8KB of nearest malloc() size minus itx and lr structs. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #15353
* import: require force when cachefile hostid doesn't match on-diskRob Norris2023-10-061-0/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Previously, if a cachefile is passed to zpool import, the cached config is mostly offered as-is to ZFS_IOC_POOL_TRYIMPORT->spa_tryimport(), and the results are taken as the canonical pool config and handed back to ZFS_IOC_POOL_IMPORT. In the course of its operation, spa_load() will inspect the pool and build a new config from what it finds on disk. However, it then regenerates a new config ready to import, and so rightly sets the hostid and hostname for the local host in the config it returns. Because of this, the "require force" checks always decide the pool is exported and last touched by the local host, even if this is not true, which is possible in a HA environment when MMP is not enabled. The pool may be imported on another head, but the import checks still pass here, so the pool ends up imported on both. (This doesn't happen when a cachefile isn't used, because the pool config is discovered in userspace in zpool_find_import(), and that does find the on-disk hostid and hostname correctly). Since the systemd zfs-import-cache.service unit uses cachefile imports, this can lead to a system returning after a crash with a "valid" cachefile on disk and automatically, quietly, importing a pool that has already been taken up by a secondary head. This commit causes the on-disk hostid and hostname to be included in the ZPOOL_CONFIG_LOAD_INFO item in the returned config, and then changes the "force" checks for zpool import to use them if present. This method should give no change in behaviour for old userspace on new kernels (they won't know to look for the new config items) and for new userspace on old kernels (the won't find the new config items). Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rob Norris <[email protected]> Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Closes #15290
* Reduce number of metaslab preload taskq threads.Alexander Motin2023-10-062-22/+29
| | | | | | | | | | | | | | | | | | Before this change ZFS created threads for 50% of CPUs for each top- level vdev. Plus it created the same number of threads for embedded log groups (that have only one metaslab and don't need any preload). As result, on system with 80 CPUs and pool of 60 vdevs this resulted in 4800 metaslab preload threads, that is absolutely insane. This patch changes the preload threads to 50% of CPUs in one taskq per pool, so on the mentioned system it will be only 40 threads. Among other things this fixes zdb on the mentioned system and pool on FreeBSD, that failed to create so many threads in one process. Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #15319
* ARC: Drop different size headers for cryptoAlexander Motin2023-10-061-175/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | To reduce memory usage ZFS crypto allocated bigger by 56 bytes ARC headers only when specific block was encrypted on disk. It was a nice optimization, except in some cases the code reallocated them on fly, that invalidated header pointers from the buffers. Since the buffers use different locking, it created number of races, that were originally covered (at least partially) by b_evict_lock, used also to protection evictions. But it has gone as part of #14340. As result, as was found in #15293, arc_hdr_realloc_crypt() ended up unprotected and causing use-after-free. Instead of introducing some even more elaborate locking, this patch just drops the difference between normal and protected headers. It cost us additional 56 bytes per header, but with couple patches saving 24 bytes, the net growth is only 32 bytes with total header size of 232 bytes on FreeBSD, that IMHO is acceptable price for simplicity. Additional locking would also end up consuming space, time or both. Reviewe-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #15293 Closes #15347
* ARC: Remove b_bufcnt/b_ebufcnt from ARC headersAlexander Motin2023-10-061-70/+35
| | | | | | | | | | | | | | | | | | In most cases we do not care about exact number of buffers linked to the header, we just need to know if it is zero, non-zero or one. That can easily be checked just looking on b_buf pointer or in some cases derefencing it. b_ebufcnt is read only once, and in that case we already traverse the list as part of arc_buf_remove(), so second traverse should not be expensive. This reduces L1 ARC header size by 8 bytes and full crypto header by 16 bytes, down to 176 and 232 bytes on FreeBSD respectively. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #15350
* ARC: Remove b_cv from struct l1arc_buf_hdrAlexander Motin2023-10-041-12/+22
| | | | | | | | | | | | | Earlier as part of #14123 I've removed one use of b_cv. This patch reuses the same approach to remove the other one from much more rare code path. This saves 16 bytes of L1 ARC header on FreeBSD (reducing it from 200 to 184 bytes) and seems even more on Linux. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #15340
* Report ashift of L2ARC devices in zdbGeorge Amanakis2023-10-021-0/+10
| | | | | | | | | | Commit 8af1104f does not actually store the ashift of cache devices in their label. However, in order to facilitate reporting the ashift through zdb, we enable this in the present commit. We also document how the retrieval of the ashift is done. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: George Amanakis <[email protected]> Closes #15331
* Restrict short block cloning requestsAlexander Motin2023-09-291-0/+13
| | | | | | | | | | | | | | If we are copying only one block and it is smaller than recordsize property, do not allow destination to grow beyond one block if it is not there yet. Otherwise the destination will get stuck with that block size forever, that can be as small as 512 bytes, no matter how big the destination grow later. Reviewed-by: Kay Pedersen <[email protected]> Reviewed-by: Rob Norris <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #15321
* Tweak rebuild in-flight hard limitBrian Behlendorf2023-09-291-2/+2
| | | | | | | | | Vendor testing shows we should be able to get a little more performance if we further relax the hard limit which we're hitting. Authored-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Tony Hutter <[email protected]> Closes #15324
* Fix ENOSPC for extended quotaAkash B2023-09-281-14/+8
| | | | | | | | | | | | | | | | | | | When unlinking multiple files from a pool at 100% capacity, it was possible for ENOSPC to be returned after the first few unlinks. This issue was fixed previously by PR #13172 but then this was again introduced by PR #13839. This is resolved using the existing mechanism of returning ERESTART when over quota as long as we know enough space will shortly be available after processing the pending deferred frees. Also, updated the existing testcase which reliably reproduced the issue without this patch. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Dipak Ghosh <[email protected]> Signed-off-by: Akash B <[email protected]> Closes #15312
* Don't allocate from new metaslabsPaul Dagnelie2023-09-281-0/+9
| | | | | | Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Paul Dagnelie <[email protected]> Closes #15307 Closes #15308
* status: report pool suspension state under failmode=continueRob N2023-09-201-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When failmode=continue is set and the pool suspends, both 'zpool status' and the 'zfs/pool/state' kstat ignore it and report the normal vdev tree state. There's no clear indicator that the pool is suspended. This is unlike suspend in failmode=wait, or suspend due to MMP check failure, which both report "SUSPENDED" explicitly. This commit changes it so SUSPENDED is reported for failmode=continue the same as for other modes. Rationale: The historical behaviour of failmode=continue is roughly, "press on as though all is well". To this end, the fact that the pool had suspended was not shown, to maintain the façade that all is well. Its unclear why hiding this information was considered appropriate. One possibility is that it was expected that a true pool fault would always be reported as DEGRADED or FAULTED, and that the pool could not suspend without these happening. That is not necessarily true, as vdev health and suspend state are only loosely connected, such that a pool in (apparent) good health can be suspended for good reasons, and of course a degraded pool does not lead to suspension. Even if that expectation were true, there's still a difference in urgency - a degraded pool may not need to be attended to for hours, while a suspended pool is most often unusable until an operator intervenes. An operator that has set failmode=continue has presumably done so because their workload is one that can continue to operate in a useful way when the pool suspends. In this case the operator still needs a clear indicator that there is a problem that needs attending to. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes #15297
* ZIL: Fix potential race on flush deferring.Alexander Motin2023-09-201-0/+9
| | | | | | | | | | | | | | | | zil_lwb_set_zio_dependency() can not set write ZIO dependency on previous LWB's write ZIO if one is already in done handler and set state to LWB_STATE_WRITE_DONE. So theoretically done handler of next LWB's write ZIO may run before done handler of previous LWB write ZIO completes. In such case we can not defer flushes, since the flush issue process is not locked. This may fix some reported assertions of lwb_vdev_tree not being empty inside zil_free_lwb(). Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #15278
* Use ASSERT0P() to check that a pointer is NULL.Dag-Erling Smørgrav2023-09-191-1/+1
| | | | | | | | Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Kay Pedersen <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Dag-Erling Smørgrav <[email protected]> Closes #15225
* Fix l2arc_apply_transforms ztest crash Paul Dagnelie2023-09-191-5/+6
| | | | | | | | | | | | | | | In #13375 we modified the allocation size of the buffer that we use to apply l2arc transforms to be the size of the arc hdr we're using, rather than the allocation size that will be in place on the disk, because sometimes the hdr size is larger. Unfortunately, sometimes the allocation size is larger, which means that we overflow the buffer in that case. This change modifies the allocation to be the max of the two values Reviewed-by: Mark Maybee <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Paul Dagnelie <[email protected]> Closes #15177 Closes #15248
* Update the MOS directory on spa_upgrade_errlog()George Amanakis2023-09-181-0/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | spa_upgrade_errlog() does not update the MOS directory when the head_errlog feature is enabled. In this case if spa_errlog_sync() is not called, the MOS dir references the old errlog_last and errlog_sync objects. Thus when doing a scrub a panic will occur: Call Trace: dump_stack+0x6d/0x8b panic+0x101/0x2e3 spl_panic+0xcf/0x102 [spl] delete_errlog+0x124/0x130 [zfs] spa_errlog_sync+0x256/0x260 [zfs] spa_sync_iterate_to_convergence+0xe5/0x250 [zfs] spa_sync+0x2f7/0x670 [zfs] txg_sync_thread+0x22d/0x2d0 [zfs] thread_generic_wrapper+0x83/0xa0 [spl] kthread+0x104/0x140 ret_from_fork+0x1f/0x40 Fix this by updating the related MOS directory objects in spa_upgrade_errlog(). Reviewed-by: Mark Maybee <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: George Amanakis <[email protected]> Closes #15279 Closes #15277
* Add more constraints for block cloning.Alexander Motin2023-09-091-2/+19
| | | | | | | | | | | | | | | | | | | | | | - We cannot clone into files with smaller block size if there is more than one block, since we can not grow the block size. - Block size must be power-of-2 if destination offset != 0, since there can be no multiple blocks of non-power-of-2 size. The first should handle the case when destination file has several blocks but still is not bigger than one block of the source file. The second fixes panic in dmu_buf_hold_array_by_dnode() on attempt to concatenate files with equal but non-power-of-2 block sizes. While there, assert that error is reported if we made no progress. Reviewed-by: Pawel Jakub Dawidek <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Rob Norris <[email protected]> Reviewed-by: Kay Pedersen <[email protected]> Reviewed-by: Umer Saleem <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #15251
* Selectable block allocatorsednadolski-ix2023-09-013-31/+87
| | | | | | | | | | | | | ZFS historically has had several space allocators that were dynamically selectable. While these have been retained in OpenZFS, only a single allocator has been statically compiled in. This patch compiles all allocators for OpenZFS and provides a module parameter to allow for manual selection between them. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Ameer Hamza <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Edmund Nadolski <[email protected]> Closes #15218
* ZIL: Change ZIOs issue order.Alexander Motin2023-09-011-2/+2
| | | | | | | | | | | | | In zil_lwb_write_issue(), after issuing lwb_root_zio/lwb_write_zio, we have no right to access lwb->lwb_child_zio. If it was not there, the first two ZIOs may have already completed and freed the lwb. ZIOs issue in opposite order from children to parent should keep the lwb valid till the end, since the lwb can be freed only after lwb_root_zio completion callback. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #15233
* ZIL: Revert zl_lock scope reduction.Alexander Motin2023-09-011-19/+11
| | | | | | | | | | | | | | While I have no reports of it, I suspect possible use-after-free scenario when zil_commit_waiter() tries to dereference zcw_lwb for lwb already freed by zil_sync(), while zcw_done is not set. Extension of zl_lock scope as it was originally should block zil_sync() from freeing the lwb, closing this race. This reverts #14959 and couple chunks of #14841. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #15228
* ZIL: Tune some assertions.Alexander Motin2023-09-011-6/+9
| | | | | | | | | | | | In zil_free_lwb() we should first assert lwb_state or the rest of assertions can be misleading if it is false. Add lwb_state assertions in zil_lwb_add_block() to make sure we are not trying to add elements to lwb_vdev_tree after it was processed. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #15227
* dmu_buf_will_clone: change assertion to fix 32-bit compiler warningDimitry Andric2023-08-311-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Building module/zfs/dbuf.c for 32-bit targets can result in a warning: In file included from /usr/src/sys/contrib/openzfs/include/sys/zfs_context.h:97, from /usr/src/sys/contrib/openzfs/module/zfs/dbuf.c:32: /usr/src/sys/contrib/openzfs/module/zfs/dbuf.c: In function 'dmu_buf_will_clone': /usr/src/sys/contrib/openzfs/lib/libspl/include/assert.h:116:33: error: cast from pointer to integer of different size [-Werror=pointer-to-int-cast] 116 | const uint64_t __left = (uint64_t)(LEFT); \ | ^ /usr/src/sys/contrib/openzfs/lib/libspl/include/assert.h:148:25: note: in expansion of macro 'VERIFY0' 148 | #define ASSERT0 VERIFY0 | ^~~~~~~ /usr/src/sys/contrib/openzfs/module/zfs/dbuf.c:2704:9: note: in expansion of macro 'ASSERT0' 2704 | ASSERT0(dbuf_find_dirty_eq(db, tx->tx_txg)); | ^~~~~~~ This is because dbuf_find_dirty_eq() returns a pointer, which if pointers are 32-bit results in a warning about the cast to uint64_t. Instead, use the ASSERT3P() macro, with == and NULL as second and third arguments, which should work regardless of the target's bitness. Reviewed-by: Kay Pedersen <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Brian Atkinson <[email protected]> Signed-off-by: Dimitry Andric <[email protected]> Closes #15224