aboutsummaryrefslogtreecommitdiffstats
path: root/module/zfs/zio.c
Commit message (Collapse)AuthorAgeFilesLines
* ZIO: Add overflow checks for linear buffersAlexander Motin2023-12-011-2/+55
| | | | | | | | | | Since we use a limited set of kmem caches, quite often we have unused memory after the end of the buffer. Put there up to a 512-byte canary when built with debug to detect buffer overflows at the free time. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #15553
* ZIO: Optimize zio_flush()Alexander Motin2023-11-171-21/+15
| | | | | | | | | | | | | | | | | - Generalize vdev_nowritecache handling by traversing through the VDEV tree and skipping children ZIOs where not supported. - Remove intermediate zio_null() in case of several VDEV children. - Remove children handling from zio_ioctl(). There are no other use cases for this code beside DKIOCFLUSHWRITECACHED, and would there be, I doubt they would so straightforward apply to all VDEV children. Comparing to removed previous optimization this should improve cases of redundant ZILs/SLOGs. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: George Wilson <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #15515
* Improve ZFS objset sync parallelismednadolski-ix2023-11-061-14/+29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | As part of transaction group commit, dsl_pool_sync() sequentially calls dsl_dataset_sync() for each dirty dataset, which subsequently calls dmu_objset_sync(). dmu_objset_sync() in turn uses up to 75% of CPU cores to run sync_dnodes_task() in taskq threads to sync the dirty dnodes (files). There are two problems: 1. Each ZVOL in a pool is a separate dataset/objset having a single dnode. This means the objsets are synchronized serially, which leads to a bottleneck of ~330K blocks written per second per pool. 2. In the case of multiple dirty dnodes/files on a dataset/objset on a big system they will be sync'd in parallel taskq threads. However, it is inefficient to to use 75% of CPU cores of a big system to do that, because of (a) bottlenecks on a single write issue taskq, and (b) allocation throttling. In addition, if not for the allocation throttling sorting write requests by bookmarks (logical address), writes for different files may reach space allocators interleaved, leading to unwanted fragmentation. The solution to both problems is to always sync no more and (if possible) no fewer dnodes at the same time than there are allocators the pool. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Edmund Nadolski <[email protected]> Closes #15197
* Tune zio buffer caches and their alignmentsAlexander Motin2023-10-301-50/+39
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We should not always use PAGESIZE alignment for caches bigger than it and SPA_MINBLOCKSIZE otherwise. Doing that caches for 5, 6, 7, 10 and 14KB rounded up to 8, 12 and 16KB respectively make no sense. Instead specify as alignment the biggest power-of-2 divisor. This way 2KB and 6KB caches are both aligned to 2KB, while 4KB and 8KB are aligned to 4KB. Reduce number of caches to half-power of 2 instead of quarter-power of 2. This removes caches difficult for underlying allocators to fit into page-granular slabs, such as: 2.5, 3.5, 5, 7, 10KB, etc. Since these caches are mostly used for transient allocations like ZIOs and small DBUF cache it does not worth being too aggressive. Due to the above alignment issue some of those caches were not working properly any way. 6KB cache now finally has a chance to work right, placing 2 buffers into 3 pages, that makes sense. Remove explicit alignment in Linux user-space case. I don't think it should be needed any more with the above fixes. As result on FreeBSD instead of such numbers of pages per slab: vm.uma.zio_buf_comb_16384.keg.ppera: 4 vm.uma.zio_buf_comb_14336.keg.ppera: 4 vm.uma.zio_buf_comb_12288.keg.ppera: 3 vm.uma.zio_buf_comb_10240.keg.ppera: 3 vm.uma.zio_buf_comb_8192.keg.ppera: 2 vm.uma.zio_buf_comb_7168.keg.ppera: 2 vm.uma.zio_buf_comb_6144.keg.ppera: 2 <= Broken vm.uma.zio_buf_comb_5120.keg.ppera: 2 vm.uma.zio_buf_comb_4096.keg.ppera: 1 vm.uma.zio_buf_comb_3584.keg.ppera: 7 <= Hard to free vm.uma.zio_buf_comb_3072.keg.ppera: 3 vm.uma.zio_buf_comb_2560.keg.ppera: 2 vm.uma.zio_buf_comb_2048.keg.ppera: 1 vm.uma.zio_buf_comb_1536.keg.ppera: 2 vm.uma.zio_buf_comb_1024.keg.ppera: 1 vm.uma.zio_buf_comb_512.keg.ppera: 1 I am now getting such: vm.uma.zio_buf_comb_16384.keg.ppera: 4 vm.uma.zio_buf_comb_12288.keg.ppera: 3 vm.uma.zio_buf_comb_8192.keg.ppera: 2 vm.uma.zio_buf_comb_6144.keg.ppera: 3 <= Fixed, 2 in 3 pages vm.uma.zio_buf_comb_4096.keg.ppera: 1 vm.uma.zio_buf_comb_3072.keg.ppera: 3 vm.uma.zio_buf_comb_2048.keg.ppera: 1 vm.uma.zio_buf_comb_1536.keg.ppera: 2 vm.uma.zio_buf_comb_1024.keg.ppera: 1 vm.uma.zio_buf_comb_512.keg.ppera: 1 Reviewed-by: Allan Jude <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #15452
* ZIO: Remove READY pipeline stage from root ZIOsAlexander Motin2023-10-251-9/+42
| | | | | | | | | | | | | | | | zio_root() has no arguments for ready callback or parent ZIO. Except one recent case in ZIL code if root ZIOs ever have a parent it is also a root ZIO. It means we do not need READY pipeline stage for them, which takes some time to process, but even more time to wait for the children and be woken by them, and both for no good reason. The most visible effect of this change is that it avoids one taskq wakeup per ZIL block written, previously used to run zio_ready() for lwb_root_zio and skipped now. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #15398
* Update outdated assertion from zio_write_compressSerapheim Dimitropoulos2023-08-251-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | As part of some internal gang block testing within Delphix we hit the assertion removed by this patch. The assertion was triggered by a ZIO that had two copies and was a gang block making the following expression equal to 3: ``` MIN(zp->zp_copies + BP_IS_GANG(bp), spa_max_replication(spa)) ``` and failing when we expected the above to be equal to `BP_GET_NDVAS(bp)`. The assertion is no longer valid since the following commit: ``` commit 14872aaa4f909d72c6b5e4105dadcfa13c7d9d66 Author: Matthew Ahrens <[email protected]> Date: Mon Feb 6 09:37:06 2023 -0800 EIO caused by encryption + recursive gang ``` The above commit changed gang block headers so they can't have more than 2 copies but the assertion in question from this PR was never updated. Reviewed-by: George Wilson <[email protected]> Reviewed-by: Matthew Ahrens <[email protected]> Signed-off-by: Serapheim Dimitropoulos <[email protected]> Closes #15180
* ZIL: Second attempt to reduce scope of zl_issuer_lock.Alexander Motin2023-08-241-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | The previous patch #14841 appeared to have significant flaw, causing deadlocks if zl_get_data callback got blocked waiting for TXG sync. I already handled some of such cases in the original patch, but issue #14982 shown cases that were impossible to solve in that design. This patch fixes the problem by postponing log blocks allocation till the very end, just before the zios issue, leaving nothing blocking after that point to cause deadlocks. Before that point though any sleeps are now allowed, not causing sync thread blockage. This require slightly more complicated lwb state machine to allocate blocks and issue zios in proper order. But with removal of special early issue workarounds the new code is much cleaner now, and should even be more efficient. Since this patch uses null zios between write, I've found that null zios do not wait for logical children ready status in zio_ready(), that makes parent write to proceed prematurely, producing incorrect log blocks. Added ZIO_CHILD_LOGICAL_BIT to zio_wait_for_children() fixes it. Reviewed-by: Rob Norris <[email protected]> Reviewed-by: Mark Maybee <[email protected]> Reviewed-by: George Wilson <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #15122
* Remove fastwrite mechanism.Alexander Motin2023-07-281-13/+1
| | | | | | | | | | | | | | | | | | | | | | | Fastwrite was introduced many years ago to improve ZIL writes spread between multiple top-level vdevs by tracking number of allocated but not written blocks and choosing vdev with smaller count. It suposed to reduce ZIL knowledge about allocation, but actually made ZIL to even more actively report allocation code about the allocations, complicating both ZIL and metaslabs code. On top of that, it seems ZIO_FLAG_FASTWRITE setting in dmu_sync() was lost many years ago, that was one of the declared benefits. Plus introduction of embedded log metaslab class solved another problem with allocation rotor accounting both normal and log allocations, since in most cases those are now in different metaslab classes. After all that, I'd prefer to simplify already too complicated ZIL, ZIO and metaslab code if the benefit of complexity is not obvious. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: George Wilson <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #15107
* spa_min_alloc should be GCD, not minAmeer Hamza2023-07-201-5/+17
| | | | | | | | | | | | Since spa_min_alloc may not be a power of 2, unlike ashifts, in the case of DRAID, we should not select the minimal value among several vdevs. Rounding to a multiple of it is unlikely to work for other vdevs. Instead, using the greatest common divisor produces smaller yet more reasonable results. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Ameer Hamza <[email protected]> Closes #15067
* Some ZIO micro-optimizations.Alexander Motin2023-06-301-9/+43
| | | | | | | | | | | | - Pack struct zio_prop by 4 bytes from 84 to 80. - Skip new child ZIO locking while linking to parent. The newly allocated ZIO is not externally visible yet, so nobody should care. - Skip io_bp_copy writes when not used (write && non-debug). Reviewed-by: Brian Atkinson <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #14985
* Remove ARC/ZIO physdone callbacks.Alexander Motin2023-06-151-26/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Those callbacks were introduced many years ago as part of a bigger patch to smoothen the write throttling within a txg. They allow to account completion of individual physical writes within a logical one, improving cases when some of physical writes complete much sooner than others, gradually opening the write throttle. Few years after that ZFS got allocation throttling, working on a level of logical writes and limiting number of writes queued to vdevs at any point, and so limiting latency distribution between the physical writes and especially writes of multiple copies. The addition of scheduling deadline I proposed in #14925 should further reduce the latency distribution. Grown memory sizes over the past 10 years should also reduce importance of the smoothing. While the use of physdone callback may still in theory provide some smoother throttling, there are cases where we simply can not afford it. Since dirty data accounting is protected by pool-wide lock, in case of 6-wide RAIDZ, for example, it requires us to take it 8 times per logical block write, creating huge lock contention. My tests of this patch show radical reduction of the lock spinning time on workloads when smaller blocks are written to RAIDZ pools, when each of the disks receives 8-16KB chunks, but the total rate reaching 100K+ blocks per second. Same time attempts to measure any write time fluctuations didn't show anything noticeable. While there, remove also io_child_count/io_parent_count counters. They are used only for couple assertions that can be avoided. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #14948
* Finally drop long disabled vdev cache.Alexander Motin2023-06-091-14/+1
| | | | | | | | | | | | | | | | | | | It was a vdev level read cache, designed to aggregate many small reads by speculatively issuing bigger reads instead and caching the result. But since it has almost no idea about what is going on with exception of ZIO_FLAG_DONT_CACHE flag set by higher layers, it was found to make more harm than good, for which reason it was disabled for the past 12 years. These days we have much better instruments to enlarge the I/Os, such as speculative and prescient prefetches, I/O scheduler, I/O aggregation etc. Besides just the dead code removal this removes one extra mutex lock/unlock per write inside vdev_cache_write(), not otherwise disabled and trying to do some work. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #14953
* Remove single parent assertion from zio_nowait().Alexander Motin2023-05-091-1/+1
| | | | | | | | | | We only need to know if ZIO has any parent there. We do not care if it has more than one, but use of zio_unique_parent() == NULL asserts that. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #14823
* Verify block pointers before writing them outMatthew Ahrens2023-05-081-26/+66
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If a block pointer is corrupted (but the block containing it checksums correctly, e.g. due to a bug that overwrites random memory), we can often detect it before the block is read, with the `zfs_blkptr_verify()` function, which is used in `arc_read()`, `zio_free()`, etc. However, such corruption is not typically recoverable. To recover from it we would need to detect the memory error before the block pointer is written to disk. This PR verifies BP's that are contained in indirect blocks and dnodes before they are written to disk, in `dbuf_write_ready()`. This way, we'll get a panic before the on-disk data is corrupted. This will help us to diagnose what's causing the corruption, as well as being much easier to recover from. To minimize performance impact, only checks that can be done without holding the spa_config_lock are performed. Additionally, when corruption is detected, the raw words of the block pointer are logged. (Note that `dprintf_bp()` is a no-op by default, but if enabled it is not safe to use with invalid block pointers.) Reviewed-by: Rich Ercolani <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Paul Zuchowski <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> Closes #14817
* Fixes in persistent error logGeorge Amanakis2023-03-281-2/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Address the following bugs in persistent error log: 1) Check nested clones, eg "fs->snap->clone->snap2->clone2". 2) When deleting files containing error blocks in those clones (from "clone" the example above), do not break the check chain. 3) When deleting files in the originating fs before syncing the errlog to disk, do not break the check chain. This happens because at the time of introducing the error block in the error list, we do not have its birth txg and the head filesystem. If the original file is deleted before the error list is synced to the error log (which is when we actually lookup the birth txg and the head filesystem), then we do not have access to this info anymore and break the check chain. The most prominent change is related to achieving (3). We expand the spa_error_entry_t structure to accommodate the newly introduced zbookmark_err_phys_t structure (containing the birth txg of the error block).Due to compatibility reasons we cannot remove the zbookmark_phys_t structure and we also need to place the new structure after se_avl, so it is not accounted for in avl_find(). Then we modify spa_log_error() to also provide the birth txg of the error block. With these changes in place we simplify the previously introduced function get_head_and_birth_txg() (now named get_head_ds()). We chose not to follow the same approach for the head filesystem (thus completely removing get_head_ds()) to avoid introducing new lock contentions. The stack sizes of nested functions (as measured by checkstack.pl in the linux kernel) are: check_filesystem [zfs]: 272 (was 912) check_clones [zfs]: 64 We also introduced two new tests covering the above changes. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: George Amanakis <[email protected]> Closes #14633
* Implementation of block cloning for ZFSPawel Jakub Dawidek2023-03-101-7/+48
| | | | | | | | | | | | | | | Block Cloning allows to manually clone a file (or a subset of its blocks) into another (or the same) file by just creating additional references to the data blocks without copying the data itself. Those references are kept in the Block Reference Tables (BRTs). The whole design of block cloning is documented in module/zfs/brt.c. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Christian Schwarz <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Rich Ercolani <[email protected]> Signed-off-by: Pawel Jakub Dawidek <[email protected]> Closes #13392
* Skip memory allocation when compressing holesRichard Yao2023-02-271-4/+7
| | | | | | | | | | | Hole detection in the zio compression code allows us to opportunistically skip compression on holes. We can go a step further by not doing memory allocations on holes either. Reviewed-by: Brian Atkinson <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Richard Yao <[email protected]> Sponsored-by: Wasabi Technology, Inc. Closes #14500
* Fix NULL pointer dereference in zio_ready()Richard Yao2023-02-231-1/+1
| | | | | | | | | | | | Clang's static analyzer correctly identified a NULL pointer dereference in zio_ready() when ZIO_FLAG_NODATA has been set on a zio that is missing a block pointer. The NULL pointer dereference occurs because we have logic intended to disable ZIO_FLAG_NODATA when it has been set on a gang block. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Brian Atkinson <[email protected]> Signed-off-by: Richard Yao <[email protected]> Closes #14469
* EIO caused by encryption + recursive gangMatthew Ahrens2023-02-061-7/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Encrypted blocks can not have 3 DVAs, because they use the space of the 3rd DVA for the IV+salt. zio_write_gang_block() takes this into account, setting `gbh_copies` to no more than 2 in this case. Gang members BP's do not have the X (encrypted) bit set (nor do they have the DMU level and type fields set), because encryption is not handled at this level. The gang block is reassembled, and then encryption (and compression) are handled. To check if this gang block is encrypted, the code in zio_write_gang_block() checks `pio->io_bp`. This is normally fine, because the block that's being ganged is typically the encrypted BP. The problem is that if there is "recursive ganging", where a gang member is itself a gang block, then when zio_write_gang_block() is called to create a gang block for a gang member, `pio->io_bp` is the gang member's BP, which doesn't have the X bit set, so the number of DVA's is not restricted to 2. It should instead be looking at the the "gang leader", i.e. the top-level gang block, to determine how many DVA's can be used, to avoid a "NDVA's inversion" (where a child has more DVA's than its parent). gang leader BP: X (encrypted) bit set, 2 DVA's, IV+salt in 3rd DVA's space: ``` DVA[0]=<1:...:100400> DVA[1]=<0:...:100400> salt=... iv=... [L0 ZFS plain file] fletcher4 uncompressed encrypted LE gang unique double size=100000L/100000P birth=... fill=1 cksum=... ``` leader's GBH contains a BP with gang bit set and 3 DVA's: ``` DVA[0]=<1:...:55600> DVA[1]=<0:...:55600> [L0 unallocated] fletcher4 uncompressed unencrypted LE contiguous unique double size=55600L/55600P birth=... fill=0 cksum=... DVA[0]=<1:...:55600> DVA[1]=<0:...:55600> [L0 unallocated] fletcher4 uncompressed unencrypted LE contiguous unique double size=55600L/55600P birth=... fill=0 cksum=... DVA[0]=<1:...:55600> DVA[1]=<0:...:55600> DVA[2]=<1:...:200> [L0 unallocated] fletcher4 uncompressed unencrypted LE gang unique double size=55400L/55400P birth=... fill=0 cksum=... ``` On nondebug bits, having the 3rd DVA in the gang block works for the most part, because it's true that all 3 DVA's are available in the gang member BP (in the GBH). However, for accounting purposes, gang block DVA's ASIZE include all the space allocated below them, i.e. the 512-byte gang block header (GBH) as well as the gang members below that. We see that above where the gang leader BP is 1MB logical (and after compression: 0x`100000P`), but the ASIZE of each DVA is 2 sectors (1KB) more than 1MB (0x`100400`). Since thre are 3 copies of a block below it, we increment the ATIME of the 3rd DVA of the gang leader by the space used by the 3rd DVA of the child (1 sector, in this case). But there isn't really a 3rd DVA of the parent; the salt is stored in place of the 3rd DVA's ASIZE. So when zio_write_gang_member_ready() increments the parent's BP's `DVA[2]`'s ASIZE, it's actually incrementing the parent's salt. When we later try to read the encrypted recursively-ganged block, the salt doesn't match what we used to write it, so MAC verification fails and we get an EIO. ``` zio_encrypt(): encrypted 515/2/0/403 salt: 25 25 bb 9d ad d6 cd 89 zio_decrypt(): decrypting 515/2/0/403 salt: 26 25 bb 9d ad d6 cd 89 ``` This commit addresses the problem by not increasing the number of copies of the GBH beyond 2 (even for non-encrypted blocks). This simplifies the logic while maintaining the ability to traverse all metadata (including gang blocks) even if one copy is lost. (Note that 3 copies of the GBH will still be created if requested, e.g. for `copies=3` or MOS blocks.) Additionally, the code that increments the parent's DVA's ASIZE is made to check the parent DVA's NDVAS even on nondebug bits. So if there's a similar bug in the future, it will cause a panic when trying to write, rather than corrupting the parent BP and causing an error when reading. Reviewed-by: Brian Behlendorf <[email protected]> Co-authored-by: Brian Behlendorf <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> Caused-by: #14356 Closes #14440 Closes #14413
* ztest fails assertion in zio_write_gang_member_ready()Matthew Ahrens2023-01-091-1/+1
| | | | | | | | | | | | | | | | | | | | | | | Encrypted blocks can have up to 2 DVA's, as the third DVA is reserved for the salt+IV. However, dmu_write_policy() allows non-encrypted blocks (e.g. DMU_OT_OBJSET) inside encrypted datasets to request and allocate 3 DVA's, since they don't need a salt+IV (they are merely authenicated). However, if such a block becomes a gang block, the gang code incorrectly limits the gang block header to 2 DVA's. This leads to a "NDVAs inversion", where a parent block (the gang block header) has less DVA's than its children (the gang members), causing an assertion failure in zio_write_gang_member_ready(). This commit addresses the problem by only restricting the gang block header to 2 DVA's if the block is actually encrypted (and thus its gang block members can have at most 2 DVA's). Reviewed-by: Richard Yao <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> Closes #14250 Closes #14356
* zio can deadlock during device removalGeorge Wilson2022-12-021-2/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When doing a device removal on a pool with gang blocks, the zio pipeline can deadlock when trying to free blocks from a device which is being removed with a stack similar to this: 0xffff8ab9a13a1740 UNINTERRUPTIBLE 4 __schedule+0x2e5 __schedule+0x2e5 schedule+0x33 schedule_preempt_disabled+0xe __mutex_lock.isra.12+0x2a7 __mutex_lock.isra.12+0x2a7 __mutex_lock_slowpath+0x13 mutex_lock+0x2c free_from_removing_vdev+0x61 metaslab_free_impl+0xd6 metaslab_free_dva+0x5e metaslab_free+0x196 zio_free_sync+0xe4 zio_free_gang+0x38 zio_gang_tree_issue+0x42 zio_gang_tree_issue+0xa2 zio_gang_issue+0x6d zio_execute+0x94 zio_execute+0x94 taskq_thread+0x23b kthread+0x120 ret_from_fork+0x1f Since there are gang blocks we have to read the gang members as part of the free. This can be seen with a zio dependency tree that looks like this: sdb> echo 0xffff900c24f8a700 | zio -rc | zio ADDRESS TYPE STAGE WAITER 0xffff900c24f8a700 NULL CHECKSUM_VERIFY 0xffff900ddfd31740 0xffff900c24f8c920 FREE GANG_ASSEMBLE - 0xffff900d93d435a0 READ DONE In the illustration above we are processing frees but because of gang block we have to read the constituents blocks. Once we finish the READ in the zio pipeline we will execute the parent. In this case the parent is a FREE but the zio taskq is a READ and we continue to process the pipeline leading to the stack above. In the stack above, we are blocked waiting for the svr_lock so as a result a READ interrupt taskq thread is now consumed. Eventually, all of the READ taskq threads end up blocked and we're unable to complete any read requests. In zio_notify_parent there is an optimization to continue to use the taskq thread to exectue the parent's pipeline. To resolve the deadlock above, we only allow this optimization if the parent's zio type matches the child which just completed. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Matthew Ahrens <[email protected]> Signed-off-by: George Wilson <[email protected]> External-issue: DLPX-80130 Closes #14236
* nopwrites on dmu_sync-ed blocks can result in a panicGeorge Wilson2022-12-021-8/+10
| | | | | | | | | | | | | | | | After a device has been removed, any nopwrites for blocks on that indirect vdev should be ignored and a new block should be allocated. The original code attempted to handle this but used the wrong block pointer when checking for indirect vdevs and failed to check all DVAs. This change corrects both of these issues and modifies the test case to ensure that it properly tests nopwrites with device removal. Reviewed-by: Prakash Surya <[email protected]> Reviewed-by: Matthew Ahrens <[email protected]> Reviewed-by: Richard Yao <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: George Wilson <[email protected]> Closes #14235
* Bump checksum error counter before reporting to ZEDRob Wing2022-12-021-3/+3
| | | | | | | | | | | | | The checksum error counter is incremented after reporting to ZED. This leads ZED to receiving a checksum error report with 0 checksum errors. To avoid this, bump the checksum error counter before reporting to ZED. Sponsored-by: Seagate Technology LLC Reviewed-by: Richard Yao <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rob Wing <[email protected]> Closes #14190
* Convert enum zio_flag to uint64_tRichard Yao2022-10-271-19/+20
| | | | | | | | | | | | | We ran out of space in enum zio_flag for additional flags. Rather than introduce enum zio_flag2 and then modify a bunch of functions to take a second flags variable, we expand the type to 64 bits via `typedef uint64_t zio_flag_t`. Reviewed-by: Allan Jude <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Allan Jude <[email protected]> Co-authored-by: Richard Yao <[email protected]> Closes #14086
* Fix declarations of non-global variablesTino Reichardt2022-10-181-1/+1
| | | | | | | | | This patch inserts the `static` keyword to non-global variables, which where found by the analysis tool smatch. Reviewed-by: Matthew Ahrens <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tino Reichardt <[email protected]> Closes #13970
* Avoid unnecessary metaslab_check_free callingFinix19792022-10-041-1/+1
| | | | | | | | | | The metaslab_check_free() function only needs to be called in the GANG|DEDUP|etc case because zio_free_sync() will internally call metaslab_check_free(). Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Richard Yao <[email protected]> Signed-off-by: Finix1979 <[email protected]> Closes #13977
* zed: mark disks as REMOVED when they are removedAmeer Hamza2022-09-281-1/+1
| | | | | | | | | | | | | ZED does not take any action for disk removal events if there is no spare VDEV available. Added zpool_vdev_remove_wanted() in libzfs and vdev_remove_wanted() in vdev.c to remove the VDEV through ZED on removal event. This means that if you are running zed and remove a disk, it will be properly marked as REMOVED. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Signed-off-by: Ameer Hamza <[email protected]> Closes #13797
* Cleanup: Specify unsignedness on things that should not be signedRichard Yao2022-09-271-8/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In #13871, zfs_vdev_aggregation_limit_non_rotating and zfs_vdev_aggregation_limit being signed was pointed out as a possible reason not to eliminate an unnecessary MAX(unsigned, 0) since the unsigned value was assigned from them. There is no reason for these module parameters to be signed and upon inspection, it was found that there are a number of other module parameters that are signed, but should not be, so we make them unsigned. Making them unsigned made it clear that some other variables in the code should also be unsigned, so we also make those unsigned. This prevents users from setting negative values that could potentially cause bad behaviors. It also makes the code slightly easier to understand. Mostly module parameters that deal with timeouts, limits, bitshifts and percentages are made unsigned by this. Any that are boolean are left signed, since whether booleans should be considered signed or unsigned does not matter. Making zfs_arc_lotsfree_percent unsigned caused a `zfs_arc_lotsfree_percent >= 0` check to become redundant, so it was removed. Removing the check was also necessary to prevent a compiler error from -Werror=type-limits. Several end of line comments had to be moved to their own lines because replacing int with uint_t caused us to exceed the 80 character limit enforced by cstyle.pl. The following were kept signed because they are passed to taskq_create(), which expects signed values and modifying the OpenSolaris/Illumos DDI is out of scope of this patch: * metaslab_load_pct * zfs_sync_taskq_batch_pct * zfs_zil_clean_taskq_nthr_pct * zfs_zil_clean_taskq_minalloc * zfs_zil_clean_taskq_maxalloc * zfs_arc_prune_task_threads Also, negative values in those parameters was found to be harmless. The following were left signed because either negative values make sense, or more analysis was needed to determine whether negative values should be disallowed: * zfs_metaslab_switch_threshold * zfs_pd_bytes_max * zfs_livelist_min_percent_shared zfs_multihost_history was made static to be consistent with other parameters. A number of module parameters were marked as signed, but in reality referenced unsigned variables. upgrade_errlog_limit is one of the numerous examples. In the case of zfs_vdev_async_read_max_active, it was already uint32_t, but zdb had an extern int declaration for it. Interestingly, the documentation in zfs.4 was right for upgrade_errlog_limit despite the module parameter being wrongly marked, while the documentation for zfs_vdev_async_read_max_active (and friends) was wrong. It was also wrong for zstd_abort_size, which was unsigned, but was documented as signed. Also, the documentation in zfs.4 incorrectly described the following parameters as ulong when they were int: * zfs_arc_meta_adjust_restarts * zfs_override_estimate_recordsize They are now uint_t as of this patch and thus the man page has been updated to describe them as uint. dbuf_state_index was left alone since it does nothing and perhaps should be removed in another patch. If any module parameters were missed, they were not found by `grep -r 'ZFS_MODULE_PARAM' | grep ', INT'`. I did find a few that grep missed, but only because they were in files that had hits. This patch intentionally did not attempt to address whether some of these module parameters should be elevated to 64-bit parameters, because the length of a long on 32-bit is 32-bit. Lastly, it was pointed out during review that uint_t is a better match for these variables than uint32_t because FreeBSD kernel parameter definitions are designed for uint_t, whose bit width can change in future memory models. As a result, we change the existing parameters that are uint32_t to use uint_t. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Neal Gompa <[email protected]> Signed-off-by: Richard Yao <[email protected]> Closes #13875
* Implement a new type of zfs receive: corrective receive (-c)Alek P2022-07-281-1/+1
| | | | | | | | | | | | | | This type of recv is used to heal corrupted data when a replica of the data already exists (in the form of a send file for example). With the provided send stream, corrective receive will read from disk blocks described by the WRITE records. When any of the reads come back with ECKSUM we use the data from the corresponding WRITE record to rewrite the corrupted block. Reviewed-by: Paul Dagnelie <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Paul Zuchowski <[email protected]> Signed-off-by: Alek Pinchuk <[email protected]> Closes #9372
* Fix scrub resume from newly created holeAlexander Motin2022-07-201-1/+17
| | | | | | | | | | | | | | | | | | It may happen that scan bookmark points to a block that was turned into a part of a big hole. In such case dsl_scan_visitbp() may skip it and dsl_scan_check_resume() will not be called for it. As result new scan suspend won't be possible until the end of the object, that may take hours if the object is a multi-terabyte ZVOL on a slow HDD pool, stretching TXG to all that time, creating all sorts of problems. This patch changes the resume condition to any greater or equal block, so even if we miss the bookmarked block, the next one we find will delete the bookmark, allowing new suspend. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored-By: iXsystems, Inc. Closes #13643
* Replace dead opensolaris.org license linkTino Reichardt2022-07-111-1/+1
| | | | | | | | | The commit replaces all findings of the link: http://www.opensolaris.org/os/licensing with this one: https://opensource.org/licenses/CDDL-1.0 Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tino Reichardt <[email protected]> Closes #13619
* Remaining {=> const} char|void *tagнаб2022-06-291-1/+1
| | | | | | Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Ahelenia Ziemiańska <[email protected]> Closes #13348
* Verify BPs in spa_load_verify_cb() and dsl_scan_visitbp()Brian Behlendorf2022-05-201-4/+2
| | | | | | | | | | | | | | | | | | | We want `zpool import` to be highly robust and never panic, even when encountering corrupt metadata. This is already handled in the arc_read() code path, which covers most cases, but spa_load_verify_cb() relies on zio_read() and is responsible for verifying the block pointer. During import it is also possible to encounter blocks pointers which contain ZIO_COMPRESS_INHERIT and ZIO_CHECKSUM_INHERIT values. Relax the verification function slightly to allow this. Futhermore, extend dsl_scan_recurse() to verify the block pointer contents of level zero blocks which are not of type DMU_OT_DNODE or DMU_OT_OBJSET. This is handled by arc_read() in the other cases. Reviewed-by: Paul Dagnelie <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #13124 Closes #13360
* Default zfs_max_recordsize to 16MRich Ercolani2022-04-281-9/+0
| | | | | | | | | | | | | | | | | | | | Increase the default allowed maximum recordsize from 1M to 16M. As described in the zfs(4) man page, there are significant costs which need to be considered before using very large blocks. However, there are scenarios where they make good sense and it should no longer be necessary to artificially restrict their use behind a module option. Note that for 32-bit platforms we continue to leave this restriction in place due to the limited virtual address space available (256-512MB). On these systems only a handful of blocks could be cached at any one time severely impacting performance and potentially stability. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Rich Ercolani <[email protected]> Closes #12830 Closes #13302
* Remove bcopy(), bzero(), bcmp()наб2022-03-151-7/+7
| | | | | | | | | | bcopy() has a confusing argument order and is actually a move, not a copy; they're all deprecated since POSIX.1-2001 and removed in -2008, and we shim them out to mem*() on Linux anyway Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Ahelenia Ziemiańska <[email protected]> Closes #12996
* Enable encrypted raw sending to pools with greater ashiftGeorge Amanakis2022-02-161-1/+7
| | | | | | | | | | | | | | | | | | | | | | Raw sending from pool1/encrypted with ashift=9 to pool2/encrypted with ashift=12 results to failure when mounting pool2/encrypted (Input/Output error). Notably, the opposite, raw sending from a greater ashift to a lower one does not fail. This happens because zio_compress_write() falsely checks only ZIO_FLAG_RAW_COMPRESS and not ZIO_FLAG_RAW_ENCRYPT which is also set in encrypted raw send streams. In this case it rounds up the psize and if not equal to the zio->io_size it modifies the block by zeroing out the extra bytes. Because this happens in a SA attr. registration object (type=46), the decryption fails upon mounting the filesystem, and zpool status falsely reports an error. Fix this by checking both ZIO_FLAG_RAW_COMPRESS and ZIO_FLAG_RAW_ENCRYPT before deciding whether to zero-pad a block. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: George Amanakis <[email protected]> Closes #13067 Closes #13074
* Clean up CSTYLEDsнаб2022-01-261-2/+0
| | | | | | | | | | | | | | | | | | | | 69 CSTYLED BEGINs remain, appx. 30 of which can be removed if cstyle(1) had a useful policy regarding CALL(ARG1, ARG2, ARG3); above 2 lines. As it stands, it spits out *both* sysctl_os.c: 385: continuation line should be indented by 4 spaces sysctl_os.c: 385: indent by spaces instead of tabs which is very cool Another >10 could be fixed by removing "ulong" &al. handling. I don't foresee anyone actually using it intentionally (does it even exist in modern headers? why did it in the first place?). Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Ahelenia Ziemiańska <[email protected]> Closes #12993
* module/*.ko: prune .data, global .rodataнаб2022-01-141-12/+12
| | | | | | | | | | | | Evaluated every variable that lives in .data (and globals in .rodata) in the kernel modules, and constified/eliminated/localised them appropriately. This means that all read-only data is now actually read-only data, and, if possible, at file scope. A lot of previously- global-symbols became inlinable (and inlined!) constants. Probably not in a big Wowee Performance Moment, but hey. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Ahelenia Ziemiańska <[email protected]> Closes #12899
* module: zfs: fix unused, remove argsusedнаб2021-12-231-3/+7
| | | | | | Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Ahelenia Ziemiańska <[email protected]> Closes #12844
* Vdev Properties FeatureAllan Jude2021-11-301-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | Add properties, similar to pool properties, to each vdev. This makes use of the existing per-vdev ZAP that was added as part of device evacuation/removal. A large number of read-only properties are exposed, many of the members of struct vdev_t, that provide useful statistics. Adds support for read-only "removing" vdev property. Adds the "allocating" property that defaults to "on" and can be set to "off" to prevent future allocations from that top-level vdev. Supports user-defined vdev properties. Includes support for properties.vdev in SYSFS. Co-authored-by: Allan Jude <[email protected]> Co-authored-by: Mark Maybee <[email protected]> Reviewed-by: Matthew Ahrens <[email protected]> Reviewed-by: Mark Maybee <[email protected]> Signed-off-by: Allan Jude <[email protected]> Closes #11711
* Verify embedded blkptr's in arc_read()Brian Behlendorf2021-09-091-1/+1
| | | | | | | | | | | | | | | | The block pointer verification check in arc_read() should also cover embedded block pointers. While highly unlikely, accessing a damaged block pointer can result in panic. To further harden the code extend the existing check to include embedded block pointers and add a comment explaining the rational for this sanity check. Lastly, correct a flaw in zfs_blkptr_verify() so the error count is checked even when checking a untrusted config to verify the non-pool-specific portions of a block pointer. Reviewed-by: Matthew Ahrens <[email protected]> Reviewed-by: Tony Nguyen <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #12535
* Compressed receive with different ashift can result in incorrect PSIZE on diskPaul Dagnelie2021-09-081-0/+12
| | | | | | | | | | | | | | We round up the psize to the nearest multiple of the asize or to the lsize, whichever is smaller. Once that's done, we allocate a new buffer of the appropriate size, zero the tail, and copy the data into it. This adds a small performance cost to these kinds of writes, but fixes the bookkeeping problems. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Matthew Ahrens <[email protected]> Co-authored-by: Matthew Ahrens <[email protected]> Signed-off-by: Paul Dagnelie <[email protected]> Closes #12522 Closes #8462
* Optimize allocation throttlingAlexander Motin2021-07-211-17/+16
| | | | | | | | | | | | | | | | | | | | | | | Remove mc_lock use from metaslab_class_throttle_*(). The math there is based on refcounts and so atomic, so the only race possible there is between zfs_refcount_count() and zfs_refcount_add(). But in most cases metaslab_class_throttle_reserve() is called with the allocator lock held, which covers the race. In cases where the lock is not held, GANG_ALLOCATION() or METASLAB_MUST_RESERVE are set, and so we do not use zfs_refcount_count(). And even if we assume some other non-existing scenario, the worst that may happen from this race is few more I/Os get to allocation earlier, that is not a problem. Move locks and data of different allocators into different cache lines to avoid false sharing. Group spa_alloc_* arrays together into single array of aligned struct spa_alloc spa_allocs. Align struct metaslab_class_allocator. Reviewed-by: Paul Dagnelie <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Don Brady <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored-By: iXsystems, Inc. Closes #12314
* A few fixes of callback typecasting (for the upcoming ClangCFI)Alexander2021-07-201-9/+9
| | | | | | | | | | | | | | * zio: avoid callback typecasting * zil: avoid zil_itxg_clean() callback typecasting * zpl: decouple zpl_readpage() into two separate callbacks * nvpair: explicitly declare callbacks for xdr_array() * linux/zfs_nvops: don't use external iput() as a callback * zcp_synctask: don't use fnvlist_free() as a callback * zvol: don't use ops->zv_free() as a callback for taskq_dispatch() Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Mark Maybee <[email protected]> Signed-off-by: Alexander Lobakin <[email protected]> Closes #12260
* Annotated dprintf as printf-likeRich Ercolani2021-06-221-15/+25
| | | | | | | | | | ZFS loves using %llu for uint64_t, but that requires a cast to not be noisy - which is even done in many, though not all, places. Also a couple places used %u for uint64_t, which were promoted to %llu. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rich Ercolani <[email protected]> Closes #12233
* Avoid deadlock when removing L2ARC devices under I/OGeorge Amanakis2021-06-161-3/+0
| | | | | | | | | | | | | | | | In case we have I/O and try to remove an L2ARC device a deadlock might occur. arc_read()->zio_read()->zfs_blkptr_verify() waits for SCL_VDEV to be dropped while holding the hash_lock. However, spa_l2cache_load() holds SCL_ALL and waits for the hash_lock in l2arc_evict(). Fix this by moving zfs_blkptr_verify() to the top top arc_read() before the hash_lock is taken. Verify the block pointer and return a checksum error if damaged rather than halting the system, by using BLK_VERIFY_LOG instead of BLK_VERIFY_HALT. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Mark Maybee <[email protected]> Signed-off-by: George Amanakis <[email protected]> Closes #12054
* Combine zio caches if possibleMateusz Guzik2021-04-171-24/+50
| | | | | | | | | | | This deduplicates 2 sets of caches which use the same allocation size. Memory savings fluctuate a lot, one sample result is FreeBSD running "make buildworld" saving ~180MB RAM in reduced page count associated with zio caches. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Mateusz Guzik <[email protected]> Closes #11877
* Fix crash in zio_done error reportingPaul Zuchowski2021-04-161-2/+3
| | | | | | | | | Fix NULL pointer dereference when reporting checksum error for gang block in zio_done. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Paul Zuchowski <[email protected]> Closes #11872 Closes #11896
* Use a helper function to clarify gang block sizeMatthew Ahrens2021-03-261-7/+11
| | | | | | | | | | | | | For gang blocks, `DVA_GET_ASIZE()` is the total space allocated for the gang DVA including its children BP's. The space allocated at each DVA's vdev/offset is `vdev_psize_to_asize(vd, SPA_GANGBLOCKSIZE)`. This commit makes this relationship more clear by using a helper function, `vdev_gang_header_asize()`, for the space allocated at the gang block's vdev/offset. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> Closes #11744
* Clean up RAIDZ/DRAID ereport codeMatthew Ahrens2021-03-191-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The RAIDZ and DRAID code is responsible for reporting checksum errors on their child vdevs. Checksum errors represent events where a disk returned data or parity that should have been correct, but was not. In other words, these are instances of silent data corruption. The checksum errors show up in the vdev stats (and thus `zpool status`'s CKSUM column), and in the event log (`zpool events`). Note, this is in contrast with the more common "noisy" errors where a disk goes offline, in which case ZFS knows that the disk is bad and doesn't try to read it, or the device returns an error on the requested read or write operation. RAIDZ/DRAID generate checksum errors via three code paths: 1. When RAIDZ/DRAID reconstructs a damaged block, checksum errors are reported on any children whose data was not used during the reconstruction. This is handled in `raidz_reconstruct()`. This is the most common type of RAIDZ/DRAID checksum error. 2. When RAIDZ/DRAID is not able to reconstruct a damaged block, that means that the data has been lost. The zio fails and an error is returned to the consumer (e.g. the read(2) system call). This would happen if, for example, three different disks in a RAIDZ2 group are silently damaged. Since the damage is silent, it isn't possible to know which three disks are damaged, so a checksum error is reported against every child that returned data or parity for this read. (For DRAID, typically only one "group" of children is involved in each io.) This case is handled in `vdev_raidz_cksum_finish()`. This is the next most common type of RAIDZ/DRAID checksum error. 3. If RAIDZ/DRAID is not able to reconstruct a damaged block (like in case 2), but there happens to be additional copies of this block due to "ditto blocks" (i.e. multiple DVA's in this blkptr_t), and one of those copies is good, then RAIDZ/DRAID compares each sector of the data or parity that it retrieved with the good data from the other DVA, and if they differ then it reports a checksum error on this child. This differs from case 2 in that the checksum error is reported on only the subset of children that actually have bad data or parity. This case happens very rarely, since normally only metadata has ditto blocks. If the silent damage is extensive, there will be many instances of case 2, and the pool will likely be unrecoverable. The code for handling case 3 is considerably more complicated than the other cases, for two reasons: 1. It needs to run after the main raidz read logic has completed. The data RAIDZ read needs to be preserved until after the alternate DVA has been read, which necessitates refcounts and callbacks managed by the non-raidz-specific zio layer. 2. It's nontrivial to map the sections of data read by RAIDZ to the correct data. For example, the correct data does not include the parity information, so the parity must be recalculated based on the correct data, and then compared to the parity that was read from the RAIDZ children. Due to the complexity of case 3, the rareness of hitting it, and the minimal benefit it provides above case 2, this commit removes the code for case 3. These types of errors will now be handled the same as case 2, i.e. the checksum error will be reported against all children that returned data or parity. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> Closes #11735