aboutsummaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* Linux 6.4 compat: METABrian Behlendorf2023-07-241-1/+1
| | | | | | | | Update the META file to reflect compatibility with the 6.4 kernel. Reviewed-by: George Melikov <[email protected]> Reviewed-by: Rob Norris <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #15095
* shellcheck: disable "unreachable command" check [SC2317]Rob N2023-07-211-1/+2
| | | | | | | | | | | | | | | | | | | | | This new check in 0.9.0 appears to have some issues with various forms of "early return", like trap, exit and return. This is tripping up (at least): cmd/zed/zed.d/history_event-zfs-list-cacher.sh /etc/zfs/zfs-functions Its not obvious what its complaining about or what the remedy is, so it seems sensible to disable this check for now. See also: https://www.shellcheck.net/wiki/SC2317 https://github.com/koalaman/shellcheck/issues/2542 https://github.com/koalaman/shellcheck/issues/2613 Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes #15089
* metaslab: tuneable to better control force gangingRob N2023-07-212-3/+18
| | | | | | | | | | | | metaslab_force_ganging isn't enough to actually force ganging, because it still only forces 3% of the time. This adds metaslab_force_ganging_pct so we can configure how often to force ganging. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rob Norris <[email protected]> Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Closes #15088
* Adjust prefetch parameters.Alexander Motin2023-07-214-12/+12
| | | | | | | | | | | | | | | - Reduce maximum prefetch distance for 32bit platforms to 8MB as it was previously. Those systems didn't grow much probably, so better stay conservative there. - Retire array_rd_sz tunable, blocking prefetch for large requests. We should not penalize applications trying to be more efficient. The speculative prefetcher by itself has reasonable distance limits, and 1MB is not much at all these days. Reviewed-by: Allan Jude <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #15072
* Add explicit prefetches to bpobj_iterate().Alexander Motin2023-07-212-13/+38
| | | | | | | | | | | | | | | | | | | | To simplify error handling bpobj_iterate_blkptrs() iterates through the list of block pointers backwards. Unfortunately speculative prefetcher is currently unable to detect such patterns, that makes each block read there synchronous and very slow on HDD pools. According to my tests, added explicit prefetch reduces time needed to asynchronously delete 8 snapshots of 4 million blocks each from 20 seconds to less than one, that should free sync thread for other useful work, such as async writes, scrub, etc. While there, plug one memory leak in case of bpobj_open() error and harmonize some variable names. Reviewed-by: Allan Jude <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #15071
* Don't emit cksum_{actual_expected} in ereport.fs.zfs.checksum eventsAlan Somers2023-07-217-20/+2
| | | | | | | | | | | | | | | | | | With anything but fletcher-4, even a tiny change in the input will cause the checksum value to change completely. So knowing the actual and expected checksums doesn't provide much more information than "they don't match". The harm in sending them is simply that they bloat the event. In particular, on FreeBSD the event must fit into a 1016 byte buffer. Fixes #14717 for mirrored pools. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Rich Ercolani <[email protected]> Signed-off-by: Alan Somers <[email protected]> Sponsored-by: Axcient Closes #14717 Closes #15052
* Don't emit checksum histograms in ereport.fs.zfs.checksum eventsAlan Somers2023-07-213-41/+5
| | | | | | | | | | | | | | | | The checksum histograms were intended to be used with ATA and parallel SCSI, which are obsolete. With modern storage hardware, they will almost always look like white noise; all bits will be wrong. They only serve to bloat the event. That's a particular problem on FreeBSD, where events must fit into a 1016 byte buffer. This fixes issue #14717 for RAIDZ pools, but not for mirror pools. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Rich Ercolani <[email protected]> Signed-off-by: Alan Somers <[email protected]> Sponsored-by: Axcient Closes #15052
* zed: Fix zed ASSERT on slot power cycleTony Hutter2023-07-211-0/+5
| | | | | | | | | | We would see zed assert on one of our systems if we powered off a slot. Further examination showed zfs_retire_recv() was reporting a GUID of 0, which in turn would return a NULL nvlist. Add in a check for a zero GUID. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tony Hutter <[email protected]> Closes #15084
* Fix zpl_test_super race with zfs_umountChunwei Chen2023-07-202-15/+25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We cannot call zpl_enter in zpl_test_super, because zpl_test_super is under spinlock so we can't sleep, and also because zpl_test_super is called without sb->s_umount taken, so it's possible we would race with zfs_umount and call zpl_enter on freed zfsvfs. Here's an stack trace when this happens: [ 2379.114837] VERIFY(cvp->cv_magic == CV_MAGIC) failed [ 2379.114845] PANIC at spl-condvar.c:497:__cv_broadcast() [ 2379.114854] Kernel panic - not syncing: VERIFY(cvp->cv_magic == CV_MAGIC) failed [ 2379.115012] Call Trace: [ 2379.115019] dump_stack+0x74/0x96 [ 2379.115024] panic+0x114/0x2f6 [ 2379.115035] spl_panic+0xcf/0xfc [spl] [ 2379.115477] __cv_broadcast+0x68/0xa0 [spl] [ 2379.115585] rrw_exit+0xb8/0x310 [zfs] [ 2379.115696] rrm_exit+0x4a/0x80 [zfs] [ 2379.115808] zpl_test_super+0xa9/0xd0 [zfs] [ 2379.115920] sget+0xd1/0x230 [ 2379.116033] zpl_mount+0xdc/0x230 [zfs] [ 2379.116037] legacy_get_tree+0x28/0x50 [ 2379.116039] vfs_get_tree+0x27/0xc0 [ 2379.116045] path_mount+0x2fe/0xa70 [ 2379.116048] do_mount+0x80/0xa0 [ 2379.116050] __x64_sys_mount+0x8b/0xe0 [ 2379.116052] do_syscall_64+0x35/0x50 [ 2379.116054] entry_SYSCALL_64_after_hwframe+0x61/0xc6 [ 2379.116057] RIP: 0033:0x7f9912e8b26a Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Chunwei Chen <[email protected]> Closes #15077
* spa_min_alloc should be GCD, not minAmeer Hamza2023-07-204-9/+51
| | | | | | | | | | | | Since spa_min_alloc may not be a power of 2, unlike ashifts, in the case of DRAID, we should not select the minimal value among several vdevs. Rounding to a multiple of it is unlikely to work for other vdevs. Instead, using the greatest common divisor produces smaller yet more reasonable results. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Ameer Hamza <[email protected]> Closes #15067
* Don't panic if setting vdev properties is unsupported for this vdev typeYuri Pankov2023-07-201-17/+21
| | | | | | | | | | | Check that vdev has valid zap and bail out early. While here, move objid selection out of the loop, it's not going to change. Reviewed-by: Allan Jude <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Yuri Pankov <[email protected]> Closes #15063
* Ignore pool ashift property during vdev attachmentAmeer Hamza2023-07-205-65/+36
| | | | | | | | | | | | | | Ashift can be set for a vdev only during its creation, and the top-level vdev does not change when a vdev is attached or replaced. The ashift property should not be used during attachment, as it does not allow attaching/replacing a vdev if the pool's ashift property is increased after the existing vdev was created. Instead, we should be able to attach the vdev if the attached vdev can satisfy the ashift requirement with its parent. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Ameer Hamza <[email protected]> Closes #15061
* Rollback before zfs root is mountedWojciech Małota-Wójcik2023-07-201-1/+1
| | | | | | | | | | | | | | | | | | On my machines I observe random failures caused by rollback happening after zfs root is mounted. I've observed two types of failures: - zfs-rollback-bootfs.service fails saying that rollback must be done just before mounting the dataset - boot process fails and rescue console is entered. After making this modification and testing it for couple of days none of those problems have been observed anymore. I don't know if `dracut-mount.service` is still needed in the `After` directive. Maybe someone else is able to address this? Reviewed-by: Gregory Bartholomew <[email protected]> Signed-off-by: Wojciech Małota-Wójcik <[email protected]> Closes #15025
* Do not request data L1 buffers on scan prefetch.Alexander Motin2023-07-201-3/+9
| | | | | | | | | | | Set ARC_FLAG_NO_BUF when prefetching data L1 buffers for scan. We do not prefetch data L0 buffers, so we do not need the L1 buffers, only want them to be ready in ARC. This saves some CPU time on the buffers decompression. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #15029
* Linux 6.5 compat: disk_check_media_change() was addedColeman Kane2023-07-202-0/+31
| | | | | | | | | | | The disk_check_media_change() function was added which replaces bdev_check_media_change. This change was introduced in 6.5rc1 444aa2c58cb3b6cfe3b7cc7db6c294d73393a894 and the new function takes a gendisk* as its argument, no longer a block_device*. Thus, bdev->bd_disk is now used to pass the expected data. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Coleman Kane <[email protected]> Closes #15060
* set autotrim default to 'off' everywhereYuri Pankov2023-07-202-7/+1
| | | | | | | | | | | | As it turns out having autotrim default to 'on' on FreeBSD never really worked due to mess with defines where userland and kernel module were getting different default values (userland was defaulting to 'off', module was thinking it's 'on'). Reviewed-by: Tino Reichardt <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Yuri Pankov <[email protected]> Closes #15079
* Linux 6.5 compat: BLK_STS_NEXUS renamed to BLK_STS_RESV_CONFLICTColeman Kane2023-07-142-0/+33
| | | | | | | | | This change was introduced in Linux commit 7ba150834b840f6f5cdd07ca69a4ccf39df59a66 Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Coleman Kane <[email protected]> Closes #15059
* intptr_t definition is canonically signedColeman Kane2023-07-141-1/+1
| | | | | | | | | Make the version here match that elsewhere in the kernel and system headers. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Coleman Kane <[email protected]> Closes #15058
* Fix raw receive with different indirect block size.Alexander Motin2023-07-142-25/+28
| | | | | | | | | | | | | | | | Unlike regular receive, raw receive require destination to have the same block structure as the source. In case of dnode reclaim this triggers two special cases, requiring special handling: - If dn_nlevels == 1, we can change the ibs, but dnode_set_blksz() should not dirty the data buffer if block size does not change, or durign receive dbuf_dirty_lightweight() will trigger assertion. - If dn_nlevels > 1, we just can't change the ibs, dnode_set_blksz() would fail and receive_object would trigger assertion, so we should destroy and recreate the dnode from scratch. Reviewed-by: Paul Dagnelie <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #15039
* Fix the ZFS checksum error histograms with larger record sizesAlan Somers2023-07-141-1/+1
| | | | | | | | | | | | | | My analysis in PR #14716 was incorrect. Each histogram bucket contains the number of incorrect bits, by position in a 64-bit word, over the entire record. 8-bit buckets can overflow for record sizes above 2k. To forestall that, saturate each bucket at 255. That should still get the point across: either all bits are equally wrong, or just a couple are. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alan Somers <[email protected]> Sponsored-by: Axcient Closes #15049
* Avoid extra snprintf() in dsl_deadlist_merge().Alexander Motin2023-07-141-3/+3
| | | | | | | | | | | | Since we are already iterating the ZAP, we have exact string key to remove, we do not need to call zap_remove_int() with the int key we just converted, we can call zap_remove() for the original string. This should make no functional change, only a micro-optimization. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #15056
* Add missed DMU_PROJECTUSED_OBJECT prefetch.Alexander Motin2023-07-131-0/+5
| | | | | | | | | | It seems 9c5167d19f "Project Quota on ZFS" missed to add prefetch for DMU_PROJECTUSED_OBJECT during scan (scrub/resilver). It should not cause visible problems, but may affect scub/resilver performance. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #15024
* FreeBSD: catch up to __FreeBSD_version 1400093Mateusz Guzik2023-07-131-0/+4
| | | | | Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Mateusz Guzik <[email protected]> Closes #15036
* Update changelog for 2.2Umer Saleem2023-07-131-0/+6
| | | | | | | | | Add a new changelog entry for native packages to reflect version 2.2.99. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Ameer Hamza <[email protected]> Signed-off-by: Umer Saleem <[email protected]> Closes #15054
* FreeBSD: Fix build on stable/13 after 1302506.Alexander Motin2023-07-131-1/+2
| | | | | | | | | | Starting approximately from version 1302506 vn_lock_pair() grown two additional arguments following head. There is a one week hole, but that is closet reference point we have. Reviewed-by: Mateusz Guzik <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #15047
* Update METAzfs-2.2.99Brian Behlendorf2023-06-301-2/+2
| | | | | | | | Increase the version to 2.2.99 to indicate the master branch is newer than the 2.2.x release. This ensures packages built from master branch are considered to be newer than the last release. Signed-off-by: Brian Behlendorf <[email protected]>
* Tag 2.2.0-rc1zfs-2.2.0-rc1Brian Behlendorf2023-06-301-2/+2
| | | | | | | | | | | | New features: - Fully adaptive ARC eviction (#14359) - Block cloning (#13392) - Scrub error log (#12812, #12355) - Linux container support (#14070, #14097, #12263) - BLAKE3 Checksums (#12918) - Corrective "zfs receive" (#9372) Signed-off-by: Brian Behlendorf <[email protected]>
* Enable tuning of ZVOL open timeout valuePrakash Surya2023-06-301-1/+6
| | | | | | | | | | | The default timeout for ZVOL opens may not be sufficient for all cases, so we should enable the value to be more easily tuned to account for systems where the default value is insufficient. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Matthew Ahrens <[email protected]> Signed-off-by: Prakash Surya <[email protected]> Closes #15023
* Revert "spa.h: use IN_BASE instead of IN_FREEBSD_BASE"Brian Behlendorf2023-06-301-2/+2
| | | | | | This reverts commit 77a3bb1f47e67c233eb1961b8746748c02bafde1. Signed-off-by: Brian Behlendorf <[email protected]>
* Pack our DDT ZAPs a bit denser.Rich Ercolani2023-06-302-3/+20
| | | | | | | | | | | | | The DDT is really inefficient on 4k and up vdevs, because it always allocates 4k blocks, and while compression could save us somewhat at ashift 9, that stops being true. So let's change the default to 32 KiB, which seems like a reasonable compromise between improved space savings and inflated write sizes for DDT updates. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rich Ercolani <[email protected]> Closes #14654
* ddt_addref: remove unnecessary phys fill when refcount is 0Rob N2023-06-301-4/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The previous comment wondered if this case could happen; it turns out that it really can't. This block can only be entered if dde_type and dde_class are "real"; that only happens when a ddt entry has been previously synced to a ddt store, that is, it was created on a previous txg. Since its gone through that sync, its dde_refcount must be >0. ddt_addref() is called from brt_pending_apply(), which is called at the beginning of spa_sync(), before pending DMU writes/frees are issued. Freeing a dedup block is the only thing that can decrement dde_refcount, so there's no way for it to drop to zero before applying the clone bumps it. Further, even if it _could_ go to zero, it wouldn't be necessary to fill the entry from the block. The phys content is not cleared until the free is issued, which happens when the refcount goes to zero, when the last real free comes through. The cloned block should be identical to what's in the phys already, so the fill should be a no-op anyway. I've replaced this with an assertion because this is all very dependent on the ordering in which BRT and DDT changes are applied, and that might change in the future. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rob Norris <[email protected]> Sponsored-By: Klara, Inc. Closes #15004
* Again fix race between zil_commit() and zil_suspend().Alexander Motin2023-06-301-8/+28
| | | | | | | | | | | | | | With zl_suspend read in zil_commit() not protected by any locks it is possible for new ZIL writes to be in progress while zil_destroy() called by zil_suspend() freeing them. This patch closes the race by taking zl_issuer_lock in zil_suspend() and adding the second zl_suspend check to zil_get_commit_list(), protected by the lock. It allows all already queued transactions to be logged normally, while blocks any new ones, calling txg_wait_synced() for the TXGs. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #14979
* Some ZIO micro-optimizations.Alexander Motin2023-06-302-10/+45
| | | | | | | | | | | | - Pack struct zio_prop by 4 bytes from 84 to 80. - Skip new child ZIO locking while linking to parent. The newly allocated ZIO is not externally visible yet, so nobody should care. - Skip io_bp_copy writes when not used (write && non-debug). Reviewed-by: Brian Atkinson <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #14985
* Do not report bytes skipped by scan as issued.Alexander Motin2023-06-309-52/+84
| | | | | | | | | | | | | | | | | | | | | Scan process may skip blocks based on their birth time, DVA, etc. Traditionally those blocks were accounted as issued, that caused reporting of hugely over-inflated numbers, having nothing to do with actual disk I/O. This change utilizes never used field in struct dsl_scan_phys to account such skipped bytes, allowing to report how much data were actually scrubbed/resilvered and what is the actual I/O speed. While formally it is an on-disk format change, it should be compatible both ways, so should not need a feature flag. This should partially address the same issue as c85ac731a0e, but from a different perspective, complementing it. Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Akash B <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #15007
* Don't use hard-coded 'size' value in snprintf()Arshad Hussain2023-06-301-6/+8
| | | | | | | | | | This patch changes the passing of "size" to snprintf from hard-coded (openended) to sizeof(errbuf). This is bringing to standard with rest of the code where- ever 'errbuf' is used. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Arshad Hussain <[email protected]> Closes #15003
* Fix remount when setting multiple properties.Alexander Motin2023-06-301-3/+4
| | | | | | | | | | | | | The previous code was checking zfs_is_namespace_prop() only for the last property on the list. If one was not "namespace", then remount wasn't called. To fix that move zfs_is_namespace_prop() inside the loop and remount if at least one of properties was "namespace". Reviewed-by: Umer Saleem <[email protected]> Reviewed-by: Ameer Hamza <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #15000
* contrib: dracut: Conditionalize copying of libgcc_s.so.1 to glibc onlyvimproved2023-06-291-1/+1
| | | | | | | | | | | | The issue that this is designed to work around is only applicable to glibc, since it's caused by glibc's pthread_cancel() implementation using dlopen on libgcc_s.so.1 (and therefor not triggering dracut to include it in the initramfs). This commit adds an extra condition to the workaround that tests for glibc via "ldconfig -p | grep -qF 'libc.so.6'" (which should only be present on glibc systems). Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Violet Purcell <[email protected]> Closes #14992
* spa.h: use IN_BASE instead of IN_FREEBSD_BASEYuri Pankov2023-06-291-2/+2
| | | | | | | | | | | Consistently get the proper default value for autotrim. Currently, only the kernel module is built with IN_FREEBSD_BASE, and libzfs get the wrong default value, leading to confusion and incorrect output when autotrim value was not set explicitly. Reviewed-by: Warner Losh <[email protected]> Signed-off-by: Yuri Pankov <[email protected]> Closes #15016
* zdb: Add missing poolname to -C synopsisMateusz Piotrowski2023-06-292-2/+3
| | | | | | | Reviewed-by: Tino Reichardt <[email protected]> Reviewed-by: Rob Norris <[email protected]> Signed-off-by: Mateusz Piotrowski <[email protected]> Sponsored-by: Klara Inc. Closes #15014
* ZIL: Fix another use-after-free.Alexander Motin2023-06-271-1/+1
| | | | | | | | | | | | | | | | lwb->lwb_issued_txg can not be accessed after lwb_state is set to LWB_STATE_FLUSH_DONE and zl_lock is dropped, since the lwb may be freed by zil_sync(). We must save the txg number before that. This is similar to the 55b1842f92, but as I see the bug is not new. It existed for quite a while, just was not triggered due to smaller race window. Reviewed-by: Allan Jude <[email protected]> Reviewed-by: Brian Atkinson <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #14988 Closes #14999
* Use big transactions for small recordsize writes.Alexander Motin2023-06-271-60/+46
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When ZFS appends files in chunks bigger than recordsize, it borrows buffer from ARC and fills it before opening transaction. This supposed to help in case of page faults to not hold transaction open indefinitely. The problem appears when recordsize is set lower than default 128KB. Since each block is committed in separate transaction, per-transaction overhead becomes significant, and what is even worse, active use of of per-dataset and per-pool locks to protect space use accounting for each transaction badly hurts the code SMP scalability. The same transaction size limitation applies in case of file rewrite, but without even excuse of buffer borrowing. To address the issue, disable the borrowing mechanism if recordsize is smaller than default and the write request is 4x bigger than it. In such case writes up to 32MB are executed in single transaction, that dramatically reduces overhead and lock contention. Since the borrowing mechanism is not used for file rewrites, and it was never used by zvols, which seem to work fine, I don't think this change should create significant problems, partially because in addition to the borrowing mechanism there are also used pre-faults. My tests with 4/8 threads writing several files same time on datasets with 32KB recordsize in 1MB requests show reduction of CPU usage by the user threads by 25-35%. I would measure it in GB/s, but at that block size we are now limited by the lock contention of single write issue taskqueue, which is a separate problem we are going to work on. Reviewed-by: Brian Atkinson <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #14964
* Remove unnecessary commas in zpool-create.8Laevos2023-06-271-2/+2
| | | | | | Reviewed-by: Brian Atkinson <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Laevos <[email protected]> Closes #15011
* Another set of vdev queue optimizations.Alexander Motin2023-06-278-172/+205
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Switch FIFO queues (SYNC/TRIM) and active queue of vdev queue from time-sorted AVL-trees to simple lists. AVL-trees are too expensive for such a simple task. To change I/O priority without searching through the trees, add io_queue_state field to struct zio. To not check number of queued I/Os for each priority add vq_cqueued bitmap to struct vdev_queue. Update it when adding/removing I/Os. Make vq_cactive a separate array instead of struct vdev_queue_class member. Together those allow to avoid lots of cache misses when looking for work in vdev_queue_class_to_issue(). Introduce deadline of ~0.5s for LBA-sorted queues. Before this I saw some I/Os waiting in a queue for up to 8 seconds and possibly more due to starvation. With this change I no longer see it. I had to slightly more complicate the comparison function, but since it uses all the same cache lines the difference is minimal. For a sequential I/Os the new code in vdev_queue_io_to_issue() actually often uses more simple avl_first(), falling back to avl_find() and avl_nearest() only when needed. Arrange members in struct zio to access only one cache line when searching through vdev queues. While there, remove io_alloc_node, reusing the io_queue_node instead. Those two are never used same time. Remove zfs_vdev_aggregate_trim parameter. It was disabled for 4 years since implemented, while still wasted time maintaining the offset-sorted tree of TRIM requests. Just remove the tree. Remove locking from txg_all_lists_empty(). It is racy by design, while 2 pair of locks/unlocks take noticeable time under the vdev queue lock. With these changes in my tests with volblocksize=4KB I measure vdev queue lock spin time reduction by 50% on read and 75% on write. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #14925
* Add a delay to tearing down threads.Rich Ercolani2023-06-263-1/+49
| | | | | | | | | | | | | | | | | | It's been observed that in certain workloads (zvol-related being a big one), ZFS will end up spending a large amount of time spinning up taskqs only to tear them down again almost immediately, then spin them up again... I noticed this when I looked at what my mostly-idle system was doing and wondered how on earth taskq creation/destroy was a bunch of time... So I added a configurable delay to avoid it tearing down tasks the first time it notices them idle, and the total number of threads at steady state went up, but the amount of time being burned just tearing down/turning up new ones almost vanished. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rich Ercolani <[email protected]> Closes #14938
* Fix memory leak in zil_parse().Alexander Motin2023-06-171-2/+6
| | | | | | | | | | 482da24e2 missed arc_buf_destroy() calls on log parse errors, possibly leaking up to 128KB of memory per dataset during ZIL replay. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Paul Dagnelie <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #14987
* Shorten arcstat_quiescence sleep timeGeorge Amanakis2023-06-151-1/+1
| | | | | | | | | With the latest L2ARC fixes, 2 seconds is too long to wait for quiescence of arcstats like l2_size. Shorten this interval to avoid having the persistent L2ARC tests in ZTS prematurely terminated. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: George Amanakis <[email protected]> Closes #14981
* Remove ARC/ZIO physdone callbacks.Alexander Motin2023-06-158-143/+26
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Those callbacks were introduced many years ago as part of a bigger patch to smoothen the write throttling within a txg. They allow to account completion of individual physical writes within a logical one, improving cases when some of physical writes complete much sooner than others, gradually opening the write throttle. Few years after that ZFS got allocation throttling, working on a level of logical writes and limiting number of writes queued to vdevs at any point, and so limiting latency distribution between the physical writes and especially writes of multiple copies. The addition of scheduling deadline I proposed in #14925 should further reduce the latency distribution. Grown memory sizes over the past 10 years should also reduce importance of the smoothing. While the use of physdone callback may still in theory provide some smoother throttling, there are cases where we simply can not afford it. Since dirty data accounting is protected by pool-wide lock, in case of 6-wide RAIDZ, for example, it requires us to take it 8 times per logical block write, creating huge lock contention. My tests of this patch show radical reduction of the lock spinning time on workloads when smaller blocks are written to RAIDZ pools, when each of the disks receives 8-16KB chunks, but the total rate reaching 100K+ blocks per second. Same time attempts to measure any write time fluctuations didn't show anything noticeable. While there, remove also io_child_count/io_parent_count counters. They are used only for couple assertions that can be avoided. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #14948
* ZTS: Skip send_raw_ashift on FreeBSDBrian Behlendorf2023-06-142-0/+5
| | | | | | | | On FreeBSD 14 this test runs slowly in the CI environment and is killed by the 10 minute timeout. Skip the test on FreeBSD until the slow down is resolved. Signed-off-by: Brian Behlendorf <[email protected]> Issue #14961
* Switch refcount tracking from lists to AVL-trees.Alexander Motin2023-06-142-95/+108
| | | | | | | | | | | | | With large number of tracked references list searches under the lock become too expensive, creating enormous lock contention. On my tests with ZFS_DEBUG enabled this increases write throughput with 32KB blocks from ~1.2GB/s to ~7.5GB/s. Reviewed-by: Brian Atkinson <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #14970
* Store the L2ARC device ashift in the vdev labelGeorge Amanakis2023-06-142-11/+11
| | | | | | | | | | | | | | | | If this is not done, and the pool has an ashift other than the default (at the moment 9) then the following happens: 1) vdev_alloc() assigns the ashift of the pool to L2ARC device, but upon export it is not stored anywhere 2) at the first import, vdev_open() sees an vdev_ashift() of 0 and assigns the logical_ashift, which is 9 3) reading the contents of L2ARC, including the header fails 4) L2ARC buffers are not restored in ARC. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: George Amanakis <[email protected]> Closes #14313 Closes #14963