aboutsummaryrefslogtreecommitdiffstats
path: root/module/zfs
Commit message (Collapse)AuthorAgeFilesLines
* Add support for zpool user propertiesAllan Jude2023-04-211-41/+87
| | | | | | | | | | | | | | | | Usage: zpool set org.freebsd:comment="this is my pool" poolname Tests are based on zfs_set's user property tests. Also stop truncating property values at MAXNAMELEN, use ZFS_MAXPROPLEN. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Allan Jude <[email protected]> Signed-off-by: Mateusz Piotrowski <[email protected]> Sponsored-by: Beckhoff Automation GmbH & Co. KG. Sponsored-by: Klara Inc. Closes #11680
* Create zap for root vdevrob-wing2023-04-203-4/+44
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | And add it to the AVZ, this is not backwards compatible with older pools due to an assertion in spa_sync() that verifies the number of ZAPs of all vdevs matches the number of ZAPs in the AVZ. Granted, the assertion only applies to #DEBUG builds - still, a feature flag is introduced to avoid the assertion, com.klarasystems:vdev_zaps_v2 Notably, this allows to get/set properties on the root vdev: % zpool set user:prop=value <pool> root-0 Before this commit, it was already possible to get/set properties on top-level vdevs with the syntax <type>-<vdev_id> (e.g. mirror-0): % zpool set user:prop=value <pool> mirror-0 This syntax also applies to the root vdev as it is is of type 'root' with a vdev_id of 0, root-0. The keyword 'root' as an alias for 'root-0'. The following tests have been added: - zpool get all properties from root vdev - zpool set a property on root vdev - verify root vdev ZAP is created Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rob Wing <[email protected]> Sponsored-by: Seagate Technology Submitted-by: Klara, Inc. Closes #14405
* Allow MMP to bypass waiting for other threadsHerb Wartens2023-04-192-4/+27
| | | | | | | | | | | | | | | At our site we have seen cases when multi-modifier protection is enabled (multihost=on) on our pool and the pool gets suspended due to a single disk that is failing and responding very slowly. Our pools have 90 disks in them and we expect disks to fail. The current version of MMP requires that we wait for other writers before moving on. When a disk is responding very slowly, we observed that waiting here was bad enough to cause the pool to suspend. This change allows the MMP thread to bypass waiting for other threads and reduces the chances the pool gets suspended. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Herb Wartens <[email protected]> Closes #14659
* Fix "Detach spare vdev in case if resilvering does not happen"Ameer Hamza2023-04-192-3/+14
| | | | | | | | | | | | | | | Spare vdev should detach from the pool when a disk is reinserted. However, spare detachment depends on the completion of resilvering, and if resilver does not schedule, the spare vdev keeps attached to the pool until the next resilvering. When a zfs pool contains several disks (25+ mirror), resilvering does not always happen when a disk is reinserted. In this patch, spare vdev is manually detached from the pool when resilvering does not occur and it has been tested on both Linux and FreeBSD. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Ameer Hamza <[email protected]> Closes #14722
* Revert "ZFS_IOC_COUNT_FILLED does unnecessary txg_wait_synced()"Tony Hutter2023-04-181-16/+5
| | | | | | | | | | | | This reverts commit 4b3133e671b958fa2c915a4faf57812820124a7b. Users identified this commit as a possible source of data corruption: https://github.com/openzfs/zfs/issues/14753 Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tony Hutter <[email protected]> Issue #14753 Closes #14761
* Fix VERIFY(!zil_replaying(zilog, tx)) panicPawel Jakub Dawidek2023-04-171-3/+1
| | | | | | | | | | | | | | The zfs_log_clone_range() function is never called from the zfs_clone_range_replay() function, so I assumed it is safe to assert that zil_replaying() is never TRUE here. It turns out zil_replaying() also returns TRUE when the sync property is set to disabled. Fix the problem by just returning if zil_replaying() returns TRUE. Reviewed-by: Richard Yao <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reported by: Florian Smeets Signed-off-by: Pawel Jakub Dawidek <[email protected]> Closes #14758
* Fix data corruption when cloning embedded blocksPawel Jakub Dawidek2023-04-121-2/+4
| | | | | | | | | | Don't overwrite blk_phys_birth, as for embedded blocks it is part of the payload. Reviewed-by: Richard Yao <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Pawel Jakub Dawidek <[email protected]> Issue #13392 Closes #14739
* Fix in check_filesystem()George Amanakis2023-04-121-5/+5
| | | | | | | | | Fix the code in case of missing snapshots. Previously the check was in a conditional that would be executed if the filesystem had snapshots. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Tino Reichardt <[email protected]> Signed-off-by: George Amanakis <[email protected]> Closes #14735
* Trim needless zeroes from checksum eventsAlan Somers2023-04-101-5/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | The ereport.fs.zfs.checksum event contains histograms of the bits that were wrongly set or cleared according to their bit position in a 64-bit word. So the maximum value that any histogram bucket could have would be 64. But ZFS currently uses a uint32_t to hold each bucket. As a result, the event report is full of needless zeroes. Change the bucket size to uint8_t, stripping 768 needless zeros from each event. Original event format: ``` class=ereport.fs.zfs.checksum ena=639460469834258433 pool=testpool.1933 pool_guid=4979719877084416563 pool_state=0 pool_context=0 pool_failmode=wait vdev_guid=4136721804819128578 vdev_type=file vdev_path=/tmp/kyua.1TxP3A/2/work/file1.1933 vdev_ashift=9 vdev_complete_ts=609837019678 vdev_delta_ts=33450 vdev_read_errors=0 vdev_write_errors=0 vdev_cksum_errors=20 vdev_delays=0 parent_guid=2751977006639883417 parent_type=raidz vdev_spare_guids= zio_err=0 zio_flags=1048752 zio_stage=4194304 zio_pipeline=65011712 zio_delay=0 zio_timestamp=0 zio_delta=0 zio_priority=4 zio_offset=702976 zio_size=1024 zio_objset=24 zio_object=0 zio_level=3 zio_blkid=0 bad_ranges=0000000000000400 bad_ranges_min_gap=8 bad_range_sets=0000079e bad_range_clears=00000854 bad_set_histogram=000000210000001a000000150000001d000000240000001b000000220000001b000000210000002100000018000000260000002300000025000000210000001e000000250000001b0000001d0000001e0000001600000025000000180000001b000000240000001b000000240000001b0000001c000000210000001b0000001e000000210000001a0000001e000000220000001d0000001b000000200000001f0000001a000000250000001f0000001d0000001b0000001d000000240000001d0000001b0000001b0000001f00000024000000190000001a0000001f0000001e000000240000001e0000002400000021000000200000001d0000001d00000021 bad_cleared_histogram=000000220000002700000021000000210000001b0000001a000000250000001f0000001c0000001e0000002400000022000000220000002400000022000000240000002200000021000000220000001b0000002100000021000000190000001b000000240000002400000020000000290000002a00000028000000250000002400000020000000270000002500000016000000270000001c000000210000001f000000240000001c0000002100000022000000240000002100000023000000210000002700000022000000240000001b00000022000000210000001c00000023000000150000002600000020000000270000001e0000001d0000002400000026 time=00000016806457270000000323406839 eid=458 ``` New format: ``` class=ereport.fs.zfs.checksum ena=96599319807790081 pool=testpool.1933 pool_guid=1236902063710799041 pool_state=0 pool_context=0 pool_failmode=wait vdev_guid=2774253874431514999 vdev_type=file vdev_path=/tmp/kyua.6Temlq/2/work/file1.1933 vdev_ashift=9 vdev_complete_ts=92124283803 vdev_delta_ts=46670 vdev_read_errors=0 vdev_write_errors=0 vdev_cksum_errors=20 vdev_delays=0 parent_guid=8090931855087882905 parent_type=raidz vdev_spare_guids= zio_err=0 zio_flags=1048752 zio_stage=4194304 zio_pipeline=65011712 zio_delay=0 zio_timestamp=0 zio_delta=0 zio_priority=4 zio_offset=1028608 zio_size=512 zio_objset=0 zio_object=0 zio_level=0 zio_blkid=4 bad_ranges=0000000000000200 bad_ranges_min_gap=8 bad_range_sets=0000061f bad_range_clears=000001f4 bad_set_histogram=1719161c1c1c101618171a151a1a19161e1c171d1816161c191f1a18192117191c131d171b1613151a171419161a1b1319101b14171b18151e191a1b141a1c17 bad_cleared_histogram=06090a0808070a0b020609060506090a01090a050a0a0509070609080d050d0607080d060507080c04070807070a0608020c080c080908040808090a05090a07 time=00000016806477050000000604157480 eid=62 ``` Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Tino Reichardt <[email protected]> Signed-off-by: Alan Somers <[email protected]> Sponsored-by: Axcient Closes #14716
* Linux 6.3 compat: idmapped mount API changesyouzhongyang2023-04-102-9/+9
| | | | | | | | | Linux kernel 6.3 changed a bunch of APIs to use the dedicated idmap type for mounts (struct mnt_idmap), we need to detect these changes and make zfs work with the new APIs. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Youzhong Yang <[email protected]> Closes #14682
* vdev: expose zfs_vdev_max_ms_shift as a module parameterRob N2023-04-061-2/+5
| | | | | | | | | Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Tino Reichardt <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Rob Norris <[email protected]> Sponsored-by: Klara, Inc. Sponsored-by: Seagate Technology LLC Closes #14719
* Fix typo in check_clones()George Amanakis2023-04-061-1/+1
| | | | | | | | | Run kmem_free() after zap_cursor_fini(). Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Tino Reichardt <[email protected]> Reviewed-by: Adam Moss <[email protected]> Signed-off-by: George Amanakis <[email protected]> Closes #14702
* Miscellaneous FreBSD compilation bugfixesMartin Matuška2023-04-061-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | Add missing machine/md_var.h to spl/sys/simd_aarch64.h and spl/sys/simd_arm.h In spl/sys/simd_x86.h, PCB_FPUNOSAVE exists only on amd64, use PCB_NPXNOSAVE on i386 In FreeBSD sys/elf_common.h redefines AT_UID and AT_GID on FreeBSD, we need a hack in vnode.h similar to Linux. sys/simd.h needs to be included early. In zfs_freebsd_copy_file_range() we pass a (size_t *)lenp to zfs_clone_range() that expects a (uint64_t *) Allow compiling armv6 world by limiting ARM macros in sha256_impl.c and sha512_impl.c to __ARM_ARCH > 6 Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Tino Reichardt <[email protected]> Reviewed-by: Richard Yao <[email protected]> Reviewed-by: Pawel Jakub Dawidek <[email protected]> Reviewed-by: Signed-off-by: WHR <[email protected]> Signed-off-by: Martin Matuska <[email protected]> Closes #14674
* vdev: expose zfs_vdev_def_queue_depth as a module parameterRob N2023-04-061-0/+3
| | | | | | | | | | | It was previously available only to FreeBSD. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Tino Reichardt <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Rob Norris <[email protected]> Sponsored-by: Klara, Inc. Sponsored-by: Seagate Technology LLC Closes #14718
* Fix some signedness issues in arc_evict()Alexander Motin2023-04-051-4/+4
| | | | | | | | | | | | It may happen that "wanted total ARC size" (wt) is negative, that was expected. But multiplication product of it and unsigned fractions result in unsigned value, incorrectly shifted right with a sing loss. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Prakash Surya <[email protected]> Reviewed-by: Paul Dagnelie <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #14692
* Fixes in persistent error logGeorge Amanakis2023-03-286-164/+198
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Address the following bugs in persistent error log: 1) Check nested clones, eg "fs->snap->clone->snap2->clone2". 2) When deleting files containing error blocks in those clones (from "clone" the example above), do not break the check chain. 3) When deleting files in the originating fs before syncing the errlog to disk, do not break the check chain. This happens because at the time of introducing the error block in the error list, we do not have its birth txg and the head filesystem. If the original file is deleted before the error list is synced to the error log (which is when we actually lookup the birth txg and the head filesystem), then we do not have access to this info anymore and break the check chain. The most prominent change is related to achieving (3). We expand the spa_error_entry_t structure to accommodate the newly introduced zbookmark_err_phys_t structure (containing the birth txg of the error block).Due to compatibility reasons we cannot remove the zbookmark_phys_t structure and we also need to place the new structure after se_avl, so it is not accounted for in avl_find(). Then we modify spa_log_error() to also provide the birth txg of the error block. With these changes in place we simplify the previously introduced function get_head_and_birth_txg() (now named get_head_ds()). We chose not to follow the same approach for the head filesystem (thus completely removing get_head_ds()) to avoid introducing new lock contentions. The stack sizes of nested functions (as measured by checkstack.pl in the linux kernel) are: check_filesystem [zfs]: 272 (was 912) check_clones [zfs]: 64 We also introduced two new tests covering the above changes. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: George Amanakis <[email protected]> Closes #14633
* Fix short-lived txg caused by autotrimKevin Jin2023-03-283-14/+63
| | | | | | | | | | | | | | | | | | | | | Current autotrim causes short-lived txg through: 1. calling txg_wait_synced() in metaslab_enable() 2. calling txg_wait_open() with should_quiesce = true This patch addresses all the issues mentioned above. A new cv, vdev_autotrim_kick_cv is added to kick autotrim activity. It will be signaled once a txg is synced so that it does not change the original autotrim pace. Also because it is a cv, the wait is interruptible which speeds up the vdev_autotrim_stop_wait() call. Finally, combining big zfs_txg_timeout, txg_wait_open() also causes delay when exporting a pool. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: jxdking <[email protected]> Issue #8993 Closes #12194
* Additional limits on hole reportingBrian Behlendorf2023-03-282-15/+19
| | | | | | | | | | | | | | | | Holding the zp->z_rangelock as a RL_READER over the range 0-UINT64_MAX is sufficient to prevent the dnode from being re-dirtied by concurrent writers. To avoid potentially looping multiple times for external caller which do not take the rangelock holes are not reported after the first sync. While not optimal this is always functionally correct. This change adds the missing rangelock calls on FreeBSD to zvol_cdev_ioctl(). Reviewed-by: Brian Atkinson <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #14512 Closes #14641
* Revert "Do not hold spa_config in ZIL while blocked on IO"George Wilson2023-03-281-2/+6
| | | | | | | | This reverts commit 7d638df09be7482935bcf6ec8e4ea2ac8a8be1a8. Reviewed-by: Prakash Surya <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: George Wilson <[email protected]> Closes #14678
* Update vdev state for spare vdevAmeer Hamza2023-03-241-3/+5
| | | | | | | | | | | | zfsd fetches new pool configuration through ZFS_IOC_POOL_STATS but it does not get updated nvlist configuration for spare vdev since the configuration is read by spa_spares->sav_config. In this commit, updating the vdev state for spare vdev that is consumed by zfsd on spare disk hotplug. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Ameer Hamza <[email protected]> Closes #14653
* panic loop when removing slog deviceGeorge Wilson2023-03-241-20/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There is a window in the slog removal code where a panic loop could ensue if the system crashes during that operation. The original design of slog removal did not persisted any state because the removal happened synchronously. This was changed by a later commit which persisted the vdev_removing flag and exposed this bug. If a slog removal is in progress and happens to crash after persisting the vdev_removing flag to the label but before the vdev is removed from the spa config, then the pool will continue to panic on import. Here's a sample of the panic: [ 134.387411] VERIFY0(0 == dmu_buf_hold_array(os, object, offset, size, FALSE, FTAG, &numbufs, &dbp)) failed (0 == 22) [ 134.393865] PANIC at dmu.c:1135:dmu_write() [ 134.396035] Kernel panic - not syncing: VERIFY0(0 == dmu_buf_hold_array(os, object, offset, size, FALSE, FTAG, &numbufs, &dbp)) failed (0 == 22) [ 134.397857] CPU: 2 PID: 5914 Comm: txg_sync Kdump: loaded Tainted: P OE 5.4.0-1100-dx2023020205-b3751f8c2-azure #106 [ 134.407938] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS 090008 12/07/2018 [ 134.407938] Call Trace: [ 134.407938] dump_stack+0x57/0x6d [ 134.407938] panic+0xfb/0x2d7 [ 134.407938] spl_panic+0xcf/0x102 [spl] [ 134.407938] ? traverse_impl+0x1ca/0x420 [zfs] [ 134.407938] ? dmu_object_alloc_impl+0x3b4/0x3c0 [zfs] [ 134.407938] ? dnode_hold+0x1b/0x20 [zfs] [ 134.407938] dmu_write+0xc3/0xd0 [zfs] [ 134.407938] ? space_map_alloc+0x55/0x80 [zfs] [ 134.407938] metaslab_sync+0x61a/0x830 [zfs] [ 134.407938] ? queued_spin_unlock+0x9/0x10 [zfs] [ 134.407938] vdev_sync+0x72/0x190 [zfs] [ 134.407938] spa_sync_iterate_to_convergence+0x160/0x250 [zfs] [ 134.407938] spa_sync+0x2f7/0x670 [zfs] [ 134.407938] txg_sync_thread+0x22d/0x2d0 [zfs] [ 134.407938] ? txg_dispatch_callbacks+0xf0/0xf0 [zfs] [ 134.407938] thread_generic_wrapper+0x83/0xa0 [spl] [ 134.407938] kthread+0x104/0x140 [ 134.407938] ? kasan_check_write.constprop.0+0x10/0x10 [spl] [ 134.407938] ? kthread_park+0x90/0x90 [ 134.457802] ret_from_fork+0x1f/0x40 This change no longer persists the vdev_removing flag when removing slog devices and also cleans up some code that was added which is not used. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Matthew Ahrens <[email protected]> Reviewed-by: Mark Maybee <[email protected]> Signed-off-by: George Wilson <[email protected]> Closes #14652
* Fix prefetching of indirect blocks while destroying Matthew Ahrens2023-03-241-15/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When traversing a tree of block pointers (e.g. for `zfs destroy <fs>` or `zfs send`), we prefetch the indirect blocks that will be needed, in `traverse_prefetch_metadata()`. In the case of `zfs destroy <fs>`, we do a little traversing each txg, and resume the traversal the next txg. So the indirect blocks that will be needed, and thus are candidates for prefetching, does not include blocks that are before the resume point. The problem is that the logic for determining if the indirect blocks are before the resume point is incorrect, causing the (up to 1024) L1 indirect blocks that are inside the first L2 to not be prefetched. In practice, if we are able to read many more than 1024 blocks per txg, then this will be inconsequential. But if i/o latency is more than a few milliseconds, almost no L1's will be prefetched, so they will be read serially, and thus the destroying will be very slow. This can be observed as `zpool get freeing` decreasing very slowly. Specifically: When we first examine the L2 that contains the block we'll be resuming from, we have not yet resumed, so `td_resume` is nonzero. At this point, all calls to `traverse_prefetch_metadata()` will fail, even if the L1 in question is after the resume point. It isn't until the callback is issued for the resume point that we zero out `td_resume`, but by this point we've already attempted and failed to prefetch everything under this L2 indirect block. This commit addresses the issue by reusing the existing `resume_skip_check()` to determine if the L1's bookmark is before or after the resume point. To do so, this function is made non-mutating (the caller now zeros `td_resume`). Note, this bug likely predates (was not introduced by) #11803. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> Closes #14603
* Fix cloning into already dirty dbufs.Pawel Jakub Dawidek2023-03-242-13/+25
| | | | | | | | | | | | Undirty the dbuf and destroy its buffer when cloning into it. Coverity ID: CID-1535375 Reported-by: Richard Yao Reported-by: Benjamin Coddington Reviewed-by: Richard Yao <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Pawel Jakub Dawidek <[email protected]> Closes #14655
* Fix possible bad bit shift in dnode_next_offset_level()Richard Yao2023-03-161-2/+3
| | | | | | | | | | | | | | | | | | | | 031d7c2fe6afaa78943bd0a563b91fc84ace42d7 did not handle reverse iteration, such that the original issue theoretically could still occur. Note that contrary to the claim in the ZFS disk format specification that a maximum of 6 levels are possible, 9 levels are possible with recordsize=512 and and indirect block size of 16KB. In this unusual configuration, span will be 65. The maximum size of span at 70 can be reached at recordsize=16K and an indirect blocksize of 16KB. When we are at this indirection level and are traversing backward, the minimum value is start, but we cannot calculate that with 64-bit arithmetic, so we avoid the calculation and instead rely on the earlier statement that did `*offset = start;`. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Richard Yao <[email protected]> Reported-by: Coverity (CID-1466214) Closes #14618
* Remove unused Edon-R variantsTino Reichardt2023-03-141-5/+3
| | | | | | | | This commit removes the edonr_byteorder.h file and all unused variants of Edon-R. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tino Reichardt <[email protected]> Closes #13618
* nvpair: Constify string functionsRichard Yao2023-03-1421-95/+96
| | | | | | | | | | | | | | After addressing coverity complaints involving `nvpair_name()`, the compiler started complaining about dropping const. This lead to a rabbit hole where not only `nvpair_name()` needed to be constified, but also `nvpair_value_string()`, `fnvpair_value_string()` and a few other static functions, plus variable pointers throughout the code. The result became a fairly big change, so it has been split out into its own patch. Reviewed-by: Tino Reichardt <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Richard Yao <[email protected]> Closes #14612
* Fix possible NULL pointer dereference in dbuf_verify()Richard Yao2023-03-141-1/+1
| | | | | | | | | | | Coverity reported a dereference after a NULL check in dbuf_verify(). If `dn` is `NULL`, we can just assume that !dn->dn_free_txg, so we change `!dn->dn_free_txg` to `(dn == NULL || !dn->dn_free_txg)`. Reviewed-by: Tino Reichardt <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Richard Yao <[email protected]> Reported-by: Coverity (CID-992298) Closes #14619
* Replace dead opensolaris.org license linksTino Reichardt2023-03-141-1/+1
| | | | | | | | | | | The commit replaces all findings of the link: http://www.opensolaris.org/os/licensing with this one: https://opensource.org/licenses/CDDL-1.0 Reviewed-by: George Melikov <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: WHR <[email protected]> Signed-off-by: Tino Reichardt <[email protected]> Closes #14625
* Zero zio_prop_t in flush_write_batch_impl()Richard Yao2023-03-141-1/+1
| | | | | | | | | | | | | | | | | After 67a1b0379159c46bcd60a462a2790248046c8804 was merged, coverity started complaining about an uninitialized scalar variable in flush_write_batch_impl() due to the new field zp.zp_brtwrite. Upon inspection, it appears that uninitialized memory was being copied for non-raw streams, so this is a pre-existing issue. The addition of zp_brtwrite by the block cloning commit caused Coverity to begin to notice it. Reviewed-by: George Melikov <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Rob Norris <[email protected]> Signed-off-by: Richard Yao <[email protected]> Reported-by: Coverity (CID-1535378) Closes #14607
* Fix uninitialized scalar value read regression in dmu_recv_begin()Richard Yao2023-03-141-1/+1
| | | | | | | | | | | | | | da19d919a853ad05ef300fe000e6c96c4db84bcf changed this in a way that permits execution to reach `if (err == 0)` without initializing err. This could randomly cause the sync task to not execute. We fix that by initializing err to zero. Reviewed-by: George Melikov <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Rob Norris <[email protected]> Signed-off-by: Richard Yao <[email protected]> Reported-by: Coverity (CID-1535377) Closes #14607
* ZFS_IOC_COUNT_FILLED does unnecessary txg_wait_synced()Matthew Ahrens2023-03-141-5/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | `lseek(SEEK_DATA | SEEK_HOLE)` are only accurate when the on-disk blocks reflect all writes, i.e. when there are no dirty data blocks. To ensure this, if the target dnode is dirty, they wait for the open txg to be synced, so we can call them "stabilizing operations". If they cause txg_wait_synced often, it can be detrimental to performance. Typically, a group of files are all modified, and then SEEK_DATA/HOLE are performed on them. In this case, the first SEEK does a txg_wait_synced(), and subsequent SEEKs don't need to wait, so performance is good. However, if a workload involves an interleaved metadata modification, the subsequent SEEK may do a txg_wait_synced() unnecessarily. For example, if we do a `read()` syscall to each file before we do its SEEK. This applies even with `relatime=on`, when the `read()` is the first read after the last write. The txg_wait_synced() is unnecessary because the SEEK operations only care that the structure of the tree of indirect and data blocks is up to date on disk. They don't care about metadata like the contents of the bonus or spill blocks. (They also don't care if an existing data block is modified, but this would be more involved to filter out.) This commit changes the behavior of SEEK_DATA/HOLE operations such that they do not call txg_wait_synced() if there is only a pending change to the bonus or spill block. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> Closes #13368 Issue #14594 Issue #14512 Issue #14009
* Implementation of block cloning for ZFSPawel Jakub Dawidek2023-03-1016-79/+3028
| | | | | | | | | | | | | | | Block Cloning allows to manually clone a file (or a subset of its blocks) into another (or the same) file by just creating additional references to the data blocks without copying the data itself. Those references are kept in the Block Reference Tables (BRTs). The whole design of block cloning is documented in module/zfs/brt.c. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Christian Schwarz <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Rich Ercolani <[email protected]> Signed-off-by: Pawel Jakub Dawidek <[email protected]> Closes #13392
* Fix incremental receive silently failing for recursive sends Paul Dagnelie2023-03-101-9/+26
| | | | | | | | | | | | | | The problem occurs because dmu_recv_begin pulls in the payload and next header from the input stream in order to use the contents of the begin record's nvlist. However, the change to do that before the other checks in dmu_recv_begin occur caused a regression where an empty send stream in a recursive send could have its END record consumed by this, which broke the logic of recv_skip. A test is also included to protect against this case in the future. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Paul Dagnelie <[email protected]> Closes #12661 Closes #14568
* txg_sync should handle write errors in ZILRichard Yao2023-03-101-1/+0
| | | | | | | | | | | | | The txg_sync thread will see certain buffers in a DR_IN_DMU_SYNC state when ZIL is writing them out. Then it waits until the state changes, but has an assertion to check that they were not DR_NOT_OVERRIDDEN. If the data write failed with an error, ZIL will put it into the DR_NOT_OVERRIDDEN state. It looks like the code will handle that state without an issue, so we can just delete the assertion. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Richard Yao <[email protected]> Sponsored-By: Wasabi Technology, Inc. Closes #14283
* Suppress clang static analyzer warning in vdev_stat_update()Richard Yao2023-03-081-0/+6
| | | | | | | | | | | | | | 63652e154643cfe596fe077c13de0e7be34dd863 added unnecessary branches in `vdev_stat_update()` to suppress an ASAN false positive the breaks ztest. This had the downside of causing false positive reports in both Coverity and Clang's static analyzer. vd is never NULL, so we add a preprocessor check to only apply the workaround when compiling with ASAN support. Reported-by: Coverity (CID-1524583) Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Richard Yao <[email protected]> Closes #14575
* Suppress static analyzer warning in dmu_objset_create_impl_dnstats()Richard Yao2023-03-081-0/+1
| | | | | | | | | | | ae7e7006500ca37c471dd625cd5cbfc590b49885 added an assertion to suppress a complaint from Clang's static analyzer. Unfortunately, it missed another way for Clang to complain about this function. This adds another assertion to handle that. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Richard Yao <[email protected]> Closes #14575
* Suppress Clang Static Analyzer warning in bpobj_enqueue()Richard Yao2023-03-081-0/+1
| | | | | | | | | | scan-build does not do cross translation unit analysis to realize that `dmu_buf_hold()` will always set `bpo->bpo_cached_dbuf` to a non-NULL pointer, so we add an assertion to make it realize this. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Richard Yao <[email protected]> Closes #14575
* Suppress Clang Static Analyzer warning in dsl_dir_rename_sync()Richard Yao2023-03-081-0/+2
| | | | | | | | | | | Clang's static analyzer reports that if we try to rename a root dataset in `dsl_dir_rename_sync()`, we will have a NULL pointer passed to strlcpy(). This is impossible because `dsl_dir_rename_check()` will prevent us from doing this. We add an assertion to silence this warning. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Richard Yao <[email protected]> Closes #14575
* Cleanup: Remove constant comparisons reported by CodeQLRichard Yao2023-03-083-6/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | CodeQL's cpp/constant-comparison query from its security-and-extended query set reported 4 instances where we have comparions that always evaluate the same way. In `draid_config_by_type()`, we have an early `if (nparity == 0)` check that returns `EINVAL`, making a later `if (nparity == 0 || nparity > VDEV_DRAID_MAXPARITY)` partially redundant. The later check prints an error message when parity is 0, but the early check does not. This is not useful feedback, so we move the later check to the place where the early check runs to replace the early check. In `perform_thread_merge()`, we return when `num_threads == 0`. After that block, we do `if (num_threads > 0) {`, which will always be true. We remove the `if` statement. In `sa_modify_attrs()`, we have a loop condition that is `k != 2`, but at the end of the loop, we have `if (k == 0 && hdl->sa_spill)` followed by an else that does a break. The result is that k != 2 will never be evaluated when it is false. We drop the comparison. In `zap_leaf_array_read()`, we have a for loop condition that is `i < ZAP_LEAF_ARRAY_BYTES && len > 0`. However, that loop itself is in a loop that is `while (len > 0)` and while the value of len is decremented inside the loop, when `len == 0`, it will return, such that `len > 0` inside the loop condition will always be true. We drop that part of the condition. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Richard Yao <[email protected]> Closes #14575
* Suppress Clang Static Analyzer warning in dbuf_dnode_findbp()Richard Yao2023-03-081-0/+1
| | | | | | | | | | | | | | | | Clang's static analyzer reports that if a `blkid == DMU_SPILL_BLKID` is passed, then we can have a NULL pointer dereference when either ->dn_have_spill or `DNODE_FLAG_SPILL_BLKPTR` is not set. This should not happen. We add an `ASSERT()` to suppress reports about NULL pointer dereferences. Originally, I wanted to use one or two IMPLY statements on pre-conditions before the call to `dbuf_findbp()`, but Clang's static analyzer did not understand it. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Richard Yao <[email protected]> Closes #14575
* Suppress Clang Static Analyzer warning in vdev_split()Richard Yao2023-03-081-0/+4
| | | | | | | | | | | | | | | | | | | Clang's static analyzer pointed out that we can have a NULL pointer dereference if we ever attempt to split a vdev that has only 1 child. If that happens, we are left with zero children, but then try to access a non-existent child. Calling vdev_split() on a vdev with only 1 child should be impossible due to how the code is structured. If this ever happens, it would be best to stop execution immediately even in a production environment to allow for the best possible chance of recovery by an expert, so we use `VERIFY3U()` instead of `ASSERT3U()`. Unfortunately, while that defensive assertion will prevent execution from ever reaching the NULL pointer dereference, Clang's static analyzer does not realize that, so we add an `ASSERT()` to inform it of this. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Richard Yao <[email protected]> Closes #14575
* Suppress Clang Static Analyzer defect report in abd_get_size()Richard Yao2023-03-081-0/+2
| | | | | | | | | | | Clang's static analyzer reports a possible NULL pointer dereference in abd_get_size() when called from vdev_draid_map_alloc_write() called from vdev_draid_map_alloc_row() and vdc->vdc_nparity == 0. This should be impossible, so we add an assertion to silence the defect report. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Richard Yao <[email protected]> Closes #14575
* More adaptive ARC evictionAlexander Motin2023-03-081-638/+325
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Traditionally ARC adaptation was limited to MRU/MFU distribution. But for years people with metadata-centric workload demanded mechanisms to also manage data/metadata distribution, that in original ZFS was just a FIFO. As result ZFS effectively got separate states for data and metadata, minimum and maximum metadata limits etc, but it all required manual tuning, was not adaptive and in its heart remained a bad FIFO. This change removes most of existing eviction logic, rewriting it from scratch. This makes MRU/MFU adaptation individual for data and meta- data, same as the distribution between data and metadata themselves. Since most of required states separation was already done, it only required to make arcs_size state field specific per data/metadata. The adaptation logic is still based on previous concept of ghost hits, just now it balances ARC capacity between 4 states: MRU data, MRU metadata, MFU data and MFU metadata. To simplify arc_c changes instead of arc_p measured in bytes, this code uses 3 variable arc_meta, arc_pd and arc_pm, representing ARC balance between metadata and data, MRU and MFU for data, and MRU and MFU for metadata respectively as 32-bit fixed point fractions. Since we care about the math result only when need to evict, this moves all the logic from arc_adapt() to arc_evict(), that reduces per-block overhead, since per-block operations are limited to stats collection, now moved from arc_adapt() to arc_access() and using cheaper wmsums. This also allows to remove ugly ARC_HDR_DO_ADAPT flag from many places. This change also removes number of metadata specific tunables, part of which were actually not functioning correctly, since not all metadata are equal and some (like L2ARC headers) are not really evictable. Instead it introduced single opaque knob zfs_arc_meta_balance, tuning ARC's reaction on ghost hits, allowing administrator give more or less preference to metadata without setting strict limits. Some of old code parts like arc_evict_meta() are just removed, because since introduction of ABD ARC they really make no sense: only headers referenced by small number of buffers are not evictable, and they are really not evictable no matter what this code do. Instead just call arc_prune_async() if too much metadata appear not evictable. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Allan Jude <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #14359
* Do not hold spa_config in ZIL while blocked on IORichard Yao2023-03-071-6/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Otherwise, we can get a deadlock that looks like this: 1. fsync() grabs spa_config_enter(zilog->zl_spa, SCL_STATE, lwb, RW_READER) as part of zil_lwb_write_issue() . It then blocks on the txg_sync when a flush fails from a drive power cycling. 2. The txg_sync then blocks on the pool suspending due to the loss of too many disks. 3. zpool clear then blocks on spa_config_enter(spa, SCL_STATE | SCL_L2ARC | SCL_ZIO, spa, RW_WRITER) because it is a writer. The disks cannot be brought online due to fsync() holding that lock and the user gets upset since fsync() is uninterruptibly blocked inside the kernel. We need to grab the lock for vdev_lookup_top(), but we do not need to hold it while there is outstanding IO. This fixes a regression introduced by 1ce23dcaff6c3d777cb0d9a4a2cf02b43f777d78. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Richard Yao <[email protected]> Sponsored-By: Wasabi Technology, Inc. Closes #14519
* Better handling for future crypto parametersRob N2023-03-071-2/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The intent is that this is like ENOTSUP, but specifically for when something can't be done because we have no support for the requested crypto parameters; eg unlocking a dataset or receiving a stream encrypted with a suite we don't support. Its not intended to be recoverable without upgrading ZFS itself. If the request could be made to work by enabling a feature or modifying some other configuration item, then some other code should be used. load-key: In the future we might have more crypto suites (ie new values for the `encryption` property. Right now trying to load a key on such a future crypto suite will look up suite parameters off the end of the crypto table, resulting in misbehaviour and/or crashes (or, with debug enabled, trip the assertion in `zio_crypt_key_unwrap`). Instead, lets check the value we got from the dataset, and if we can't handle it, abort early. recv: When receiving a raw stream encrypted with an unknown crypto suite, `zfs recv` would report a generic `invalid backup stream` (EINVAL). While technically correct, its not super helpful, so lets ship a more specific error code and message. Reviewed-by: Tino Reichardt <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Richard Yao <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes #14577
* Optimize the is_l2cacheable functionsGeorge Amanakis2023-03-062-44/+44
| | | | | | | | | | by placing the most common use case (no special vdevs) first and avoid allocating new variables. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: George Amanakis <[email protected]> Closes #14494 Closes #14563
* Add missing increment to dsl_deadlist_move_bpobj()Richard Yao2023-03-061-1/+1
| | | | | | | | | | | | | | | | | | dc5c8006f684b1df3f2d4b6b8c121447d2db0017 was recently merged to prefetch up to 128 deadlists. Unfortunately, a loop was missing an increment, such that it will prefetch all deadlists. The performance properties of that patch probably should be re-evaluated. This was caught by CodeQL's cpp/constant-comparison check in an experimental branch where I am testing the security-and-extended queries. It complained about the `i < 128` part of the loop condition always evaluating to the same thing. The standard CodeQL configuration we use missed this because it does not include that check. Reviewed-by: Tino Reichardt <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Richard Yao <[email protected]> Closes #14573
* Add generic implementation handling and SHA2 implTino Reichardt2023-03-024-63/+144
| | | | | | | | | | | | | | | | | | | | | | | | | | | | The skeleton file module/icp/include/generic_impl.c can be used for iterating over different implementations of algorithms. It is used by SHA256, SHA512 and BLAKE3 currently. The Solaris SHA2 implementation got replaced with a version which is based on public domain code of cppcrypto v0.10. These assembly files are taken from current openssl master: - sha256-x86_64.S: x64, SSSE3, AVX, AVX2, SHA-NI (x86_64) - sha512-x86_64.S: x64, AVX, AVX2 (x86_64) - sha256-armv7.S: ARMv7, NEON, ARMv8-CE (arm) - sha512-armv7.S: ARMv7, NEON (arm) - sha256-armv8.S: ARMv7, NEON, ARMv8-CE (aarch64) - sha512-armv8.S: ARMv7, ARMv8-CE (aarch64) - sha256-ppc.S: Generic PPC64 LE/BE (ppc64) - sha512-ppc.S: Generic PPC64 LE/BE (ppc64) - sha256-p8.S: Power8 ISA Version 2.07 LE/BE (ppc64) - sha512-p8.S: Power8 ISA Version 2.07 LE/BE (ppc64) Tested-by: Rich Ercolani <[email protected]> Tested-by: Sebastian Gottschall <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tino Reichardt <[email protected]> Closes #13741
* System-wide speculative prefetch limit.Alexander Motin2023-03-011-5/+24
| | | | | | | | | | | | | | | With some pathological access patterns it is possible to make ZFS accumulate almost unlimited amount of speculative prefetch ZIOs. Combined with linear ABD allocations in RAIDZ code, it appears to be possible to exhaust system KVA, triggering kernel panic. Address this by introducing a system-wide counter of active prefetch requests and blocking prefetch distance doubling per stream hits if the number of active requests is higher that ~6% of ARC size. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #14516
* Fix data race between zil_commit() and zil_suspend()Richard Yao2023-03-011-0/+27
| | | | | | | | | | | | | | | | | | | | | | | | openzfsonwindows/openzfs#206 found that it is possible to trip `VERIFY(list_is_empty(&lwb->lwb_itxs))` when a `zil_commit()` is delayed by the scheduler long enough for a parallel `zil_suspend()` operation to exit `zil_commit_impl()`. This is a data race. To prevent this, we introduce a `zilog->zl_suspend_lock` rwlock to ensure that all outstanding `zil_commit()` operations finish before `zil_suspend()` begins and that subsequent operations fallback to `txg_wait_synced()` after `zil_suspend()` has begun. On `PREEMPT_RT` Linux kernels, the `rw_enter()` implementation suffers from writer starvation. This means that a ZIL intensive system can delay `zil_suspend()` indefinitely. This is a pre-existing problem that affects everything that uses rw locks, so it needs to be addressed in the SPL. However, builds against `PREEMPT_RT` Linux kernels are currently broken due to a GPL symbol issue (#11097), so we can safely disregard that issue for now. Reported-by: Arun KV <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Richard Yao <[email protected]> Closes #14514