| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Usage:
zpool set org.freebsd:comment="this is my pool" poolname
Tests are based on zfs_set's user property tests.
Also stop truncating property values at MAXNAMELEN, use ZFS_MAXPROPLEN.
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Allan Jude <[email protected]>
Signed-off-by: Mateusz Piotrowski <[email protected]>
Sponsored-by: Beckhoff Automation GmbH & Co. KG.
Sponsored-by: Klara Inc.
Closes #11680
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
And add it to the AVZ, this is not backwards compatible with older pools
due to an assertion in spa_sync() that verifies the number of ZAPs of
all vdevs matches the number of ZAPs in the AVZ.
Granted, the assertion only applies to #DEBUG builds - still, a feature
flag is introduced to avoid the assertion, com.klarasystems:vdev_zaps_v2
Notably, this allows to get/set properties on the root vdev:
% zpool set user:prop=value <pool> root-0
Before this commit, it was already possible to get/set properties on
top-level vdevs with the syntax <type>-<vdev_id> (e.g. mirror-0):
% zpool set user:prop=value <pool> mirror-0
This syntax also applies to the root vdev as it is is of type 'root'
with a vdev_id of 0, root-0. The keyword 'root' as an alias for
'root-0'.
The following tests have been added:
- zpool get all properties from root vdev
- zpool set a property on root vdev
- verify root vdev ZAP is created
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Rob Wing <[email protected]>
Sponsored-by: Seagate Technology
Submitted-by: Klara, Inc.
Closes #14405
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
At our site we have seen cases when multi-modifier protection is enabled
(multihost=on) on our pool and the pool gets suspended due to a single
disk that is failing and responding very slowly. Our pools have 90 disks
in them and we expect disks to fail. The current version of MMP requires
that we wait for other writers before moving on. When a disk is
responding very slowly, we observed that waiting here was bad enough to
cause the pool to suspend. This change allows the MMP thread to bypass
waiting for other threads and reduces the chances the pool gets
suspended.
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Herb Wartens <[email protected]>
Closes #14659
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Spare vdev should detach from the pool when a disk is reinserted.
However, spare detachment depends on the completion of resilvering,
and if resilver does not schedule, the spare vdev keeps attached to
the pool until the next resilvering. When a zfs pool contains
several disks (25+ mirror), resilvering does not always happen when
a disk is reinserted. In this patch, spare vdev is manually detached
from the pool when resilvering does not occur and it has been tested
on both Linux and FreeBSD.
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Signed-off-by: Ameer Hamza <[email protected]>
Closes #14722
|
|
|
|
|
|
|
|
|
|
|
|
| |
This reverts commit 4b3133e671b958fa2c915a4faf57812820124a7b.
Users identified this commit as a possible source of data
corruption:
https://github.com/openzfs/zfs/issues/14753
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Tony Hutter <[email protected]>
Issue #14753
Closes #14761
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The zfs_log_clone_range() function is never called from the
zfs_clone_range_replay() function, so I assumed it is safe to assert
that zil_replaying() is never TRUE here. It turns out zil_replaying()
also returns TRUE when the sync property is set to disabled.
Fix the problem by just returning if zil_replaying() returns TRUE.
Reviewed-by: Richard Yao <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Reported by: Florian Smeets
Signed-off-by: Pawel Jakub Dawidek <[email protected]>
Closes #14758
|
|
|
|
|
|
|
|
|
|
| |
Don't overwrite blk_phys_birth, as for embedded blocks it is part of
the payload.
Reviewed-by: Richard Yao <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Pawel Jakub Dawidek <[email protected]>
Issue #13392
Closes #14739
|
|
|
|
|
|
|
|
|
| |
Fix the code in case of missing snapshots. Previously the check was in
a conditional that would be executed if the filesystem had snapshots.
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Tino Reichardt <[email protected]>
Signed-off-by: George Amanakis <[email protected]>
Closes #14735
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The ereport.fs.zfs.checksum event contains histograms of the bits that
were wrongly set or cleared according to their bit position in a 64-bit
word. So the maximum value that any histogram bucket could have would
be 64. But ZFS currently uses a uint32_t to hold each bucket. As a
result, the event report is full of needless zeroes.
Change the bucket size to uint8_t, stripping 768 needless zeros from
each event.
Original event format:
```
class=ereport.fs.zfs.checksum ena=639460469834258433 pool=testpool.1933 pool_guid=4979719877084416563 pool_state=0 pool_context=0 pool_failmode=wait vdev_guid=4136721804819128578 vdev_type=file vdev_path=/tmp/kyua.1TxP3A/2/work/file1.1933 vdev_ashift=9 vdev_complete_ts=609837019678 vdev_delta_ts=33450 vdev_read_errors=0 vdev_write_errors=0 vdev_cksum_errors=20 vdev_delays=0 parent_guid=2751977006639883417 parent_type=raidz vdev_spare_guids= zio_err=0 zio_flags=1048752 zio_stage=4194304 zio_pipeline=65011712 zio_delay=0 zio_timestamp=0 zio_delta=0 zio_priority=4 zio_offset=702976 zio_size=1024 zio_objset=24 zio_object=0 zio_level=3 zio_blkid=0 bad_ranges=0000000000000400 bad_ranges_min_gap=8 bad_range_sets=0000079e bad_range_clears=00000854 bad_set_histogram=000000210000001a000000150000001d000000240000001b000000220000001b000000210000002100000018000000260000002300000025000000210000001e000000250000001b0000001d0000001e0000001600000025000000180000001b000000240000001b000000240000001b0000001c000000210000001b0000001e000000210000001a0000001e000000220000001d0000001b000000200000001f0000001a000000250000001f0000001d0000001b0000001d000000240000001d0000001b0000001b0000001f00000024000000190000001a0000001f0000001e000000240000001e0000002400000021000000200000001d0000001d00000021 bad_cleared_histogram=000000220000002700000021000000210000001b0000001a000000250000001f0000001c0000001e0000002400000022000000220000002400000022000000240000002200000021000000220000001b0000002100000021000000190000001b000000240000002400000020000000290000002a00000028000000250000002400000020000000270000002500000016000000270000001c000000210000001f000000240000001c0000002100000022000000240000002100000023000000210000002700000022000000240000001b00000022000000210000001c00000023000000150000002600000020000000270000001e0000001d0000002400000026 time=00000016806457270000000323406839 eid=458
```
New format:
```
class=ereport.fs.zfs.checksum ena=96599319807790081 pool=testpool.1933 pool_guid=1236902063710799041 pool_state=0 pool_context=0 pool_failmode=wait vdev_guid=2774253874431514999 vdev_type=file vdev_path=/tmp/kyua.6Temlq/2/work/file1.1933 vdev_ashift=9 vdev_complete_ts=92124283803 vdev_delta_ts=46670 vdev_read_errors=0 vdev_write_errors=0 vdev_cksum_errors=20 vdev_delays=0 parent_guid=8090931855087882905 parent_type=raidz vdev_spare_guids= zio_err=0 zio_flags=1048752 zio_stage=4194304 zio_pipeline=65011712 zio_delay=0 zio_timestamp=0 zio_delta=0 zio_priority=4 zio_offset=1028608 zio_size=512 zio_objset=0 zio_object=0 zio_level=0 zio_blkid=4 bad_ranges=0000000000000200 bad_ranges_min_gap=8 bad_range_sets=0000061f bad_range_clears=000001f4 bad_set_histogram=1719161c1c1c101618171a151a1a19161e1c171d1816161c191f1a18192117191c131d171b1613151a171419161a1b1319101b14171b18151e191a1b141a1c17 bad_cleared_histogram=06090a0808070a0b020609060506090a01090a050a0a0509070609080d050d0607080d060507080c04070807070a0608020c080c080908040808090a05090a07 time=00000016806477050000000604157480 eid=62
```
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Tino Reichardt <[email protected]>
Signed-off-by: Alan Somers <[email protected]>
Sponsored-by: Axcient
Closes #14716
|
|
|
|
|
|
|
|
|
| |
Linux kernel 6.3 changed a bunch of APIs to use the dedicated idmap
type for mounts (struct mnt_idmap), we need to detect these changes
and make zfs work with the new APIs.
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Youzhong Yang <[email protected]>
Closes #14682
|
|
|
|
|
|
|
|
|
| |
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Tino Reichardt <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Signed-off-by: Rob Norris <[email protected]>
Sponsored-by: Klara, Inc.
Sponsored-by: Seagate Technology LLC
Closes #14719
|
|
|
|
|
|
|
|
|
| |
Run kmem_free() after zap_cursor_fini().
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Tino Reichardt <[email protected]>
Reviewed-by: Adam Moss <[email protected]>
Signed-off-by: George Amanakis <[email protected]>
Closes #14702
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Add missing machine/md_var.h to spl/sys/simd_aarch64.h and
spl/sys/simd_arm.h
In spl/sys/simd_x86.h, PCB_FPUNOSAVE exists only on amd64, use PCB_NPXNOSAVE
on i386
In FreeBSD sys/elf_common.h redefines AT_UID and AT_GID on FreeBSD, we need
a hack in vnode.h similar to Linux. sys/simd.h needs to be included early.
In zfs_freebsd_copy_file_range() we pass a (size_t *)lenp to
zfs_clone_range() that expects a (uint64_t *)
Allow compiling armv6 world by limiting ARM macros in sha256_impl.c and
sha512_impl.c to __ARM_ARCH > 6
Reviewed-by: Alexander Motin <[email protected]>
Reviewed-by: Tino Reichardt <[email protected]>
Reviewed-by: Richard Yao <[email protected]>
Reviewed-by: Pawel Jakub Dawidek <[email protected]>
Reviewed-by: Signed-off-by: WHR <[email protected]>
Signed-off-by: Martin Matuska <[email protected]>
Closes #14674
|
|
|
|
|
|
|
|
|
|
|
| |
It was previously available only to FreeBSD.
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Tino Reichardt <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Signed-off-by: Rob Norris <[email protected]>
Sponsored-by: Klara, Inc.
Sponsored-by: Seagate Technology LLC
Closes #14718
|
|
|
|
|
|
|
|
|
|
|
|
| |
It may happen that "wanted total ARC size" (wt) is negative, that was
expected. But multiplication product of it and unsigned fractions
result in unsigned value, incorrectly shifted right with a sing loss.
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Prakash Surya <[email protected]>
Reviewed-by: Paul Dagnelie <[email protected]>
Signed-off-by: Alexander Motin <[email protected]>
Sponsored by: iXsystems, Inc.
Closes #14692
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Address the following bugs in persistent error log:
1) Check nested clones, eg "fs->snap->clone->snap2->clone2".
2) When deleting files containing error blocks in those clones (from
"clone" the example above), do not break the check chain.
3) When deleting files in the originating fs before syncing the errlog
to disk, do not break the check chain. This happens because at the
time of introducing the error block in the error list, we do not have
its birth txg and the head filesystem. If the original file is
deleted before the error list is synced to the error log (which is
when we actually lookup the birth txg and the head filesystem), then
we do not have access to this info anymore and break the check chain.
The most prominent change is related to achieving (3). We expand the
spa_error_entry_t structure to accommodate the newly introduced
zbookmark_err_phys_t structure (containing the birth txg of the error
block).Due to compatibility reasons we cannot remove the
zbookmark_phys_t structure and we also need to place the new structure
after se_avl, so it is not accounted for in avl_find(). Then we modify
spa_log_error() to also provide the birth txg of the error block. With
these changes in place we simplify the previously introduced function
get_head_and_birth_txg() (now named get_head_ds()).
We chose not to follow the same approach for the head filesystem (thus
completely removing get_head_ds()) to avoid introducing new lock
contentions.
The stack sizes of nested functions (as measured by checkstack.pl in the
linux kernel) are:
check_filesystem [zfs]: 272 (was 912)
check_clones [zfs]: 64
We also introduced two new tests covering the above changes.
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: George Amanakis <[email protected]>
Closes #14633
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Current autotrim causes short-lived txg through:
1. calling txg_wait_synced() in metaslab_enable()
2. calling txg_wait_open() with should_quiesce = true
This patch addresses all the issues mentioned above.
A new cv, vdev_autotrim_kick_cv is added to kick autotrim activity.
It will be signaled once a txg is synced so that it does not change
the original autotrim pace. Also because it is a cv, the wait is
interruptible which speeds up the vdev_autotrim_stop_wait() call.
Finally, combining big zfs_txg_timeout, txg_wait_open() also causes
delay when exporting a pool.
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: jxdking <[email protected]>
Issue #8993
Closes #12194
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Holding the zp->z_rangelock as a RL_READER over the range
0-UINT64_MAX is sufficient to prevent the dnode from being
re-dirtied by concurrent writers. To avoid potentially
looping multiple times for external caller which do not
take the rangelock holes are not reported after the first
sync. While not optimal this is always functionally correct.
This change adds the missing rangelock calls on FreeBSD to
zvol_cdev_ioctl().
Reviewed-by: Brian Atkinson <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #14512
Closes #14641
|
|
|
|
|
|
|
|
| |
This reverts commit 7d638df09be7482935bcf6ec8e4ea2ac8a8be1a8.
Reviewed-by: Prakash Surya <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: George Wilson <[email protected]>
Closes #14678
|
|
|
|
|
|
|
|
|
|
|
|
| |
zfsd fetches new pool configuration through ZFS_IOC_POOL_STATS but
it does not get updated nvlist configuration for spare vdev since
the configuration is read by spa_spares->sav_config. In this commit,
updating the vdev state for spare vdev that is consumed by zfsd on
spare disk hotplug.
Reviewed-by: Alexander Motin <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Ameer Hamza <[email protected]>
Closes #14653
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
There is a window in the slog removal code where a panic loop could
ensue if the system crashes during that operation. The original design
of slog removal did not persisted any state because the removal happened
synchronously. This was changed by a later commit which persisted the
vdev_removing flag and exposed this bug. If a slog removal is in
progress and happens to crash after persisting the vdev_removing flag to
the label but before the vdev is removed from the spa config, then the
pool will continue to panic on import. Here's a sample of the panic:
[ 134.387411] VERIFY0(0 == dmu_buf_hold_array(os, object, offset, size,
FALSE, FTAG, &numbufs, &dbp)) failed (0 == 22)
[ 134.393865] PANIC at dmu.c:1135:dmu_write()
[ 134.396035] Kernel panic - not syncing: VERIFY0(0 ==
dmu_buf_hold_array(os, object, offset, size, FALSE, FTAG, &numbufs,
&dbp)) failed (0 == 22)
[ 134.397857] CPU: 2 PID: 5914 Comm: txg_sync Kdump: loaded Tainted:
P OE 5.4.0-1100-dx2023020205-b3751f8c2-azure #106
[ 134.407938] Hardware name: Microsoft Corporation Virtual
Machine/Virtual Machine, BIOS 090008 12/07/2018
[ 134.407938] Call Trace:
[ 134.407938] dump_stack+0x57/0x6d
[ 134.407938] panic+0xfb/0x2d7
[ 134.407938] spl_panic+0xcf/0x102 [spl]
[ 134.407938] ? traverse_impl+0x1ca/0x420 [zfs]
[ 134.407938] ? dmu_object_alloc_impl+0x3b4/0x3c0 [zfs]
[ 134.407938] ? dnode_hold+0x1b/0x20 [zfs]
[ 134.407938] dmu_write+0xc3/0xd0 [zfs]
[ 134.407938] ? space_map_alloc+0x55/0x80 [zfs]
[ 134.407938] metaslab_sync+0x61a/0x830 [zfs]
[ 134.407938] ? queued_spin_unlock+0x9/0x10 [zfs]
[ 134.407938] vdev_sync+0x72/0x190 [zfs]
[ 134.407938] spa_sync_iterate_to_convergence+0x160/0x250 [zfs]
[ 134.407938] spa_sync+0x2f7/0x670 [zfs]
[ 134.407938] txg_sync_thread+0x22d/0x2d0 [zfs]
[ 134.407938] ? txg_dispatch_callbacks+0xf0/0xf0 [zfs]
[ 134.407938] thread_generic_wrapper+0x83/0xa0 [spl]
[ 134.407938] kthread+0x104/0x140
[ 134.407938] ? kasan_check_write.constprop.0+0x10/0x10 [spl]
[ 134.407938] ? kthread_park+0x90/0x90
[ 134.457802] ret_from_fork+0x1f/0x40
This change no longer persists the vdev_removing flag when removing slog
devices and also cleans up some code that was added which is not used.
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Matthew Ahrens <[email protected]>
Reviewed-by: Mark Maybee <[email protected]>
Signed-off-by: George Wilson <[email protected]>
Closes #14652
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When traversing a tree of block pointers (e.g. for `zfs destroy <fs>` or
`zfs send`), we prefetch the indirect blocks that will be needed, in
`traverse_prefetch_metadata()`. In the case of `zfs destroy <fs>`, we
do a little traversing each txg, and resume the traversal the next txg.
So the indirect blocks that will be needed, and thus are candidates for
prefetching, does not include blocks that are before the resume point.
The problem is that the logic for determining if the indirect blocks are
before the resume point is incorrect, causing the (up to 1024) L1
indirect blocks that are inside the first L2 to not be prefetched. In
practice, if we are able to read many more than 1024 blocks per txg,
then this will be inconsequential. But if i/o latency is more than a
few milliseconds, almost no L1's will be prefetched, so they will be
read serially, and thus the destroying will be very slow. This can be
observed as `zpool get freeing` decreasing very slowly.
Specifically: When we first examine the L2 that contains the block we'll
be resuming from, we have not yet resumed, so `td_resume` is nonzero.
At this point, all calls to `traverse_prefetch_metadata()` will fail,
even if the L1 in question is after the resume point. It isn't until
the callback is issued for the resume point that we zero out
`td_resume`, but by this point we've already attempted and failed to
prefetch everything under this L2 indirect block.
This commit addresses the issue by reusing the existing
`resume_skip_check()` to determine if the L1's bookmark is before or
after the resume point. To do so, this function is made non-mutating
(the caller now zeros `td_resume`).
Note, this bug likely predates (was not introduced by) #11803.
Reviewed-by: Alexander Motin <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Matthew Ahrens <[email protected]>
Closes #14603
|
|
|
|
|
|
|
|
|
|
|
|
| |
Undirty the dbuf and destroy its buffer when cloning into it.
Coverity ID: CID-1535375
Reported-by: Richard Yao
Reported-by: Benjamin Coddington
Reviewed-by: Richard Yao <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Signed-off-by: Pawel Jakub Dawidek <[email protected]>
Closes #14655
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
031d7c2fe6afaa78943bd0a563b91fc84ace42d7 did not handle reverse
iteration, such that the original issue theoretically could still occur.
Note that contrary to the claim in the ZFS disk format specification
that a maximum of 6 levels are possible, 9 levels are possible with
recordsize=512 and and indirect block size of 16KB. In this unusual
configuration, span will be 65. The maximum size of span at 70 can be
reached at recordsize=16K and an indirect blocksize of 16KB.
When we are at this indirection level and are traversing backward, the
minimum value is start, but we cannot calculate that with 64-bit
arithmetic, so we avoid the calculation and instead rely on the earlier
statement that did `*offset = start;`.
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Richard Yao <[email protected]>
Reported-by: Coverity (CID-1466214)
Closes #14618
|
|
|
|
|
|
|
|
| |
This commit removes the edonr_byteorder.h file and all unused
variants of Edon-R.
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Tino Reichardt <[email protected]>
Closes #13618
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
After addressing coverity complaints involving `nvpair_name()`, the
compiler started complaining about dropping const. This lead to a rabbit
hole where not only `nvpair_name()` needed to be constified, but also
`nvpair_value_string()`, `fnvpair_value_string()` and a few other static
functions, plus variable pointers throughout the code. The result became
a fairly big change, so it has been split out into its own patch.
Reviewed-by: Tino Reichardt <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Richard Yao <[email protected]>
Closes #14612
|
|
|
|
|
|
|
|
|
|
|
| |
Coverity reported a dereference after a NULL check in dbuf_verify(). If
`dn` is `NULL`, we can just assume that !dn->dn_free_txg, so we change
`!dn->dn_free_txg` to `(dn == NULL || !dn->dn_free_txg)`.
Reviewed-by: Tino Reichardt <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Richard Yao <[email protected]>
Reported-by: Coverity (CID-992298)
Closes #14619
|
|
|
|
|
|
|
|
|
|
|
| |
The commit replaces all findings of the link:
http://www.opensolaris.org/os/licensing with this one:
https://opensource.org/licenses/CDDL-1.0
Reviewed-by: George Melikov <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: WHR <[email protected]>
Signed-off-by: Tino Reichardt <[email protected]>
Closes #14625
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
After 67a1b0379159c46bcd60a462a2790248046c8804 was merged, coverity
started complaining about an uninitialized scalar variable in
flush_write_batch_impl() due to the new field zp.zp_brtwrite. Upon
inspection, it appears that uninitialized memory was being copied for
non-raw streams, so this is a pre-existing issue. The addition of
zp_brtwrite by the block cloning commit caused Coverity to begin to
notice it.
Reviewed-by: George Melikov <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Rob Norris <[email protected]>
Signed-off-by: Richard Yao <[email protected]>
Reported-by: Coverity (CID-1535378)
Closes #14607
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
da19d919a853ad05ef300fe000e6c96c4db84bcf changed this in a way that
permits execution to reach `if (err == 0)` without initializing err.
This could randomly cause the sync task to not execute. We fix that by
initializing err to zero.
Reviewed-by: George Melikov <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Rob Norris <[email protected]>
Signed-off-by: Richard Yao <[email protected]>
Reported-by: Coverity (CID-1535377)
Closes #14607
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
`lseek(SEEK_DATA | SEEK_HOLE)` are only accurate when the on-disk blocks
reflect all writes, i.e. when there are no dirty data blocks. To ensure
this, if the target dnode is dirty, they wait for the open txg to be
synced, so we can call them "stabilizing operations". If they cause
txg_wait_synced often, it can be detrimental to performance.
Typically, a group of files are all modified, and then SEEK_DATA/HOLE
are performed on them. In this case, the first SEEK does a
txg_wait_synced(), and subsequent SEEKs don't need to wait, so
performance is good.
However, if a workload involves an interleaved metadata modification,
the subsequent SEEK may do a txg_wait_synced() unnecessarily. For
example, if we do a `read()` syscall to each file before we do its SEEK.
This applies even with `relatime=on`, when the `read()` is the first
read after the last write. The txg_wait_synced() is unnecessary because
the SEEK operations only care that the structure of the tree of indirect
and data blocks is up to date on disk. They don't care about metadata
like the contents of the bonus or spill blocks. (They also don't care
if an existing data block is modified, but this would be more involved
to filter out.)
This commit changes the behavior of SEEK_DATA/HOLE operations such that
they do not call txg_wait_synced() if there is only a pending change to
the bonus or spill block.
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Signed-off-by: Matthew Ahrens <[email protected]>
Closes #13368
Issue #14594
Issue #14512
Issue #14009
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Block Cloning allows to manually clone a file (or a subset of its
blocks) into another (or the same) file by just creating additional
references to the data blocks without copying the data itself.
Those references are kept in the Block Reference Tables (BRTs).
The whole design of block cloning is documented in module/zfs/brt.c.
Reviewed-by: Alexander Motin <[email protected]>
Reviewed-by: Christian Schwarz <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Rich Ercolani <[email protected]>
Signed-off-by: Pawel Jakub Dawidek <[email protected]>
Closes #13392
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The problem occurs because dmu_recv_begin pulls in the payload and
next header from the input stream in order to use the contents of
the begin record's nvlist. However, the change to do that before the
other checks in dmu_recv_begin occur caused a regression where an
empty send stream in a recursive send could have its END record
consumed by this, which broke the logic of recv_skip. A test is
also included to protect against this case in the future.
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Paul Dagnelie <[email protected]>
Closes #12661
Closes #14568
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The txg_sync thread will see certain buffers in a DR_IN_DMU_SYNC state
when ZIL is writing them out. Then it waits until the state changes, but
has an assertion to check that they were not DR_NOT_OVERRIDDEN. If the
data write failed with an error, ZIL will put it into the
DR_NOT_OVERRIDDEN state. It looks like the code will handle that state
without an issue, so we can just delete the assertion.
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Richard Yao <[email protected]>
Sponsored-By: Wasabi Technology, Inc.
Closes #14283
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
63652e154643cfe596fe077c13de0e7be34dd863 added unnecessary branches in
`vdev_stat_update()` to suppress an ASAN false positive the breaks
ztest. This had the downside of causing false positive reports in both
Coverity and Clang's static analyzer. vd is never NULL, so we add a
preprocessor check to only apply the workaround when compiling with ASAN
support.
Reported-by: Coverity (CID-1524583)
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Richard Yao <[email protected]>
Closes #14575
|
|
|
|
|
|
|
|
|
|
|
| |
ae7e7006500ca37c471dd625cd5cbfc590b49885 added an assertion to suppress
a complaint from Clang's static analyzer. Unfortunately, it missed
another way for Clang to complain about this function. This adds another
assertion to handle that.
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Richard Yao <[email protected]>
Closes #14575
|
|
|
|
|
|
|
|
|
|
| |
scan-build does not do cross translation unit analysis to realize that
`dmu_buf_hold()` will always set `bpo->bpo_cached_dbuf` to a non-NULL
pointer, so we add an assertion to make it realize this.
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Richard Yao <[email protected]>
Closes #14575
|
|
|
|
|
|
|
|
|
|
|
| |
Clang's static analyzer reports that if we try to rename a root dataset
in `dsl_dir_rename_sync()`, we will have a NULL pointer passed to
strlcpy(). This is impossible because `dsl_dir_rename_check()` will
prevent us from doing this. We add an assertion to silence this warning.
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Richard Yao <[email protected]>
Closes #14575
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
CodeQL's cpp/constant-comparison query from its security-and-extended
query set reported 4 instances where we have comparions that always
evaluate the same way.
In `draid_config_by_type()`, we have an early `if (nparity == 0)` check
that returns `EINVAL`, making a later `if (nparity == 0 || nparity >
VDEV_DRAID_MAXPARITY)` partially redundant. The later check prints an
error message when parity is 0, but the early check does not. This is
not useful feedback, so we move the later check to the place where the
early check runs to replace the early check.
In `perform_thread_merge()`, we return when `num_threads == 0`. After
that block, we do `if (num_threads > 0) {`, which will always be true.
We remove the `if` statement.
In `sa_modify_attrs()`, we have a loop condition that is `k != 2`, but
at the end of the loop, we have `if (k == 0 && hdl->sa_spill)` followed
by an else that does a break. The result is that k != 2 will never be
evaluated when it is false. We drop the comparison.
In `zap_leaf_array_read()`, we have a for loop condition that is `i <
ZAP_LEAF_ARRAY_BYTES && len > 0`. However, that loop itself is in a loop
that is `while (len > 0)` and while the value of len is decremented
inside the loop, when `len == 0`, it will return, such that `len > 0`
inside the loop condition will always be true. We drop that part of the
condition.
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Richard Yao <[email protected]>
Closes #14575
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Clang's static analyzer reports that if a `blkid == DMU_SPILL_BLKID` is
passed, then we can have a NULL pointer dereference when either
->dn_have_spill or `DNODE_FLAG_SPILL_BLKPTR` is not set. This should not
happen. We add an `ASSERT()` to suppress reports about NULL pointer
dereferences.
Originally, I wanted to use one or two IMPLY statements on
pre-conditions before the call to `dbuf_findbp()`, but Clang's static
analyzer did not understand it.
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Richard Yao <[email protected]>
Closes #14575
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Clang's static analyzer pointed out that we can have a NULL pointer
dereference if we ever attempt to split a vdev that has only 1 child. If
that happens, we are left with zero children, but then try to access a
non-existent child. Calling vdev_split() on a vdev with only 1 child
should be impossible due to how the code is structured. If this ever
happens, it would be best to stop execution immediately even in a
production environment to allow for the best possible chance of recovery
by an expert, so we use `VERIFY3U()` instead of `ASSERT3U()`.
Unfortunately, while that defensive assertion will prevent execution
from ever reaching the NULL pointer dereference, Clang's static analyzer
does not realize that, so we add an `ASSERT()` to inform it of this.
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Richard Yao <[email protected]>
Closes #14575
|
|
|
|
|
|
|
|
|
|
|
| |
Clang's static analyzer reports a possible NULL pointer dereference in
abd_get_size() when called from vdev_draid_map_alloc_write() called from
vdev_draid_map_alloc_row() and vdc->vdc_nparity == 0. This should be
impossible, so we add an assertion to silence the defect report.
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Richard Yao <[email protected]>
Closes #14575
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Traditionally ARC adaptation was limited to MRU/MFU distribution. But
for years people with metadata-centric workload demanded mechanisms to
also manage data/metadata distribution, that in original ZFS was just
a FIFO. As result ZFS effectively got separate states for data and
metadata, minimum and maximum metadata limits etc, but it all required
manual tuning, was not adaptive and in its heart remained a bad FIFO.
This change removes most of existing eviction logic, rewriting it from
scratch. This makes MRU/MFU adaptation individual for data and meta-
data, same as the distribution between data and metadata themselves.
Since most of required states separation was already done, it only
required to make arcs_size state field specific per data/metadata.
The adaptation logic is still based on previous concept of ghost hits,
just now it balances ARC capacity between 4 states: MRU data, MRU
metadata, MFU data and MFU metadata. To simplify arc_c changes instead
of arc_p measured in bytes, this code uses 3 variable arc_meta, arc_pd
and arc_pm, representing ARC balance between metadata and data, MRU and
MFU for data, and MRU and MFU for metadata respectively as 32-bit fixed
point fractions. Since we care about the math result only when need to
evict, this moves all the logic from arc_adapt() to arc_evict(), that
reduces per-block overhead, since per-block operations are limited to
stats collection, now moved from arc_adapt() to arc_access() and using
cheaper wmsums. This also allows to remove ugly ARC_HDR_DO_ADAPT flag
from many places.
This change also removes number of metadata specific tunables, part of
which were actually not functioning correctly, since not all metadata
are equal and some (like L2ARC headers) are not really evictable.
Instead it introduced single opaque knob zfs_arc_meta_balance, tuning
ARC's reaction on ghost hits, allowing administrator give more or less
preference to metadata without setting strict limits.
Some of old code parts like arc_evict_meta() are just removed, because
since introduction of ABD ARC they really make no sense: only headers
referenced by small number of buffers are not evictable, and they are
really not evictable no matter what this code do. Instead just call
arc_prune_async() if too much metadata appear not evictable.
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Allan Jude <[email protected]>
Signed-off-by: Alexander Motin <[email protected]>
Sponsored by: iXsystems, Inc.
Closes #14359
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Otherwise, we can get a deadlock that looks like this:
1. fsync() grabs spa_config_enter(zilog->zl_spa, SCL_STATE, lwb,
RW_READER) as part of zil_lwb_write_issue() . It then blocks on the
txg_sync when a flush fails from a drive power cycling.
2. The txg_sync then blocks on the pool suspending due to the loss of
too many disks.
3. zpool clear then blocks on spa_config_enter(spa, SCL_STATE |
SCL_L2ARC | SCL_ZIO, spa, RW_WRITER) because it is a writer.
The disks cannot be brought online due to fsync() holding that lock and
the user gets upset since fsync() is uninterruptibly blocked inside the
kernel.
We need to grab the lock for vdev_lookup_top(), but we do not need to
hold it while there is outstanding IO.
This fixes a regression introduced by
1ce23dcaff6c3d777cb0d9a4a2cf02b43f777d78.
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Richard Yao <[email protected]>
Sponsored-By: Wasabi Technology, Inc.
Closes #14519
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The intent is that this is like ENOTSUP, but specifically for when
something can't be done because we have no support for the requested
crypto parameters; eg unlocking a dataset or receiving a stream
encrypted with a suite we don't support.
Its not intended to be recoverable without upgrading ZFS itself.
If the request could be made to work by enabling a feature or modifying
some other configuration item, then some other code should be used.
load-key: In the future we might have more crypto suites (ie new values
for the `encryption` property. Right now trying to load a key on such
a future crypto suite will look up suite parameters off the end of the
crypto table, resulting in misbehaviour and/or crashes (or, with debug
enabled, trip the assertion in `zio_crypt_key_unwrap`).
Instead, lets check the value we got from the dataset, and if we can't
handle it, abort early.
recv: When receiving a raw stream encrypted with an unknown crypto
suite, `zfs recv` would report a generic `invalid backup stream`
(EINVAL). While technically correct, its not super helpful, so lets
ship a more specific error code and message.
Reviewed-by: Tino Reichardt <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Richard Yao <[email protected]>
Signed-off-by: Rob Norris <[email protected]>
Closes #14577
|
|
|
|
|
|
|
|
|
|
| |
by placing the most common use case (no special vdevs) first and avoid
allocating new variables.
Reviewed-by: Alexander Motin <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: George Amanakis <[email protected]>
Closes #14494
Closes #14563
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
dc5c8006f684b1df3f2d4b6b8c121447d2db0017 was recently merged to prefetch
up to 128 deadlists. Unfortunately, a loop was missing an increment,
such that it will prefetch all deadlists. The performance properties of
that patch probably should be re-evaluated.
This was caught by CodeQL's cpp/constant-comparison check in an
experimental branch where I am testing the security-and-extended
queries. It complained about the `i < 128` part of the loop condition
always evaluating to the same thing. The standard CodeQL configuration
we use missed this because it does not include that check.
Reviewed-by: Tino Reichardt <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Signed-off-by: Richard Yao <[email protected]>
Closes #14573
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The skeleton file module/icp/include/generic_impl.c can be used for
iterating over different implementations of algorithms.
It is used by SHA256, SHA512 and BLAKE3 currently.
The Solaris SHA2 implementation got replaced with a version which is
based on public domain code of cppcrypto v0.10.
These assembly files are taken from current openssl master:
- sha256-x86_64.S: x64, SSSE3, AVX, AVX2, SHA-NI (x86_64)
- sha512-x86_64.S: x64, AVX, AVX2 (x86_64)
- sha256-armv7.S: ARMv7, NEON, ARMv8-CE (arm)
- sha512-armv7.S: ARMv7, NEON (arm)
- sha256-armv8.S: ARMv7, NEON, ARMv8-CE (aarch64)
- sha512-armv8.S: ARMv7, ARMv8-CE (aarch64)
- sha256-ppc.S: Generic PPC64 LE/BE (ppc64)
- sha512-ppc.S: Generic PPC64 LE/BE (ppc64)
- sha256-p8.S: Power8 ISA Version 2.07 LE/BE (ppc64)
- sha512-p8.S: Power8 ISA Version 2.07 LE/BE (ppc64)
Tested-by: Rich Ercolani <[email protected]>
Tested-by: Sebastian Gottschall <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Tino Reichardt <[email protected]>
Closes #13741
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
With some pathological access patterns it is possible to make ZFS
accumulate almost unlimited amount of speculative prefetch ZIOs.
Combined with linear ABD allocations in RAIDZ code, it appears to
be possible to exhaust system KVA, triggering kernel panic.
Address this by introducing a system-wide counter of active prefetch
requests and blocking prefetch distance doubling per stream hits if
the number of active requests is higher that ~6% of ARC size.
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Alexander Motin <[email protected]>
Sponsored by: iXsystems, Inc.
Closes #14516
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
openzfsonwindows/openzfs#206 found that it is possible to trip
`VERIFY(list_is_empty(&lwb->lwb_itxs))` when a `zil_commit()` is delayed
by the scheduler long enough for a parallel `zil_suspend()` operation to
exit `zil_commit_impl()`. This is a data race. To prevent this, we
introduce a `zilog->zl_suspend_lock` rwlock to ensure that all
outstanding `zil_commit()` operations finish before `zil_suspend()`
begins and that subsequent operations fallback to `txg_wait_synced()`
after `zil_suspend()` has begun.
On `PREEMPT_RT` Linux kernels, the `rw_enter()` implementation suffers
from writer starvation. This means that a ZIL intensive system can delay
`zil_suspend()` indefinitely. This is a pre-existing problem that
affects everything that uses rw locks, so it needs to be addressed in
the SPL. However, builds against `PREEMPT_RT` Linux kernels are
currently broken due to a GPL symbol issue (#11097), so we can safely
disregard that issue for now.
Reported-by: Arun KV <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Richard Yao <[email protected]>
Closes #14514
|