| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
| |
With the latest L2ARC fixes, 2 seconds is too long to wait for
quiescence of arcstats like l2_size. Shorten this interval to avoid
having the persistent L2ARC tests in ZTS prematurely terminated.
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: George Amanakis <[email protected]>
Closes #14981
|
|
|
|
|
|
|
|
| |
On FreeBSD 14 this test runs slowly in the CI environment
and is killed by the 10 minute timeout. Skip the test on
FreeBSD until the slow down is resolved.
Signed-off-by: Brian Behlendorf <[email protected]>
Issue #14961
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
If this is not done, and the pool has an ashift other than the default
(at the moment 9) then the following happens:
1) vdev_alloc() assigns the ashift of the pool to L2ARC device, but
upon export it is not stored anywhere
2) at the first import, vdev_open() sees an vdev_ashift() of 0 and
assigns the logical_ashift, which is 9
3) reading the contents of L2ARC, including the header fails
4) L2ARC buffers are not restored in ARC.
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: George Amanakis <[email protected]>
Closes #14313
Closes #14963
|
|
|
|
|
|
|
|
| |
Until the ASSERT which is occasionally hit while running
checkpoint_discard_busy is resolved skip this test case.
Signed-off-by: Brian Behlendorf <[email protected]>
Issue #12053
Closes #14952
|
|
|
|
|
|
|
|
|
|
|
| |
This is more-or-less like `zfs send`, but specifying the snapshot by its
objset id for situations where it can't be referenced any other way.
Sponsored-By: Klara, Inc.
Reviewed-by: Tino Reichardt <[email protected]>
Reviewed-by: WHR <[email protected]>
Signed-off-by: Rob Norris <[email protected]>
Closes #14642
|
|
|
|
|
|
|
| |
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Felix Dörre <[email protected]>
Signed-off-by: Val Packett <[email protected]>
Closes #14834
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
There's usually no requirement that a user be logged in for changing
their password, so let's not be surprising here.
We need to use the fetch_lazy mechanism for the old password to avoid
a double prompt for it, so that mechanism is now generalized a bit.
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Felix Dörre <[email protected]>
Signed-off-by: Val Packett <[email protected]>
Closes #14834
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
It's not always desirable to have a fixed flat homes directory.
With the 'recursive_homes' flag, 'prop_mountpoint' search would
traverse the whole tree starting at 'homes' (which can now be '*'
to mean all pools) to find a dataset with a mountpoint matching
the home directory.
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Felix Dörre <[email protected]>
Signed-off-by: Val Packett <[email protected]>
Closes #14834
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Disable the zvol_misc_fua.ksh and zvol_misc_trim.ksh test cases on impacted
kernels. This issue is being actively worked in #14872 and as part of that
fix this commit will be reverted.
VERIFY(zh->zh_claim_txg == 0) failed
PANIC at zil.c:904:zil_create()
Reviewed-by: Tony Hutter <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Issue #14872
Closes #14870
|
|
|
|
|
|
|
| |
The zpool_resilver_concurrent test case requires the ZED which is not used
on FreeBSD. Add this test to the known list of skipped tested for FreeBSD.
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #14904
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This implements a binary search algorithm for B-Trees that reduces
branching to the absolute minimum necessary for a binary search
algorithm. It also enables the compiler to inline the comparator to
ensure that the only slowdown when doing binary search is from waiting
for memory accesses. Additionally, it instructs the compiler to unroll
the loop, which gives an additional 40% improve with Clang and 8%
improvement with GCC.
Consumers must opt into using the faster algorithm. At present, only
B-Trees used inside kernel code have been modified to use the faster
algorithm.
Micro-benchmarks suggest that this can improve binary search performance
by up to 3.5 times when compiling with Clang 16 and up to 1.9 times when
compiling with GCC 12.2.
Reviewed-by: Alexander Motin <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Richard Yao <[email protected]>
Closes #14866
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
In addition to a number of actual log bytes written, account also a
total written bytes including padding and total allocated bytes (bytes
<= write <= alloc). It should allow to monitor zil traffic and space
efficiency.
Add dtrace probe for zil block size selection.
Make zilstat report more information and fit it into less width.
Reviewed-by: Ameer Hamza <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Alexander Motin <[email protected]>
Sponsored by: iXsystems, Inc.
Closes #14863
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
For draid vdevs it was possible to initiate both the
sequential and healing resilver at same time.
This fixes the following two scenarios.
1) There's a window where a sequential rebuild can
be started via ZED even if a healing resilver has been
scheduled.
- This is fixed by adding additional check in
spa_vdev_attach() for any scheduled resilver and return
appropriate error code when a resilver is already in
progress.
2) It was possible for zpool clear to start a healing
resilver when it wasn't needed at all. This occurs because
during a vdev_open() the device is presumed to be healthy not
until the device is validated by vdev_validate() and it's set
unavailable. However, by this point an async resilver will
have already been requested if the DTL isn't empty.
- This is fixed by cancelling the SPA_ASYNC_RESILVER
request immediately at the end of vdev_reopen() when a resilver
is unneeded.
Finally, added a testcase in ZTS for verification.
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Dipak Ghosh <[email protected]>
Signed-off-by: Akash B <[email protected]>
Closes #14881
Closes #14892
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Added a flag '-e' in zpool scrub to scrub only blocks in error log. A
user can pause, resume and cancel the error scrub by passing additional
command line arguments -p -s just like a regular scrub. This involves
adding a new flag, creating new libzfs interfaces, a new ioctl, and the
actual iteration and read-issuing logic. Error scrubbing is executed in
multiple txg to make sure pool performance is not affected.
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Tony Hutter <[email protected]>
Co-authored-by: TulsiJain [email protected]
Signed-off-by: George Amanakis <[email protected]>
Closes #8995
Closes #12355
|
|
|
|
|
|
|
|
|
|
|
|
| |
zpool initialize functions well for touching every free byte...once.
But if we want to do it again, we're currently out of luck.
So let's add zpool initialize -u to clear it.
Co-authored-by: Rich Ercolani <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Signed-off-by: Rich Ercolani <[email protected]>
Closes #12451
Closes #14873
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
test-runner.py orchestrates all of the ZTS executions. The `Cmd` object
manages these process, and its `run` method specifically invokes these
possibly long-running processes, possibly retrying in the event of a
timeout. Since its inception, memory leak detection using the kmemleak
infrastructure [1], and kernel logging [2] have been added to this run
mechanism.
However, the callback to cull a process beyond its timeout threshold,
`kill_cmd`, has evaded modernization by both of these changes. As a
result, this function fails to properly invoke `run`, leading to an
untrapped exception and unreported test failure.
This patch extends `kill_cmd` to receive these kernel devices through
the `options` parameter, and regularizes all the `.run` calls from
`Cmd`, and its subclasses, to accept that parameter.
[1] Commit a69765ea5b563e0cd4d15fac4b1ac08c6ccf12d1
[2] Commit fc2c0256c55a2859d1988671b0896d22b75c8aba
Reviewed-by: John Wren Kennedy <[email protected]>
Signed-off-by: Antonio Russo <[email protected]>
Closes #14849
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When the special_small_blocks property is being set during a pool
create it enforces a limit of 128KiB even if the pool's record size
is larger.
If the recordsize property is being set during a pool create, then
use that value instead of the default SPA_OLD_MAXBLOCKSIZE value.
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Don Brady <[email protected]>
Closes #13815
Closes #14811
|
|
|
|
|
|
|
|
|
| |
The auto_replace_001_pos test case does not reliably pass on
Fedora 37 and newer. Until the test case can be updated to make
it reliable add it to the list of "maybe" exceptions on Linux.
Signed-off-by: Brian Behlendorf <[email protected]>
Issue #14851
Closes #14852
|
|
|
|
|
|
|
|
|
| |
Reduced the timeout to 60 seconds which should be more than
sufficient and allow the test to be marked as FAILED rather
than KILLED. Also dump the pool status on cleanup.
Reviewed-by: Brian Atkinson <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #14829
|
|
|
|
|
|
|
|
|
|
| |
l2arc_evict() performs the adjustment of the size of buffers to be
written on L2ARC unnecessarily. l2arc_write_size() is called right
before l2arc_evict() and performs those adjustments.
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Brian Atkinson <[email protected]>
Signed-off-by: George Amanakis <[email protected]>
Closes #14828
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
In case check_filesystem() does not error out and does not report
an error, remove that error block from error lists and logs
without requiring a scrub. This can happen when the original file and
all snapshots/clones referencing it have been removed.
Otherwise zpool status will still report that "Permanent errors have
been detected..." without actually reporting any of them.
To implement this change the functions introduced in corrective
receive were modified to take into account the head_errlog feature.
Before this change:
=============================
pool: test
state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
config:
NAME STATE READ WRITE CKSUM
test ONLINE 0 0 0
/home/user/vdev_a ONLINE 0 0 2
errors: Permanent errors have been detected in the following files:
=============================
After this change:
=============================
pool: test
state: ONLINE
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are
unaffected.
action: Determine if the device needs to be replaced, and clear the
errors
using 'zpool clear' or replace the device with 'zpool replace'.
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
config:
NAME STATE READ WRITE CKSUM
test ONLINE 0 0 0
/home/user/vdev_a ONLINE 0 0 2
errors: No known data errors
=============================
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Brian Atkinson <[email protected]>
Signed-off-by: George Amanakis <[email protected]>
Closes #14813
|
|
|
|
|
|
|
|
|
|
| |
For the head_errlog feature use dsl_dataset_hold_obj_flags() instead of
dsl_dataset_hold_obj() in order to enable access to the encryption keys
(if loaded). This enables reporting of errors in encrypted filesystems
which are not mounted but have their keys loaded.
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: George Amanakis <[email protected]>
Closes #14837
|
|
|
|
|
|
|
|
|
| |
Add snapshot_002_pos to the known list of occasional failures
for FreeBSD until it can be made entirely reliable.
Reviewed-by: Tino Reichardt <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Issue #14831
Closes #14832
|
|
|
|
|
| |
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Pawel Jakub Dawidek <[email protected]>
Closes #14805
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
spa_import() relies on a pool config fetched by spa_try_import() for
spare/cache devices. Import flags are not passed to spa_tryimport(),
which makes it return early due to a missing log device and missing
retrieving the cache device and spare eventually. Passing
ZFS_IMPORT_MISSING_LOG to spa_tryimport() makes it fetch the correct
configuration regardless of the missing log device.
Reviewed-by: Alexander Motin <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Ameer Hamza <[email protected]>
Closes #14794
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This commit expands on the zhack label repair command in d04b5c9 by
adding the -u option to undetach a device by regenerating uberblocks,
in addition to the existing functionality of fixing checksums, now
represented by -c. Previous behavior is retained in the case of no
options.
The changes are heavily inspired by Jeff Bonwick's labelfix
utility, as archived at:
https://gist.github.com/jjwhitney/baaa63144da89726e482
Additionally, it is now capable of properly determining the size of
block devices and other media, as well as handling sizes which are
not divisible by 2^18. This should make it viable for use on physical
devices and partitions, in addition to files.
These changes should make it possible to import zpools that have had
their uberblocks erased, such as in the case of pools rendered
inaccessible by erroneous detach commands.
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: buzzingwires <[email protected]>
Closes #14773
|
|
|
|
|
|
|
|
|
| |
On FreeBSD, `wc` prints some leading spaces, while on Linux it does not.
So we tell ksh to expect an integer, and it does the rest.
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Rob Norris <[email protected]>
Closes #14791
Closes #14797
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Usage:
zpool set org.freebsd:comment="this is my pool" poolname
Tests are based on zfs_set's user property tests.
Also stop truncating property values at MAXNAMELEN, use ZFS_MAXPROPLEN.
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Allan Jude <[email protected]>
Signed-off-by: Mateusz Piotrowski <[email protected]>
Sponsored-by: Beckhoff Automation GmbH & Co. KG.
Sponsored-by: Klara Inc.
Closes #11680
|
|
|
|
|
|
|
|
|
|
|
| |
Retry the export if the pool is busy due to an open zvol.
Observed in the CI on Fedora 37.
cannot export 'testpool': pool is busy
ERROR: zpool export testpool exited 1
Reviewed-by: George Melikov <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #14769
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
And add it to the AVZ, this is not backwards compatible with older pools
due to an assertion in spa_sync() that verifies the number of ZAPs of
all vdevs matches the number of ZAPs in the AVZ.
Granted, the assertion only applies to #DEBUG builds - still, a feature
flag is introduced to avoid the assertion, com.klarasystems:vdev_zaps_v2
Notably, this allows to get/set properties on the root vdev:
% zpool set user:prop=value <pool> root-0
Before this commit, it was already possible to get/set properties on
top-level vdevs with the syntax <type>-<vdev_id> (e.g. mirror-0):
% zpool set user:prop=value <pool> mirror-0
This syntax also applies to the root vdev as it is is of type 'root'
with a vdev_id of 0, root-0. The keyword 'root' as an alias for
'root-0'.
The following tests have been added:
- zpool get all properties from root vdev
- zpool set a property on root vdev
- verify root vdev ZAP is created
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Rob Wing <[email protected]>
Sponsored-by: Seagate Technology
Submitted-by: Klara, Inc.
Closes #14405
|
|
|
|
|
|
|
|
|
|
|
| |
We use block_device_wait to wait for the zvol block device to
actually appear, and we log the result of the dd calls by using
an intermediate file.
Reviewed-by: George Melikov <[email protected]>
Reviewed-by: John Wren Kennedy <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Paul Dagnelie <[email protected]>
Closes #14767
|
|
|
|
|
|
|
|
|
| |
Some test cases were committed to the repository but never added to
runfiles.
Move `zfs_unshare_008_pos` to the Linux-only runfile.
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: szubersk <[email protected]>
Closes #14701
|
|
|
|
|
|
|
|
|
|
| |
Make the test runner try to use the included python monotonic time
function instead of calling librt.
This makes the test runner work on macos where librt wasn't available.
Reviewed-by: Tino Reichardt <[email protected]>
Signed-off-by: Andrew Innes <[email protected]>
Closes #14700
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Address the following bugs in persistent error log:
1) Check nested clones, eg "fs->snap->clone->snap2->clone2".
2) When deleting files containing error blocks in those clones (from
"clone" the example above), do not break the check chain.
3) When deleting files in the originating fs before syncing the errlog
to disk, do not break the check chain. This happens because at the
time of introducing the error block in the error list, we do not have
its birth txg and the head filesystem. If the original file is
deleted before the error list is synced to the error log (which is
when we actually lookup the birth txg and the head filesystem), then
we do not have access to this info anymore and break the check chain.
The most prominent change is related to achieving (3). We expand the
spa_error_entry_t structure to accommodate the newly introduced
zbookmark_err_phys_t structure (containing the birth txg of the error
block).Due to compatibility reasons we cannot remove the
zbookmark_phys_t structure and we also need to place the new structure
after se_avl, so it is not accounted for in avl_find(). Then we modify
spa_log_error() to also provide the birth txg of the error block. With
these changes in place we simplify the previously introduced function
get_head_and_birth_txg() (now named get_head_ds()).
We chose not to follow the same approach for the head filesystem (thus
completely removing get_head_ds()) to avoid introducing new lock
contentions.
The stack sizes of nested functions (as measured by checkstack.pl in the
linux kernel) are:
check_filesystem [zfs]: 272 (was 912)
check_clones [zfs]: 64
We also introduced two new tests covering the above changes.
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: George Amanakis <[email protected]>
Closes #14633
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This commit changes the workflow of the github actions.
We split the workflow into different parts:
1) build zfs modules for Ubuntu 20.04 and 22.04 (~25m)
2) 2x zloop test (~10m) + 2x sanity test (~25m)
3) functional testings in parts 1..5 (each ~1h)
- these could be triggered, when sanity tests are ok
- currently I just start them all in the same time
4) cleanup and create summary
When everything is fine, the full run with all testings
should be done in around 2 hours.
The codeql.yml and checkstyle.yml are not part in this circle.
The testings are also modified a bit:
- report info about CPU and checksum benchmarks
- reset the debugging logs for each test
- when some error occurred, we call dmesg with -c to get
only the log output for the last failed test
- we empty also the dbgsys
Reviewed-by: George Melikov <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Tino Reichardt <[email protected]>
Closes #14078
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Fix the manpage. The "SYNOPSIS" section is incorrectly formatted for
receive -c. I also took this opportunity to reword some parts and
fix a run-on sentence in the manpage.
Add large block testing for corrective recv. This adds a new test
that makes sure blocks generated using zfs send -L/--large-block
large-block send flag are able to be used for healing.
Since with unloaded key and errlog feature enabled corruption is not
shown in zpool status #13675 is fixed the zfs_receive_corrective.ksh
test no longer sets -o feature@head_errlog=disabled on pool creation
so that it can also test for regressions related to head_errlog feature.
Note that the zfs_receive_compressed_corrective.ksh and
zfs_receive_large_block_corrective.ksh tests are still creating pools
with -o feature@head_errlog=disabled.
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Alek Pinchuk <[email protected]>
Closes #14615
|
|
|
|
|
|
|
|
| |
This commit removes the edonr_byteorder.h file and all unused
variants of Edon-R.
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Tino Reichardt <[email protected]>
Closes #13618
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
After addressing coverity complaints involving `nvpair_name()`, the
compiler started complaining about dropping const. This lead to a rabbit
hole where not only `nvpair_name()` needed to be constified, but also
`nvpair_value_string()`, `fnvpair_value_string()` and a few other static
functions, plus variable pointers throughout the code. The result became
a fairly big change, so it has been split out into its own patch.
Reviewed-by: Tino Reichardt <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Richard Yao <[email protected]>
Closes #14612
|
|
|
|
|
|
|
|
|
|
|
| |
The commit replaces all findings of the link:
http://www.opensolaris.org/os/licensing with this one:
https://opensource.org/licenses/CDDL-1.0
Reviewed-by: George Melikov <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: WHR <[email protected]>
Signed-off-by: Tino Reichardt <[email protected]>
Closes #14625
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Block Cloning allows to manually clone a file (or a subset of its
blocks) into another (or the same) file by just creating additional
references to the data blocks without copying the data itself.
Those references are kept in the Block Reference Tables (BRTs).
The whole design of block cloning is documented in module/zfs/brt.c.
Reviewed-by: Alexander Motin <[email protected]>
Reviewed-by: Christian Schwarz <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Rich Ercolani <[email protected]>
Signed-off-by: Pawel Jakub Dawidek <[email protected]>
Closes #13392
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The problem occurs because dmu_recv_begin pulls in the payload and
next header from the input stream in order to use the contents of
the begin record's nvlist. However, the change to do that before the
other checks in dmu_recv_begin occur caused a regression where an
empty send stream in a recursive send could have its END record
consumed by this, which broke the logic of recv_skip. A test is
also included to protect against this case in the future.
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Paul Dagnelie <[email protected]>
Closes #12661
Closes #14568
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Traditionally ARC adaptation was limited to MRU/MFU distribution. But
for years people with metadata-centric workload demanded mechanisms to
also manage data/metadata distribution, that in original ZFS was just
a FIFO. As result ZFS effectively got separate states for data and
metadata, minimum and maximum metadata limits etc, but it all required
manual tuning, was not adaptive and in its heart remained a bad FIFO.
This change removes most of existing eviction logic, rewriting it from
scratch. This makes MRU/MFU adaptation individual for data and meta-
data, same as the distribution between data and metadata themselves.
Since most of required states separation was already done, it only
required to make arcs_size state field specific per data/metadata.
The adaptation logic is still based on previous concept of ghost hits,
just now it balances ARC capacity between 4 states: MRU data, MRU
metadata, MFU data and MFU metadata. To simplify arc_c changes instead
of arc_p measured in bytes, this code uses 3 variable arc_meta, arc_pd
and arc_pm, representing ARC balance between metadata and data, MRU and
MFU for data, and MRU and MFU for metadata respectively as 32-bit fixed
point fractions. Since we care about the math result only when need to
evict, this moves all the logic from arc_adapt() to arc_evict(), that
reduces per-block overhead, since per-block operations are limited to
stats collection, now moved from arc_adapt() to arc_access() and using
cheaper wmsums. This also allows to remove ugly ARC_HDR_DO_ADAPT flag
from many places.
This change also removes number of metadata specific tunables, part of
which were actually not functioning correctly, since not all metadata
are equal and some (like L2ARC headers) are not really evictable.
Instead it introduced single opaque knob zfs_arc_meta_balance, tuning
ARC's reaction on ghost hits, allowing administrator give more or less
preference to metadata without setting strict limits.
Some of old code parts like arc_evict_meta() are just removed, because
since introduction of ABD ARC they really make no sense: only headers
referenced by small number of buffers are not evictable, and they are
really not evictable no matter what this code do. Instead just call
arc_prune_async() if too much metadata appear not evictable.
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Allan Jude <[email protected]>
Signed-off-by: Alexander Motin <[email protected]>
Sponsored by: iXsystems, Inc.
Closes #14359
|
|
|
|
|
|
|
|
|
|
|
| |
This commit changes the BLAKE3 implementation handling and
also the calls to it from the ztest command.
Tested-by: Rich Ercolani <[email protected]>
Tested-by: Sebastian Gottschall <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Tino Reichardt <[email protected]>
Closes #13741
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The skeleton file module/icp/include/generic_impl.c can be used for
iterating over different implementations of algorithms.
It is used by SHA256, SHA512 and BLAKE3 currently.
The Solaris SHA2 implementation got replaced with a version which is
based on public domain code of cppcrypto v0.10.
These assembly files are taken from current openssl master:
- sha256-x86_64.S: x64, SSSE3, AVX, AVX2, SHA-NI (x86_64)
- sha512-x86_64.S: x64, AVX, AVX2 (x86_64)
- sha256-armv7.S: ARMv7, NEON, ARMv8-CE (arm)
- sha512-armv7.S: ARMv7, NEON (arm)
- sha256-armv8.S: ARMv7, NEON, ARMv8-CE (aarch64)
- sha512-armv8.S: ARMv7, ARMv8-CE (aarch64)
- sha256-ppc.S: Generic PPC64 LE/BE (ppc64)
- sha512-ppc.S: Generic PPC64 LE/BE (ppc64)
- sha256-p8.S: Power8 ISA Version 2.07 LE/BE (ppc64)
- sha512-p8.S: Power8 ISA Version 2.07 LE/BE (ppc64)
Tested-by: Rich Ercolani <[email protected]>
Tested-by: Sebastian Gottschall <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Tino Reichardt <[email protected]>
Closes #13741
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The approach is straightforward: for dataset ops, if a key was offered,
find the encryption root and the various encryption parameters, derive a
wrapping key if necessary, and then unlock the encryption root. After
that all the regular dataset ops will return unencrypted data, and
that's kinda the whole thing.
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Jorgen Lundman <[email protected]>
Signed-off-by: Rob Norris <[email protected]>
Closes #11551
Closes #12707
Closes #14503
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
- The migration_012_pos.ksh test case was failing because of a
missing space after `log_must`.
- None of the tests listed in the runfiles should include the .ksh
suffix.
Reviewed-by: Richard Yao <[email protected]>
Reviewed-by: Brian Atkinson <[email protected]>
Reviewed-by: George Melikov <[email protected]>
Reviewed-by: Ryan Moeller <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #14515
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When a page is faulted in for memory mapped I/O the page lock
may be dropped before it has been read and marked up to date.
If a buffered read encounters such a page in mappedread() it
must wait until the page has been updated. Failure to do so
will result in a panic on debug builds and incorrect data on
production builds.
The critical part of this change is in mappedread() where pages
which are not up to date are now handled. Additionally, it
includes the following simplifications.
- zfs_getpage() and zfs_fillpage() could be passed an array of
pages. This could be more efficient if it was used but in
practice only a single page was ever provided. These
interfaces were simplified to acknowledge that.
- update_pages() was modified to correctly set the PG_error bit
on a page when it cannot be read by dmu_read().
- Setting PG_error and PG_uptodate was moved to zfs_fillpage()
from zpl_readpage_common(). This is consistent with the
handling in update_pages() and mappedread().
- Minor additional refactoring to comments and variable
declarations to improve readability.
- Add a test case to exercise concurrent buffered, direct,
and mmap IO to the same file.
- Reduce the mmap_sync test case default run time.
Reviewed-by: Richard Yao <[email protected]>
Reviewed-by: Brian Atkinson <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #13608
Closes #14498
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Encrypted blocks can not have 3 DVAs, because they use the space of the
3rd DVA for the IV+salt. zio_write_gang_block() takes this into
account, setting `gbh_copies` to no more than 2 in this case. Gang
members BP's do not have the X (encrypted) bit set (nor do they have the
DMU level and type fields set), because encryption is not handled at
this level. The gang block is reassembled, and then encryption (and
compression) are handled.
To check if this gang block is encrypted, the code in
zio_write_gang_block() checks `pio->io_bp`. This is normally fine,
because the block that's being ganged is typically the encrypted BP.
The problem is that if there is "recursive ganging", where a gang member
is itself a gang block, then when zio_write_gang_block() is called to
create a gang block for a gang member, `pio->io_bp` is the gang member's
BP, which doesn't have the X bit set, so the number of DVA's is not
restricted to 2. It should instead be looking at the the "gang leader",
i.e. the top-level gang block, to determine how many DVA's can be used,
to avoid a "NDVA's inversion" (where a child has more DVA's than its
parent).
gang leader BP: X (encrypted) bit set, 2 DVA's, IV+salt in 3rd DVA's
space:
```
DVA[0]=<1:...:100400> DVA[1]=<0:...:100400> salt=... iv=...
[L0 ZFS plain file] fletcher4 uncompressed encrypted LE
gang unique double size=100000L/100000P birth=... fill=1 cksum=...
```
leader's GBH contains a BP with gang bit set and 3 DVA's:
```
DVA[0]=<1:...:55600> DVA[1]=<0:...:55600>
[L0 unallocated] fletcher4 uncompressed unencrypted LE
contiguous unique double size=55600L/55600P birth=... fill=0 cksum=...
DVA[0]=<1:...:55600> DVA[1]=<0:...:55600>
[L0 unallocated] fletcher4 uncompressed unencrypted LE
contiguous unique double size=55600L/55600P birth=... fill=0 cksum=...
DVA[0]=<1:...:55600> DVA[1]=<0:...:55600> DVA[2]=<1:...:200>
[L0 unallocated] fletcher4 uncompressed unencrypted LE
gang unique double size=55400L/55400P birth=... fill=0 cksum=...
```
On nondebug bits, having the 3rd DVA in the gang block works for the
most part, because it's true that all 3 DVA's are available in the gang
member BP (in the GBH). However, for accounting purposes, gang block
DVA's ASIZE include all the space allocated below them, i.e. the
512-byte gang block header (GBH) as well as the gang members below that.
We see that above where the gang leader BP is 1MB logical (and after
compression: 0x`100000P`), but the ASIZE of each DVA is 2 sectors (1KB)
more than 1MB (0x`100400`).
Since thre are 3 copies of a block below it, we increment the ATIME of
the 3rd DVA of the gang leader by the space used by the 3rd DVA of the
child (1 sector, in this case). But there isn't really a 3rd DVA of the
parent; the salt is stored in place of the 3rd DVA's ASIZE.
So when zio_write_gang_member_ready() increments the parent's BP's
`DVA[2]`'s ASIZE, it's actually incrementing the parent's salt. When we
later try to read the encrypted recursively-ganged block, the salt
doesn't match what we used to write it, so MAC verification fails and we
get an EIO.
```
zio_encrypt(): encrypted 515/2/0/403 salt: 25 25 bb 9d ad d6 cd 89
zio_decrypt(): decrypting 515/2/0/403 salt: 26 25 bb 9d ad d6 cd 89
```
This commit addresses the problem by not increasing the number of copies
of the GBH beyond 2 (even for non-encrypted blocks). This simplifies
the logic while maintaining the ability to traverse all metadata
(including gang blocks) even if one copy is lost. (Note that 3 copies
of the GBH will still be created if requested, e.g. for `copies=3` or
MOS blocks.) Additionally, the code that increments the parent's DVA's
ASIZE is made to check the parent DVA's NDVAS even on nondebug bits. So
if there's a similar bug in the future, it will cause a panic when
trying to write, rather than corrupting the parent BP and causing an
error when reading.
Reviewed-by: Brian Behlendorf <[email protected]>
Co-authored-by: Brian Behlendorf <[email protected]>
Signed-off-by: Matthew Ahrens <[email protected]>
Caused-by: #14356
Closes #14440
Closes #14413
|
|
|
|
|
|
| |
Reviewed-by: Richard Yao <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Paul Dagnelie <[email protected]>
Closes #14450
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
If we receive a DRR_FREEOBJECTS as the first entry in an object range,
this might end up producing a hole if the freed objects were the
only existing objects in the block.
If the txg starts syncing before we've processed any following
DRR_OBJECT records, this leads to a possible race where the backing
arc_buf_t gets its psize set to 0 in the arc_write_ready() callback
while still being referenced from a dirty record in the open txg.
To prevent this, we insert a txg_wait_synced call if the first
record in the range was a DRR_FREEOBJECTS that actually
resulted in one or more freed objects.
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: David Hedberg <[email protected]>
Sponsored by: Findity AB
Closes #11893
Closes #14358
|