summaryrefslogtreecommitdiffstats
path: root/tests
Commit message (Collapse)AuthorAgeFilesLines
* Shorten arcstat_quiescence sleep timeGeorge Amanakis2023-06-151-1/+1
| | | | | | | | | With the latest L2ARC fixes, 2 seconds is too long to wait for quiescence of arcstats like l2_size. Shorten this interval to avoid having the persistent L2ARC tests in ZTS prematurely terminated. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: George Amanakis <[email protected]> Closes #14981
* ZTS: Skip send_raw_ashift on FreeBSDBrian Behlendorf2023-06-142-0/+5
| | | | | | | | On FreeBSD 14 this test runs slowly in the CI environment and is killed by the 10 minute timeout. Skip the test on FreeBSD until the slow down is resolved. Signed-off-by: Brian Behlendorf <[email protected]> Issue #14961
* Store the L2ARC device ashift in the vdev labelGeorge Amanakis2023-06-141-11/+8
| | | | | | | | | | | | | | | | If this is not done, and the pool has an ashift other than the default (at the moment 9) then the following happens: 1) vdev_alloc() assigns the ashift of the pool to L2ARC device, but upon export it is not stored anywhere 2) at the first import, vdev_open() sees an vdev_ashift() of 0 and assigns the logical_ashift, which is 9 3) reading the contents of L2ARC, including the header fails 4) L2ARC buffers are not restored in ARC. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: George Amanakis <[email protected]> Closes #14313 Closes #14963
* ZTS: Skip checkpoint_discard_busyBrian Behlendorf2023-06-092-0/+3
| | | | | | | | Until the ASSERT which is occasionally hit while running checkpoint_discard_busy is resolved skip this test case. Signed-off-by: Brian Behlendorf <[email protected]> Issue #12053 Closes #14952
* zdb: add -B option to generate backup streamRob Norris2023-06-053-1/+57
| | | | | | | | | | | This is more-or-less like `zfs send`, but specifying the snapshot by its objset id for situations where it can't be referenced any other way. Sponsored-By: Klara, Inc. Reviewed-by: Tino Reichardt <[email protected]> Reviewed-by: WHR <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes #14642
* PAM: enable testing on FreeBSDVal Packett2023-05-311-0/+5
| | | | | | | Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Felix Dörre <[email protected]> Signed-off-by: Val Packett <[email protected]> Closes #14834
* PAM: support password changes even when not mountedVal Packett2023-05-313-2/+58
| | | | | | | | | | | | | There's usually no requirement that a user be logged in for changing their password, so let's not be surprising here. We need to use the fetch_lazy mechanism for the old password to avoid a double prompt for it, so that mechanism is now generalized a bit. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Felix Dörre <[email protected]> Signed-off-by: Val Packett <[email protected]> Closes #14834
* PAM: add 'recursive_homes' flag to use with 'prop_mountpoint'Val Packett2023-05-313-1/+74
| | | | | | | | | | | | | It's not always desirable to have a fixed flat homes directory. With the 'recursive_homes' flag, 'prop_mountpoint' search would traverse the whole tree starting at 'homes' (which can now be '*' to mean all pools) to find a dataset with a mountpoint matching the home directory. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Felix Dörre <[email protected]> Signed-off-by: Val Packett <[email protected]> Closes #14834
* ZTS: zvol_misc_trim disable blk mqBrian Behlendorf2023-05-293-1/+20
| | | | | | | | | | | | | Disable the zvol_misc_fua.ksh and zvol_misc_trim.ksh test cases on impacted kernels. This issue is being actively worked in #14872 and as part of that fix this commit will be reverted. VERIFY(zh->zh_claim_txg == 0) failed PANIC at zil.c:904:zil_create() Reviewed-by: Tony Hutter <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #14872 Closes #14870
* ZTS: Add zpool_resilver_concurrent exceptionBrian Behlendorf2023-05-261-0/+2
| | | | | | | The zpool_resilver_concurrent test case requires the ZED which is not used on FreeBSD. Add this test to the known list of skipped tested for FreeBSD. Signed-off-by: Brian Behlendorf <[email protected]> Closes #14904
* btree: Implement faster binary search algorithmRichard Yao2023-05-261-1/+1
| | | | | | | | | | | | | | | | | | | | | | This implements a binary search algorithm for B-Trees that reduces branching to the absolute minimum necessary for a binary search algorithm. It also enables the compiler to inline the comparator to ensure that the only slowdown when doing binary search is from waiting for memory accesses. Additionally, it instructs the compiler to unroll the loop, which gives an additional 40% improve with Clang and 8% improvement with GCC. Consumers must opt into using the faster algorithm. At present, only B-Trees used inside kernel code have been modified to use the faster algorithm. Micro-benchmarks suggest that this can improve binary search performance by up to 3.5 times when compiling with Clang 16 and up to 1.9 times when compiling with GCC 12.2. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Richard Yao <[email protected]> Closes #14866
* zil: Add some more statistics.Alexander Motin2023-05-251-1/+1
| | | | | | | | | | | | | | | | In addition to a number of actual log bytes written, account also a total written bytes including padding and total allocated bytes (bytes <= write <= alloc). It should allow to monitor zil traffic and space efficiency. Add dtrace probe for zil block size selection. Make zilstat report more information and fit it into less width. Reviewed-by: Ameer Hamza <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #14863
* Fix concurrent resilvers initiated at same timeAkash B2023-05-243-1/+104
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For draid vdevs it was possible to initiate both the sequential and healing resilver at same time. This fixes the following two scenarios. 1) There's a window where a sequential rebuild can be started via ZED even if a healing resilver has been scheduled. - This is fixed by adding additional check in spa_vdev_attach() for any scheduled resilver and return appropriate error code when a resilver is already in progress. 2) It was possible for zpool clear to start a healing resilver when it wasn't needed at all. This occurs because during a vdev_open() the device is presumed to be healthy not until the device is validated by vdev_validate() and it's set unavailable. However, by this point an async resilver will have already been requested if the DTL isn't empty. - This is fixed by cancelling the SPA_ASYNC_RESILVER request immediately at the end of vdev_reopen() when a resilver is unneeded. Finally, added a testcase in ZTS for verification. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Dipak Ghosh <[email protected]> Signed-off-by: Akash B <[email protected]> Closes #14881 Closes #14892
* Teach zpool scrub to scrub only blocks in error logGeorge Amanakis2023-05-188-1/+381
| | | | | | | | | | | | | | | | Added a flag '-e' in zpool scrub to scrub only blocks in error log. A user can pause, resume and cancel the error scrub by passing additional command line arguments -p -s just like a regular scrub. This involves adding a new flag, creating new libzfs interfaces, a new ioctl, and the actual iteration and read-issuing logic. Error scrubbing is executed in multiple txg to make sure pool performance is not affected. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Co-authored-by: TulsiJain [email protected] Signed-off-by: George Amanakis <[email protected]> Closes #8995 Closes #12355
* Add the ability to uninitializeBrian Behlendorf2023-05-183-0/+143
| | | | | | | | | | | | zpool initialize functions well for touching every free byte...once. But if we want to do it again, we're currently out of luck. So let's add zpool initialize -u to clear it. Co-authored-by: Rich Ercolani <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Rich Ercolani <[email protected]> Closes #12451 Closes #14873
* test-runner: pass kmemleak and kmsg to Cmd.runAntonio Russo2023-05-151-16/+22
| | | | | | | | | | | | | | | | | | | | | | | | test-runner.py orchestrates all of the ZTS executions. The `Cmd` object manages these process, and its `run` method specifically invokes these possibly long-running processes, possibly retrying in the event of a timeout. Since its inception, memory leak detection using the kmemleak infrastructure [1], and kernel logging [2] have been added to this run mechanism. However, the callback to cull a process beyond its timeout threshold, `kill_cmd`, has evaded modernization by both of these changes. As a result, this function fails to properly invoke `run`, leading to an untrapped exception and unreported test failure. This patch extends `kill_cmd` to receive these kernel devices through the `options` parameter, and regularizes all the `.run` calls from `Cmd`, and its subclasses, to accept that parameter. [1] Commit a69765ea5b563e0cd4d15fac4b1ac08c6ccf12d1 [2] Commit fc2c0256c55a2859d1988671b0896d22b75c8aba Reviewed-by: John Wren Kennedy <[email protected]> Signed-off-by: Antonio Russo <[email protected]> Closes #14849
* Refine special_small_blocks property validationDon Brady2023-05-124-1/+86
| | | | | | | | | | | | | When the special_small_blocks property is being set during a pool create it enforces a limit of 128KiB even if the pool's record size is larger. If the recordsize property is being set during a pool create, then use that value instead of the default SPA_OLD_MAXBLOCKSIZE value. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Don Brady <[email protected]> Closes #13815 Closes #14811
* ZTS: Add auto_replace_001_pos to exceptionsBrian Behlendorf2023-05-121-0/+1
| | | | | | | | | The auto_replace_001_pos test case does not reliably pass on Fedora 37 and newer. Until the test case can be updated to make it reliable add it to the list of "maybe" exceptions on Linux. Signed-off-by: Brian Behlendorf <[email protected]> Issue #14851 Closes #14852
* Debug auto_replace_001_pos failuresBrian Behlendorf2023-05-092-5/+14
| | | | | | | | | Reduced the timeout to 60 seconds which should be more than sufficient and allow the test to be marked as FAILED rather than KILLED. Also dump the pool status on cleanup. Reviewed-by: Brian Atkinson <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #14829
* Remove duplicate code in l2arc_evict()George Amanakis2023-05-091-1/+1
| | | | | | | | | | l2arc_evict() performs the adjustment of the size of buffers to be written on L2ARC unnecessarily. l2arc_write_size() is called right before l2arc_evict() and performs those adjustments. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Brian Atkinson <[email protected]> Signed-off-by: George Amanakis <[email protected]> Closes #14828
* Enable the head_errlog feature to remove errorsGeorge Amanakis2023-05-091-0/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In case check_filesystem() does not error out and does not report an error, remove that error block from error lists and logs without requiring a scrub. This can happen when the original file and all snapshots/clones referencing it have been removed. Otherwise zpool status will still report that "Permanent errors have been detected..." without actually reporting any of them. To implement this change the functions introduced in corrective receive were modified to take into account the head_errlog feature. Before this change: ============================= pool: test state: ONLINE status: One or more devices has experienced an error resulting in data corruption. Applications may be affected. action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A config: NAME STATE READ WRITE CKSUM test ONLINE 0 0 0 /home/user/vdev_a ONLINE 0 0 2 errors: Permanent errors have been detected in the following files: ============================= After this change: ============================= pool: test state: ONLINE status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'. see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P config: NAME STATE READ WRITE CKSUM test ONLINE 0 0 0 /home/user/vdev_a ONLINE 0 0 2 errors: No known data errors ============================= Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Brian Atkinson <[email protected]> Signed-off-by: George Amanakis <[email protected]> Closes #14813
* Fixes in head_errlog feature with encryptionGeorge Amanakis2023-05-082-4/+21
| | | | | | | | | | For the head_errlog feature use dsl_dataset_hold_obj_flags() instead of dsl_dataset_hold_obj() in order to enable access to the encryption keys (if loaded). This enables reporting of errors in encrypted filesystems which are not mounted but have their keys loaded. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: George Amanakis <[email protected]> Closes #14837
* ZTS: add snapshot/snapshot_002_pos exceptionBrian Behlendorf2023-05-081-0/+1
| | | | | | | | | Add snapshot_002_pos to the known list of occasional failures for FreeBSD until it can be made entirely reliable. Reviewed-by: Tino Reichardt <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #14831 Closes #14832
* Simplify and optimize random_int_between().Pawel Jakub Dawidek2023-05-051-9/+2
| | | | | Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Pawel Jakub Dawidek <[email protected]> Closes #14805
* zpool import -m also removing spare and cache when log device is missingAmeer Hamza2023-05-033-1/+77
| | | | | | | | | | | | | spa_import() relies on a pool config fetched by spa_try_import() for spare/cache devices. Import flags are not passed to spa_tryimport(), which makes it return early due to a missing log device and missing retrieving the cache device and spare eventually. Passing ZFS_IMPORT_MISSING_LOG to spa_tryimport() makes it fetch the correct configuration regardless of the missing log device. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Ameer Hamza <[email protected]> Closes #14794
* Allow zhack label repair to restore detached devices.buzzingwires2023-05-038-66/+492
| | | | | | | | | | | | | | | | | | | | | | | | | This commit expands on the zhack label repair command in d04b5c9 by adding the -u option to undetach a device by regenerating uberblocks, in addition to the existing functionality of fixing checksums, now represented by -c. Previous behavior is retained in the case of no options. The changes are heavily inspired by Jeff Bonwick's labelfix utility, as archived at: https://gist.github.com/jjwhitney/baaa63144da89726e482 Additionally, it is now capable of properly determining the size of block devices and other media, as well as handling sizes which are not divisible by 2^18. This should make it viable for use on physical devices and partitions, in addition to files. These changes should make it possible to import zpools that have had their uberblocks erased, such as in the case of pools rendered inaccessible by erroneous detach commands. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: buzzingwires <[email protected]> Closes #14773
* tests/zdb_encrypted: parse numbers a little more robustlyRob N2023-04-261-2/+2
| | | | | | | | | On FreeBSD, `wc` prints some leading spaces, while on Linux it does not. So we tell ksh to expect an integer, and it does the rest. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes #14791 Closes #14797
* Add support for zpool user propertiesAllan Jude2023-04-215-1/+360
| | | | | | | | | | | | | | | | Usage: zpool set org.freebsd:comment="this is my pool" poolname Tests are based on zfs_set's user property tests. Also stop truncating property values at MAXNAMELEN, use ZFS_MAXPROPLEN. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Allan Jude <[email protected]> Signed-off-by: Mateusz Piotrowski <[email protected]> Sponsored-by: Beckhoff Automation GmbH & Co. KG. Sponsored-by: Klara Inc. Closes #11680
* ZTS: zvol_misc_trim retry busy exportBrian Behlendorf2023-04-201-2/+2
| | | | | | | | | | | Retry the export if the pool is busy due to an open zvol. Observed in the CI on Fedora 37. cannot export 'testpool': pool is busy ERROR: zpool export testpool exited 1 Reviewed-by: George Melikov <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #14769
* Create zap for root vdevrob-wing2023-04-2014-3/+214
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | And add it to the AVZ, this is not backwards compatible with older pools due to an assertion in spa_sync() that verifies the number of ZAPs of all vdevs matches the number of ZAPs in the AVZ. Granted, the assertion only applies to #DEBUG builds - still, a feature flag is introduced to avoid the assertion, com.klarasystems:vdev_zaps_v2 Notably, this allows to get/set properties on the root vdev: % zpool set user:prop=value <pool> root-0 Before this commit, it was already possible to get/set properties on top-level vdevs with the syntax <type>-<vdev_id> (e.g. mirror-0): % zpool set user:prop=value <pool> mirror-0 This syntax also applies to the root vdev as it is is of type 'root' with a vdev_id of 0, root-0. The keyword 'root' as an alias for 'root-0'. The following tests have been added: - zpool get all properties from root vdev - zpool set a property on root vdev - verify root vdev ZAP is created Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rob Wing <[email protected]> Sponsored-by: Seagate Technology Submitted-by: Klara, Inc. Closes #14405
* ZTS: send-c_volume is flakyPaul Dagnelie2023-04-191-2/+7
| | | | | | | | | | | We use block_device_wait to wait for the zvol block device to actually appear, and we log the result of the dd calls by using an intermediate file. Reviewed-by: George Melikov <[email protected]> Reviewed-by: John Wren Kennedy <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Paul Dagnelie <[email protected]> Closes #14767
* ZTS: add existing tests to runfilesDamian Szuberski2023-04-063-8/+14
| | | | | | | | | Some test cases were committed to the repository but never added to runfiles. Move `zfs_unshare_008_pos` to the Linux-only runfile. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: szubersk <[email protected]> Closes #14701
* ZTS: Use inbuilt monotonic timeAndrew Innes2023-04-061-17/+17
| | | | | | | | | | Make the test runner try to use the included python monotonic time function instead of calling librt. This makes the test runner work on macos where librt wasn't available. Reviewed-by: Tino Reichardt <[email protected]> Signed-off-by: Andrew Innes <[email protected]> Closes #14700
* Fixes in persistent error logGeorge Amanakis2023-03-286-5/+217
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Address the following bugs in persistent error log: 1) Check nested clones, eg "fs->snap->clone->snap2->clone2". 2) When deleting files containing error blocks in those clones (from "clone" the example above), do not break the check chain. 3) When deleting files in the originating fs before syncing the errlog to disk, do not break the check chain. This happens because at the time of introducing the error block in the error list, we do not have its birth txg and the head filesystem. If the original file is deleted before the error list is synced to the error log (which is when we actually lookup the birth txg and the head filesystem), then we do not have access to this info anymore and break the check chain. The most prominent change is related to achieving (3). We expand the spa_error_entry_t structure to accommodate the newly introduced zbookmark_err_phys_t structure (containing the birth txg of the error block).Due to compatibility reasons we cannot remove the zbookmark_phys_t structure and we also need to place the new structure after se_avl, so it is not accounted for in avl_find(). Then we modify spa_log_error() to also provide the birth txg of the error block. With these changes in place we simplify the previously introduced function get_head_and_birth_txg() (now named get_head_ds()). We chose not to follow the same approach for the head filesystem (thus completely removing get_head_ds()) to avoid introducing new lock contentions. The stack sizes of nested functions (as measured by checkstack.pl in the linux kernel) are: check_filesystem [zfs]: 272 (was 912) check_clones [zfs]: 64 We also introduced two new tests covering the above changes. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: George Amanakis <[email protected]> Closes #14633
* Split functional testings via github action matrixTino Reichardt2023-03-153-1/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | This commit changes the workflow of the github actions. We split the workflow into different parts: 1) build zfs modules for Ubuntu 20.04 and 22.04 (~25m) 2) 2x zloop test (~10m) + 2x sanity test (~25m) 3) functional testings in parts 1..5 (each ~1h) - these could be triggered, when sanity tests are ok - currently I just start them all in the same time 4) cleanup and create summary When everything is fine, the full run with all testings should be done in around 2 hours. The codeql.yml and checkstyle.yml are not part in this circle. The testings are also modified a bit: - report info about CPU and checksum benchmarks - reset the debugging logs for each test - when some error occurred, we call dmesg with -c to get only the log output for the last failed test - we empty also the dbgsys Reviewed-by: George Melikov <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tino Reichardt <[email protected]> Closes #14078
* Improve tests and update man page for healing recvAlek P2023-03-154-2/+197
| | | | | | | | | | | | | | | | | | | | | | Fix the manpage. The "SYNOPSIS" section is incorrectly formatted for receive -c. I also took this opportunity to reword some parts and fix a run-on sentence in the manpage. Add large block testing for corrective recv. This adds a new test that makes sure blocks generated using zfs send -L/--large-block large-block send flag are able to be used for healing. Since with unloaded key and errlog feature enabled corruption is not shown in zpool status #13675 is fixed the zfs_receive_corrective.ksh test no longer sets -o feature@head_errlog=disabled on pool creation so that it can also test for regressions related to head_errlog feature. Note that the zfs_receive_compressed_corrective.ksh and zfs_receive_large_block_corrective.ksh tests are still creating pools with -o feature@head_errlog=disabled. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alek Pinchuk <[email protected]> Closes #14615
* Remove unused Edon-R variantsTino Reichardt2023-03-141-78/+5
| | | | | | | | This commit removes the edonr_byteorder.h file and all unused variants of Edon-R. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tino Reichardt <[email protected]> Closes #13618
* nvpair: Constify string functionsRichard Yao2023-03-143-4/+4
| | | | | | | | | | | | | | After addressing coverity complaints involving `nvpair_name()`, the compiler started complaining about dropping const. This lead to a rabbit hole where not only `nvpair_name()` needed to be constified, but also `nvpair_value_string()`, `fnvpair_value_string()` and a few other static functions, plus variable pointers throughout the code. The result became a fairly big change, so it has been split out into its own patch. Reviewed-by: Tino Reichardt <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Richard Yao <[email protected]> Closes #14612
* Replace dead opensolaris.org license linksTino Reichardt2023-03-147-7/+7
| | | | | | | | | | | The commit replaces all findings of the link: http://www.opensolaris.org/os/licensing with this one: https://opensource.org/licenses/CDDL-1.0 Reviewed-by: George Melikov <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: WHR <[email protected]> Signed-off-by: Tino Reichardt <[email protected]> Closes #14625
* Implementation of block cloning for ZFSPawel Jakub Dawidek2023-03-101-0/+4
| | | | | | | | | | | | | | | Block Cloning allows to manually clone a file (or a subset of its blocks) into another (or the same) file by just creating additional references to the data blocks without copying the data itself. Those references are kept in the Block Reference Tables (BRTs). The whole design of block cloning is documented in module/zfs/brt.c. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Christian Schwarz <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Rich Ercolani <[email protected]> Signed-off-by: Pawel Jakub Dawidek <[email protected]> Closes #13392
* Fix incremental receive silently failing for recursive sends Paul Dagnelie2023-03-104-14/+79
| | | | | | | | | | | | | | The problem occurs because dmu_recv_begin pulls in the payload and next header from the input stream in order to use the contents of the begin record's nvlist. However, the change to do that before the other checks in dmu_recv_begin occur caused a regression where an empty send stream in a recursive send could have its END record consumed by this, which broke the logic of recv_skip. A test is also included to protect against this case in the future. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Paul Dagnelie <[email protected]> Closes #12661 Closes #14568
* More adaptive ARC evictionAlexander Motin2023-03-081-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Traditionally ARC adaptation was limited to MRU/MFU distribution. But for years people with metadata-centric workload demanded mechanisms to also manage data/metadata distribution, that in original ZFS was just a FIFO. As result ZFS effectively got separate states for data and metadata, minimum and maximum metadata limits etc, but it all required manual tuning, was not adaptive and in its heart remained a bad FIFO. This change removes most of existing eviction logic, rewriting it from scratch. This makes MRU/MFU adaptation individual for data and meta- data, same as the distribution between data and metadata themselves. Since most of required states separation was already done, it only required to make arcs_size state field specific per data/metadata. The adaptation logic is still based on previous concept of ghost hits, just now it balances ARC capacity between 4 states: MRU data, MRU metadata, MFU data and MFU metadata. To simplify arc_c changes instead of arc_p measured in bytes, this code uses 3 variable arc_meta, arc_pd and arc_pm, representing ARC balance between metadata and data, MRU and MFU for data, and MRU and MFU for metadata respectively as 32-bit fixed point fractions. Since we care about the math result only when need to evict, this moves all the logic from arc_adapt() to arc_evict(), that reduces per-block overhead, since per-block operations are limited to stats collection, now moved from arc_adapt() to arc_access() and using cheaper wmsums. This also allows to remove ugly ARC_HDR_DO_ADAPT flag from many places. This change also removes number of metadata specific tunables, part of which were actually not functioning correctly, since not all metadata are equal and some (like L2ARC headers) are not really evictable. Instead it introduced single opaque knob zfs_arc_meta_balance, tuning ARC's reaction on ghost hits, allowing administrator give more or less preference to metadata without setting strict limits. Some of old code parts like arc_evict_meta() are just removed, because since introduction of ABD ARC they really make no sense: only headers referenced by small number of buffers are not evictable, and they are really not evictable no matter what this code do. Instead just call arc_prune_async() if too much metadata appear not evictable. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Allan Jude <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #14359
* Update BLAKE3 for using the new impl handlingTino Reichardt2023-03-021-6/+12
| | | | | | | | | | | This commit changes the BLAKE3 implementation handling and also the calls to it from the ztest command. Tested-by: Rich Ercolani <[email protected]> Tested-by: Sebastian Gottschall <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tino Reichardt <[email protected]> Closes #13741
* Add generic implementation handling and SHA2 implTino Reichardt2023-03-021-7/+27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | The skeleton file module/icp/include/generic_impl.c can be used for iterating over different implementations of algorithms. It is used by SHA256, SHA512 and BLAKE3 currently. The Solaris SHA2 implementation got replaced with a version which is based on public domain code of cppcrypto v0.10. These assembly files are taken from current openssl master: - sha256-x86_64.S: x64, SSSE3, AVX, AVX2, SHA-NI (x86_64) - sha512-x86_64.S: x64, AVX, AVX2 (x86_64) - sha256-armv7.S: ARMv7, NEON, ARMv8-CE (arm) - sha512-armv7.S: ARMv7, NEON (arm) - sha256-armv8.S: ARMv7, NEON, ARMv8-CE (aarch64) - sha512-armv8.S: ARMv7, ARMv8-CE (aarch64) - sha256-ppc.S: Generic PPC64 LE/BE (ppc64) - sha512-ppc.S: Generic PPC64 LE/BE (ppc64) - sha256-p8.S: Power8 ISA Version 2.07 LE/BE (ppc64) - sha512-p8.S: Power8 ISA Version 2.07 LE/BE (ppc64) Tested-by: Rich Ercolani <[email protected]> Tested-by: Sebastian Gottschall <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tino Reichardt <[email protected]> Closes #13741
* zdb: add decryption supportRob N2023-03-024-4/+74
| | | | | | | | | | | | | | The approach is straightforward: for dataset ops, if a key was offered, find the encryption root and the various encryption parameters, derive a wrapping key if necessary, and then unlock the encryption root. After that all the regular dataset ops will return unencrypted data, and that's kinda the whole thing. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Jorgen Lundman <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes #11551 Closes #12707 Closes #14503
* ZTS: Minor fixesBrian Behlendorf2023-02-232-3/+3
| | | | | | | | | | | | | | - The migration_012_pos.ksh test case was failing because of a missing space after `log_must`. - None of the tests listed in the runfiles should include the .ksh suffix. Reviewed-by: Richard Yao <[email protected]> Reviewed-by: Brian Atkinson <[email protected]> Reviewed-by: George Melikov <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #14515
* Fix buffered/direct/mmap I/O raceBrian Behlendorf2023-02-234-2/+90
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When a page is faulted in for memory mapped I/O the page lock may be dropped before it has been read and marked up to date. If a buffered read encounters such a page in mappedread() it must wait until the page has been updated. Failure to do so will result in a panic on debug builds and incorrect data on production builds. The critical part of this change is in mappedread() where pages which are not up to date are now handled. Additionally, it includes the following simplifications. - zfs_getpage() and zfs_fillpage() could be passed an array of pages. This could be more efficient if it was used but in practice only a single page was ever provided. These interfaces were simplified to acknowledge that. - update_pages() was modified to correctly set the PG_error bit on a page when it cannot be read by dmu_read(). - Setting PG_error and PG_uptodate was moved to zfs_fillpage() from zpl_readpage_common(). This is consistent with the handling in update_pages() and mappedread(). - Minor additional refactoring to comments and variable declarations to improve readability. - Add a test case to exercise concurrent buffered, direct, and mmap IO to the same file. - Reduce the mmap_sync test case default run time. Reviewed-by: Richard Yao <[email protected]> Reviewed-by: Brian Atkinson <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #13608 Closes #14498
* EIO caused by encryption + recursive gangMatthew Ahrens2023-02-063-1/+88
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Encrypted blocks can not have 3 DVAs, because they use the space of the 3rd DVA for the IV+salt. zio_write_gang_block() takes this into account, setting `gbh_copies` to no more than 2 in this case. Gang members BP's do not have the X (encrypted) bit set (nor do they have the DMU level and type fields set), because encryption is not handled at this level. The gang block is reassembled, and then encryption (and compression) are handled. To check if this gang block is encrypted, the code in zio_write_gang_block() checks `pio->io_bp`. This is normally fine, because the block that's being ganged is typically the encrypted BP. The problem is that if there is "recursive ganging", where a gang member is itself a gang block, then when zio_write_gang_block() is called to create a gang block for a gang member, `pio->io_bp` is the gang member's BP, which doesn't have the X bit set, so the number of DVA's is not restricted to 2. It should instead be looking at the the "gang leader", i.e. the top-level gang block, to determine how many DVA's can be used, to avoid a "NDVA's inversion" (where a child has more DVA's than its parent). gang leader BP: X (encrypted) bit set, 2 DVA's, IV+salt in 3rd DVA's space: ``` DVA[0]=<1:...:100400> DVA[1]=<0:...:100400> salt=... iv=... [L0 ZFS plain file] fletcher4 uncompressed encrypted LE gang unique double size=100000L/100000P birth=... fill=1 cksum=... ``` leader's GBH contains a BP with gang bit set and 3 DVA's: ``` DVA[0]=<1:...:55600> DVA[1]=<0:...:55600> [L0 unallocated] fletcher4 uncompressed unencrypted LE contiguous unique double size=55600L/55600P birth=... fill=0 cksum=... DVA[0]=<1:...:55600> DVA[1]=<0:...:55600> [L0 unallocated] fletcher4 uncompressed unencrypted LE contiguous unique double size=55600L/55600P birth=... fill=0 cksum=... DVA[0]=<1:...:55600> DVA[1]=<0:...:55600> DVA[2]=<1:...:200> [L0 unallocated] fletcher4 uncompressed unencrypted LE gang unique double size=55400L/55400P birth=... fill=0 cksum=... ``` On nondebug bits, having the 3rd DVA in the gang block works for the most part, because it's true that all 3 DVA's are available in the gang member BP (in the GBH). However, for accounting purposes, gang block DVA's ASIZE include all the space allocated below them, i.e. the 512-byte gang block header (GBH) as well as the gang members below that. We see that above where the gang leader BP is 1MB logical (and after compression: 0x`100000P`), but the ASIZE of each DVA is 2 sectors (1KB) more than 1MB (0x`100400`). Since thre are 3 copies of a block below it, we increment the ATIME of the 3rd DVA of the gang leader by the space used by the 3rd DVA of the child (1 sector, in this case). But there isn't really a 3rd DVA of the parent; the salt is stored in place of the 3rd DVA's ASIZE. So when zio_write_gang_member_ready() increments the parent's BP's `DVA[2]`'s ASIZE, it's actually incrementing the parent's salt. When we later try to read the encrypted recursively-ganged block, the salt doesn't match what we used to write it, so MAC verification fails and we get an EIO. ``` zio_encrypt(): encrypted 515/2/0/403 salt: 25 25 bb 9d ad d6 cd 89 zio_decrypt(): decrypting 515/2/0/403 salt: 26 25 bb 9d ad d6 cd 89 ``` This commit addresses the problem by not increasing the number of copies of the GBH beyond 2 (even for non-encrypted blocks). This simplifies the logic while maintaining the ability to traverse all metadata (including gang blocks) even if one copy is lost. (Note that 3 copies of the GBH will still be created if requested, e.g. for `copies=3` or MOS blocks.) Additionally, the code that increments the parent's DVA's ASIZE is made to check the parent DVA's NDVAS even on nondebug bits. So if there's a similar bug in the future, it will cause a panic when trying to write, rather than corrupting the parent BP and causing an error when reading. Reviewed-by: Brian Behlendorf <[email protected]> Co-authored-by: Brian Behlendorf <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> Caused-by: #14356 Closes #14440 Closes #14413
* Prevent error messages when running tests with no timeoutPaul Dagnelie2023-02-021-1/+1
| | | | | | Reviewed-by: Richard Yao <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Paul Dagnelie <[email protected]> Closes #14450
* Wait for txg sync if the last DRR_FREEOBJECTS might result in a holeDavid Hedberg2023-01-234-1/+90
| | | | | | | | | | | | | | | | | | | | If we receive a DRR_FREEOBJECTS as the first entry in an object range, this might end up producing a hole if the freed objects were the only existing objects in the block. If the txg starts syncing before we've processed any following DRR_OBJECT records, this leads to a possible race where the backing arc_buf_t gets its psize set to 0 in the arc_write_ready() callback while still being referenced from a dirty record in the open txg. To prevent this, we insert a txg_wait_synced call if the first record in the range was a DRR_FREEOBJECTS that actually resulted in one or more freed objects. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: David Hedberg <[email protected]> Sponsored by: Findity AB Closes #11893 Closes #14358