aboutsummaryrefslogtreecommitdiffstats
path: root/tests/runfiles
Commit message (Collapse)AuthorAgeFilesLines
* Added test for being able to read various variants of zstdRich Ercolani2021-09-201-1/+1
| | | | | | | | | | | | | As detailed in #12022 and #12008, it turns out the current zstd implementation is quite nonportable, and results in various configurations of ondisk header that only each platform can read. So I've added a test which contains a dataset with a file written by Linux/x86_64 and one written by FBSD/ppc64. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: John Kennedy <[email protected]> Signed-off-by: Rich Ercolani <[email protected]> Closes #12030
* ZTS: Enable punch-hole tests on FreeBSDKa Ho Ng2021-08-302-1/+5
| | | | | | | | | Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Signed-off-by: Ka Ho Ng <[email protected]> Sponsored-by: The FreeBSD Foundation Closes #12458
* ZTS: Add tests for creation timeRyan Moeller2021-08-171-0/+4
| | | | | | | | | Reviewed-by: Tony Nguyen <[email protected]> Reviewed-by: Allan Jude <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Ryan Moeller <[email protected]> Closes #12432
* Fixes in persistent L2ARCGeorge Amanakis2021-07-261-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | In l2arc_add_vdev() first decide whether the device is eligible for L2ARC rebuild or whole device trim and then add it to the list of cache devices. Otherwise l2arc_feed_thread() might already start writing on the device invalidating previous content as l2ad_hand = l2ad_start. However l2arc_rebuild_vdev() needs the device present in the cache device list to figure out its l2arc_dev_t. Fix this by moving most of l2arc_rebuild_vdev() in a new function l2arc_rebuild_dev() which does not need to search in the cache device list. In contrast to l2arc_add_vdev() we do not have to worry about l2arc_feed_thread() invalidating previous content when onlining a cache device. The device parameters (l2ad*) are not cleared when offlining the device and writing new buffers will not invalidate all previous content. In worst case only buffers that have not had their log block written to the device will be lost. Retire persist_l2arc_00{4,5,8} tests since they cover code already covered by the remaining ones. Test persist_l2arc_006 is renamed to persist_l2arc_004 and persist_l2arc_007 is renamed to persist_l2arc_005. Fix a typo in persist_l2arc_004, and remove an assertion that is not always true from l2arc_arcstats_pos. Also update an assertion in persist_l2arc_005 and explain why in a comment. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: George Amanakis <[email protected]> Closes #12365
* ZED: Match added disk by pool/vdev GUID if found (#12217)Ryan Moeller2021-06-301-4/+5
| | | | | | | | | This enables ZED to auto-online vdevs that are not wholedisk managed by ZFS. Signed-off-by: Ryan Moeller <[email protected]> Reviewed-by: Don Brady <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Tony Hutter <[email protected]>
* Update cache file when setting compatibility propertyBrian Behlendorf2021-06-241-1/+2
| | | | | | | | | | | | | | | | | | | | Unlike most other properties the 'compatibility' property is stored in the pool config object and not the DMU_OT_POOL_PROPS object. This had the advantage that the compatibility information is available without needing to fully import the pool (it can be read with zdb). However, this means we need to make sure to update both the copy of the config in the MOS and the cache file. This wasn't being done. This commit adds a call to spa_async_request() to ensure the copy of the config in the cache file gets updated as well as the one stored in the pool. This same change is made for the 'comment' property which suffers from the same inconsistency. Reviewed-by: Sean Eric Fagan <[email protected]> Reviewed-by: Colm Buckley <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #12261 Closes #12276
* Livelist logic should handle dedup blkptrsSerapheim Dimitropoulos2021-06-071-2/+2
| | | | | | | | | | | | | | | | | Update the logic to handle the dedup-case of consecutive FREEs in the livelist code. The logic still ensures that all the FREE entries are matched up with a respective ALLOC by keeping a refcount for each FREE blkptr that we encounter and ensuring that this refcount gets to zero by the time we are done processing the livelist. zdb -y no longer panics when encountering double frees Reviewed-by: Matthew Ahrens <[email protected]> Reviewed-by: John Kennedy <[email protected]> Reviewed-by: Don Brady <[email protected]> Signed-off-by: Serapheim Dimitropoulos <[email protected]> Closes #11480 Closes #12177
* Fix dRAID sequential resilver silent damage handlingBrian Behlendorf2021-05-201-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This change addresses two distinct scenarios which are possible when performing a sequential resilver to a dRAID pool with vdevs that contain silent unknown damage. Which in this circumstance took the form of the devices being intentionally overwritten with zeros. However, it could also result from a device returning incorrect data while a sequential resilver was in progress. Scenario 1) A sequential resilver is performed while all of the dRAID vdevs are ONLINE and there is silent damage present on the vdev being resilvered. In this case, nothing will be repaired by vdev_raidz_io_done_reconstruct_known_missing() because rc->rc_error isn't set on any of the raid columns. To address this vdev_draid_io_start_read() has been updated to always mark the resilvering column as ESTALE for sequential resilver IO. Scenario 2) Multiple columns contain silent damage for the same block and a sequential resilver is performed. In this case it's impossible to generate the correct data from parity unless all of the damaged columns are being sequentially resilvered (and thus only good data is used to generate parity). This is as expected and there's nothing which can be done about it. However, we need to be careful not to make to situation worse. Since we can't verify the data is actually good without a checksum, we must only repair the devices which are being sequentially resilvered. Otherwise, an incorrect repair to a device which previously contained good data could effectively lock in the damage and make reconstruction impossible. A check for this was added to vdev_raidz_io_done_verified() along with a new test case. Lastly, this change updates the redundancy_draid_spare1 and redundancy_draid_spare3 test cases to be more representative of normal dRAID replacement operation. Specifically, what we care about is that the scrub run after a sequential resilver does not find additional blocks which need repair. This would indicate the sequential resilver failed to rebuild a section of one of the devices. Note also the tests were switched to using the verify_pool() function which still checks for checksum errors. Reviewed-by: Mark Maybee <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #12061
* ZTS: Increase redundancy test timeoutBrian Behlendorf2021-05-141-0/+1
| | | | | | | | | | | | | The redundancy_draid.ksh and redundancy_raidz.ksh tests were updated by commit 93c8e91fe to additionally verify self-healing. This additional check increased the run time which can now occasionally exceed the default maximum timeout in the CI environment. To prevent this from causing failures increase the default timeout for the redundancy test cases. Reviewed-by: John Kennedy <[email protected]> Reviewed-by: George Melikov <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #12043
* Revert "Fix raw sends on encrypted datasets when copying back snapshots"Brian Behlendorf2021-05-131-2/+1
| | | | | | | | | | | | | | | Commit d1d4769 takes into account the encryption key version to decide if the local_mac could be zeroed out. However, this could lead to failure mounting encrypted datasets created with intermediate versions of ZFS encryption available in master between major releases. In order to prevent this situation revert d1d4769 pending a more comprehensive fix which addresses the mount failure case. Reviewed-by: George Amanakis <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #11294 Issue #12025 Issue #12300 Closes #12033
* libzfs: add keylocation=https://, backed by fetch(3) or libcurlнаб2021-05-122-2/+4
| | | | | | | | | | | Add support for http and https to the keylocation properly to allow encryption keys to be fetched from the specified URL. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Signed-off-by: Ahelenia Ziemiańska <[email protected]> Issue #9543 Closes #9947 Closes #11956
* Fix dRAID self-healing short columnsBrian Behlendorf2021-05-081-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When dRAID performs a normal read operation only the data columns in the raid map are read from disk. This is enough information to calculate the checksum, verify it, and return the needed data to the application. It's only in the event of a checksum failure that the additional parity and any empty columns must be read since they are required for parity reconstruction. Reading these additional columns is handled by vdev_raidz_read_all() which calls vdev_draid_map_alloc_empty() to expand the raid_map_t and submit IOs for the missing columns. This all works correctly, but it fails to account for any "short" columns. These are data columns which are padded with a empty skip sector at the end. Since that empty sector is not needed for a normal read it's not read when columns is first read from disk. However, like the parity and empty columns the skip sector is needed to perform reconstruction. The fix is to mark any "short" columns as never being read by clearing the rc_tried flag when expanding the raid_map_t. This will cause the entire column to re-read from disk in the event of a checksum failure allowing the self-healing functionality to repair the block. Note that this only effects the self-healing feature because when scrubbing a pool the parity, data, and empty columns are all read initially to verify their contents. Furthermore, only blocks which contain "short" columns would be effected, and only when the memory backing the skip sector wasn't already zeroed out. This change extends the existing redundancy_raidz.ksh test case to verify self-healing (as well as resilver and scrub). Then applies the same test case to dRAID with a slightly modified version of the test script called redundancy_draid.ksh. The unused variable combrec was also removed from both test cases. Reviewed-by: Matthew Ahrens <[email protected]> Reviewed-by: Mark Maybee <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #12010
* ZTS: add zed_fd_spill to verify the fds ZEDLETs inheritнаб2021-04-151-1/+1
| | | | | | Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Ahelenia Ziemiańska <[email protected]> Closes #11891
* Allow zfs to send replication streams with missing snapshotspablofsf2021-04-111-1/+1
| | | | | | | | | | | | A tentative implementation and discussion was done in #5285. According to it a send --skip-missing|-s flag has been added. In a replication stream, when there are snapshots missing in the hierarchy, if -s is provided print a warning and ignore dataset (and its children) instead of throwing an error Reviewed-by: Paul Dagnelie <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Pablo Correa Gómez <[email protected]> Closes #11710
* Ratelimit deadman zevents as with delay zeventsRyan Moeller2021-04-072-6/+6
| | | | | | | | | | | | | | | Just as delay zevents can flood the zevent pipe when a vdev becomes unresponsive, so do the deadman zevents. Ratelimit deadman zevents according to the same tunable as for delay zevents. Enable deadman tests on FreeBSD and add a test for deadman event ratelimiting. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Don Brady <[email protected]> Signed-off-by: Ryan Moeller <[email protected]> Closes #11786
* Fix various typosAndrea Gelmini2021-04-021-1/+1
| | | | | | | | | | Correct an assortment of typos throughout the code base. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Matthew Ahrens <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Signed-off-by: Andrea Gelmini <[email protected]> Closes #11774
* Fix regression in POSIX mode behaviorAndrew2021-03-192-0/+8
| | | | | | | | | | | | | | | | | Commit 235a85657 introduced a regression in evaluation of POSIX modes that require group DENY entries in the internal ZFS ACL. An example of such a POSX mode is 007. When write_implies_delete_child is set, then ACE_WRITE_DATA is added to `wanted_dirperms` in prior to calling zfs_zaccess_common(). This occurs is zfs_zaccess_delete(). Unfortunately, when zfs_zaccess_aces_check hits this particular DENY ACE, zfs_groupmember() is checked to determine whether access should be denied, and since zfs_groupmember() always returns B_TRUE on Linux and so this check is failed, resulting ultimately in EPERM being returned. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Signed-off-by: Andrew Walker <[email protected]> Closes #11760
* ZTS: New test for kernel panic induced by redacted sendPalash Gandhi2021-03-191-2/+2
| | | | | | | | | | This change adds a new test that covers a bug fix in the binary search in the redacted send resume logic that causes a kernel panic. The bug was fixed in https://github.com/openzfs/zfs/pull/11297. Reviewed-by: Brian Behlendorf <[email protected]> Co-authored-by: John Kennedy <[email protected]> Signed-off-by: Palash Gandhi <[email protected]> Closes #11764
* ZTS: Add tests for DOS mode attributesRyan Moeller2021-03-161-0/+4
| | | | | | | | | | Create a new section of tests to run with acltype=off. For now the only test we have is for the DOS mode READONLY attribute on FreeBSD. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Ryan Moeller <[email protected]> Closes #11734
* zpool import cachefile improvementsGeorge Wilson2021-03-121-0/+1
| | | | | | | | | | | | | | | | | | | | | | | Importing a pool using the cachefile is ideal to reduce the time required to import a pool. However, if the devices associated with a pool in the cachefile have changed, then the import would fail. This can easily be corrected by doing a normal import which would then read the pool configuration from the labels. The goal of this change is make importing using a cachefile more resilient and auto-correcting. This is accomplished by having the cachefile import logic automatically fallback to reading the labels of the devices similar to a normal import. The main difference between the fallback logic and a normal import is that the cachefile import logic will only look at the device directories that were originally used when the cachefile was populated. Additionally, the fallback logic will always import by guid to ensure that only the pools in the cachefile would be imported. External-issue: DLPX-71980 Reviewed-by: Matthew Ahrens <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: George Wilson <[email protected]> Closes #11716
* Cancel TRIM / initialize on FAULTED non-writeable vdevsnssrikanth2021-03-021-0/+2
| | | | | | | | | | | | | | When a device which is actively trimming or initializing becomes FAULTED, and therefore no longer writable, cancel the active TRIM or initialization. When the device is merely taken offline with `zpool offline` then stop the operation but do not cancel it. When the device is brought back online the operation will be resumed if possible. Reviewed-by: Brian Behlendorf <[email protected]> Co-authored-by: Brian Behlendorf <[email protected]> Co-authored-by: Vipin Kumar Verma <[email protected]> Signed-off-by: Srikanth N S <[email protected]> Closes #11588
* send_iterate_snap : doall send without fromsnapCedric Maunoury2021-02-241-1/+1
| | | | | | | | | | The behavior of a NULL fromsnap was inadvertently changed for a doall send when the send/recv logic in libzfs was updated. Restore the previous behavior by correcting send_iterate_snap() to include all the snapshots in the nvlist for this case. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Cedric Maunoury <[email protected]> Closes #11608
* Checksum errors may not be countedDon Brady2021-02-191-1/+2
| | | | | | | | | | Fix regression seen in issue #11545 where checksum errors where not being counted or showing up in a zpool event. Reviewed-by: Matthew Ahrens <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Don Brady <[email protected]> Closes #11609
* Add "compatibility" property for zpool feature setsColm2021-02-171-3/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Property to allow sets of features to be specified; for compatibility with specific versions / releases / external systems. Influences the behavior of 'zpool upgrade' and 'zpool create'. Initial man page changes and test cases included. Brief synopsis: zpool create -o compatibility=off|legacy|file[,file...] pool vdev... compatibility = off : disable compatibility mode (enable all features) compatibility = legacy : request that no features be enabled compatibility = file[,file...] : read features from specified files. Only features present in *all* files will be enabled on the resulting pool. Filenames may be absolute, or relative to /etc/zfs/compatibility.d or /usr/share/zfs/compatibility.d (/etc checked first). Only affects zpool create, zpool upgrade and zpool status. ABI changes in libzfs: * New function "zpool_load_compat" to load and parse compat sets. * Add "zpool_compat_status_t" typedef for compatibility parse status. * Add ZPOOL_PROP_COMPATIBILITY to the pool properties enum * Add ZPOOL_STATUS_COMPATIBILITY_ERR to the pool status enum An initial set of base compatibility sets are included in cmd/zpool/compatibility.d, and the Makefile for cmd/zpool is modified to install these in $pkgdatadir/compatibility.d and to create symbolic links to a reasonable set of aliases. Reviewed-by: ericloewe Reviewed-by: Matthew Ahrens <[email protected]> Reviewed-by: Richard Laager <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Colm Buckley <[email protected]> Closes #11468
* Add zdb -r <dataset> <object-id | file> <output>Allan Jude2021-01-271-1/+1
| | | | | | | | | | | | | While you can use zdb -R poolname vdev:offset:[<lsize>/]<psize>[:flags] to extract individual DVAs from a vdev, it would be handy for be able copy an entire file out of the pool. Given a file or object number, add support to copy the contents to a file. Useful for debugging and recovery. Reviewed-by: Jorgen Lundman <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Allan Jude <[email protected]> Closes #11027
* RAIDZ2/3 fails to heal silently corrupted parity w/2+ bad disksMatthew Ahrens2021-01-261-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When scrubbing, (non-sequential) resilvering, or correcting a checksum error using RAIDZ parity, ZFS should heal any incorrect RAIDZ parity by overwriting it. For example, if P disks are silently corrupted (P being the number of failures tolerated; e.g. RAIDZ2 has P=2), `zpool scrub` should detect and heal all the bad state on these disks, including parity. This way if there is a subsequent failure we are fully protected. With RAIDZ2 or RAIDZ3, a block can have silent damage to a parity sector, and also damage (silent or known) to a data sector. In this case the parity should be healed but it is not. The problem can be noticed by scrubbing the pool twice. Assuming there was no damage concurrent with the scrubs, the first scrub should fix all silent damage, and the second scrub should be "clean" (`zpool status` should not report checksum errors on any disks). If the bug is encountered, then the second scrub will repair the silently-damaged parity that the first scrub failed to repair, and these checksum errors will be reported after the second scrub. Since the first scrub repaired all the damaged data, the bug can not be encountered during the second scrub, so subsequent scrubs (more than two) are not necessary. The root cause of the problem is some code that was inadvertently added to `raidz_parity_verify()` by the DRAID changes. The incorrect code causes the parity healing to be aborted if there is damaged data (`rc_error != 0`) or the data disk is not present (`!rc_tried`). These checks are not necessary, because we only call `raidz_parity_verify()` if we have the correct data (which may have been reconstructed using parity, and which was verified by the checksum). This commit fixes the problem by removing the incorrect checks in `raidz_parity_verify()`. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> Closes #11489 Closes #11510
* Add basic io_uring testMatthew Macy2021-01-231-1/+1
| | | | | | | | Provide a basic test coverage for io_uring I/O. Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matt Macy <[email protected]> Closes #11497
* Fix reporting of CKSUM errors in indirect vdevsGeorge Amanakis2020-12-111-1/+1
| | | | | | | | | | | | | When removing and subsequently reattaching a vdev, CKSUM errors may occur as vdev_indirect_read_all() reads from all children of a mirror in case of a resilver. Fix this by checking whether a child is missing the data and setting a flag (ic_error) which is then checked in vdev_indirect_repair() and suppresses incrementing the checksum counter. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: George Amanakis <[email protected]> Closes #11277
* CI: add new zfs-tests-sanity workflowGeorge Melikov2020-12-081-1/+1
| | | | | | | | | Run zfs-tests with sanity.run for brief results. Timeouts are rare, so minimize false positives by increasing the default from 60 to 180 seconds. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: George Melikov <[email protected]> Closes #11304
* ZTS: adjust zpool_import_012_pos timeoutBrian Behlendorf2020-12-061-0/+1
| | | | | | | | | | | | When running in the CI the zpool_import_012_pos test case occasionally takes longer than the maximum 600 seconds. When this happens the test case is considered to have failed but always completes a few minutes latter. Since the logs suggest nothing has actually failed this commit increases timeout and removes the exception. Reviewed-by: George Melikov <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #11286
* Fix raw sends on encrypted datasets when copying back snapshotsGeorge Amanakis2020-12-041-1/+2
| | | | | | | | | | | | | | When sending raw encrypted datasets the user space accounting is present when it's not expected to be. This leads to the subsequent mount failure due a checksum error when verifying the local mac. Fix this by clearing the OBJSET_FLAG_USERACCOUNTING_COMPLETE and reset the local mac. This allows the user accounting to be correctly updated on first mount using the normal upgrade process. Reviewed-By: Brian Behlendorf <[email protected]> Reviewed-By: Tom Caputi <[email protected]> Signed-off-by: George Amanakis <[email protected]> Closes #10523 Closes #11221
* zpool: Dryrun fails to list some devicesAttila Fülöp2020-12-041-3/+4
| | | | | | | | | | | | `zpool create -n` fails to list cache and spare vdevs. `zpool add -n` fails to list spare devices. `zpool split -n` fails to list `special` and `dedup` labels. `zpool add -n` and `zpool split -n` shouldn't list hole devices. Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Attila Fülöp <[email protected]> Closes #11122 Closes #11167
* Add -u option to 'zfs create'Ryan Moeller2020-12-041-1/+2
| | | | | | | | | | | | | Add -u option to 'zfs create' that prevents file system from being automatically mounted. This is similar to the 'zfs receive -u'. Authored by: pjd <[email protected]> FreeBSD-commit: freebsd/freebsd@35c58230e292775a694d189ff2b0bea2dcf6947d Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Allan Jude <[email protected]> Ported-by: Ryan Moeller <[email protected]> Signed-off-by: Ryan Moeller <[email protected]> Closes #11254
* Add sanity.run fileBrian Behlendorf2020-12-033-0/+625
| | | | | | | | | | | | | | | | | | | This run file contains a subset of functional tests which exercise as much functionality as possible while still executing relatively quickly. The included tests should take no more than a few seconds each to run at most. This provides a convenient way to sanity test a change before committing to a full test run which takes several hours. $ ./scripts/zfs-tests.sh -r sanity ... Results Summary PASS 813 Running Time: 00:14:42 Percent passed: 100.0% Reviewed-by: Ryan Moeller <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #11271
* Fix 'zfs userspace' for received datasets in encrypted rootloli10K2020-11-161-1/+1
| | | | | | | | | | | | | For encrypted receives, where user accounting is initially disabled on creation, both 'zfs userspace' and 'zfs groupspace' fails with EOPNOTSUPP: this is because dmu_objset_id_quota_upgrade_cb() forgets to set OBJSET_FLAG_USERACCOUNTING_COMPLETE on the objset flags after a successful dmu_objset_space_upgrade(). Reviewed-by: Brian Behlendorf <[email protected]> Co-authored-by: Brian Behlendorf <[email protected]> Signed-off-by: loli10K <[email protected]> Closes #9501 Closes #9596
* Distributed Spare (dRAID) FeatureBrian Behlendorf2020-11-131-4/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch adds a new top-level vdev type called dRAID, which stands for Distributed parity RAID. This pool configuration allows all dRAID vdevs to participate when rebuilding to a distributed hot spare device. This can substantially reduce the total time required to restore full parity to pool with a failed device. A dRAID pool can be created using the new top-level `draid` type. Like `raidz`, the desired redundancy is specified after the type: `draid[1,2,3]`. No additional information is required to create the pool and reasonable default values will be chosen based on the number of child vdevs in the dRAID vdev. zpool create <pool> draid[1,2,3] <vdevs...> Unlike raidz, additional optional dRAID configuration values can be provided as part of the draid type as colon separated values. This allows administrators to fully specify a layout for either performance or capacity reasons. The supported options include: zpool create <pool> \ draid[<parity>][:<data>d][:<children>c][:<spares>s] \ <vdevs...> - draid[parity] - Parity level (default 1) - draid[:<data>d] - Data devices per group (default 8) - draid[:<children>c] - Expected number of child vdevs - draid[:<spares>s] - Distributed hot spares (default 0) Abbreviated example `zpool status` output for a 68 disk dRAID pool with two distributed spares using special allocation classes. ``` pool: tank state: ONLINE config: NAME STATE READ WRITE CKSUM slag7 ONLINE 0 0 0 draid2:8d:68c:2s-0 ONLINE 0 0 0 L0 ONLINE 0 0 0 L1 ONLINE 0 0 0 ... U25 ONLINE 0 0 0 U26 ONLINE 0 0 0 spare-53 ONLINE 0 0 0 U27 ONLINE 0 0 0 draid2-0-0 ONLINE 0 0 0 U28 ONLINE 0 0 0 U29 ONLINE 0 0 0 ... U42 ONLINE 0 0 0 U43 ONLINE 0 0 0 special mirror-1 ONLINE 0 0 0 L5 ONLINE 0 0 0 U5 ONLINE 0 0 0 mirror-2 ONLINE 0 0 0 L6 ONLINE 0 0 0 U6 ONLINE 0 0 0 spares draid2-0-0 INUSE currently in use draid2-0-1 AVAIL ``` When adding test coverage for the new dRAID vdev type the following options were added to the ztest command. These options are leverages by zloop.sh to test a wide range of dRAID configurations. -K draid|raidz|random - kind of RAID to test -D <value> - dRAID data drives per group -S <value> - dRAID distributed hot spares -R <value> - RAID parity (raidz or dRAID) The zpool_create, zpool_import, redundancy, replacement and fault test groups have all been updated provide test coverage for the dRAID feature. Co-authored-by: Isaac Huang <[email protected]> Co-authored-by: Mark Maybee <[email protected]> Co-authored-by: Don Brady <[email protected]> Co-authored-by: Matthew Ahrens <[email protected]> Co-authored-by: Brian Behlendorf <[email protected]> Reviewed-by: Mark Maybee <[email protected]> Reviewed-by: Matt Ahrens <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #10102
* Fix memleak in cmd/mount_zfs.csterlingjensen2020-11-101-1/+2
| | | | | | | | | Convert dynamic allocation to static buffer, simplify parse_dataset function return path. Add tests specific to the mount helper. Reviewed-by: Mateusz Guzik <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Sterling Jensen <[email protected]> Closes #11098
* ZTS: Add L1 corruption testRyan Moeller2020-11-051-1/+2
| | | | | | | | | Add a new test case which corrupts all level 1 block in a file. Then verifies that corruption is detected and repaired. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Ryan Moeller <[email protected]> Closes #11141
* Non-l2arc pool reads shouldn't be l2arc missesAdam D. Moss2020-10-201-1/+1
| | | | | | | | | | | | | | | | | | | | The current l2_misses accounting behavior treats all reads to pools without a configured l2arc as an l2arc miss, IFF there is at least one other pool on the system which does have an l2arc configured. This makes it extremely hard to tune for an improved l2arc hit/miss ratio because this ratio will be modulated by reads from pools which do not (and should not) have l2arc devices; its upper limit will depend on the ratio of reads from l2arc'd pools and non-l2arc'd pools. This PR prevents ARC reads affecting l2arc stats (n.b. l2_misses is the only relevant one) where the target spa doesn't have an l2arc. Includes new test - l2arc_l2miss_pos.ksh Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: George Amanakis <[email protected]> Signed-off-by: Adam Moss <[email protected]> Closes #10921
* Add zpool_influxdb commandRichard Elling2020-10-091-0/+3
| | | | | | | | | | | | | | | A zpool_influxdb command is introduced to ease the collection of zpool statistics into the InfluxDB time-series database. Examples are given on how to integrate with the telegraf statistics aggregator, a companion to influxdb. Finally, a grafana dashboard template is included to show how pool latency distributions can be visualized in a ZFS + telegraf + influxdb + grafana environment. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Signed-off-by: Richard Elling <[email protected]> Closes #10786
* Linux: Initialize zp in zfs_setattr_dirRyan Moeller2020-10-091-1/+5
| | | | | | | | | | | | | | | | The value of zp is used without having been initialized under some conditions. Initialize the pointer to NULL. Add a regression test case using chown in acl/posix. However, this is not enough because the setup sets xattr=sa, which means zfs_setattr_dir will not be called. Create a second group of acl tests in acl/posix-sa duplicating the acl/posix tests with symlinks, and remove xattr=sa from the original acl/posix tests. This provides more coverage for the default xattr=on code. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Ryan Moeller <[email protected]> Closes #10043 Closes #11025
* Make dbufstat work on FreeBSDRyan Moeller2020-10-082-5/+5
| | | | | | | | | | | | | With procfs_list kstats implemented for FreeBSD, dbufs are now exposed as kstat.zfs.misc.dbufs. On FreeBSD, dbufstats can use the sysctl instead of procfs when no input file has been given. Enable the dbufstats tests on FreeBSD. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Ryan Moeller <[email protected]> Closes #11008
* Drop references when skipping dmu_send due to EXDEVRyan Moeller2020-09-301-1/+1
| | | | | | | | | | | | | | When an invalid incremental send is requested where the "to" ds is before the "from" ds, make sure to drop the reference to the pool and the dataset before returning the error. Add an assert on FreeBSD to make sure we don't hold any locks after returning from an ioctl. Add some test coverage. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Ryan Moeller <[email protected]> Closes #10919
* Add L2ARC arcstats for MFU/MRU buffers and buffer content typeGeorge Amanakis2020-09-141-6/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently the ARC state (MFU/MRU) of cached L2ARC buffer and their content type is unknown. Knowing this information may prove beneficial in adjusting the L2ARC caching policy. This commit adds L2ARC arcstats that display the aligned size (in bytes) of L2ARC buffers according to their content type (data/metadata) and according to their ARC state (MRU/MFU or prefetch). It also expands the existing evict_l2_eligible arcstat to differentiate between MFU and MRU buffers. L2ARC caches buffers from the MRU and MFU lists of ARC. Upon caching a buffer, its ARC state (MRU/MFU) is stored in the L2 header (b_arcs_state). The l2_m{f,r}u_asize arcstats reflect the aligned size (in bytes) of L2ARC buffers according to their ARC state (based on b_arcs_state). We also account for the case where an L2ARC and ARC cached MRU or MRU_ghost buffer transitions to MFU. The l2_prefetch_asize reflects the alinged size (in bytes) of L2ARC buffers that were cached while they had the prefetch flag set in ARC. This is dynamically updated as the prefetch flag of L2ARC buffers changes. When buffers are evicted from ARC, if they are determined to be L2ARC eligible then their logical size is recorded in evict_l2_eligible_m{r,f}u arcstats according to their ARC state upon eviction. Persistent L2ARC: When committing an L2ARC buffer to a log block (L2ARC metadata) its b_arcs_state and prefetch flag is also stored. If the buffer changes its arcstate or prefetch flag this is reflected in the above arcstats. However, the L2ARC metadata cannot currently be updated to reflect this change. Example: L2ARC caches an MRU buffer. L2ARC metadata and arcstats count this as an MRU buffer. The buffer transitions to MFU. The arcstats are updated to reflect this. Upon pool re-import or on/offlining the L2ARC device the arcstats are cleared and the buffer will now be counted as an MRU buffer, as the L2ARC metadata were not updated. Bug fix: - If l2arc_noprefetch is set, arc_read_done clears the L2CACHE flag of an ARC buffer. However, prefetches may be issued in a way that arc_read_done() is bypassed. Instead, move the related code in l2arc_write_eligible() to account for those cases too. Also add a test and update manpages for l2arc_mfuonly module parameter, and update the manpages and code comments for l2arc_noprefetch. Move persist_l2arc tests to l2arc. Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Richard Elling <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: George Amanakis <[email protected]> Closes #10743
* Avoid posting duplicate zpool eventsDon Brady2020-09-041-1/+1
| | | | | | | | | | | | | | | | | | | Duplicate io and checksum ereport events can misrepresent that things are worse than they seem. Ideally the zpool events and the corresponding vdev stat error counts in a zpool status should be for unique errors -- not the same error being counted over and over. This can be demonstrated in a simple example. With a single bad block in a datafile and just 5 reads of the file we end up with a degraded vdev, even though there is only one unique error in the pool. The proposed solution to the above issue, is to eliminate duplicates when posting events and when updating vdev error stats. We now save recent error events of interest when posting events so that we can easily check for duplicates when posting an error. Reviewed by: Brad Lewis <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Don Brady <[email protected]> Closes #10861
* Add 'zfs rename -u' to rename without remountingRyan Moeller2020-09-011-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | Allow to rename file systems without remounting if it is possible. It is possible for file systems with 'mountpoint' property set to 'legacy' or 'none' - we don't have to change mount directory for them. Currently such file systems are unmounted on rename and not even mounted back. This introduces layering violation, as we need to update 'f_mntfromname' field in statfs structure related to mountpoint (for the dataset we are renaming and all its children). In my opinion it is worth it, as it allow to update FreeBSD in even cleaner way - in ZFS-only configuration root file system is ZFS file system with 'mountpoint' property set to 'legacy'. If root dataset is named system/rootfs, we can snapshot it (system/rootfs@upgrade), clone it (system/oldrootfs), update FreeBSD and if it doesn't boot we can boot back from system/oldrootfs and rename it back to system/rootfs while it is mounted as /. Before it was not possible, because unmounting / was not possible. Authored by: Pawel Jakub Dawidek <[email protected]> Reviewed-by: Allan Jude <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Ported by: Matt Macy <[email protected]> Signed-off-by: Ryan Moeller <[email protected]> Closes #10839
* Always track temporary fses and snapshots for accountingPaul Dagnelie2020-08-261-1/+1
| | | | | | | | | | | | | | | | The root cause of the issue is that we only occasionally do as the comments in the code suggest and actually ignore the %recv dataset when it comes to filesystem limit tracking. Specifically, the only time we ignore it is when initializing the filesystem and snapshot limit values; when creating a new %recv dataset or deleting one, we always update the bookkeeping. This causes a problem if you init the fs count on a filesystem that already has a %recv dataset, since the bookmarking will be decremented but not incremented. This is resolved in this patch by simply always tracking the %recv dataset as a child. Reviewed-by: Matt Ahrens <[email protected]> Reviewed by: Jerry Jelinek <[email protected]> Signed-off-by: Paul Dagnelie <[email protected]> Closes #10791
* 'zfs share -a' should clean noauto exportsDon Brady2020-08-201-1/+2
| | | | | | | | | | | | | | | | | | This is a follow on to PR #10688 where `zfs share -a` allows the sharing of canmount=noauto datasets if they are mounted. However, when a dataset with canmount=noauto is not mounted, the command should also purge any existing entries from the exports file. Otherwise, after a reboot, the nfs server attempts to export the underlying mountpath, not the dataset. This can lead to a hard hang for existing client mounts. Instead of just skipping the adding of an export if not mounted and canmount=noauto, have it also remove an existing export of the dataset so that, after a reboot, we don't export an unmounted dataset. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: George Wilson <[email protected]> Signed-off-by: Don Brady <[email protected]> Closes #10747
* Add zstd support to zfsMichael Niewöhner2020-08-201-4/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This PR adds two new compression types, based on ZStandard: - zstd: A basic ZStandard compression algorithm Available compression. Levels for zstd are zstd-1 through zstd-19, where the compression increases with every level, but speed decreases. - zstd-fast: A faster version of the ZStandard compression algorithm zstd-fast is basically a "negative" level of zstd. The compression decreases with every level, but speed increases. Available compression levels for zstd-fast: - zstd-fast-1 through zstd-fast-10 - zstd-fast-20 through zstd-fast-100 (in increments of 10) - zstd-fast-500 and zstd-fast-1000 For more information check the man page. Implementation details: Rather than treat each level of zstd as a different algorithm (as was done historically with gzip), the block pointer `enum zio_compress` value is simply zstd for all levels, including zstd-fast, since they all use the same decompression function. The compress= property (a 64bit unsigned integer) uses the lower 7 bits to store the compression algorithm (matching the number of bits used in a block pointer, as the 8th bit was borrowed for embedded block pointers). The upper bits are used to store the compression level. It is necessary to be able to determine what compression level was used when later reading a block back, so the concept used in LZ4, where the first 32bits of the on-disk value are the size of the compressed data (since the allocation is rounded up to the nearest ashift), was extended, and we store the version of ZSTD and the level as well as the compressed size. This value is returned when decompressing a block, so that if the block needs to be recompressed (L2ARC, nop-write, etc), that the same parameters will be used to result in the matching checksum. All of the internal ZFS code ( `arc_buf_hdr_t`, `objset_t`, `zio_prop_t`, etc.) uses the separated _compress and _complevel variables. Only the properties ZAP contains the combined/bit-shifted value. The combined value is split when the compression_changed_cb() callback is called, and sets both objset members (os_compress and os_complevel). The userspace tools all use the combined/bit-shifted value. Additional notes: zdb can now also decode the ZSTD compression header (flag -Z) and inspect the size, version and compression level saved in that header. For each record, if it is ZSTD compressed, the parameters of the decoded compression header get printed. ZSTD is included with all current tests and new tests are added as-needed. Per-dataset feature flags now get activated when the property is set. If a compression algorithm requires a feature flag, zfs activates the feature when the property is set, rather than waiting for the first block to be born. This is currently only used by zstd but can be extended as needed. Portions-Sponsored-By: The FreeBSD Foundation Co-authored-by: Allan Jude <[email protected]> Co-authored-by: Brian Behlendorf <[email protected]> Co-authored-by: Sebastian Gottschall <[email protected]> Co-authored-by: Kjeld Schouten-Lebbing <[email protected]> Co-authored-by: Michael Niewöhner <[email protected]> Signed-off-by: Allan Jude <[email protected]> Signed-off-by: Allan Jude <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Sebastian Gottschall <[email protected]> Signed-off-by: Kjeld Schouten-Lebbing <[email protected]> Signed-off-by: Michael Niewöhner <[email protected]> Closes #6247 Closes #9024 Closes #10277 Closes #10278
* Fix L2ARC reads when compressed ARC disabledAllan Jude2020-08-131-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When reading compressed blocks from the L2ARC, with compressed ARC disabled, arc_hdr_size() returns LSIZE rather than PSIZE, but the actual read is PSIZE. This causes l2arc_read_done() to compare the checksum against the wrong size, resulting in checksum failure. This manifests as an increase in the kstat l2_cksum_bad and the read being retried from the main pool, making the L2ARC ineffective. Add new L2ARC tests with Compressed ARC enabled/disabled Blocks are handled differently depending on the state of the zfs_compressed_arc_enabled tunable. If a block is compressed on-disk, and compressed_arc is enabled: - the block is read from disk - It is NOT decompressed - It is added to the ARC in its compressed form - l2arc_write_buffers() may write it to the L2ARC (as is) - l2arc_read_done() compares the checksum to the BP (compressed) However, if compressed_arc is disabled: - the block is read from disk - It is decompressed - It is added to the ARC (uncompressed) - l2arc_write_buffers() will use l2arc_apply_transforms() to recompress the block, before writing it to the L2ARC - l2arc_read_done() compares the checksum to the BP (compressed) - l2arc_read_done() will use l2arc_untransform() to uncompress it This test writes out a test file to a pool consisting of one disk and one cache device, then randomly reads from it. Since the arc_max in the tests is low, this will feed the L2ARC, and result in reads from the L2ARC. We compare the value of the kstat l2_cksum_bad before and after to determine if any blocks failed to survive the trip through the L2ARC. Sponsored-by: The FreeBSD Foundation Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Allan Jude <[email protected]> Closes #10693