aboutsummaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* OpenZFS 9290 - device removal reduces redundancy of mirrorsMatthew Ahrens2018-04-1415-231/+915
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Mirrors are supposed to provide redundancy in the face of whole-disk failure and silent damage (e.g. some data on disk is not right, but ZFS hasn't detected the whole device as being broken). However, the current device removal implementation bypasses some of the mirror's redundancy. Note that in no case is incorrect data returned, but we might get a checksum error when we should have been able to find the right data. There are two underlying problems: 1. When we remove a mirror device, we only read one side of the mirror. Since we can't verify the checksum, this side may be silently bad, but the good data is on the other side of the mirror (which we didn't read). This can cause the removal to "bake in" the busted data – all copies of the data in the new location are the same, busted version, while we left the good version behind. The fix for this is to read and copy both sides of the mirror. If the old and new vdevs are mirrors, we will read both sides of the old mirror, and write each copy to the corresponding side of the new mirror. (If the old and new vdevs have a different number of children, we will do this as best as possible.) Even though we aren't verifying checksums, this ensures that as long as there's a good copy of the data, we'll have a good copy after the removal, even if there's silent damage to one side of the mirror. If we're removing a mirror that has some silent damage, we'll have exactly the same damage in the new location (assuming that the new location is also a mirror). 2. When we read from an indirect vdev that points to a mirror vdev, we only consider one copy of the data. This can lead to reduced effective redundancy, because we might read a bad copy of the data from one side of the mirror, and not retry the other, good side of the mirror. Note that the problem is not with the removal process, but rather after the removal has completed (having copied correct data to both sides of the mirror), if one side of the new mirror is silently damaged, we encounter the problem when reading the relocated data via the indirect vdev. Also note that the problem doesn't occur when ZFS knows that one side of the mirror is bad, e.g. when a disk entirely fails or is offlined. The impact is that reads (from indirect vdevs that point to mirrors) may return a checksum error even though the good data exists on one side of the mirror, and scrub doesn't repair all data on the mirror (if some of it is pointed to via an indirect vdev). The fix for this is complicated by "split blocks" - one logical block may be split into two (or more) pieces with each piece moved to a different new location. In this case we need to read all versions of each split (one from each side of the mirror), and figure out which combination of versions results in the correct checksum, and then repair the incorrect versions. This ensures that we supply the same redundancy whether you use device removal or not. For example, if a mirror has small silent errors on all of its children, we can still reconstruct the correct data, as long as those errors are at sufficiently-separated offsets (specifically, separated by the largest block size - default of 128KB, but up to 16MB). Porting notes: * A new indirect vdev check was moved from dsl_scan_needs_resilver_cb() to dsl_scan_needs_resilver(), which was added to ZoL as part of the sequential scrub work. * Passed NULL for zfs_ereport_post_checksum()'s zbookmark_phys_t parameter. The extra parameter is unique to ZoL. * When posting indirect checksum errors the ABD can be passed directly, zfs_ereport_post_checksum() is not yet ABD-aware in OpenZFS. Authored by: Matthew Ahrens <[email protected]> Reviewed by: Tim Chase <[email protected]> Reviewed by: Brian Behlendorf <[email protected]> Ported-by: Tim Chase <[email protected]> OpenZFS-issue: https://illumos.org/issues/9290 OpenZFS-commit: https://github.com/openzfs/openzfs/pull/591 Closes #6900
* OpenZFS 7614, 9064 - zfs device evacuation/removalMatthew Ahrens2018-04-14127-912/+9862
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | OpenZFS 7614 - zfs device evacuation/removal OpenZFS 9064 - remove_mirror should wait for device removal to complete This project allows top-level vdevs to be removed from the storage pool with "zpool remove", reducing the total amount of storage in the pool. This operation copies all allocated regions of the device to be removed onto other devices, recording the mapping from old to new location. After the removal is complete, read and free operations to the removed (now "indirect") vdev must be remapped and performed at the new location on disk. The indirect mapping table is kept in memory whenever the pool is loaded, so there is minimal performance overhead when doing operations on the indirect vdev. The size of the in-memory mapping table will be reduced when its entries become "obsolete" because they are no longer used by any block pointers in the pool. An entry becomes obsolete when all the blocks that use it are freed. An entry can also become obsolete when all the snapshots that reference it are deleted, and the block pointers that reference it have been "remapped" in all filesystems/zvols (and clones). Whenever an indirect block is written, all the block pointers in it will be "remapped" to their new (concrete) locations if possible. This process can be accelerated by using the "zfs remap" command to proactively rewrite all indirect blocks that reference indirect (removed) vdevs. Note that when a device is removed, we do not verify the checksum of the data that is copied. This makes the process much faster, but if it were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be possible to copy the wrong data, when we have the correct data on e.g. the other side of the mirror. At the moment, only mirrors and simple top-level vdevs can be removed and no removal is allowed if any of the top-level vdevs are raidz. Porting Notes: * Avoid zero-sized kmem_alloc() in vdev_compact_children(). The device evacuation code adds a dependency that vdev_compact_children() be able to properly empty the vdev_child array by setting it to NULL and zeroing vdev_children. Under Linux, kmem_alloc() and related functions return a sentinel pointer rather than NULL for zero-sized allocations. * Remove comment regarding "mpt" driver where zfs_remove_max_segment is initialized to SPA_MAXBLOCKSIZE. Change zfs_condense_indirect_commit_entry_delay_ticks to zfs_condense_indirect_commit_entry_delay_ms for consistency with most other tunables in which delays are specified in ms. * ZTS changes: Use set_tunable rather than mdb Use zpool sync as appropriate Use sync_pool instead of sync Kill jobs during test_removal_with_operation to allow unmount/export Don't add non-disk names such as "mirror" or "raidz" to $DISKS Use $TEST_BASE_DIR instead of /tmp Increase HZ from 100 to 1000 which is more common on Linux removal_multiple_indirection.ksh Reduce iterations in order to not time out on the code coverage builders. removal_resume_export: Functionally, the test case is correct but there exists a race where the kernel thread hasn't been fully started yet and is not visible. Wait for up to 1 second for the removal thread to be started before giving up on it. Also, increase the amount of data copied in order that the removal not finish before the export has a chance to fail. * MMP compatibility, the concept of concrete versus non-concrete devices has slightly changed the semantics of vdev_writeable(). Update mmp_random_leaf_impl() accordingly. * Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool feature which is not supported by OpenZFS. * Added support for new vdev removal tracepoints. * Test cases removal_with_zdb and removal_condense_export have been intentionally disabled. When run manually they pass as intended, but when running in the automated test environment they produce unreliable results on the latest Fedora release. They may work better once the upstream pool import refectoring is merged into ZoL at which point they will be re-enabled. Authored by: Matthew Ahrens <[email protected]> Reviewed-by: Alex Reece <[email protected]> Reviewed-by: George Wilson <[email protected]> Reviewed-by: John Kennedy <[email protected]> Reviewed-by: Prakash Surya <[email protected]> Reviewed by: Richard Laager <[email protected]> Reviewed by: Tim Chase <[email protected]> Reviewed by: Brian Behlendorf <[email protected]> Approved by: Garrett D'Amore <[email protected]> Ported-by: Tim Chase <[email protected]> Signed-off-by: Tim Chase <[email protected]> OpenZFS-issue: https://www.illumos.org/issues/7614 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb Closes #6900
* Wait for resilver after onlineTim Chase2018-04-131-0/+11
| | | | | | | | | | | | | | This test performs a rapid offline/online cycle of each of several mirror vdevs. It can run so quickly that there isn't sufficient pool redundancy to perform an offline. The solution is to wait until the pool is resilvered following the online operation. Also, add a pool sync before the offline operation to help reduce spurious errors. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tim Chase <[email protected]> Issue #6900
* Eliminate trailing spaces in DISKSTim Chase2018-04-131-2/+10
| | | | | | | | | The zfs-tests.sh driver script could add spaces to the end of $DISKS which defeates shell-based parsing with constructs such as ${DISKS##* }. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tim Chase <[email protected]> Issue #6900
* Allow mounting datasets more than onceSeth Forshee2018-04-138-29/+219
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently mounting an already mounted zfs dataset results in an error, whereas it is typically allowed with other filesystems. This causes some bad interactions with mount namespaces. Take this sequence for example: - Create a dataset - Create a snapshot of the dataset - Create a clone of the snapshot - Create a new mount namespace - Rename the original dataset The rename results in unmounting and remounting the clone in the original mount namespace, however the remount fails because the dataset is still mounted in the new mount namespace. (Note that this means the mount in the new mount namespace is never being unmounted, so perhaps the unmount/remount of the clone isn't actually necessary.) The problem here is a result of the way mounting is implemented in the kernel module. Since it is not mounting block devices it uses mount_nodev() instead of the usual mount_bdev(). However, mount_nodev() is written for filesystems for which each mount is a new instance (i.e. a new super block), and zfs should be able to detect when a mount request can be satisfied using an existing super block. Change zpl_mount() to call sget() directly with it's own test callback. Passing the objset_t object as the fs data allows checking if a superblock already exists for the dataset, and in that case we just need to return a new reference for the sb's root dentry. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Tom Caputi <[email protected]> Signed-off-by: Alek Pinchuk <[email protected]> Signed-off-by: Seth Forshee <[email protected]> Closes #5796 Closes #7207
* ZTS: clean up leftover ibackup_trunc filesbunder20152018-04-131-2/+3
| | | | | | | | | | | | | zfs_receive_raw_incremental did not clean up ibackup_trunc.* files left over from running the test. Also changing the path of the ibackup files so they can be placed in the correct directories when /var/tmp is not the temporary directory. Reviewed-by: Giuseppe Di Natale <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: bunder2015 <[email protected]> Closes #7430
* Linux compat 4.16: blk_queue_flag_{set,clear}Brian Behlendorf2018-04-121-8/+6
| | | | | | | | | | | | | | | | | | | | The HAVE_BLK_QUEUE_WRITE_CACHE_GPL_ONLY case was overlooked in the original 10f88c5c commit because blk_queue_write_cache() was available for the in-kernel builds. Update the blk_queue_flag_{set,clear} wrappers to call the locked versions to avoid confusion. This is safe for all existing callers. The blk_queue_set_write_cache() function has been updated to use these wrappers. This means setting/clearing both QUEUE_FLAG_WC and QUEUE_FLAG_FUA is no longer atomic but this only done early in zvol_alloc() prior to any requests so there is no issue. Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Giuseppe Di Natale <[email protected]> Reviewed-by: Kash Pande <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #7428 Closes #7431
* Add 'zpool split' coverage to the ZFS Test SuiteLOLi2018-04-1212-1/+529
| | | | | | | | | | | | | This change adds five new tests to the ZTS: * zpool_split_cliargs: verify command line options and arguments * zpool_split_devices: verify zpool split accepts a device list * zpool_split_encryption: verify zpool can split encrypted pools * zpool_split_props: verify zpool split can set property values * zpool_split_vdevs: verify vdev layout when splitting the pool Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: loli10K <[email protected]> Closes #7409
* Fix calloc(3) arguments orderTomohiro Kusumi2018-04-126-15/+15
| | | | | | | | | | | | calloc(3) takes `nelem` (or `nmemb` in glibc) first, and then size of elements. No difference expected for having these in reverse order, however should follow the standard. http://pubs.opengroup.org/onlinepubs/009695399/functions/calloc.html Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tomohiro Kusumi <[email protected]> Closes #7405
* Fix zfs_arc_max minimum tuningberen122018-04-121-1/+1
| | | | | | | | | | When setting `zfs_arc_max` its minimum value is allowed to be 64 MiB. There was an off-by-1 error which can matter on tiny systems. Reviewed-by: Giuseppe Di Natale <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Chris Zubrzycki <[email protected]> Closes #7417
* OpenZFS 9286 - want refreservation=autoMike Gerdts2018-04-119-11/+424
| | | | | | | | | | | | | | | | | | | | Authored by: Mike Gerdts <[email protected]> Reviewed by: Allan Jude <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Reviewed by: John Kennedy <[email protected]> Reviewed by: Andy Stormont <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Giuseppe Di Natale <[email protected]> Approved by: Richard Lowe <[email protected]> Ported-by: Don Brady <[email protected]> Porting Notes: * Adopted destroy_dataset in ZTS test cleanup * Use ksh shebang instead of bash for new tests OpenZFS-issue: https://www.illumos.org/issues/9286 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/723d0c85 Closes #7387
* Fix zpool set feature@<feature>=disabledLOLi2018-04-115-9/+95
| | | | | | | | | Commit e4010f2 accidentally allows zpool to set pool features to "disabled"; this should only be allowed at pool creation. This commit adds additional checks and test coverage to 'zpool set'. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: loli10K <[email protected]> Closes #7402
* zfs(8): fix dedup omission during mdoc overhaulDeHackEd2018-04-101-0/+30
| | | | | | | | | | | The property description has been updated with new algorithms as well. Reviewed-by: Matt Ahrens <[email protected]> Reviewed-by: Richard Elling <[email protected]> Reviewed-by: loli10K <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: George Melikov <[email protected]> Signed-off-by: DHE <[email protected]> Closes #7377
* Fix race in dnode_check_slots_free()Tom Caputi2018-04-106-11/+43
| | | | | | | | | | | | | | | | | | Currently, dnode_check_slots_free() works by checking dn->dn_type in the dnode to determine if the dnode is reclaimable. However, there is a small window of time between dnode_free_sync() in the first call to dsl_dataset_sync() and when the useraccounting code is run when the type is set DMU_OT_NONE, but the dnode is not yet evictable, leading to crashes. This patch adds the ability for dnodes to track which txg they were last dirtied in and adds a check for this before performing the reclaim. This patch also corrects several instances when dn_dirty_link was treated as a list_node_t when it is technically a multilist_node_t. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tom Caputi <[email protected]> Closes #7147 Closes #7388
* Linux compat 4.16: blk_queue_flag_{set,clear}Giuseppe Di Natale2018-04-104-4/+60
| | | | | | | | | | queue_flag_{set,clear}_unlocked are now private interfaces in the Linux kernel (https://github.com/torvalds/linux/commit/8a0ac14). Use blk_queue_flag_{set,clear} interfaces which were introduced as of https://github.com/torvalds/linux/commit/8814ce8. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Giuseppe Di Natale <[email protected]> Closes #7410
* Correct swapped keylocation error messagesTom Caputi2018-04-091-4/+4
| | | | | | | | | | | This patch corrects a small issue where two error messages in the code that checks for invalid keylocations were swapped. Reviewed by: Matt Ahrens <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Giuseppe Di Natale <[email protected]> Signed-off-by: Tom Caputi <[email protected]> Closes #7418
* Revert "Handle zap_add() failures in mixed ... "Tony Hutter2018-04-099-290/+32
| | | | | | | | | | | | This reverts commit cc63068e95ee725cce03b1b7ce50179825a6cda5. Under certain circumstances this change can result in an ENOSPC error when adding new files to a directory. See #7401 for full details. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tony Hutter <[email protected]> Issue #7401 Cloes #7416
* Fix 'zfs send/recv' hang with 16M blocksBrian Behlendorf2018-04-082-4/+37
| | | | | | | | | | | When using 16MB blocks the send/recv queue's aren't quite big enough. This change leaves the default 16M queue size which a good value for most pools. But it additionally ensures that the queue sizes are at least twice the allowed zfs_max_recordsize. Reviewed-by: loli10K <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #7365 Closes #7404
* Clean up (k)shlib and cfg file shebangsGiuseppe Di Natale2018-04-0825-30/+27
| | | | | | | | | | | | Most kshlib files are imported by other scripts and do not have a shebang at the top of their files. Make all kshlib follow this convention. Remove shebangs from cfg files as well. Reviewed-by: loli10K <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Giuseppe Di Natale <[email protected]> Close #7406
* Fix "file is executable, but no shebang" warningsTony Hutter2018-04-0683-148/+296
| | | | | | | | | | | | | Fedora 28's RPM build checks warn when executable files don't have a shebang line. These warnings are caused when we (incorrectly) include data & config files in the_SCRIPTS automake lines. Files in _SCRIPTS are marked executable by automake. This patch fixes the issue by including non-executable scripts in a _DATA line instead. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Giuseppe Di Natale <[email protected]> Signed-off-by: Tony Hutter <[email protected]> Closes #7359 Closes #7395
* Exclude python scripts from RPM shebang checkTony Hutter2018-04-061-0/+10
| | | | | | | | | | | | | | | | | | The newest Fedora packaging rules print warnings for scripts using the /usr/bin/python shebang: *** WARNING: mangling shebang in /usr/bin/arc_summary.py from #!/usr/bin/python to #!/usr/bin/python2. This will become an ERROR, fix it manually! Fedora wants all cross compatible scripts to pick python3. Since we don't want our users to have to pick a specific version of python, we exclude our scripts from the RPM build check. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Giuseppe Di Natale <[email protected]> Signed-off-by: Tony Hutter <[email protected]> Closes #7360 Closes #7399
* systemd mount generator and tracking ZEDLETAntonio Russo2018-04-0612-3/+282
| | | | | | | | | | | | | | | | | | | | | | | | zfs-mount-generator implements the "systemd generator" protocol, producing systemd.mount units from the cached outputs of zfs list, during early boot, integrating with systemd. Each pool has an indpendent cache of the command zfs list -H -oname,mountpoint,canmount -tfilesystem -r $pool which is kept synchronized by the ZEDLET history_event-zfs-list-cacher.sh Datasets not in the cache will be loaded later in the boot process by zfs-mount.service, including pools without a cache. Among other things, this allows for complex mount hierarchies. Reviewed-by: Fabian Grünbichler <[email protected]> Reviewed-by: Richard Laager <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Antonio Russo <[email protected]> Closes #7329
* Fixes for SNPRINTF_BLKPTR with encrypted BP'sMatthew Ahrens2018-04-063-17/+27
| | | | | | | | | | | | | mdb doesn't have dmu_ot[], so we need a different mechanism for its SNPRINTF_BLKPTR() to determine if the BP is encrypted vs authenticated. Additionally, since it already relies on BP_IS_ENCRYPTED (etc), SNPRINTF_BLKPTR might as well figure out the "crypt_type" on its own, rather than making the caller do so. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Tom Caputi <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> Closes #7390
* Fix divide-by-zero in mmp_delay_update()Olaf Faaland2018-04-061-1/+1
| | | | | | | | | | | | vdev_count_leaves() in the denominator may return 0, caught by Coverity. Introduced by * 533ea04 Update mmp_delay on sync or skipped, failed write Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Giuseppe Di Natale <[email protected]> Reviewed-by: George Melikov <[email protected]> Signed-off-by: Olaf Faaland <[email protected]> Closes #7391
* Make encrypted "zfs mount -a" failures consistentTom Caputi2018-04-062-3/+26
| | | | | | | | | | | Currently, "zfs mount -a" will print a warning and fail to mount any encrypted datasets that do not have a key loaded. This patch makes the behavior of this failure consistent with other failure modes ("zfs mount -a" will silently continue, explict "zfs mount" will print a message and return an error code. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tom Caputi <[email protected]> Closes #7382
* Update mmp_delay on sync or skipped, failed writeOlaf Faaland2018-04-046-47/+93
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When an MMP write is skipped, or fails, and time since mts->mmp_last_write is already greater than mts->mmp_delay, increase mts->mmp_delay. The original code only updated mts->mmp_delay when a write succeeded, but this results in the write(s) after delays and failed write(s) reporting an ub_mmp_delay which is too low. Update mmp_last_write and mmp_delay if a txg sync was successful. At least one uberblock was written, thus extending the time we can be sure the pool will not be imported by another host. Do not allow mmp_delay to go below (MSEC2NSEC(zfs_multihost_interval) / vdev_count_leaves()) so that a period of frequent successful MMP writes, e.g. due to frequent txg syncs, does not result in an import activity check so short it is not reliable based on mmp thread writes alone. Remove unnecessary local variable, start. We do not use the start time of the loop iteration. Add a debug message in spa_activity_check() to allow verification of the import_delay value and to prove the activity check occurred. Alter the tests that import pools and attempt to detect an activity check. Calculate the expected duration of spa_activity_check() based on module parameters at the time the import is performed, rather than a fixed time set in mmp.cfg. The fixed time may be wrong. Also, use the default zfs_multihost_interval value so the activity check is longer and easier to recognize. Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Giuseppe Di Natale <[email protected]> Signed-off-by: Olaf Faaland <[email protected]> Closes #7330
* Fedora 28: Fix misc bounds check compiler warningsTony Hutter2018-04-0410-34/+77
| | | | | | | | | Fix a bunch of (mostly) sprintf/snprintf truncation compiler warnings that show up on Fedora 28 (GCC 8.0.1). Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tony Hutter <[email protected]> Closes #7361 Closes #7368
* Fix spa reference leak in zfs_ioc_pool_scanLOLi2018-04-031-3/+3
| | | | | | | | | | | | zfs_ioc_pool_scan leaks a spa reference when zc->zc_flags is not a valid pool_scrub_cmd_t: this could happen if the userland binaries and ZFS kernel module differ in version and would prevent the pool from being exported. Reviewed by: Matt Ahrens <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Giuseppe Di Natale <[email protected]> Signed-off-by: loli10K <[email protected]> Closes #7380
* Fix add_nested_replacing_spare test caseLOLi2018-04-031-1/+1
| | | | | | | | | | | | | | | | | | | | | | | Use 'zpool reopen' instead of 'zpool scrub' to kick in the spare device: this is required to avoid spurious failures caused by a race condition in events processing by the ZFS Event Daemon: P1 (zpool scrub) P2 (zed) --- zfs_ioc_pool_scan() -> dsl_scan() -> vdev_reopen() -> vdev_set_state(VDEV_STATE_CANT_OPEN) zfs_ioc_vdev_attach() -> spa_vdev_attach() -> dsl_resilver_restart() -> dsl_sync_task() -> dsl_scan_setup_check() <- dsl_scan_setup_check(): EBUSY Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: loli10K <[email protected]> Closes #7247 Closes #7342
* Remove ASSERT() in l2arc_apply_transforms()Tim Chase2018-03-311-4/+4
| | | | | | | | | | The ASSERT was erroneously copied from the next section of code. The buffer's size should be expanded from "psize" to "asize" if necessary. Reviewed-by: Tom Caputi <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tim Chase <[email protected]> Closes #7375
* Decryption error handling improvementsTom Caputi2018-03-319-23/+68
| | | | | | | | | | | | | | | | | | Currently, the decryption and block authentication code in the ZIO / ARC layers is a bit inconsistent with regards to the ereports that are produces and the error codes that are passed to calling functions. This patch ensures that all of these errors (which begin as ECKSUM) are converted to EIO before they leave the ZIO or ARC layer and that ereports are correctly generated on each decryption / authentication failure. In addition, this patch fixes a bug in zio_decrypt() where ECKSUM never gets written to zio->io_error. Reviewed by: Matt Ahrens <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tom Caputi <[email protected]> Closes #7372
* Encrypted dnode blocks should be prefetched rawTom Caputi2018-03-311-2/+7
| | | | | | | | | | | | | | | Encrypted dnode blocks are always initially read as raw data and converted to decrypted data when an encrypted bonus buffer is needed. This allows the DMU to be used for things like fetching the DMU master node without requiring keys to be loaded. However, dbuf_issue_final_prefetch() does not currently read the data as raw. The end result of this is that prefetched dnode blocks are read twice from disk: once decrypted and then again as raw data. This patch corrects the issue by adding the flag when appropriate. Reviewed by: Matt Ahrens <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tom Caputi <[email protected]> Closes #7362
* Fix hung z_zvol tasks during 'zfs receive'LOLi2018-03-303-1/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | During a receive operation zvol_create_minors_impl() can wait needlessly for the prefetch thread because both share the same tasks queue. This results in hung tasks: <3>INFO: task z_zvol:5541 blocked for more than 120 seconds. <3> Tainted: P O 3.16.0-4-amd64 <3>"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. The first z_zvol:5541 (zvol_task_cb) is waiting for the long running traverse_prefetch_thread:260 root@linux:~# cat /proc/spl/taskq taskq act nthr spwn maxt pri mina spl_system_taskq/0 1 2 0 64 100 1 active: [260]traverse_prefetch_thread [zfs](0xffff88003347ae40) wait: 5541 spl_delay_taskq/0 0 1 0 4 100 1 delay: spa_deadman [zfs](0xffff880039924000) z_zvol/1 1 1 0 1 120 1 active: [5541]zvol_task_cb [zfs](0xffff88001fde6400) pend: zvol_task_cb [zfs](0xffff88001fde6800) This change adds a dedicated, per-pool, prefetch taskq to prevent the traverse code from monopolizing the global (and limited) system_taskq by inappropriately scheduling long running tasks on it. Reviewed-by: Albert Lee <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: loli10K <[email protected]> Closes #6330 Closes #6890 Closes #7343
* zfs-functions: skip lines where mntpnt is 'none'Georgy Yakovlev2018-03-301-0/+2
| | | | | | | | | | | | | | This fixes zfs-mount initscript trying to mount swap volumes as filesystems or anything that has 'none' as a mountpoint in /etc/fstab. Additionally, fixes trying to mount swap volumes as a filesystem on RHEL. RHEL defines mountpoint for swap as `swap`. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Giuseppe Di Natale <[email protected]> Reviewed-by: George Melikov <[email protected]> Signed-off-by: Georgy Yakovlev <[email protected]> Closes #7346 Closes #7347
* OpenZFS 9164 - assert: newds == os->os_dsl_datasetAndriy Gapon2018-03-306-25/+32
| | | | | | | | | | | | | | | | | | | | Authored by: Andriy Gapon <[email protected]> Reviewed by: Matt Ahrens <[email protected]> Reviewed by: Don Brady <[email protected]> Reviewed-by: loli10K <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Approved by: Richard Lowe <[email protected]> Ported-by: Giuseppe Di Natale <[email protected]> Porting Notes: * Re-enabled and tweaked the zpool_upgrade_007_pos test case to successfully run in under 5 minutes. OpenZFS-issue: https://www.illumos.org/issues/9164 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/0e776dc06a Closes #6112 Closes #7336
* Add support for nvme based devidsDon Brady2018-03-291-2/+13
| | | | | | | | | | | Adds a devid for nvme devices. This is very similar to how the other 'bus' (scsi|sata|usb) devids are generated. The devid resides in a name/value pair in the leaf vdevs in a zpool config. Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Richard Elling <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Don Brady <[email protected]> Closes #7356
* Resolve QAT issues with incompressible dataTom Caputi2018-03-293-37/+106
| | | | | | | | | | | | | | | | | | | | | | Currently, when ZFS wants to accelerate compression with QAT, it passes a destination buffer of the same size as the source buffer. Unfortunately, if the data is incompressible, QAT can actually "compress" the data to be larger than the source buffer. When this happens, the QAT driver will return a FAILED error code and print warnings to dmesg. This patch fixes these issues by providing the QAT driver with an additional buffer to work with so that even completely incompressible source data will not cause an overflow. This patch also resolves an error handling issue where incompressible data attempts compression twice: once by QAT and once in software. To fix this issue, a new (and fake) error code CPA_STATUS_INOMPRESSIBLE has been added so that the calling code can correctly account for the difference between a hardware failure and data that simply cannot be compressed. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Weigang Li <[email protected]> Signed-off-by: Tom Caputi <[email protected]> Closes #7338
* Fix ASSERT in dsl_scan_fini() and cleanup commentsTom Caputi2018-03-281-13/+13
| | | | | | | | | | | | | | | This patch fixes an issue where dsl_scan_prefetch_cb() might add more prefetch I/Os to the prefetch queue after prefetching has been completed. This was happening because that code was checking scn->scn_suspending instead of scn->scn_prefetch_stop. This occasionally triggered an ASSERT during ztest runs in dsl_scan_fini() when the code attempted to destroy an AVL tree that still had entires in it. This patch also includes a number of spelling corrections and comment cleanups throughout dsl_scan.c Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tom Caputi <[email protected]> Closes #7353
* chmod -x on etc/init.d/zfs-*.in automake filesTony Hutter2018-03-284-0/+0
| | | | | | | | | | | | Clear executable bit on zfs-import.in, zfs-mount.in, zfs-share.in, and zfs-zed.in. These are automake files and should not be marked executable. This fixes a RPM build error on Fedora 28. Reviewed-by: George Melikov <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tony Hutter <[email protected]> Closes #7355 Closes #7327
* Fix mmap / libaio deadlockBrian Behlendorf2018-03-2814-5/+184
| | | | | | | | | | | | | | Calling uiomove() in mappedread() under the page lock can result in a deadlock if the user space page needs to be faulted in. Resolve the issue by dropping the page lock before the uiomove(). The inode range lock protects against concurrent updates via zfs_read() and zfs_write(). Reviewed-by: Albert Lee <[email protected]> Reviewed-by: Chunwei Chen <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #7335 Closes #7339
* Remove libattr requirementDeHackEd2018-03-274-15/+2
| | | | | | | | | | RHEL/CentOS 6 supports sys/xattr.h eliminating the need for libattr-devel as a dependency. Reviewed-by: Giuseppe Di Natale <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: DHE <[email protected]> Closes #7344 Closes #7351
* OpenZFS 9321 - arc_loan_compressed_buf() can increment arc_loaned_bytes by ↵Allan Jude2018-03-261-2/+2
| | | | | | | | | | | | | | | | the wrong value Authored by: Allan Jude <[email protected]> Reviewed by: Matt Ahrens <[email protected]> Reviewed by: George Wilson <[email protected]> Reviewed-by: George Melikov <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Approved by: Garrett D'Amore <[email protected]> Ported-by: Giuseppe Di Natale <[email protected]> OpenZFS-issue: https://www.illumos.org/issues/9321 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/92b05f3a18 Closes #7333
* Fedora 28: Fix "Macro %_dracutdir has empty body"Tony Hutter2018-03-251-4/+17
| | | | | | | | | | | If you run ./configure --with-config=srpm, it will not trigger the user m4 scripts to populate the dracut and udev directories. This causes a build error on Fedora 28. Make the dracut and udev lines conditional to get around this. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tony Hutter <[email protected]> Closes #7326 Closes #7328
* Don't count embedded bps in read statsTom Caputi2018-03-231-1/+3
| | | | | | | | | | | | | | Currently, ZFS tracks statistics about calls to arc_read() via the /proc/spl/kstat/zfs/<pool>/reads file for debugging. Unfortunately, this file currently counts embedded bps as disk reads since they are technically processed by the ZIO layer. This pollutes the log since the ARC will never cache embedded bps. This patch corrects this issue by preventing the logging of embedded bp reads. Reviewed-by: George Melikov <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tom Caputi <[email protected]> Closes #7334
* OpenZFS 9193 - bootcfg -C doesn't workPaul Dagnelie2018-03-221-2/+3
| | | | | | | | | | | | | | | | | | | | When given an empty string as a rootds value, bootcfg -C fails with the error message 'could not set nextboot: '' is an invalid name'. This should be allowed because it represents clearing the nextboot configuration. Authored by: Paul Dagnelie <[email protected]> Reviewed by: Chris Williamson <[email protected]> Reviewed by: Sebastien Roy <[email protected]> Reviewed by: Igor Kozhukhov <[email protected]> Reviewed-by: George Melikov <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Approved by: Robert Mustacchi <[email protected]> Ported-by: Giuseppe Di Natale <[email protected]> OpenZFS-issue: https://www.illumos.org/issues/9193 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/504645d227 Closes #7230
* Clarify zpool actions for an intent log devicePeter Ashford2018-03-221-2/+3
| | | | | | | | | | Updated the "Intent Log" section of the "zpool" manual page to properly reflect the actions that may be performed. Reviewed-by: George Melikov <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Peter Ashford <[email protected]> Closes #6938 Closes #7318
* modprobe zfs during dracut mountkpande2018-03-221-0/+1
| | | | | | | | | | | Resolves importing root pool during boot in dracut. This case was inadvertently broken with the module autoloading change in #7287. Reviewed-by: Matthew Thode <[email protected]> Reviewed-by: George Melikov <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Giuseppe Di Natale <[email protected]> Signed-off-by: Kash Pande <[email protected]> Closes #7322
* enable zfs_dbgmsg() by default, without dprintf()Matthew Ahrens2018-03-215-28/+38
| | | | | | | | | | | | | | | | | | | | | | | zfs_dbgmsg() should record a message by default. As a general principal, these messages shouldn't be too verbose. Furthermore, the amount of memory used is limited to 4MB (by default). dprintf() should only record a message if this is a debug build, and ZFS_DEBUG_DPRINTF is set in zfs_flags. This flag is not set by default (even on debug builds). These messages are extremely verbose, and sometimes nontrivial to compute. SET_ERROR() should only record a message if ZFS_DEBUG_SET_ERROR is set in zfs_flags. This flag is not set by default (even on debug builds). This brings our behavior in line with illumos. Note that the message format is unchanged (including file, line, and function, even though these are not recorded on illumos). Reviewed-by: Giuseppe Di Natale <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Prakash Surya <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> Closes #7314
* Fix spelling errors in commentsTom Caputi2018-03-214-47/+48
| | | | | | | | | | This patch simply corrects some spelling / grammar errors in the QAT and encryption code comments. No functional changes Reviewed-by: George Melikov <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Giuseppe Di Natale <[email protected]> Signed-off-by: Tom Caputi <[email protected]> Closes #7319
* Add support for nvme disk detectiontimor2018-03-211-0/+14
| | | | | | | | | | | | | This treats /dev/nvme.. devices the same way as /dev/sd... devices. The motivation behind this is that whole disk detection did not work on nvme SSDs without that, because it DKC_UNKNOWN was returned for such devices. Perhaps there should be a separate DKC_ type for this, but I don't know enough about the code to know the implications of that. Reviewed-by: Don Brady <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: timor <[email protected]> Closes #7304