aboutsummaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* Fix missing dkms modules after upgradesTony Hutter2019-01-081-0/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If you were upgrading from say, fc28->fc29, on ZFS version X, the RPMs macros would get called like this: %post X.fc29 - This is the step where fc29 gets built by dkms. As part of the build, dkms automatically removes the previous modules before building the new ones. It then builds the new modules. %preun X.fc28 - Right before this step, X.fc29 is be built and installed, but since it has the same X, it's files get inadvertently removed by fc28's uninstall. %postun X.fc28 This patch updates %preun X.fc28 to see if we're upgrading or uninstalling. If we're uninstalling, then remove our files. If we're upgrading then do nothing, since will know dkms will have already removed our files in %post X.fc29. Note that since this fixes the %preun step, it's effect isn't going to be noticed immediately. It will only be seen when packages with this fix are upgraded to a newer version. Reviewed-by: Ralf Ertzinger <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tony Hutter <[email protected]> Closes #6902 Closes #8216
* Bump commit subject length to 72 charactersNeal Gompa (ニール・ゴンパ)2019-01-082-4/+4
| | | | | | | | | | | | | There's not really a reason to keep the subject length so short, since the reason to make it this short was for making nice renders of a summary list of the git log. With 72 characters, this still works out fine, so let's just raise it to that so that it's easier to give slightly more descriptive change summaries. Reviewed by: Matt Ahrens <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Neal Gompa <[email protected]> Closes #8250
* zfs.8 uses wrong snapshot names in Example 15Benjamin Gentil2019-01-071-4/+4
| | | | | | | Reviewed-by: George Melikov <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: bunder2015 <[email protected]> Signed-off-by: Benjamin Gentil <[email protected]> Closes #8241
* Add 'zpool status -i' optionBrian Behlendorf2019-01-073-43/+64
| | | | | | | | | | | | | | | | | | | | | Only display the full details of the vdev initialization state in 'zpool status' output when requested with the -i option. By default display '(initializing)' after vdevs when they are being actively initialized. This is consistent with the established precident of appending '(resilvering), etc' and fits within the default 80 column terminal width making it easy to read. Additionally, updated the 'zpool initialize' documentation to make it clear the options are mutually exclusive, but allow duplicate options like all other zfs/zpool commands. Reviewed by: Matt Ahrens <[email protected]> Reviewed-by: loli10K <[email protected]> Reviewed-by: Tim Chase <[email protected]> Reviewed-by: George Wilson <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #8230
* zfs initialize performance enhancementsGeorge Wilson2019-01-0710-79/+175
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | PROBLEM ======== When invoking "zpool initialize" on a pool the command will create a thread to initialize each disk. Unfortunately, it does this serially across many transaction groups which can result in commands taking a long time to return to the user and may appear hung. The same thing is true when trying to suspend/cancel the operation. SOLUTION ========= This change refactors the way we invoke the initialize interface to ensure we can start or stop the intialization in just a few transaction groups. When stopping or cancelling a vdev initialization perform it in two phases. First signal each vdev initialization thread that it should exit, then after all threads have been signaled wait for them to exit. On a pool with 40 leaf vdevs this reduces the vdev initialize stop/cancel time from ~10 minutes to under a second. The reason for this is spa_vdev_initialize() no longer needs to wait on multiple full TXGs per leaf vdev being stopped. This commit additionally adds some missing checks for the passed "initialize_vdevs" input nvlist. The contents of the user provided input "initialize_vdevs" nvlist must be validated to ensure all values are uint64s. This is done in zfs_ioc_pool_initialize() in order to keep all of these checks in a single location. Updated the innvl and outnvl comments to match the formatting used for all other new sytle ioctls. Reviewed by: Matt Ahrens <[email protected]> Reviewed-by: loli10K <[email protected]> Reviewed-by: Tim Chase <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: George Wilson <[email protected]> Closes #8230
* OpenZFS 9102 - zfs should be able to initialize storage devicesGeorge Wilson2019-01-0754-30/+2743
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | PROBLEM ======== The first access to a block incurs a performance penalty on some platforms (e.g. AWS's EBS, VMware VMDKs). Therefore we recommend that volumes are "thick provisioned", where supported by the platform (VMware). This can create a large delay in getting a new virtual machines up and running (or adding storage to an existing Engine). If the thick provision step is omitted, write performance will be suboptimal until all blocks on the LUN have been written. SOLUTION ========= This feature introduces a way to 'initialize' the disks at install or in the background to make sure we don't incur this first read penalty. When an entire LUN is added to ZFS, we make all space available immediately, and allow ZFS to find unallocated space and zero it out. This works with concurrent writes to arbitrary offsets, ensuring that we don't zero out something that has been (or is in the middle of being) written. This scheme can also be applied to existing pools (affecting only free regions on the vdev). Detailed design: - new subcommand:zpool initialize [-cs] <pool> [<vdev> ...] - start, suspend, or cancel initialization - Creates new open-context thread for each vdev - Thread iterates through all metaslabs in this vdev - Each metaslab: - select a metaslab - load the metaslab - mark the metaslab as being zeroed - walk all free ranges within that metaslab and translate them to ranges on the leaf vdev - issue a "zeroing" I/O on the leaf vdev that corresponds to a free range on the metaslab we're working on - continue until all free ranges for this metaslab have been "zeroed" - reset/unmark the metaslab being zeroed - if more metaslabs exist, then repeat above tasks. - if no more metaslabs, then we're done. - progress for the initialization is stored on-disk in the vdev’s leaf zap object. The following information is stored: - the last offset that has been initialized - the state of the initialization process (i.e. active, suspended, or canceled) - the start time for the initialization - progress is reported via the zpool status command and shows information for each of the vdevs that are initializing Porting notes: - Added zfs_initialize_value module parameter to set the pattern written by "zpool initialize". - Added zfs_vdev_{initializing,removal}_{min,max}_active module options. Authored by: George Wilson <[email protected]> Reviewed by: John Wren Kennedy <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Reviewed by: Pavel Zakharov <[email protected]> Reviewed by: Prakash Surya <[email protected]> Reviewed by: loli10K <[email protected]> Reviewed by: Brian Behlendorf <[email protected]> Approved by: Richard Lowe <[email protected]> Signed-off-by: Tim Chase <[email protected]> Ported-by: Tim Chase <[email protected]> OpenZFS-issue: https://www.illumos.org/issues/9102 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/c3963210eb Closes #8230
* Python 2 and 3 compatibilityBrian Behlendorf2019-01-0645-1395/+1594
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | With Python 2 (slowly) approaching EOL and its removal from distribitions already being planned (Fedora), the existing Python 2 code needs to be transitioned to Python 3. This patch stack updates the Python code to be compatible with Python 2.7, 3.4, 3.5, 3.6, and 3.7. Reviewed-by: John Ramsden <[email protected]> Reviewed-by: Neal Gompa <[email protected]> Reviewed-by: loli10K <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: John Wren Kennedy <[email protected]> Reviewed-by: Antonio Russo <[email protected]> Closes #8096
| * test-runner: python3 supportJohn Wren Kennedy2019-01-061-95/+100
| | | | | | | | | | | | | | | | | | | | | | Updated to be compatible with Python 2.6, 2.7, 3.5 or newer. Reviewed-by: John Ramsden <[email protected]> Reviewed-by: Neal Gompa <[email protected]> Reviewed-by: loli10K <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: John Wren Kennedy <[email protected]> Closes #8096
| * arc_summary: consolidate test caseBrian Behlendorf2019-01-067-72/+36
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Since we're only installing one version of arc_summary we only need one test case. Update the test to determine which version is available and then test its supported flags. Remove files for misc tests which should have been cleaned up. Reviewed-by: John Ramsden <[email protected]> Reviewed-by: Neal Gompa <[email protected]> Reviewed-by: loli10K <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #8096
| * pyzfs: python3 support (build system)Brian Behlendorf2019-01-0629-162/+339
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Almost all of the Python code in the respository has been updated to be compatibile with Python 2.6, Python 3.4, or newer. The only exceptions are arc_summery3.py which requires Python 3, and pyzfs which requires at least Python 2.7. This allows us to maintain a single version of the code and support most default versions of python. This change does the following: * Sets the default shebang for all Python scripts to python3. If only Python 2 is available, then at install time scripts which are compatible with Python 2 will have their shebangs replaced with /usr/bin/python. This is done for compatibility until Python 2 goes end of life. Since only the installed versions are changed this means Python 3 must be installed on the system for test-runner when testing in-tree. * Added --with-python=<2|3|3.4,etc> configure option which sets the PYTHON environment variable to target a specific python version. By default the newest installed version of Python will be used or the preferred distribution version when creating pacakges. * Fixed --enable-pyzfs configure checks so they are run when --enable-pyzfs=check and --enable-pyzfs=yes. * Enabled pyzfs for Python 3.4 and newer, which is now supported. * Renamed pyzfs package to python<VERSION>-pyzfs and updated to install in the appropriate site location. For example, when building with --with-python=3.4 a python34-pyzfs will be created which installs in /usr/lib/python3.4/site-packages/. * Renamed the following python scripts according to the Fedora guidance for packaging utilities in /bin - dbufstat.py -> dbufstat - arcstat.py -> arcstat - arc_summary.py -> arc_summary - arc_summary3.py -> arc_summary3 * Updated python-cffi package name. On CentOS 6, CentOS 7, and Amazon Linux it's called python-cffi, not python2-cffi. For Python3 it's called python3-cffi or python3x-cffi. * Install one version of arc_summary. Depending on the version of Python available install either arc_summary2 or arc_summary3 as arc_summary. The user output is only slightly different. Reviewed-by: John Ramsden <[email protected]> Reviewed-by: Neal Gompa <[email protected]> Reviewed-by: loli10K <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #8096
| * pyzfs: python3 support (unit tests)Brian Behlendorf2019-01-062-944/+978
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * Updated unit tests to be compatbile with python 2 or 3. In most cases all that was required was to add the 'b' prefix to existing strings to convert them to type bytes for python 3 compatibility. * There were several places where the python version need to be checked to remain compatible with pythong 2 and 3. Some one more seasoned with Python may be able to find a way to rewrite these statements in a compatible fashion. Reviewed-by: John Ramsden <[email protected]> Reviewed-by: Neal Gompa <[email protected]> Reviewed-by: loli10K <[email protected]> Signed-off-by: John Wren Kennedy <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #8096
| * pyzfs: python3 support (library 2/2)Brian Behlendorf2019-01-064-23/+23
| | | | | | | | | | | | | | | | | | | | | | * All pool, dataset, and nvlist keys must be of type bytes. Reviewed-by: John Ramsden <[email protected]> Reviewed-by: Neal Gompa <[email protected]> Reviewed-by: loli10K <[email protected]> Signed-off-by: John Kennedy <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #8096
| * pyzfs: python3 support (library 1/2)Antonio Russo2019-01-0613-122/+141
|/ | | | | | | | | | | | | | | | | | These changes are efficient and valid in python 2 and 3. For the most part, they are also pythonic. * 2to3 conversion * add __future__ imports * iterator changes * integer division * relative import fixes Reviewed-by: John Ramsden <[email protected]> Reviewed-by: Neal Gompa <[email protected]> Reviewed-by: loli10K <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Antonio Russo <[email protected]> Closes #8096
* Add zfs module feature and property compatibilityBrian Behlendorf2019-01-031-6/+30
| | | | | | | | | This change is required to ease the transition when upgrading from 0.7.x to 0.8.x. It allows 0.8.x user space utilities to remain compatible with 0.7.x and older kernel modules. Reviewed-by: Don Brady <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #8231
* Add missing MMP status code to libzfs_statusbunder20152019-01-033-21/+39
| | | | | | | | | | | | When MMP was merged the status codes in libzfs_status were not updated to add the status code for ZPOOL_STATUS_IO_FAILURE_MMP. This commit corrects this and adds comments to help keep track of which code is used for which status. Reviewed-by: Olaf Faaland <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: bunder2015 <[email protected]> Closes #8148 Closes #8222
* Fix 'zpool remap' freeing raceBrian Behlendorf2019-01-022-10/+32
| | | | | | | | | | | | | | | | | | | | | | | | The dmu_objset_remap_indirects_impl() logic depends on dnode_hold() returning ENOENT for dnodes which will be freed and should be skipped. This behavior can only be relied upon when taking a new hold and while the caller has an open transaction. This ensures that the open txg cannot advance and that a concurrent free will end up in the same txg (which is critical). Relying on an existing hold will not prevent dnode_free() from succeeding. The solution is to take an additional dnode_hold() after assigning the transaction. This ensures the remap will never dirty the dnode if it was freed while we were waiting in dmu_tx_assign(, TXG_WAIT). Randomly set zfs_object_remap_one_indirect_delay_ms in ztest. This increases the likelihood of an operation racing with the remap. Converted from ticks to milliseconds. Reviewed by: Matt Ahrens <[email protected]> Reviewed by: Tom Caputi <[email protected]> Reviewed by: Igor Kozhukhov <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #8215
* Minor tweaks to zfs.8 man page for POSIX ACLsRichard Laager2018-12-261-6/+6
| | | | | | | | | | | | | | | | * Capitalize POSIX in POSIX ACLs. This change makes the POSIX in POSIX ACLs all caps, which is both correct and consistent with the rest of the man page. * Slightly reword part of zfs.8. I tweaked a sentence to add a missing comma, and as long as I was editing, removed a couple unnecessary words. Reviewed-by: George Melikov <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: bunder2015 <[email protected]> Signed-off-by: Richard Laager <[email protected]> Closes #8220
* OpenZFS 9284 - arc_reclaim_thread has 2 jobsBrad Lewis2018-12-267-172/+285
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Following the fix for 9018 (Replace kmem_cache_reap_now() with kmem_cache_reap_soon), the arc_reclaim_thread() no longer blocks while reaping. However, the code is still confusing and error-prone, because this thread has two responsibilities. We should instead separate this into two threads each with their own responsibility: 1. keep `arc_size` under `arc_c`, by calling `arc_adjust()`, which improves `arc_is_overflowing()` 2. keep enough free memory in the system, by calling `arc_kmem_reap_now()` plus `arc_shrink()`, which improves `arc_available_memory()`. Furthermore, we can use the zthr infrastructure to separate the "should we do something" from "do it" parts of the logic, and normalize the start up / shut down of the threads. Authored by: Brad Lewis <[email protected]> Reviewed by: Matt Ahrens <[email protected]> Reviewed by: Serapheim Dimitropoulos <[email protected]> Reviewed by: Pavel Zakharov <[email protected]> Reviewed by: Dan Kimmel <[email protected]> Reviewed by: Paul Dagnelie <[email protected]> Reviewed by: Dan McDonald <[email protected]> Reviewed by: Tim Kordas <[email protected]> Reviewed by: Tim Chase <[email protected]> Reviewed by: Brian Behlendorf <[email protected]> Ported-by: Brad Lewis <[email protected]> Signed-off-by: Brad Lewis <[email protected]> OpenZFS-issue: https://www.illumos.org/issues/9284 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/de753e34f9 Closes #8165
* Fix zfs_dirty_data_sync_percent documentationTom Caputi2018-12-181-2/+3
| | | | | | | | | | | In dfbe2675 zfs_dirty_data_sync was changed to a new tunable named zfs_dirty_data_sync_percent. Unfortunately, the module parameter documentation is the code was not updated accordingly. This patch simply corrects that. Reviewed-by: George Melikov <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tom Caputi <[email protected]> Closes #8212
* Add enclosure_symlinks option to vdev_idTony Hutter2018-12-144-3/+112
| | | | | | | | | | | | | | | | Add an 'enclosure_symlinks' option to vdev_id.conf. This creates consistently named symlinks to the enclosure devices (/dev/sg*) based off the configuration in vdev_id.conf. The enclosure symlinks show up in /dev/by-enclosure/<prefix>-<channel><num>. The links make it make it easy to run sg_ses on a particular enclosure device. The enclosure links are created in addition to the normal /dev/disk/by-vdev links. 'enclosure_symlinks' is only valid in sas_direct configurations. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Simon Guest <[email protected]> Signed-off-by: Tony Hutter <[email protected]> Closes #8194
* ZTS: fix wait_scrubbed()Tom Caputi2018-12-142-10/+5
| | | | | | | | | | | | Currently, wait_scrubbed() is the only function of its kind that accepts a timeout, which is 10s by default. This timeout is pretty short for a scrub and causes test failures if we run too long. This patch removes the timeout, instead leaning on the global test suite timeout to ensure the tests keep moving. Reviewed-by: Richard Elling <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tom Caputi <[email protected]> Closes #8210
* Fix zap_update() ASSERT from ztestTom Caputi2018-12-141-12/+0
| | | | | | | | | | | This patch simply removes an invalid assert from the zap_update() function. The ASSERT is invalid because it does not hold the zap lock from the time it fetches the old value to the time it confirms that it is what it should be. Reviewed by: Matt Ahrens <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tom Caputi <[email protected]> Closes #8209
* ztest: ENOSPC in ztest_objset_destroy_cb()Brian Behlendorf2018-12-141-2/+6
| | | | | | | | | | | | While unlikely it is possible for dsl_destroy_head() to return ENOSPC in the ztest_objset_destroy_cb(). This can occur even when ZFS_SPACE_CHECK_DESTROY is used with the dsl_sync_task(). Both the existence of a checkpoint and pending deferred frees can cause this. Reviewed-by: Serapheim Dimitropoulos <[email protected]> Reviewed-by: Tom Caputi <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #8206
* OpenZFS 9559 - zfs diff handles files on delete queue in fromsnap poorlyPaul Dagnelie2018-12-141-6/+6
| | | | | | | | | | | | Authored by: Paul Dagnelie <[email protected]> Reviewed by: Joshua M. Clulow <[email protected]> Reviewed by: Tom Caputi <[email protected]> Approved by: Richard Lowe <[email protected]> Ported-by: Brian Behlendorf <[email protected]> OpenZFS-issue: https://www.illumos.org/issues/9559 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/d7e45412 Closes #8211
* OpenZFS 9630 - add lzc_rename and lzc_destroy to libzfs_coreAndriy Gapon2018-12-148-51/+81
| | | | | | | | | | | | | | | | | | | | Porting Notes: * Additional changes to recv_rename_impl() were required due to encryption code not being merged in OpenZFS yet. * libzfs_core python bindings (pyzfs) were updated to fully support both lzc_rename() and lzc_destroy() Authored by: Andriy Gapon <[email protected]> Reviewed by: Andy Stormont <[email protected]> Reviewed by: Matt Ahrens <[email protected]> Reviewed by: Serapheim Dimitropoulos <[email protected]> Reviewed by: Brian Behlendorf <[email protected]> Approved by: Dan McDonald <[email protected]> Ported-by: loli10K <[email protected]> OpenZFS-issue: https://www.illumos.org/issues/9630 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/049ba63 Closes #8207
* Add `cut` binary to the initramfsBen Cordero2018-12-132-2/+3
| | | | | | | | | | | | | | Since the `cut -b` command is used by `parse-zfs.sh`, ensure that it is copied to the initramfs. Fix spl_hostid when set by cmdline. This follows a similar logic from the `zgenhostid` script, using `echo` instead of `printf`. Reviewed-by: Olaf Faaland <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Ben Cordero <[email protected]> Closes #8197
* Fix resilver writes in vdev_indirect_io_startTom Caputi2018-12-131-8/+15
| | | | | | | | | | | | | This patch addresses an issue found in ztest where resilver write zios that were passed to an indirect vdev would end up being handled as though they were resilver read zios. This caused issues where the zio->io_abd would be both read to and written from at the same time, causing asserts to fail. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed by: Matt Ahrens <[email protected]> Reviewed-by: Serapheim Dimitropoulos <[email protected]> Signed-off-by: Tom Caputi <[email protected]> Closes #8193
* Check for strlcat and strlcpyBrian Behlendorf2018-12-116-50/+126
| | | | | | | | | | | | | | | | This partially reverts commit 8005ca4 by moving the strlcat() and strlcpy() compatibility implementations back to their original location. In addition, these two functions were added to the AC_CHECK_FUNCS macro. When these functions are available from the C library, HAVE_STRLCAT and HAVE_STRLCPY will be defined and library version used. Otherwise the compatibility version is built. Reviewed-by: Sebastian Gottschall <[email protected]> Reviewed-by: Alek Pinchuk <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #8157 Closes #8202
* Seeing negative values for wlentime and rlentimeRichard Elling2018-12-111-3/+4
| | | | | | | | | | | | | | | | Linux kstat IO and TIMER printed values as signed. However the counters only increment. Thus humans looking at the data can be confused when the counters roll over. Note: The recommended use of these values is to monitor the derivative, which don't really care about the sign. See explanations related to non-negative derivatives in the various time-series databases. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Giuseppe Di Natale <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Signed-off-by: Richard Elling <[email protected]> Closes #8131 Closes #8198
* Rename macro ZFS_MINOR due to Lustre conflictOlaf Faaland2018-12-102-8/+8
| | | | | | | | | | | | | | | | | Macro ZFS_MINOR, introduced in commit a6cc9756 to record the chosen static minor number for /dev/zfs, conflicts with an existing macro in Lustre. The lustre macro (along with _MAJOR, _PATCH, _FIX) is used to record the zfsonlinux version Lustre is being built against. Since the Lustre macro came first, and is used in past versions of lustre at least going back to 2.10, it makes sense to rename the macro in ZFS instead of doing so in Lustre which would require backporting the patch. Reviewed-by: Giuseppe Di Natale <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Olaf Faaland <[email protected]> Closes #8195
* OpenZFS 9962 - zil_commit should omit cache thrashPrakash Surya2018-12-076-78/+206
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | As a result of the changes made in 8585, it's possible for an excessive amount of vdev flush commands to be issued under some workloads. Specifically, when the workload consists of mostly async write activity, interspersed with some sync write and/or fsync activity, we can end up issuing more flush commands to the underlying storage than is actually necessary. As a result of these flush commands, the write latency and overall throughput of the pool can be poorly impacted (latency increases, throughput decreases). Currently, any time an lwb completes, the vdev(s) written to as a result of that lwb will be issued a flush command. The intenion is so the data written to that vdev is on stable storage, prior to communicating to any waiting threads that their data is safe on disk. The problem with this scheme, is that sometimes an lwb will not have any threads waiting for it to complete. This can occur when there's async activity that gets "converted" to sync requests, as a result of calling the zil_async_to_sync() function via zil_commit_impl(). When this occurs, the current code may issue many lwbs that don't have waiters associated with them, resulting in many flush commands, potentially to the same vdev(s). For example, given a pool with a single vdev, and a single fsync() call that results in 10 lwbs being written out (e.g. due to other async writes), that will result in 10 flush commands to that single vdev (a flush issued after each lwb write completes). Ideally, we'd only issue a single flush command to that vdev, after all 10 lwb writes completed. Further, and most important as it pertains to this change, since the flush commands are often very impactful to the performance of the pool's underlying storage, unnecessarily issuing these flush commands can poorly impact the performance of the lwb writes themselves. Thus, we need to avoid issuing flush commands when possible, in order to acheive the best possible performance out of the pool's underlying storage. This change attempts to address this problem by changing the ZIL's logic to only issue a vdev flush command when it detects an lwb that has a thread waiting for it to complete. When an lwb does not have threads waiting for it, the responsibility of issuing the flush command to the vdevs involved with that lwb's write is passed on to the "next" lwb. It's only once a write for an lwb with waiters completes, do we issue the vdev flush command(s). As a result, now when we issue the flush(s), we will issue them to the vdevs involved with that specific lwb's write, but potentially also to vdevs involved with "previous" lwb writes (i.e. if the previous lwbs did not have waiters associated with them). Thus, in our prior example with 10 lwbs, it's only once the last lwb completes (which will be the lwb containing the waiter for the thread that called fsync) will we issue the vdev flush command; all of the other lwbs will find they have no waiters, so they'll pass the responsibility of the flush to the "next" lwb (until reaching the last lwb that has the waiter). Porting Notes: * Reconciled conflicts with the fastwrite feature. Authored by: Prakash Surya <[email protected]> Reviewed by: Matt Ahrens <[email protected]> Reviewed by: Brad Lewis <[email protected]> Reviewed by: Patrick Mooney <[email protected]> Reviewed by: Jerry Jelinek <[email protected]> Approved by: Joshua M. Clulow <[email protected]> Ported-by: Signed-off-by: Brian Behlendorf <[email protected]> OpenZFS-issue: https://www.illumos.org/issues/9962 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/545190c6 Closes #8188
* OpenZFS 9963 - Separate tunable for disabling ZIL vdev flushPrakash Surya2018-12-073-12/+35
| | | | | | | | | | | | | | | | | | | | Porting Notes: * Add options to zfs-module-parameters(5) man page. * zfs_nocacheflush move to vdev.c instead of vdev_disk.c, since the latter doesn't get built for user space. Authored by: Prakash Surya <[email protected]> Reviewed by: Matt Ahrens <[email protected]> Reviewed by: Brad Lewis <[email protected]> Reviewed by: Patrick Mooney <[email protected]> Reviewed by: Tom Caputi <[email protected]> Reviewed by: George Melikov <[email protected]> Approved by: Dan McDonald <[email protected]> Ported-by: Signed-off-by: Brian Behlendorf <[email protected]> OpenZFS-issue: https://www.illumos.org/issues/9963 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f8fdf68125 Closes #8186
* OpenZFS 9993 - zil writes can get delayed in zio pipelineGeorge Wilson2018-12-071-1/+2
| | | | | | | | | | | | | | | Authored by: George Wilson <[email protected]> Reviewed by: Prakash Surya <[email protected]> Reviewed by: Brad Lewis <[email protected]> Reviewed by: Matt Ahrens <[email protected]> Reviewed by: Tom Caputi <[email protected]> Reviewed by: George Melikov <[email protected]> Approved by: Dan McDonald <[email protected]> Ported-by: Brian Behlendorf <[email protected]> OpenZFS-issue: https://www.illumos.org/issues/9993 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/2258ad0b Closes #8185
* OpenZFS 9880 - Race in ZFS parallel mountAndy Fiddaman2018-12-071-3/+31
| | | | | | | | | | | | | | | | | Porting Notes: * Not required for Linux since the zone is always global. But we'll want this change if we start using the zones code. Authored by: Andy Fiddaman <[email protected]> Reviewed by: Jason King <[email protected]> Reviewed by: Sebastien Roy <[email protected]> Reviewed by: Tom Caputi <[email protected]> Approved by: Joshua M. Clulow <[email protected]> Ported-by: Brian Behlendorf <[email protected]> OpenZFS-issue: https://www.illumos.org/issues/9880 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/bc4c0ff134 Closes #8189
* Fix error message when zfs module is not loadedTom Caputi2018-12-071-3/+3
| | | | | | | | | | | This patch corrects a small issue where the wrong error message was being displayed when the zfs kernel module was not loaded. This also avoids waiting for the (by default) 10s timeout to see if the /dev/zfs device appears. Reviewed-by: George Melikov <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tom Caputi <[email protected]> Closes #8187
* Do not enable stack tracer for ZFS performance testTony Nguyen2018-12-072-6/+21
| | | | | | | | | | | | | | Linux ZFS test suite runs with /proc/sys/kernel/stack_tracer_enabled=1, via zfs.sh script, which has negative performance impact, up to 40%. Since large stack is a rare issue now, preferred behavior would be: - making stack tracer an opt-in feature for zfs.sh - zfs-test.sh enables stack tracer only when requested Reviewed-by: George Melikov <[email protected]> Reviewed-by: Richard Elling <[email protected]> Reviewed-by: John Kennedy <[email protected]> Signed-off-by: Tony Nguyen <[email protected]> #8173
* Ensure dsl scan prefetch queue is emptiedTom Caputi2018-12-061-0/+20
| | | | | | | | | This patch simply ensures that scn->scn_prefetch_queue is emptied before the kernel module is unloaded and when scanning completes. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alek Pinchuk <[email protected]> Signed-off-by: Tom Caputi <[email protected]> Closes #8178
* Fix 'zfs receive -F' message when destination has snapshotsloli10K2018-12-051-1/+1
| | | | | | | | | | | | | | | | | | When receiving a send stream with forced rollback on a dataset with snapshots zfs suggests said snapshots must be removed to successfully receive the stream; however the message is misleading because it prints the dataset name instead of one of its snapshots. $ sudo zfs snap pp/recvfs@snap-orig $ sudo zfs recv -F pp/recvfs < sendstream cannot receive new filesystem stream: destination has snapshots (eg. pp/recvfs) must destroy them to overwrite it This change simply restores the snapshot name in the error message. Reviewed-by: George Melikov <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: loli10K <[email protected]> Closes #8167
* Use autoconf variable for C preprocessorBen Wolsieffer2018-12-051-1/+1
| | | | | | | | | | This fixes the build when cross-compiling, where the preprocessor might be prefixed. Reviewed-by: George Melikov <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Signed-off-by: Ben Wolsieffer <[email protected]> Closes #8180
* Move assert in dump_dir() in zdbTom Caputi2018-12-051-3/+3
| | | | | | | | | | | This one line patch moves an assert in the function dump_dir() below an error check that ensures it ran correctly. This ensures zdb dumps the error that actually caused the problem, as opposed to one of its symptoms. Reviewed-by: George Melikov <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tom Caputi <[email protected]> Closes #8171
* Fix dnode_hold() freeing dnode behaviorBrian Behlendorf2018-12-051-2/+4
| | | | | | | | | | | | | | | | | | | | | Commit 4c5b89f59 refactored dnode_hold() and in the process accidentally introduced a slight change in behavior which was not intended. The required behavior is that once the ZPL, or other consumer, declares its intent to free a dnode then dnode_hold() should immediately start failing. This updated code wouldn't return the failure until after it was freed. When DNODE_MUST_BE_ALLOCATED is set it must return ENOENT, and when DNODE_MUST_BE_FREE is set it must return EEXIST; This issue was uncovered by ztest_remap() which attempted to remap a freeing object which should have been skipped as described by the comment in dmu_objset_remap_indirects_impl(). Reviewed-by: George Melikov <[email protected]> Reviewed-by: Tom Caputi <[email protected]> Reviewed-by: Olaf Faaland <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #8172
* Fix 'zpool list -v' alignmentBrian Behlendorf2018-12-042-52/+111
| | | | | | | | | | | | | | | The verbose output of 'zpool list' was not correctly aligned due to differences in the vdev name lengths. Minimally update the code the correct the alignment using the same strategy employed by 'zpool status'. Missing dashes were added for the empty defaults columns, and the vdev state is now printed for all vdevs. Reviewed-by: Tom Caputi <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #7308 Closes #8147
* zfs-functions.in: is_mounted() always returns 1TerraTech2018-12-041-2/+7
| | | | | | | | | | | | | | The 'while read line; ...; done' loop is run in a piped subshell therefore the 'return 0' would not cause a return from the is_mounted() function. In all cases, this function will always return 1. The fix is to 'return 1' from the subshell on a successful match (no match == return 0), and then negating the final return value. Reviewed-by: George Melikov <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: TerraTech <[email protected]> Closes #8151
* Fix ztest deadlock in spa_vdev_remove()Tom Caputi2018-12-041-12/+19
| | | | | | | | | | | | | | This patch corrects an issue where spa_vdev_remove() would call spa_history_log_internal() while holding the spa config lock. This function may decide to block until the next txg if the current one seems too full. However, since the thread is holding the config log, the txg sync thread cannot progress and the system ends up deadlocked. This patch simply moves all calls to spa_history_log_internal() outside of the config lock. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tom Caputi <[email protected]> Closes #8162
* Fix ztest deadlock in ztest_zil_remount()Tom Caputi2018-12-041-0/+8
| | | | | | | | | | | | This patch fixes a small race condition in ztest_zil_remount() that could result in a deadlock. ztest_device_removal() calls spa_vdev_remove() which may eventually call spa_reset_logs(). If ztest_zil_remount() attempts to call zil_close() while this is happening, it may fail when it asserts !zilog_is_dirty(zilog). This patch simply adds locking to correct the issue. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tom Caputi <[email protected]> Closes #8154
* Fix ASSERT in zfs_receive_one()LOLi2018-12-045-5/+128
| | | | | | | | | | | | | | | This commit fixes the following ASSERT in zfs_receive_one() when receiving a send stream from a root dataset with the "-e" option: $ sudo zfs snap source@snap $ sudo zfs send source@snap | sudo zfs recv -e destination/recv chopprefix > drrb->drr_toname ASSERT at libzfs_sendrecv.c:3804:zfs_receive_one() Reviewed-by: Tom Caputi <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed by: Paul Dagnelie <[email protected]> Signed-off-by: loli10K <[email protected]> Closes #8121
* Detect IO errors during device removalBrian Behlendorf2018-12-049-19/+350
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * Detect IO errors during device removal While device removal cannot verify the checksums of individual blocks during device removal, it can reasonably detect hard IO errors from the leaf vdevs. Failure to perform this error checking can result in device removal completing successfully, but moving no data which will permanently corrupt the pool. Situation 1: faulted/degraded vdevs In the configuration shown below, the removal of mirror-0 will permanently corrupt the pool. Device removal will preferentially copy data from 'vdev1 -> vdev3' and from 'vdev2 -> vdev4'. Which in this case will result in nothing being copied since one vdev in each of those groups in unavailable. However, device removal will complete successfully since all IO errors are ignored. tank DEGRADED 0 0 0 mirror-0 DEGRADED 0 0 0 /var/tmp/vdev1 FAULTED 0 0 0 external fault /var/tmp/vdev2 ONLINE 0 0 0 mirror-1 DEGRADED 0 0 0 /var/tmp/vdev3 ONLINE 0 0 0 /var/tmp/vdev4 FAULTED 0 0 0 external fault This issue is resolved by updating the source child selection logic to exclude unreadable leaf vdevs. Additionally, unwritable destination child vdevs which can never succeed are skipped to prevent generating a large number of write IO errors. Situation 2: individual hard IO errors During removal if an unexpected hard IO error is encountered when either reading or writing the child vdev the entire removal operation is cancelled. While it may be possible to reconstruct the data after removal that cannot be guaranteed. The only strictly safe thing to do is to cancel the removal. As a future improvement we may want to instead suspend the removal process and allow the damaged region to be retried. But that work is left for another time, hard IO errors during the removal process are expected to be exceptionally rare. Reviewed-by: Serapheim Dimitropoulos <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Tom Caputi <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #6900 Closes #8161
* Fix consistency of ztest_device_removal_activeTom Caputi2018-11-283-10/+23
| | | | | | | | | | | | | | | | | | | | | | | | | | ztest currently uses the boolean flag ztest_device_removal_active to protect some tests that may not run successfully if they occur at the same time as ztest_device_removal(). Unfortunately, in the event that ztest is in the middle of a device removal when it decides to issue a SIGKILL, the device removal will be automatically restarted (without setting the flag) when the pool is re-imported on the next run. This patch corrects this by ensuring that any in-progress removals are completed before running further tests after the re-import. This patch also makes a few small changes to prevent race conditions involving the creation and destruction of spa->spa_vdev_removal, since this field is not protected by any locks. Some checks that may run concurrently with setting / unsetting this field have been updated to check spa->spa_removing_phys.sr_state instead. The most significant change here is that spa_removal_get_stats() no longer accounts for in-flight work done, since that could result in a NULL pointer dereference. Reviewed by: Matthew Ahrens <[email protected]> Reviewed-by: Serapheim Dimitropoulos <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tom Caputi <[email protected]> Closes #8105
* zfs_dbgmsg() is not safe from every contextLOLi2018-11-281-3/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This commit reverts to using printk() instead of zfs_dbgmsg() to log messages in vdev_disk_error(): this is necessary because the latter can be called from interrupt context where we are not allowed to sleep. Unfortunately zfs_dbgmsg() performs its allocations calling kmalloc() with the KM_SLEEP flag which may result in the following oops: BUG: scheduling while atomic: swapper/4/0/0x10000100 Call Trace: <IRQ> [<0>] dump_stack+0x19/0x1b ... [<0>] spl_kmem_alloc+0xdf/0x140 [spl] <-- kmem_alloc(size, KM_SLEEP) [<0>] __dprintf+0x69/0x150 [zfs] [<0>] ? kmem_cache_free+0x1e2/0x200 [<0>] vdev_disk_error.part.15+0x5f/0x70 [zfs] [<0>] vdev_disk_io_flush_completion+0x48/0x70 [zfs] [<0>] bio_endio+0x67/0xb0 [<0>] blk_update_request+0x90/0x360 ... [<0>] scsi_finish_command+0xdc/0x140 [<0>] scsi_softirq_done+0x132/0x160 [<0>] blk_done_softirq+0x96/0xc0 [<0>] __do_softirq+0xf5/0x280 [<0>] call_softirq+0x1c/0x30 [<0>] do_softirq+0x65/0xa0 [<0>] irq_exit+0x105/0x110 [<0>] do_IRQ+0x56/0xf0 [<0>] common_interrupt+0x162/0x162 <EOI> [<0>] ? cpuidle_enter_state+0x54/0xd0 [<0>] cpuidle_idle_call+0xde/0x230 [<0>] arch_cpu_idle+0xe/0xb0 [<0>] cpu_startup_entry+0x14a/0x1e0 [<0>] start_secondary+0x1f7/0x270 [<0>] start_cpu+0x5/0x14 Reviewed-by: Olaf Faaland <[email protected]> Reviewed by: Brian Behlendorf <[email protected]> Signed-off-by: loli10K <[email protected]> Closes #8137 Closes #8150
* Remove races from scrub / resilver testsTom Caputi2018-11-2814-83/+65
| | | | | | | | | | | | | | | | | | | | | | Currently, several tests in the ZFS Test Suite that attempt to test scrub and resilver behavior occasionally fail. A big reason for this is that these tests use a combination of zinject and zfs_scan_vdev_limit to attempt to slow these operations enough to verify their test commands. This method works most of the time, but provides no guarantees and leads to flaky behavior. This patch adds a new tunable, zfs_scan_suspend_progress, that ensures that scans make no progress, guaranteeing that tests can be run without racing. This patch also changes zfs_remove_max_bytes_pause to match this new tunable. This provides some consistency between these two similar tunables and ensures that the tunable will not misbehave on 32-bit systems. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Giuseppe Di Natale <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Signed-off-by: Tom Caputi <[email protected]> Closes #8111