aboutsummaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* Fix txg_wait_open() load average inflationBrian Behlendorf2019-04-042-5/+25
| | | | | | | | | | | | | | | | | Callers of txg_wait_open() which set should_quiesce=B_TRUE should be accounted for as iowait time. Otherwise, the caller is understood to be idle and cv_wait_sig() is used to prevent incorrectly inflating the system load average. Similarly txg_wait_wait() has been updated to use cv_wait_io() to be accounted against iowait. Reviewed-by: Tim Chase <[email protected]> Reviewed-by: Olaf Faaland <[email protected]> Reviewed-by: Matt Ahrens <[email protected]> Reviewed-by: George Melikov <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #8550 Closes #8558
* Move dracut specifics to dracut moduleMichael Niewöhner2019-04-026-5/+25
| | | | | | | | | | | | Dracut depends on the environment variable BOOTFS to be set after pool import. This dracut specific systemd ExecStartPost command should not be called for any non-dracut systems, so let's move it to a static systemd unit that. Reviewed-by: Manuel Amador (Rudd-O) <[email protected]> Reviewed-by: Matthew Thode <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Michael Niewöhner <[email protected]> Closes #8510
* Fix man(1) warningsJosh Soref2019-04-021-3/+3
| | | | | | | | | | | The macOS man app strenuously objects to blank lines in man files. mdoc warning: Empty input line #xyz Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: bunder2015 <[email protected]> Reviewed-by: George Melikov <[email protected]> Signed-off-by: Josh Soref <[email protected]> Closes #8559
* Append snapshot name to "TIME SENT SNAPSHOT" outputTerraTech2019-04-011-1/+2
| | | | | | | | | Simply appends zhp->zfs_name to the "TIME SENT SNAPSHOT" output. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Matt Ahrens <[email protected]> Reviewed-by: Richard Elling <[email protected]> Signed-off-by: TerraTech <[email protected]> Closes #8543
* Do not iterate through filesystems unnecessarilyTom Caputi2019-04-014-6/+45
| | | | | | | | | | | | | | | | | | | | | | | | Currently, when attempting to list snapshots ZFS may do a lot of extra work checking child datasets. This is because the code does not realize that it will not be able to reach any snapshots contained within snapshots that are at the depth limit since the snapshots of those datasets are counted as an additional layer deeper. This patch corrects this issue. In addition, this patch adds the ability to do perform the commands: $ zfs list -t snapshot <dataset> $ zfs get -t snapshot <prop> <dataset> as a convenient way to list out properties of all snapshots of a given dataset without having to use the depth limit. Reviewed-by: Alek Pinchuk <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Richard Laager <[email protected]> Reviewed-by: Matt Ahrens <[email protected]> Reviewed-by: George Melikov <[email protected]> Signed-off-by: Tom Caputi <[email protected]> Closes #8539
* Fix systemd-import servicesMichael Niewöhner2019-03-292-2/+2
| | | | | | | | | | | | | | | | On debian, systemd complains about missing /bin/awk because it actually is located at /usr/bin/awk. It is not a good idea to hardcode binary paths because different linux distros use different paths. According to systemd's man page it is absolutely safe to miss paths for binaries located at standard locations (/bin, /sbin, /usr/bin, ...). Further, replace this more or less complicated awk command by grep. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Michael Niewöhner <[email protected]> Issue #8510
* Remove hard dependency on bashMichael Niewöhner2019-03-292-2/+2
| | | | | | | | | | zfs-import-* services have a hard dependency on bash while not everyone has bash installed. At this point /bin/sh is sufficient, so use that. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Michael Niewöhner <[email protected]> Issue #8510
* Update raw send documentationTom Caputi2019-03-291-4/+14
| | | | | | | | | | This patch simply clarifies some of the limitations related to raw sends in the man page. No functional changes. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Jason Cohen <[email protected]> Signed-off-by: Tom Caputi <[email protected]> Closes #8503 Closes #8544
* Add TRIM supportBrian Behlendorf2019-03-2991-430/+5584
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | UNMAP/TRIM support is a frequently-requested feature to help prevent performance from degrading on SSDs and on various other SAN-like storage back-ends. By issuing UNMAP/TRIM commands for sectors which are no longer allocated the underlying device can often more efficiently manage itself. This TRIM implementation is modeled on the `zpool initialize` feature which writes a pattern to all unallocated space in the pool. The new `zpool trim` command uses the same vdev_xlate() code to calculate what sectors are unallocated, the same per- vdev TRIM thread model and locking, and the same basic CLI for a consistent user experience. The core difference is that instead of writing a pattern it will issue UNMAP/TRIM commands for those extents. The zio pipeline was updated to accommodate this by adding a new ZIO_TYPE_TRIM type and associated spa taskq. This new type makes is straight forward to add the platform specific TRIM/UNMAP calls to vdev_disk.c and vdev_file.c. These new ZIO_TYPE_TRIM zios are handled largely the same way as ZIO_TYPE_READs or ZIO_TYPE_WRITEs. This makes it possible to largely avoid changing the pipieline, one exception is that TRIM zio's may exceed the 16M block size limit since they contain no data. In addition to the manual `zpool trim` command, a background automatic TRIM was added and is controlled by the 'autotrim' property. It relies on the exact same infrastructure as the manual TRIM. However, instead of relying on the extents in a metaslab's ms_allocatable range tree, a ms_trim tree is kept per metaslab. When 'autotrim=on', ranges added back to the ms_allocatable tree are also added to the ms_free tree. The ms_free tree is then periodically consumed by an autotrim thread which systematically walks a top level vdev's metaslabs. Since the automatic TRIM will skip ranges it considers too small there is value in occasionally running a full `zpool trim`. This may occur when the freed blocks are small and not enough time was allowed to aggregate them. An automatic TRIM and a manual `zpool trim` may be run concurrently, in which case the automatic TRIM will yield to the manual TRIM. Reviewed-by: Jorgen Lundman <[email protected]> Reviewed-by: Tim Chase <[email protected]> Reviewed-by: Matt Ahrens <[email protected]> Reviewed-by: George Wilson <[email protected]> Reviewed-by: Serapheim Dimitropoulos <[email protected]> Contributions-by: Saso Kiselkov <[email protected]> Contributions-by: Tim Chase <[email protected]> Contributions-by: Chunwei Chen <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #8419 Closes #598
* Send stream should only list included snapsTom Caputi2019-03-281-24/+83
| | | | | | | | | | | | | | | | | Currently, zfs send streams will include a list of all snapshots on the source side if the '-p' option is provided. This can cause performance problems on the receive side, especially if those snapshots aren't present on the destination. These problems arise because guid_to_name(), which is used for several receive side functions, will search the entire receive-side pool if it can't find a snapshot with a matching guid. This patch corrects the issue by ensuring only streams that require this list of snapshots include them. Reviewed-by: Alek Pinchuk <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Matt Ahrens <[email protected]> Signed-off-by: Tom Caputi <[email protected]> Closes #8533
* Fix issues with truncated files in raw sendsTom Caputi2019-03-277-50/+157
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch fixes a few issues with raw receives involving truncated files: * dnode_reallocate() now calls dnode_set_blksz() instead of dnode_setdblksz(). This ensures that any remaining dbufs with blkid 0 are resized along with their containing dnode upon reallocation. * One of the calls to dmu_free_long_range() in receive_object() needs to check that the object it is about to free some contents or hasn't been completely removed already by a previous call to dmu_free_long_object() in the same function. * The same call to dmu_free_long_range() in the previous point needs to ensure it uses the object's current block size and not the new block size. This ensures the blocks of the object that are supposed to be freed are completely removed and not simply partially zeroed out. This patch also adds handling for DRR_OBJECT_RANGE records to dprintf_drr() for debugging purposes. Reviewed-by: Matt Ahrens <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tom Caputi <[email protected]> Closes #7378 Closes #8528
* Update valid vdev types for get_disklistRichard Elling2019-03-261-1/+1
| | | | | | | Reviewed-by: George Melikov <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed by: John Kennedy <[email protected]> Signed-off-by: Richard Elling <[email protected]> Closes #8532
* ZTS: Detect e2fsprogs verity issueBrian Behlendorf2019-03-262-0/+20
| | | | | | | | | | | | | | | | | | The projectid_001_pos and projecttree_001_pos test cases use the lsattr command to detect that the project quota bit is set correctly. Due to a bug in e2fsprogs-1.44.4 setting the Project 'P' bit also results in the Verity 'V' bit being reported as set. This will result in the test case failing. The issue has been resolved in e2fsprogs but in order to avoid testing failures these two test cases are skipped when e2fsprogs-1.44.4 is installed. https://github.com/tytso/e2fsprogs/commit/7e5a95e3d Reviewed-by: Richard Elling <[email protected]> Reviewed-by: George Melikov <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #8534
* Correct a very minor grammar issueEvan Allrich2019-03-261-2/+2
| | | | | | Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: George Melikov <[email protected]> Signed-off-by: Evan Allrich <[email protected]> Closes #8535
* git ignore python 3.7 virtual environment directoriesRichard Elling2019-03-251-0/+2
| | | | | | Reviewed-by: George Melikov <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Richard Elling <[email protected]> Closes #8525
* Fix vd_path and error in spa_vdev_remove()Igor K2019-03-221-5/+12
| | | | | | | | | | | Make a local copy of the vd_path and preserve the removal error for use in spa_history_log_internal(). This is required because after spa_vdev_exit() there is nothing preventing the vdev state from changing. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Tom Caputi <[email protected]> Signed-off-by: Igor Kozhukhov <[email protected]> Closes #8522
* Panic when running 'zpool split'Roman Strashkin2019-03-221-0/+12
| | | | | | | | | | | Added missing remove of detachable VDEV from txg's DTL list to avoid use-after-free for the split VDEV Reviewed by: Pavel Zakharov <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Jorgen Lundman <[email protected]> Signed-off-by: Roman Strashkin <[email protected]> Closes #5565 Closes #7856
* ZFS Reads may result in unneccesary calls to zil_commitGeorge Wilson2019-03-223-3/+9
| | | | | | | | | | | | | | | ZFS supports O_RSYNC for read operations and when specified will ensure the same level of data integrity that O_DSYNC and O_SYNC provides for writes. O_RSYNC by itself has no effect so it must be combined with either O_DSYNC or O_SYNC. However, many platforms don't support O_RSYNC and have mapped O_SYNC to mean O_RSYNC within ZFS. This is incorrect and causes unnecessary calls to zil_commit. Only platforms which support O_RSYNC should implement the zil_commit functionality in the read code path. Reviewed-by: Matt Ahrens <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: George Wilson <[email protected]> Closes #8523
* MMP interval and fail_intervals in uberblockOlaf Faaland2019-03-2117-175/+692
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When Multihost is enabled, and a pool is imported, uberblock writes include ub_mmp_delay to allow an importing node to calculate the duration of an activity test. This value, is not enough information. If zfs_multihost_fail_intervals > 0 on the node with the pool imported, the safe minimum duration of the activity test is well defined, but does not depend on ub_mmp_delay: zfs_multihost_fail_intervals * zfs_multihost_interval and if zfs_multihost_fail_intervals == 0 on that node, there is no such well defined safe duration, but the importing host cannot tell whether mmp_delay is high due to I/O delays, or due to a very large zfs_multihost_interval setting on the host which last imported the pool. As a result, it may use a far longer period for the activity test than is necessary. This patch renames ub_mmp_sequence to ub_mmp_config and uses it to record the zfs_multihost_interval and zfs_multihost_fail_intervals values, as well as the mmp sequence. This allows a shorter activity test duration to be calculated by the importing host in most situations. These values are also added to the multihost_history kstat records. It calculates the activity test duration differently depending on whether the new fields are present or not; for importing pools with only ub_mmp_delay, it uses (zfs_multihost_interval + ub_mmp_delay) * zfs_multihost_import_intervals Which results in an activity test duration less sensitive to the leaf count. In addition, it makes a few other improvements: * It updates the "sequence" part of ub_mmp_config when MMP writes in between syncs occur. This allows an importing host to detect MMP on the remote host sooner, when the pool is idle, as it is not limited to the granularity of ub_timestamp (1 second). * It issues writes immediately when zfs_multihost_interval is changed so remote hosts see the updated value as soon as possible. * It fixes a bug where setting zfs_multihost_fail_intervals = 1 results in immediate pool suspension. * Update tests to verify activity check duration is based on recorded tunable values, not tunable values on importing host. * Update tests to verify the expected number of uberblocks have valid MMP fields - fail_intervals, mmp_interval, mmp_seq (sequence number), that sequence number is incrementing, and that uberblock values match tunable settings. Reviewed-by: Andreas Dilger <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Signed-off-by: Olaf Faaland <[email protected]> Closes #7842
* Mutex leak in dsl_dataset_hold_obj()Jorgen Lundman2019-03-211-0/+4
| | | | | | | | | | | | | | In addition to dsl_dataset_evict_async() releasing a hold, there is an error case in dsl_dataset_hold_obj() which had missed 4 additional release calls. This was introduced in a1d477c24. openzfsonosx-commit: https://github.com/openzfsonosx/zfs/commit/63ff7f1c Authored by: Jorgen Lundman <[email protected]> Reviewed-by: Olaf Faaland <[email protected]> Reviewed by: Brian Behlendorf <[email protected]> Ported-by: Brian Behlendorf <[email protected]> Closes #8517
* QAT: Allocate digest_buffer using QAT_PHYS_CONTIG_ALLOC()cfzhu2019-03-211-1/+10
| | | | | | | | | | | | | If the buffer 'digest_buffer' is allocated in the qat_checksum() stack, it can't ensure that the address is physically contiguous, and the DMA result of the buffer may be handled incorrectly. Using QAT_PHYS_CONTIG_ALLOC() ensures a physically contiguous allocation. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Tom Caputi <[email protected]> Signed-off-by: Chengfei, Zhu <[email protected]> Closes #8323 Closes #8521
* Report holes when there are only metadata changesBrian Behlendorf2019-03-211-3/+28
| | | | | | | | | | | | Update the dirty check in dmu_offset_next() such that dnode's are only considered dirty for the purpose or reporting holes when there are pending data blocks or frees to be synced. This ensures that when there are only metadata updates to be synced (atime) that holes are reported. Reviewed-by: Debabrata Banerjee <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #6958 Closes #8505
* Improve `zpool labelclear`Brian Behlendorf2019-03-2112-58/+246
| | | | | | | | | | | | | | | | | | | | | | | | | | | | 1) As implemented the `zpool labelclear` command overwrites the calculated offsets of all four vdev labels even when only a single valid label is found. If the device as been re-purposed but still contains a valid label this can result in space no longer owned by ZFS being zeroed. Prevent this by verifying every label removed is intact before it's overwritten. 2) Address a small bug in zpool_do_labelclear() which prevented labelclear from working on file vdevs. Only block devices support BLKFLSBUF, try the ioctl() but when it's reported as unsupported this should not be fatal. 3) Fix `zpool labelclear` so it can be run on vdevs which were removed from the pool with `zpool remove`. Additionally, allow intact but partial labels to be cleared as in the case of a failed `zpool attach` or `zpool replace`. 4) Remove LABELCLEAR and LABELREAD variables for test cases. Reviewed-by: Matt Ahrens <[email protected]> Reviewed-by: Tim Chase <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #8500 Closes #8373 Closes #6261
* Add missing dmu_zfetch_fini() in dnode_move_impl()Julian Heuking2019-03-201-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | As it turns out, on the Windows platform when rw_init() is called (rather its bedrock call ExInitializeResourceLite) it is placed on an active-list of locks, and is removed at rw_destroy() time. dnode_move() has logic to copy over the old-dnode to new-dnode, including calling dmu_zfetch_init(new-dnode). But due to the missing dmu_zfetch_fini(old-dnode), kmem will call dnode_dest() to release the memory (and in debug builds fill pattern 0xdeadbeef) over the Windows active-lock's prev/next list pointers, making Windows sad. But on other platforms, the contents of dmu_zfetch_fini() is one call to list_destroy() and one to rw_destroy(), which is effectively a no-op call and is not required. This commit is mostly for "correctness" and can be skipped there. Porting Notes: * This leak exists on Linux but currently can never happen because the dnode_move() functionality is not supported. openzfsonosx-commit: openzfsonosx/zfs@d95fe517 Authored by: Julian Heuking <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Matt Ahrens <[email protected]> Signed-off-by: Jorgen Lundman <[email protected]> Closes #8519
* Add space in error messageTom Caputi2019-03-191-2/+2
| | | | | | | | | | | This patch simply adds a missing space in the ZFS_ERR_FROM_IVSET_GUID_MISSING error message. Reviewed-by: Richard Laager <[email protected]> Reviewed-by: George Melikov <[email protected]> Reviewed-by: Don Brady <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tom Caputi <[email protected]> Closes #8514
* Fix l2arc_evict() destroy raceBrian Behlendorf2019-03-151-24/+30
| | | | | | | | | | | | | | | | | When destroying an arc_buf_hdr_t its identity cannot be discarded until it is entirely undiscoverable. This not only includes being unhashed, but also being removed from the l2arc header list. Discarding the header's identify prematurely renders the hash lock useless because it will always hash to bucket zero. This change resolves a race with l2arc_evict() by discarding the identity after it has been removed from the l2arc header list. This ensures either the header is not on the list or contains the correct identify. Reviewed-by: Tom Caputi <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #7688 Closes #8144
* Multiple DVA Scrubbing FixTom Caputi2019-03-1510-95/+415
| | | | | | | | | | | | | | | | | | | | | | | | | | Currently, there is an issue in the sequential scrub code which prevents self healing from working in some cases. The scrub code will split up all DVA copies of a bp and issue each of them separately. The problem is that, since each of the DVAs is no longer associated with the others, the self healing code doesn't have the opportunity to repair problems that show up in one of the DVAs with the data from the others. This patch fixes this issue by ensuring that all IOs issued by the sequential scrub code include all DVAs. Initially, only the first DVA of each is attempted. If an issue arises, the IO is retried with all available copies, giving the self healing code a chance to correct the issue. To test this change, this patch also adds the ability for zinject to specify individual DVAs to inject read errors into. We then add a new test case that utilizes this functionality to ensure scrubs and self-healing reads can handle and transparently fix issues with individual copies of blocks. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Matt Ahrens <[email protected]> Signed-off-by: Tom Caputi <[email protected]> Closes #8453
* Make zpool status counters match error events countTony Hutter2019-03-146-16/+176
| | | | | | | | | | | | | | | | | | | The number of IO and checksum events should match the number of errors seen in zpool status. Previously there was a mismatch between the two counts because zpool status would only count unrecovered errors, while zpool events would get an event for *all* errors (recovered or not). This lead to situations where disks could be faulted for "too many errors", while at the same time showing zero errors in zpool status. This fixes the zpool status error counters to increment at the same times we post the error events. Reviewed-by: Tom Caputi <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Olaf Faaland <[email protected]> Signed-off-by: Tony Hutter <[email protected]> Closes #4851 Closes #7817
* Fix memory leaks in zfsvfs_create_impl()Tom Caputi2019-03-141-9/+14
| | | | | | | | | | This patch simply fixes some small memory leaks that can happen during error handling in zfsvfs_create_impl(). If the function fails, it frees all the memory / references it created. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Signed-off-by: Tom Caputi <[email protected]> Closes #8490
* zfs-import: should be before swapHenrik Riomar2019-03-141-0/+1
| | | | | | | | zfs-import must be done before swap in order for swap on zvol to work Reviewed-by: Kash Pande <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Henrik Riomar <[email protected]> Closes #8502
* Better user experience for errata 4Tom Caputi2019-03-144-12/+39
| | | | | | | | | | | | | | | | | | | | This patch attempts to address some user concerns that have arisen since errata 4 was introduced. * The errata warning has been made less scary for users without any encrypted datasets. * The errata warning now clears itself without a pool reimport if the bookmark_v2 feature is enabled and no encrypted datasets exist. * It is no longer possible to create new encrypted datasets without enabling the bookmark_v2 feature, thus helping to ensure that the errata is resolved. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tom Caputi <[email protected]> Issue ##8308 Closes #8504
* Update commented zed.rc values to defaultskpande2019-03-141-3/+4
| | | | | | | | | Update zed.rc values reflect their default value. This helps avoid confusion if a user expects functionality to be enabled. Reviewed-by: Richard Laager <[email protected]> Reviewed-by: George Melikov <[email protected]> Signed-off-by: Kash Pande <[email protected]> Closes #8498
* Use 'printf %s' instead of 'echo -n' for compatibilityIgor K2019-03-133-5/+8
| | | | | | | | | | | The ksh 'echo -n' behavior on Illumos and Linux differs. For compatibility with others platforms switch to "printf '%s' ". Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Allan Jude <[email protected]> Reviewed-by: George Melikov <[email protected]> Reviewed-by: Giuseppe Di Natale <[email protected]> Signed-off-by: Igor Kozhukhov <[email protected]> Closes #8501
* Add separate aggregation limit for non-rotating mediaAlexander Motin2019-03-132-1/+21
| | | | | | | | | | | | | | | Before sequential scrub patches ZFS never aggregated I/Os above 128KB. Sequential scrub bumped that to 1MB, supposedly to reduce number of head seeks for spinning disks. But for SSDs it makes little to no sense, especially on FreeBSD, where due to MAXPHYS limitation device will likely still see bunch of 128KB I/Os instead of one large. Having more strict aggregation limit for SSDs allows to avoid allocation of large memory buffer and copy to/from it, that is a serious problem when throughput reaches gigabytes per second. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Richard Elling <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Closes #8494
* Update CONTRIBUTING to point users to IRC as well as mailing listkpande2019-03-131-4/+5
| | | | | | | Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Richard Laager <[email protected]> Reviewed-by: George Melikov <[email protected]> Signed-off-by: Kash Pande <[email protected]> Closes #8466
* Make zstreamdump -v more greppableTom Caputi2019-03-133-32/+11
| | | | | | | | | | | | | | | | Currently, the verbose output of zstreamdump includes new line characters within some individual records. Presumably, this was originally done to keep the output from getting too wide to fit on a terminal. However, since new flags and struct members have been added, these rules have not been maintained consistently. In addition, these newlines can make it hard to grep the output in some scenarios. This patch simply removes these newlines, making the output easier to grep and removing the inconsistency. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Matt Ahrens <[email protected]> Reviewed by: Allan Jude <[email protected]> Signed-off-by: Tom Caputi <[email protected]> Closes #8493
* OpenZFS 9914 - NV_UNIQUE_NAME_TYPE broken after 9580Andrew Stormont2019-03-131-1/+2
| | | | | | | | | | | | | | | Authored by: Andrew Stormont <[email protected]> Reviewed by: Yuri Pankov <[email protected]> Reviewed by: Garrett D'Amore <[email protected]> Reviewed by: Andy Fiddaman <[email protected]> Reviewed-by: Serapheim Dimitropoulos <[email protected]> Reviewed-by: Igor Kozhukhov <[email protected]> Approved by: Dan McDonald <[email protected]> Ported-by: Brian Behlendorf <[email protected]> OpenZFS-issue: https://www.illumos.org/issues/9914 OpenZFS-commit: https://github.com/illumos/illumos-gate/commit/b8a5bee18 Closes #8496
* Detect and prevent mixed raw and non-raw sendsTom Caputi2019-03-1327-53/+607
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, there is an issue in the raw receive code where raw receives are allowed to happen on top of previously non-raw received datasets. This is a problem because the source-side dataset doesn't know about how the blocks on the destination were encrypted. As a result, any MAC in the objset's checksum-of-MACs tree that is a parent of both blocks encrypted on the source and blocks encrypted by the destination will be incorrect. This will result in authentication errors when we decrypt the dataset. This patch fixes this issue by adding a new check to the raw receive code. The code now maintains an "IVset guid", which acts as an identifier for the set of IVs used to encrypt a given snapshot. When a snapshot is raw received, the destination snapshot will take this value from the DRR_BEGIN payload. Non-raw receives and normal "zfs snap" operations will cause ZFS to generate a new IVset guid. When a raw incremental stream is received, ZFS will check that the "from" IVset guid in the stream matches that of the "from" destination snapshot. If they do not match, the code will error out the receive, preventing the problem. This patch requires an on-disk format change to add the IVset guids to snapshots and bookmarks. As a result, this patch has errata handling and a tunable to help affected users resolve the issue with as little interruption as possible. Reviewed-by: Paul Dagnelie <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Matt Ahrens <[email protected]> Signed-off-by: Tom Caputi <[email protected]> Closes #8308
* Add bookmark v2 on-disk featureTom Caputi2019-03-135-3/+59
| | | | | | | | | | | | | | This patch adds the bookmark v2 feature to the on-disk format. This feature will be needed for the upcoming redacted sends and for an upcoming fix that for raw receives. The feature is not currently used by any code and thus this change is a no-op, aside from the fact that the user can now enable the feature. Reviewed-by: Paul Dagnelie <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Matt Ahrens <[email protected]> Signed-off-by: Tom Caputi <[email protected]> Issue #8308
* Fix handling of maxblkid for raw sendsTom Caputi2019-03-1310-55/+156
| | | | | | | | | | | | | | | | | | | | | | Currently, the receive code can create an unreadable dataset from a correct raw send stream. This is because it is currently impossible to set maxblkid to a lower value without freeing the associated object. This means truncating files on the send side to a non-0 size could result in corruption. This patch solves this issue by adding a new 'force' flag to dnode_new_blkid() which will allow the raw receive code to force the DMU to accept the provided maxblkid even if it is a lower value than the existing one. For testing purposes the send_encrypted_files.ksh test has been extended to include a variety of truncated files and multiple snapshots. It also now leverages the xattrtest command to help ensure raw receives correctly handle xattrs. Reviewed-by: Paul Dagnelie <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Matt Ahrens <[email protected]> Signed-off-by: Tom Caputi <[email protected]> Closes #8168 Closes #8487
* Fix typo in arc_summary3jwittlincohen2019-03-131-1/+1
| | | | | | | | | | This is a simple fix for a typo ("perfetch" rather than "prefetch") in arc_summary3. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Richard Laager <[email protected]> Reviewed-by: George Melikov <[email protected]> Signed-off-by: Jason Cohen <[email protected]> Closes #8499
* Increase default zfs_multihost_fail_intervals and import_intervalsOlaf Faaland2019-03-132-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | By default, when multihost is enabled for a pool, the pool is suspended if (zfs_multihost_fail_intervals*zfs_multihost_interval) ms pass without a successful MMP write. This is the recommended configuration. The default value for zfs_multihost_fail_intervals has been 5, and the default value for zfs_multihost_interval has been 1000, so pool suspension occurred at 5 seconds. There have been multiple cases where a single misbehaving device in a pool triggered a SCSI reset, and all I/O paused for 5-6 seconds. This in turn caused MMP to suspend the pool. In the cases observed, the rest of the devices were healthy and the pool was otherwise correctly performing I/O. The reset was handled correctly by ZFS, and by suspending the pool MMP made replacing the device more difficult as well as forcing the host to be rebooted. Increase the default value of zfs_multihost_fail_intervals to 10, so that MMP tolerates up to 10 seconds of failed MMP writes before suspending the pool. Increase the default value of zfs_multihost_import_intervals to 20, to maintain the 2:1 safety factor. This results in a force import taking approximately 20 seconds when MMP is enabled, with default values. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Andreas Dilger <[email protected]> Reviewed-by: George Melikov <[email protected]> Signed-off-by: Olaf Faaland <[email protected]> Closes #7709 Closes #8495
* Fix most zfs_arc_* mod params not actually being modifiable at runtimeJustin Gottula2019-03-121-0/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Most of the zfs_arc_* module parameters do not have their values used by the ARC code directly. Instead, there is a function, arc_tuning_update, which is called during module initialization and periodically thereafter, whose job is to fetch the module parameter values, clamp/ limit them appropriately, and then assign those values to a separate set of internal variables that are actually referenced by the ARC code. Commit 3ec34e55 featured an overhaul of arc_reclaim_thread, which is the former location where the post-init-time calls to arc_tuning_update would occur. The rework split the work previously done by the arc_reclaim_thread into a pair of replacement threads; and unfortunately, the call to arc_tuning_update fell through the cracks and was lost in the reorganization. This meant that changing almost any ARC-related zfs module parameter via /sys/module/zfs/parameters/ would result in the module parameter value itself appearing to change; however the modification would not actually propagate to the ARC code and have any real effect. This commit reinstates the post-init-time call to arc_tuning_update. It is now called during arc_adjust_cb_check; this should be equivalent to its former call location in arc_reclaim_thread. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Signed-off-by: Justin Gottula <[email protected]> Closes #8405 Closes #8463
* Avoid retrieving unused snapshot propsAlek P2019-03-128-74/+199
| | | | | | | | | | | | | | This patch modifies the zfs_ioc_snapshot_list_next() ioctl to enable it to take input parameters that alter the way looping through the list of snapshots is performed. The idea here is to restrict functions that throw away some of the snapshots returned by the ioctl to a range of snapshots that these functions actually use. This improves efficiency and execution speed for some rollback and send operations. Reviewed-by: Tom Caputi <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed by: Matt Ahrens <[email protected]> Signed-off-by: Alek Pinchuk <[email protected]> Closes #8077
* Fix vdev_initialize_restart / removal raceBrian Behlendorf2019-03-121-2/+4
| | | | | | | | | | | Resolve a vdev_initialize crash uncovered by ztest. Similar to when starting a new initialization verify that a removal is not in progress. Additionally, do not restart when the thread already exists. This check is now congruent with the POOL_INITIALIZE_DO handling in spa_vdev_initialize_impl(). Reviewed-by: Tom Caputi <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #8477
* MMP writes rotate over leavesOlaf Faaland2019-03-129-67/+160
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Instead of choosing a leaf vdev quasi-randomly, by starting at the root vdev and randomly choosing children, rotate over leaves to issue MMP writes. This fixes an issue in a pool whose top-level vdevs have different numbers of leaves. The issue is that the frequency at which individual leaves are chosen for MMP writes is based not on the total number of leaves but based on how many siblings the leaves have. For example, in a pool like this: root-vdev +------+---------------+ vdev1 vdev2 | | | +------+-----+-----+----+ disk1 disk2 disk3 disk4 disk5 disk6 vdev1 and vdev2 will each be chosen 50% of the time. Every time vdev1 is chosen, disk1 will be chosen. However, every time vdev2 is chosen, disk2 is chosen 20% of the time. As a result, disk1 will be sent 5x as many MMP writes as disk2. This may create wear issues in the case of SSDs. It also reduces the effectiveness of MMP as it depends on the writes being evenly distributed for the case where some devices fail or are partitioned. The new code maintains a list of leaf vdevs in the pool. MMP records the last leaf used for an MMP write in mmp->mmp_last_leaf. To choose the next leaf, MMP starts at mmp->mmp_last_leaf and traverses the list, continuing from the head if the tail is reached. It stops when a suitable leaf is found or all leaves have been examined. Added a test to verify MMP write distribution is even. Reviewed-by: Tom Caputi <[email protected]> Reviewed-by: Kash Pande <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: loli10K <[email protected]> Signed-off-by: Olaf Faaland <[email protected]> Closes #7953
* zfs does not honor NFS sync write semanticsGeorge Wilson2019-03-111-2/+30
| | | | | | | | | | | | | | | The linux kernel's nfsd implementation use RWF_SYNC to determine if the write is synchronous or not. This flag is used to set the kernel's I/O control block flags. Unfortunately, ZFS was not updated to inspect these flags so NFS sync writes were not being honored. This change maps the IOCB_* flags to the ZFS equivalent. Reviewed-by: Don Brady <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: George Wilson <[email protected]> Closes #8474 Closes #8452 Closes #8486
* Fix lockdep between ds_lock and dd_lock in dsl_dataset_namelen()mzhivich2019-03-111-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Booting debug kernel found an inconsistent lock dependency between dataset's ds_lock and its directory's dd_lock. [ 32.215336] ====================================================== [ 32.221859] WARNING: possible circular locking dependency detected [ 32.221861] 4.14.90+ #8 Tainted: G O [ 32.221862] ------------------------------------------------------ [ 32.221863] dynamic_kernel_/4667 is trying to acquire lock: [ 32.221864] (&ds->ds_lock){+.+.}, at: [<ffffffffc10a4bde>] dsl_dataset_check_quota+0x9e/0x8a0 [zfs] [ 32.221941] but task is already holding lock: [ 32.221941] (&dd->dd_lock){+.+.}, at: [<ffffffffc10cd8e9>] dsl_dir_tempreserve_space+0x3b9/0x1290 [zfs] [ 32.221983] which lock already depends on the new lock. [ 32.221983] the existing dependency chain (in reverse order) is: [ 32.221984] -> #1 (&dd->dd_lock){+.+.}: [ 32.221992] __mutex_lock+0xef/0x14c0 [ 32.222049] dsl_dir_namelen+0xd4/0x2d0 [zfs] [ 32.222093] dsl_dataset_namelen+0x2f1/0x430 [zfs] [ 32.222142] verify_dataset_name_len+0xd/0x40 [zfs] [ 32.222184] dmu_objset_find_dp_impl+0x5f5/0xef0 [zfs] [ 32.222226] dmu_objset_find_dp_cb+0x40/0x60 [zfs] [ 32.222235] taskq_thread+0x969/0x1460 [spl] [ 32.222238] kthread+0x2fb/0x400 [ 32.222241] ret_from_fork+0x3a/0x50 [ 32.222241] -> #0 (&ds->ds_lock){+.+.}: [ 32.222246] lock_acquire+0x14f/0x390 [ 32.222248] __mutex_lock+0xef/0x14c0 [ 32.222291] dsl_dataset_check_quota+0x9e/0x8a0 [zfs] [ 32.222355] dsl_dir_tempreserve_space+0x5d2/0x1290 [zfs] [ 32.222392] dmu_tx_assign+0xa61/0xdb0 [zfs] [ 32.222436] zfs_create+0x4e6/0x11d0 [zfs] [ 32.222481] zpl_create+0x194/0x340 [zfs] [ 32.222484] lookup_open+0xa86/0x16f0 [ 32.222486] path_openat+0xe56/0x2490 [ 32.222488] do_filp_open+0x17f/0x260 [ 32.222490] do_sys_open+0x195/0x310 [ 32.222491] SyS_open+0xbf/0xf0 [ 32.222494] do_syscall_64+0x191/0x4f0 [ 32.222496] entry_SYSCALL_64_after_hwframe+0x42/0xb7 [ 32.222497] other info that might help us debug this: [ 32.222497] Possible unsafe locking scenario: [ 32.222498] CPU0 CPU1 [ 32.222498] ---- ---- [ 32.222499] lock(&dd->dd_lock); [ 32.222500] lock(&ds->ds_lock); [ 32.222502] lock(&dd->dd_lock); [ 32.222503] lock(&ds->ds_lock); [ 32.222504] *** DEADLOCK *** [ 32.222505] 3 locks held by dynamic_kernel_/4667: [ 32.222506] #0: (sb_writers#9){.+.+}, at: [<ffffffffaf68933c>] mnt_want_write+0x3c/0xa0 [ 32.222511] #1: (&type->i_mutex_dir_key#8){++++}, at: [<ffffffffaf652cde>] path_openat+0xe2e/0x2490 [ 32.222515] #2: (&dd->dd_lock){+.+.}, at: [<ffffffffc10cd8e9>] dsl_dir_tempreserve_space+0x3b9/0x1290 [zfs] The issue is caused by dsl_dataset_namelen() holding ds_lock, followed by acquiring dd_lock on ds->ds_dir in dsl_dir_namelen(). However, ds->ds_dir should not be protected by ds_lock, so releasing it before call to dsl_dir_namelen() prevents the lockdep issue Reviewed-by: Alek Pinchuk <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Chris Dunlop <[email protected]> Signed-off-by: Michael Zhivich <[email protected]> Closes #8413
* Reorder ZFS ioctls to fix cross-version compatibilityLorenz Brun2019-03-092-6/+6
| | | | | | | | | | | Reorder ZFS ioctls to fix cross-version compatibility. Reviewed-by: Don Brady <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: George Melikov <[email protected]> Reviewed by: Matt Ahrens <[email protected]> Signed-off-by: Lorenz Brun <[email protected]> Closes #8484
* Linux 5.1 compat: get_ds() removedBrian Behlendorf2019-03-071-3/+3
| | | | | | | | | | | Commit torvalds/linux@736706bee has removed the get_fs() function as a bit of cleanup. It has been defined as KERNEL_DS on all architectures for all supported kernels. Replace get_fs() with KERNEL_DS as was done in the kernel. Reviewed-by: Tom Caputi <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #8479