openzfs/zfs.git - Unnamed repository; edit this file 'description' to name the repository.

	Commit message (Collapse)	Author	Age	Files	Lines
*	Added encryption support for zfs recv -o / -x	Tom Caputi	2018-08-15	3	-48/+119
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	One small integration that was absent from b52563 was support for zfs recv -o / -x with regards to encryption parameters. The main use cases of this are as follows: * Receiving an unencrypted stream as encrypted without needing to create a "dummy" encrypted parent so that encryption can be inheritted. * Allowing users to change their keylocation on receive, so long as the receiving dataset is an encryption root. * Allowing users to explicitly exclude or override the encryption property from an unencrypted properties stream, allowing it to be received as encrypted. * Receiving a recursive heirarchy of unencrypted datasets, encrypting the top-level one and forcing all children to inherit the encryption. Reviewed-by: Jorgen Lundman <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Richard Elling <[email protected]> Signed-off-by: Tom Caputi <[email protected]> Closes #7650
*	Fix comment on calculating blkid	Tomohiro Kusumi	2018-08-13	1	-1/+1
\| \| \| \| \| \| \| \| \|	Fix comment on calculating blkid at level n within dnode's blkptrs. "(2^(level*(indblkshift - SPA_BLKPTRSHIFT)" is part of divisor in this division. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tomohiro Kusumi <[email protected]> Closes #7768
*	Allow inherited properties in zfs_check_settable()	LOLi	2018-08-03	1	-13/+13
\| \| \| \| \| \| \| \| \| \| \| \| \|	This change modifies how 'checksum' and 'dedup' properties are verified in zfs_check_settable() handling the case where they are explicitly inherited in the dataset hierarchy when receiving a recursive send stream. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Tom Caputi <[email protected]> Signed-off-by: loli10K <[email protected]> Closes #7755 Closes #7576 Closes #7757
*	zfs_ioc_unload_key can drop extra spa ref	Don Brady	2018-08-03	2	-5/+14
\| \| \| \| \| \| \|	Reviewed by: Thomas Caputi <[email protected]> Reviewed by: Matt Ahrens <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Don Brady <[email protected]> Closes #7759
*	Reduce taskq and context-switch cost of zio pipe	Matthew Ahrens	2018-08-02	1	-124/+144
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When doing a read from disk, ZFS creates 3 ZIO's: a zio_null(), the logical zio_read(), and then a physical zio. Currently, each of these results in a separate taskq_dispatch(zio_execute). On high-read-iops workloads, this causes a significant performance impact. By processing all 3 ZIO's in a single taskq entry, we reduce the overhead on taskq locking and context switching. We accomplish this by allowing zio_done() to return a "next zio to execute" to zio_execute(). This results in a ~12% performance increase for random reads, from 96,000 iops to 108,000 iops (with recordsize=8k, on SSD's). Reviewed by: Pavel Zakharov <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed by: George Wilson <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> External-issue: DLPX-59292 Closes #7736
*	Add missing checks to zpl_xattr_* functions	John Gallagher	2018-08-02	1	-6/+9
\| \| \| \| \| \| \| \| \| \| \| \| \|	Linux specific zpl_* entry points, such as xattrs, must include the same unmounted and sa handle checks as the common zfs_ entry points. The additional ZPL_* wrappers are identical to their ZFS_ counterparts except the errno is negated since they are expected to be used at the zpl_ layer. Reviewed by: Matthew Ahrens <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: John Gallagher <[email protected]> Closes #5866 Closes #7761
*	Fix OpenZFS 9337 mismerge	George Wilson	2018-08-02	3	-13/+14
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This change reintroduces logic required by OpenZFS 9577. When OpenZFS 9337, zfs get all is slow due to uncached metadata, was merged in it ended up removing logic required by OpenZFS 9577, remove zfs_dbuf_evict_key, and inadvertently reintroduced the bug that 9577 was designed to fix. This change re-enables the "evicting" flag to dbuf_rele_and_unlock and dnode_rele_and_unlock and updates all callers to provide the correct parameter. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: George Wilson <[email protected]> Closes #7758
*	Fix deadlock between zfs umount & snapentry_expire	Rohan Puri	2018-08-01	1	-6/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	zfs umount -> zfsctl_destroy() takes the zfs_snapshot_lock as a writer and calls zfsctl_snapshot_unmount_cancel(), which waits for snapentry_expire() if present (when snap is automounted). This snapentry_expire() itself then waits for zfs_snapshot_lock as a reader, resulting in a deadlock. The fix is to only hold the zfs_snapshot_lock over the tree lookup and removal. After a successful lookup the lock can be dropped and zfs_snapentry_t will remain valid until the reference taken by the lookup is released. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rohan Puri <[email protected]> Closes #7751 Closes #7752
*	OpenZFS 9112 - Improve allocation performance on high-end systems	Paul Dagnelie	2018-07-31	8	-175/+505
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Overview ======== We parallelize the allocation process by creating the concept of "allocators". There are a certain number of allocators per metaslab group, defined by the value of a tunable at pool open time. Each allocator for a given metaslab group has up to 2 active metaslabs; one "primary", and one "secondary". The primary and secondary weight mean the same thing they did in in the pre-allocator world; primary metaslabs are used for most allocations, secondary metaslabs are used for ditto blocks being allocated in the same metaslab group. There is also the CLAIM weight, which has been separated out from the other weights, but that is less important to understanding the patch. The active metaslabs for each allocator are moved from their normal place in the metaslab tree for the group to the back of the tree. This way, they will not be selected for use by other allocators searching for new metaslabs unless all the passive metaslabs are unsuitable for allocations. If that does happen, the allocators will "steal" from each other to ensure that IOs don't fail until there is truly no space left to perform allocations. In addition, the alloc queue for each metaslab group has been broken into a separate queue for each allocator. We don't want to dramatically increase the number of inflight IOs on low-end systems, because it can significantly increase txg times. On the other hand, we want to ensure that there are enough IOs for each allocator to allow for good coalescing before sending the IOs to the disk. As a result, we take a compromise path; each allocator's alloc queue max depth starts at a certain value for every txg. Every time an IO completes, we increase the max depth. This should hopefully provide a good balance between the two failure modes, while not dramatically increasing complexity. We also parallelize the spa_alloc_tree and spa_alloc_lock, which cause very similar contention when selecting IOs to allocate. This parallelization uses the same allocator scheme as metaslab selection. Performance Results =================== Performance improvements from this change can vary significantly based on the number of CPUs in the system, whether or not the system has a NUMA architecture, the speed of the drives, the values for the various tunables, and the workload being performed. For an fio async sequential write workload on a 24 core NUMA system with 256 GB of RAM and 8 128 GB SSDs, there is a roughly 25% performance improvement. Future Work =========== Analysis of the performance of the system with this patch applied shows that a significant new bottleneck is the vdev disk queues, which also need to be parallelized. Prototyping of this change has occurred, and there was a performance improvement, but more work needs to be done before its stability has been verified and it is ready to be upstreamed. Authored by: Paul Dagnelie <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Reviewed by: George Wilson <[email protected]> Reviewed by: Serapheim Dimitropoulos <[email protected]> Reviewed by: Alexander Motin <[email protected]> Reviewed by: Brian Behlendorf <[email protected]> Approved by: Gordon Ross <[email protected]> Ported-by: Paul Dagnelie <[email protected]> Signed-off-by: Paul Dagnelie <[email protected]> Porting Notes: * Fix reservation test failures by increasing tolerance. OpenZFS-issue: https://illumos.org/issues/9112 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/3f3cc3c3 Closes #7682
*	OpenZFS 9465 - ARC check for 'anon_size > arc_c/2' can stall the system	Don Brady	2018-07-30	3	-16/+43
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In the case of one pool being built on another pool, we want to make sure we don't end up throttling the lower (backing) pool when the upper pool is the majority contributor to dirty data. To insure we make forward progress during throttling, we also check the current pool's net dirty data and only throttle if it exceeds zfs_arc_pool_dirty_percent of the anonymous dirty data in the cache. Authored by: Don Brady <[email protected]> Reviewed by: Sebastien Roy <[email protected]> Reviewed by: Matt Ahrens <[email protected]> Reviewed by: Prashanth Sreenivasa <[email protected]> Approved by: Robert Mustacchi <[email protected]> Ported-by: Brian Behlendorf <[email protected]> Porting Notes: * The new global variables zfs_arc_dirty_limit_percent, zfs_arc_anon_limit_percent, and zfs_arc_pool_dirty_percent were intentially not added as tunable module parameters. OpenZFS-issue: https://illumos.org/issues/9465 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/d6a4c3ef Closes #7749
*	OpenZFS 9580 - Add a hash-table on top of nvlist to speed-up operations	Serapheim Dimitropoulos	2018-07-30	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	= Motivation While dealing with another performance issue (see 126118f) we noticed that we spend a lot of time in various places in the kernel when constructing long nvlists. The problem is that when an nvlist is created with the NV_UNIQUE_NAME set (which is the case most of the time), we do a linear search through the whole list to ensure uniqueness for every entry we add. An example of the above scenario can be seen in the following flamegraph, where more than have the time of the zfsdev_ioctl() is spent on constructing nvlists. Flamegraph: https://sdimitro.github.io/img/flame/sdimitro_snap_unmount3.svg Adding a table to speed up lookups will help situations where we just construct an nvlist (like the scenario above), in addition to regular lookups and removals. = What this patch does In this diff we've implemented a hash-table on top of the nvlist code that converts most nvlist operations from O(# number of entries) to O(1)* (the start is for amortized time as the hash-table grows and shrinks depending on the # of entries - plain lookup is strictly O(1)). = Performance Analysis To analyze the performance improvement I just used the setup from the snapshot deletion issue mentioned above in the Motivation section. Basically I created 10K filesystems with one snapshot each and then I just used the API of libZFS_Core to pass down an nvlist of all the snapshots to have them deleted. The reason I used my own driver program was to have clean performance results of what actually happens in the kernel. The flamegraphs and wall clock times mentioned below were gathered from the start to the end of the driver program's run. Between trials the testpool used was completely destroyed, the system was rebooted and the testpool was completely recreated. The reason for this dance was to get consistent results. == Results (before patch): === Sampling Flamegraphs [Trial 1] https://sdimitro.github.io/img/flame/DLPX-53417/trial-A.svg [Trial 2] https://sdimitro.github.io/img/flame/DLPX-53417/trial-A2.svg [Trial 3] https://sdimitro.github.io/img/flame/DLPX-53417/trial-A3.svg === Wall clock times (in seconds) ``` [Trial 4] real 5.3 user 0.4 sys 2.3 [Trial 5] real 8.2 user 0.4 sys 2.4 [Trial 6] real 6.0 user 0.5 sys 2.3 ``` == Results (after patch): === Sampling Flamegraphs [Trial 1] https://sdimitro.github.io/img/flame/DLPX-53417/trial-Ae.svg [Trial 2] https://sdimitro.github.io/img/flame/DLPX-53417/trial-A2e.svg [Trial 3] https://sdimitro.github.io/img/flame/DLPX-53417/trial-A3e.svg === Wall clock times (in seconds) ``` [Trial 4] real 4.9 user 0.0 sys 0.9 [Trial 5] real 3.8 user 0.0 sys 0.9 [Trial 6] real 3.6 user 0.0 sys 0.9 ``` == Analysis The results between the trials are consistent so in this sections I will only talk about the flamegraph results from trial-1 and the wall-clock results from trial-4. From trial-1 we can see that zfs_dev_ioctl() goes from 2,331 to 996 samples counts. Specifically, the samples from fnvlist_add_nvlist() and spa_history_log_nvl() are almost gone (~500 & ~800 to 5 & 5 samples), leaving zfs_ioc_destroy_snaps() to dominate most samples from zfs_dev_ioctl(). From trial-4 we see that the user time dropped to 0 secods. I believe the consistent 0.4 seconds before my patch was applied was due to my driver program constructing the long nvlist of snapshots so it can pass it to the kernel. As for the system time, the effect there is more clear (2.3 down to 0.9 seconds). Porting Notes: * DATA_TYPE_DONTCARE case added to switch in fm_nvprintr() and zpool_do_events_nvprint(). Authored by: Serapheim Dimitropoulos <[email protected]> Reviewed by: Matt Ahrens <[email protected]> Reviewed by: Sebastien Roy <[email protected]> Approved by: Robert Mustacchi <[email protected]> Ported-by: Brian Behlendorf <[email protected]> OpenZFS-issue: https://www.illumos.org/issues/9580 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/b5eca7b1 Closes #7748
*	OpenZFS 9439 - ZFS double-free due to failure to dirty indirect block	Matthew Ahrens	2018-07-30	2	-4/+20
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Follow up commit for OpenZFS 9438. See the OpenZFS-issue link below for a complete analysis. Authored by: Matthew Ahrens <[email protected]> Reviewed by: George Wilson <[email protected]> Reviewed by: Paul Dagnelie <[email protected]> Approved by: Robert Mustacchi <[email protected]> Ported-by: Brian Behlendorf <[email protected]> OpenZFS-issue: https://illumos.org/issues/9439 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/779220d External-issue: DLPX-46861 Closes #7746
*	OpenZFS 9438 - Holes can lose birth time info if a block has a mix of birth ↵	Paul Dagnelie	2018-07-30	3	-24/+102
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	times As reported by https://github.com/zfsonlinux/zfs/issues/4996, there is yet another hole birth issue. In this one, if a block is entirely holes, but the birth times are not all the same, we lose that information by creating one hole with the current txg as its birth time. The ZoL PR's fix approach is incorrect. Ultimately, the problem here is that when you truncate and write a file in the same transaction group, the dbuf for the indirect block will be zeroed out to deal with the truncation, and then written for the write. During this process, we will lose hole birth time information for any holes in the range. In the case where a dnode is being freed, we need to determine whether the block should be converted to a higher-level hole in the zio pipeline, and if so do it when the dnode is being synced out. Porting Notes: * The DMU_OBJECT_END change in zfs_znode.c was already applied. * Added test cases from #5675 provided by @rincebrain for hole_birth issues. These test cases should be pushed upstream to OpenZFS. * Updated mk_files which is used by several rsend tests so the files created are a little more interesting and may contain holes. Authored by: Paul Dagnelie <[email protected]> Reviewed by: Matt Ahrens <[email protected]> Reviewed by: George Wilson <[email protected]> Approved by: Robert Mustacchi <[email protected]> Ported-by: Brian Behlendorf <[email protected]> OpenZFS-issue: https://www.illumos.org/issues/9438 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/738e2a3c External-issue: DLPX-46861 Closes #7746
*	OpenZFS 8906 - uts: illumos rootfs should support salted cksum	Toomas Soome	2018-07-27	1	-10/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Porting notes: * As of grub-2.02 these checksums are not supported. However, as pointed out in #6501 there are alternatives such as EFISTUB which work and have no such restriction. A warning was added to the checksum property section of the zfs.8 man page. Authored by: Toomas Soome <[email protected]> Reviewed by: C Fraire <[email protected]> Reviewed by: Robert Mustacchi <[email protected]> Reviewed by: Yuri Pankov <[email protected]> Approved by: Dan McDonald <[email protected]> Ported-by: Brian Behlendorf <[email protected]> OpenZFS-issue: https://illumos.org/issues/8906 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/7dec52f Closes #6501 Closes #7714
*	OpenZFS 9442 - decrease indirect block size of spacemaps	Matthew Ahrens	2018-07-25	2	-18/+47
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Authored by: Matthew Ahrens <[email protected]> Reviewed by: Serapheim Dimitropoulos <[email protected]> Reviewed by: George Wilson <[email protected]> Reviewed by: Albert Lee <[email protected]> Reviewed by: Igor Kozhukhov <[email protected]> Reviewed by: George Melikov <[email protected]> Approved by: Dan McDonald <[email protected]> Ported-by: Brian Behlendorf <[email protected]> Updates to indirect blocks of spacemaps can contribute significantly to write inflation. Therefore we want to reduce the indirect block size of spacemaps from 128K to 16K. Porting notes: * Refactored to allow the dmu_object_alloc(), dmu_object_alloc_ibs() and dmu_object_alloc_dnsize() functions to use a common shared dmu_object_alloc_impl() function. OpenZFS-issue: https://www.illumos.org/issues/9442 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/0c2e6408b Closes #7712
*	Introduce kstat dmu_tx_dirty_frees_delay	Feng Sun	2018-07-25	2	-0/+2
\| \| \| \| \| \| \| \| \|	It is helpful to tune zfs_per_txg_dirty_frees_percent for commit 539d33c7(OpenZFS 6569 - large file delete can starve out write ops). Reviewed-by: Brian Behlendorf <[email protected]> Reviewed by: Richard Elling <[email protected]> Signed-off-by: Feng Sun <[email protected]> Closes #7718
*	OpenZFS 9338 - moved dnode has incorrect dn_next_type	Matthew Ahrens	2018-07-24	1	-0/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Authored by: Matthew Ahrens <[email protected]> Reviewed by: Prashanth Sreenivasa <[email protected]> Reviewed by: Serapheim Dimitropoulos <[email protected]> Reviewed by: Dan Kimmel <[email protected]> Reviewed by: Giuseppe Di Natale <[email protected]> Reviewed by: George Melikov <[email protected]> Approved by: Robert Mustacchi <[email protected]> Ported-by: Brian Behlendorf <[email protected]> While investigating a different problem, I noticed that moved dnodes (those processed by dnode_move_impl() via kmem_move()) have an incorrect dn_next_type. This could cause the on-disk dn_type to be changed to an invalid value. The fix to copy the dn_next_type in dnode_move_impl(). Porting notes: * For the moment this potential issue cannot occur on Linux since the SPL does not provide the kmem_move() functionality. OpenZFS-issue: https://illumos.org/issues/9338 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/0717e6f13 Closes #7715
*	Refactor arc_hdr_realloc_crypt()	Tom Caputi	2018-07-24	1	-3/+60
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The arc_hdr_realloc_crypt() function is responsible for converting a "full" arc header to an extended "crypt" header and visa versa. This code was originally written with a bcopy() so that any new members added to arc headers would automatically be included without requiring a code change. However, in practice this (along with small differences in kmem_cache implementations between various platforms) has caused a number of hard-to-find problems in ports to other operating systems. This patch solves this problem by making all member copies explicit and adding ASSERTs for fields that cannot be set during the transfer. It also manually resets the old header after the reallocation is finished so it can be properly reallocated and reused. Reviewed-by: Jorgen Lundman <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Signed-off-by: Tom Caputi <[email protected]> Closes #7711
*	dsl_scan_scrub_cb: don't double-account non-embedded blocks	Steven Noonan	2018-07-24	1	-2/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We were doing count_block() twice inside this function, once unconditionally at the beginning (intended to catch the embedded block case) and once near the end after processing the block. The double-accounting caused the "zpool scrub" progress statistics in "zpool status" to climb from 0% to 200% instead of 0% to 100%, and showed double the I/O rate it was actually seeing. This was apparently a regression introduced in commit 00c405b4b5e8, which was an incorrect port of this OpenZFS commit: https://github.com/openzfs/openzfs/commit/d8a447a7 Reviewed by: Thomas Caputi <[email protected]> Reviewed by: Matt Ahrens <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: George Melikov <[email protected]> Signed-off-by: Steven Noonan <[email protected]> Closes #7720 Closes #7738
*	Add support for autoexpand property	Brian Behlendorf	2018-07-23	3	-145/+174
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	While the autoexpand property may seem like a small feature it depends on a significant amount of system infrastructure. Enough of that infrastructure is now in place that with a few modifications for Linux it can be supported. Auto-expand works as follows; when a block device is modified (re-sized, closed after being open r/w, etc) a change uevent is generated for udev. The ZED, which is monitoring udev events, passes the change event along to zfs_deliver_dle() if the disk or partition contains a zfs_member as identified by blkid. From here the device is matched against all imported pool vdevs using the vdev_guid which was read from the label by blkid. If a match is found the ZED reopens the pool vdev. This re-opening is important because it allows the vdev to be briefly closed so the disk partition table can be re-read. Otherwise, it wouldn't be possible to report the maximum possible expansion size. Finally, if the property autoexpand=on a vdev expansion will be attempted. After performing some sanity checks on the disk to verify that it is safe to expand, the primary partition (-part1) will be expanded and the partition table updated. The partition is then re-opened (again) to detect the updated size which allows the new capacity to be used. In order to make all of the above possible the following changes were required: * Updated the zpool_expand_001_pos and zpool_expand_003_pos tests. These tests now create a pool which is layered on a loopback, scsi_debug, and file vdev. This allows for testing of non- partitioned block device (loopback), a partition block device (scsi_debug), and a file which does not receive udev change events. This provided for better test coverage, and by removing the layering on ZFS volumes there issues surrounding layering one pool on another are avoided. * zpool_find_vdev_by_physpath() updated to accept a vdev guid. This allows for matching by guid rather than path which is a more reliable way for the ZED to reference a vdev. * Fixed zfs_zevent_wait() signal handling which could result in the ZED spinning when a signal was not handled. * Removed vdev_disk_rrpart() functionality which can be abandoned in favor of kernel provided blkdev_reread_part() function. * Added a rwlock which is held as a writer while a disk is being reopened. This is important to prevent errors from occurring for any configuration related IOs which bypass the SCL_ZIO lock. The zpool_reopen_007_pos.ksh test case was added to verify IO error are never observed when reopening. This is not expected to impact IO performance. Additional fixes which aren't critical but were discovered and resolved in the course of developing this functionality. * Added PHYS_PATH="/dev/zvol/dataset" to the vdev configuration for ZFS volumes. This is as good as a unique physical path, while the volumes are not used in the test cases anymore for other reasons this improvement was included. Reviewed by: Richard Elling <[email protected]> Signed-off-by: Sara Hartse <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #120 Closes #2437 Closes #5771 Closes #7366 Closes #7582 Closes #7629
*	OpenZFS 9337 - zfs get all is slow due to uncached metadata	Matthew Ahrens	2018-07-12	8	-156/+325
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This project's goal is to make read-heavy channel programs and zfs(1m) administrative commands faster by caching all the metadata that they will need in the dbuf layer. This will prevent the data from being evicted, so that any future call to i.e. zfs get all won't have to go to disk (very much). There are two parts: The dbuf_metadata_cache. We identify what to put into the cache based on the object type of each dbuf. Caching objset properties os {version,normalization,utf8only,casesensitivity} in the objset_t. The reason these needed to be cached is that although they are queried frequently, they aren't stored in a dbuf type which we can easily recognize and cache in the dbuf layer; instead, we have to explicitly store them. There's already existing infrastructure for maintaining cached properties in the objset setup code, so I simply used that. Performance Testing: - Disabled kmem_flags - Tuned dbuf_cache_max_bytes very low (128K) - Tuned zfs_arc_max very low (64M) Created test pool with 400 filesystems, and 100 snapshots per filesystem. Later on in testing, added 600 more filesystems (with no snapshots) to make sure scaling didn't look different between snapshots and filesystems. Results: \| Test \| Time (trunk / diff) \| I/Os (trunk / diff) \| +------------------------+---------------------+---------------------+ \| zpool import \| 0:05 / 0:06 \| 12.9k / 12.9k \| \| zfs get all (uncached) \| 1:36 / 0:53 \| 16.7k / 5.7k \| \| zfs get all (cached) \| 1:36 / 0:51 \| 16.0k / 6.0k \| Authored by: Matthew Ahrens <[email protected]> Reviewed by: Prakash Surya <[email protected]> Reviewed by: George Wilson <[email protected]> Reviewed by: Thomas Caputi <[email protected]> Reviewed by: Brian Behlendorf <[email protected]> Approved by: Richard Lowe <[email protected]> Ported-by: Alek Pinchuk <[email protected]> Signed-off-by: Alek Pinchuk <[email protected]> OpenZFS-issue: https://illumos.org/issues/9337 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/7dec52f Closes #7668
*	OpenZFS 9426 - metaslab size can exceed offset addressable by spacemap	Don Brady	2018-07-11	1	-27/+55
\| \| \| \| \| \| \| \| \| \| \| \|	Authored by: Don Brady <[email protected]> Reviewed by: Paul Dagnelie <[email protected]> Reviewed by: Matt Ahrens <[email protected]> Ported-by: Brian Behlendorf <[email protected]> Approved by: Dan McDonald <[email protected]> OpenZFS-issue: https://www.illumos.org/issues/9426 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f1c88afb1 Closes #7700
*	OpenZFS 9479 - fix wrong format specifier for vdev_id	Andriy Gapon	2018-07-11	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \|	Authored by: Andriy Gapon <[email protected]> Reviewed by: Matt Ahrens <[email protected]> Reviewed by: Giuseppe Di Natale <[email protected]> Reviewed by: George Melikov <[email protected]> Ported-by: Brian Behlendorf <[email protected]> Approved by: Robert Mustacchi <[email protected]> OpenZFS-issue: https://www.illumos.org/issues/9479 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/20aa447c Closes #7699
*	Fix zpl_mount() deadlock	Brian Behlendorf	2018-07-11	1	-1/+10
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Commit 93b43af10 inadvertently introduced the following scenario which can result in a deadlock. This issue was most easily reproduced by LXD containers using a ZFS storage backend but should be reproducible under any workload which is frequently mounting and unmounting. -- THREAD A -- spa_sync() spa_sync_upgrades() rrw_enter(&dp->dp_config_rwlock, RW_WRITER, FTAG); <- Waiting on B -- THREAD B -- mount_fs() zpl_mount() zpl_mount_impl() dmu_objset_hold() dmu_objset_hold_flags() dsl_pool_hold() dsl_pool_config_enter() rrw_enter(&dp->dp_config_rwlock, RW_READER, tag); sget() sget_userns() grab_super() down_write(&s->s_umount); <- Waiting on C -- THREAD C -- cleanup_mnt() deactivate_super() down_write(&s->s_umount); deactivate_locked_super() zpl_kill_sb() kill_anon_super() generic_shutdown_super() sync_filesystem() zpl_sync_fs() zfs_sync() zil_commit() txg_wait_synced() <- Waiting on A Reviewed by: Alek Pinchuk <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #7598 Closes #7659 Closes #7691 Closes #7693
*	Fix kernel unaligned access on sparc64	Brian Behlendorf	2018-07-11	1	-15/+20
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Update the SA_COPY_DATA macro to check if architecture supports efficient unaligned memory accesses at compile time. Otherwise fallback to using the sa_copy_data() function. The kernel provided CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS is used to determine availability in kernel space. In user space the x86_64, x86, powerpc, and sometimes arm architectures will define the HAVE_EFFICIENT_UNALIGNED_ACCESS macro. Signed-off-by: Brian Behlendorf <[email protected]> Closes #7642 Closes #7684
*	OpenZFS 9424 - ztest failure: "unprotected error in call to Lua API (Invalid ↵	Matthew Ahrens	2018-07-10	1	-3/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	value type 'function' for key 'error')" Ztest failed with the following crash. ::status debugging core file of ztest (64-bit) from clone-dc-slave-280-bc7947b1.dcenter file: /usr/bin/amd64/ztest initial argv: /usr/bin/amd64/ztest threading model: raw lwps status: process terminated by SIGABRT (Abort), pid=2150 uid=1025 code=-1 panic message: failure for thread 0xfffffd7fff112a40, thread-id 1: unprotected error in call to Lua API (Invalid value type 'function' for key 'error') ::stack libc.so.1`_lwp_kill+0xa() libc.so.1`_assfail+0x182(fffffd7fffdfe8d0, 0, 0) libc.so.1`assfail+0x19(fffffd7fffdfe8d0, 0, 0) libzpool.so.1`vpanic+0x3d(fffffd7ffaa58c20, fffffd7fffdfeb00) 0xfffffd7ffaa28146() 0xfffffd7ffaa0a109() libzpool.so.1`luaD_throw+0x86(3011a48, 2) 0xfffffd7ffa9350d3() 0xfffffd7ffa93e3f1() libzpool.so.1`zcp_lua_to_nvlist+0x33(3011a48, 1, 2686470, fffffd7ffaa2e2c3) libzpool.so.1`zcp_convert_return_values+0xa4(3011a48, 2686470, fffffd7ffaa2e2c3, fffffd7fffdfedd0) libzpool.so.1`zcp_pool_error+0x59(fffffd7fffdfedd0, 1e0f450) libzpool.so.1`zcp_eval+0x6f8(1e0f450, fffffd7ffaa483f8, 1, 0, 6400000, 1d33b30) libzpool.so.1`dsl_destroy_snapshots_nvl+0x12c(2786b60, 0, 484750) libzpool.so.1`dsl_destroy_snapshot+0x4f(fffffd7fffdfef70, 0) ztest_dsl_dataset_cleanup+0xea(fffffd7fffdff4c0, 1) ztest_dataset_destroy+0x53(1) ztest_run+0x59f(fffffd7fff0e0498) main+0x7ff(1, fffffd7fffdffa88) _start+0x6c() The problem is that zcp_convert_return_values() assumes that there's exactly one value on the stack, but that isn't always true. It ends up putting the wrong thing on the stack which is then consumed by zcp_convert_return values, which either adds the wrong message to the nvlist, or blows up. The fix is to make sure that callers of zcp_convert_return_values() clear the stack before pushing their error message, and zcp_convert_return_values() should VERIFY that the stack is the expected size. Authored by: Matthew Ahrens <[email protected]> Reviewed by: Sebastien Roy <[email protected]> Reviewed by: Paul Dagnelie <[email protected]> Reviewed by: Don Brady <[email protected]> Ported-by: Brian Behlendorf <[email protected]> Approved by: Robert Mustacchi <[email protected]> OpenZFS-issue: https://www.illumos.org/issues/9424 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/eb7e57429 Closes #7696
*	OpenZFS 9454 - ::zfs_blkstats should count embedded blocks	Matthew Ahrens	2018-07-10	1	-4/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When we do a scrub or resilver, ZFS counts the different types of blocks, which can be printed by the ::zfs_blkstats mdb dcmd. However, it fails to count embedded blocks. Porting notes: * Commit d4a72f23 moved count_blocks under a BP_IS_EMBEDDED conditional as part of the sequential resilver functionality. Since phys_birth would be zero that case should never happen as described above. This is confirmed by the code coverage analysis. Remove the conditional to realign that aspect of this function with OpenZFS. Authored by: Matthew Ahrens <[email protected]> Reviewed by: Dan Kimmel <[email protected]> Reviewed by: George Wilson <[email protected]> Reviewed by: George Melikov <[email protected]> Reviewed by: Tom Caputi <[email protected]> Ported-by: Brian Behlendorf <[email protected]> Approved by: Robert Mustacchi <[email protected]> OpenZFS-issue: https://www.illumos.org/issues/9454 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/d8a447a7 Closes #7697
*	OpenZFS 9456 - ztest failure in zil_commit_waiter_timeout	Prakash Surya	2018-07-10	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Problem ======= Illumos bug 8373 was integrated, which now presents a code path where "dmu_tx_assign" can fail. When "dmu_tx_assign" fails, it will not issue the lwb that was passed in to "zil_lwb_write_issue". As a result, when "zil_lwb_write_issue" returns, the lwb will still be in the "opened" state, just as it was when "zil_lwb_write_issue" was originally called. Solution ======== As a result of this new call path, the failed assertion needs to be modified to be aware of this new possibility. Thus, we can only assert that the lwb is no longer in the "opened" state if the returned lwb is non-null, since we cannot differentiate between the case of "dmu_tx_assign" failing or "zio_alloc_zil" failing within the call to "zil_lwb_write_issue". Authored by: Prakash Surya <[email protected]> Reviewed by: Matt Ahrens <[email protected]> Reviewed by: Serapheim Dimitropoulos <[email protected]> Reviewed by: Prakash Surya <[email protected]> Reviewed by: George Melikov <[email protected]> Ported-by: Brian Behlendorf <[email protected]> Approved by: Matt Ahrens <[email protected]> OpenZFS-issue: https://www.illumos.org/issues/9456 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/a8b09f4e Closes #7695
*	OpenZFS 9330 - stack overflow when creating a deeply nested dataset	Serapheim Dimitropoulos	2018-07-09	2	-5/+30
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Datasets that are deeply nested (~100 levels) are impractical. We just put a limit of 50 levels to newly created datasets. Existing datasets should work without a problem. The problem can be seen by attempting to create a dataset using the -p option with many levels: panic[cpu0]/thread=ffffff01cd282c20: BAD TRAP: type=8 (#df Double fault) rp=ffffffff fffffffffbc3aa60 unix:die+100 () fffffffffbc3ab70 unix:trap+157d () ffffff00083d7020 unix:_patch_xrstorq_rbx+196 () ffffff00083d7050 zfs:dbuf_rele+2e () ... ffffff00083d7080 zfs:dsl_dir_close+32 () ffffff00083d70b0 zfs:dsl_dir_evict+30 () ffffff00083d70d0 zfs:dbuf_evict_user+4a () ffffff00083d7100 zfs:dbuf_rele_and_unlock+87 () ffffff00083d7130 zfs:dbuf_rele+2e () ... The block above repeats once per directory in the ... ... create -p command, working towards the root ... ffffff00083db9f0 zfs:dsl_dataset_drop_ref+19 () ffffff00083dba20 zfs:dsl_dataset_rele+42 () ffffff00083dba70 zfs:dmu_objset_prefetch+e4 () ffffff00083dbaa0 zfs:findfunc+23 () ffffff00083dbb80 zfs:dmu_objset_find_spa+38c () ffffff00083dbbc0 zfs:dmu_objset_find+40 () ffffff00083dbc20 zfs:zfs_ioc_snapshot_list_next+4b () ffffff00083dbcc0 zfs:zfsdev_ioctl+347 () ffffff00083dbd00 genunix:cdev_ioctl+45 () ffffff00083dbd40 specfs:spec_ioctl+5a () ffffff00083dbdc0 genunix:fop_ioctl+7b () ffffff00083dbec0 genunix:ioctl+18e () ffffff00083dbf10 unix:brand_sys_sysenter+1c9 () Porting notes: * Added zfs_max_dataset_nesting module option with documentation. * Updated zfs_rename_014_neg.ksh for Linux. * Increase the zfs.sh stack warning to 15K. Enough time has passed that 16K can be reasonably assumed to be the default value. It was increased in the 3.15 kernel released in June of 2014. Authored by: Serapheim Dimitropoulos <[email protected]> Reviewed by: John Kennedy <[email protected]> Reviewed by: Matt Ahrens <[email protected]> Ported-by: Brian Behlendorf <[email protected]> Approved by: Garrett D'Amore <[email protected]> OpenZFS-issue: https://www.illumos.org/issues/9330 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/757a75a Closes #7681
*	OpenZFS 9238 - ZFS Spacemap Encoding V2	Serapheim Dimitropoulos	2018-07-05	6	-276/+645
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Motivation ========== The current space map encoding has the following disadvantages: [1] Assuming 512 sector size each entry can represent at most 16MB for a segment. This makes the encoding very inefficient for large regions of space. [2] As vdev-wide space maps have started to be used by new features (i.e. device removal, zpool checkpoint) we've started imposing limits in the vdevs that can be used with them based on the maximum addressable offset (currently 64PB for a top-level vdev). New encoding ============ The layout can be found at space_map.h and it remains backwards compatible with the old one. The introduced two-word entry format, besides extending the limits imposed by the single-entry layout, also includes a vdev field and some extra padding after its prefix. The extra padding after the prefix should is reserved for future usage (e.g. new prefixes for future encodings or new fields for flags). The new vdev field not only makes the space maps more self-descriptive, but also opens the doors for pool-wide space maps (expected to be used in the log spacemap project). One final important note is that the number of bits used for vdevs is reduced to 24 bits for blkptrs. That was decided as we don't know of any setups that use more than 16M vdevs for the time being and we wanted to fit the vdev field in the space map. In addition that gives us some extra bits in dva_t. Other references: ================= The new encoding is also discussed towards the end of the Log Space Map presentation from 2017's OpenZFS summit. Link: https://www.youtube.com/watch?v=jj2IxRkl5bQ Authored by: Serapheim Dimitropoulos <[email protected]> Reviewed by: Matt Ahrens <[email protected]> Reviewed by: George Wilson <[email protected]> Reviewed by: Brian Behlendorf <[email protected]> Approved by: Gordon Ross <[email protected]> Ported-by: Tim Chase <[email protected]> Signed-off-by: Tim Chase <[email protected]> OpenZFS-commit: https://github.com/openzfs/openzfs/commit/90a56e6d OpenZFS-issue: https://www.illumos.org/issues/9238 Closes #7665
*	Fix coverity defects: CID 176037	Tom Caputi	2018-07-02	1	-3/+4
\| \| \| \| \| \| \| \| \| \|	CID 176037: Uninitialized scalar variable This patch fixes an uninitialized variable defect caught by coverity and introduced in 69830602 Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tom Caputi <[email protected]> Closes #7667
*	Fix 'zfs recv' of non large_dnode send streams	Tom Caputi	2018-06-28	2	-8/+13
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently, there is a bug where older send streams without the DMU_BACKUP_FEATURE_LARGE_DNODE flag are not handled correctly. The code in receive_object() fails to handle cases where drro->drr_dn_slots is set to 0, which is always the case when the sending code does not support this feature flag. This patch fixes the issue by ensuring that that a value of 0 is treated as DNODE_MIN_SLOTS. Tested-by: DHE <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tom Caputi <[email protected]> Closes #7617 Closes #7662
*	Raw receive fix and encrypted objset security fix	Tom Caputi	2018-06-28	3	-38/+116
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch fixes two problems with the encryption code. First, the current code does not correctly prohibit the DMU from updating dn_maxblkid during object truncation within a raw receive. This usually only causes issues when the truncating DRR_FREE record is aggregated with DRR_FREE records later in the receive, so it is relatively hard to hit. Second, this patch fixes a security issue where reading blocks within an encrypted object did not guarantee that the dnode block itself had ever been verified against its MAC. Usually the verification happened anyway when the bonus buffer was read, but some use cases (notably zvols) might never perform the check. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Signed-off-by: Tom Caputi <[email protected]> Closes #7632
*	OpenZFS 9166 - zfs storage pool checkpoint	Serapheim Dimitropoulos	2018-06-26	28	-459/+2319
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Details about the motivation of this feature and its usage can be found in this blogpost: https://sdimitro.github.io/post/zpool-checkpoint/ A lightning talk of this feature can be found here: https://www.youtube.com/watch?v=fPQA8K40jAM Implementation details can be found in big block comment of spa_checkpoint.c Side-changes that are relevant to this commit but not explained elsewhere: * renames members of "struct metaslab trees to be shorter without losing meaning * space_map_{alloc,truncate}() accept a block size as a parameter. The reason is that in the current state all space maps that we allocate through the DMU use a global tunable (space_map_blksz) which defauls to 4KB. This is ok for metaslab space maps in terms of bandwirdth since they are scattered all over the disk. But for other space maps this default is probably not what we want. Examples are device removal's vdev_obsolete_sm or vdev_chedkpoint_sm from this review. Both of these have a 1:1 relationship with each vdev and could benefit from a bigger block size. Porting notes: * The part of dsl_scan_sync() which handles async destroys has been moved into the new dsl_process_async_destroys() function. * Remove "VERIFY(!(flags & FWRITE))" in "kernel.c" so zhack can write to block device backed pools. * ZTS: * Fix get_txg() in zpool_sync_001_pos due to "checkpoint_txg". * Don't use large dd block sizes on /dev/urandom under Linux in checkpoint_capacity. * Adopt Delphix-OS's setting of 4 (spa_asize_inflation = SPA_DVAS_PER_BP + 1) for the checkpoint_capacity test to speed its attempts to fill the pool * Create the base and nested pools with sync=disabled to speed up the "setup" phase. * Clear labels in test pool between checkpoint tests to avoid duplicate pool issues. * The import_rewind_device_replaced test has been marked as "known to fail" for the reasons listed in its DISCLAIMER. * New module parameters: zfs_spa_discard_memory_limit, zfs_remove_max_bytes_pause (not documented - debugging only) vdev_max_ms_count (formerly metaslabs_per_vdev) vdev_min_ms_count Authored by: Serapheim Dimitropoulos <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Reviewed by: John Kennedy <[email protected]> Reviewed by: Dan Kimmel <[email protected]> Reviewed by: Brian Behlendorf <[email protected]> Approved by: Richard Lowe <[email protected]> Ported-by: Tim Chase <[email protected]> Signed-off-by: Tim Chase <[email protected]> OpenZFS-issue: https://illumos.org/issues/9166 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/7159fdb8 Closes #7570
*	OpenZFS 9591 - ms_shift can be incorrectly changed	Serapheim Dimitropoulos	2018-06-21	2	-3/+15
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	ms_shift can be incorrectly changed changed in MOS config for indirect vdevs that have been historically expanded According to spa_config_update() we expect new vdevs to have vdev_ms_array equal to 0 and then we go ahead and set their metaslab size. The problem is that indirect vdevs also have vdev_ms_array == 0 because their metaslabs are destroyed once their removal is done. As a result, if a vdev was expanded and then removed may have its ms_shift changed if another vdev was added after its removal. Fortunately this behavior does not cause any type of crash or bad behavior in the kernel but it can confuse zdb and anyone doing any kind of analysis of the history of the pools. Authored by: Serapheim Dimitropoulos <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Reviewed by: George Wilson <[email protected]> Reviewed by: John Kennedy <[email protected]> Reviewed by: Prashanth Sreenivasa <[email protected]> Reviewed by: Brian Behlendorf <[email protected]> Signed-off-by: Tim Chase <[email protected]> Ported-by: Tim Chase <[email protected]> OpenZFS-commit: https://github.com/openzfs/openzfs/pull/651 OpenZFS-issue: https://illumos.org/issues/9591a External-issue: DLPX-58879 Closes #7644
*	Remove suffix from zio taskq names	Matthew Ahrens	2018-06-20	1	-8/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	For zio taskq's which have multiple instances (e.g. z_rd_int_0, z_rd_int_1, etc), each one has a unique name (the _0, _1, _2 suffix). This makes performance analysis more difficult, because by default, `perf` includes the thread name (which is the same as the taskq name) in the stack trace. This means that we get 8 different stacks, all of which are doing the same thing, but are executed from different taskq's. We should remove the suffix of the taskq name, so that all the read-interrupt threads are named z_rd_int. Note that we already support multiple taskq's with the same name. This happens when there are multiple pools. In this case the taskq has a different tq_instance, which shows up in /proc/spl/taskq-all. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed by: Richard Elling <[email protected]> Reviewed-by: Giuseppe Di Natale <[email protected]> Reviewed-by: George Melikov <[email protected]> Reviewed by: Pavel Zakharov <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> Closes #7646
*	Linux 4.14 compat: blk_queue_stackable()	Brian Behlendorf	2018-06-19	1	-14/+11
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The blk_queue_stackable() function was replaced in the 4.14 kernel by queue_is_rq_based(), commit torvalds/linux@5fdee212. This change resulted in the default elevator being used which can negatively impact performance. Rather than adding additional compatibility code to detect the new interface unconditionally attempt to set the elevator. Since we expect this to fail for block devices without an elevator the error message has been moved in to zfs_dbgmsg(). Finally, it was observed that the elevator_change() was removed from the 4.12 kernel, commit torvalds/linux@c033269. Update the comment to clearly specify which are expected to export the elevator_change() symbol. Reviewed-by: Matthew Ahrens <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #7645
*	Linux 4.18 compat: inode timespec -> timespec64	Brian Behlendorf	2018-06-19	7	-12/+13
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Commit torvalds/linux@95582b0 changes the inode i_atime, i_mtime, and i_ctime members form timespec's to timespec64's to make them 2038 safe. As part of this change the current_time() function was also updated to return the timespec64 type. Resolve this issue by introducing a new inode_timespec_t type which is defined to match the timespec type used by the inode. It should be used when working with inode timestamps to ensure matching types. The timestruc_t type under Illumos was used in a similar fashion but was specified to always be a timespec_t. Rather than incorrectly define this type all timespec_t types have been replaced by the new inode_timespec_t type. Finally, the kernel and user space 'sys/time.h' headers were aligned with each other. They define as appropriate for the context several constants as macros and include static inline implementation of gethrestime(), gethrestime_sec(), and gethrtime(). Reviewed-by: Chunwei Chen <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #7643
*	Add ASSERT to debug encryption key mapping issues	Tom Caputi	2018-06-18	2	-0/+23
\| \| \| \| \| \| \| \| \| \| \|	This patch simply adds an ASSERT that confirms that the last decrypting reference on a dataset waits until the dataset is no longer dirty. This should help to debug issues where the ZIO layer cannot find encryption keys after a dataset has been disowned. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tom Caputi <[email protected]> Closes #7637
*	Add tunables for channel programs	John Gallagher	2018-06-15	1	-3/+15
\| \| \| \| \| \| \| \| \| \| \| \|	This patch adds tunables for modifying the maximum memory limit and maximum instruction limit that can be specified when running a channel program. Reviewed-by: Matthew Ahrens <[email protected]> Reviewed-by: Brian Behlendorf <[email protected] Reviewed-by: Sara Hartse <[email protected]> Signed-off-by: John Gallagher <[email protected]> External-issue: LX-1085 Closes #7618
*	Linux compat 4.18: check_disk_size_change()	Brian Behlendorf	2018-06-15	1	-144/+115
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Added support for the bops->check_events() interface which was added in the 2.6.38 kernel to replace bops->media_changed(). Fully implementing this functionality allows the volume resize code to rely on revalidate_disk(), which is the preferred mechanism, and removes the need to use check_disk_size_change(). In order for bops->check_events() to lookup the zvol_state_t stored in the disk->private_data the zvol_state_lock needs to be held. Since the check events interface may poll the mutex has been converted to a rwlock for better concurrently. The rwlock need only be taken as a writer in the zvol_free() path when disk->private_data is set to NULL. The configure checks for the block_device_operations structure were consolidated in a single kernel-block-device-operations.m4 file. The ZFS_AC_KERNEL_BDEV_BLOCK_DEVICE_OPERATIONS configure checks and assoicated dead code was removed. This interface was added to the 2.6.28 kernel which predates the oldest supported 2.6.32 kernel and will therefore always be available. Updated maximum Linux version in META file. The 4.17 kernel was released on 2018-06-03 and ZoL is compatible with the finalized kernel. Reviewed-by: Boris Protopopov <[email protected]> Reviewed-by: Sara Hartse <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #7611
*	OpenZFS 9577 - remove zfs_dbuf_evict_key tsd	Matthew Ahrens	2018-06-13	3	-51/+42
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The zfs_dbuf_evict_key TSD (thread-specific data) is not necessary - we can instead pass a flag down in a few places to prevent recursive dbuf eviction. Making this change has 3 benefits: 1. The code semantics are easier to understand. 2. On Linux, performance is improved, because creating/removing TSD values (by setting to NULL vs non-NULL) is expensive, and we do it very often. 3. According to Nexenta, the current semantics can cause a deadlock when concurrently calling dmu_objset_evict_dbufs() (which is rare today, but they are working on a "parallel unmount" change that triggers this more easily): Porting Notes: * Minor conflict with OpenZFS 9337 which has not yet been ported. Authored by: Matthew Ahrens <[email protected]> Reviewed by: George Wilson <[email protected]> Reviewed by: Serapheim Dimitropoulos <[email protected]> Reviewed by: Brad Lewis <[email protected]> Reviewed-by: George Melikov <[email protected]> Ported-by: Brian Behlendorf <[email protected]> OpenZFS-issue: https://illumos.org/issues/9577 OpenZFS-commit: https://github.com/openzfs/openzfs/pull/645 External-issue: DLPX-58547 Closes #7602
*	Wrong error message when removing log device	Paul Zuchowski	2018-06-07	1	-1/+5
\| \| \| \| \| \| \| \| \| \| \| \| \|	In the case where the pool is loaded without the crypto keys necessary to playback the intent log, and log device removal is attempted, a generic busy message is received. Change the message to inform the user that the datasets must be mounted. Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Tom Caputi <[email protected]> Signed-off-by: Paul Zuchowski <[email protected]> Closes #7518
*	Fix preemptible warning in aggsum_add()	Brian Behlendorf	2018-06-07	1	-2/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	In the new aggsum counters the CPU_SEQID macro should be surrounded by kpreempt_disable)() and kpreempt_enable() calls to prevent a Linux kernel BUG warning. The addsum_add() function use the cpuid to minimize lock contention when selecting a bucket, after selection the bucket is protected by a mutex and it is safe to reschedule the process to a different processor at any time. Reviewed-by: Matthew Thode <[email protected]> Reviewed-by: Paul Dagnelie <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #7609 Closes #7610
*	Don't panic on bad SA_MAGIC in sa_build_index	Nathaniel Clark	2018-06-07	1	-1/+7
\| \| \| \| \| \| \| \| \| \| \|	If sa_build_index() encounters a corrupt buffer, don't panic. Add info to zfs ring buffer and return EIO. This allows for a cleaner error recovery path. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Signed-off-by: Nathaniel Clark <[email protected]> Issue #6500 Closes #7487
*	Update the correct abd in l2arc_read_done()	Tom Caputi	2018-06-06	1	-9/+12
\| \| \| \| \| \| \| \| \| \| \| \|	This patch fixes an issue where l2arc_read_done() would always write data to b_pabd, even if raw encrypted data was requested. This only occured in cases where the L2ARC device had a different ashift than the main pool. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Signed-off-by: Tom Caputi <[email protected]> Closes #7586 Closes #7593
*	Raw receive functions must not decrypt data	Tom Caputi	2018-06-06	4	-5/+21
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch fixes a small bug found where receive_spill() sometimes attempted to decrypt spill blocks when doing a raw receive. In addition, this patch fixes another small issue in arc_buf_fill()'s error handling where a decryption failure (which could be caused by the first bug) would attempt to set the arc header's IO_ERROR flag without holding the header's lock. Reviewed-by: Matthew Thode <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Signed-off-by: Tom Caputi <[email protected]> Closes #7564 Closes #7584 Closes #7592
*	OpenZFS 8484 - Implement aggregate sum and use for arc counters	Paul Dagnelie	2018-06-06	7	-85/+483
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In pursuit of improving performance on multi-core systems, we should implements fanned out counters and use them to improve the performance of some of the arc statistics. These stats are updated extremely frequently, and can consume a significant amount of CPU time. Authored by: Paul Dagnelie <[email protected]> Reviewed by: Pavel Zakharov <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Approved by: Dan McDonald <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Ported-by: Paul Dagnelie <[email protected]> OpenZFS-issue: https://www.illumos.org/issues/8484 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/7028a8b92b7 Issue #3752 Closes #7462
*	Add pool state /proc entry, "SUSPENDED" pools	Tony Hutter	2018-06-06	2	-0/+102
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	1. Add a proc entry to display the pool's state: $ cat /proc/spl/kstat/zfs/tank/state ONLINE This is done without using the spa config locks, so it will never hang. 2. Fix 'zpool status' and 'zpool list -o health' output to print "SUSPENDED" instead of "ONLINE" for suspended pools. Reviewed-by: Olaf Faaland <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed by: Richard Elling <[email protected]> Signed-off-by: Tony Hutter <[email protected]> Closes #7331 Closes #7563
*	OpenZFS 9464 - txg_kick() fails to see that we are quiescing	Serapheim Dimitropoulos	2018-06-04	2	-7/+37
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	txg_kick() fails to see that we are quiescing, forcing transactions to their next stages without leaving them accumulate changes Creating a fragmented pool in a DCenter VM and continuously writing to it with multiple instances of randwritecomp, we get the following output from txg.d: 0ms 311MB in 4114ms (95% p1) 75MB/s 544MB (76%) 336us 153ms 0ms 0ms 8MB in 51ms ( 0% p1) 163MB/s 474MB (66%) 129us 34ms 0ms 0ms 366MB in 4454ms (93% p1) 82MB/s 572MB (79%) 498us 20ms 0ms 0ms 406MB in 5212ms (95% p1) 77MB/s 591MB (82%) 661us 37ms 0ms 0ms 340MB in 5110ms (94% p1) 66MB/s 622MB (86%) 1048us 41ms 1ms 0ms 3MB in 61ms ( 0% p1) 51MB/s 419MB (58%) 33us 0ms 0ms 0ms 361MB in 3555ms (88% p1) 101MB/s 542MB (75%) 335us 40ms 0ms 0ms 356MB in 4592ms (92% p1) 77MB/s 561MB (78%) 430us 89ms 1ms 0ms 11MB in 129ms (13% p1) 90MB/s 507MB (70%) 222us 15ms 0ms 0ms 281MB in 2520ms (89% p1) 111MB/s 542MB (75%) 334us 42ms 0ms 0ms 383MB in 3666ms (91% p1) 104MB/s 557MB (77%) 411us 133ms 0ms 0ms 404MB in 5757ms (94% p1) 70MB/s 635MB (88%) 1274us 123ms 2ms 4ms 367MB in 4172ms (89% p1) 88MB/s 556MB (77%) 401us 51ms 0ms 0ms 42MB in 470ms (44% p1) 90MB/s 557MB (77%) 412us 43ms 0ms 0ms 261MB in 2273ms (88% p1) 114MB/s 556MB (77%) 407us 27ms 0ms 0ms 394MB in 3646ms (85% p1) 108MB/s 552MB (77%) 393us 304ms 0ms 0ms 275MB in 2416ms (89% p1) 113MB/s 510MB (71%) 200us 53ms 0ms 0ms 9MB in 53ms ( 0% p1) 169MB/s 483MB (67%) 140us 100ms 1ms The TXGs that are getting synced and don't have lots of changes are pushed by txg_kick() which basically forces the current open txg to get to the quiesced state: if (tx->tx_syncing_txg == 0 && tx->tx_quiesce_txg_waiting <= tx->tx_open_txg && tx->tx_sync_txg_waiting <= tx->tx_synced_txg && tx->tx_quiesced_txg <= tx->tx_synced_txg) { tx->tx_quiesce_txg_waiting = tx->tx_open_txg + 1; cv_broadcast(&tx->tx_quiesce_more_cv); } The problem is that the above code doesn't check if we are currently quiescing anything (only if a quiesce or a sync has been requested, ..etc) so the following scenario can happen: 1] We have an open txg A that had enough dirty data (more than zfs_dirty_data_sync) and it was pushed to the quiesced state, and opened a new txg B. No txg is currently being synced. 2] Immediately after the opening of B, txg_kick() was run by some other write (and because of A's dirty data) and saw that we are not currently syncing any txg and no one has requested quiescing so it requests one by bumping tx_quiesce_txg_waiting and broadcasts the quiesce thread. 3] The quiesce thread just passed txg A to be synced and sees that a quiescing request has been sent to it so it immediately grabs B without letting it gather enough data, putting it in a quiesced state and opening a new txg C. In this scenario txg B, is an example of how the entries of interest show up in the txg.d output. Ideally we would like txg_kick() to get triggered only when we are sure that we are not syncing AND not quiescing any txg. This way we can kick an open TXG to the quiescing state when we are sure that there is nothing going on and we would benefit from the different states running concurrently. Authored by: Serapheim Dimitropoulos <[email protected]> Reviewed by: Matt Ahrens <[email protected]> Reviewed by: Brad Lewis <[email protected]> Reviewed by: Andriy Gapon <[email protected]> Approved by: Dan McDonald <[email protected]> Ported-by: Brian Behlendorf <[email protected]> OpenZFS-issue: https://illumos.org/issues/9464 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/1cd7635b Closes #7587