summaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* Fix issues with raw receive_write_byref()Tom Caputi2018-08-207-43/+144
| | | | | | | | | | | | | | | | | | | | | | | This patch fixes 2 issues with raw, deduplicated send streams. The first is that datasets who had been completely received earlier in the stream were not still marked as raw receives. This caused problems when newly received datasets attempted to fetch raw data from these datasets without this flag set. The second problem was that the arc freeze checksum code was not consistent about which locks needed to be held while performing its asserts. The proper locking needed to run these asserts is actually fairly nuanced, since the asserts touch the linked list of buffers (requiring the header lock), the arc_state (requiring the b_evict_lock), and the b_freeze_cksum (requiring the b_freeze_lock). This seems like a large performance sacrifice and a lot of unneeded complexity to verify that this relatively small debug feature is working as intended, so this patch simply removes these asserts instead. Reviewed by: Matthew Ahrens <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tom Caputi <[email protected]> Closes #7701
* pyzfs: add missing libzfs_core functionsLOLi2018-08-207-1/+148
| | | | | | | | | This change adds the following libzfs_core functions to pyzfs: lzc_remap, lzc_pool_checkpoint, lzc_pool_checkpoint_discard Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: loli10K <[email protected]> Closes #7793 Closes #7800
* Skip import activity test in more zdb code pathsOlaf Faaland2018-08-204-17/+110
| | | | | | | | | | | | | | | | | | | | | | Since zdb opens the pools read-only, it cannot damage the pool in the event the pool is already imported either on the same host or on another one. If the pool vdev structure is changing while zdb is importing the pool, it may cause zdb to crash. However this is unlikely, and in any case it's a user space process and can simply be run again. For this reason, zdb should disable the multihost activity test on import that is normally run. This commit fixes a few zdb code paths where that had been overlooked. It also adds tests to ensure that several common use cases handle this properly in the future. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Gu Zheng <[email protected]> Signed-off-by: Olaf Faaland <[email protected]> Closes #7797 Closes #7801
* Don't modify argv[] in user toolsDeHackEd2018-08-202-4/+32
| | | | | | | | | | | | argv[] gets modified during string parsing for input arguments. This is reflected in the live process listing. Don't do that. Reviewed-by: Serapheim Dimitropoulos <[email protected]> Reviewed-by: loli10K <[email protected]> Reviewed-by: Giuseppe Di Natale <[email protected]> Reviewed-by: George Melikov <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: DHE <[email protected]> Closes #7760
* Introduce read/write kstats per datasetSerapheim Dimitropoulos2018-08-2016-78/+368
| | | | | | | | | | | | | The following patch introduces a few statistics on reads and writes grouped by dataset. These statistics are implemented as kstats (backed by aggregate sums for performance) and can be retrieved by using the dataset objset ID number. The motivation for this change is to provide some preliminary analytics on dataset usage/performance. Reviewed-by: Richard Elling <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Signed-off-by: Serapheim Dimitropoulos <[email protected]> Closes #7705
* ZTS: events path cleanupbunder20152018-08-181-5/+5
| | | | | | | | | Removing hardcoded paths in events.cfg Reviewed-by: Giuseppe Di Natale <[email protected]> Reviewed-by: George Melikov <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: bunder2015 <[email protected]> Closes #7805
* ZTS: largest_pool_001 path cleanupbunder20152018-08-181-5/+5
| | | | | | | | | Removing hardcoded paths in largest_pool_001 Reviewed-by: Giuseppe Di Natale <[email protected]> Reviewed-by: George Melikov <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: bunder2015 <[email protected]> Closes #7804
* ZTS: privilege group path cleanupbunder20152018-08-184-8/+8
| | | | | | | | Removing hardcoded paths in privilege group tests Reviewed-by: George Melikov <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: bunder2015 <[email protected]> Closes #7803
* ZTS: Fix import_cache_device_replacedBrian Behlendorf2018-08-1810-19/+24
| | | | | | | | | | | | | | | | | | | Allow the 'zpool replace' to run slowly without overwhelming the vdev queues by setting zfs_scan_vdev_limit=128k. This limits the number of concurrent slow IOs which need to be handled. The net effect is the test case runs approximately 3x faster putting it well under the 10 minute per-test time limit. Rename import_cache* test cases to imprt_cachefile*. Originally these were renamed due to a maximum tar name limit, this limit was removed by commit 1dfde3d9b. Replaced instances of /var/tmp in zpool_import.cfg with $TEST_BASE_DIR. Reviewed-by: bunder2015 <[email protected]> Reviewed-by: George Melikov <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #7765 Closes #7802
* 'zfs holds' scripted mode is not documentedLOLi2018-08-182-5/+8
| | | | | | | | | | This change simply documents the existing "scripted mode" option in both command help and man page. Reviewed-by: Giuseppe Di Natale <[email protected]> Reviewed-by: George Melikov <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: loli10K <[email protected]> Closes #7798
* Fix arcstat.py handling of unsupported optionsLOLi2018-08-181-1/+1
| | | | | | | | | | | This change allows the arcstat.py script to handle unsupported options gracefully and print both error and usage messages when one such option is provided. Reviewed-by: Giuseppe Di Natale <[email protected]> Reviewed-by: George Melikov <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: loli10K <[email protected]> Closes #7799
* ZTS: Fix reservation_001_posBrian Behlendorf2018-08-171-1/+1
| | | | | | | | | | It's possible for an unrelated process, like blkid, to have the volume open when 'zfs destroy' is run. Switch the cleanup function to the destroy_dataset() helper which handles this case by retrying the destroy when the dataset is busy. Reviewed-by: George Melikov <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #7796
* Fix traverse_impl() kmem leakBrian Behlendorf2018-08-151-2/+2
| | | | | | | | | | | | The error path must free the memory allocated by this function or it will be leaked. In practice, this would leak only a few bytes of memory under rare circumstances and thus is unlikely to have caused any real problems. This issue was caught by the kmemleak. Reviewed-by: Giuseppe Di Natale <[email protected]> Reviewed-by: George Melikov <[email protected]> Reviewed-by: Chunwei Chen <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #7791
* Use posix format for dist tarballsBrian Behlendorf2018-08-1565-71/+75
| | | | | | | | | | | | | | | | | | | | | | Traditionally Automake has defaulted to the V7 tar format when creating tarballs for distributions. One of the many limitions of this format is a 99 character maximum path + file name limit. This can cause problems when adding new test cases to the ZTS due to the depth of the sub-tree and descriptive test names. This change switches the build system to the posix (aliased as pax) tar format which conforms to the POSIX.1-2001 specification. This format does not suffer from the V7 limitations, was designed to be compatible, and will become the default format in future versions of GNU tar. https://www.gnu.org/software/tar/manual/html_chapter/tar_8.html As part of this change the blockfiles directories which were originally removed due to this limit have been readded. Reviewed by: Tim Chase <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #7767
* Check encrypted dataset + embedded recv earlierTom Caputi2018-08-155-14/+57
| | | | | | | | | | | | | | This patch fixes a bug where attempting to receive a send stream with embedded data into an encrypted dataset would not cleanup that dataset when the error was reached. The check was moved into dmu_recv_begin_check(), preventing this issue. Reviewed-by: Jorgen Lundman <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Richard Elling <[email protected]> Signed-off-by: Tom Caputi <[email protected]> Closes #7650
* Added encryption support for zfs recv -o / -xTom Caputi2018-08-1517-93/+507
| | | | | | | | | | | | | | | | | | | | | | | | | | | | One small integration that was absent from b52563 was support for zfs recv -o / -x with regards to encryption parameters. The main use cases of this are as follows: * Receiving an unencrypted stream as encrypted without needing to create a "dummy" encrypted parent so that encryption can be inheritted. * Allowing users to change their keylocation on receive, so long as the receiving dataset is an encryption root. * Allowing users to explicitly exclude or override the encryption property from an unencrypted properties stream, allowing it to be received as encrypted. * Receiving a recursive heirarchy of unencrypted datasets, encrypting the top-level one and forcing all children to inherit the encryption. Reviewed-by: Jorgen Lundman <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Richard Elling <[email protected]> Signed-off-by: Tom Caputi <[email protected]> Closes #7650
* Fix comment on calculating blkidTomohiro Kusumi2018-08-131-1/+1
| | | | | | | | | Fix comment on calculating blkid at level n within dnode's blkptrs. "(2^(level*(indblkshift - SPA_BLKPTRSHIFT)" is part of divisor in this division. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tomohiro Kusumi <[email protected]> Closes #7768
* ZTS: delegate group path cleanupbunder20152018-08-132-7/+7
| | | | | | | | | Removing hardcoded paths in delegate group tests Reviewed-by: Giuseppe Di Natale <[email protected]> Reviewed-by: George Melikov <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: bunder2015 <[email protected]> Closes #7778
* ZTS: acl group path cleanupbunder20152018-08-133-7/+6
| | | | | | | | | Removing hardcoded paths in acl group tests Reviewed-by: Giuseppe Di Natale <[email protected]> Reviewed-by: George Melikov <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: bunder2015 <[email protected]> Closes #7777
* ZTS: inuse_004 path cleanupbunder20152018-08-131-1/+1
| | | | | | | | | Removing hardcoded path in inuse_004 Reviewed-by: Giuseppe Di Natale <[email protected]> Reviewed-by: George Melikov <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: bunder2015 <[email protected]> Closes #7775
* ZTS: projectquota_002 path cleanupbunder20152018-08-131-1/+1
| | | | | | | | | Removing hardcoded path in projectquota_002 Reviewed-by: Giuseppe Di Natale <[email protected]> Reviewed-by: George Melikov <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: bunder2015 <[email protected]> Closes #7774
* MMP should not suspend pool in ztestOlaf Faaland2018-08-131-0/+2
| | | | | | | | | | | | | | | | | When running ztest, never suspend the pool due to failed or delayed MMP writes. There are many sources of long delays within ztest, such as device opens, closes, etc. which in combination, may delay MMP writes too long and cause MMP to suspend the pool. Some of these delays also affect real pools, and should be fixed. That is being worked separately. Reviewed-by: Giuseppe Di Natale <[email protected]> Reviewed-by: George Melikov <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Olaf Faaland <[email protected]> Closes #7776
* ZTS: Test case reliabilityBrian Behlendorf2018-08-122-13/+14
| | | | | | | | | | | | | | | | * Both cli_root/zpool_import/import_cache_device_replaced, and redundancy/redundancy_004_neg have been observed to fail for spurious reasons ~1% of the time. Add them to the exception list and reference the open Github issue. * Speed up replacement/replacement_001_pos to prevent it from exceeding the 10 minute per test limit and getting KILLED. File vdev creation switched to truncate -s, redundant raidz1 testing pass dropped, fixed some minor formating issues. Reviewed-by: Giuseppe Di Natale <[email protected]> Reviewed-by: George Melikov <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #7766
* Allow inherited properties in zfs_check_settable()LOLi2018-08-032-15/+17
| | | | | | | | | | | | | This change modifies how 'checksum' and 'dedup' properties are verified in zfs_check_settable() handling the case where they are explicitly inherited in the dataset hierarchy when receiving a recursive send stream. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Tom Caputi <[email protected]> Signed-off-by: loli10K <[email protected]> Closes #7755 Closes #7576 Closes #7757
* zfs_ioc_unload_key can drop extra spa refDon Brady2018-08-032-5/+14
| | | | | | | Reviewed by: Thomas Caputi <[email protected]> Reviewed by: Matt Ahrens <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Don Brady <[email protected]> Closes #7759
* ZTS: Fix zfs_create_007_posBrian Behlendorf2018-08-031-4/+2
| | | | | | | | | | | | It's possible for an unrelated process, like blkid, to have the volume open when 'zfs destroy' is run. Switch the cleanup function to the destroy_dataset() helper which handles this case by retrying the destroy when the dataset is busy. Reviewed-by: Giuseppe Di Natale <[email protected]> Reviewed-by: George Melikov <[email protected]> Reviewed by: John Kennedy <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #7763
* Reduce taskq and context-switch cost of zio pipeMatthew Ahrens2018-08-022-126/+146
| | | | | | | | | | | | | | | | | | | | When doing a read from disk, ZFS creates 3 ZIO's: a zio_null(), the logical zio_read(), and then a physical zio. Currently, each of these results in a separate taskq_dispatch(zio_execute). On high-read-iops workloads, this causes a significant performance impact. By processing all 3 ZIO's in a single taskq entry, we reduce the overhead on taskq locking and context switching. We accomplish this by allowing zio_done() to return a "next zio to execute" to zio_execute(). This results in a ~12% performance increase for random reads, from 96,000 iops to 108,000 iops (with recordsize=8k, on SSD's). Reviewed by: Pavel Zakharov <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed by: George Wilson <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> External-issue: DLPX-59292 Closes #7736
* Add missing checks to zpl_xattr_* functionsJohn Gallagher2018-08-022-28/+38
| | | | | | | | | | | | | Linux specific zpl_* entry points, such as xattrs, must include the same unmounted and sa handle checks as the common zfs_ entry points. The additional ZPL_* wrappers are identical to their ZFS_ counterparts except the errno is negated since they are expected to be used at the zpl_ layer. Reviewed by: Matthew Ahrens <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: John Gallagher <[email protected]> Closes #5866 Closes #7761
* Add support for selecting encryption backendNathan Lewis2018-08-0218-1586/+2296
| | | | | | | | | | | | | | | | | | - Add two new module parameters to icp (icp_aes_impl, icp_gcm_impl) that control the crypto implementation. At the moment there is a choice between generic and aesni (on platforms that support it). - This enables support for AES-NI and PCLMULQDQ-NI on AMD Family 15h (bulldozer) and newer CPUs (zen). - Modify aes_key_t to track what implementation it was generated with as key schedules generated with various implementations are not necessarily interchangable. Reviewed by: Gvozden Neskovic <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Tom Caputi <[email protected]> Reviewed-by: Richard Laager <[email protected]> Signed-off-by: Nathaniel R. Lewis <[email protected]> Closes #7102 Closes #7103
* Fix OpenZFS 9337 mismergeGeorge Wilson2018-08-025-15/+16
| | | | | | | | | | | | | | | | This change reintroduces logic required by OpenZFS 9577. When OpenZFS 9337, zfs get all is slow due to uncached metadata, was merged in it ended up removing logic required by OpenZFS 9577, remove zfs_dbuf_evict_key, and inadvertently reintroduced the bug that 9577 was designed to fix. This change re-enables the "evicting" flag to dbuf_rele_and_unlock and dnode_rele_and_unlock and updates all callers to provide the correct parameter. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: George Wilson <[email protected]> Closes #7758
* Fix deadlock between zfs umount & snapentry_expireRohan Puri2018-08-011-6/+5
| | | | | | | | | | | | | | | | | | zfs umount -> zfsctl_destroy() takes the zfs_snapshot_lock as a writer and calls zfsctl_snapshot_unmount_cancel(), which waits for snapentry_expire() if present (when snap is automounted). This snapentry_expire() itself then waits for zfs_snapshot_lock as a reader, resulting in a deadlock. The fix is to only hold the zfs_snapshot_lock over the tree lookup and removal. After a successful lookup the lock can be dropped and zfs_snapentry_t will remain valid until the reference taken by the lookup is released. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rohan Puri <[email protected]> Closes #7751 Closes #7752
* OpenZFS 9112 - Improve allocation performance on high-end systemsPaul Dagnelie2018-07-3116-212/+605
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Overview ======== We parallelize the allocation process by creating the concept of "allocators". There are a certain number of allocators per metaslab group, defined by the value of a tunable at pool open time. Each allocator for a given metaslab group has up to 2 active metaslabs; one "primary", and one "secondary". The primary and secondary weight mean the same thing they did in in the pre-allocator world; primary metaslabs are used for most allocations, secondary metaslabs are used for ditto blocks being allocated in the same metaslab group. There is also the CLAIM weight, which has been separated out from the other weights, but that is less important to understanding the patch. The active metaslabs for each allocator are moved from their normal place in the metaslab tree for the group to the back of the tree. This way, they will not be selected for use by other allocators searching for new metaslabs unless all the passive metaslabs are unsuitable for allocations. If that does happen, the allocators will "steal" from each other to ensure that IOs don't fail until there is truly no space left to perform allocations. In addition, the alloc queue for each metaslab group has been broken into a separate queue for each allocator. We don't want to dramatically increase the number of inflight IOs on low-end systems, because it can significantly increase txg times. On the other hand, we want to ensure that there are enough IOs for each allocator to allow for good coalescing before sending the IOs to the disk. As a result, we take a compromise path; each allocator's alloc queue max depth starts at a certain value for every txg. Every time an IO completes, we increase the max depth. This should hopefully provide a good balance between the two failure modes, while not dramatically increasing complexity. We also parallelize the spa_alloc_tree and spa_alloc_lock, which cause very similar contention when selecting IOs to allocate. This parallelization uses the same allocator scheme as metaslab selection. Performance Results =================== Performance improvements from this change can vary significantly based on the number of CPUs in the system, whether or not the system has a NUMA architecture, the speed of the drives, the values for the various tunables, and the workload being performed. For an fio async sequential write workload on a 24 core NUMA system with 256 GB of RAM and 8 128 GB SSDs, there is a roughly 25% performance improvement. Future Work =========== Analysis of the performance of the system with this patch applied shows that a significant new bottleneck is the vdev disk queues, which also need to be parallelized. Prototyping of this change has occurred, and there was a performance improvement, but more work needs to be done before its stability has been verified and it is ready to be upstreamed. Authored by: Paul Dagnelie <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Reviewed by: George Wilson <[email protected]> Reviewed by: Serapheim Dimitropoulos <[email protected]> Reviewed by: Alexander Motin <[email protected]> Reviewed by: Brian Behlendorf <[email protected]> Approved by: Gordon Ross <[email protected]> Ported-by: Paul Dagnelie <[email protected]> Signed-off-by: Paul Dagnelie <[email protected]> Porting Notes: * Fix reservation test failures by increasing tolerance. OpenZFS-issue: https://illumos.org/issues/9112 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/3f3cc3c3 Closes #7682
* Add missing zfs-dracut RPM dependenciesBrian Behlendorf2018-07-311-1/+3
| | | | | | | | | | | The zfs-dracut package requires the hostid, basename, head, awk, and grep utilities be installed. The first three are provided by coreutils but additional dependencies are required for awk and grep. Reviewed-by: Manuel Amador (Rudd-O) <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #7729 Closes #7747
* Use zfs-import.target in contrib/dracutAntonio Russo2018-07-312-5/+10
| | | | | | | | | The new zfs-import.target should be used in place of the zfs-import-*.service units. Reviewed by: Brian Behlendorf <[email protected]> Reviewed-by: Manuel Amador (Rudd-O) <[email protected]> Signed-off-by: Antonio Russo <[email protected]> Closes #6964
* OpenZFS 9465 - ARC check for 'anon_size > arc_c/2' can stall the systemDon Brady2018-07-306-17/+50
| | | | | | | | | | | | | | | | | | | | | | | | | | In the case of one pool being built on another pool, we want to make sure we don't end up throttling the lower (backing) pool when the upper pool is the majority contributor to dirty data. To insure we make forward progress during throttling, we also check the current pool's net dirty data and only throttle if it exceeds zfs_arc_pool_dirty_percent of the anonymous dirty data in the cache. Authored by: Don Brady <[email protected]> Reviewed by: Sebastien Roy <[email protected]> Reviewed by: Matt Ahrens <[email protected]> Reviewed by: Prashanth Sreenivasa <[email protected]> Approved by: Robert Mustacchi <[email protected]> Ported-by: Brian Behlendorf <[email protected]> Porting Notes: * The new global variables zfs_arc_dirty_limit_percent, zfs_arc_anon_limit_percent, and zfs_arc_pool_dirty_percent were intentially not added as tunable module parameters. OpenZFS-issue: https://illumos.org/issues/9465 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/d6a4c3ef Closes #7749
* OpenZFS 9580 - Add a hash-table on top of nvlist to speed-up operationsSerapheim Dimitropoulos2018-07-306-55/+348
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | = Motivation While dealing with another performance issue (see 126118f) we noticed that we spend a lot of time in various places in the kernel when constructing long nvlists. The problem is that when an nvlist is created with the NV_UNIQUE_NAME set (which is the case most of the time), we do a linear search through the whole list to ensure uniqueness for every entry we add. An example of the above scenario can be seen in the following flamegraph, where more than have the time of the zfsdev_ioctl() is spent on constructing nvlists. Flamegraph: https://sdimitro.github.io/img/flame/sdimitro_snap_unmount3.svg Adding a table to speed up lookups will help situations where we just construct an nvlist (like the scenario above), in addition to regular lookups and removals. = What this patch does In this diff we've implemented a hash-table on top of the nvlist code that converts most nvlist operations from O(# number of entries) to O(1)* (the start is for amortized time as the hash-table grows and shrinks depending on the # of entries - plain lookup is strictly O(1)). = Performance Analysis To analyze the performance improvement I just used the setup from the snapshot deletion issue mentioned above in the Motivation section. Basically I created 10K filesystems with one snapshot each and then I just used the API of libZFS_Core to pass down an nvlist of all the snapshots to have them deleted. The reason I used my own driver program was to have clean performance results of what actually happens in the kernel. The flamegraphs and wall clock times mentioned below were gathered from the start to the end of the driver program's run. Between trials the testpool used was completely destroyed, the system was rebooted and the testpool was completely recreated. The reason for this dance was to get consistent results. == Results (before patch): === Sampling Flamegraphs [Trial 1] https://sdimitro.github.io/img/flame/DLPX-53417/trial-A.svg [Trial 2] https://sdimitro.github.io/img/flame/DLPX-53417/trial-A2.svg [Trial 3] https://sdimitro.github.io/img/flame/DLPX-53417/trial-A3.svg === Wall clock times (in seconds) ``` [Trial 4] real 5.3 user 0.4 sys 2.3 [Trial 5] real 8.2 user 0.4 sys 2.4 [Trial 6] real 6.0 user 0.5 sys 2.3 ``` == Results (after patch): === Sampling Flamegraphs [Trial 1] https://sdimitro.github.io/img/flame/DLPX-53417/trial-Ae.svg [Trial 2] https://sdimitro.github.io/img/flame/DLPX-53417/trial-A2e.svg [Trial 3] https://sdimitro.github.io/img/flame/DLPX-53417/trial-A3e.svg === Wall clock times (in seconds) ``` [Trial 4] real 4.9 user 0.0 sys 0.9 [Trial 5] real 3.8 user 0.0 sys 0.9 [Trial 6] real 3.6 user 0.0 sys 0.9 ``` == Analysis The results between the trials are consistent so in this sections I will only talk about the flamegraph results from trial-1 and the wall-clock results from trial-4. From trial-1 we can see that zfs_dev_ioctl() goes from 2,331 to 996 samples counts. Specifically, the samples from fnvlist_add_nvlist() and spa_history_log_nvl() are almost gone (~500 & ~800 to 5 & 5 samples), leaving zfs_ioc_destroy_snaps() to dominate most samples from zfs_dev_ioctl(). From trial-4 we see that the user time dropped to 0 secods. I believe the consistent 0.4 seconds before my patch was applied was due to my driver program constructing the long nvlist of snapshots so it can pass it to the kernel. As for the system time, the effect there is more clear (2.3 down to 0.9 seconds). Porting Notes: * DATA_TYPE_DONTCARE case added to switch in fm_nvprintr() and zpool_do_events_nvprint(). Authored by: Serapheim Dimitropoulos <[email protected]> Reviewed by: Matt Ahrens <[email protected]> Reviewed by: Sebastien Roy <[email protected]> Approved by: Robert Mustacchi <[email protected]> Ported-by: Brian Behlendorf <[email protected]> OpenZFS-issue: https://www.illumos.org/issues/9580 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/b5eca7b1 Closes #7748
* OpenZFS 9439 - ZFS double-free due to failure to dirty indirect blockMatthew Ahrens2018-07-302-4/+20
| | | | | | | | | | | | | | | | Follow up commit for OpenZFS 9438. See the OpenZFS-issue link below for a complete analysis. Authored by: Matthew Ahrens <[email protected]> Reviewed by: George Wilson <[email protected]> Reviewed by: Paul Dagnelie <[email protected]> Approved by: Robert Mustacchi <[email protected]> Ported-by: Brian Behlendorf <[email protected]> OpenZFS-issue: https://illumos.org/issues/9439 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/779220d External-issue: DLPX-46861 Closes #7746
* OpenZFS 9438 - Holes can lose birth time info if a block has a mix of birth ↵Paul Dagnelie2018-07-307-31/+270
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | times As reported by https://github.com/zfsonlinux/zfs/issues/4996, there is yet another hole birth issue. In this one, if a block is entirely holes, but the birth times are not all the same, we lose that information by creating one hole with the current txg as its birth time. The ZoL PR's fix approach is incorrect. Ultimately, the problem here is that when you truncate and write a file in the same transaction group, the dbuf for the indirect block will be zeroed out to deal with the truncation, and then written for the write. During this process, we will lose hole birth time information for any holes in the range. In the case where a dnode is being freed, we need to determine whether the block should be converted to a higher-level hole in the zio pipeline, and if so do it when the dnode is being synced out. Porting Notes: * The DMU_OBJECT_END change in zfs_znode.c was already applied. * Added test cases from #5675 provided by @rincebrain for hole_birth issues. These test cases should be pushed upstream to OpenZFS. * Updated mk_files which is used by several rsend tests so the files created are a little more interesting and may contain holes. Authored by: Paul Dagnelie <[email protected]> Reviewed by: Matt Ahrens <[email protected]> Reviewed by: George Wilson <[email protected]> Approved by: Robert Mustacchi <[email protected]> Ported-by: Brian Behlendorf <[email protected]> OpenZFS-issue: https://www.illumos.org/issues/9438 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/738e2a3c External-issue: DLPX-46861 Closes #7746
* ZTS: Fix reservation_017_posBrian Behlendorf2018-07-301-1/+1
| | | | | | | | | | | It's possible for an unrelated process, like blkid, to have the volume open when 'zfs destroy' is run. Switch the cleanup function to the destroy_dataset() helper which handles this case by retrying the destroy when the dataset is busy. Reviewed-by: Giuseppe Di Natale <[email protected]> Reviewed-by: George Melikov <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #7750
* Add rwsem_tryupgrade for 4.9.20-rt16 kernelBrian Behlendorf2018-07-303-11/+20
| | | | | | | | | | | | | | | | | The RT rwsem implementation was changed to allow multiple readers as of the 4.9.20-rt16 patch set. This results in a build failure because the existing implementation was forced to directly access the rwsem structure which has changed. While this could be accommodated by adding additional compatibility code. This patch resolves the build issue by simply assuming the rwsem can never be upgraded. This functionality is a performance optimization and all callers must already handle this case. Converting the last remaining use of __SPIN_LOCK_UNLOCKED to spin_lock_init() was additionally required to get a clean build. Signed-off-by: Brian Behlendorf <[email protected]> Closes #7589
* Fix initramfs missing systemd binariesGeorge Diamantopoulos2018-07-271-0/+2
| | | | | | | | | | | | | Systemd binaries necessary for mounting an encrypted root dataset weren't copied to initramfs generated by dracut. This patch fixes this and copies these binaries unconditionally, that is regardless of whether native ZFS encryption is used for the root dataset. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: George Diamantopoulos <[email protected]> Closes #7607 Closes #7719
* OpenZFS 8906 - uts: illumos rootfs should support salted cksumToomas Soome2018-07-273-26/+12
| | | | | | | | | | | | | | | | | | | | Porting notes: * As of grub-2.02 these checksums are not supported. However, as pointed out in #6501 there are alternatives such as EFISTUB which work and have no such restriction. A warning was added to the checksum property section of the zfs.8 man page. Authored by: Toomas Soome <[email protected]> Reviewed by: C Fraire <[email protected]> Reviewed by: Robert Mustacchi <[email protected]> Reviewed by: Yuri Pankov <[email protected]> Approved by: Dan McDonald <[email protected]> Ported-by: Brian Behlendorf <[email protected]> OpenZFS-issue: https://illumos.org/issues/8906 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/7dec52f Closes #6501 Closes #7714
* OpenZFS 9442 - decrease indirect block size of spacemapsMatthew Ahrens2018-07-253-18/+50
| | | | | | | | | | | | | | | | | | | | | | | | Authored by: Matthew Ahrens <[email protected]> Reviewed by: Serapheim Dimitropoulos <[email protected]> Reviewed by: George Wilson <[email protected]> Reviewed by: Albert Lee <[email protected]> Reviewed by: Igor Kozhukhov <[email protected]> Reviewed by: George Melikov <[email protected]> Approved by: Dan McDonald <[email protected]> Ported-by: Brian Behlendorf <[email protected]> Updates to indirect blocks of spacemaps can contribute significantly to write inflation. Therefore we want to reduce the indirect block size of spacemaps from 128K to 16K. Porting notes: * Refactored to allow the dmu_object_alloc(), dmu_object_alloc_ibs() and dmu_object_alloc_dnsize() functions to use a common shared dmu_object_alloc_impl() function. OpenZFS-issue: https://www.illumos.org/issues/9442 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/0c2e6408b Closes #7712
* ZTS: Add reservation_008_pos exceptionBrian Behlendorf2018-07-251-0/+1
| | | | | | | | | | | | The reservation_008_pos test case has been observed to fail in a non-dangerous way in approximately 5% of automated test runs. Add the test case to the list of possible expected failures until the test case can be made perfectly reliable. Reviewed by: Giuseppe Di Natale <[email protected]> Reviewed by: George Melikov <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #7741 Closes #7742
* Introduce kstat dmu_tx_dirty_frees_delayFeng Sun2018-07-253-0/+3
| | | | | | | | | It is helpful to tune zfs_per_txg_dirty_frees_percent for commit 539d33c7(OpenZFS 6569 - large file delete can starve out write ops). Reviewed-by: Brian Behlendorf <[email protected]> Reviewed by: Richard Elling <[email protected]> Signed-off-by: Feng Sun <[email protected]> Closes #7718
* OpenZFS 9457 - libzfs_import.c:add_config() has a memory leaksara hartse2018-07-241-18/+6
| | | | | | | | | | | | | | | | | | | | | | | | A memory leak occurs on lines 209 and 213 because the config is not freed in the error case. The interface to add_config() seems less than ideal - it would be better if it copied any data necessary from the config and the caller freed it. Porting notes: * This issue had already been resolved on Linux by adding the missing calls to nvlist_free(). But we'll adopt the upstream fix to keep the behavior of the code consistent. Authored by: Sara Hartse <[email protected]> Reviewed by: Matt Ahrens <[email protected]> Reviewed by: Serapheim Dimitropoulos <[email protected]> Reviewed by: Giuseppe Di Natale <[email protected]> Reviewed by: George Melikov <[email protected]> Approved by: Robert Mustacchi <[email protected]> Ported-by: Brian Behlendorf <[email protected]> OpenZFS-issue: https://illumos.org/issues/9457 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/be86bb8a Closes #7713
* OpenZFS 9338 - moved dnode has incorrect dn_next_typeMatthew Ahrens2018-07-241-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | Authored by: Matthew Ahrens <[email protected]> Reviewed by: Prashanth Sreenivasa <[email protected]> Reviewed by: Serapheim Dimitropoulos <[email protected]> Reviewed by: Dan Kimmel <[email protected]> Reviewed by: Giuseppe Di Natale <[email protected]> Reviewed by: George Melikov <[email protected]> Approved by: Robert Mustacchi <[email protected]> Ported-by: Brian Behlendorf <[email protected]> While investigating a different problem, I noticed that moved dnodes (those processed by dnode_move_impl() via kmem_move()) have an incorrect dn_next_type. This could cause the on-disk dn_type to be changed to an invalid value. The fix to copy the dn_next_type in dnode_move_impl(). Porting notes: * For the moment this potential issue cannot occur on Linux since the SPL does not provide the kmem_move() functionality. OpenZFS-issue: https://illumos.org/issues/9338 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/0717e6f13 Closes #7715
* Refactor arc_hdr_realloc_crypt()Tom Caputi2018-07-241-3/+60
| | | | | | | | | | | | | | | | | | | | The arc_hdr_realloc_crypt() function is responsible for converting a "full" arc header to an extended "crypt" header and visa versa. This code was originally written with a bcopy() so that any new members added to arc headers would automatically be included without requiring a code change. However, in practice this (along with small differences in kmem_cache implementations between various platforms) has caused a number of hard-to-find problems in ports to other operating systems. This patch solves this problem by making all member copies explicit and adding ASSERTs for fields that cannot be set during the transfer. It also manually resets the old header after the reallocation is finished so it can be properly reallocated and reused. Reviewed-by: Jorgen Lundman <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Signed-off-by: Tom Caputi <[email protected]> Closes #7711
* dsl_scan_scrub_cb: don't double-account non-embedded blocksSteven Noonan2018-07-241-2/+3
| | | | | | | | | | | | | | | | | | | | | | We were doing count_block() twice inside this function, once unconditionally at the beginning (intended to catch the embedded block case) and once near the end after processing the block. The double-accounting caused the "zpool scrub" progress statistics in "zpool status" to climb from 0% to 200% instead of 0% to 100%, and showed double the I/O rate it was actually seeing. This was apparently a regression introduced in commit 00c405b4b5e8, which was an incorrect port of this OpenZFS commit: https://github.com/openzfs/openzfs/commit/d8a447a7 Reviewed by: Thomas Caputi <[email protected]> Reviewed by: Matt Ahrens <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: George Melikov <[email protected]> Signed-off-by: Steven Noonan <[email protected]> Closes #7720 Closes #7738
* PR's should provide motivation & context firstMatthew Ahrens2018-07-231-3/+3
| | | | | | | | | | | | | | It's often necessary to understand why a change is made, before understanding the exact changes that are made. Context provides background, which by definition is necessary to understand prior to the substance of the Pull Request. Change the PR template to request "Motivation and Context" first, before "Description". Reviewed-by: Giuseppe Di Natale <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> Closes #7737