aboutsummaryrefslogtreecommitdiffstats
path: root/module
Commit message (Collapse)AuthorAgeFilesLines
* Fix LZ4_uncompress_unknownOutputSize caused panicFeng Sun2017-05-191-8/+19
| | | | | | | | | | | | | | | | Sync with kernel patches for lz4 https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/log/lib/lz4 4a3a99 lz4: add overrun checks to lz4_uncompress_unknownoutputsize() d5e7ca LZ4 : fix the data abort issue bea2b5 lib/lz4: Pull out constant tables 99b7e9 lz4: fix system halt at boot kernel on x86_64 Reviewed-by: Chunwei Chen <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Feng Sun <[email protected]> Closes #5975 Closes #5973
* Implemented zpool sync commandAlek P2017-05-191-0/+44
| | | | | | | | | | | This addition will enable us to sync an open TXG to the main pool on demand. The functionality is similar to 'sync(2)' but 'zpool sync' will return when data has hit the main storage instead of potentially just the ZIL as is the case with the 'sync(2)' cmd. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Signed-off-by: Alek Pinchuk <[email protected]> Closes #6122
* Force fault a vdev with 'zpool offline -f'Tony Hutter2017-05-194-12/+76
| | | | | | | | | | | | | This patch adds a '-f' option to 'zpool offline' to fault a vdev instead of bringing it offline. Unlike the OFFLINE state, the FAULTED state will trigger the FMA code, allowing for things like autoreplace and triggering the slot fault LED. The -f faults persist across imports, unless they were set with the temporary (-t) flag. Both persistent and temporary faults can be cleared with zpool clear. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tony Hutter <[email protected]> Closes #6094
* Fixed small memory leak in ereport handlingTom Caputi2017-05-181-6/+6
| | | | | | | | | | | One pre-check in zfs_ereport_start() was being called after the nvlists were being allocated. This simply corrects that issue. Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Giuseppe Di Natale <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tom Caputi <[email protected]> Closes #6140
* Introduce zv_state_lockBoris Protopopov2017-05-161-71/+124
| | | | | | | | | | | | The lock is designed to protect internal state of zvol_state_t and to avoid taking spa_namespace_lock (e.g. in dmu_objset_own() code path) while holding zvol_stat_lock. Refactor the code accordingly. Signed-off-by: Boris Protopopov <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #3484 Closes #6065 Closes #6134
* Revert commit 1ee159f4Boris Protopopov2017-05-161-2/+29
| | | | | | | | | | | | Fix lock order inversion with zvol_open() as it did not account for use of zvols as vdevs. The latter use cases resulted in the lock order inversion deadlocks that involved spa_namespace_lock and bdev->bd_mutex. Signed-off-by: Boris Protopopov <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #6065 Issue #6134
* Skip spurious resilver IO on raidz vdevIsaac Huang2017-05-128-33/+122
| | | | | | | | | | | | | | | | On a raidz vdev, a block that does not span all child vdevs, excluding its skip sectors if any, may not be affected by a child vdev outage or failure. In such cases, the block does not need to be resilvered. However, current resilver algorithm simply resilvers all blocks on a degraded raidz vdev. Such spurious IO is not only wasteful, but also adds the risk of overwriting good data. This patch eliminates such spurious IOs. Reviewed-by: Gvozden Neskovic <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Signed-off-by: Isaac Huang <[email protected]> Closes #5316
* OpenZFS 8063 - verify that we do not attempt to access inactive txgMatthew Ahrens2017-05-106-23/+38
| | | | | | | | | | | | | | | | | | | | | Authored by: Matthew Ahrens <[email protected]> Reviewed by: Serapheim Dimitropoulos <[email protected]> Reviewed by: Pavel Zakharov <[email protected]> Approved by: Robert Mustacchi <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Ported-by: George Melikov <[email protected]> A standard practice in ZFS is to keep track of "per-txg" state. Any of the 3 active TXG's (open, quiescing, syncing) can have different values for this state. We should assert that we do not attempt to modify other (inactive) TXG's. Porting Notes: - ASSERTV added to txg_sync_waiting() for unused variable. OpenZFS-issue: https://www.illumos.org/issues/8063 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/01acb46 Closes #6109
* OpenZFS 8166 - zpool scrub thinks it repaired offline deviceMatthew Ahrens2017-05-101-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | Authored by: Matthew Ahrens <[email protected]> Reviewed by: George Wilson <[email protected]> Reviewed-by: loli10K <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Ported-by: Matthew Ahrens <[email protected]> If we do a scrub while a leaf device is offline (via "zpool offline"), we will inadvertently clear the DTL (dirty time log) of the offline device, even though it is still damaged. When the device comes back online, we will incompletely resilver it, thinking that the scrub repaired blocks written before the scrub was started. The incomplete resilver can lead to data loss if there is a subsequent failure of a different leaf device. The fix is to never clear the DTL of offline devices. Note that if a device is onlined while a scrub is in progress, the scrub will be restarted. The problem can be worked around by running "zpool scrub" after "zpool online". OpenZFS-issue: https://www.illumos.org/issues/8166 OpenZFS-commit: https://github.com/openzfs/openzfs/pull/372 Closes #5806 Closes #6103
* Add missing arc_free_cksum() to arc_release()Tom Caputi2017-05-101-0/+4
| | | | | | | | | | | | | | | | | | The arc layer tracks checksums of its data in the arc header so that it can ensure that buffers haven't changed when they're not supposed to. This checksum is only maintained while there is an uncompressed buffer still attached to the header. Unfortunately there is a missing call to arc_free_cksum() in arc_release() that can trigger ASSERTs. This has not been a common issue because the checksums are only maintained for debug builds and triggering the bug requires writing a block (and therefore calling arc_release()) while a compressed buffer is still being used on a debug build. This simply corrects the issue. Reviewed-by: George Melikov <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tom Caputi <[email protected]> Closes #6105
* Linux 4.12 compat: CURRENT_TIME removedBrian Behlendorf2017-05-104-11/+14
| | | | | | | | Linux 4.9 added current_time() as the preferred interface to get the filesystem time. CURRENT_TIME was retired in Linux 4.12. Reviewed-by: Chunwei Chen <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #6114
* Add property overriding (-o|-x) to 'zfs receive'LOLi2017-05-091-31/+165
| | | | | | | | | | | | | | | | | | | | This allows users to specify "-o property=value" to override and "-x property" to exclude properties when receiving a zfs send stream. Both native and user properties can be specified. This is useful when using zfs send/receive for periodic backup/replication because it lets users change properties such as canmount, mountpoint, or compression without modifying the source. References: https://www.illumos.org/issues/2745 https://www.illumos.org/issues/3753 Reviewed by: Matthew Ahrens <[email protected]> Reviewed-by: Alek Pinchuk <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: loli10K <[email protected]> Closes #1350 Closes #5349
* Make createtxg and guid properties publicChristian Schwarz2017-05-091-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Document the existence of `createtxg` and `guid` native properties in man pages and zfs command output. One of the great features of ZFS is incremental replication of snapshots, possibly between pools on different machines. Shell scripts are commonly used to auomate this procedure. They have to find the most recent common snapshot between both sides and then perform incremental send & recv. Currently, scripts rely on the sorting order of `zfs list`, which defaults to `createtxg`, and the assumption that snapshot names on either side do not change. By making `createtxg` and `guid` part of the public ZFS interface, scripts are enabled to use a) `createtxg` to determine the logical & temporal order of snapshots (the creation property is not an equivalent substitute since multiple snapshots may be created within one second) b) `guid` to uniquely identify a snapshot, independent of its current display name This has the potential of making scripts safer and correct. Reviewed-by: George Melikov <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: DHE <[email protected]> Reviewed-by: Richard Laager <[email protected]> Signed-off-by: Christian Schwarz <[email protected]> Closes #6102
* Linux 4.12 compat: PF_FSTRANS was removedChunwei Chen2017-05-091-1/+1
| | | | | | | | zfsonlinux/spl@8f87971 added __spl_pf_fstrans_check for the xfs related check, so we use them accordingly. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Chunwei Chen <[email protected]> Closes #6113
* Fix unused variable warningBrian Behlendorf2017-05-051-3/+2
| | | | | | | | | | | Remove the lz4_ac local variable from dmu_write_policy() to resolve the following unused variable warning on non-debug builds. dmu.c: In function ‘dmu_write_policy’: dmu.c:1892:12: warning: unused variable ‘lz4_ac’ [-Wunused-variable] boolean_t lz4_ac = spa_feature_is_active(os->os_spa, Signed-off-by: Brian Behlendorf <[email protected]>
* Add missing *_destroy/*_fini callsGvozden Neskovic2017-05-0411-7/+61
| | | | | | | | | The proposed debugging enhancements in zfsonlinux/spl#587 identified the following missing *_destroy/*_fini calls. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Gvozden Neskovic <[email protected]> Closes #5428
* Default to zvol_request_async=0Brian Behlendorf2017-05-041-1/+1
| | | | | | | | | Change the default ZVOL behavior so requests are handled asynchronously. This behavior is functionally the same as in the zfs-0.6.4 release. Reviewed-by: Chunwei Chen <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #5902
* Enable Linux read-ahead for a single page on ZVOLsRichard Yao2017-05-041-0/+3
| | | | | | | | | | | | | | | | | | | | Linux has read-ahead logic designed to accelerate sequential workloads. ZFS has its own read-ahead logic called zprefetch that operates on both ZVOLs and datasets. Having two prefetchers active at the same time can cause overprefetching, which unnecessarily reduces IOPS performance on CoW filesystems like ZFS. Testing shows that entirely disabling the Linux prefetch results in a significant performance penalty for reads while commensurate benefits are seen in random writes. It appears that read-ahead benefits are inversely proportional to random write benefits, and so a single page of Linux-layer read-ahead appears to offer the middle ground for both workloads. Reviewed-by: Chunwei Chen <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Richard Yao <[email protected]> Issue #5902
* Disable write merging on ZVOLsRageLtMan2017-05-041-0/+3
| | | | | | | | | | | | | | | | | | | The current ZVOL implementation does not explicitly set merge options on ZVOL device queues, which results in the default merge behavior. Explicitly set QUEUE_FLAG_NOMERGES on ZVOL queues allowing the ZIO pipeline to do its work. Initial benchmarks (tiotest with no O_DIRECT) show random write performance going up almost 3X on 8K ZVOLs, even after significant rewrites of the logical space allocation. Reviewed-by: Richard Yao <[email protected]> Reviewed-by: Chunwei Chen <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: RageLtMan <rageltman@sempervictus> Issue #5902
* More ashift improvementsLOLi2017-05-031-4/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | This commit allow higher ashift values (up to 16) in 'zpool create' The ashift value was previously limited to 13 (8K block) in b41c990 because the limited number of uberblocks we could fit in the statically sized (128K) vdev label ring buffer could prevent the ability the safely roll back a pool to recover it. Since b02fe35 the largest uberblock size we support is 8K: this allow us to store a minimum number of 16 uberblocks in the vdev label, even with higher ashift values. Additionally change 'ashift' pool property behaviour: if set it will be used as the default hint value in subsequent vdev operations ('zpool add', 'attach' and 'replace'). A custom ashift value can still be specified from the command line, if desired. Finally, fix a bug in add-o_ashift.ksh caused by a missing variable. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: loli10K <[email protected]> Closes #2024 Closes #4205 Closes #4740 Closes #5763
* Write label 2,3 uberblocks when vdev expandsOlaf Faaland2017-05-022-0/+68
| | | | | | | | | | | | | | | | | | | | | | When vdev_psize increases, the location of labels 2 and 3 changes because their location is relative to the end of the device. The configs for labels 2 and 3 are written during the next spa_sync() because the vdev is added to the dirty config list. However, the uberblock rings are not re-written in their new location, leaving the device vulnerable to the beginning of the device being overwritten or damaged. This patch copies the uberblock ring from label 0 to labels 2 and 3, in their new locations, at the next sync after vdev_psize increases. Also, add a test zpool_expand_004_pos.ksh to confirm the uberblocks are copied. Reviewed-by: BearBabyLiu <[email protected]> Reviewed-by: Andreas Dilger <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Olaf Faaland <[email protected]> Closes #5108
* Allow scaling of arc in proportion to pagecacheDebabrata Banerjee2017-05-021-2/+19
| | | | | | | | | | | | | | | When multiple filesystems are in use, memory pressure causes arc_cache to collapse to a minimum. Allow arc_cache to maintain proportional size even when hit rates are disproportionate. We do this only via evictable size from the kernel shrinker, thus it's only in effect under memory pressure. AKAMAI: zfs: CR 3695072 Reviewed-by: Tim Chase <[email protected]> Reviewed-by: Richard Yao <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Debabrata Banerjee <[email protected]> Closes #6035
* Correct signed operationDebabrata Banerjee2017-05-021-2/+2
| | | | | | | | | | | Could return the wrong pages value AKAMAI: zfs: CR 3695072 Reviewed-by: Tim Chase <[email protected]> Reviewed-by: Richard Yao <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Debabrata Banerjee <[email protected]> Issue #6035
* Don't run the reaper if we didn't shrink the cacheDebabrata Banerjee2017-05-021-6/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Calling it when nothing is evictable will cause extra kswapd cpu. Also if we didn't shrink it's unlikely to have memory to reap because we likely just called it microseconds ago. The exception is if we are in direct reclaim. You can see how hard this is being hit in kswapd with a light test workload: 34.95% [zfs] [k] arc_kmem_reap_now 5.40% [spl] [k] spl_kmem_cache_reap_now 3.79% [kernel] [k] _raw_spin_lock 2.86% [spl] [k] __spl_kmem_cache_generic_shrinker.isra.7 2.70% [kernel] [k] shrink_slab.part.37 1.93% [kernel] [k] isolate_lru_pages.isra.43 1.55% [kernel] [k] __wake_up_bit 1.20% [kernel] [k] super_cache_count 1.20% [kernel] [k] __radix_tree_lookup With ZFS just mounted but only ext4/pagecache memory pressure arc_kmem_reap_now still consumes excessive CPU: 12.69% [kernel] [k] isolate_lru_pages.isra.43 10.76% [kernel] [k] free_pcppages_bulk 7.98% [kernel] [k] drop_buffers 7.31% [kernel] [k] shrink_page_list 6.44% [zfs] [k] arc_kmem_reap_now 4.19% [kernel] [k] free_hot_cold_page 4.00% [kernel] [k] __slab_free 3.95% [kernel] [k] __isolate_lru_page 3.09% [kernel] [k] __radix_tree_lookup Same pagecache only workload as above with this patch series: 11.58% [kernel] [k] isolate_lru_pages.isra.43 11.20% [kernel] [k] drop_buffers 9.67% [kernel] [k] free_pcppages_bulk 8.44% [kernel] [k] shrink_page_list 4.86% [kernel] [k] __isolate_lru_page 4.43% [kernel] [k] free_hot_cold_page 4.00% [kernel] [k] __slab_free 3.44% [kernel] [k] __radix_tree_lookup (arc_kmem_reap_now has 0 samples in perf) AKAMAI: zfs: CR 3695042 Reviewed-by: Tim Chase <[email protected]> Reviewed-by: Richard Yao <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Debabrata Banerjee <[email protected]> Issue #6035
* Only wakeup waiters if we've actually done workDebabrata Banerjee2017-05-021-5/+5
| | | | | | | | | AKAMAI: zfs: CR 3695072 Reviewed-by: Tim Chase <[email protected]> Reviewed-by: Richard Yao <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Debabrata Banerjee <[email protected]> Issue #6035
* Do not stop kernel shrinker on lock contentionDebabrata Banerjee2017-05-021-1/+1
| | | | | | | | | | | | | | | | Lock contention, by itself, shouldn't indicate a stop condition to the kernel's slab shrinker. Doing so can cause stalls when the kernel is trying to free large parts of the cache such as is done by drop_caches Also, perhaps arc_reclaim_lock should be a spinlock, and this code eliminated. AKAMAI: zfs: CR 3593801 Reviewed-by: Tim Chase <[email protected]> Reviewed-by: Richard Yao <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Debabrata Banerjee <[email protected]> Issue #6035
* Stop double reclaiming or not reclaiming at allDebabrata Banerjee2017-05-021-2/+3
| | | | | | | | | | | | | | | | Move arcstat_need_free increment from all direct calls to when arc_reclaim_lock is busy and we exit wihout doing anything. Data will be reclaimed in reclaim thread. The previous location meant that we both reclaim the memory in this thread, and also schedule the same amount of memory for reclaim in arc_reclaim, effectively doubling the requested reclaim. AKAMAI: zfs: CR 3695072 Reviewed-by: Tim Chase <[email protected]> Reviewed-by: Richard Yao <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Debabrata Banerjee <[email protected]> Issue #6035
* Make arc_need_free updates atomicDebabrata Banerjee2017-05-021-6/+7
| | | | | | | | | | | Ensures proper accounting of bytes we requested to free AKAMAI: zfs: CR 3695072 Reviewed-by: Tim Chase <[email protected]> Reviewed-by: Richard Yao <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Debabrata Banerjee <[email protected]> Issue #6035
* Don't report ghost buffers as evictable memDebabrata Banerjee2017-05-021-7/+2
| | | | | | | | | | | Ghost meta/data buffers are not actually allocated AKAMAI: zfs: CR 3695072 Reviewed-by: Tim Chase <[email protected]> Reviewed-by: Richard Yao <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Debabrata Banerjee <[email protected]> Issue #6035
* minor improvement to abd_free_pages()jxiong2017-05-021-8/+6
| | | | | | | | | | | It doesn't need to have a loop to free page in a single scatterlist entry because it should be single or compound page. The pages can be freed in one invocation to __free_pages() for both cases. Reviewed-by: Gvozden Neskovic <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Richard Yao <[email protected]> Signed-off-by: Jinshan Xiong <[email protected]> Closes #6057
* Guarantee PAGESIZE alignment for large zio buffersjxiong2017-05-021-2/+2
| | | | | | | | | | | | | | | | | | | In current implementation, only zio buffers in 16KB and bigger are guaranteed PAGESIZE alignment. This breaks Lustre since it assumes that 'arc_buf_t::b_data' must be page aligned when zio buffers are greater than or equal to PAGESIZE. This patch will make the zio buffers to be PAGESIZE aligned when the sizes are not less than PAGESIZE. This change may cause a little bit memory waste but that should be fine because after ABD is introduced, zio buffers are used to hold data temporarily and live in memory for a short while. Reviewed-by: Don Brady <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Jinshan Xiong <[email protected]> Signed-off-by: Jinshan Xiong <[email protected]> Closes #6084
* Linux 4.12 compat: super_setup_bdi_name()Brian Behlendorf2017-05-021-5/+6
| | | | | | | | | | All filesystems were converted to dynamically allocated BDIs. The destruction of backing_dev_info structures is handled as part of super block destruction. Refactor the code to abstract away the details of creating and destroying a BDI. Reviewed-by: Chunwei Chen <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #6089
* OpenZFS 7786 - zfs`vdev_online() needs better notification about state changesYuri Pankov2017-05-011-6/+8
| | | | | | | | | | | | | | Authored by: Yuri Pankov <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Reviewed by: George Wilson <[email protected]> Approved by: Albert Lee <[email protected]> Reviewed-by: Giuseppe Di Natale <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Ported-by: bunder2015 <[email protected]> OpenZFS-issue: https://www.illumos.org/issues/7786 OpenZFS-commit: http://github.com/openzfs/openzfs/commit/db8498f Closes #6074
* Limit zfs_dirty_data_max_max to 4GBrian Behlendorf2017-05-011-3/+3
| | | | | | | | | | Reinstate default 4G zfs_dirty_data_max_max limit. Reviewed-by: Giuseppe Di Natale <[email protected]> Reviewed-by: Chunwei Chen <[email protected]> Reviewed-by: George Melikov <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #6072 Closes #6081
* Reinstate zvol_taskq to fix aio on zvolChunwei Chen2017-04-261-80/+165
| | | | | | | | | | | | | | | | | | | | | | | Commit 37f9dac removed the zvol_taskq for processing zvol requests. This was removed as part of switching to make_request_fn and was motivated by a concern at the time over dispatch latency. However, this also made all bio request synchronous, and caused serious performance issues as the bio submitter would wait for every bio it submitted, effectively making the IO depth 1. This patch reinstate zvol_taskq, and to make sure overlapped I/Os are ordered properly, we take range lock in zvol_request, and pass it along with bio to the I/O functions zvol_{write,discard,read}. In order to facilitate benchmarks a zvol_request_sync module option was added to switch between sync and async request handling. For the moment, the default behavior is synchronous but this is likely to change pending additional testing. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Chunwei Chen <[email protected]> Closes #5824
* OpenZFS 7252 - compressed zfs send / receiveDan Kimmel2017-04-265-52/+84
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | OpenZFS 7252 - compressed zfs send / receive OpenZFS 7628 - create long versions of ZFS send / receive options Authored by: Dan Kimmel <[email protected]> Reviewed by: George Wilson <[email protected]> Reviewed by: John Kennedy <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Reviewed by: Paul Dagnelie <[email protected]> Reviewed by: Pavel Zakharov <[email protected]> Reviewed by: Sebastien Roy <[email protected]> Reviewed by: David Quigley <[email protected]> Reviewed by: Thomas Caputi <[email protected]> Approved by: Dan McDonald <[email protected]> Reviewed by: David Quigley <[email protected]> Reviewed-by: loli10K <[email protected]> Ported-by: bunder2015 <[email protected]> Ported-by: Don Brady <[email protected]> Ported-by: Brian Behlendorf <[email protected]> Porting Notes: - Most of 7252 was already picked up during ABD work. This commit represents the gap from the final commit to openzfs. - Fixed split_large_blocks check in do_dump() - An alternate version of the write_compressible() function was implemented for Linux which does not depend on fio. The behavior of fio differs significantly based on the exact version. - mkholes was replaced with truncate for Linux. OpenZFS-issue: https://www.illumos.org/issues/7252 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/5602294 Closes #6067
* Change U16 to U32 due to atomic_inc_32_nvwli52017-04-251-2/+2
| | | | | | | | | After run a long time with QAT compression, the variable "inst_num" is overflow by "atomic_inc_32_nv", which causes its neighbor variable overwritten. Change its definition from U16 to U32. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Weigang Li <[email protected]> Closes #6051
* OpenZFS 8025 - dbuf_read() creates unnecessary zio_root() for bonus bufMatthew Ahrens2017-04-241-5/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | Authored by: Matthew Ahrens <[email protected]> Reviewed by: Dan Kimmel <[email protected]> Reviewed by: Pavel Zakharov <[email protected]> Reviewed by: Prashanth Sreenivasa <[email protected]> Approved by: Robert Mustacchi <[email protected]> Ported-by: Brian Behlendorf <[email protected]> dbuf_read() creates a zio_root() to track and wait for all the zio's that may happen as part of this call. However, if the blkptr_t for this buffer is NULL or a hole, we will not create any more zio's, so this zio_root() is unnecessary. This is always the case when calling dbuf_read() on a bonus buffer, because it has no blkptr (it's part of the containing dnode). For workloads that read a lot of bonus buffers (e.g. file creation and removal), creating and destroying these unnecessary zio's can decrease performance by around 3%. The fix is to only create/destroy the zio_root() in dbuf_read() if the blkptr is not NULL and not a hole. Porting Notes: - The error handling for when dbuf_read_impl() fails which was originally added in commit 5f6d0b6f5 has been preserved. OpenZFS-issue: https://www.illumos.org/issues/8025 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/8ec5c7c Closes #6048
* Fix lseek result when dnode is dirtydbavatar2017-04-241-3/+7
| | | | | | | | | | | Fixup commit 66aca24. We should have equivalent return values as generic_file_llseek() and advance to end of file. Reviewed-by: Richard Yao <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Tested-by: bunder2015 <[email protected]> Signed-off-by: Debabrata Banerjee <[email protected]> Closes #6050 Closes #6053
* Correct lock ASSERTs in vdev_label_read/writeOlaf Faaland2017-04-211-6/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | The existing assertions in vdev_label_read() and vdev_label_write(), testing which config locks are held, are incorrect. The assertions test for locks which exceed what is required for safety. Both vdev_label_{read,write}() are changed to assert SCL_STATE is held as RW_READER or RW_WRITER. This is safe because: Changes to the vdev tree occur under SCL_ALL as RW_WRITER, via spa_vdev_enter() and spa_vdev_exit(). Changes to vdev state occur under SCL_STATE_ALL as RW_WRITER, via spa_vdev_state_enter() and spa_vdev_state_exit(). Therefore, the new assertions guarantee that the vdev cannot change out from under a zio, and I/O to a specified leaf vdev's label is safe. Furthermore, this is consistent with the SPA locking discussion in spa_misc.c, "For any zio operation that takes an explicit vdev_t argument ... zio_read_phys(), or zio_write_phys() ... SCL_STATE as reader suffices." Reviewed-by: Chunwei Chen <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Olaf Faaland <[email protected]> Closes #5983
* Increase zfs_vdev_async_write_min_active to 2DHE2017-04-141-1/+1
| | | | | | | | | | | | | | | | | Resilver operations frequently cause only a small amount of dirty data to be written to disk at a time, resulting in the IO scheduler to only issue 1 write at a time to the resilvering disk. When it is rotational media the drive will often travel past the next sector to be written before receiving a write command from ZFS, significantly delaying the write of the next sector. Raise zfs_vdev_async_write_min_active so that drives are kept fed during resilvering. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: DHE <[email protected]> Issue #4825 Closes #5926
* OpenZFS 8061 - sa_find_idx_tab can be declared more type-safelyMatthew Ahrens2017-04-141-6/+5
| | | | | | | | | | | | | | | | Authored by: Matthew Ahrens <[email protected]> Approved by: Dan McDonald <[email protected]> Reviewed by: George Wilson <[email protected]> Reviewed by: Chris Williamson <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Ported-by: Giuseppe Di Natale <[email protected]> sa_find_idx_tab() is declared as taking and returning "void *" parameters. These can be declared to be the specific types. OpenZFS-issue: https://www.illumos.org/issues/8061 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/4e64aff Closes #6017
* OpenZFS 6101 - attempt to lzc_create() a filesystem under a volume results ↵Andriy Gapon2017-04-142-1/+6
| | | | | | | | | | | | | | | | | in a panic Authored by: Andriy Gapon <[email protected]> Approved by: Dan McDonald <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Ported-by: Giuseppe Di Natale <[email protected]> When querying ZPL properties verify that the objset is of type DMU_OST_ZFS. OpenZFS-issue: https://www.illumos.org/issues/6101 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/ce2243a Closes #6015
* OpenZFS 8026 - retire zfs_throttle_delay and zfs_throttle_resolutionAndriy Gapon2017-04-141-3/+0
| | | | | | | | | | | | | Authored by: Andriy Gapon <[email protected]> Approved by: Richard Lowe <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Reviewed by: Serapheim Dimitropoulos <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Ported-by: Giuseppe Di Natale <[email protected]> OpenZFS-issue: https://www.illumos.org/issues/8026 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/9b33e07 Closes #6014
* SEEK_HOLE should not block on txg_wait_synced()Debabrata Banerjee2017-04-132-7/+46
| | | | | | | | | | | | | | | Force flushing of txg's can be painfully slow when competing for disk IO, since this is a process meant to execute asynchronously. Optimize this path via allowing data/hole seeking if the file is clean, but if dirty fall back to old logic. This is a compromise to disabling the feature entirely. Reviewed-by: Giuseppe Di Natale <[email protected]> Reviewed-by: George Melikov <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Debabrata Banerjee <[email protected]> Closes #4306 Closes #5962
* OpenZFS 5120 - zfs should allow large block/gzip/raidz boot pool (loader ↵Brian Behlendorf2017-04-133-34/+10
| | | | | | | | | | | | | | | | | | | | | | | project) Authored by: Toomas Soome <[email protected]> Reviewed by: George Wilson <[email protected]> Reviewed by: Yuri Pankov <[email protected]> Reviewed by: Andrew Stormont <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Approved by: Robert Mustacchi <[email protected]> Reviewed-by: Giuseppe Di Natale <[email protected]> Reviewed-by: George Melikov <[email protected]> Reviewed-by: Don Brady <[email protected]> Ported-by: Brian Behlendorf <[email protected]> Porting Notes: - grub-2.02-beta2-422-gcad5cc0 includes support for large blocks. - Commit 8aab121 allowed GZIP[1-9]. - Grub allows pools with multiple top-level vdevs. OpenZFS-issue: https://www.illumos.org/issues/5120 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/c8811bd Closes #6007
* OpenZFS 7503 - zfs-test should tail ::zfs_dbgmsg on test failureBrian Behlendorf2017-04-121-1/+1
| | | | | | | | | | | | | | | | | | | Authored by: Pavel Zakharov <[email protected]> Reviewed by: John Kennedy <[email protected]> Reviewed by: Matt Ahrens <[email protected]> Approved by: Gordon Ross <[email protected]> Reviewed-by: Giuseppe Di Natale <[email protected]> Reviewed-by: George Melikov <[email protected]> Ported-by: Brian Behlendorf <[email protected]> Porting Notes: - Enable internal log for DEBUG builds and in zfs-tests.sh. - callbacks/zfs_dbgmsg.ksh - Dump interal log via kstat. - callbacks/zfs_dmesg.ksh - Dump dmesg log. - default.cfg - 'Test Suite Specific Commands' dropped. OpenZFS-issue: https://www.illumos.org/issues/7503 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/55a1300 Closes #6002
* Skip rate limiting events in zfs_ereport_postGiuseppe Di Natale2017-04-111-3/+3
| | | | | | | | In zfs_ereport_post, if an event is a rate limiting event, immediately return before any processing is done. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Giuseppe Di Natale <[email protected]> Closes #5998
* Fix size inflation in spa_get_worst_case_asize()LOLi2017-04-101-2/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When we try assign a new transaction to a TXG we must know beforehand if there is sufficient free space on disk. This is to decide, in dmu_tx_assign(), if we should reject the TX with ENOSPC. We rely on spa_get_worst_case_asize() to inflate the size of our logical writes by a factor of spa_asize_inflation which is calculated as: (VDEV_RAIDZ_MAXPARITY + 1) * SPA_DVAS_PER_BP * 2 == 24 The problem with the current implementation is that we don't take into account what happens with very small writes on VDEVs with large physical block sizes. Consider the case of writes to a dataset with recordsize=512, copies=3 on a VDEV with ashift=13 (usually SSD with 8K block size): every logical IO will end up allocating 3 * 8K = 24K on disk, so 512 bytes multiplied by 48, which is double the size we account for. If we allow this kind of writes to be assigned a TX it is possible, when the pool is almost full, to trigger an allocation failure (ENOSPC) in the ZIO pipeline, which will in turn result in the whole pool being suspended. The bug is fixed by using, in spa_get_worst_case_asize(), the MAX() value chosen between the logical io size from zfs_write() and the maximum physical block size used among our VDEVs. Reviewed by: Matthew Ahrens <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: loli10K <[email protected]> Closes #5941
* OpenZFS 8005 - poor performance of 1MB writes on certain RAID-Z configurationsMatthew Ahrens2017-04-101-5/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Authored by: Matt Ahrens <[email protected]> Reviewed by: Saso Kiselkov <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: George Melikov <[email protected]> Reviewed-by: Don Brady <[email protected]> Ported-by: Matt Ahrens <[email protected]> RAID-Z requires that space be allocated in multiples of P+1 sectors, because this is the minimum size block that can have the required amount of parity. Thus blocks on RAIDZ1 must be allocated in a multiple of 2 sectors; on RAIDZ2 multiple of 3; and on RAIDZ3 multiple of 4. A sector is a unit of 2^ashift bytes, typically 512B or 4KB. To satisfy this constraint, the allocation size is rounded up to the proper multiple, resulting in up to 3 "pad sectors" at the end of some blocks. The contents of these pad sectors are not used, so we do not need to read or write these sectors. However, some storage hardware performs much worse (around 1/2 as fast) on mostly-contiguous writes when there are small gaps of non-overwritten data between the writes. Therefore, ZFS creates "optional" zio's when writing RAID-Z blocks that include pad sectors. If writing a pad sector will fill the gap between two (required) writes, we will issue the optional zio, thus doubling performance. The gap-filling performance improvement was introduced in July 2009. Writing the optional zio is done by the io aggregation code in vdev_queue.c. The problem is that it is also subject to the limit on the size of aggregate writes, zfs_vdev_aggregation_limit, which is by default 128KB. For a given block, if the amount of data plus padding written to a leaf device exceeds zfs_vdev_aggregation_limit, the optional zio will not be written, resulting in a ~2x performance degradation. The problem occurs only for certain values of ashift, compressed block size, and RAID-Z configuration (number of parity and data disks). It cannot occur with the default recordsize=128KB. If compression is enabled, all configurations with recordsize=1MB or larger will be impacted to some degree. The problem notably occurs with recordsize=1MB, compression=off, with 10 disks in a RAIDZ2 or RAIDZ3 group (with 512B or 4KB sectors). Therefore this problem has been known as "the 1MB 10-wide RAIDZ2 (or 3) problem". The problem also occurs with the following configurations: With recordsize=512KB or 256KB, compression=off, the problem occurs only in rarely-used configurations: * 4-wide RAIDZ1 with recordsize=512KB and ashift=12 (4KB sectors) * 4-wide RAIDZ2 (either recordsize, either ashift) * 5-wide RAIDZ2 with recordsize=512KB (either ashift) * 6-wide RAIDZ2 with recordsize=512KB (either ashift) With recordsize=1MB, compression=off, ashift=9 (512B sectors) * RAIDZ1 with 4 or 8 disks * RAIDZ2 with 4, 8, or 10 disks * RAIDZ3 with 6, 8, 9, or 10 disks With recordsize=1MB, compression=off, ashift=12 (4KB sectors) * RAIDZ1 with 7 or 8 disks * RAIDZ2 with 4, 5, or 10 disks * RAIDZ3 with 6, 9, or 10 disks With recordsize=2MB and larger (which can only be selected by changing kernel tunables), many configurations are affected, including with higher numbers of disks (up to 18 disks with recordsize=2MB). Increase zfs_vdev_aggregation_limit to allow the optional zio to be aggregated, thus eliminating the problem. Setting it to 256KB fixes all commonly-used configurations. The solution is to aggregate optional zio's regardless of the aggregation size limit. Analysis sponsored by Intel Corp. OpenZFS-issue: https://www.illumos.org/issues/8005 OpenZFS-commit: https://github.com/openzfs/openzfs/pull/321 Closes #5931