summaryrefslogtreecommitdiffstats
path: root/module/zfs
Commit message (Collapse)AuthorAgeFilesLines
* Fix zvol_init error handlingRichard Yao2017-06-131-0/+1
| | | | | Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Richard Yao <[email protected]>
* Make zvol operations use _by_dnode routinesRichard Yao2017-06-132-14/+12
| | | | | | | | | | This continues what was started in 0eef1bde31d67091d3deed23fe2394f5a8bf2276 by fully converting zvols to avoid unnecessary dnode_hold() calls. This saves a small amount of CPU time and slightly improves latencies of operations on zvols. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Richard Yao <[email protected]> Closes #6058
* Reduce stack usage of dsl_dir_tempreserve_implDeHackEd2017-06-121-6/+19
| | | | | | | | Buildbots and zfs-tests regularly see 7 kilobytes of stack usage with this function. Convert self-calls to iterations Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: DHE <[email protected]> Closes #6219
* OpenZFS 8056 - zfs send size estimate is inaccurate for some zvolsPaul Dagnelie2017-06-091-2/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Authored by: Paul Dagnelie <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Reviewed by: Pavel Zakharov <[email protected]> Approved by: Robert Mustacchi <[email protected]> Reviewed-by: Kash Pande <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: George Melikov <[email protected]> Ported-by: Giuseppe Di Natale <[email protected]> The send size estimate for a zvol can be too low, if the size of the record headers (dmu_replay_record_t's) is a significant portion of the size. This is typically the case when the data is highly compressible, especially with embedded blocks. The problem is that dmu_adjust_send_estimate_for_indirects() assumes that blocks are the size of the "recordsize" property (128KB). However, for zvols, the blocks are the size of the "volblocksize" property (8KB). Therefore, we estimate that there will be 16x less record headers than there really will be. The fix is to check the type of the object set (whether it is a zvol or not) and pick the appropriate property. In addition, while we are at it, we also add the size of the BEGIN and END records to the estimate. OpenZFS-issue: https://www.illumos.org/issues/8056 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/faf09cd Closes #6205
* OpenZFS 8156 - dbuf_evict_notify() does not need dbuf_evict_lockMatthew Ahrens2017-06-091-11/+7
| | | | | | | | | | | | | | | | | | | | | Authored by: Matthew Ahrens <[email protected]> Reviewed by: Dan Kimmel <[email protected]> Reviewed by: Paul Dagnelie <[email protected]> Approved by: Robert Mustacchi <[email protected]> Reviewed-by: George Melikov <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Ported-by: Giuseppe Di Natale <[email protected]> dbuf_evict_notify() holds the dbuf_evict_lock while checking if it should do the eviction itself (because the evict thread is not able to keep up). This can result in massive lock contention. It isn't necessary to hold the lock, because if we make the wrong choice occasionally, nothing bad will happen. This commit results in a ~60% performance improvement for ARC-cached sequential reads. OpenZFS-issue: https://www.illumos.org/issues/8156 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f73e5d9 Closes #6204
* OpenZFS 8199 - multi-threaded dmu_object_alloc()Matthew Ahrens2017-06-093-70/+130
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | dmu_object_alloc() is single-threaded, so when multiple threads are creating files in a single filesystem, they spend a lot of time waiting for the os_obj_lock. To improve performance of multi-threaded file creation, we must make dmu_object_alloc() typically not grab any filesystem-wide locks. The solution is to have a "next object to allocate" for each CPU. Each of these "next object"s is in a different block of the dnode object, so that concurrent allocation holds dnodes in different dbufs. When a thread's "next object" reaches the end of a chunk of objects (by default 4 blocks worth -- 128 dnodes), it will be reset to the per-objset os_obj_next, which will be increased by a chunk of objects (128). Only when manipulating the os_obj_next will we need to grab the os_obj_lock. This decreases lock contention dramatically, because each thread only needs to grab the os_obj_lock briefly, once per 128 allocations. This results in a 70% performance improvement to multi-threaded object creation (where each thread is creating objects in its own directory), from 67,000/sec to 115,000/sec, with 8 CPUs. Work sponsored by Intel Corp. Authored by: Matthew Ahrens <[email protected]> Reviewed-by: Ned Bass <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Ported-by: Matthew Ahrens <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> OpenZFS-issue: https://www.illumos.org/issues/8199 OpenZFS-commit: https://github.com/openzfs/openzfs/pull/374 Closes #4703 Closes #6117
* OpenZFS 7578 - Fix/improve some aspects of ZIL writingGiuseppe Di Natale2017-06-094-114/+107
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | - After some ZIL changes 6 years ago zil_slog_limit got partially broken due to zl_itx_list_sz not updated when async itx'es upgraded to sync. Actually because of other changes about that time zl_itx_list_sz is not really required to implement the functionality, so this patch removes some unneeded broken code and variables. - Original idea of zil_slog_limit was to reduce chance of SLOG abuse by single heavy logger, that increased latency for other (more latency critical) loggers, by pushing heavy log out into the main pool instead of SLOG. Beside huge latency increase for heavy writers, this implementation caused double write of all data, since the log records were explicitly prepared for SLOG. Since we now have I/O scheduler, I've found it can be much more efficient to reduce priority of heavy logger SLOG writes from ZIO_PRIORITY_SYNC_WRITE to ZIO_PRIORITY_ASYNC_WRITE, while still leave them on SLOG. - Existing ZIL implementation had problem with space efficiency when it has to write large chunks of data into log blocks of limited size. In some cases efficiency stopped to almost as low as 50%. In case of ZIL stored on spinning rust, that also reduced log write speed in half, since head had to uselessly fly over allocated but not written areas. This change improves the situation by offloading problematic operations from z*_log_write() to zil_lwb_commit(), which knows real situation of log blocks allocation and can split large requests into pieces much more efficiently. Also as side effect it removes one of two data copy operations done by ZIL code WR_COPIED case. - While there, untangle and unify code of z*_log_write() functions. Also zfs_log_write() alike to zvol_log_write() can now handle writes crossing block boundary, that may also improve efficiency if ZPL is made to do that. Sponsored by: iXsystems, Inc. Authored by: Alexander Motin <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Reviewed by: Prakash Surya <[email protected]> Reviewed by: Andriy Gapon <[email protected]> Reviewed by: Steven Hartland <[email protected]> Reviewed by: Brad Lewis <[email protected]> Reviewed by: Richard Elling <[email protected]> Approved by: Robert Mustacchi <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Richard Yao <[email protected]> Ported-by: Giuseppe Di Natale <[email protected]> OpenZFS-issue: https://www.illumos.org/issues/7578 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/aeb13ac Closes #6191
* OpenZFS 8155 - simplify dmu_write_policy handling of pre-compressed buffersMatthew Ahrens2017-06-074-23/+15
| | | | | | | | | | | | | | | | | | | | | | | Authored by: Matthew Ahrens <[email protected]> Reviewed by: Dan Kimmel <[email protected]> Reviewed by: George Wilson <[email protected]> Approved by: Robert Mustacchi <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Ported-by: Giuseppe Di Natale <[email protected]> When writing pre-compressed buffers, arc_write() requires that the compression algorithm used to compress the buffer matches the compression algorithm requested by the zio_prop_t, which is set by dmu_write_policy(). This makes dmu_write_policy() and its callers a bit more complicated. We simplify this by making arc_write() trust the caller to supply the type of pre-compressed buffer that it wants to write, and override the compression setting in the zio_prop_t. OpenZFS-issue: https://www.illumos.org/issues/8155 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/b55ff58 Closes #6200
* Linux 4.9 compat: fix zfs_ctldir xattr handlingLOLi2017-06-051-0/+3
| | | | | | | | | Since torvalds/linux@d0a5b99 IOP_XATTR is used to indicate the inode has xattr support: clear it for the ctldir inodes to avoid EIO errors. Reviewed-by: Chunwei Chen <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: loli10K <[email protected]> Closes #6189
* Fix "snapdev" property issuesLOLi2017-06-022-62/+71
| | | | | | | | | | | | | | | | | | | | When inheriting the "snapdev" property to we don't always call zfs_prop_set_special(): this prevents device nodes from being created in certain situations. Because "snapdev" is the only *special* property that is also inheritable we need to call zfs_prop_set_special() even when we're not reverting it to the received value ('zfs inherit -S'). Additionally, fix a NULL pointer dereference accidentally introduced in 5559ba0 that can be triggered when setting the "snapdev" property to the value "hidden" twice. Finally, add a new test case "zvol_misc_snapdev" to the ZFS Test Suite. Reviewed by: Boris Protopopov <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: loli10K <[email protected]> Closes #6131 Closes #6175 Closes #6176
* Fix import wrong spare/l2 device when path changeChunwei Chen2017-06-011-6/+0
| | | | | | | | | | | | | | | If, for example, your aux device was /dev/sdc, but now the aux device is removed and /dev/sdc points to other device. zpool import will still use that device and corrupt it. The problem is that the spa_validate_aux in spa_import, rather than validate the on-disk label, it would actually write label to disk. We remove them since spa_load_{spares,l2cache} seems to do everything we need and they would actually validate on-disk label. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Chunwei Chen <[email protected]> Closes #6158
* Fix memory leak in zvol_set_volsize()LOLi2017-05-311-1/+2
| | | | | | | | | | | Move kmem_free() so it's called for every error path: this is preferred over making `dmu_object_info_t doi` local to accommodate older kernels with limited stacks. Reviewed by: Boris Protopopov <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Giuseppe Di Natale <[email protected]> Signed-off-by: loli10K <[email protected]> Closes #6177
* Fix ida leak in zvol_create_minor_implBoris Protopopov2017-05-261-0/+7
| | | | | | | | | | Added missing ida_simple_remove() in the error handling path. Reviewed-by: Chunwei Chen <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: George Melikov <[email protected]> Signed-off-by: Boris Protopopov <[email protected]> Closes #6159 Closes #6172
* Don't dirty bpobj if it has no entriesAlek P2017-05-261-0/+4
| | | | | | | | | | | | | | | In certain cases (dsl_scan_sync() is one), we may end up calling bpobj_iterate() on an empty bpobj. Even though we don't end up modifying the bpobj it still gets dirtied, causing unneeded writes to the pool. This patch adds an early bail from bpobj_iterate_impl() if bpobj is empty to prevent unneeded writes. Reviewed-by: George Melikov <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Signed-off-by: Alek Pinchuk <[email protected]> Closes #6164
* Revert "Fix "snapdev" property inheritance behaviour"Brian Behlendorf2017-05-262-67/+57
| | | | | | | | | | | This reverts commit 959f56b99366c8727647b5b19fb3d47555c96cf3. An issue was uncovered by the new zvol_misc_snapdev test case which needs to be investigated and resolved. Reviewed-by: loli10K <[email protected]> Reviewed-by: George Melikov <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #6174 Issue #6131
* OpenZFS 8070 - Add some ZFS commentsAlan Somers2017-05-252-0/+7
| | | | | | | | | | | | | | Authored by: Alan Somers <[email protected]> Reviewed by: Yuri Pankov <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Approved by: Robert Mustacchi <[email protected]> Reviewed-by: Giuseppe Di Natale <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Ported-by: bunder2015 <[email protected]> OpenZFS-issue: https://www.illumos.org/issues/8070 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/40713f2 Closes #6160
* Fix "snapdev" property inheritance behaviourLOLi2017-05-252-57/+67
| | | | | | | | | | | When inheriting the "snapdev" property to we don't always call zfs_prop_set_special(): this prevents device nodes from being created in certain situations. Because "snapdev" is the only *special* property that is also inheritable we need to call zfs_prop_set_special() even when we're not reverting it to the received value ('zfs inherit -S'). Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: loli10K <[email protected]> Closes #6131
* Linux 4.12 compat: fix super_setup_bdi_name() callLOLi2017-05-251-2/+1
| | | | | | | | | | Provide a format parameter to super_setup_bdi_name() so we don't create duplicate names in '/devices/virtual/bdi' sysfs namespace which would prevent us from mounting more than one ZFS filesystem at a time. Reviewed-by: Chunwei Chen <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: loli10K <[email protected]> Closes #6147
* Fix LZ4_uncompress_unknownOutputSize caused panicFeng Sun2017-05-191-8/+19
| | | | | | | | | | | | | | | | Sync with kernel patches for lz4 https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/log/lib/lz4 4a3a99 lz4: add overrun checks to lz4_uncompress_unknownoutputsize() d5e7ca LZ4 : fix the data abort issue bea2b5 lib/lz4: Pull out constant tables 99b7e9 lz4: fix system halt at boot kernel on x86_64 Reviewed-by: Chunwei Chen <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Feng Sun <[email protected]> Closes #5975 Closes #5973
* Implemented zpool sync commandAlek P2017-05-191-0/+44
| | | | | | | | | | | This addition will enable us to sync an open TXG to the main pool on demand. The functionality is similar to 'sync(2)' but 'zpool sync' will return when data has hit the main storage instead of potentially just the ZIL as is the case with the 'sync(2)' cmd. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Signed-off-by: Alek Pinchuk <[email protected]> Closes #6122
* Force fault a vdev with 'zpool offline -f'Tony Hutter2017-05-194-12/+76
| | | | | | | | | | | | | This patch adds a '-f' option to 'zpool offline' to fault a vdev instead of bringing it offline. Unlike the OFFLINE state, the FAULTED state will trigger the FMA code, allowing for things like autoreplace and triggering the slot fault LED. The -f faults persist across imports, unless they were set with the temporary (-t) flag. Both persistent and temporary faults can be cleared with zpool clear. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tony Hutter <[email protected]> Closes #6094
* Fixed small memory leak in ereport handlingTom Caputi2017-05-181-6/+6
| | | | | | | | | | | One pre-check in zfs_ereport_start() was being called after the nvlists were being allocated. This simply corrects that issue. Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Giuseppe Di Natale <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tom Caputi <[email protected]> Closes #6140
* Introduce zv_state_lockBoris Protopopov2017-05-161-71/+124
| | | | | | | | | | | | The lock is designed to protect internal state of zvol_state_t and to avoid taking spa_namespace_lock (e.g. in dmu_objset_own() code path) while holding zvol_stat_lock. Refactor the code accordingly. Signed-off-by: Boris Protopopov <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #3484 Closes #6065 Closes #6134
* Revert commit 1ee159f4Boris Protopopov2017-05-161-2/+29
| | | | | | | | | | | | Fix lock order inversion with zvol_open() as it did not account for use of zvols as vdevs. The latter use cases resulted in the lock order inversion deadlocks that involved spa_namespace_lock and bdev->bd_mutex. Signed-off-by: Boris Protopopov <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #6065 Issue #6134
* Skip spurious resilver IO on raidz vdevIsaac Huang2017-05-128-33/+122
| | | | | | | | | | | | | | | | On a raidz vdev, a block that does not span all child vdevs, excluding its skip sectors if any, may not be affected by a child vdev outage or failure. In such cases, the block does not need to be resilvered. However, current resilver algorithm simply resilvers all blocks on a degraded raidz vdev. Such spurious IO is not only wasteful, but also adds the risk of overwriting good data. This patch eliminates such spurious IOs. Reviewed-by: Gvozden Neskovic <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Signed-off-by: Isaac Huang <[email protected]> Closes #5316
* OpenZFS 8063 - verify that we do not attempt to access inactive txgMatthew Ahrens2017-05-106-23/+38
| | | | | | | | | | | | | | | | | | | | | Authored by: Matthew Ahrens <[email protected]> Reviewed by: Serapheim Dimitropoulos <[email protected]> Reviewed by: Pavel Zakharov <[email protected]> Approved by: Robert Mustacchi <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Ported-by: George Melikov <[email protected]> A standard practice in ZFS is to keep track of "per-txg" state. Any of the 3 active TXG's (open, quiescing, syncing) can have different values for this state. We should assert that we do not attempt to modify other (inactive) TXG's. Porting Notes: - ASSERTV added to txg_sync_waiting() for unused variable. OpenZFS-issue: https://www.illumos.org/issues/8063 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/01acb46 Closes #6109
* OpenZFS 8166 - zpool scrub thinks it repaired offline deviceMatthew Ahrens2017-05-101-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | Authored by: Matthew Ahrens <[email protected]> Reviewed by: George Wilson <[email protected]> Reviewed-by: loli10K <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Ported-by: Matthew Ahrens <[email protected]> If we do a scrub while a leaf device is offline (via "zpool offline"), we will inadvertently clear the DTL (dirty time log) of the offline device, even though it is still damaged. When the device comes back online, we will incompletely resilver it, thinking that the scrub repaired blocks written before the scrub was started. The incomplete resilver can lead to data loss if there is a subsequent failure of a different leaf device. The fix is to never clear the DTL of offline devices. Note that if a device is onlined while a scrub is in progress, the scrub will be restarted. The problem can be worked around by running "zpool scrub" after "zpool online". OpenZFS-issue: https://www.illumos.org/issues/8166 OpenZFS-commit: https://github.com/openzfs/openzfs/pull/372 Closes #5806 Closes #6103
* Add missing arc_free_cksum() to arc_release()Tom Caputi2017-05-101-0/+4
| | | | | | | | | | | | | | | | | | The arc layer tracks checksums of its data in the arc header so that it can ensure that buffers haven't changed when they're not supposed to. This checksum is only maintained while there is an uncompressed buffer still attached to the header. Unfortunately there is a missing call to arc_free_cksum() in arc_release() that can trigger ASSERTs. This has not been a common issue because the checksums are only maintained for debug builds and triggering the bug requires writing a block (and therefore calling arc_release()) while a compressed buffer is still being used on a debug build. This simply corrects the issue. Reviewed-by: George Melikov <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tom Caputi <[email protected]> Closes #6105
* Linux 4.12 compat: CURRENT_TIME removedBrian Behlendorf2017-05-104-11/+14
| | | | | | | | Linux 4.9 added current_time() as the preferred interface to get the filesystem time. CURRENT_TIME was retired in Linux 4.12. Reviewed-by: Chunwei Chen <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #6114
* Add property overriding (-o|-x) to 'zfs receive'LOLi2017-05-091-31/+165
| | | | | | | | | | | | | | | | | | | | This allows users to specify "-o property=value" to override and "-x property" to exclude properties when receiving a zfs send stream. Both native and user properties can be specified. This is useful when using zfs send/receive for periodic backup/replication because it lets users change properties such as canmount, mountpoint, or compression without modifying the source. References: https://www.illumos.org/issues/2745 https://www.illumos.org/issues/3753 Reviewed by: Matthew Ahrens <[email protected]> Reviewed-by: Alek Pinchuk <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: loli10K <[email protected]> Closes #1350 Closes #5349
* Linux 4.12 compat: PF_FSTRANS was removedChunwei Chen2017-05-091-1/+1
| | | | | | | | zfsonlinux/spl@8f87971 added __spl_pf_fstrans_check for the xfs related check, so we use them accordingly. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Chunwei Chen <[email protected]> Closes #6113
* Fix unused variable warningBrian Behlendorf2017-05-051-3/+2
| | | | | | | | | | | Remove the lz4_ac local variable from dmu_write_policy() to resolve the following unused variable warning on non-debug builds. dmu.c: In function ‘dmu_write_policy’: dmu.c:1892:12: warning: unused variable ‘lz4_ac’ [-Wunused-variable] boolean_t lz4_ac = spa_feature_is_active(os->os_spa, Signed-off-by: Brian Behlendorf <[email protected]>
* Add missing *_destroy/*_fini callsGvozden Neskovic2017-05-047-3/+11
| | | | | | | | | The proposed debugging enhancements in zfsonlinux/spl#587 identified the following missing *_destroy/*_fini calls. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Gvozden Neskovic <[email protected]> Closes #5428
* Default to zvol_request_async=0Brian Behlendorf2017-05-041-1/+1
| | | | | | | | | Change the default ZVOL behavior so requests are handled asynchronously. This behavior is functionally the same as in the zfs-0.6.4 release. Reviewed-by: Chunwei Chen <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #5902
* Enable Linux read-ahead for a single page on ZVOLsRichard Yao2017-05-041-0/+3
| | | | | | | | | | | | | | | | | | | | Linux has read-ahead logic designed to accelerate sequential workloads. ZFS has its own read-ahead logic called zprefetch that operates on both ZVOLs and datasets. Having two prefetchers active at the same time can cause overprefetching, which unnecessarily reduces IOPS performance on CoW filesystems like ZFS. Testing shows that entirely disabling the Linux prefetch results in a significant performance penalty for reads while commensurate benefits are seen in random writes. It appears that read-ahead benefits are inversely proportional to random write benefits, and so a single page of Linux-layer read-ahead appears to offer the middle ground for both workloads. Reviewed-by: Chunwei Chen <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Richard Yao <[email protected]> Issue #5902
* Disable write merging on ZVOLsRageLtMan2017-05-041-0/+3
| | | | | | | | | | | | | | | | | | | The current ZVOL implementation does not explicitly set merge options on ZVOL device queues, which results in the default merge behavior. Explicitly set QUEUE_FLAG_NOMERGES on ZVOL queues allowing the ZIO pipeline to do its work. Initial benchmarks (tiotest with no O_DIRECT) show random write performance going up almost 3X on 8K ZVOLs, even after significant rewrites of the logical space allocation. Reviewed-by: Richard Yao <[email protected]> Reviewed-by: Chunwei Chen <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: RageLtMan <rageltman@sempervictus> Issue #5902
* Write label 2,3 uberblocks when vdev expandsOlaf Faaland2017-05-022-0/+68
| | | | | | | | | | | | | | | | | | | | | | When vdev_psize increases, the location of labels 2 and 3 changes because their location is relative to the end of the device. The configs for labels 2 and 3 are written during the next spa_sync() because the vdev is added to the dirty config list. However, the uberblock rings are not re-written in their new location, leaving the device vulnerable to the beginning of the device being overwritten or damaged. This patch copies the uberblock ring from label 0 to labels 2 and 3, in their new locations, at the next sync after vdev_psize increases. Also, add a test zpool_expand_004_pos.ksh to confirm the uberblocks are copied. Reviewed-by: BearBabyLiu <[email protected]> Reviewed-by: Andreas Dilger <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Olaf Faaland <[email protected]> Closes #5108
* Allow scaling of arc in proportion to pagecacheDebabrata Banerjee2017-05-021-2/+19
| | | | | | | | | | | | | | | When multiple filesystems are in use, memory pressure causes arc_cache to collapse to a minimum. Allow arc_cache to maintain proportional size even when hit rates are disproportionate. We do this only via evictable size from the kernel shrinker, thus it's only in effect under memory pressure. AKAMAI: zfs: CR 3695072 Reviewed-by: Tim Chase <[email protected]> Reviewed-by: Richard Yao <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Debabrata Banerjee <[email protected]> Closes #6035
* Correct signed operationDebabrata Banerjee2017-05-021-2/+2
| | | | | | | | | | | Could return the wrong pages value AKAMAI: zfs: CR 3695072 Reviewed-by: Tim Chase <[email protected]> Reviewed-by: Richard Yao <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Debabrata Banerjee <[email protected]> Issue #6035
* Don't run the reaper if we didn't shrink the cacheDebabrata Banerjee2017-05-021-6/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Calling it when nothing is evictable will cause extra kswapd cpu. Also if we didn't shrink it's unlikely to have memory to reap because we likely just called it microseconds ago. The exception is if we are in direct reclaim. You can see how hard this is being hit in kswapd with a light test workload: 34.95% [zfs] [k] arc_kmem_reap_now 5.40% [spl] [k] spl_kmem_cache_reap_now 3.79% [kernel] [k] _raw_spin_lock 2.86% [spl] [k] __spl_kmem_cache_generic_shrinker.isra.7 2.70% [kernel] [k] shrink_slab.part.37 1.93% [kernel] [k] isolate_lru_pages.isra.43 1.55% [kernel] [k] __wake_up_bit 1.20% [kernel] [k] super_cache_count 1.20% [kernel] [k] __radix_tree_lookup With ZFS just mounted but only ext4/pagecache memory pressure arc_kmem_reap_now still consumes excessive CPU: 12.69% [kernel] [k] isolate_lru_pages.isra.43 10.76% [kernel] [k] free_pcppages_bulk 7.98% [kernel] [k] drop_buffers 7.31% [kernel] [k] shrink_page_list 6.44% [zfs] [k] arc_kmem_reap_now 4.19% [kernel] [k] free_hot_cold_page 4.00% [kernel] [k] __slab_free 3.95% [kernel] [k] __isolate_lru_page 3.09% [kernel] [k] __radix_tree_lookup Same pagecache only workload as above with this patch series: 11.58% [kernel] [k] isolate_lru_pages.isra.43 11.20% [kernel] [k] drop_buffers 9.67% [kernel] [k] free_pcppages_bulk 8.44% [kernel] [k] shrink_page_list 4.86% [kernel] [k] __isolate_lru_page 4.43% [kernel] [k] free_hot_cold_page 4.00% [kernel] [k] __slab_free 3.44% [kernel] [k] __radix_tree_lookup (arc_kmem_reap_now has 0 samples in perf) AKAMAI: zfs: CR 3695042 Reviewed-by: Tim Chase <[email protected]> Reviewed-by: Richard Yao <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Debabrata Banerjee <[email protected]> Issue #6035
* Only wakeup waiters if we've actually done workDebabrata Banerjee2017-05-021-5/+5
| | | | | | | | | AKAMAI: zfs: CR 3695072 Reviewed-by: Tim Chase <[email protected]> Reviewed-by: Richard Yao <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Debabrata Banerjee <[email protected]> Issue #6035
* Do not stop kernel shrinker on lock contentionDebabrata Banerjee2017-05-021-1/+1
| | | | | | | | | | | | | | | | Lock contention, by itself, shouldn't indicate a stop condition to the kernel's slab shrinker. Doing so can cause stalls when the kernel is trying to free large parts of the cache such as is done by drop_caches Also, perhaps arc_reclaim_lock should be a spinlock, and this code eliminated. AKAMAI: zfs: CR 3593801 Reviewed-by: Tim Chase <[email protected]> Reviewed-by: Richard Yao <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Debabrata Banerjee <[email protected]> Issue #6035
* Stop double reclaiming or not reclaiming at allDebabrata Banerjee2017-05-021-2/+3
| | | | | | | | | | | | | | | | Move arcstat_need_free increment from all direct calls to when arc_reclaim_lock is busy and we exit wihout doing anything. Data will be reclaimed in reclaim thread. The previous location meant that we both reclaim the memory in this thread, and also schedule the same amount of memory for reclaim in arc_reclaim, effectively doubling the requested reclaim. AKAMAI: zfs: CR 3695072 Reviewed-by: Tim Chase <[email protected]> Reviewed-by: Richard Yao <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Debabrata Banerjee <[email protected]> Issue #6035
* Make arc_need_free updates atomicDebabrata Banerjee2017-05-021-6/+7
| | | | | | | | | | | Ensures proper accounting of bytes we requested to free AKAMAI: zfs: CR 3695072 Reviewed-by: Tim Chase <[email protected]> Reviewed-by: Richard Yao <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Debabrata Banerjee <[email protected]> Issue #6035
* Don't report ghost buffers as evictable memDebabrata Banerjee2017-05-021-7/+2
| | | | | | | | | | | Ghost meta/data buffers are not actually allocated AKAMAI: zfs: CR 3695072 Reviewed-by: Tim Chase <[email protected]> Reviewed-by: Richard Yao <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Debabrata Banerjee <[email protected]> Issue #6035
* minor improvement to abd_free_pages()jxiong2017-05-021-8/+6
| | | | | | | | | | | It doesn't need to have a loop to free page in a single scatterlist entry because it should be single or compound page. The pages can be freed in one invocation to __free_pages() for both cases. Reviewed-by: Gvozden Neskovic <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Richard Yao <[email protected]> Signed-off-by: Jinshan Xiong <[email protected]> Closes #6057
* Guarantee PAGESIZE alignment for large zio buffersjxiong2017-05-021-2/+2
| | | | | | | | | | | | | | | | | | | In current implementation, only zio buffers in 16KB and bigger are guaranteed PAGESIZE alignment. This breaks Lustre since it assumes that 'arc_buf_t::b_data' must be page aligned when zio buffers are greater than or equal to PAGESIZE. This patch will make the zio buffers to be PAGESIZE aligned when the sizes are not less than PAGESIZE. This change may cause a little bit memory waste but that should be fine because after ABD is introduced, zio buffers are used to hold data temporarily and live in memory for a short while. Reviewed-by: Don Brady <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Jinshan Xiong <[email protected]> Signed-off-by: Jinshan Xiong <[email protected]> Closes #6084
* Linux 4.12 compat: super_setup_bdi_name()Brian Behlendorf2017-05-021-5/+6
| | | | | | | | | | All filesystems were converted to dynamically allocated BDIs. The destruction of backing_dev_info structures is handled as part of super block destruction. Refactor the code to abstract away the details of creating and destroying a BDI. Reviewed-by: Chunwei Chen <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #6089
* OpenZFS 7786 - zfs`vdev_online() needs better notification about state changesYuri Pankov2017-05-011-6/+8
| | | | | | | | | | | | | | Authored by: Yuri Pankov <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Reviewed by: George Wilson <[email protected]> Approved by: Albert Lee <[email protected]> Reviewed-by: Giuseppe Di Natale <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Ported-by: bunder2015 <[email protected]> OpenZFS-issue: https://www.illumos.org/issues/7786 OpenZFS-commit: http://github.com/openzfs/openzfs/commit/db8498f Closes #6074
* Limit zfs_dirty_data_max_max to 4GBrian Behlendorf2017-05-011-3/+3
| | | | | | | | | | Reinstate default 4G zfs_dirty_data_max_max limit. Reviewed-by: Giuseppe Di Natale <[email protected]> Reviewed-by: Chunwei Chen <[email protected]> Reviewed-by: George Melikov <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #6072 Closes #6081