aboutsummaryrefslogtreecommitdiffstats
path: root/module
Commit message (Collapse)AuthorAgeFilesLines
* metaslab_verify_weight_and_frag() shouldn't cause side-effectsSerapheim Dimitropoulos2019-09-051-15/+24
| | | | | | | | | | | | | | | | | | | | `metaslab_verify_weight_and_frag()` a verification function and by the end of it there shouldn't be any side-effects. The function calls `metaslab_weight()` which in turn calls `metaslab_set_fragmentation()`. The latter can dirty and otherwise not dirty metaslab fro the next TXGand set `metaslab_condense_wanted` if the spacemaps were just upgraded (meaning we just enabled the SPACEMAP_HISTOGRAM feature through upgrade). This patch adds a new flag as a parameter to `metaslab_weight()` and `metaslab_set_fragmentation()` making the dirtying of the metaslab optional. Reviewed-by: Matt Ahrens <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Serapheim Dimitropoulos <[email protected]> Closes #9185 Closes #9282
* OpenZFS restructuring - move platform specific headersMatthew Macy2019-09-0521-24/+27
| | | | | | | | | | | | | | | | | Move platform specific Linux headers under include/os/linux/. Update the build system accordingly to detect the platform. This lays some of the initial groundwork to supporting building for other platforms. As part of this change it was necessary to create both a user and kernel space sys/simd.h header which can be included in either context. No functional change, the source has been refactored and the relevant #include's updated. Reviewed-by: Jorgen Lundman <[email protected]> Reviewed-by: Igor Kozhukhov <[email protected]> Signed-off-by: Matthew Macy <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #9198
* Fix panic on DilOS with kstat per dataset statisticsIgor K2019-09-031-0/+1
| | | | | | | | | | | Account for ZFS_MAX_DATASET_NAME_LEN in kstat data size. This value is ignored in the Linux kstat code but resolves the issue for other platforms. Reviewed-by: Serapheim Dimitropoulos <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Igor Kozhukhov <[email protected]> Closes #9254 Closes #9151
* Always refuse receving non-resume stream when resume state existsAndriy Gapon2019-09-031-3/+7
| | | | | | | | | | | | | | This fixes a hole in the situation where the resume state is left from receiving a new dataset and, so, the state is set on the dataset itself (as opposed to %recv child). Additionally, distinguish incremental and resume streams in error messages. Reviewed-by: Matt Ahrens <[email protected]> Reviewed-by: Tom Caputi <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Andriy Gapon <[email protected]> Closes #9252
* Fix Intel QAT / ZFS compatibility on v4.7.1+ kernelsloli10K2019-09-032-3/+3
| | | | | | | | This change use the compat code introduced in 9cc1844a. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: loli10K <[email protected]> Closes #9268 Closes #9269
* maxinflight can overflow in spa_load_verify_cb()George Wilson2019-09-021-1/+2
| | | | | | | | | | When running on larger memory systems, we can overflow the value of maxinflight. This can result in maxinflight having a value of 0 causing the system to hang. Reviewed-by: Igor Kozhukhov <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: George Wilson <[email protected]> Closes #9272
* Fix typos in module/zfs/Andrea Gelmini2019-09-0252-114/+114
| | | | | | | | Reviewed-by: Matt Ahrens <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Richard Laager <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Andrea Gelmini <[email protected]> Closes #9240
* Fix typos in module/Andrea Gelmini2019-08-3010-16/+16
| | | | | | | Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Richard Laager <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Andrea Gelmini <[email protected]> Closes #9241
* Fix typos in modules/icp/Andrea Gelmini2019-08-3012-22/+22
| | | | | | | Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Richard Laager <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Andrea Gelmini <[email protected]> Closes #9239
* Prevent metaslab_sync panic due to spa_final_dirty_txgPaul Dagnelie2019-08-301-2/+9
| | | | | | | | | | | | | | | | | | | | | If a pool enables the SPACEMAP_HISTOGRAM feature shortly before being exported, we can enter a situation that causes a kernel panic. Any metaslabs that are loaded during the final dirty txg and haven't already been condensed will cause metaslab_sync to proceed after the final dirty txg so that the condense can be performed, which there are assertions to prevent. Because of the nature of this issue, there are a number of ways we can enter this state. Rather than try to prevent each of them one by one, potentially missing some edge cases, we instead cut it off at the point of intersection; by preventing metaslab_sync from proceeding if it would only do so to perform a condense and we're past the final dirty txg, we preserve the utility of the existing asserts while preventing this particular issue. Reviewed-by: Matt Ahrens <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Paul Dagnelie <[email protected]> Closes #9185 Closes #9186 Closes #9231 Closes #9253
* Keep more metaslabs loadedPaul Dagnelie2019-08-291-30/+39
| | | | | | | | | | | | | | | | | | | | | | | | With the other metaslab changes loaded onto a system, we can significantly reduce the memory usage of each loaded metaslab and unload them on demand if there is memory pressure. However, none of those changes actually result in us keeping more metaslabs loaded. If we don't keep more metaslabs loaded, we will still have to wait for demand-loading to finish when no loaded metaslab can satisfy our allocation, which can cause ZIL performance issues. In addition, performance is traditionally measured by IOs per unit time, while unloading is currently done on a txg-count basis. Txgs can take a widely varying range of times, from tenths of a second to several seconds. This can result in confusing, hard to predict behavior. This change simply adds a time-based component to metaslab unloading. A metaslab will remain loaded for one minute and 8 txgs (by default) after it was last used, unless it is evicted due to memory pressure. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Matt Ahrens <[email protected]> Signed-off-by: Paul Dagnelie <[email protected]> External-issue: DLPX-65016 External-issue: DLPX-65047 Closes #9197
* Use smaller default slack/delta value for schedule_hrtimeout_range()Tony Nguyen2019-08-282-18/+38
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | For interrupt coalescing, cv_timedwait_hires() uses a 100us slack/delta for calls to schedule_hrtimeout_range(). This 100us slack can be costly for small writes. This change improves small write performance by passing resolution `res` parameter to schedule_hrtimeout_range() to be used as delta/slack. A new tunable `spl_schedule_hrtimeout_slack_us` is added to preserve old behavior when desired. Performance observations on 8K recordsize filesystem: - 8K random writes at 1-64 threads, up to 60% improvement for one thread and smaller gains as thread count increases. At >64 threads, 2-5% decrease in performance was observed. - 8K sequential writes, similar 60% improvement for one thread and leveling out around 64 threads. At >64 threads, 5-10% decrease in performance was observed. - 128K sequential write sees 1-5 for the 128K. No observed regression at high thread count. Testing done on Ubuntu 18.04 with 4.15 kernel, 8vCPUs and SSD storage on VMware ESX. Reviewed-by: Richard Elling <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Matt Ahrens <[email protected]> Signed-off-by: Tony Nguyen <[email protected]> Closes #9217
* Tag ABD pages for exclusion in kernel crash dumpsDon Brady2019-08-281-0/+29
| | | | | | | | | | | | | | | | | | | | Tag the ABD data pages so that they can be identified for exclusion from kernel crash dumps. Eliminating the zfs file data allows for significantly smaller crash dump files. Note that ZFS in illumos has always excluded the zfs data pages from a kernel crash dump. This change tags ARC scatter data pages so they can be identified from the makedumpfile(8) command. That command is used to create smaller dump files by ignoring some memory regions and using compression. It already filters file data from the VFS page cache and will now be able to exclude ZFS file data pages from the dump file. A corresponding change to makeumpfile(8) is required to identify ZFS data pages. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Paul Dagnelie <[email protected]> Signed-off-by: Don Brady <[email protected]> Closes #8899
* Fix zil replay panic when TX_REMOVE followed by TX_CREATEChunwei Chen2019-08-282-16/+41
| | | | | | | | | | | | | | | | | | | | If TX_REMOVE is followed by TX_CREATE on the same object id, we need to make sure the object removal is completely finished before creation. The current implementation relies on dnode_hold_impl with DNODE_MUST_BE_ALLOCATED returning ENOENT. While this check seems to work fine before, in current version it does not guarantee the object removal is completed. We fix this by checking if DNODE_MUST_BE_FREE returns successful instead. Also add test and remove dead code in dnode_hold_impl. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Tom Caputi <[email protected]> Signed-off-by: Chunwei Chen <[email protected]> Closes #7151 Closes #8910 Closes #9123 Closes #9145
* zfs_ioc_snapshot: check user-prop permissions on snapshotted datasetsAndriy Gapon2019-08-271-10/+16
| | | | | | | | | | | | | | Previously, the permissions were checked on the pool which was obviously incorrect. After this change, zfs_check_userprops() only validates the properties without any permission checks. The permissions are checked individually for each snapshotted dataset. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Matt Ahrens <[email protected]> Signed-off-by: Andriy Gapon <[email protected]> Closes #9179 Closes #9180
* Fix deadlock in 'zfs rollback'Tom Caputi2019-08-272-1/+16
| | | | | | | | | | | | | | | | | | | | Currently, the 'zfs rollback' code can end up deadlocked due to the way the kernel handles unreferenced inodes on a suspended fs. Essentially, the zfs_resume_fs() code path may cause zfs to spawn new threads as it reinstantiates the suspended fs's zil. When a new thread is spawned, the kernel may attempt to free memory for that thread by freeing some unreferenced inodes. If it happens to select inodes that are a a part of the suspended fs a deadlock will occur because freeing inodes requires holding the fs's z_teardown_inactive_lock which is still held from the suspend. This patch corrects this issue by adding an additional reference to all inodes that are still present when a suspend is initiated. This prevents them from being freed by the kernel for any reason. Reviewed-by: Alek Pinchuk <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tom Caputi <[email protected]> Closes #9203
* Linux 5.3: Fix switch() fall though compiler errorsTony Hutter2019-08-213-3/+11
| | | | | | | | | Fix some switch() fall-though compiler errors: abd.c:1504:9: error: this statement may fall through Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tony Hutter <[email protected]> Closes #9170
* Add fast path for zfs_ioc_space_snaps() handling of empty_bpobjMatthew Ahrens2019-08-202-34/+168
| | | | | | | | | | | | | | | | | | | | | When there are many snapshots, calls to zfs_ioc_space_snaps() (e.g. from `zfs destroy -nv pool/fs@snap1%snap10000`) can be very slow, resulting in poor performance because we are holding the dp_config_rwlock the entire time, blocking spa_sync() from continuing. With around ten thousand snapshots, we've seen up to 500 seconds in this ioctl, iterating over up to 50,000,000 bpobjs, ~99% of which are the empty bpobj. By creating a fast path for zfs_ioc_space_snaps() handling of the empty_bpobj, we can achieve a ~5x performance improvement of this ioctl (when there are many snapshots, and the deadlist is mostly empty_bpobj's). Reviewed-by: Pavel Zakharov <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Paul Dagnelie <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> External-issue: DLPX-58348 Closes #8744
* Fix lockdep circular locking false positive involving sa_lockjdike2019-08-191-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There are two different deadlock scenarios, but they share a common link, which is thread 1 holding sa_lock and trying to get zap->zap_rwlock: zap_lockdir_impl+0x858/0x16c0 [zfs] zap_lockdir+0xd2/0x100 [zfs] zap_lookup_norm+0x7f/0x100 [zfs] zap_lookup+0x12/0x20 [zfs] sa_setup+0x902/0x1380 [zfs] zfsvfs_init+0x3d6/0xb20 [zfs] zfsvfs_create+0x5dd/0x900 [zfs] zfs_domount+0xa3/0xe20 [zfs] and thread 2 trying to get sa_lock, either in sa_setup: sa_setup+0x742/0x1380 [zfs] zfsvfs_init+0x3d6/0xb20 [zfs] zfsvfs_create+0x5dd/0x900 [zfs] zfs_domount+0xa3/0xe20 [zfs] or in sa_build_index: sa_build_index+0x13d/0x790 [zfs] sa_handle_get_from_db+0x368/0x500 [zfs] zfs_znode_sa_init.isra.0+0x24b/0x330 [zfs] zfs_znode_alloc+0x3da/0x1a40 [zfs] zfs_zget+0x39a/0x6e0 [zfs] zfs_root+0x101/0x160 [zfs] zfs_domount+0x91f/0xea0 [zfs] From there, there are different locking paths back to something holding zap->zap_rwlock. The deadlock scenarios involve multiple different ZFS filesystems being mounted. sa_lock is common to these scenarios, and the sa struct involved is private to a mount. Therefore, these must be referring to different sa_lock instances and these deadlocks can't occur in practice. The fix, from Brian Behlendorf, is to remove sa_lock from lockdep coverage by initializing it with MUTEX_NOLOCKDEP. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Jeff Dike <[email protected]> Closes #9110
* Linux 5.3 compat: Makefile subdir-m no longer supportedDominic Pearson2019-08-191-12/+12
| | | | | | | | | | Uses obj-m instead, due to kernel changes. See LKML: Masahiro Yamada, Tue, 6 Aug 2019 19:03:23 +0900 Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Signed-off-by: Dominic Pearson <[email protected]> Closes #9169
* Cap metaslab memory usagePaul Dagnelie2019-08-167-55/+258
| | | | | | | | | | | | | | | | | | | | | | | | | | On systems with large amounts of storage and high fragmentation, a huge amount of space can be used by storing metaslab range trees. Since metaslabs are only unloaded during a txg sync, and only if they have been inactive for 8 txgs, it is possible to get into a state where all of the system's memory is consumed by range trees and metaslabs, and txgs cannot sync. While ZFS knows how to evict ARC data when needed, it has no such mechanism for range tree data. This can result in boot hangs for some system configurations. First, we add the ability to unload metaslabs outside of syncing context. Second, we store a multilist of all loaded metaslabs, sorted by their selection txg, so we can quickly identify the oldest metaslabs. We use a multilist to reduce lock contention during heavy write workloads. Finally, we add logic that will unload a metaslab when we're loading a new metaslab, if we're using more than a certain fraction of the available memory on range trees. Reviewed-by: Matt Ahrens <[email protected]> Reviewed-by: George Wilson <[email protected]> Reviewed-by: Sebastien Roy <[email protected]> Reviewed-by: Serapheim Dimitropoulos <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Paul Dagnelie <[email protected]> Closes #9128
* dmu_tx_wait() hang likely due to cv_signal() in dsl_pool_dirty_delta()Serapheim Dimitropoulos2019-08-154-18/+60
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Even though the bug's writeup (Github issue #9136) is very detailed, we still don't know exactly how we got to that state, thus I wasn't able to reproduce the bug. That said, we can make an educated guess combining the information on filled issue with the code. From the fact that `dp_dirty_total` was 0 (which is less than `zfs_dirty_data_max`) we know that there was one thread that set it to 0 and then signaled one of the waiters of `dp_spaceavail_cv` [see `dsl_pool_dirty_delta()` which is also the only place that `dp_dirty_total` is changed]. Thus, the only logical explaination then for the bug being hit is that the waiter that just got awaken didn't go through `dsl_pool_dirty_data()`. Given that this function is only called by `dsl_pool_dirty_space()` or `dsl_pool_undirty_space()` I can only think of two possible ways of the above scenario happening: [1] The waiter didn't call into any of the two functions - which I find highly unlikely (i.e. why wait on `dp_spaceavail_cv` to begin with?). [2] The waiter did call in one of the above function but it passed 0 as the space/delta to be dirtied (or undirtied) and then the callee returned immediately (e.g both `dsl_pool_dirty_space()` and `dsl_pool_undirty_space()` return immediately when space is 0). In any case and no matter how we got there, the easy fix would be to just broadcast to all waiters whenever `dp_dirty_total` hits 0. That said and given that we've never hit this before, it would make sense to think more on why the above situation occured. Attempting to mimic what Prakash was doing in the issue filed, I created a dataset with `sync=always` and started doing contiguous writes in a file within that dataset. I observed with DTrace that even though we update the pool's dirty data accounting when we would dirty stuff, the accounting wouldn't be decremented incrementally as we were done with the ZIOs of those writes (the reason being that `dbuf_write_physdone()` isn't be called as we go through the override code paths, and thus `dsl_pool_undirty_space()` is never called). As a result we'd have to wait until we get to `dsl_pool_sync()` where we zero out all dirty data accounting for the pool and the current TXG's metadata. In addition, as Matt noted and I later verified, the same issue would arise when using dedup. In both cases (sync & dedup) we shouldn't have to wait until `dsl_pool_sync()` zeros out the accounting data. According to the comment in that part of the code, the reasons why we do the zeroing, have nothing to do with what we observe: ```` /* * We have written all of the accounted dirty data, so our * dp_space_towrite should now be zero. However, some seldom-used * code paths do not adhere to this (e.g. dbuf_undirty(), also * rounding error in dbuf_write_physdone). * Shore up the accounting of any dirtied space now. */ dsl_pool_undirty_space(dp, dp->dp_dirty_pertxg[txg & TXG_MASK], txg); ```` Ideally what we want to do is to undirty in the accounting exactly what we dirty (I use the word ideally as we can still have rounding errors). This would make the behavior of the system more clear and predictable. Another interesting issue that I observed with DTrace was that we wouldn't update any of the pool's dirty data accounting whenever we would dirty and/or undirty MOS data. In addition, every time we would change the size of a dbuf through `dbuf_new_size()` we wouldn't update the accounted space dirtied in the appropriate dirty record, so when ZIOs are done we would undirty less that we dirtied from the pool's accounting point of view. For the first two issues observed (sync & dedup) this patch ensures that we still update the pool's accounting when we undirty data, regardless of the write being physical or not. For changes in the MOS, we first ensure to zero out the pool's dirty data accounting in `dsl_pool_sync()` after we synced the MOS. Then we can go ahead and enable the update of the pool's dirty data accounting wheneve we change MOS data. Another fix is that we now update the accounting explicitly for counting errors in `dbuf_write_done()`. Finally, `dbuf_new_size()` updates the accounted space of the appropriate dirty record correctly now. The problem is that we still don't know how the bug came up in the issue filled. That said the issues fixed seem to be very relevant, so instead of going with the broadcasting solution right away, I decided to leave this patch as is. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Prakash Surya <[email protected]> Signed-off-by: Serapheim Dimitropoulos <[email protected]> External-issue: DLPX-47285 Closes #9137
* Improve write performance by using dmu_read_by_dnode()Tony Nguyen2019-08-151-2/+7
| | | | | | | | | | | | | | In zfs_log_write(), we can use dmu_read_by_dnode() rather than dmu_read() thus avoiding unnecessary dnode_hold() calls. We get a 2-5% performance gain for large sequential_writes tests, >=128K writes to files with recordsize=8K. Testing done on Ubuntu 18.04 with 4.15 kernel, 8vCPUs and SSD storage on VMware ESX. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tony Nguyen <[email protected]> Closes #9156
* Assert that a dnode's bonuslen never exceeds its recorded sizeSerapheim Dimitropoulos2019-08-152-0/+52
| | | | | | | | | | | | | | | | | | This patch introduces an assertion that can catch pitfalls in development where there is a mismatch between the size of reads and writes between a *_phys structure and its respective in-core structure when bonus buffers are used. This debugging-aid should be complementary to the verification done by ztest in ztest_verify_dnode_bt(). A side to this patch is that we now clear out any extra bytes past a bonus buffer's new size when the buffer is shrinking. Reviewed-by: Matt Ahrens <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Tom Caputi <[email protected]> Signed-off-by: Serapheim Dimitropoulos <[email protected]> Closes #8348
* Make txg_wait_synced conditional in zfsvfs_teardownPaul Zuchowski2019-08-151-1/+10
| | | | | | | | | | | The call to txg_wait_synced in zfsvfs_teardown should be made conditional on the objset having dirty data. This can prevent unnecessary txg_wait_synced during some unmount operations. Reviewed-by: Matt Ahrens <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Paul Zuchowski <[email protected]> Closes #9115
* Prevent race in blkptr_verify against device removalPaul Dagnelie2019-08-133-16/+23
| | | | | | | | | | | | | | | | | When we check the vdev of the blkptr in zfs_blkptr_verify, we can run into a race condition where that vdev is temporarily unavailable. This happens when a device removal operation and the old vdev_t has been removed from the array, but the new indirect vdev has not yet been inserted. We hold the spa_config_lock while doing our sensitive verification. To ensure that we don't deadlock, we only grab the lock if we don't have config_writer held. In addition, I had to const the tags of the refcounts and the spa_config_lock arguments. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Serapheim Dimitropoulos <[email protected]> Signed-off-by: Paul Dagnelie <[email protected]> Closes #9112
* Fix out-of-order ZIL txtype lost on hardlinked filesChunwei Chen2019-08-133-14/+18
| | | | | | | | | | | | | | | | We should only call zil_remove_async when an object is removed. However, in current implementation, it is called whenever TX_REMOVE is called. In the case of hardlinked file, every unlink will generate TX_REMOVE and causing operations to be dropped even when the object is not removed. We fix this by only calling zil_remove_async when the file is fully unlinked. Reviewed-by: George Wilson <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Prakash Surya <[email protected]> Signed-off-by: Chunwei Chen <[email protected]> Closes #8769 Closes #9061
* Mark dsl_livelist_should_disable() staticAllan Jude2019-08-131-1/+1
| | | | | | | | | This function is not used outside of dsl_dataset.c Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: George Melikov <[email protected]> Reviewed by: Sara Hartse <[email protected]> Signed-off-by: Allan Jude <[email protected]> Closes #9154
* spa_load_verify() may consume too much memoryGeorge Wilson2019-08-131-9/+15
| | | | | | | | | | | | | | | | | | | | When a pool is imported it will scan the pool to verify the integrity of the data and metadata. The amount it scans will depend on the import flags provided. On systems with small amounts of memory or when importing a pool from the crash kernel, it's possible for spa_load_verify to issue too many I/Os that it consumes all the memory of the system resulting in an OOM message or a hang. To prevent this, we limit the amount of memory that the initial pool scan can consume. This change will, by default, use 1/16th of the ARC for scan I/Os to prevent running the system out of memory during import. Reviewed-by: Matt Ahrens <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Serapheim Dimitropoulos <[email protected]> Signed-off-by: George Wilson [email protected] External-issue: DLPX-65237 External-issue: DLPX-65238 Closes #9146
* Change boolean-like uint8_t fields in znode_t to boolean_tTomohiro Kusumi2019-08-133-29/+29
| | | | | | | | | Given znode_t is an in-core structure, it's more readable to have them as boolean. Also co-locate existing boolean fields with them for space efficiency (expecting 8 booleans to be packed/aligned). Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tomohiro Kusumi <[email protected]> Closes #9092
* Drop KMC_NOEMERGENCYRichard Yao2019-08-131-1/+1
| | | | | | | | | This is not implemented. If it were implemented, using it would risk deadlocks on pre-3.18 kernels. Lets just drop it. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Michael Niewöhner <[email protected]> Signed-off-by: Richard Yao <[email protected]> Closes #9119
* Introduce getting holds and listing bookmarks through ZCPSerapheim Dimitropoulos2019-08-121-21/+253
| | | | | | | | | | | | | | | | Consumers of ZFS Channel Programs can now list bookmarks, and get holds from datasets. A minor-refactoring was also applied to distinguish between user and system properties in ZCP. Reviewed-by: Paul Dagnelie <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Matt Ahrens <[email protected]> Reviewed-by: Serapheim Dimitropoulos <[email protected]> Ported-by: Serapheim Dimitropoulos <[email protected]> Signed-off-by: Dan Kimmel <[email protected]> OpenZFS-issue: https://illumos.org/issues/8862 Closes #7902
* Sort log spacemap tunables in alphabetical orderSerapheim Dimitropoulos2019-08-121-32/+32
| | | | | | | | | | | Beside the whole commit being a nit in reality it should bring the diffs of the spa_log_spacemap.c source file between ZoL and delphix/zfs to 0. Reviewed-by: George Melikov <[email protected]> Reviewed-by: Chris Dunlop <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Serapheim Dimitropoulos <[email protected]> Closes #9143
* Metaslab max_size should be persisted while unloadedPaul Dagnelie2019-08-052-37/+162
| | | | | | | | | | | | | | | | | | | | | | | | | | | When we unload metaslabs today in ZFS, the cached max_size value is discarded. We instead use the histogram to determine whether or not we think we can satisfy an allocation from the metaslab. This can result in situations where, if we're doing I/Os of a size not aligned to a histogram bucket, a metaslab is loaded even though it cannot satisfy the allocation we think it can. For example, a metaslab with 16 entries in the 16k-32k bucket may have entirely 16kB entries. If we try to allocate a 24kB buffer, we will load that metaslab because we think it should be able to handle the allocation. Doing so is expensive in CPU time, disk reads, and average IO latency. This is exacerbated if the write being attempted is a sync write. This change makes ZFS cache the max_size after the metaslab is unloaded. If we ever get a free (or a coalesced group of frees) larger than the max_size, we will update it. Otherwise, we leave it as is. When attempting to allocate, we use the max_size as a lower bound, and respect it unless we are in try_hard. However, we do age the max_size out at some point, since we expect the actual max_size to increase as we do more frees. A more sophisticated algorithm here might be helpful, but this works reasonably well. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Matt Ahrens <[email protected]> Signed-off-by: Paul Dagnelie <[email protected]> Closes #9055
* Don't wakeup unnecessarily in 'zpool events -f'DeHackEd2019-08-051-2/+1
| | | | | | | | | | ZED can prevent CPU's from properly sleeping. Rather than periodically waking up in the zevents code, just go to sleep and wait for a wakeup. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Signed-off-by: DHE <[email protected]> Closes #9091
* lockdep false positive - move txg_kick() outside of ->dp_lockjdike2019-07-311-5/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This fixes a lockdep warning by breaking a link between ->tx_sync_lock and ->dp_lock. The deadlock envisioned by lockdep is this: thread 1 holds db->db_mtx and tries to get dp->dp_lock: dsl_pool_dirty_space+0x70/0x2d0 [zfs] dbuf_dirty+0x778/0x31d0 [zfs] thread 2 holds bpo->bpo_lock and tries to get db->db_mtx: dmu_buf_will_dirty_impl dmu_buf_will_dirty+0x6b/0x6c0 [zfs] bpobj_iterate_impl+0xbe6/0x1410 [zfs] thread 3 holds tx->tx_sync_lock and tries to get bpo->bpo_lock: bpobj_space+0x63/0x470 [zfs] dsl_scan_active+0x340/0x3d0 [zfs] txg_sync_thread+0x3f2/0x1370 [zfs] thread 4 holds dp->dp_lock and tries to get tx->tx_sync_lock txg_kick+0x61/0x420 [zfs] dsl_pool_need_dirty_delay+0x1c7/0x3f0 [zfs] This patch is orginally from Brian Behlendorf and slightly simplified by me. It breaks this cycle in thread 4 by moving the call from dsl_pool_need_dirty_delay to txg_kick outside the section controlled by dp->dp_lock. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Matt Ahrens <[email protected]> Signed-off-by: Jeff Dike <[email protected]> Closes #9094
* 9072 handle error of zap_cursor_retrieve() for log spacemap zapSerapheim Dimitropoulos2019-07-301-2/+28
| | | | | | | | | | | | In spa_ld_log_sm_metadata(), it is possible for zap_cursor_retrieve() to return errors other than the expected ENOENT (e.g. when we are at the end of the zap). Ensure that these error cases are handled correctly by the import path. Reviewed by: Brian Behlendorf <[email protected]> Reviewed by: Sara Hartse <[email protected]> Reviewed by: Matt Ahrens <[email protected]> Signed-off-by: Serapheim Dimitropoulos <[email protected]> Closes #9074
* mismerged log spacemap comment for metaslab_verify_weight_and_fragSerapheim Dimitropoulos2019-07-301-1/+9
| | | | | | | | | | | | | | | | | When the log spacemap commit was merged in ZoL, the metaslab_verify_unflushed_changes() debugging function was deleted as the feature was pretty much stable by then. Unfortunately though there was a reference to it from a comment in metaslab_verify_weight_and_frag(). This patch deletes the reference and pastes that comment as is. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Matt Ahrens <[email protected]> Reviewed-by: Igor Kozhukhov <[email protected]> Reviewed-by: George Melikov <[email protected]> Signed-off-by: Serapheim Dimitropoulos <[email protected]> Closes #9097
* Improve performance by using dmu_tx_hold_*_by_dnode()Matthew Ahrens2019-07-303-9/+15
| | | | | | | | | | | | | | | | | | | In zfs_write() and dmu_tx_hold_sa(), we can use dmu_tx_hold_*_by_dnode() instead of dmu_tx_hold_*(), since we already have a dbuf from the target dnode in hand. This eliminates some calls to dnode_hold(), which can be expensive. This is especially impactful if several threads are accessing objects that are in the same block of dnodes, because they will contend for that dbuf's lock. We are seeing 10-20% performance wins for the sequential_writes tests in the performance test suite, when doing >=128K writes to files with recordsize=8K. This also removes some unnecessary casts that are in the area. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Tony Nguyen <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> Closes #9081
* Fix channel programs on s390xBrian Behlendorf2019-07-281-1/+1
| | | | | | | | | | | | | | When adapting the original sources for s390x the JMP_BUF_CNT was mistakenly halved due to an incorrect assumption of the size of a unsigned long. They are 8 bytes for the s390x architecture. Increase JMP_BUF_CNT accordingly. Authored-by: Don Brady <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reported-by: Colin Ian King <canonical.com> Tested-by: Colin Ian King <canonical.com> Signed-off-by: Brian Behlendorf <[email protected]> Closes #8992 Closes #9080
* Implement secpolicy_vnode_setid_retain()Tomohiro Kusumi2019-07-261-1/+1
| | | | | | | | | | | | | | Don't unconditionally return 0 (i.e. retain SUID/SGID). Test CAP_FSETID capability. https://github.com/pjd/pjdfstest/blob/master/tests/chmod/12.t which expects SUID/SGID to be dropped on write(2) by non-owner fails without this. Most filesystems make this decision within VFS by using a generic file write for fops. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tomohiro Kusumi <[email protected]> Closes #9035 Closes #9043
* Fast Clone DeletionSara Hartse2019-07-2613-124/+1524
| | | | | | | | | | | | | | | | | | | | | Deleting a clone requires finding blocks are clone-only, not shared with the snapshot. This was done by traversing the entire block tree which results in a large performance penalty for sparsely written clones. This is new method keeps track of clone blocks when they are modified in a "Livelist" so that, when it’s time to delete, the clone-specific blocks are already at hand. We see performance improvements because now deletion work is proportional to the number of clone-modified blocks, not the size of the original dataset. Reviewed-by: Sean Eric Fagan <[email protected]> Reviewed-by: Matt Ahrens <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Serapheim Dimitropoulos <[email protected]> Signed-off-by: Sara Hartse <[email protected]> Closes #8416
* Don't directly cast unsigned long to void*Tomohiro Kusumi2019-07-251-2/+3
| | | | | | | | Cast to uintptr_t first for portability on integer to/from pointer conversion. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tomohiro Kusumi <[email protected]> Closes #9065
* Replace zf_rwlock with a mutexMatthew Ahrens2019-07-252-22/+11
| | | | | | | | | | | | | The rwlock implementation on linux does not perform as well as mutexes. We can realize a performance benefit by replacing the zf_rwlock with a mutex. Local microbenchmarks show ~50% improvement, and over NFS we see ~5% improvement on several of the ZFS Performance Tests cases, especially randwrite and seq_write. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Tony Nguyen <[email protected]> Reviewed-by: Olaf Faaland <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> Closes #9062
* Fix module_param() type for zfs_read_chunk_sizeTomohiro Kusumi2019-07-191-2/+4
| | | | | | | zfs_read_chunk_size is unsigned long. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tomohiro Kusumi <[email protected]> Closes #9051
* Tricky semantics of ms_max_size in metaslab_should_allocate()Serapheim Dimitropoulos2019-07-191-7/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | metaslab_should_allocate() is used in two places: [1] When trying to select a metaslab to allocate from [2] When trying to allocate from a metaslab In [2] we always expect the metaslab to be loaded, and after the refactoring of the log spacemap changes, whenever we load a metaslab we set ms_max_size to the biggest range in the ms_allocatable tree. Thus, when it is used in [2], if that field is 0, it means that the metaslab doesn't have any segments that can be used for allocations now (though it may have some free space but that space can be in the freeing, freed, or deferred trees). In [1] a metaslab can be loaded or unloaded at which point 0 can either mean the metaslab doesn't have any space or the metaslab is just not loaded thus we go ahead and try to make an estimation based on its weight. The issue here is when we call the above function for [2] and the metaslab doesn't have any allocatable space, we still go ahead and check its ms_weight which may be out of date because we haven't ran metaslab_sync_done() yet. At that point we are allowing an allocation to be attempted even though we know there is no range that is allocatable. This patch fixes this issue by explicitly checking if the metaslab is loaded and if it is, the ms_max_size is used. Reviewed-by: Matt Ahrens <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Serapheim Dimitropoulos <[email protected]> Closes #9045
* Race condition between spa async threads and exportSerapheim Dimitropoulos2019-07-181-1/+17
| | | | | | | | | | | | | | | | | | In the past we've seen multiple race conditions that have to do with open-context threads async threads and concurrent calls to spa_export()/spa_destroy() (including the one referenced in issue #9015). This patch ensures that only one thread can execute the main body of spa_export_common() at a time, with subsequent threads returning with a new error code created just for this situation, eliminating this way any race condition bugs introduced by concurrent calls to this function. Reviewed by: Matt Ahrens <[email protected]> Reviewed by: Brian Behlendorf <[email protected]> Signed-off-by: Serapheim Dimitropoulos <[email protected]> Closes #9015 Closes #9044
* hdr_recl calls zthr_wakeup() on destroyed zthrSerapheim Dimitropoulos2019-07-181-4/+16
| | | | | | | | | | | | | | | | | | | | | | There exists a race condition were hdr_recl() calls zthr_wakeup() on a destroyed zthr. The timeline is the following: [1] hdr_recl() runs first and goes intro zthr_wakeup() because arc_initialized is set. [2] arc_fini() is called by another thread, zeroes that flag, destroying the zthr, and goes into buf_init(). [3] hdr_recl() tries to enter the destroyed mutex and we blow up. This patch ensures that the ARC's zthrs are not offloaded any new work once arc_initialized is set and then destroys them after all of the ARC state has been deleted. Reviewed by: Matt Ahrens <[email protected]> Reviewed by: Brian Behlendorf <[email protected]> Signed-off-by: Serapheim Dimitropoulos <[email protected]> Closes #9047
* Fix wrong comment on zcr_blksz_{min,max}Tomohiro Kusumi2019-07-181-5/+6
| | | | | | | | | | These aren't tunable; illumos has this comment fixed in "3742 zfs comments need cleaner, more consistent style", so sync with that. Reviewed-by: George Melikov <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tomohiro Kusumi <[email protected]> Closes #9052
* Retire unused spl_{mutex,rwlock}_{init_fini}Brian Behlendorf2019-07-174-85/+13
| | | | | | | | | | These functions are unused and can be removed along with the spl-mutex.c and spl-rwlock.c source files. Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Tomohiro Kusumi <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #9029