aboutsummaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* Don't wakeup unnecessarily in 'zpool events -f'DeHackEd2019-08-051-2/+1
| | | | | | | | | | ZED can prevent CPU's from properly sleeping. Rather than periodically waking up in the zevents code, just go to sleep and wait for a wakeup. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Signed-off-by: DHE <[email protected]> Closes #9091
* Test cancelling a removal in ZTSSerapheim Dimitropoulos2019-08-053-4/+104
| | | | | | | | | This patch adds a new test that sanity checks cancelling a removal. Reviewed-by: Matt Ahrens <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: John Kennedy <[email protected]> Signed-off-by: Serapheim Dimitropoulos <[email protected]> Closes #9101
* lockdep false positive - move txg_kick() outside of ->dp_lockjdike2019-07-311-5/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This fixes a lockdep warning by breaking a link between ->tx_sync_lock and ->dp_lock. The deadlock envisioned by lockdep is this: thread 1 holds db->db_mtx and tries to get dp->dp_lock: dsl_pool_dirty_space+0x70/0x2d0 [zfs] dbuf_dirty+0x778/0x31d0 [zfs] thread 2 holds bpo->bpo_lock and tries to get db->db_mtx: dmu_buf_will_dirty_impl dmu_buf_will_dirty+0x6b/0x6c0 [zfs] bpobj_iterate_impl+0xbe6/0x1410 [zfs] thread 3 holds tx->tx_sync_lock and tries to get bpo->bpo_lock: bpobj_space+0x63/0x470 [zfs] dsl_scan_active+0x340/0x3d0 [zfs] txg_sync_thread+0x3f2/0x1370 [zfs] thread 4 holds dp->dp_lock and tries to get tx->tx_sync_lock txg_kick+0x61/0x420 [zfs] dsl_pool_need_dirty_delay+0x1c7/0x3f0 [zfs] This patch is orginally from Brian Behlendorf and slightly simplified by me. It breaks this cycle in thread 4 by moving the call from dsl_pool_need_dirty_delay to txg_kick outside the section controlled by dp->dp_lock. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Matt Ahrens <[email protected]> Signed-off-by: Jeff Dike <[email protected]> Closes #9094
* List log_spacemap feature in zpool-features.5 manualSerapheim Dimitropoulos2019-07-311-0/+22
| | | | | | | | | Update zpool-features.5 manpage to describe the log_spacemap feature. Reviewed-by: Matthew Ahrens <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Pavel Zakharov <[email protected]> Signed-off-by: Serapheim Dimitropoulos <[email protected]> Closes #9096
* Add channel program for property based snapshotsClint Armstrong2019-07-304-2/+79
| | | | | | | | | | | | Channel programs that many users find useful should be included with zfs in the /contrib directory. This is the first of these contributions. A channel program to recursively take snapshots of datasets with the property com.sun:auto-snapshot=true. Reviewed-by: Kash Pande <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Clint Armstrong <[email protected]> Closes #8443 Closes #9050
* 9072 handle error of zap_cursor_retrieve() for log spacemap zapSerapheim Dimitropoulos2019-07-301-2/+28
| | | | | | | | | | | | In spa_ld_log_sm_metadata(), it is possible for zap_cursor_retrieve() to return errors other than the expected ENOENT (e.g. when we are at the end of the zap). Ensure that these error cases are handled correctly by the import path. Reviewed by: Brian Behlendorf <[email protected]> Reviewed by: Sara Hartse <[email protected]> Reviewed by: Matt Ahrens <[email protected]> Signed-off-by: Serapheim Dimitropoulos <[email protected]> Closes #9074
* mismerged log spacemap comment for metaslab_verify_weight_and_fragSerapheim Dimitropoulos2019-07-301-1/+9
| | | | | | | | | | | | | | | | | When the log spacemap commit was merged in ZoL, the metaslab_verify_unflushed_changes() debugging function was deleted as the feature was pretty much stable by then. Unfortunately though there was a reference to it from a comment in metaslab_verify_weight_and_frag(). This patch deletes the reference and pastes that comment as is. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Matt Ahrens <[email protected]> Reviewed-by: Igor Kozhukhov <[email protected]> Reviewed-by: George Melikov <[email protected]> Signed-off-by: Serapheim Dimitropoulos <[email protected]> Closes #9097
* install path fixesMichael Niewöhner2019-07-305-6/+6
| | | | | | | | | | | | | | | | | | * rpm: correct pkgconfig path pkconfig files get installed to $datarootdir/pkgconfig but rpm expects them to be at $datadir. This works when $datarootdir==$datadir which is the case most of the time but will fail when they differ. * install: make initramfs-tools path static Since initramfs-tools' path is nothing we can control as it is an external package it does not make any sense to install zfs additions anywhere else. Simply use /usr/share/initramfs-tools as path. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Richard Laager <[email protected]> Signed-off-by: Michael Niewöhner <[email protected]> Closes #9087
* Increase default zcmd allocation to 256KMichael Niewöhner2019-07-301-1/+1
| | | | | | | | | | | | | | | | | When creating hundreds of clones (for example using containers with LXD) cloning slows down as the number of clones increases over time. The reason for this is that the fetching of the clone information using a small zcmd buffer requires two ioctl calls, one to determine the size and a second to return the data. However, this requires gathering the data twice, once to determine the size and again to populate the zcmd buffer to return it to userspace. These are expensive ioctl() calls, so instead, make the default buffer size much larger: 256K. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: George Melikov <[email protected]> Signed-off-by: Colin Ian King <[email protected]> Signed-off-by: Michael Niewöhner <[email protected]> Closes #9084
* Improve performance by using dmu_tx_hold_*_by_dnode()Matthew Ahrens2019-07-303-9/+15
| | | | | | | | | | | | | | | | | | | In zfs_write() and dmu_tx_hold_sa(), we can use dmu_tx_hold_*_by_dnode() instead of dmu_tx_hold_*(), since we already have a dbuf from the target dnode in hand. This eliminates some calls to dnode_hold(), which can be expensive. This is especially impactful if several threads are accessing objects that are in the same block of dnodes, because they will contend for that dbuf's lock. We are seeing 10-20% performance wins for the sequential_writes tests in the performance test suite, when doing >=128K writes to files with recordsize=8K. This also removes some unnecessary casts that are in the area. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Tony Nguyen <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> Closes #9081
* Revert "Develop tests for issues #5866 and #8858"Brian Behlendorf2019-07-2911-158/+2
| | | | | | | | | | This reverts commit 693c1fc478cc8118dd0168c4815c0ae3be41c9c3. This change resulted in a kmem leak being observed in existing code which needs to be identified and addressed. Reviewed-by: Paul Zuchowski <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #8978 Closes #9090
* Fix channel programs on s390xBrian Behlendorf2019-07-281-1/+1
| | | | | | | | | | | | | | When adapting the original sources for s390x the JMP_BUF_CNT was mistakenly halved due to an incorrect assumption of the size of a unsigned long. They are 8 bytes for the s390x architecture. Increase JMP_BUF_CNT accordingly. Authored-by: Don Brady <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reported-by: Colin Ian King <canonical.com> Tested-by: Colin Ian King <canonical.com> Signed-off-by: Brian Behlendorf <[email protected]> Closes #8992 Closes #9080
* Race between zfs-share and zfs-mount servicesGeorge Wilson2019-07-281-0/+1
| | | | | | | | | | | | | | | | | | When a system boots the zfs-mount.service and the zfs-share.service can start simultaneously. What may be unclear is that sharing a filesystem will first mount the filesystem if it's not already mounted. This means that both service can race to mount the same fileystem. This race can result in a SEGFAULT or EBUSY conditions. This change explicitly defines the start ordering between the two services such that the zfs-mount.service is solely responsible for mounting filesystems eliminating the race between "zfs mount -a" and "zfs share -a" commands. Reviewed-by: Sebastien Roy <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: George Wilson <[email protected]> Closes #9083
* Develop tests for issues #5866 and #8858Paul Zuchowski2019-07-2611-2/+158
| | | | | | | | | | | | Provide zfstest coverage for these two issues which were a panic accessing extended attributes and a problem comparing 64 bit and 32 bit generation numbers. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Paul Zuchowski <[email protected]> Issue #5866 Issue #8858 Closes #8978
* Implement secpolicy_vnode_setid_retain()Tomohiro Kusumi2019-07-2613-1/+435
| | | | | | | | | | | | | | Don't unconditionally return 0 (i.e. retain SUID/SGID). Test CAP_FSETID capability. https://github.com/pjd/pjdfstest/blob/master/tests/chmod/12.t which expects SUID/SGID to be dropped on write(2) by non-owner fails without this. Most filesystems make this decision within VFS by using a generic file write for fops. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tomohiro Kusumi <[email protected]> Closes #9035 Closes #9043
* zed crashes when devid not presentMatthew Ahrens2019-07-261-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | zed core dumps due to a NULL pointer in zfs_agent_iter_vdev(). The gs_devid is NULL, but the nvl has a "devid" entry. zfs_agent_post_event() checks that ZFS_EV_VDEV_GUID or DEV_IDENTIFIER is present in nvl, but then later it and zfs_agent_iter_vdev() assume that DEV_IDENTIFIER is present and thus gs_devid is set. Typically this is not a problem because usually either all vdevs have devid's, or none of them do. Since zfs_agent_iter_vdev() first checks if the vdev has devid before dereferencing gs_devid, the problem isn't typically encountered. However, if some vdevs have devid's and some do not, then the problem is easily reproduced. This can happen if the pool has been moved from a system that has devid's to one that does not. The fix is for zfs_agent_iter_vdev() to only try to match the devid's if both nvl and gsp have devid's present. Reviewed-by: Prashanth Sreenivasa <[email protected]> Reviewed-by: Don Brady <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: loli10K <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> External-issue: DLPX-65090 Closes #9054 Closes #9060
* Fast Clone DeletionSara Hartse2019-07-2638-203/+2581
| | | | | | | | | | | | | | | | | | | | | Deleting a clone requires finding blocks are clone-only, not shared with the snapshot. This was done by traversing the entire block tree which results in a large performance penalty for sparsely written clones. This is new method keeps track of clone blocks when they are modified in a "Livelist" so that, when it’s time to delete, the clone-specific blocks are already at hand. We see performance improvements because now deletion work is proportional to the number of clone-modified blocks, not the size of the original dataset. Reviewed-by: Sean Eric Fagan <[email protected]> Reviewed-by: Matt Ahrens <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Serapheim Dimitropoulos <[email protected]> Signed-off-by: Sara Hartse <[email protected]> Closes #8416
* Don't directly cast unsigned long to void*Tomohiro Kusumi2019-07-251-2/+3
| | | | | | | | Cast to uintptr_t first for portability on integer to/from pointer conversion. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tomohiro Kusumi <[email protected]> Closes #9065
* Replace zf_rwlock with a mutexMatthew Ahrens2019-07-253-23/+12
| | | | | | | | | | | | | The rwlock implementation on linux does not perform as well as mutexes. We can realize a performance benefit by replacing the zf_rwlock with a mutex. Local microbenchmarks show ~50% improvement, and over NFS we see ~5% improvement on several of the ZFS Performance Tests cases, especially randwrite and seq_write. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Tony Nguyen <[email protected]> Reviewed-by: Olaf Faaland <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> Closes #9062
* Fix module_param() type for zfs_read_chunk_sizeTomohiro Kusumi2019-07-191-2/+4
| | | | | | | zfs_read_chunk_size is unsigned long. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tomohiro Kusumi <[email protected]> Closes #9051
* Move some tests to cli_user/zpool_statusTony Hutter2019-07-1911-10/+81
| | | | | | | | | | | | | | | | | | | | The tests in tests/functional/cli_root/zpool_status should all require root. However, linux.run has "user =" specified for those tests, which means they run as a normal user. When I removed that line to run them as root, the following tests did not pass: zpool_status_003_pos zpool_status_-c_disable zpool_status_-c_homedir zpool_status_-c_searchpath These tests need to be run as a normal user. To fix this, move these tests to a new tests/functional/cli_user/zpool_status directory. Reviewed-by: George Melikov <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Giuseppe Di Natale <[email protected]> Signed-off-by: Tony Hutter <[email protected]> Closes #9057
* Tricky semantics of ms_max_size in metaslab_should_allocate()Serapheim Dimitropoulos2019-07-191-7/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | metaslab_should_allocate() is used in two places: [1] When trying to select a metaslab to allocate from [2] When trying to allocate from a metaslab In [2] we always expect the metaslab to be loaded, and after the refactoring of the log spacemap changes, whenever we load a metaslab we set ms_max_size to the biggest range in the ms_allocatable tree. Thus, when it is used in [2], if that field is 0, it means that the metaslab doesn't have any segments that can be used for allocations now (though it may have some free space but that space can be in the freeing, freed, or deferred trees). In [1] a metaslab can be loaded or unloaded at which point 0 can either mean the metaslab doesn't have any space or the metaslab is just not loaded thus we go ahead and try to make an estimation based on its weight. The issue here is when we call the above function for [2] and the metaslab doesn't have any allocatable space, we still go ahead and check its ms_weight which may be out of date because we haven't ran metaslab_sync_done() yet. At that point we are allowing an allocation to be attempted even though we know there is no range that is allocatable. This patch fixes this issue by explicitly checking if the metaslab is loaded and if it is, the ms_max_size is used. Reviewed-by: Matt Ahrens <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Serapheim Dimitropoulos <[email protected]> Closes #9045
* Race condition between spa async threads and exportSerapheim Dimitropoulos2019-07-186-2/+42
| | | | | | | | | | | | | | | | | | In the past we've seen multiple race conditions that have to do with open-context threads async threads and concurrent calls to spa_export()/spa_destroy() (including the one referenced in issue #9015). This patch ensures that only one thread can execute the main body of spa_export_common() at a time, with subsequent threads returning with a new error code created just for this situation, eliminating this way any race condition bugs introduced by concurrent calls to this function. Reviewed by: Matt Ahrens <[email protected]> Reviewed by: Brian Behlendorf <[email protected]> Signed-off-by: Serapheim Dimitropoulos <[email protected]> Closes #9015 Closes #9044
* hdr_recl calls zthr_wakeup() on destroyed zthrSerapheim Dimitropoulos2019-07-181-4/+16
| | | | | | | | | | | | | | | | | | | | | | There exists a race condition were hdr_recl() calls zthr_wakeup() on a destroyed zthr. The timeline is the following: [1] hdr_recl() runs first and goes intro zthr_wakeup() because arc_initialized is set. [2] arc_fini() is called by another thread, zeroes that flag, destroying the zthr, and goes into buf_init(). [3] hdr_recl() tries to enter the destroyed mutex and we blow up. This patch ensures that the ARC's zthrs are not offloaded any new work once arc_initialized is set and then destroys them after all of the ARC state has been deleted. Reviewed by: Matt Ahrens <[email protected]> Reviewed by: Brian Behlendorf <[email protected]> Signed-off-by: Serapheim Dimitropoulos <[email protected]> Closes #9047
* zdb: don't print log spacemap stats in pools without the featureSerapheim Dimitropoulos2019-07-181-0/+6
| | | | | | | | | | | | | | | | | | | | Creating a pool with not features enabled and running `zdb -mmmmmm on` it before the patch: ``` Log Space Maps in Pool: Log Space Map Obsolete Entry Statistics: 0 valid entries out of 0 - txg 0 0 valid entries out of 0 - total ``` After this patch the above output goes away. Reviewed by: Matt Ahrens <[email protected]> Reviewed by: Sara Hartse <[email protected]> Reviewed by: Brian Behlendorf <[email protected]> Signed-off-by: Serapheim Dimitropoulos <[email protected]> Closes #9048
* Fix wrong comment on zcr_blksz_{min,max}Tomohiro Kusumi2019-07-181-5/+6
| | | | | | | | | | These aren't tunable; illumos has this comment fixed in "3742 zfs comments need cleaner, more consistent style", so sync with that. Reviewed-by: George Melikov <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tomohiro Kusumi <[email protected]> Closes #9052
* New service that waits on zvol links to be createdPavel Zakharov2019-07-1711-3/+145
| | | | | | | | | | | | | | | | The zfs-volume-wait.service scans existing zvols and waits for their links under /dev to be created. Any service that depends on zvol links to be there should add a dependency on zfs-volumes.target. By default, this target is not enabled. Reviewed-by: Fabian Grünbichler <[email protected]> Reviewed-by: Antonio Russo <[email protected]> Reviewed-by: Richard Laager <[email protected]> Reviewed-by: loli10K <[email protected]> Reviewed-by: John Gallagher <[email protected]> Reviewed-by: George Wilson <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Pavel Zakharov <[email protected]> Closes #8975
* Retire unused spl_{mutex,rwlock}_{init_fini}Brian Behlendorf2019-07-176-92/+13
| | | | | | | | | | These functions are unused and can be removed along with the spl-mutex.c and spl-rwlock.c source files. Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Tomohiro Kusumi <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #9029
* Linux 5.3 compat: retire rw_tryupgrade()Brian Behlendorf2019-07-172-154/+7
| | | | | | | | | | | | | | | | | | | | | | | The Linux kernel's rwsem's have never provided an interface to allow a reader to be upgraded to a writer. Historically, this functionality has been implemented by a SPL wrapper function. However, this approach depends on internal knowledge of the rw_semaphore and is therefore rather brittle. Since the ZFS code must always be able to fallback to rw_exit() and rw_enter() when an rw_tryupgrade() fails; this functionality isn't critical. Furthermore, the only potentially performance sensitive consumer is dmu_zfetch() and no decrease in performance was observed with this change applied. See the PR comments for additional testing details. Therefore, it is being retired to make the build more robust and to simplify the rwlock implementation. Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Tomohiro Kusumi <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #9029
* Linux 5.3 compat: rw_semaphore ownerBrian Behlendorf2019-07-172-66/+5
| | | | | | | | | | | | | | Commit https://github.com/torvalds/linux/commit/94a9717b updated the rwsem's owner field to contain additional flags describing the rwsem's state. Rather then update the wrappers to mask out these bits, the code no longer relies on the owner stored by the kernel. This does increase the size of a krwlock_t but it makes the implementation less sensitive to future kernel changes. Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Tomohiro Kusumi <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #9029
* Fix lockdep recursive locking false positive in dbuf_destroyjdike2019-07-173-1/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | lockdep reports a possible recursive lock in dbuf_destroy. It is true that dbuf_destroy is acquiring the dn_dbufs_mtx on one dnode while holding it on another dnode. However, it is impossible for these to be the same dnode because, among other things,dbuf_destroy checks MUTEX_HELD before acquiring the mutex. This fix defines a class NESTED_SINGLE == 1 and changes that lock to call mutex_enter_nested with a subclass of NESTED_SINGLE. In order to make the userspace code compile, include/sys/zfs_context.h now defines mutex_enter_nested and NESTED_SINGLE. This is the lockdep report: [ 122.950921] ============================================ [ 122.950921] WARNING: possible recursive locking detected [ 122.950921] 4.19.29-4.19.0-debug-d69edad5368c1166 #1 Tainted: G O [ 122.950921] -------------------------------------------- [ 122.950921] dbu_evict/1457 is trying to acquire lock: [ 122.950921] 0000000083e9cbcf (&dn->dn_dbufs_mtx){+.+.}, at: dbuf_destroy+0x3c0/0xdb0 [zfs] [ 122.950921] but task is already holding lock: [ 122.950921] 0000000055523987 (&dn->dn_dbufs_mtx){+.+.}, at: dnode_evict_dbufs+0x90/0x740 [zfs] [ 122.950921] other info that might help us debug this: [ 122.950921] Possible unsafe locking scenario: [ 122.950921] CPU0 [ 122.950921] ---- [ 122.950921] lock(&dn->dn_dbufs_mtx); [ 122.950921] lock(&dn->dn_dbufs_mtx); [ 122.950921] *** DEADLOCK *** [ 122.950921] May be due to missing lock nesting notation [ 122.950921] 1 lock held by dbu_evict/1457: [ 122.950921] #0: 0000000055523987 (&dn->dn_dbufs_mtx){+.+.}, at: dnode_evict_dbufs+0x90/0x740 [zfs] [ 122.950921] stack backtrace: [ 122.950921] CPU: 0 PID: 1457 Comm: dbu_evict Tainted: G O 4.19.29-4.19.0-debug-d69edad5368c1166 #1 [ 122.950921] Hardware name: Supermicro H8SSL-I2/H8SSL-I2, BIOS 080011 03/13/2009 [ 122.950921] Call Trace: [ 122.950921] dump_stack+0x91/0xeb [ 122.950921] __lock_acquire+0x2ca7/0x4f10 [ 122.950921] lock_acquire+0x153/0x330 [ 122.950921] dbuf_destroy+0x3c0/0xdb0 [zfs] [ 122.950921] dbuf_evict_one+0x1cc/0x3d0 [zfs] [ 122.950921] dbuf_rele_and_unlock+0xb84/0xd60 [zfs] [ 122.950921] dnode_evict_dbufs+0x3a6/0x740 [zfs] [ 122.950921] dmu_objset_evict+0x7a/0x500 [zfs] [ 122.950921] dsl_dataset_evict_async+0x70/0x480 [zfs] [ 122.950921] taskq_thread+0x979/0x1480 [spl] [ 122.950921] kthread+0x2e7/0x3e0 [ 122.950921] ret_from_fork+0x27/0x50 Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Jeff Dike <[email protected]> Closes #8984
* Fix CONFIG_X86_DEBUG_FPU build failureBrian Behlendorf2019-07-171-0/+9
| | | | | | | | | | | | When CONFIG_X86_DEBUG_FPU is defined the alternatives_patched symbol is pulled in as a dependency which results in a build failure. To prevent this undefine CONFIG_X86_DEBUG_FPU to disable the WARN_ON_FPU() macro and rely on WARN_ON_ONCE debugging checks which were previously added. Reviewed-by: Tony Hutter <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #9041 Closes #9049
* Add missing __GFP_HIGHMEM flag to vmallocMichael Niewöhner2019-07-171-1/+2
| | | | | | | | | | | Make use of __GFP_HIGHMEM flag in vmem_alloc, which is required for some 32-bit systems to make use of full available memory. While kernel versions >=4.12-rc1 add this flag implicitly, older kernels do not. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Sebastian Gottschall <[email protected]> Signed-off-by: Michael Niewöhner <[email protected]> Closes #9031
* Use zfsctl_snapshot_hold() wrapperTomohiro Kusumi2019-07-171-3/+3
| | | | | | | | zfs_refcount_*() are to be wrapped by zfsctl_snapshot_*() in this file. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Signed-off-by: Tomohiro Kusumi <[email protected]> Closes #9039
* Minor style cleanupBrian Behlendorf2019-07-169-42/+57
| | | | | | | | | | Resolve an assortment of style inconsistencies including use of white space, typos, capitalization, and line wrapping. There is no functional change. Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: George Melikov <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #9030
* Fix get_special_prop() build failureBrian Behlendorf2019-07-161-4/+2
| | | | | | | | | | | | | The cast of the size_t returned by strlcpy() to a uint64_t by the VERIFY3U can result in a build failure when CONFIG_FORTIFY_SOURCE is set. This is due to the additional hardening. Since the token is expected to always fit in strval the VERIFY3U has been removed. If somehow it doesn't, it will still be safely truncated. Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Don Brady <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #8999 Closes #9020
* Add zfs create dryrunMike Gerdts2019-07-166-29/+523
| | | | | | | | | | | | | | | | | | | Adds the ability to sanity check zfs create arguments and to see the value of any additional properties that will local to the dataset. For example, automation that may need to adjust quota on a parent filesystem before creating a volume may call `zfs create -nP -V <size> <volume>` to obtain the value of refreservation. This adds the following options to zfs create: - -n dry-run (no-op) - -v verbose - -P parseable (implies verbose) Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Matt Ahrens <[email protected]> Reviewed-by: Jerry Jelinek <[email protected]> Signed-off-by: Mike Gerdts <[email protected]> Closes #8974
* Log Spacemap ProjectSerapheim Dimitropoulos2019-07-1641-331/+3194
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | = Motivation At Delphix we've seen a lot of customer systems where fragmentation is over 75% and random writes take a performance hit because a lot of time is spend on I/Os that update on-disk space accounting metadata. Specifically, we seen cases where 20% to 40% of sync time is spend after sync pass 1 and ~30% of the I/Os on the system is spent updating spacemaps. The problem is that these pools have existed long enough that we've touched almost every metaslab at least once, and random writes scatter frees across all metaslabs every TXG, thus appending to their spacemaps and resulting in many I/Os. To give an example, assuming that every VDEV has 200 metaslabs and our writes fit within a single spacemap block (generally 4K) we have 200 I/Os. Then if we assume 2 levels of indirection, we need 400 additional I/Os and since we are talking about metadata for which we keep 2 extra copies for redundancy we need to triple that number, leading to a total of 1800 I/Os per VDEV every TXG. We could try and decrease the number of metaslabs so we have less I/Os per TXG but then each metaslab would cover a wider range on disk and thus would take more time to be loaded in memory from disk. In addition, after it's loaded, it's range tree would consume more memory. Another idea would be to just increase the spacemap block size which would allow us to fit more entries within an I/O block resulting in fewer I/Os per metaslab and a speedup in loading time. The problem is still that we don't deal with the number of I/Os going up as the number of metaslabs is increasing and the fact is that we generally write a lot to a few metaslabs and a little to the rest of them. Thus, just increasing the block size would actually waste bandwidth because we won't be utilizing our bigger block size. = About this patch This patch introduces the Log Spacemap project which provides the solution to the above problem while taking into account all the aforementioned tradeoffs. The details on how it achieves that can be found in the references sections below and in the code (see Big Theory Statement in spa_log_spacemap.c). Even though the change is fairly constraint within the metaslab and lower-level SPA codepaths, there is a side-change that is user-facing. The change is that VDEV IDs from VDEV holes will no longer be reused. To give some background and reasoning for this, when a log device is removed and its VDEV structure was replaced with a hole (or was compacted; if at the end of the vdev array), its vdev_id could be reused by devices added after that. Now with the pool-wide space maps recording the vdev ID, this behavior can cause problems (e.g. is this entry referring to a segment in the new vdev or the removed log?). Thus, to simplify things the ID reuse behavior is gone and now vdev IDs for top-level vdevs are truly unique within a pool. = Testing The illumos implementation of this feature has been used internally for a year and has been in production for ~6 months. For this patch specifically there don't seem to be any regressions introduced to ZTS and I have been running zloop for a week without any related problems. = Performance Analysis (Linux Specific) All performance results and analysis for illumos can be found in the links of the references. Redoing the same experiments in Linux gave similar results. Below are the specifics of the Linux run. After the pool reached stable state the percentage of the time spent in pass 1 per TXG was 64% on average for the stock bits while the log spacemap bits stayed at 95% during the experiment (graph: sdimitro.github.io/img/linux-lsm/PercOfSyncInPassOne.png). Sync times per TXG were 37.6 seconds on average for the stock bits and 22.7 seconds for the log spacemap bits (related graph: sdimitro.github.io/img/linux-lsm/SyncTimePerTXG.png). As a result the log spacemap bits were able to push more TXGs, which is also the reason why all graphs quantified per TXG have more entries for the log spacemap bits. Another interesting aspect in terms of txg syncs is that the stock bits had 22% of their TXGs reach sync pass 7, 55% reach sync pass 8, and 20% reach 9. The log space map bits reached sync pass 4 in 79% of their TXGs, sync pass 7 in 19%, and sync pass 8 at 1%. This emphasizes the fact that not only we spend less time on metadata but we also iterate less times to convergence in spa_sync() dirtying objects. [related graphs: stock- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGStock.png lsm- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGLSM.png] Finally, the improvement in IOPs that the userland gains from the change is approximately 40%. There is a consistent win in IOPS as you can see from the graphs below but the absolute amount of improvement that the log spacemap gives varies within each minute interval. sdimitro.github.io/img/linux-lsm/StockVsLog3Days.png sdimitro.github.io/img/linux-lsm/StockVsLog10Hours.png = Porting to Other Platforms For people that want to port this commit to other platforms below is a list of ZoL commits that this patch depends on: Make zdb results for checkpoint tests consistent db587941c5ff6dea01932bb78f70db63cf7f38ba Update vdev_is_spacemap_addressable() for new spacemap encoding 419ba5914552c6185afbe1dd17b3ed4b0d526547 Simplify spa_sync by breaking it up to smaller functions 8dc2197b7b1e4d7ebc1420ea30e51c6541f1d834 Factor metaslab_load_wait() in metaslab_load() b194fab0fb6caad18711abccaff3c69ad8b3f6d3 Rename range_tree_verify to range_tree_verify_not_present df72b8bebe0ebac0b20e0750984bad182cb6564a Change target size of metaslabs from 256GB to 16GB c853f382db731e15a87512f4ef1101d14d778a55 zdb -L should skip leak detection altogether 21e7cf5da89f55ce98ec1115726b150e19eefe89 vs_alloc can underflow in L2ARC vdevs 7558997d2f808368867ca7e5234e5793446e8f3f Simplify log vdev removal code 6c926f426a26ffb6d7d8e563e33fc176164175cb Get rid of space_map_update() for ms_synced_length 425d3237ee88abc53d8522a7139c926d278b4b7f Introduce auxiliary metaslab histograms 928e8ad47d3478a3d5d01f0dd6ae74a9371af65e Error path in metaslab_load_impl() forgets to drop ms_sync_lock 8eef997679ba54547f7d361553d21b3291f41ae7 = References Background, Motivation, and Internals of the Feature - OpenZFS 2017 Presentation: youtu.be/jj2IxRkl5bQ - Slides: slideshare.net/SerapheimNikolaosDim/zfs-log-spacemaps-project Flushing Algorithm Internals & Performance Results (Illumos Specific) - Blogpost: sdimitro.github.io/post/zfs-lsm-flushing/ - OpenZFS 2018 Presentation: youtu.be/x6D2dHRjkxw - Slides: slideshare.net/SerapheimNikolaosDim/zfs-log-spacemap-flushing-algorithm Upstream Delphix Issues: DLPX-51539, DLPX-59659, DLPX-57783, DLPX-61438, DLPX-41227, DLPX-59320 DLPX-63385 Reviewed-by: Sean Eric Fagan <[email protected]> Reviewed-by: Matt Ahrens <[email protected]> Reviewed-by: George Wilson <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Serapheim Dimitropoulos <[email protected]> Closes #8442
* Enable zfs-mount-generator by defaultAntonio Russo2019-07-151-0/+1
| | | | | | | | | Reviewed-by: Richard Laager <[email protected]> Reviewed-by: Fabian Grünbichler <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Antonio Russo <[email protected]> Closes #8750 Closes #8848
* systemd encryption key supportAntonio Russo2019-07-153-5/+55
| | | | | | | | | | | | | | | | | | | | | | Modify zfs-mount-generator to produce a dependency on new zfs-import-key-*.service units, dynamically created at boot to call zfs load-key for the encryption root, before attempting to mount any encrypted datasets. These units are created by zfs-mount-generator, and RequiresMountsFor on the keyfile, if present, or call systemd-ask-password if a passphrase is requested. This patch includes suggestions from @Fabian-Gruenbichler, @ryanjaeb and @rlaager, as well an adaptation of @rlaager's script to retry on incorrect password entry. Reviewed-by: Richard Laager <[email protected]> Reviewed-by: Fabian Grünbichler <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Antonio Russo <[email protected]> Closes #8750 Closes #8848
* Drop redundant POSIX ACL check in zpl_init_acl()Tomohiro Kusumi2019-07-151-7/+4
| | | | | | | | | ZFS_ACLTYPE_POSIXACL has already been tested in zpl_init_acl(), so no need to test again on POSIX ACL access. Reviewed by: Brian Behlendorf <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Signed-off-by: Tomohiro Kusumi <[email protected]> Closes #9009
* Export dnode symbolsBrian Behlendorf2019-07-151-0/+10
| | | | | | | | | | | External consumers such as Lustre require access to the dnode interfaces in order to correctly manipulate dnodes. Reviewed-by: James Simmons <[email protected]> Reviewed-by: Olaf Faaland <[email protected]> Reviewed-by: George Melikov <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #8994 Closes #9027
* Ensure dsl_destroy_head() decrypts objsetsTom Caputi2019-07-151-3/+4
| | | | | | | | | | | | This patch corrects a small issue where the dsl_destroy_head() code that runs when the async_destroy feature is disabled would not properly decrypt the dataset before beginning processing. If the dataset is not able to be decrypted, the optimization code now simply does not run and the dataset is completely destroyed in the DSL sync task. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tom Caputi <[email protected]> Closes #9021
* Disable unused pathname::pn_path* (unneeded in Linux)Tomohiro Kusumi2019-07-152-4/+13
| | | | | | | | | | | | | | | | struct pathname is originally from Solaris VFS, and it has been used in ZoL to merely call VOP from Linux VFS interface without API change, therefore pathname::pn_path* are unused and unneeded. Technically, struct pathname is a wrapper for C string in ZoL. Saves stack a bit on lookup and unlink. (#if0'd members instead of removing since comments refer to them.) Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Richard Elling <[email protected]> Reviewed-by: George Melikov <[email protected]> Signed-off-by: Tomohiro Kusumi <[email protected]> Closes #9025
* Linux 5.0 compat: SIMD compatibilityBrian Behlendorf2019-07-1230-204/+454
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Restore the SIMD optimization for 4.19.38 LTS, 4.14.120 LTS, and 5.0 and newer kernels. This is accomplished by leveraging the fact that by definition dedicated kernel threads never need to concern themselves with saving and restoring the user FPU state. Therefore, they may use the FPU as long as we can guarantee user tasks always restore their FPU state before context switching back to user space. For the 5.0 and 5.1 kernels disabling preemption and local interrupts is sufficient to allow the FPU to be used. All non-kernel threads will restore the preserved user FPU state. For 5.2 and latter kernels the user FPU state restoration will be skipped if the kernel determines the registers have not changed. Therefore, for these kernels we need to perform the additional step of saving and restoring the FPU registers. Invalidating the per-cpu global tracking the FPU state would force a restore but that functionality is private to the core x86 FPU implementation and unavailable. In practice, restricting SIMD to kernel threads is not a major restriction for ZFS. The vast majority of SIMD operations are already performed by the IO pipeline. The remaining cases are relatively infrequent and can be handled by the generic code without significant impact. The two most noteworthy cases are: 1) Decrypting the wrapping key for an encrypted dataset, i.e. `zfs load-key`. All other encryption and decryption operations will use the SIMD optimized implementations. 2) Generating the payload checksums for a `zfs send` stream. In order to avoid making any changes to the higher layers of ZFS all of the `*_get_ops()` functions were updated to take in to consideration the calling context. This allows for the fastest implementation to be used as appropriate (see kfpu_allowed()). The only other notable instance of SIMD operations being used outside a kernel thread was at module load time. This code was moved in to a taskq in order to accommodate the new kernel thread restriction. Finally, a few other modifications were made in order to further harden this code and facilitate testing. They include updating each implementations operations structure to be declared as a constant. And allowing "cycle" to be set when selecting the preferred ops in the kernel as well as user space. Reviewed-by: Tony Hutter <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #8754 Closes #8793 Closes #8965
* Fixes: #8934 Large kmem_allocNick Mattis2019-07-101-4/+4
| | | | | | | | | | | Large allocation over the spl_kmem_alloc_warn value was being performed. Switched to vmem_alloc interface as specified for large allocations. Changed the subsequent frees to match. Reviewed-by: Tom Caputi <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: nmattis <[email protected]> Closes #8934 Closes #9011
* Fix ZTS killed processes detectionAttila Fülöp2019-07-101-4/+4
| | | | | | | | | | log_neg_expect was using the wrong exit status to detect if a process got killed by SIGSEGV or SIGBUS, resulting in false positives. Reviewed-by: loli10K <[email protected]> Reviewed by: John Kennedy <[email protected]> Reviewed by: Brian Behlendorf <[email protected]> Signed-off-by: Attila Fülöp <[email protected]> Closes #9003
* pkg-utils python sitelib for SLES15Shaun Tancheff2019-07-091-2/+3
| | | | | | | | Use python -Esc to set __python_sitelib. Reviewed-by: Neal Gompa <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Shaun Tancheff <[email protected]> Closes #8969
* Fix race in parallel mount's thread dispatching algorithmTomohiro Kusumi2019-07-094-3/+123
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Strategy of parallel mount is as follows. 1) Initial thread dispatching is to select sets of mount points that don't have dependencies on other sets, hence threads can/should run lock-less and shouldn't race with other threads for other sets. Each thread dispatched corresponds to top level directory which may or may not have datasets to be mounted on sub directories. 2) Subsequent recursive thread dispatching for each thread from 1) is to mount datasets for each set of mount points. The mount points within each set have dependencies (i.e. child directories), so child directories are processed only after parent directory completes. The problem is that the initial thread dispatching in zfs_foreach_mountpoint() can be multi-threaded when it needs to be single-threaded, and this puts threads under race condition. This race appeared as mount/unmount issues on ZoL for ZoL having different timing regarding mount(2) execution due to fork(2)/exec(2) of mount(8). `zfs unmount -a` which expects proper mount order can't unmount if the mounts were reordered by the race condition. There are currently two known patterns of input list `handles` in `zfs_foreach_mountpoint(..,handles,..)` which cause the race condition. 1) #8833 case where input is `/a /a /a/b` after sorting. The problem is that libzfs_path_contains() can't correctly handle an input list with two same top level directories. There is a race between two POSIX threads A and B, * ThreadA for "/a" for test1 and "/a/b" * ThreadB for "/a" for test0/a and in case of #8833, ThreadA won the race. Two threads were created because "/a" wasn't considered as `"/a" contains "/a"`. 2) #8450 case where input is `/ /var/data /var/data/test` after sorting. The problem is that libzfs_path_contains() can't correctly handle an input list containing "/". There is a race between two POSIX threads A and B, * ThreadA for "/" and "/var/data/test" * ThreadB for "/var/data" and in case of #8450, ThreadA won the race. Two threads were created because "/var/data" wasn't considered as `"/" contains "/var/data"`. In other words, if there is (at least one) "/" in the input list, the initial thread dispatching must be single-threaded since every directory is a child of "/", meaning they all directly or indirectly depend on "/". In both cases, the first non_descendant_idx() call fails to correctly determine "path1-contains-path2", and as a result the initial thread dispatching creates another thread when it needs to be single-threaded. Fix a conditional in libzfs_path_contains() to consider above two. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed by: Sebastien Roy <[email protected]> Signed-off-by: Tomohiro Kusumi <[email protected]> Closes #8450 Closes #8833 Closes #8878
* Fix dracut Debian/Ubuntu packagingloli10K2019-07-091-4/+4
| | | | | | | | | | | | This commit ensures make(1) targets that build .deb packages fail if alien(1) can't convert all .rpm files; additionally it also updates the zfs-dracut package name which was changed to "noarch" in ca4e5a7. Reviewed-by: Neal Gompa <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Olaf Faaland <[email protected]> Signed-off-by: loli10K <[email protected]> Closes #8990 Closes #8991