openzfs/zfs.git - Unnamed repository; edit this file 'description' to name the repository.

	Commit message (Collapse)	Author	Age	Files	Lines
*	Restore :: in Makefile.am	Ryan Moeller	2019-08-26	12	-1/+16
\| \| \| \| \| \| \| \| \| \| \| \|	The double-colon looked like a typo, but it's actually an obscure feature. Rules with :: may appear multiple times and are run independently of one another in the order they appear. The use of :: for distclean-local was conventional, not accidental. Add comments to indicate the intentional use of double-colon rules. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Ryan Moeller <[email protected]> Closes #9210
*	Add regression test for "zpool list -p"	Paul Dagnelie	2019-08-25	4	-3/+115
\| \| \| \| \| \| \| \| \| \|	Other than this test, zpool list -p is not well tested by any of the automated tests. Add a test for zpool list -p. Reviewed-by: Prakash Surya <[email protected]> Reviewed-by: Serapheim Dimitropoulos <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Paul Dagnelie <[email protected]> Closes #9134
*	Split argument list, satisfy shellcheck SC2086	Ryan Moeller	2019-08-25	1	-6/+13
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Split the arguments for ${TEST_RUNNER} across multiple lines for clarity. Also added quotes in the message to match the invoked command. Unquoted variables in argument lists are subject to splitting. In this particular case we can't quote the variable because it is an optional argument. Use the method suggested in the description linked below, instead. The technique is to use an unquoted variable with an alternate value. https://github.com/koalaman/shellcheck/wiki/SC2086 Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Giuseppe Di Natale <[email protected]> Reviewed-by: John Kennedy <[email protected]> Signed-off-by: Ryan Moeller <[email protected]> Closes #9212
*	ZTS: Fix in-tree dbufstats test case	Brian Behlendorf	2019-08-22	4	-7/+7
\| \| \| \| \| \| \| \| \| \| \| \| \|	Commit a887d653 updated the dbufstats such that escalated privileges are required. Since all tests under cli_user are run with normal privileges move this test case to a location where it will be run required privileges. Reviewed-by: John Kennedy <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Michael Niewöhner <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #9118 Closes #9196
*	Fix install error introduced by #9089 (#9205)	Brian Behlendorf	2019-08-22	1	-1/+1
\|\ \| \| \| \|	Signed-off-by: Paul Dagnelie <[email protected]>
\| *	Fix install error introduced by #9089	Paul Dagnelie	2019-08-22	1	-1/+1
\| \| \| \| \| \| \| \|	Signed-off-by: Paul Dagnelie <[email protected]>
* \|	Make slog test setup more robust	Ryan Moeller	2019-08-22	19	-10/+27
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The slog tests fail when attempting to create pools using file vdevs that already exist from previous test runs. Remove these files in the setup for the test. Reviewed-by: Igor Kozhukhov <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: John Kennedy <[email protected]> Signed-off-by: Ryan Moeller <[email protected]> Closes #9194
* \|	zfs-mount-genrator: dependencies should be space-separated	yshui	2019-08-22	1	-1/+1
\|/ \| \| \| \| \|	Reviewed-by: Antonio Russo <[email protected]> Reviewed-by: Richard Laager <[email protected]> Signed-off-by: Yuxuan Shui <[email protected]> Closes #9174
*	ZTS: Use decimal values when setting tunables	Brian Behlendorf	2019-08-22	4	-7/+7
\| \| \| \| \| \| \| \| \| \| \| \|	The mdb_set_uint32 function requires that the values passed in be decimal. This was overlooked initially because the matching Linux function accepts both decimal and hexadecimal values. Reviewed-by: John Kennedy <[email protected]> Reviewed by: Sara Hartse <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Igor Kozhukhov <[email protected]> Closes #9125 Closes #9195
*	Fix automake program name transformations (#9190)	Brian Behlendorf	2019-08-22	1	-2/+8
\|\ \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Automake can perform program name transformations at install time. However, arc_summary has its own name transformation taking place, which interferes with the automake transforms. The automake transforms must be taken into account in order to resolve the conflict. Signed-off-by: Ryan Moeller <[email protected]>
\| *	Fix automake program name transformations	Ryan Moeller	2019-08-20	1	-2/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Automake can perform program name transformations at install time. However, arc_summary has its own name transformation taking place, which interferes with the automake transforms. The automake transforms must be taken into account in order to resolve the conflict. Signed-off-by: Ryan Moeller <[email protected]>
* \|	Document ZFS_DKMS_ENABLE_DEBUGINFO in userland configuration	Mauricio Faria de Oliveira	2019-08-22	1	-0/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Document the ZFS_DKMS_ENABLE_DEBUGINFO option in the userland configuration file, as done with the other ZFS_DKMS_* options. It has been introduced with commit e45c1734a665 ("dkms: Enable debuginfo option to be set with zfs sysconfig file") but isn't mentioned anywhere other than the 'dkms.conf' file (generated). Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Mauricio Faria de Oliveira <[email protected]> Closes #9191
* \|	Dedup IOC enum values in libzfs_input_check	Ryan Moeller	2019-08-22	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Reuse enum value ZFS_IOC_BASE for `('Z' << 8)`. This is helpful on FreeBSD where ZFS_IOC_BASE has a different value and `('Z' << 8)` is wrong. Reviewed-by: Chris Dunlop <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Ryan Moeller <[email protected]> Closes #9188
* \|	Enhance ioctl number checks	Ryan Moeller	2019-08-22	1	-87/+99
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When checking ZFS_IOC_* numbers, print which numbers are wrong rather than silently failing. Reviewed-by: Chris Dunlop <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Ryan Moeller <[email protected]> Closes #9187
* \|	ZTS: Fix vdev_zaps_005_pos on CentOS 6	Brian Behlendorf	2019-08-22	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The ancient version of blkid (v2.17.2) used in CentOS 6 will not detect the newly created pool unless it has been written to. Force a pool sync so `zpool import` will detect the newly created pool. Reviewed-by: John Kennedy <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #9199
* \|	Linux 5.3: Fix switch() fall though compiler errors	Tony Hutter	2019-08-21	3	-3/+11
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Fix some switch() fall-though compiler errors: abd.c:1504:9: error: this statement may fall through Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tony Hutter <[email protected]> Closes #9170
* \|	Minor cleanup in Makefile.am	Ryan Moeller	2019-08-21	1	-6/+11
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Split long lines where adding license info to dist archive. Remove extra colon from target line. Reviewed-by: Chris Dunlop <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Ryan Moeller <[email protected]> Closes #9189
* \|	zfs-functions.in: in_mtab() always returns 1	Alexey Smirnoff	2019-08-20	1	-2/+5
\|/ \| \| \| \| \| \| \| \| \| \|	$fs used with the wrong sed command where should be $mntpnt instead to match a variable exported by read_mtab() The fix is mostly to reuse the sed command found in read_mtab() Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Michael Niewöhner <[email protected]> Signed-off-by: Alexey Smirnoff <[email protected]> Closes #9168
*	Add fast path for zfs_ioc_space_snaps() handling of empty_bpobj	Matthew Ahrens	2019-08-20	3	-35/+181
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When there are many snapshots, calls to zfs_ioc_space_snaps() (e.g. from `zfs destroy -nv pool/fs@snap1%snap10000`) can be very slow, resulting in poor performance because we are holding the dp_config_rwlock the entire time, blocking spa_sync() from continuing. With around ten thousand snapshots, we've seen up to 500 seconds in this ioctl, iterating over up to 50,000,000 bpobjs, ~99% of which are the empty bpobj. By creating a fast path for zfs_ioc_space_snaps() handling of the empty_bpobj, we can achieve a ~5x performance improvement of this ioctl (when there are many snapshots, and the deadlist is mostly empty_bpobj's). Reviewed-by: Pavel Zakharov <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Paul Dagnelie <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> External-issue: DLPX-58348 Closes #8744
*	Fix lockdep circular locking false positive involving sa_lock	jdike	2019-08-19	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	There are two different deadlock scenarios, but they share a common link, which is thread 1 holding sa_lock and trying to get zap->zap_rwlock: zap_lockdir_impl+0x858/0x16c0 [zfs] zap_lockdir+0xd2/0x100 [zfs] zap_lookup_norm+0x7f/0x100 [zfs] zap_lookup+0x12/0x20 [zfs] sa_setup+0x902/0x1380 [zfs] zfsvfs_init+0x3d6/0xb20 [zfs] zfsvfs_create+0x5dd/0x900 [zfs] zfs_domount+0xa3/0xe20 [zfs] and thread 2 trying to get sa_lock, either in sa_setup: sa_setup+0x742/0x1380 [zfs] zfsvfs_init+0x3d6/0xb20 [zfs] zfsvfs_create+0x5dd/0x900 [zfs] zfs_domount+0xa3/0xe20 [zfs] or in sa_build_index: sa_build_index+0x13d/0x790 [zfs] sa_handle_get_from_db+0x368/0x500 [zfs] zfs_znode_sa_init.isra.0+0x24b/0x330 [zfs] zfs_znode_alloc+0x3da/0x1a40 [zfs] zfs_zget+0x39a/0x6e0 [zfs] zfs_root+0x101/0x160 [zfs] zfs_domount+0x91f/0xea0 [zfs] From there, there are different locking paths back to something holding zap->zap_rwlock. The deadlock scenarios involve multiple different ZFS filesystems being mounted. sa_lock is common to these scenarios, and the sa struct involved is private to a mount. Therefore, these must be referring to different sa_lock instances and these deadlocks can't occur in practice. The fix, from Brian Behlendorf, is to remove sa_lock from lockdep coverage by initializing it with MUTEX_NOLOCKDEP. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Jeff Dike <[email protected]> Closes #9110
*	Linux 5.3 compat: Makefile subdir-m no longer supported	Dominic Pearson	2019-08-19	2	-12/+23
\| \| \| \| \| \| \| \| \| \|	Uses obj-m instead, due to kernel changes. See LKML: Masahiro Yamada, Tue, 6 Aug 2019 19:03:23 +0900 Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Signed-off-by: Dominic Pearson <[email protected]> Closes #9169
*	Set "none" scheduler if available (initramfs)	colmbuckley	2019-08-19	1	-6/+13
\| \| \| \| \| \| \| \| \| \| \| \| \|	Existing zfs initramfs script logic will attempt to set the 'noop' scheduler if it's available on the vdev block devices. Newer kernels have the similar 'none' scheduler on multiqueue devices; this change alters the initramfs script logic to also attempt to set this scheduler if it's available. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Garrett Fields <[email protected]> Reviewed-by: Richard Laager <[email protected]> Signed-off-by: Colm Buckley <[email protected]> Closes #9042
*	Add more refquota tests	Paul Dagnelie	2019-08-19	4	-2/+137
\| \| \| \| \| \| \| \| \| \| \| \| \|	It used to be possible for zfs receive (and other operations related to clone swap) to bypass refquotas. This can cause a number of issues, and there should be an automated test for it. Added tests for rollback and receive not overriding refquota. Reviewed-by: Pavel Zakharov <[email protected]> Reviewed-by: John Kennedy <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Paul Dagnelie <[email protected]> Closes #9139
*	Cap metaslab memory usage	Paul Dagnelie	2019-08-16	11	-58/+289
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	On systems with large amounts of storage and high fragmentation, a huge amount of space can be used by storing metaslab range trees. Since metaslabs are only unloaded during a txg sync, and only if they have been inactive for 8 txgs, it is possible to get into a state where all of the system's memory is consumed by range trees and metaslabs, and txgs cannot sync. While ZFS knows how to evict ARC data when needed, it has no such mechanism for range tree data. This can result in boot hangs for some system configurations. First, we add the ability to unload metaslabs outside of syncing context. Second, we store a multilist of all loaded metaslabs, sorted by their selection txg, so we can quickly identify the oldest metaslabs. We use a multilist to reduce lock contention during heavy write workloads. Finally, we add logic that will unload a metaslab when we're loading a new metaslab, if we're using more than a certain fraction of the available memory on range trees. Reviewed-by: Matt Ahrens <[email protected]> Reviewed-by: George Wilson <[email protected]> Reviewed-by: Sebastien Roy <[email protected]> Reviewed-by: Serapheim Dimitropoulos <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Paul Dagnelie <[email protected]> Closes #9128
*	initramfs: fixes for (debian) initramfs	Michael Niewöhner	2019-08-16	4	-9/+30
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	* contrib/initramfs: include /etc/default/zfs and /etc/zfs/zfs-functions At least debian needs /etc/default/zfs and /etc/zfs/zfs-functions for its initramfs. Include both in build when initramfs is configured. * contrib/initramfs: include 60-zvol.rules and zvol_id Include 60-zvol.rules and zvol_id and set udev as predependency instead of debians zdev. This makes debians additional zdev hook unneeded. * Correct initconfdir substitution for some distros Not every Linux distro is using @sysconfdir@/default but @initconfdir@ which is already determined by configure. Let's use it. * systemd: prevent possible conflict between systemd and sysvinit Systemd will not load a sysvinit service if a unit exists with the same name. This prevents conflicts between sysvinit and systemd. In ZFS there is one sysvinit service that does not have a systemd service but a target counterpart, zfs-import.target. Usually it does not make any sense to install both but it is possisble. Let's prevent any conflict by masking zfs-import.service by default. This does not harm even if init.d/zfs-import does not exist. Reviewed-by: Chris Wedgwood <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Tested-by: Alex Ingram <[email protected]> Tested-by: Dreamcat4 <[email protected]> Signed-off-by: Michael Niewöhner <[email protected]> Closes #7904 Closes #9089
*	dmu_tx_wait() hang likely due to cv_signal() in dsl_pool_dirty_delta()	Serapheim Dimitropoulos	2019-08-15	4	-18/+60
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Even though the bug's writeup (Github issue #9136) is very detailed, we still don't know exactly how we got to that state, thus I wasn't able to reproduce the bug. That said, we can make an educated guess combining the information on filled issue with the code. From the fact that `dp_dirty_total` was 0 (which is less than `zfs_dirty_data_max`) we know that there was one thread that set it to 0 and then signaled one of the waiters of `dp_spaceavail_cv` [see `dsl_pool_dirty_delta()` which is also the only place that `dp_dirty_total` is changed]. Thus, the only logical explaination then for the bug being hit is that the waiter that just got awaken didn't go through `dsl_pool_dirty_data()`. Given that this function is only called by `dsl_pool_dirty_space()` or `dsl_pool_undirty_space()` I can only think of two possible ways of the above scenario happening: [1] The waiter didn't call into any of the two functions - which I find highly unlikely (i.e. why wait on `dp_spaceavail_cv` to begin with?). [2] The waiter did call in one of the above function but it passed 0 as the space/delta to be dirtied (or undirtied) and then the callee returned immediately (e.g both `dsl_pool_dirty_space()` and `dsl_pool_undirty_space()` return immediately when space is 0). In any case and no matter how we got there, the easy fix would be to just broadcast to all waiters whenever `dp_dirty_total` hits 0. That said and given that we've never hit this before, it would make sense to think more on why the above situation occured. Attempting to mimic what Prakash was doing in the issue filed, I created a dataset with `sync=always` and started doing contiguous writes in a file within that dataset. I observed with DTrace that even though we update the pool's dirty data accounting when we would dirty stuff, the accounting wouldn't be decremented incrementally as we were done with the ZIOs of those writes (the reason being that `dbuf_write_physdone()` isn't be called as we go through the override code paths, and thus `dsl_pool_undirty_space()` is never called). As a result we'd have to wait until we get to `dsl_pool_sync()` where we zero out all dirty data accounting for the pool and the current TXG's metadata. In addition, as Matt noted and I later verified, the same issue would arise when using dedup. In both cases (sync & dedup) we shouldn't have to wait until `dsl_pool_sync()` zeros out the accounting data. According to the comment in that part of the code, the reasons why we do the zeroing, have nothing to do with what we observe: ```` /* * We have written all of the accounted dirty data, so our * dp_space_towrite should now be zero. However, some seldom-used * code paths do not adhere to this (e.g. dbuf_undirty(), also * rounding error in dbuf_write_physdone). * Shore up the accounting of any dirtied space now. */ dsl_pool_undirty_space(dp, dp->dp_dirty_pertxg[txg & TXG_MASK], txg); ```` Ideally what we want to do is to undirty in the accounting exactly what we dirty (I use the word ideally as we can still have rounding errors). This would make the behavior of the system more clear and predictable. Another interesting issue that I observed with DTrace was that we wouldn't update any of the pool's dirty data accounting whenever we would dirty and/or undirty MOS data. In addition, every time we would change the size of a dbuf through `dbuf_new_size()` we wouldn't update the accounted space dirtied in the appropriate dirty record, so when ZIOs are done we would undirty less that we dirtied from the pool's accounting point of view. For the first two issues observed (sync & dedup) this patch ensures that we still update the pool's accounting when we undirty data, regardless of the write being physical or not. For changes in the MOS, we first ensure to zero out the pool's dirty data accounting in `dsl_pool_sync()` after we synced the MOS. Then we can go ahead and enable the update of the pool's dirty data accounting wheneve we change MOS data. Another fix is that we now update the accounting explicitly for counting errors in `dbuf_write_done()`. Finally, `dbuf_new_size()` updates the accounted space of the appropriate dirty record correctly now. The problem is that we still don't know how the bug came up in the issue filled. That said the issues fixed seem to be very relevant, so instead of going with the broadcasting solution right away, I decided to leave this patch as is. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Prakash Surya <[email protected]> Signed-off-by: Serapheim Dimitropoulos <[email protected]> External-issue: DLPX-47285 Closes #9137
*	Improve write performance by using dmu_read_by_dnode()	Tony Nguyen	2019-08-15	1	-2/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	In zfs_log_write(), we can use dmu_read_by_dnode() rather than dmu_read() thus avoiding unnecessary dnode_hold() calls. We get a 2-5% performance gain for large sequential_writes tests, >=128K writes to files with recordsize=8K. Testing done on Ubuntu 18.04 with 4.15 kernel, 8vCPUs and SSD storage on VMware ESX. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tony Nguyen <[email protected]> Closes #9156
*	Assert that a dnode's bonuslen never exceeds its recorded size	Serapheim Dimitropoulos	2019-08-15	2	-0/+52
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch introduces an assertion that can catch pitfalls in development where there is a mismatch between the size of reads and writes between a *_phys structure and its respective in-core structure when bonus buffers are used. This debugging-aid should be complementary to the verification done by ztest in ztest_verify_dnode_bt(). A side to this patch is that we now clear out any extra bytes past a bonus buffer's new size when the buffer is shrinking. Reviewed-by: Matt Ahrens <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Tom Caputi <[email protected]> Signed-off-by: Serapheim Dimitropoulos <[email protected]> Closes #8348
*	Make txg_wait_synced conditional in zfsvfs_teardown	Paul Zuchowski	2019-08-15	1	-1/+10
\| \| \| \| \| \| \| \| \| \| \|	The call to txg_wait_synced in zfsvfs_teardown should be made conditional on the objset having dirty data. This can prevent unnecessary txg_wait_synced during some unmount operations. Reviewed-by: Matt Ahrens <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Paul Zuchowski <[email protected]> Closes #9115
*	Prevent race in blkptr_verify against device removal	Paul Dagnelie	2019-08-13	5	-28/+35
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When we check the vdev of the blkptr in zfs_blkptr_verify, we can run into a race condition where that vdev is temporarily unavailable. This happens when a device removal operation and the old vdev_t has been removed from the array, but the new indirect vdev has not yet been inserted. We hold the spa_config_lock while doing our sensitive verification. To ensure that we don't deadlock, we only grab the lock if we don't have config_writer held. In addition, I had to const the tags of the refcounts and the spa_config_lock arguments. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Serapheim Dimitropoulos <[email protected]> Signed-off-by: Paul Dagnelie <[email protected]> Closes #9112
*	Fix out-of-order ZIL txtype lost on hardlinked files	Chunwei Chen	2019-08-13	5	-15/+27
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We should only call zil_remove_async when an object is removed. However, in current implementation, it is called whenever TX_REMOVE is called. In the case of hardlinked file, every unlink will generate TX_REMOVE and causing operations to be dropped even when the object is not removed. We fix this by only calling zil_remove_async when the file is fully unlinked. Reviewed-by: George Wilson <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Prakash Surya <[email protected]> Signed-off-by: Chunwei Chen <[email protected]> Closes #8769 Closes #9061
*	Fix device expansion when VM is powered off	Prakash Surya	2019-08-13	1	-25/+87
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When running on an ESXi based VM, I've found that "zpool online -e" will not expand the zpool, if the disk was expanded in ESXi while the VM was powered off. For example, take the following scenario: 1. VM running on top of VMware ESXi 2. ZFS pool created with a given device "sda" of size 8GB 3. VM powered off 4. Device "sda" size expanded to 16GB 5. VM powered on 6. "zpool online -e" used on device "sda" In this situation, after (2) the zpool will be roughly 8GB in size. After (6), the expectation is the zpool's size will expand to roughly 16GB in size; i.e. expand to the new size of the "sda" device. Unfortunately, I've seen that after (6), the zpool size does not change. What's happening is after (5), the EFI label of the "sda" device will be such that fields "efi_last_u_lba", "efi_last_lba", and "efi_altern_lba" all reflect the new size of the disk; i.e. "33554398", "33554431", and "33554431" respectively. Thus, the check that we perform in "efi_use_whole_disk": if ((efi_label->efi_altern_lba == 1) \|\| (efi_label->efi_altern_lba >= efi_label->efi_last_lba)) { This will return true, and then we return from the function without having expanded the size of the zpool/device. In contrast, if we remove steps (3) and (5) in the sequence above, i.e. the device is expanded while the VM is powered on, things change. In that case, the fields "efi_last_u_lba" and "efi_altern_lba" do not change (i.e. they still reflect the old 8GB device size), but the "efi_last_lba" field does change (i.e. it now reflects the new 16GB device size). Thus, when we evaluate the same conditional in "efi_use_whole_disk", it'll return false, so the zpool is expanded. Taking all of this into account, this PR updates "efi_use_whole_disk" to properly expand the zpool when the underlying disk is expanded while the VM is powered off. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: George Wilson <[email protected]> Reviewed-by: Don Brady <[email protected]> Signed-off-by: Prakash Surya <[email protected]> Closes #9111
*	Mark dsl_livelist_should_disable() static	Allan Jude	2019-08-13	1	-1/+1
\| \| \| \| \| \| \| \| \|	This function is not used outside of dsl_dataset.c Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: George Melikov <[email protected]> Reviewed by: Sara Hartse <[email protected]> Signed-off-by: Allan Jude <[email protected]> Closes #9154
*	spa_load_verify() may consume too much memory	George Wilson	2019-08-13	4	-21/+30
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When a pool is imported it will scan the pool to verify the integrity of the data and metadata. The amount it scans will depend on the import flags provided. On systems with small amounts of memory or when importing a pool from the crash kernel, it's possible for spa_load_verify to issue too many I/Os that it consumes all the memory of the system resulting in an OOM message or a hang. To prevent this, we limit the amount of memory that the initial pool scan can consume. This change will, by default, use 1/16th of the ARC for scan I/Os to prevent running the system out of memory during import. Reviewed-by: Matt Ahrens <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Serapheim Dimitropoulos <[email protected]> Signed-off-by: George Wilson [email protected] External-issue: DLPX-65237 External-issue: DLPX-65238 Closes #9146
*	Change boolean-like uint8_t fields in znode_t to boolean_t	Tomohiro Kusumi	2019-08-13	4	-37/+37
\| \| \| \| \| \| \| \| \|	Given znode_t is an in-core structure, it's more readable to have them as boolean. Also co-locate existing boolean fields with them for space efficiency (expecting 8 booleans to be packed/aligned). Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tomohiro Kusumi <[email protected]> Closes #9092
*	Drop KMC_NOEMERGENCY	Richard Yao	2019-08-13	2	-3/+1
\| \| \| \| \| \| \| \| \|	This is not implemented. If it were implemented, using it would risk deadlocks on pre-3.18 kernels. Lets just drop it. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Michael Niewöhner <[email protected]> Signed-off-by: Richard Yao <[email protected]> Closes #9119
*	Introduce getting holds and listing bookmarks through ZCP	Serapheim Dimitropoulos	2019-08-12	10	-25/+593
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Consumers of ZFS Channel Programs can now list bookmarks, and get holds from datasets. A minor-refactoring was also applied to distinguish between user and system properties in ZCP. Reviewed-by: Paul Dagnelie <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Matt Ahrens <[email protected]> Reviewed-by: Serapheim Dimitropoulos <[email protected]> Ported-by: Serapheim Dimitropoulos <[email protected]> Signed-off-by: Dan Kimmel <[email protected]> OpenZFS-issue: https://illumos.org/issues/8862 Closes #7902
*	Sort log spacemap tunables in alphabetical order	Serapheim Dimitropoulos	2019-08-12	1	-32/+32
\| \| \| \| \| \| \| \| \| \| \|	Beside the whole commit being a nit in reality it should bring the diffs of the spa_log_spacemap.c source file between ZoL and delphix/zfs to 0. Reviewed-by: George Melikov <[email protected]> Reviewed-by: Chris Dunlop <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Serapheim Dimitropoulos <[email protected]> Closes #9143
*	Metaslab max_size should be persisted while unloaded	Paul Dagnelie	2019-08-05	7	-40/+190
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When we unload metaslabs today in ZFS, the cached max_size value is discarded. We instead use the histogram to determine whether or not we think we can satisfy an allocation from the metaslab. This can result in situations where, if we're doing I/Os of a size not aligned to a histogram bucket, a metaslab is loaded even though it cannot satisfy the allocation we think it can. For example, a metaslab with 16 entries in the 16k-32k bucket may have entirely 16kB entries. If we try to allocate a 24kB buffer, we will load that metaslab because we think it should be able to handle the allocation. Doing so is expensive in CPU time, disk reads, and average IO latency. This is exacerbated if the write being attempted is a sync write. This change makes ZFS cache the max_size after the metaslab is unloaded. If we ever get a free (or a coalesced group of frees) larger than the max_size, we will update it. Otherwise, we leave it as is. When attempting to allocate, we use the max_size as a lower bound, and respect it unless we are in try_hard. However, we do age the max_size out at some point, since we expect the actual max_size to increase as we do more frees. A more sophisticated algorithm here might be helpful, but this works reasonably well. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Matt Ahrens <[email protected]> Signed-off-by: Paul Dagnelie <[email protected]> Closes #9055
*	Don't wakeup unnecessarily in 'zpool events -f'	DeHackEd	2019-08-05	1	-2/+1
\| \| \| \| \| \| \| \| \| \|	ZED can prevent CPU's from properly sleeping. Rather than periodically waking up in the zevents code, just go to sleep and wait for a wakeup. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Signed-off-by: DHE <[email protected]> Closes #9091
*	Test cancelling a removal in ZTS	Serapheim Dimitropoulos	2019-08-05	3	-4/+104
\| \| \| \| \| \| \| \| \|	This patch adds a new test that sanity checks cancelling a removal. Reviewed-by: Matt Ahrens <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: John Kennedy <[email protected]> Signed-off-by: Serapheim Dimitropoulos <[email protected]> Closes #9101
*	lockdep false positive - move txg_kick() outside of ->dp_lock	jdike	2019-07-31	1	-5/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This fixes a lockdep warning by breaking a link between ->tx_sync_lock and ->dp_lock. The deadlock envisioned by lockdep is this: thread 1 holds db->db_mtx and tries to get dp->dp_lock: dsl_pool_dirty_space+0x70/0x2d0 [zfs] dbuf_dirty+0x778/0x31d0 [zfs] thread 2 holds bpo->bpo_lock and tries to get db->db_mtx: dmu_buf_will_dirty_impl dmu_buf_will_dirty+0x6b/0x6c0 [zfs] bpobj_iterate_impl+0xbe6/0x1410 [zfs] thread 3 holds tx->tx_sync_lock and tries to get bpo->bpo_lock: bpobj_space+0x63/0x470 [zfs] dsl_scan_active+0x340/0x3d0 [zfs] txg_sync_thread+0x3f2/0x1370 [zfs] thread 4 holds dp->dp_lock and tries to get tx->tx_sync_lock txg_kick+0x61/0x420 [zfs] dsl_pool_need_dirty_delay+0x1c7/0x3f0 [zfs] This patch is orginally from Brian Behlendorf and slightly simplified by me. It breaks this cycle in thread 4 by moving the call from dsl_pool_need_dirty_delay to txg_kick outside the section controlled by dp->dp_lock. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Matt Ahrens <[email protected]> Signed-off-by: Jeff Dike <[email protected]> Closes #9094
*	List log_spacemap feature in zpool-features.5 manual	Serapheim Dimitropoulos	2019-07-31	1	-0/+22
\| \| \| \| \| \| \| \| \|	Update zpool-features.5 manpage to describe the log_spacemap feature. Reviewed-by: Matthew Ahrens <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Pavel Zakharov <[email protected]> Signed-off-by: Serapheim Dimitropoulos <[email protected]> Closes #9096
*	Add channel program for property based snapshots	Clint Armstrong	2019-07-30	4	-2/+79
\| \| \| \| \| \| \| \| \| \| \| \|	Channel programs that many users find useful should be included with zfs in the /contrib directory. This is the first of these contributions. A channel program to recursively take snapshots of datasets with the property com.sun:auto-snapshot=true. Reviewed-by: Kash Pande <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Clint Armstrong <[email protected]> Closes #8443 Closes #9050
*	9072 handle error of zap_cursor_retrieve() for log spacemap zap	Serapheim Dimitropoulos	2019-07-30	1	-2/+28
\| \| \| \| \| \| \| \| \| \| \| \|	In spa_ld_log_sm_metadata(), it is possible for zap_cursor_retrieve() to return errors other than the expected ENOENT (e.g. when we are at the end of the zap). Ensure that these error cases are handled correctly by the import path. Reviewed by: Brian Behlendorf <[email protected]> Reviewed by: Sara Hartse <[email protected]> Reviewed by: Matt Ahrens <[email protected]> Signed-off-by: Serapheim Dimitropoulos <[email protected]> Closes #9074
*	mismerged log spacemap comment for metaslab_verify_weight_and_frag	Serapheim Dimitropoulos	2019-07-30	1	-1/+9
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When the log spacemap commit was merged in ZoL, the metaslab_verify_unflushed_changes() debugging function was deleted as the feature was pretty much stable by then. Unfortunately though there was a reference to it from a comment in metaslab_verify_weight_and_frag(). This patch deletes the reference and pastes that comment as is. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Matt Ahrens <[email protected]> Reviewed-by: Igor Kozhukhov <[email protected]> Reviewed-by: George Melikov <[email protected]> Signed-off-by: Serapheim Dimitropoulos <[email protected]> Closes #9097
*	install path fixes	Michael Niewöhner	2019-07-30	5	-6/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	* rpm: correct pkgconfig path pkconfig files get installed to $datarootdir/pkgconfig but rpm expects them to be at $datadir. This works when $datarootdir==$datadir which is the case most of the time but will fail when they differ. * install: make initramfs-tools path static Since initramfs-tools' path is nothing we can control as it is an external package it does not make any sense to install zfs additions anywhere else. Simply use /usr/share/initramfs-tools as path. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Richard Laager <[email protected]> Signed-off-by: Michael Niewöhner <[email protected]> Closes #9087
*	Increase default zcmd allocation to 256K	Michael Niewöhner	2019-07-30	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When creating hundreds of clones (for example using containers with LXD) cloning slows down as the number of clones increases over time. The reason for this is that the fetching of the clone information using a small zcmd buffer requires two ioctl calls, one to determine the size and a second to return the data. However, this requires gathering the data twice, once to determine the size and again to populate the zcmd buffer to return it to userspace. These are expensive ioctl() calls, so instead, make the default buffer size much larger: 256K. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: George Melikov <[email protected]> Signed-off-by: Colin Ian King <[email protected]> Signed-off-by: Michael Niewöhner <[email protected]> Closes #9084
*	Improve performance by using dmu_tx_hold_*_by_dnode()	Matthew Ahrens	2019-07-30	3	-9/+15
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In zfs_write() and dmu_tx_hold_sa(), we can use dmu_tx_hold__by_dnode() instead of dmu_tx_hold_(), since we already have a dbuf from the target dnode in hand. This eliminates some calls to dnode_hold(), which can be expensive. This is especially impactful if several threads are accessing objects that are in the same block of dnodes, because they will contend for that dbuf's lock. We are seeing 10-20% performance wins for the sequential_writes tests in the performance test suite, when doing >=128K writes to files with recordsize=8K. This also removes some unnecessary casts that are in the area. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Tony Nguyen <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> Closes #9081
*	Revert "Develop tests for issues #5866 and #8858"	Brian Behlendorf	2019-07-29	11	-158/+2
\| \| \| \| \| \| \| \| \| \|	This reverts commit 693c1fc478cc8118dd0168c4815c0ae3be41c9c3. This change resulted in a kmem leak being observed in existing code which needs to be identified and addressed. Reviewed-by: Paul Zuchowski <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #8978 Closes #9090