openzfs/zfs.git - Unnamed repository; edit this file 'description' to name the repository.

	Commit message (Collapse)	Author	Age	Files	Lines
*	Implement ZPOOL_IMPORT_UDEV_TIMEOUT_MS	Richard Yao	2019-10-11	2	-1/+16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Since 0.7.0, zpool import would unconditionally block on udev for 30 seconds. This introduced a regression in initramfs environments that lack udev (particularly mdev based environments), yet use a zfs userland tools intended for the system that had been built against udev. Gentoo's genkernel is the main example, although custom user initramfs environments would be similarly impacted unless special builds of the ZFS userland utilities were done for them. Such environments already have their own mechanisms for blocking until device nodes are ready (such as genkernel's scandelay parameter), so it is unnecessary for zpool import to block on a non-existent udev until a timeout is reached inside of them. Rather than trying to intelligently determine whether udev is available on the system to avoid unnecessarily blocking in such environments, it seems best to just allow the environment to override the timeout. I propose that we add an environment variable called ZPOOL_IMPORT_UDEV_TIMEOUT_MS. Setting it to 0 would restore the 0.6.x behavior that was more desirable in mdev based initramfs environments. This allows the system user land utilities to be reused when building mdev-based initramfs archives. Reviewed-by: Igor Kozhukhov <[email protected]> Reviewed-by: Jorgen Lundman <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Georgy Yakovlev <[email protected]> Signed-off-by: Richard Yao <[email protected]> Closes #9436
*	Clarify loop variable name in zfs copies test	Ryan Moeller	2019-10-11	1	-2/+2
\| \| \| \| \| \|	Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: John Kennedy <[email protected]> Signed-off-by: Ryan Moeller <[email protected]> Closes #9445
*	Fix some style nits in tests	Ryan Moeller	2019-10-11	7	-17/+15
\| \| \| \| \| \| \| \|	Mostly whitespace changes, no functional changes intended. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: John Kennedy <[email protected]> Signed-off-by: Ryan Moeller <[email protected]> Closes #9447
*	Fix pool creation with feature@allocation_classes disabled	loli10K	2019-10-10	4	-1/+44
\| \| \| \| \| \| \| \| \| \| \|	When "feature@allocation_classes" is not enabled on the pool no vdev with "special" or "dedup" allocation type should be allowed to exist in the vdev tree. Reviewed-by: Pavel Zakharov <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: loli10K <[email protected]> Closes #9427 Closes #9429
*	Move get_temporary_prop to platform code	Matthew Macy	2019-10-10	3	-81/+84
\| \| \| \| \| \| \| \| \| \|	Temporary property handling at the VFS layer requires platform specific code. Reviewed-by: Sean Eric Fagan <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Jorgen Lundman <[email protected]> Signed-off-by: Matt Macy <[email protected]> Closes #9401
*	Add kmem cache accessors	Matthew Macy	2019-10-10	3	-4/+21
\| \| \| \| \| \| \| \| \| \| \| \|	Make the metaslab platform agnostic again by adding accessor functions which can be implemented by each platform. Reviewed-by: Paul Dagnelie <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Jorgen Lundman <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Signed-off-by: Matt Macy <[email protected]> Closes #9404
*	Make `zil_async_to_sync` visible to platform code	Matthew Macy	2019-10-10	2	-3/+2
\| \| \| \| \| \| \| \| \|	FreeBSD's zvol platform code requires access to the zil_async_to_sync() function. Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matt Macy <[email protected]> Closes #9440
*	Update `zfs program` command usage	Brian Behlendorf	2019-10-10	1	-2/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Update the zfs(8) man page to clearly describe that arguments for channel programs are to be listed after the -- sentinel which terminates argument processing. This behavior is supported by getopt on Linux, FreeBSD, and Illumos according to each platforms respective man pages. zfs program [-jn] [-t instruction-limit] [-m memory-limit] pool script [--] arg1 ... Reviewed-by: Clint Armstrong <[email protected]> Reviewed-by: George Melikov <[email protected]> Reviewed-by: loli10K <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #9056 Closes #9428
*	Fix strdup conflict on other platforms	Matthew Macy	2019-10-10	21	-75/+76
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In the FreeBSD kernel the strdup signature is: ``` char strdup(const char __restrict, struct malloc_type *); ``` It's unfortunate that the developers have chosen to change the signature of libc functions - but it's what I have to deal with. Reviewed-by: Jorgen Lundman <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matt Macy <[email protected]> Closes #9433
*	Make clang happy with vdev_raidz_ code	Matthew Macy	2019-10-10	2	-0/+13
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The macros are used to generate code for conditions without a corresponding branch. This is not a problem in practice, but clang has no way of knowing that. Add a default branch with a VERIFY(0) to indicate that it "can't happen" ``` In file included from \ /usr/home/mmacy/devel/ZoF/module/zfs/vdev_raidz_math_sse2.c:607: /usr/home/mmacy/devel/ZoF/module/zfs/vdev_raidz_math_impl.h:281:3: \ error: no case matching constant switch condition '3' [-Werror] ``` Reviewed-by: George Melikov <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matt Macy <[email protected]> Closes #9434
*	Sort list of generated files in configure.ac	Ryan Moeller	2019-10-09	1	-91/+91
\| \| \| \| \| \| \|	No functional change, just a quality of life improvement. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Ryan Moeller <[email protected]> Closes #9426
*	Move platform independent tests to a shared runfile	Ryan Moeller	2019-10-09	6	-949/+1036
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Tests that aren't limited to running on Linux can be moved to a common runfile to be shared with other platforms. The test runner and wrapper script are enhanced to allow specifying multiple runfiles as a comma-separated list. The default runfiles are now "common.run,PLATFORM.run" where PLATFORM is determined at run time. Sections in runfiles that share a path with another runfile can append a colon separator and an identifier to the path in the section name, ie `[tests/functional/atime:Linux]`, to avoid overriding the tests specified by other runfiles. Reviewed-by: Jorgen Lundman <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: John Kennedy <[email protected]> Signed-off-by: Ryan Moeller <[email protected]> Closes #9391
*	Reduce loaded range tree memory usage	Paul Dagnelie	2019-10-09	50	-638/+3722
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch implements a new tree structure for ZFS, and uses it to store range trees more efficiently. The new structure is approximately a B-tree, though there are some small differences from the usual characterizations. The tree has core nodes and leaf nodes; each contain data elements, which the elements in the core nodes acting as separators between its children. The difference between core and leaf nodes is that the core nodes have an array of children, while leaf nodes don't. Every node in the tree may be only partially full; in most cases, they are all at least 50% full (in terms of element count) except for the root node, which can be less full. Underfull nodes will steal from their neighbors or merge to remain full enough, while overfull nodes will split in two. The data elements are contained in tree-controlled buffers; they are copied into these on insertion, and overwritten on deletion. This means that the elements are not independently allocated, which reduces overhead, but also means they can't be shared between trees (and also that pointers to them are only valid until a side-effectful tree operation occurs). The overhead varies based on how dense the tree is, but is usually on the order of about 50% of the element size; the per-node overheads are very small, and so don't make a significant difference. The trees can accept arbitrary records; they accept a size and a comparator to allow them to be used for a variety of purposes. The new trees replace the AVL trees used in the range trees today. Currently, the range_seg_t structure contains three 8 byte integers of payload and two 24 byte avl_tree_node_ts to handle its storage in both an offset-sorted tree and a size-sorted tree (total size: 64 bytes). In the new model, the range seg structures are usually two 4 byte integers, but a separate one needs to exist for the size-sorted and offset-sorted tree. Between the raw size, the 50% overhead, and the double storage, the new btrees are expected to use 81.52 = 24 bytes per record, or 33.3% as much memory as the AVL trees (this is for the purposes of storing metaslab range trees; for other purposes, like scrubs, they use ~50% as much memory). We reduced the size of the payload in the range segments by teaching range trees about starting offsets and shifts; since metaslabs have a fixed starting offset, and they all operate in terms of disk sectors, we can store the ranges using 4-byte integers as long as the size of the metaslab divided by the sector size is less than 2^32. For 512-byte sectors, this is a 2^41 (or 2TB) metaslab, which with the default settings corresponds to a 256PB disk. 4k sector disks can handle metaslabs up to 2^46 bytes, or 2^63 byte disks. Since we do not anticipate disks of this size in the near future, there should be almost no cases where metaslabs need 64-byte integers to store their ranges. We do still have the capability to store 64-byte integer ranges to account for cases where we are storing per-vdev (or per-dnode) trees, which could reasonably go above the limits discussed. We also do not store fill information in the compact version of the node, since it is only used for sorted scrub. We also optimized the metaslab loading process in various other ways to offset some inefficiencies in the btree model. While individual operations (find, insert, remove_from) are faster for the btree than they are for the avl tree, remove usually requires a find operation, while in the AVL tree model the element itself suffices. Some clever changes actually caused an overall speedup in metaslab loading; we use approximately 40% less cpu to load metaslabs in our tests on Illumos. Another memory and performance optimization was achieved by changing what is stored in the size-sorted trees. When a disk is heavily fragmented, the df algorithm used by default in ZFS will almost always find a number of small regions in its initial cursor-based search; it will usually only fall back to the size-sorted tree to find larger regions. If we increase the size of the cursor-based search slightly, and don't store segments that are smaller than a tunable size floor in the size-sorted tree, we can further cut memory usage down to below 20% of what the AVL trees store. This also results in further reductions in CPU time spent loading metaslabs. The 16KiB size floor was chosen because it results in substantial memory usage reduction while not usually resulting in situations where we can't find an appropriate chunk with the cursor and are forced to use an oversized chunk from the size-sorted tree. In addition, even if we do have to use an oversized chunk from the size-sorted tree, the chunk would be too small to use for ZIL allocations, so it isn't as big of a loss as it might otherwise be. And often, more small allocations will follow the initial one, and the cursor search will now find the remainder of the chunk we didn't use all of and use it for subsequent allocations. Practical testing has shown little or no change in fragmentation as a result of this change. If the size-sorted tree becomes empty while the offset sorted one still has entries, it will load all the entries from the offset sorted tree and disregard the size floor until it is unloaded again. This operation occurs rarely with the default setting, only on incredibly thoroughly fragmented pools. There are some other small changes to zdb to teach it to handle btrees, but nothing major. Reviewed-by: George Wilson <[email protected]> Reviewed-by: Matt Ahrens <[email protected]> Reviewed by: Sebastien Roy [email protected] Reviewed-by: Igor Kozhukhov <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Paul Dagnelie <[email protected]> Closes #9181
*	ZTS: Fix mmp_hostid test	Igor K	2019-10-08	1	-1/+5
\| \| \| \| \| \| \| \|	Correctly use the `mntpnt_fs` variable, and include additional logic to ensure the /etc/hostid is correct set up and cleaned up. Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Igor Kozhukhov <[email protected]> Closes #9349
*	module/Makefile.in: don't run xargs if empty	George Melikov	2019-10-08	1	-1/+1
\| \| \| \| \| \| \| \| \| \|	If stdin if empty - don't run xargs command, otherwise we can get `cp: missing file operand` error. Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: George Melikov <[email protected]> Closes #9418
*	ZTS: Fix trim/trim_config and trim/autotrim_config	Brian Behlendorf	2019-10-04	2	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \|	There have been occasional CI failures which occur when the trimmed vdev size exactly matches the target size. Resolve this by slightly relaxing the conditional and checking for -ge rather than -gt. In all of the cases observer, the values match exactly. For example: Failure /mnt/trim-vdev1 is 768 MB which is not -gt than 768 MB Reviewed-by: Ryan Moeller <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #9399
*	Fix automount for root filesystems	Brian Behlendorf	2019-10-04	1	-1/+2
\| \| \| \| \| \| \| \| \| \| \|	Commit 093bb64 resolved an automount failures for chroot'd processes but inadvertently broke automounting for root filesystems where the vfs_mntpoint is NULL. Resolve the issue by checking for NULL in order to generate the correct path. Reviewed-by: Tom Caputi <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #9381 Closes #9384
*	Rename rangelock_ functions to zfs_rangelock_	Matthew Macy	2019-10-03	6	-82/+84
\| \| \| \| \| \| \| \| \|	A rangelock KPI already exists on FreeBSD. Add a zfs_ prefix as per our convention to prevent any conflict with existing symbols. Reviewed-by: Igor Kozhukhov <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matt Macy <[email protected]> Closes #9402
*	dbuf_hold_impl() cleanup to improve cached read performance	Tony Nguyen	2019-10-03	1	-154/+69
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently every dbuf_hold_impl() incurs kmem_alloc() and kmem_free() which can be costly for cached read performance. This change reverts the dbuf_hold_impl() fix stack commit, i.e. fc5bb51f08a6c91ff9ad3559d0266eeeab0b1f61 to eliminate the extra kmem_alloc() and kmem_free() operations and improve cached read performance. With the change, each dbuf_hold_impl() frame uses 40 bytes more, total of 800 for 20 recursive levels. Linux kernel stack sizes are 8K and 16K for 32bit and 64bit, respectively, so stack overrun risk is limited. Sample stack output comparisons with 50 PB file and recordsize=512 Current code 11) 2240 64 arc_alloc_buf+0x4a/0xd0 [zfs] 12) 2176 264 dbuf_read_impl.constprop.16+0x2e3/0x7f0 [zfs] 13) 1912 120 dbuf_read+0xe5/0x520 [zfs] 14) 1792 56 dbuf_hold_impl_arg+0x572/0x630 [zfs] 15) 1736 64 dbuf_hold_impl_arg+0x508/0x630 [zfs] 16) 1672 64 dbuf_hold_impl_arg+0x508/0x630 [zfs] 17) 1608 40 dbuf_hold_impl+0x23/0x40 [zfs] 18) 1568 40 dbuf_hold_level+0x32/0x60 [zfs] 19) 1528 16 dbuf_hold+0x16/0x20 [zfs] dbuf_hold_impl() cleanup 11) 2320 64 arc_alloc_buf+0x4a/0xd0 [zfs] 12) 2256 264 dbuf_read_impl.constprop.17+0x2e3/0x7f0 [zfs] 13) 1992 120 dbuf_read+0xe5/0x520 [zfs] 14) 1872 96 dbuf_hold_impl+0x50f/0x5e0 [zfs] 15) 1776 104 dbuf_hold_impl+0x4df/0x5e0 [zfs] 16) 1672 104 dbuf_hold_impl+0x4df/0x5e0 [zfs] 17) 1568 40 dbuf_hold_level+0x32/0x60 [zfs] 18) 1528 16 dbuf_hold+0x16/0x20 [zfs] Performance observations on 8K recordsize filesystem: - 8/128/1024K at 1-128 sequential cached read, ~3% improvement Testing done on Ubuntu 18.04 with 4.15 kernel, 8vCPUs and SSD storage on VMware ESX. Reviewed-by: Matt Ahrens <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tony Nguyen <[email protected]> Closes #9351
*	OpenZFS restructuring - libzfs	Matthew Macy	2019-10-03	11	-797/+1026
\| \| \| \| \| \| \| \| \|	Factor Linux specific functionality out of libzfs. Reviewed-by: Allan Jude <[email protected]> Reviewed-by: Jorgen Lundman <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matthew Macy <[email protected]> Closes #9377
*	OpenZFS restructuring - libzutil	Matthew Macy	2019-10-03	8	-1337/+1508
\| \| \| \| \| \| \| \|	Factor Linux specific functionality out of libzutil. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matt Macy <[email protected]> Signed-off-by: Ryan Moeller <[email protected]> Closes #9356
*	ZTS: Fix upgrade_readonly_pool	Brian Behlendorf	2019-10-03	1	-4/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Update cleanup_upgrade to use destroy_dataset and destroy_pool when performing cleanup. These wrappers retry if the pool is busy preventing occasional failures like those observed when running tests upgrade_readonly_pool. For example: SUCCESS: test enabled == enabled User accounting upgrade is not executed on readonly pool NOTE: Performing local cleanup via log_onexit (cleanup_upgrade) cannot destroy 'testpool': pool is busy ERROR: zpool destroy testpool exited 1 Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: John Kennedy <[email protected]> Reviewed-by: Igor Kozhukhov <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #9400
*	Workaround to avoid a race when /var/lib is a persistent dataset	Didier Roche	2019-10-02	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	If /var/lib is a dataset not under <pool>/ROOT/<root_dataset>, as proposed in the ubuntu root on zfs upstream guide (https://github.com/zfsonlinux/zfs/wiki/Ubuntu-18.04-Root-on-ZFS), we end up with a race where some services, like systemd-random-seed are writing under /var/lib, while zfs-mount is called. zfs mount will then potentially fail because of /var/lib isn't empty and so, can't be mounted. Order those 2 units for now (more may be needed) as we can't declare virtually a provide mount point to match "RequiresMountsFor=/var/lib/systemd/random-seed" from systemd-random-seed.service. The optional generator for zfs 0.8 fixes it, but it's not enabled by default nor necessarily required. Example: - rpool/ROOT/ubuntu (mountpoint = /) - rpool/var/ (mountpoint = /var) - rpool/var/lib (mountpoint = /var/lib) Both zfs-mount.service and systemd-random-seed.service are starting After=systemd-remount-fs.service. zfs-mount.service should be done before local-fs.target while systemd-random-seed.service should finish before sysinit.target (which is a later target). Ideally, we would have a way for zfs mount -a unit to declare all paths or move systemd-random-seed after local-fs.target. Reviewed-by: Antonio Russo <[email protected]> Reviewed-by: Richard Laager <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Didier Roche <[email protected]> Closes #9360
*	OpenZFS restructuring - libspl	Matthew Macy	2019-10-02	65	-443/+127
\| \| \| \| \| \| \| \| \|	Factor Linux specific pieces out of libspl. Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Sean Eric Fagan <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matt Macy <[email protected]> Closes #9336
*	Add inode accessors to common code	Matthew Macy	2019-10-02	5	-15/+23
\| \| \| \| \| \|	Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Jorgen Lundman <[email protected]> Signed-off-by: Matt Macy <[email protected]> Closes #9389
*	OpenZFS restructuring - arc_stats	Matthew Macy	2019-10-01	2	-281/+282
\| \| \| \| \| \| \| \|	Make arc_stats visible to platform code. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Igor Kozhukhov <[email protected]> Signed-off-by: Matt Macy <[email protected]> Closes #9386
*	Enable clang to use intrinsics for lz4	Matthew Macy	2019-10-01	1	-3/+3
\| \| \| \| \| \| \| \|	Reviewed-by: Igor Kozhukhov <[email protected]> Reviewed-by: Jorgen Lundman <[email protected]> Reviewed-by: George Melikov <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matt Macy <[email protected]> Closes #9385
*	Fix for zfs-dracut regression	dacianstremtan	2019-10-01	1	-6/+7
\| \| \| \| \| \| \| \| \| \| \| \|	Line 31 and 32 overwrote the ${root} variable which broke mount-zfs.sh We have create a new variable for the dataset instead of overwriting the ${root} variable in zfs-load-key.sh${root} variable in zfs-load-key.sh Reviewed-by: Kash Pande <[email protected]> Reviewed-by: Garrett Fields <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Dacian Reece-Stremtan <[email protected]> Closes #8913 Closes #9379
*	Perform KABI checks in parallel	Brian Behlendorf	2019-10-01	118	-2127/+3245
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Reduce the time required for ./configure to perform the needed KABI checks by allowing kbuild to compile multiple test cases in parallel. This was accomplished by splitting each test's source code from the logic handling whether that code could be compiled or not. By introducing this split it's possible to minimize the number of times kbuild needs to be invoked. As importantly, it means all of the tests can be built in parallel. This does require a little extra care since we expect some tests to fail, so the --keep-going (-k) option must be provided otherwise some tests may not get compiled. Furthermore, since a failure during the kbuild modpost phase will result in an early exit; the final linking phase is limited to tests which passed the initial compilation and produced an object file. Once everything has been built the configure script proceeds as previously. The only significant difference is that it now merely needs to test for the existence of a .ko file to determine the result of a given test. This vastly speeds up the entire process. New test cases should use ZFS_LINUX_TEST_SRC to declare their test source code and ZFS_LINUX_TEST_RESULT to check the result. All of the existing kernel-*.m4 files have been updated accordingly, see config/kernel-current-time.m4 for a basic example. The legacy ZFS_LINUX_TRY_COMPILE macro has been kept to handle special cases but it's use is not encouraged. master (secs) patched (secs) ------------- ---------------- autogen.sh 61 68 configure 137 24 (~17% of current run time) make -j $(nproc) 44 44 make rpms 287 150 Reviewed-by: Tony Hutter <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #8547 Closes #9132 Closes #9341
*	Timeout waiting for ZVOL device to be created	Prakash Surya	2019-10-01	2	-6/+14
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We've seen cases where after creating a ZVOL, the ZVOL device node in "/dev" isn't generated after 20 seconds of waiting, which is the point at which our applications gives up on waiting and reports an error. The workload when this occurs is to "refresh" 400+ ZVOLs roughly at the same time, based on a policy set by the user. This refresh operation will destroy the ZVOL, and re-create it based on a snapshot. When this occurs, we see many hundreds of entries on the "z_zvol" taskq (based on inspection of the /proc/spl/taskq-all file). Many of the entries on the taskq end up in the "zvol_remove_minors_impl" function, and I've measured the latency of that function: Function = zvol_remove_minors_impl msecs : count distribution 0 -> 1 : 0 \| \| 2 -> 3 : 0 \| \| 4 -> 7 : 1 \| \| 8 -> 15 : 0 \| \| 16 -> 31 : 0 \| \| 32 -> 63 : 0 \| \| 64 -> 127 : 1 \| \| 128 -> 255 : 45 \|**************************************\| 256 -> 511 : 5 \|** \| That data is from a 10 second sample, using the BCC "funclatency" tool. As we can see, in this 10 second sample, most calls took 128ms at a minimum. Thus, some basic math tells us that in any 20 second interval, we could only process at most about 150 removals, which is much less than the 400+ that'll occur based on the workload. As a result of this, and since all ZVOL minor operations will go through the single threaded "z_zvol" taskq, the latency for creating a single ZVOL device can be unreasonably large due to other ZVOL activity on the system. In our case, it's large enough to cause the application to generate an error and fail the operation. When profiling the "zvol_remove_minors_impl" function, I saw that most of the time in the function was spent off-cpu, blocked in the function "taskq_wait_outstanding". How this works, is "zvol_remove_minors_impl" will dispatch calls to "zvol_free" using the "system_taskq", and then the "taskq_wait_outstanding" function is used to wait for all of those dispatched calls to occur before "zvol_remove_minors_impl" will return. As far as I can tell, "zvol_remove_minors_impl" doesn't necessarily have to wait for all calls to "zvol_free" to occur before it returns. Thus, this change removes the call to "taskq_wait_oustanding", so that calls to "zvol_free" don't affect the latency of "zvol_remove_minors_impl". Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: John Gallagher <[email protected]> Signed-off-by: Prakash Surya <[email protected]> Closes #9380
*	OpenZFS restructuring - zpool	Matthew Macy	2019-09-30	8	-352/+445
\| \| \| \| \| \| \| \| \| \| \|	Factor Linux specific functions out of the zpool command. Reviewed-by: Allan Jude <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Sean Eric Fagan <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: loli10K <[email protected]> Signed-off-by: Matt Macy <[email protected]> Closes #9333
*	OpenZFS restructuring - zfs_ioctl	Matthew Macy	2019-09-27	9	-335/+509
\| \| \| \| \| \| \| \| \|	Refactor the zfs ioctls in to platform dependent and independent bits. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Sean Eric Fagan <[email protected]> Signed-off-by: Matthew Macy <[email protected]> Signed-off-by: Ryan Moeller <[email protected]> Closes #9301
*	Adding slack notifier	Ben McGough	2019-09-26	2	-0/+86
\| \| \| \| \| \| \| \| \| \|	Allow ZED notification via slack incoming webhook. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Richard Elling <[email protected]> Signed-off-by: Ben McGough <[email protected]> Closes #9076 Closes #9350
*	Fix encryption hierarchy issues with zfs recv -d	Tom Caputi	2019-09-25	5	-45/+107
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently, the recv_fix_encryption_hierarchy() function accepts 'destsnap' as one of its parameters. Originally, this was intended to be the top-level dataset of a receive (whether or not the receive was recursive). Unfortunately, this parameter actually is simply the input that is passed in from the command line. When the user specifies 'zfs recv -d', this string is actually only the name of the receiving pool since the rest of the name is derived from the send stream. This causes the function to fail, leaving some datasets with an invalid encryption hierarchy. This patch resolves this problem by passing in the top_zfs variable instead. In order to make this work, this patch also includes some changes that ensure the value is always present when we need it. Reviewed-by: loli10K <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tom Caputi <[email protected]> Closes #9273 Closes #9309
*	ZTS: Fix typos in zfs_destroy tests cleanup	Ryan Moeller	2019-09-25	2	-2/+2
\| \| \| \| \| \| \| \| \| \|	lot_must -> log_must Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Igor Kozhukhov <[email protected]> Reviewed by: Sara Hartse <[email protected]> Reviewed-by: John Kennedy <[email protected]> Signed-off-by: Ryan Moeller <[email protected]> Closes #9362
*	ZTS: harden xattr/cleanup.ksh	Brian Behlendorf	2019-09-25	2	-5/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When the xattr/cleanup.ksh script is unable to remove the test group due to an active process then it will not call default_cleanup. This will result in a zvol_ENOSPC/setup failure when attempting to create the /mnt/testdir directory which will already exist. Resolve the issue by performing the default_cleanup before removing the test user and group to ensure this step always happens. Also allow one more retry to further minimize the likelihood of the cleanup failing. Reviewed-by: Ryan Moeller <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #9358
*	Add warning for zfs_vdev_elevator option removal	Brian Behlendorf	2019-09-25	2	-3/+12
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Originally the zfs_vdev_elevator module option was added as a convenience so the requested elevator would be automatically set on the underlying block devices. At the time this was simple because the kernel provided an API function which did exactly this. This API was then removed in the Linux 4.12 kernel which prompted us to add compatibly code to set the elevator via a usermodehelper. While well intentioned this introduced a bug which could cause a system hang, that issue was subsequently fixed by commit 2a0d4188. In order to avoid future bugs in this area, and to simplify the code, this functionality is being deprecated. A console warning has been added to notify any existing consumers and the documentation updated accordingly. This option will remain for the lifetime of the 0.8.x series for compatibility but if planned to be phased out of master. Reviewed-by: Richard Laager <[email protected]> Reviewed-by: loli10K <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #8664 Closes #9317
*	OpenZFS restructuring - zvol	Matthew Macy	2019-09-25	9	-1112/+1297
\| \| \| \| \| \| \| \| \| \|	Refactor the zvol in to platform dependent and independent bits. Reviewed-by: Allan Jude <[email protected]> Reviewed-by: Jorgen Lundman <[email protected]> Reviewed-by: Igor Kozhukhov <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matt Macy <[email protected]> Closes #9295
*	diff_cb() does not handle large dnodes	loli10K	2019-09-24	2	-4/+15
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Trying to 'zfs diff' a snapshot with large dnodes will incorrectly try to access its interior slots when dnodesize > sizeof(dnode_phys_t). This is normally not an issue because the interior slots are zero-filled, which report_dnode() handles calling report_free_dnode_range(). However this is not the case for encrypted large dnodes or filesystem using many SA based xattrs where the extra data past the legacy dnode size boundary is interpreted as a dnode_phys_t. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Tom Caputi <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Signed-off-by: loli10K <[email protected]> Closes #7678 Closes #8931 Closes #9343
*	Use signed types to prevent subtraction overflow	Ryan Moeller	2019-09-22	1	-4/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The difference between the sizes could be positive or negative. Leaving the types as unsigned means the result overflows when the difference is negative and removing the labs() means we'll have introduced a bug. The subtraction results in the correct value when the unsigned integer is interpreted as a signed integer by labs(). Clang doesn't see that we're doing a subtraction and abusing the types. It sees the result of the subtraction, an unsigned value, being passed to an absolute value function and emits a warning which we treat as an error. Reviewed by: Youzhong Yang <[email protected]> Reviewed-by: Igor Kozhukhov <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Ryan Moeller <[email protected]> Closes #9355
*	Disabled resilver_defer feature leads to looping resilvers	Kody A Kantor	2019-09-22	1	-9/+11
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When a disk is replaced with another on a pool with the resilver_defer feature present, but not enabled the resilver activity restarts during each spa_sync. This patch checks to make sure that the resilver_defer feature is first enabled before requesting a deferred resilver. This was originally fixed in illumos-joyent as OS-7982. Reviewed-by: Chris Dunlop <[email protected]> Reviewed-by: George Melikov <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Tom Caputi <[email protected]> Reviewed by: Jerry Jelinek <[email protected]> Signed-off-by: Kody A Kantor <[email protected]> External-issue: illumos-joyent OS-7982 Closes #9299 Closes #9338
*	Refactor libzfs_error_init newlines	Ryan Moeller	2019-09-18	5	-9/+9
\| \| \| \| \| \| \| \| \|	Move the trailing newlines from the error message strings to the format strings to more closely match the other error messages. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: George Melikov <[email protected]> Signed-off-by: Ryan Moeller <[email protected]> Closes #9330
*	Fix dsl_scan_ds_clone_swapped logic	Andriy Gapon	2019-09-18	1	-31/+69
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The was incorrect with respect to swapping dataset IDs both in the on-disk ZAP object and the in-memory queue. In both cases, if ds1 was already present, then it would be first replaced with ds2 and then ds would be replaced back with ds1. Also, both cases did not properly handle a situation where both ds1 and ds2 are already queued. A duplicate insertion would be attempted and its failure would result in a panic. Reviewed-by: Matt Ahrens <[email protected]> Reviewed-by: Tom Caputi <[email protected]> Signed-off-by: Andriy Gapon <[email protected]> Closes #9140 Closes #9163
*	Device removal of indirect vdev panics the kernel	loli10K	2019-09-16	4	-2/+65
\| \| \| \| \| \| \| \| \| \|	This commit fixes a NULL pointer dereference triggered in spa_vdev_remove_top_check() by trying to "zpool remove" an indirect vdev. Reviewed-by: Matt Ahrens <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: loli10K <[email protected]> Closes #9327
*	Prevent gcc -Werror=maybe-uninitialized warnings in spa_wait_common()	loli10K	2019-09-16	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This commit fixes the following build failure detected on Debian9 (GCC 6.3.0): CC [M] module/zfs/spa.o module/zfs/spa.c: In function ‘spa_wait_common.part.31’: module/zfs/spa.c:9468:6: error: ‘in_progress’ may be used uninitialized in this function [-Werror=maybe-uninitialized] if (!in_progress \|\| spa->spa_waiters_cancel \|\| error) ^ cc1: all warnings being treated as errors Reviewed-by: Chris Dunlop <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: John Gallagher <[email protected]> Signed-off-by: loli10K <[email protected]> Closes #9326
*	ZTS: Fix /usr/bin/env: 'python2': No such file or directory	loli10K	2019-09-16	1	-0/+1
\| \| \| \| \| \| \| \| \| \|	Since 4f342e45 env(1) must be able to find a "python2" executable in the "constrained path" on systems configured with --with-python=2.x otherwise the ZFS Test Suite won't be able to use Python scripts. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Signed-off-by: loli10K <[email protected]> Closes #9325
*	Fix clone handling with encryption roots	Tom Caputi	2019-09-16	5	-24/+149
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently, spa_keystore_change_key_sync_impl() does not recurse into clones when updating encryption roots for either a call to 'zfs promote' or 'zfs change-key'. This can cause children of these clones to end up in a state where they point to the wrong dataset as the encryption root. It can also trigger ASSERTs in some cases where the code checks reference counts on wrapping keys. This patch fixes this issue by ensuring that this function properly recurses into clones during processing. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alek Pinchuk <[email protected]> Signed-off-by: Tom Caputi <[email protected]> Closes #9267 Closes #9294
*	Scrubbing root pools may deadlock on kernels without elevator_change() (#9321)	loli10K	2019-09-13	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Originally the zfs_vdev_elevator module option was added as a convenience so the requested elevator would be automatically set on the underlying block devices. At the time this was simple because the kernel provided an API function which did exactly this. This API was then removed in the Linux 4.12 kernel which prompted us to add compatibly code to set the elevator via a usermodehelper. Unfortunately changing the evelator via usermodehelper requires reading some userland binaries, most notably modprobe(8) or sh(1), from a zfs dataset on systems with root-on-zfs. This can deadlock the system if used during the following call path because it may need, if the data is not already cached in the ARC, reading directly from disk while holding the spa config lock as a writer: zfs_ioc_pool_scan() -> spa_scan() -> spa_scan() -> vdev_reopen() -> vdev_elevator_switch() -> call_usermodehelper() While the usermodehelper waits sh(1), modprobe(8) is blocked in the ZIO pipeline trying to read from disk: INFO: task modprobe:2650 blocked for more than 10 seconds. Tainted: P OE 5.2.14 modprobe D 0 2650 206 0x00000000 Call Trace: ? __schedule+0x244/0x5f0 schedule+0x2f/0xa0 cv_wait_common+0x156/0x290 [spl] ? do_wait_intr_irq+0xb0/0xb0 spa_config_enter+0x13b/0x1e0 [zfs] zio_vdev_io_start+0x51d/0x590 [zfs] ? tsd_get_by_thread+0x3b/0x80 [spl] zio_nowait+0x142/0x2f0 [zfs] arc_read+0xb2d/0x19d0 [zfs] ... zpl_iter_read+0xfa/0x170 [zfs] new_sync_read+0x124/0x1b0 vfs_read+0x91/0x140 ksys_read+0x59/0xd0 do_syscall_64+0x4f/0x130 entry_SYSCALL_64_after_hwframe+0x44/0xa9 This commit changes how we use the usermodehelper functionality from synchronous (UMH_WAIT_PROC) to asynchronous (UMH_NO_WAIT) which prevents scrubs, and other vdev_elevator_switch() consumers, from triggering the aforementioned issue. Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: loli10K <[email protected]> Issue #8664 Closes #9321
*	Add subcommand to wait for background zfs activity to complete	John Gallagher	2019-09-13	61	-144/+2662
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently the best way to wait for the completion of a long-running operation in a pool, like a scrub or device removal, is to poll 'zpool status' and parse its output, which is neither efficient nor convenient. This change adds a 'wait' subcommand to the zpool command. When invoked, 'zpool wait' will block until a specified type of background activity completes. Currently, this subcommand can wait for any of the following: - Scrubs or resilvers to complete - Devices to initialized - Devices to be replaced - Devices to be removed - Checkpoints to be discarded - Background freeing to complete For example, a scrub that is in progress could be waited for by running zpool wait -t scrub <pool> This also adds a -w flag to the attach, checkpoint, initialize, replace, remove, and scrub subcommands. When used, this flag makes the operations kicked off by these subcommands synchronous instead of asynchronous. This functionality is implemented using a new ioctl. The type of activity to wait for is provided as input to the ioctl, and the ioctl blocks until all activity of that type has completed. An ioctl was used over other methods of kernel-userspace communiction primarily for the sake of portability. Porting Notes: This is ported from Delphix OS change DLPX-44432. The following changes were made while porting: - Added ZoL-style ioctl input declaration. - Reorganized error handling in zpool_initialize in libzfs to integrate better with changes made for TRIM support. - Fixed check for whether a checkpoint discard is in progress. Previously it also waited if the pool had a checkpoint, instead of just if a checkpoint was being discarded. - Exposed zfs_initialize_chunk_size as a ZoL-style tunable. - Updated more existing tests to make use of new 'zpool wait' functionality, tests that don't exist in Delphix OS. - Used existing ZoL tunable zfs_scan_suspend_progress, together with zinject, in place of a new tunable zfs_scan_max_blks_per_txg. - Added support for a non-integral interval argument to zpool wait. Future work: ZoL has support for trimming devices, which Delphix OS does not. In the future, 'zpool wait' could be extended to add the ability to wait for trim operations to complete. Reviewed-by: Matt Ahrens <[email protected]> Reviewed-by: John Kennedy <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: John Gallagher <[email protected]> Closes #9162
*	QAT related bug fixes	Chengfei ZHu	2019-09-12	4	-32/+18
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	1. Fix issue: Kernel BUG with QAT during decompression #9276. Now it is uninterruptible for a specific given QAT request, but Ctrl-C interrupt still works in user-space process. 2. Copy the digest result to the buffer only when doing encryption, and vise-versa for decryption. Reviewed-by: Tom Caputi <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Chengfei Zhu <[email protected]> Closes #9276 Closes #9303