aboutsummaryrefslogtreecommitdiffstats
path: root/module/zfs
Commit message (Collapse)AuthorAgeFilesLines
* Improve logging of 128KB writesAlexander Motin2019-11-111-7/+13
| | | | | | | | | | | | | | | | | | | | | | Before my ZIL space optimization few years ago 128KB writes were logged as two 64KB+ records in two 128KB log blocks. After that change it became ~127KB+/1KB+ in two 128KB log blocks to free space in the second block for another record. Unfortunately in case of 128KB only writes, when space in the second block remained unused, that change increased write latency by unbalancing checksum computation and write times between parallel threads. It also didn't help with SLOG space efficiency in that case. This change introduces new 68KB log block size, used for both writes below 67KB and 128KB-sharp writes. Writes of 68-127KB are still using one 128KB block to not increase processing overhead. Writes above 131KB are still using full 128KB blocks, since possible saving there is small. Mixed loads will likely also fall back to previous 128KB, since code uses maximum of the last 16 requested block sizes. Reviewed-by: Matt Ahrens <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Closes #9409
* Move platform specific parts of zfs_znode.h to platform codeMatthew Macy2019-11-061-0/+15
| | | | | | | | | Some of the znode fields are different and functions consuming an inode don't exist on FreeBSD. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Jorgen Lundman <[email protected]> Signed-off-by: Matt Macy <[email protected]> Closes #9536
* Enable use of DTRACE_PROBE* macros in "spl" modulePrakash Surya2019-11-0113-13/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | This change modifies some of the infrastructure for enabling the use of the DTRACE_PROBE* macros, such that we can use tehm in the "spl" module. Currently, when the DTRACE_PROBE* macros are used, they get expanded to create new functions, and these dynamically generated functions become part of the "zfs" module. Since the "spl" module does not depend on the "zfs" module, the use of DTRACE_PROBE* in the "spl" module would result in undefined symbols being used in the "spl" module. Specifically, DTRACE_PROBE* would turn into a function call, and the function being called would be a symbol only contained in the "zfs" module; which results in a linker and/or runtime error. Thus, this change adds the necessary logic to the "spl" module, to mirror the tracing functionality available to the "zfs" module. After this change, we'll have a "trace_zfs.h" header file which defines the probes available only to the "zfs" module, and a "trace_spl.h" header file which defines the probes available only to the "spl" module. Reviewed by: Brad Lewis <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Prakash Surya <[email protected]> Closes #9525
* Prefix struct rangelockMatthew Macy2019-11-011-45/+51
| | | | | | | | | | A struct rangelock already exists on FreeBSD. Add a zfs_ prefix as per our convention to prevent any conflict with existing symbols. This change is a follow up to 2cc479d0. Reviewed-by: Matt Ahrens <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matt Macy <[email protected]> Closes #9534
* Include prototypes for vdev_initializeMatthew Macy2019-10-311-1/+2
| | | | | | | | Address two prototype related warnings emitted by clang. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Igor Kozhukhov <[email protected]> Signed-off-by: Matt Macy <[email protected]> Closes #9535
* Factor Linux specific code out of spa_misc.cMatthew Macy2019-10-311-76/+11
| | | | | | | | | | Move these Linux module parameter get/set helpers in to platform specific code. Reviewed-by: Igor Kozhukhov <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Signed-off-by: Matt Macy <[email protected]> Closes #9457
* Fix 'zfs change-key' with unencrypted childTom Caputi2019-10-301-2/+6
| | | | | | | | | | | | | | Currently, when you call 'zfs change-key' on an encrypted dataset that has an unencrypted child, the code will trigger a VERIFY. This VERIFY is leftover from before we allowed unencrypted datasets to exist underneath encrypted ones. This patch fixes the issue by simply replacing the VERIFY with an early return when recursing through datasets. Reviewed by: Jason King <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Igor Kozhukhov <[email protected]> Signed-off-by: Tom Caputi <[email protected]> Closes #9524
* Minor spa portability fixesMatthew Macy2019-10-281-3/+3
| | | | | | | | | | | | - FreeBSD's rootpool import code uses spa_config_parse - Move the zvol_create_minors call out from under the spa_namespace_lock in spa_import. It isn't needed and it causes a lock order reversal on FreeBSD. Reviewed-by: Igor Kozhukhov <[email protected]> Reviewed-by: Jorgen Lundman <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matt Macy <[email protected]> Closes #9499
* Fix for ARC sysctls ignored at runtimeloli10K2019-10-262-32/+37
| | | | | | | | | | | | | | This change leverage module_param_call() to run arc_tuning_update() immediately after the ARC tunable has been updated as suggested in cffa8372 code review. A simple test case is added to the ZFS Test Suite to prevent future regressions in functionality. Reviewed-by: Matt Macy <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: loli10K <[email protected]> Closes #9487 Closes #9489
* Remove non-portable pointer is valid assertMatthew Macy2019-10-251-1/+0
| | | | | | | | | | This assert makes non portable assumptions about the state of memory returned by the memory allocator. Reviewed-by: Igor Kozhukhov <[email protected]> Reviewed-by: Jorgen Lundman <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matt Macy <[email protected]> Closes #9506
* Move final zvol_remove_minors to common codeMatthew Macy2019-10-251-0/+10
| | | | | | | | | | This logic is not platform dependent and should reside in the common code. Reviewed-by: Igor Kozhukhov <[email protected]> Reviewed-by: Jorgen Lundman <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matt Macy <[email protected]> Closes #9505
* Remove sdt.hMatthew Macy2019-10-251-1/+0
| | | | | | | | | It's mostly a noop on ZoL and it conflicts with platforms that support dtrace. Remove this header to resolve the conflict. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Jorgen Lundman <[email protected]> Signed-off-by: Matt Macy <[email protected]> Closes #9497
* Linux 4.14, 4.19, 5.0+ compat: SIMD save/restoreBrian Behlendorf2019-10-243-27/+15
| | | | | | | | | | | | | | | | | | | | Contrary to initial testing we cannot rely on these kernels to invalidate the per-cpu FPU state and restore the FPU registers. Nor can we guarantee that the kernel won't modify the FPU state which we saved in the task struck. Therefore, the kfpu_begin() and kfpu_end() functions have been updated to save and restore the FPU state using our own dedicated per-cpu FPU state variables. This has the additional advantage of allowing us to use the FPU again in user threads. So we remove the code which was added to use task queues to ensure some functions ran in kernel threads. Reviewed-by: Fabian Grünbichler <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #9346 Closes #9403
* Don't call arc_buf_destroy on unallocated arc_bufchrisrd2019-10-231-4/+5
| | | | | | | | | | Fixes an obvious issue of calling arc_buf_destroy() on an unallocated arc_buf. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Matt Ahrens <[email protected]> Reviewed-by: Paul Dagnelie <[email protected]> Signed-off-by: Chris Dunlop <[email protected]> Closes #9453
* OpenZFS restructuring - ARC memory pressureMatthew Macy2019-10-181-438/+22
| | | | | | | | | | | Factor Linux specific memory pressure handling out of ARC. Each platform will have different available interfaces for managing memory pressure. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Jorgen Lundman <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Signed-off-by: Matt Macy <[email protected]> Closes #9472
* Make zfsdev_getminor signature cross platformMatthew Macy2019-10-161-6/+1
| | | | | | | | Only pass the file descriptor to make zfsdev_get_miror() portable. Reviewed-by: George Melikov <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matt Macy <[email protected]> Closes #9466
* Move linux specific mmp module_param_call handler to platform codeMatthew Macy2019-10-161-28/+0
| | | | | | Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matt Macy <[email protected]> Closes #9465
* Fix signature for private functions without header declarationsMatthew Macy2019-10-162-2/+2
| | | | | | | | Clang will complain if a function has no prior declaration Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matt Macy <[email protected]> Closes #9467
* Remove dead code and cleanup scoping in dmu_send.cMatthew Macy2019-10-131-16/+5
| | | | | | | | | | | | | This addresses a number of problems with dmu_send.c: * bp_span is unused which makes clang complain * dump_write conflicts with FreeBSD's existing core dump code * range_alloc is private to the file and not declared in any headers causing clang to complain Reviewed-by: Matt Ahrens <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matt Macy <[email protected]> Closes #9432
* Don't call sizeof on voidPaul Dagnelie2019-10-131-1/+1
| | | | | | | | | We get the sizeof the appropriate type, and don't cast away const. Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Igor Kozhukhov <[email protected]> Reviewed-by: Matt Macy <[email protected]> Signed-off-by: Paul Dagnelie <[email protected]> Closes #9455
* Move zfs_onexit_fd_hold to platform codeMatthew Macy2019-10-131-33/+0
| | | | | | | | FreeBSD has a very different implementation. Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matt Macy <[email protected]> Closes #9442
* Move zio_delay_interrupt to platform codeMatthew Macy2019-10-131-71/+0
| | | | | | | | FreeBSD has its own implementation as do other platforms. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Signed-off-by: Matt Macy <[email protected]> Closes #9439
* Typo fix in comment: dso_dryrunchrisrd2019-10-111-1/+1
| | | | | | Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Chris Dunlop <[email protected]> Closes #9452
* Function name and comment updatesPaul Dagnelie2019-10-113-8/+27
| | | | | | | | | | | Rename certain functions for more consistency when they share common features. Make comments clearer about what arguments should be passed to the insert and add functions. Reviewed by: Sara Hartse <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Matt Ahrens <[email protected]> Signed-off-by: Paul Dagnelie <[email protected]> Closes #9441
* Expose dmu_buf_hold_array_by_dnode to platform codeMatthew Macy2019-10-111-1/+1
| | | | | | | | FreeBSD uses this in its pager ops routines Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Signed-off-by: Matt Macy <[email protected]> Closes #9431
* Fix pool creation with feature@allocation_classes disabledloli10K2019-10-101-0/+10
| | | | | | | | | | | When "feature@allocation_classes" is not enabled on the pool no vdev with "special" or "dedup" allocation type should be allowed to exist in the vdev tree. Reviewed-by: Pavel Zakharov <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: loli10K <[email protected]> Closes #9427 Closes #9429
* Move get_temporary_prop to platform codeMatthew Macy2019-10-101-81/+7
| | | | | | | | | | Temporary property handling at the VFS layer requires platform specific code. Reviewed-by: Sean Eric Fagan <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Jorgen Lundman <[email protected]> Signed-off-by: Matt Macy <[email protected]> Closes #9401
* Add kmem cache accessorsMatthew Macy2019-10-101-4/+5
| | | | | | | | | | | | Make the metaslab platform agnostic again by adding accessor functions which can be implemented by each platform. Reviewed-by: Paul Dagnelie <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Jorgen Lundman <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Signed-off-by: Matt Macy <[email protected]> Closes #9404
* Make `zil_async_to_sync` visible to platform codeMatthew Macy2019-10-101-3/+1
| | | | | | | | | FreeBSD's zvol platform code requires access to the zil_async_to_sync() function. Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matt Macy <[email protected]> Closes #9440
* Fix strdup conflict on other platformsMatthew Macy2019-10-109-35/+35
| | | | | | | | | | | | | | | | In the FreeBSD kernel the strdup signature is: ``` char *strdup(const char *__restrict, struct malloc_type *); ``` It's unfortunate that the developers have chosen to change the signature of libc functions - but it's what I have to deal with. Reviewed-by: Jorgen Lundman <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matt Macy <[email protected]> Closes #9433
* Make clang happy with vdev_raidz_ codeMatthew Macy2019-10-102-0/+13
| | | | | | | | | | | | | | | | | | The macros are used to generate code for conditions without a corresponding branch. This is not a problem in practice, but clang has no way of knowing that. Add a default branch with a VERIFY(0) to indicate that it "can't happen" ``` In file included from \ /usr/home/mmacy/devel/ZoF/module/zfs/vdev_raidz_math_sse2.c:607: /usr/home/mmacy/devel/ZoF/module/zfs/vdev_raidz_math_impl.h:281:3: \ error: no case matching constant switch condition '3' [-Werror] ``` Reviewed-by: George Melikov <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matt Macy <[email protected]> Closes #9434
* Reduce loaded range tree memory usagePaul Dagnelie2019-10-0932-545/+3137
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch implements a new tree structure for ZFS, and uses it to store range trees more efficiently. The new structure is approximately a B-tree, though there are some small differences from the usual characterizations. The tree has core nodes and leaf nodes; each contain data elements, which the elements in the core nodes acting as separators between its children. The difference between core and leaf nodes is that the core nodes have an array of children, while leaf nodes don't. Every node in the tree may be only partially full; in most cases, they are all at least 50% full (in terms of element count) except for the root node, which can be less full. Underfull nodes will steal from their neighbors or merge to remain full enough, while overfull nodes will split in two. The data elements are contained in tree-controlled buffers; they are copied into these on insertion, and overwritten on deletion. This means that the elements are not independently allocated, which reduces overhead, but also means they can't be shared between trees (and also that pointers to them are only valid until a side-effectful tree operation occurs). The overhead varies based on how dense the tree is, but is usually on the order of about 50% of the element size; the per-node overheads are very small, and so don't make a significant difference. The trees can accept arbitrary records; they accept a size and a comparator to allow them to be used for a variety of purposes. The new trees replace the AVL trees used in the range trees today. Currently, the range_seg_t structure contains three 8 byte integers of payload and two 24 byte avl_tree_node_ts to handle its storage in both an offset-sorted tree and a size-sorted tree (total size: 64 bytes). In the new model, the range seg structures are usually two 4 byte integers, but a separate one needs to exist for the size-sorted and offset-sorted tree. Between the raw size, the 50% overhead, and the double storage, the new btrees are expected to use 8*1.5*2 = 24 bytes per record, or 33.3% as much memory as the AVL trees (this is for the purposes of storing metaslab range trees; for other purposes, like scrubs, they use ~50% as much memory). We reduced the size of the payload in the range segments by teaching range trees about starting offsets and shifts; since metaslabs have a fixed starting offset, and they all operate in terms of disk sectors, we can store the ranges using 4-byte integers as long as the size of the metaslab divided by the sector size is less than 2^32. For 512-byte sectors, this is a 2^41 (or 2TB) metaslab, which with the default settings corresponds to a 256PB disk. 4k sector disks can handle metaslabs up to 2^46 bytes, or 2^63 byte disks. Since we do not anticipate disks of this size in the near future, there should be almost no cases where metaslabs need 64-byte integers to store their ranges. We do still have the capability to store 64-byte integer ranges to account for cases where we are storing per-vdev (or per-dnode) trees, which could reasonably go above the limits discussed. We also do not store fill information in the compact version of the node, since it is only used for sorted scrub. We also optimized the metaslab loading process in various other ways to offset some inefficiencies in the btree model. While individual operations (find, insert, remove_from) are faster for the btree than they are for the avl tree, remove usually requires a find operation, while in the AVL tree model the element itself suffices. Some clever changes actually caused an overall speedup in metaslab loading; we use approximately 40% less cpu to load metaslabs in our tests on Illumos. Another memory and performance optimization was achieved by changing what is stored in the size-sorted trees. When a disk is heavily fragmented, the df algorithm used by default in ZFS will almost always find a number of small regions in its initial cursor-based search; it will usually only fall back to the size-sorted tree to find larger regions. If we increase the size of the cursor-based search slightly, and don't store segments that are smaller than a tunable size floor in the size-sorted tree, we can further cut memory usage down to below 20% of what the AVL trees store. This also results in further reductions in CPU time spent loading metaslabs. The 16KiB size floor was chosen because it results in substantial memory usage reduction while not usually resulting in situations where we can't find an appropriate chunk with the cursor and are forced to use an oversized chunk from the size-sorted tree. In addition, even if we do have to use an oversized chunk from the size-sorted tree, the chunk would be too small to use for ZIL allocations, so it isn't as big of a loss as it might otherwise be. And often, more small allocations will follow the initial one, and the cursor search will now find the remainder of the chunk we didn't use all of and use it for subsequent allocations. Practical testing has shown little or no change in fragmentation as a result of this change. If the size-sorted tree becomes empty while the offset sorted one still has entries, it will load all the entries from the offset sorted tree and disregard the size floor until it is unloaded again. This operation occurs rarely with the default setting, only on incredibly thoroughly fragmented pools. There are some other small changes to zdb to teach it to handle btrees, but nothing major. Reviewed-by: George Wilson <[email protected]> Reviewed-by: Matt Ahrens <[email protected]> Reviewed by: Sebastien Roy [email protected] Reviewed-by: Igor Kozhukhov <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Paul Dagnelie <[email protected]> Closes #9181
* Rename rangelock_ functions to zfs_rangelock_Matthew Macy2019-10-032-39/+41
| | | | | | | | | A rangelock KPI already exists on FreeBSD. Add a zfs_ prefix as per our convention to prevent any conflict with existing symbols. Reviewed-by: Igor Kozhukhov <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matt Macy <[email protected]> Closes #9402
* dbuf_hold_impl() cleanup to improve cached read performanceTony Nguyen2019-10-031-154/+69
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently every dbuf_hold_impl() incurs kmem_alloc() and kmem_free() which can be costly for cached read performance. This change reverts the dbuf_hold_impl() fix stack commit, i.e. fc5bb51f08a6c91ff9ad3559d0266eeeab0b1f61 to eliminate the extra kmem_alloc() and kmem_free() operations and improve cached read performance. With the change, each dbuf_hold_impl() frame uses 40 bytes more, total of 800 for 20 recursive levels. Linux kernel stack sizes are 8K and 16K for 32bit and 64bit, respectively, so stack overrun risk is limited. Sample stack output comparisons with 50 PB file and recordsize=512 Current code 11) 2240 64 arc_alloc_buf+0x4a/0xd0 [zfs] 12) 2176 264 dbuf_read_impl.constprop.16+0x2e3/0x7f0 [zfs] 13) 1912 120 dbuf_read+0xe5/0x520 [zfs] 14) 1792 56 dbuf_hold_impl_arg+0x572/0x630 [zfs] 15) 1736 64 dbuf_hold_impl_arg+0x508/0x630 [zfs] 16) 1672 64 dbuf_hold_impl_arg+0x508/0x630 [zfs] 17) 1608 40 dbuf_hold_impl+0x23/0x40 [zfs] 18) 1568 40 dbuf_hold_level+0x32/0x60 [zfs] 19) 1528 16 dbuf_hold+0x16/0x20 [zfs] dbuf_hold_impl() cleanup 11) 2320 64 arc_alloc_buf+0x4a/0xd0 [zfs] 12) 2256 264 dbuf_read_impl.constprop.17+0x2e3/0x7f0 [zfs] 13) 1992 120 dbuf_read+0xe5/0x520 [zfs] 14) 1872 96 dbuf_hold_impl+0x50f/0x5e0 [zfs] 15) 1776 104 dbuf_hold_impl+0x4df/0x5e0 [zfs] 16) 1672 104 dbuf_hold_impl+0x4df/0x5e0 [zfs] 17) 1568 40 dbuf_hold_level+0x32/0x60 [zfs] 18) 1528 16 dbuf_hold+0x16/0x20 [zfs] Performance observations on 8K recordsize filesystem: - 8/128/1024K at 1-128 sequential cached read, ~3% improvement Testing done on Ubuntu 18.04 with 4.15 kernel, 8vCPUs and SSD storage on VMware ESX. Reviewed-by: Matt Ahrens <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tony Nguyen <[email protected]> Closes #9351
* Add inode accessors to common codeMatthew Macy2019-10-024-14/+14
| | | | | | Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Jorgen Lundman <[email protected]> Signed-off-by: Matt Macy <[email protected]> Closes #9389
* OpenZFS restructuring - arc_statsMatthew Macy2019-10-011-281/+6
| | | | | | | | Make arc_stats visible to platform code. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Igor Kozhukhov <[email protected]> Signed-off-by: Matt Macy <[email protected]> Closes #9386
* Enable clang to use intrinsics for lz4Matthew Macy2019-10-011-3/+3
| | | | | | | | Reviewed-by: Igor Kozhukhov <[email protected]> Reviewed-by: Jorgen Lundman <[email protected]> Reviewed-by: George Melikov <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matt Macy <[email protected]> Closes #9385
* Timeout waiting for ZVOL device to be createdPrakash Surya2019-10-011-6/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We've seen cases where after creating a ZVOL, the ZVOL device node in "/dev" isn't generated after 20 seconds of waiting, which is the point at which our applications gives up on waiting and reports an error. The workload when this occurs is to "refresh" 400+ ZVOLs roughly at the same time, based on a policy set by the user. This refresh operation will destroy the ZVOL, and re-create it based on a snapshot. When this occurs, we see many hundreds of entries on the "z_zvol" taskq (based on inspection of the /proc/spl/taskq-all file). Many of the entries on the taskq end up in the "zvol_remove_minors_impl" function, and I've measured the latency of that function: Function = zvol_remove_minors_impl msecs : count distribution 0 -> 1 : 0 | | 2 -> 3 : 0 | | 4 -> 7 : 1 | | 8 -> 15 : 0 | | 16 -> 31 : 0 | | 32 -> 63 : 0 | | 64 -> 127 : 1 | | 128 -> 255 : 45 |****************************************| 256 -> 511 : 5 |**** | That data is from a 10 second sample, using the BCC "funclatency" tool. As we can see, in this 10 second sample, most calls took 128ms at a minimum. Thus, some basic math tells us that in any 20 second interval, we could only process at most about 150 removals, which is much less than the 400+ that'll occur based on the workload. As a result of this, and since all ZVOL minor operations will go through the single threaded "z_zvol" taskq, the latency for creating a single ZVOL device can be unreasonably large due to other ZVOL activity on the system. In our case, it's large enough to cause the application to generate an error and fail the operation. When profiling the "zvol_remove_minors_impl" function, I saw that most of the time in the function was spent off-cpu, blocked in the function "taskq_wait_outstanding". How this works, is "zvol_remove_minors_impl" will dispatch calls to "zvol_free" using the "system_taskq", and then the "taskq_wait_outstanding" function is used to wait for all of those dispatched calls to occur before "zvol_remove_minors_impl" will return. As far as I can tell, "zvol_remove_minors_impl" doesn't necessarily have to wait for all calls to "zvol_free" to occur before it returns. Thus, this change removes the call to "taskq_wait_oustanding", so that calls to "zvol_free" don't affect the latency of "zvol_remove_minors_impl". Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: John Gallagher <[email protected]> Signed-off-by: Prakash Surya <[email protected]> Closes #9380
* OpenZFS restructuring - zfs_ioctlMatthew Macy2019-09-271-329/+83
| | | | | | | | | Refactor the zfs ioctls in to platform dependent and independent bits. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Sean Eric Fagan <[email protected]> Signed-off-by: Matthew Macy <[email protected]> Signed-off-by: Ryan Moeller <[email protected]> Closes #9301
* OpenZFS restructuring - zvolMatthew Macy2019-09-253-1106/+59
| | | | | | | | | | Refactor the zvol in to platform dependent and independent bits. Reviewed-by: Allan Jude <[email protected]> Reviewed-by: Jorgen Lundman <[email protected]> Reviewed-by: Igor Kozhukhov <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matt Macy <[email protected]> Closes #9295
* diff_cb() does not handle large dnodesloli10K2019-09-241-2/+3
| | | | | | | | | | | | | | | | | | Trying to 'zfs diff' a snapshot with large dnodes will incorrectly try to access its interior slots when dnodesize > sizeof(dnode_phys_t). This is normally not an issue because the interior slots are zero-filled, which report_dnode() handles calling report_free_dnode_range(). However this is not the case for encrypted large dnodes or filesystem using many SA based xattrs where the extra data past the legacy dnode size boundary is interpreted as a dnode_phys_t. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Tom Caputi <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Signed-off-by: loli10K <[email protected]> Closes #7678 Closes #8931 Closes #9343
* Disabled resilver_defer feature leads to looping resilversKody A Kantor2019-09-221-9/+11
| | | | | | | | | | | | | | | | | | When a disk is replaced with another on a pool with the resilver_defer feature present, but not enabled the resilver activity restarts during each spa_sync. This patch checks to make sure that the resilver_defer feature is first enabled before requesting a deferred resilver. This was originally fixed in illumos-joyent as OS-7982. Reviewed-by: Chris Dunlop <[email protected]> Reviewed-by: George Melikov <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Tom Caputi <[email protected]> Reviewed by: Jerry Jelinek <[email protected]> Signed-off-by: Kody A Kantor <[email protected]> External-issue: illumos-joyent OS-7982 Closes #9299 Closes #9338
* Fix dsl_scan_ds_clone_swapped logicAndriy Gapon2019-09-181-31/+69
| | | | | | | | | | | | | | | | The was incorrect with respect to swapping dataset IDs both in the on-disk ZAP object and the in-memory queue. In both cases, if ds1 was already present, then it would be first replaced with ds2 and then ds would be replaced back with ds1. Also, both cases did not properly handle a situation where both ds1 and ds2 are already queued. A duplicate insertion would be attempted and its failure would result in a panic. Reviewed-by: Matt Ahrens <[email protected]> Reviewed-by: Tom Caputi <[email protected]> Signed-off-by: Andriy Gapon <[email protected]> Closes #9140 Closes #9163
* Device removal of indirect vdev panics the kernelloli10K2019-09-161-0/+4
| | | | | | | | | | This commit fixes a NULL pointer dereference triggered in spa_vdev_remove_top_check() by trying to "zpool remove" an indirect vdev. Reviewed-by: Matt Ahrens <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: loli10K <[email protected]> Closes #9327
* Prevent gcc -Werror=maybe-uninitialized warnings in spa_wait_common()loli10K2019-09-161-1/+1
| | | | | | | | | | | | | | | | | This commit fixes the following build failure detected on Debian9 (GCC 6.3.0): CC [M] module/zfs/spa.o module/zfs/spa.c: In function ‘spa_wait_common.part.31’: module/zfs/spa.c:9468:6: error: ‘in_progress’ may be used uninitialized in this function [-Werror=maybe-uninitialized] if (!in_progress || spa->spa_waiters_cancel || error) ^ cc1: all warnings being treated as errors Reviewed-by: Chris Dunlop <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: John Gallagher <[email protected]> Signed-off-by: loli10K <[email protected]> Closes #9326
* Fix clone handling with encryption rootsTom Caputi2019-09-161-19/+50
| | | | | | | | | | | | | | | | Currently, spa_keystore_change_key_sync_impl() does not recurse into clones when updating encryption roots for either a call to 'zfs promote' or 'zfs change-key'. This can cause children of these clones to end up in a state where they point to the wrong dataset as the encryption root. It can also trigger ASSERTs in some cases where the code checks reference counts on wrapping keys. This patch fixes this issue by ensuring that this function properly recurses into clones during processing. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alek Pinchuk <[email protected]> Signed-off-by: Tom Caputi <[email protected]> Closes #9267 Closes #9294
* Add subcommand to wait for background zfs activity to completeJohn Gallagher2019-09-139-4/+395
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently the best way to wait for the completion of a long-running operation in a pool, like a scrub or device removal, is to poll 'zpool status' and parse its output, which is neither efficient nor convenient. This change adds a 'wait' subcommand to the zpool command. When invoked, 'zpool wait' will block until a specified type of background activity completes. Currently, this subcommand can wait for any of the following: - Scrubs or resilvers to complete - Devices to initialized - Devices to be replaced - Devices to be removed - Checkpoints to be discarded - Background freeing to complete For example, a scrub that is in progress could be waited for by running zpool wait -t scrub <pool> This also adds a -w flag to the attach, checkpoint, initialize, replace, remove, and scrub subcommands. When used, this flag makes the operations kicked off by these subcommands synchronous instead of asynchronous. This functionality is implemented using a new ioctl. The type of activity to wait for is provided as input to the ioctl, and the ioctl blocks until all activity of that type has completed. An ioctl was used over other methods of kernel-userspace communiction primarily for the sake of portability. Porting Notes: This is ported from Delphix OS change DLPX-44432. The following changes were made while porting: - Added ZoL-style ioctl input declaration. - Reorganized error handling in zpool_initialize in libzfs to integrate better with changes made for TRIM support. - Fixed check for whether a checkpoint discard is in progress. Previously it also waited if the pool had a checkpoint, instead of just if a checkpoint was being discarded. - Exposed zfs_initialize_chunk_size as a ZoL-style tunable. - Updated more existing tests to make use of new 'zpool wait' functionality, tests that don't exist in Delphix OS. - Used existing ZoL tunable zfs_scan_suspend_progress, together with zinject, in place of a new tunable zfs_scan_max_blks_per_txg. - Added support for a non-integral interval argument to zpool wait. Future work: ZoL has support for trimming devices, which Delphix OS does not. In the future, 'zpool wait' could be extended to add the ability to wait for trim operations to complete. Reviewed-by: Matt Ahrens <[email protected]> Reviewed-by: John Kennedy <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: John Gallagher <[email protected]> Closes #9162
* Move objnode handling to common codeMatthew Macy2019-09-121-1/+69
| | | | | | | | objnode is OS agnostic and used only by dmu_redact.c. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Signed-off-by: Matt Macy <[email protected]> Closes #9315
* Enable compiler to typecheck loggingMatthew Macy2019-09-129-24/+30
| | | | | | | | | | Annotate spa logging declarations with printflike Workaround gcc bug (non disable-able warning) by replacing "" with " " Reviewed-by: Matt Ahrens <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matt Macy <[email protected]> Closes #9316
* OpenZFS restructuring - move linux tracing code to platform directoriesMatthew Macy2019-09-1115-68/+13
| | | | | | | | | | | Move Linux specific tracing headers and source to platform directories and update the build system. Reviewed-by: Allan Jude <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Reviewed by: Brad Lewis <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matt Macy <[email protected]> Closes #9290