| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This change leverage module_param_call() to run arc_tuning_update()
immediately after the ARC tunable has been updated as suggested in
cffa8372 code review.
A simple test case is added to the ZFS Test Suite to prevent future
regressions in functionality.
Reviewed-by: Matt Macy <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: loli10K <[email protected]>
Closes #9487
Closes #9489
|
|
|
|
|
|
|
|
|
|
| |
This assert makes non portable assumptions about the state of memory
returned by the memory allocator.
Reviewed-by: Igor Kozhukhov <[email protected]>
Reviewed-by: Jorgen Lundman <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Matt Macy <[email protected]>
Closes #9506
|
|
|
|
|
|
|
|
|
|
| |
This logic is not platform dependent and should reside in the
common code.
Reviewed-by: Igor Kozhukhov <[email protected]>
Reviewed-by: Jorgen Lundman <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Matt Macy <[email protected]>
Closes #9505
|
|
|
|
|
|
|
|
|
| |
It's mostly a noop on ZoL and it conflicts with platforms that
support dtrace. Remove this header to resolve the conflict.
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Jorgen Lundman <[email protected]>
Signed-off-by: Matt Macy <[email protected]>
Closes #9497
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Contrary to initial testing we cannot rely on these kernels to
invalidate the per-cpu FPU state and restore the FPU registers.
Nor can we guarantee that the kernel won't modify the FPU state
which we saved in the task struck.
Therefore, the kfpu_begin() and kfpu_end() functions have been
updated to save and restore the FPU state using our own dedicated
per-cpu FPU state variables.
This has the additional advantage of allowing us to use the FPU
again in user threads. So we remove the code which was added to
use task queues to ensure some functions ran in kernel threads.
Reviewed-by: Fabian Grünbichler <[email protected]>
Reviewed-by: Tony Hutter <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Issue #9346
Closes #9403
|
|
|
|
|
|
|
|
|
|
| |
Fixes an obvious issue of calling arc_buf_destroy() on an
unallocated arc_buf.
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Matt Ahrens <[email protected]>
Reviewed-by: Paul Dagnelie <[email protected]>
Signed-off-by: Chris Dunlop <[email protected]>
Closes #9453
|
|
|
|
|
|
|
|
|
|
|
| |
Factor Linux specific memory pressure handling out of ARC. Each
platform will have different available interfaces for managing memory
pressure.
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Jorgen Lundman <[email protected]>
Reviewed-by: Ryan Moeller <[email protected]>
Signed-off-by: Matt Macy <[email protected]>
Closes #9472
|
|
|
|
|
|
|
|
| |
Only pass the file descriptor to make zfsdev_get_miror() portable.
Reviewed-by: George Melikov <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Matt Macy <[email protected]>
Closes #9466
|
|
|
|
|
|
| |
Reviewed-by: Ryan Moeller <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Matt Macy <[email protected]>
Closes #9465
|
|
|
|
|
|
|
|
| |
Clang will complain if a function has no prior declaration
Reviewed-by: Ryan Moeller <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Matt Macy <[email protected]>
Closes #9467
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This addresses a number of problems with dmu_send.c:
* bp_span is unused which makes clang complain
* dump_write conflicts with FreeBSD's existing core dump code
* range_alloc is private to the file and not declared in any headers
causing clang to complain
Reviewed-by: Matt Ahrens <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Matt Macy <[email protected]>
Closes #9432
|
|
|
|
|
|
|
|
|
| |
We get the sizeof the appropriate type, and don't cast away const.
Reviewed-by: Ryan Moeller <[email protected]>
Reviewed-by: Igor Kozhukhov <[email protected]>
Reviewed-by: Matt Macy <[email protected]>
Signed-off-by: Paul Dagnelie <[email protected]>
Closes #9455
|
|
|
|
|
|
|
|
| |
FreeBSD has a very different implementation.
Reviewed-by: Ryan Moeller <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Matt Macy <[email protected]>
Closes #9442
|
|
|
|
|
|
|
|
| |
FreeBSD has its own implementation as do other platforms.
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Ryan Moeller <[email protected]>
Signed-off-by: Matt Macy <[email protected]>
Closes #9439
|
|
|
|
|
|
| |
Reviewed-by: Ryan Moeller <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Chris Dunlop <[email protected]>
Closes #9452
|
|
|
|
|
|
|
|
|
|
|
| |
Rename certain functions for more consistency when they share common
features. Make comments clearer about what arguments should be passed
to the insert and add functions.
Reviewed by: Sara Hartse <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Matt Ahrens <[email protected]>
Signed-off-by: Paul Dagnelie <[email protected]>
Closes #9441
|
|
|
|
|
|
|
|
| |
FreeBSD uses this in its pager ops routines
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Ryan Moeller <[email protected]>
Signed-off-by: Matt Macy <[email protected]>
Closes #9431
|
|
|
|
|
|
|
|
|
|
|
| |
When "feature@allocation_classes" is not enabled on the pool no vdev
with "special" or "dedup" allocation type should be allowed to exist in
the vdev tree.
Reviewed-by: Pavel Zakharov <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: loli10K <[email protected]>
Closes #9427
Closes #9429
|
|
|
|
|
|
|
|
|
|
| |
Temporary property handling at the VFS layer requires
platform specific code.
Reviewed-by: Sean Eric Fagan <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Jorgen Lundman <[email protected]>
Signed-off-by: Matt Macy <[email protected]>
Closes #9401
|
|
|
|
|
|
|
|
|
|
|
|
| |
Make the metaslab platform agnostic again by adding
accessor functions which can be implemented by each
platform.
Reviewed-by: Paul Dagnelie <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Jorgen Lundman <[email protected]>
Reviewed-by: Ryan Moeller <[email protected]>
Signed-off-by: Matt Macy <[email protected]>
Closes #9404
|
|
|
|
|
|
|
|
|
| |
FreeBSD's zvol platform code requires access to the
zil_async_to_sync() function.
Reviewed-by: Ryan Moeller <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Matt Macy <[email protected]>
Closes #9440
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
In the FreeBSD kernel the strdup signature is:
```
char *strdup(const char *__restrict, struct malloc_type *);
```
It's unfortunate that the developers have chosen to change
the signature of libc functions - but it's what I have to
deal with.
Reviewed-by: Jorgen Lundman <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Matt Macy <[email protected]>
Closes #9433
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The macros are used to generate code for conditions without a
corresponding branch. This is not a problem in practice, but
clang has no way of knowing that. Add a default branch with a
VERIFY(0) to indicate that it "can't happen"
```
In file included from \
/usr/home/mmacy/devel/ZoF/module/zfs/vdev_raidz_math_sse2.c:607:
/usr/home/mmacy/devel/ZoF/module/zfs/vdev_raidz_math_impl.h:281:3: \
error: no case matching constant switch condition '3' [-Werror]
```
Reviewed-by: George Melikov <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Matt Macy <[email protected]>
Closes #9434
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This patch implements a new tree structure for ZFS, and uses it to
store range trees more efficiently.
The new structure is approximately a B-tree, though there are some
small differences from the usual characterizations. The tree has core
nodes and leaf nodes; each contain data elements, which the elements
in the core nodes acting as separators between its children. The
difference between core and leaf nodes is that the core nodes have an
array of children, while leaf nodes don't. Every node in the tree may
be only partially full; in most cases, they are all at least 50% full
(in terms of element count) except for the root node, which can be
less full. Underfull nodes will steal from their neighbors or merge to
remain full enough, while overfull nodes will split in two. The data
elements are contained in tree-controlled buffers; they are copied
into these on insertion, and overwritten on deletion. This means that
the elements are not independently allocated, which reduces overhead,
but also means they can't be shared between trees (and also that
pointers to them are only valid until a side-effectful tree operation
occurs). The overhead varies based on how dense the tree is, but is
usually on the order of about 50% of the element size; the per-node
overheads are very small, and so don't make a significant difference.
The trees can accept arbitrary records; they accept a size and a
comparator to allow them to be used for a variety of purposes.
The new trees replace the AVL trees used in the range trees today.
Currently, the range_seg_t structure contains three 8 byte integers
of payload and two 24 byte avl_tree_node_ts to handle its storage in
both an offset-sorted tree and a size-sorted tree (total size: 64
bytes). In the new model, the range seg structures are usually two 4
byte integers, but a separate one needs to exist for the size-sorted
and offset-sorted tree. Between the raw size, the 50% overhead, and
the double storage, the new btrees are expected to use 8*1.5*2 = 24
bytes per record, or 33.3% as much memory as the AVL trees (this is
for the purposes of storing metaslab range trees; for other purposes,
like scrubs, they use ~50% as much memory).
We reduced the size of the payload in the range segments by teaching
range trees about starting offsets and shifts; since metaslabs have a
fixed starting offset, and they all operate in terms of disk sectors,
we can store the ranges using 4-byte integers as long as the size of
the metaslab divided by the sector size is less than 2^32. For 512-byte
sectors, this is a 2^41 (or 2TB) metaslab, which with the default
settings corresponds to a 256PB disk. 4k sector disks can handle
metaslabs up to 2^46 bytes, or 2^63 byte disks. Since we do not
anticipate disks of this size in the near future, there should be
almost no cases where metaslabs need 64-byte integers to store their
ranges. We do still have the capability to store 64-byte integer ranges
to account for cases where we are storing per-vdev (or per-dnode) trees,
which could reasonably go above the limits discussed. We also do not
store fill information in the compact version of the node, since it
is only used for sorted scrub.
We also optimized the metaslab loading process in various other ways
to offset some inefficiencies in the btree model. While individual
operations (find, insert, remove_from) are faster for the btree than
they are for the avl tree, remove usually requires a find operation,
while in the AVL tree model the element itself suffices. Some clever
changes actually caused an overall speedup in metaslab loading; we use
approximately 40% less cpu to load metaslabs in our tests on Illumos.
Another memory and performance optimization was achieved by changing
what is stored in the size-sorted trees. When a disk is heavily
fragmented, the df algorithm used by default in ZFS will almost always
find a number of small regions in its initial cursor-based search; it
will usually only fall back to the size-sorted tree to find larger
regions. If we increase the size of the cursor-based search slightly,
and don't store segments that are smaller than a tunable size floor
in the size-sorted tree, we can further cut memory usage down to
below 20% of what the AVL trees store. This also results in further
reductions in CPU time spent loading metaslabs.
The 16KiB size floor was chosen because it results in substantial memory
usage reduction while not usually resulting in situations where we can't
find an appropriate chunk with the cursor and are forced to use an
oversized chunk from the size-sorted tree. In addition, even if we do
have to use an oversized chunk from the size-sorted tree, the chunk
would be too small to use for ZIL allocations, so it isn't as big of a
loss as it might otherwise be. And often, more small allocations will
follow the initial one, and the cursor search will now find the
remainder of the chunk we didn't use all of and use it for subsequent
allocations. Practical testing has shown little or no change in
fragmentation as a result of this change.
If the size-sorted tree becomes empty while the offset sorted one still
has entries, it will load all the entries from the offset sorted tree
and disregard the size floor until it is unloaded again. This operation
occurs rarely with the default setting, only on incredibly thoroughly
fragmented pools.
There are some other small changes to zdb to teach it to handle btrees,
but nothing major.
Reviewed-by: George Wilson <[email protected]>
Reviewed-by: Matt Ahrens <[email protected]>
Reviewed by: Sebastien Roy [email protected]
Reviewed-by: Igor Kozhukhov <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Paul Dagnelie <[email protected]>
Closes #9181
|
|
|
|
|
|
|
|
|
| |
A rangelock KPI already exists on FreeBSD. Add a zfs_ prefix as
per our convention to prevent any conflict with existing symbols.
Reviewed-by: Igor Kozhukhov <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Matt Macy <[email protected]>
Closes #9402
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Currently every dbuf_hold_impl() incurs kmem_alloc() and kmem_free()
which can be costly for cached read performance.
This change reverts the dbuf_hold_impl() fix stack commit, i.e.
fc5bb51f08a6c91ff9ad3559d0266eeeab0b1f61 to eliminate the extra
kmem_alloc() and kmem_free() operations and improve cached read
performance. With the change, each dbuf_hold_impl() frame uses 40 bytes
more, total of 800 for 20 recursive levels. Linux kernel stack sizes are
8K and 16K for 32bit and 64bit, respectively, so stack overrun risk is
limited.
Sample stack output comparisons with 50 PB file and recordsize=512
Current code
11) 2240 64 arc_alloc_buf+0x4a/0xd0 [zfs]
12) 2176 264 dbuf_read_impl.constprop.16+0x2e3/0x7f0 [zfs]
13) 1912 120 dbuf_read+0xe5/0x520 [zfs]
14) 1792 56 dbuf_hold_impl_arg+0x572/0x630 [zfs]
15) 1736 64 dbuf_hold_impl_arg+0x508/0x630 [zfs]
16) 1672 64 dbuf_hold_impl_arg+0x508/0x630 [zfs]
17) 1608 40 dbuf_hold_impl+0x23/0x40 [zfs]
18) 1568 40 dbuf_hold_level+0x32/0x60 [zfs]
19) 1528 16 dbuf_hold+0x16/0x20 [zfs]
dbuf_hold_impl() cleanup
11) 2320 64 arc_alloc_buf+0x4a/0xd0 [zfs]
12) 2256 264 dbuf_read_impl.constprop.17+0x2e3/0x7f0 [zfs]
13) 1992 120 dbuf_read+0xe5/0x520 [zfs]
14) 1872 96 dbuf_hold_impl+0x50f/0x5e0 [zfs]
15) 1776 104 dbuf_hold_impl+0x4df/0x5e0 [zfs]
16) 1672 104 dbuf_hold_impl+0x4df/0x5e0 [zfs]
17) 1568 40 dbuf_hold_level+0x32/0x60 [zfs]
18) 1528 16 dbuf_hold+0x16/0x20 [zfs]
Performance observations on 8K recordsize filesystem:
- 8/128/1024K at 1-128 sequential cached read, ~3% improvement
Testing done on Ubuntu 18.04 with 4.15 kernel, 8vCPUs and SSD storage on
VMware ESX.
Reviewed-by: Matt Ahrens <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Tony Nguyen <[email protected]>
Closes #9351
|
|
|
|
|
|
| |
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Jorgen Lundman <[email protected]>
Signed-off-by: Matt Macy <[email protected]>
Closes #9389
|
|
|
|
|
|
|
|
| |
Make arc_stats visible to platform code.
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Igor Kozhukhov <[email protected]>
Signed-off-by: Matt Macy <[email protected]>
Closes #9386
|
|
|
|
|
|
|
|
| |
Reviewed-by: Igor Kozhukhov <[email protected]>
Reviewed-by: Jorgen Lundman <[email protected]>
Reviewed-by: George Melikov <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Matt Macy <[email protected]>
Closes #9385
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
We've seen cases where after creating a ZVOL, the ZVOL device node in
"/dev" isn't generated after 20 seconds of waiting, which is the point
at which our applications gives up on waiting and reports an error.
The workload when this occurs is to "refresh" 400+ ZVOLs roughly at the
same time, based on a policy set by the user. This refresh operation
will destroy the ZVOL, and re-create it based on a snapshot.
When this occurs, we see many hundreds of entries on the "z_zvol" taskq
(based on inspection of the /proc/spl/taskq-all file). Many of the
entries on the taskq end up in the "zvol_remove_minors_impl" function,
and I've measured the latency of that function:
Function = zvol_remove_minors_impl
msecs : count distribution
0 -> 1 : 0 | |
2 -> 3 : 0 | |
4 -> 7 : 1 | |
8 -> 15 : 0 | |
16 -> 31 : 0 | |
32 -> 63 : 0 | |
64 -> 127 : 1 | |
128 -> 255 : 45 |****************************************|
256 -> 511 : 5 |**** |
That data is from a 10 second sample, using the BCC "funclatency" tool.
As we can see, in this 10 second sample, most calls took 128ms at a
minimum. Thus, some basic math tells us that in any 20 second interval,
we could only process at most about 150 removals, which is much less
than the 400+ that'll occur based on the workload.
As a result of this, and since all ZVOL minor operations will go through
the single threaded "z_zvol" taskq, the latency for creating a single
ZVOL device can be unreasonably large due to other ZVOL activity on the
system. In our case, it's large enough to cause the application to
generate an error and fail the operation.
When profiling the "zvol_remove_minors_impl" function, I saw that most
of the time in the function was spent off-cpu, blocked in the function
"taskq_wait_outstanding". How this works, is "zvol_remove_minors_impl"
will dispatch calls to "zvol_free" using the "system_taskq", and then
the "taskq_wait_outstanding" function is used to wait for all of those
dispatched calls to occur before "zvol_remove_minors_impl" will return.
As far as I can tell, "zvol_remove_minors_impl" doesn't necessarily have
to wait for all calls to "zvol_free" to occur before it returns. Thus,
this change removes the call to "taskq_wait_oustanding", so that calls
to "zvol_free" don't affect the latency of "zvol_remove_minors_impl".
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: John Gallagher <[email protected]>
Signed-off-by: Prakash Surya <[email protected]>
Closes #9380
|
|
|
|
|
|
|
|
|
| |
Refactor the zfs ioctls in to platform dependent and independent bits.
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Sean Eric Fagan <[email protected]>
Signed-off-by: Matthew Macy <[email protected]>
Signed-off-by: Ryan Moeller <[email protected]>
Closes #9301
|
|
|
|
|
|
|
|
|
|
| |
Refactor the zvol in to platform dependent and independent bits.
Reviewed-by: Allan Jude <[email protected]>
Reviewed-by: Jorgen Lundman <[email protected]>
Reviewed-by: Igor Kozhukhov <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Matt Macy <[email protected]>
Closes #9295
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Trying to 'zfs diff' a snapshot with large dnodes will incorrectly try
to access its interior slots when dnodesize > sizeof(dnode_phys_t).
This is normally not an issue because the interior slots are
zero-filled, which report_dnode() handles calling
report_free_dnode_range(). However this is not the case for encrypted
large dnodes or filesystem using many SA based xattrs where the extra
data past the legacy dnode size boundary is interpreted as a
dnode_phys_t.
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Tom Caputi <[email protected]>
Reviewed-by: Ryan Moeller <[email protected]>
Signed-off-by: loli10K <[email protected]>
Closes #7678
Closes #8931
Closes #9343
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When a disk is replaced with another on a pool with the resilver_defer
feature present, but not enabled the resilver activity restarts during
each spa_sync. This patch checks to make sure that the resilver_defer
feature is first enabled before requesting a deferred resilver.
This was originally fixed in illumos-joyent as OS-7982.
Reviewed-by: Chris Dunlop <[email protected]>
Reviewed-by: George Melikov <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Tom Caputi <[email protected]>
Reviewed by: Jerry Jelinek <[email protected]>
Signed-off-by: Kody A Kantor <[email protected]>
External-issue: illumos-joyent OS-7982
Closes #9299
Closes #9338
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The was incorrect with respect to swapping dataset IDs both in the
on-disk ZAP object and the in-memory queue.
In both cases, if ds1 was already present, then it would be first
replaced with ds2 and then ds would be replaced back with ds1.
Also, both cases did not properly handle a situation where both ds1 and
ds2 are already queued. A duplicate insertion would be attempted and
its failure would result in a panic.
Reviewed-by: Matt Ahrens <[email protected]>
Reviewed-by: Tom Caputi <[email protected]>
Signed-off-by: Andriy Gapon <[email protected]>
Closes #9140
Closes #9163
|
|
|
|
|
|
|
|
|
|
| |
This commit fixes a NULL pointer dereference triggered in
spa_vdev_remove_top_check() by trying to "zpool remove" an indirect
vdev.
Reviewed-by: Matt Ahrens <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: loli10K <[email protected]>
Closes #9327
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This commit fixes the following build failure detected on Debian9
(GCC 6.3.0):
CC [M] module/zfs/spa.o
module/zfs/spa.c: In function ‘spa_wait_common.part.31’:
module/zfs/spa.c:9468:6: error: ‘in_progress’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
if (!in_progress || spa->spa_waiters_cancel || error)
^
cc1: all warnings being treated as errors
Reviewed-by: Chris Dunlop <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: John Gallagher <[email protected]>
Signed-off-by: loli10K <[email protected]>
Closes #9326
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Currently, spa_keystore_change_key_sync_impl() does not recurse
into clones when updating encryption roots for either a call to
'zfs promote' or 'zfs change-key'. This can cause children of
these clones to end up in a state where they point to the wrong
dataset as the encryption root. It can also trigger ASSERTs in
some cases where the code checks reference counts on wrapping
keys. This patch fixes this issue by ensuring that this function
properly recurses into clones during processing.
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Alek Pinchuk <[email protected]>
Signed-off-by: Tom Caputi <[email protected]>
Closes #9267
Closes #9294
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Currently the best way to wait for the completion of a long-running
operation in a pool, like a scrub or device removal, is to poll 'zpool
status' and parse its output, which is neither efficient nor convenient.
This change adds a 'wait' subcommand to the zpool command. When invoked,
'zpool wait' will block until a specified type of background activity
completes. Currently, this subcommand can wait for any of the following:
- Scrubs or resilvers to complete
- Devices to initialized
- Devices to be replaced
- Devices to be removed
- Checkpoints to be discarded
- Background freeing to complete
For example, a scrub that is in progress could be waited for by running
zpool wait -t scrub <pool>
This also adds a -w flag to the attach, checkpoint, initialize, replace,
remove, and scrub subcommands. When used, this flag makes the operations
kicked off by these subcommands synchronous instead of asynchronous.
This functionality is implemented using a new ioctl. The type of
activity to wait for is provided as input to the ioctl, and the ioctl
blocks until all activity of that type has completed. An ioctl was used
over other methods of kernel-userspace communiction primarily for the
sake of portability.
Porting Notes:
This is ported from Delphix OS change DLPX-44432. The following changes
were made while porting:
- Added ZoL-style ioctl input declaration.
- Reorganized error handling in zpool_initialize in libzfs to integrate
better with changes made for TRIM support.
- Fixed check for whether a checkpoint discard is in progress.
Previously it also waited if the pool had a checkpoint, instead of
just if a checkpoint was being discarded.
- Exposed zfs_initialize_chunk_size as a ZoL-style tunable.
- Updated more existing tests to make use of new 'zpool wait'
functionality, tests that don't exist in Delphix OS.
- Used existing ZoL tunable zfs_scan_suspend_progress, together with
zinject, in place of a new tunable zfs_scan_max_blks_per_txg.
- Added support for a non-integral interval argument to zpool wait.
Future work:
ZoL has support for trimming devices, which Delphix OS does not. In the
future, 'zpool wait' could be extended to add the ability to wait for
trim operations to complete.
Reviewed-by: Matt Ahrens <[email protected]>
Reviewed-by: John Kennedy <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: John Gallagher <[email protected]>
Closes #9162
|
|
|
|
|
|
|
|
| |
objnode is OS agnostic and used only by dmu_redact.c.
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Ryan Moeller <[email protected]>
Signed-off-by: Matt Macy <[email protected]>
Closes #9315
|
|
|
|
|
|
|
|
|
|
| |
Annotate spa logging declarations with printflike
Workaround gcc bug (non disable-able warning) by
replacing "" with " "
Reviewed-by: Matt Ahrens <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Matt Macy <[email protected]>
Closes #9316
|
|
|
|
|
|
|
|
|
|
|
| |
Move Linux specific tracing headers and source to platform directories
and update the build system.
Reviewed-by: Allan Jude <[email protected]>
Reviewed-by: Ryan Moeller <[email protected]>
Reviewed by: Brad Lewis <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Matt Macy <[email protected]>
Closes #9290
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Currently, the DSL scan code figures out when it should suspend
processing and allow a txg to continue by calling the function
dsl_scan_check_suspend(). Unfortunately, this function only
allows the scan to suspend at a level 0 block. In the event that
the system is scanning a bunch of empty snapshots or a resilver
is running with a high enough scn_cur_min_txg, the scan will
stop processing each dataset at the root level, deciding it
has nothing left to do. This means that the check_suspend
function is never called and the txg remains stuck until a
dataset is found that has data to scan.
This patch fixes the problem by allowing scans to suspend at
the root level of the objset. For backwards compatibility, we
use the bookmark <objsetid, 0, 0, 0> when we suspend here so
that older versions of the code will work as intended.
Reviewed-by: Matt Ahrens <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Tom Caputi <[email protected]>
Closes #9300
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Accidentally introduced by dc04a8c which now takes the SCL_VDEV lock
as a reader in zfs_blkptr_verify(). A deadlock can occur if the
/etc/hostid file resides on a dataset in the same pool. This is
because reading the /etc/hostid file may occur while the caller is
holding the SCL_VDEV lock as a writer. For example, to perform a
`zpool attach` as shown in the abbreviated stack below.
To resolve the issue we cache the system's hostid when initializing
the spa_t, or when modifying the multihost property. The cached
value is then relied upon for subsequent accesses.
Call Trace:
spa_config_enter+0x1e8/0x350 [zfs]
zfs_blkptr_verify+0x33c/0x4f0 [zfs] <--- trying read lock
zio_read+0x6c/0x140 [zfs]
...
vfs_read+0xfc/0x1e0
kernel_read+0x50/0x90
...
spa_get_hostid+0x1c/0x38 [zfs]
spa_config_generate+0x1a0/0x610 [zfs]
vdev_label_init+0xa0/0xc80 [zfs]
vdev_create+0x98/0xe0 [zfs]
spa_vdev_attach+0x14c/0xb40 [zfs] <--- grabbed write lock
Reviewed-by: loli10K <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #9256
Closes #9285
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When adding the SIMD compatibility code in e5db313 the decryption of a
dataset wrapping key was left in a user thread context. This was done
intentionally since it's a relatively infrequent operation. However,
this also meant that the encryption context templates were initialized
using the generic operations. Therefore, subsequent encryption and
decryption operations would use the generic implementation even when
executed by an I/O pipeline thread.
Resolve the issue by initializing the context templates in an I/O
pipeline thread. And by updating zio_do_crypt_uio() to dispatch any
encryption operations to a pipeline thread when called from the user
context. For example, when performing a read from the ARC.
Tested-by: Attila Fülöp <[email protected]>
Reviewed-by: Tom Caputi <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #9215
Closes #9296
|
|
|
|
|
|
|
|
|
|
|
|
| |
Move platform specific Linux source under module/os/linux/
and update the build system accordingly. Additional code
restructuring will follow to make the common code fully
portable.
Reviewed-by: Jorgen Lundman <[email protected]>
Reviewed-by: Igor Kozhukhov <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Matthew Macy <[email protected]>
Closes #9206
|
|
|
|
|
|
|
|
|
|
|
| |
Adds ZFS_MODULE_PARAM to abstract module parameter
setting to operating systems other than Linux.
Reviewed-by: Jorgen Lundman <[email protected]>
Reviewed-by: Igor Kozhukhov <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Matt Macy <[email protected]>
Signed-off-by: Ryan Moeller <[email protected]>
Closes #9230
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
`metaslab_verify_weight_and_frag()` a verification function and
by the end of it there shouldn't be any side-effects.
The function calls `metaslab_weight()` which in turn calls
`metaslab_set_fragmentation()`. The latter can dirty and otherwise
not dirty metaslab fro the next TXGand set `metaslab_condense_wanted`
if the spacemaps were just upgraded (meaning we just enabled the
SPACEMAP_HISTOGRAM feature through upgrade).
This patch adds a new flag as a parameter to `metaslab_weight()` and
`metaslab_set_fragmentation()` making the dirtying of the metaslab
optional.
Reviewed-by: Matt Ahrens <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Serapheim Dimitropoulos <[email protected]>
Closes #9185
Closes #9282
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Move platform specific Linux headers under include/os/linux/.
Update the build system accordingly to detect the platform.
This lays some of the initial groundwork to supporting building
for other platforms.
As part of this change it was necessary to create both a user
and kernel space sys/simd.h header which can be included in
either context. No functional change, the source has been
refactored and the relevant #include's updated.
Reviewed-by: Jorgen Lundman <[email protected]>
Reviewed-by: Igor Kozhukhov <[email protected]>
Signed-off-by: Matthew Macy <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #9198
|
|
|
|
|
|
|
|
|
|
|
| |
Account for ZFS_MAX_DATASET_NAME_LEN in kstat data size. This value
is ignored in the Linux kstat code but resolves the issue for other
platforms.
Reviewed-by: Serapheim Dimitropoulos <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Igor Kozhukhov <[email protected]>
Closes #9254
Closes #9151
|