| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
In case L2ARC read failed, l2arc_read_done() creates _different_ ZIO
to read data from the original storage device. Unfortunately pointer
to the failed ZIO remains in hdr->b_l1hdr.b_acb->acb_zio_head, and if
some other read try to bump the ZIO priority, it will crash.
The problem is reproducible by corrupting L2ARC content and reading
some data with prefetch if l2arc_noprefetch tunable is changed to 0.
With the default setting the issue is probably not reproducible now.
Reviewed-by: Tom Caputi <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Alexander Motin <[email protected]>
Sponsored-By: iXsystems, Inc.
Closes #9648
|
|
|
|
|
|
|
|
|
| |
The module_param_call() functionality is currently still
Linux-specific and should be wrapped accordingly.
Reviewed-by: Allan Jude <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Matt Macy <[email protected]>
Closes #9666
|
|
|
|
|
|
|
|
| |
The write_record() function is private and should be marked as such.
Reviewed-by: Allan Jude <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Matt Macy <[email protected]>
Closes #9665
|
|
|
|
|
|
|
|
|
|
|
| |
FreeBSD needs to cope with multiple version of the zfs_cmd_t
structure. Allowing the platform code to pre and post
process the cmd structure makes it possible to work with
legacy tooling.
Reviewed-by: Jorgen Lundman <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Matt Macy <[email protected]>
Closes #9624
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
If a device is participating in an active resilver, then it will have a
non-empty DTL. Operations like vdev_{open,reopen,probe}() can cause the
resilver to be restarted (or deferred to be restarted later), which is
unnecessary if the DTL is still covered by the current scan range. This
is similar to the logic in vdev_dtl_should_excise() where the DTL can
only be excised if it's max txg is in the resilvered range.
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: John Gallagher <[email protected]>
Reviewed-by: Kjeld Schouten <[email protected]>
Signed-off-by: John Poduska <[email protected]>
Issue #840
Closes #9155
Closes #9378
Closes #9551
Closes #9588
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Provide a common zfs_file_* interface which can be implemented on all
platforms to perform normal file access from either the kernel module
or the libzpool library.
This allows all non-portable vnode_t usage in the common code to be
replaced by the new portable zfs_file_t. The associated vnode and
kobj compatibility functions, types, and macros have been removed
from the SPL. Moving forward, vnodes should only be used in platform
specific code when provided by the native operating system.
Reviewed-by: Sean Eric Fagan <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Igor Kozhukhov <[email protected]>
Reviewed-by: Jorgen Lundman <[email protected]>
Signed-off-by: Matt Macy <[email protected]>
Closes #9556
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Before my ZIL space optimization few years ago 128KB writes were logged
as two 64KB+ records in two 128KB log blocks. After that change it
became ~127KB+/1KB+ in two 128KB log blocks to free space in the second
block for another record. Unfortunately in case of 128KB only writes,
when space in the second block remained unused, that change increased
write latency by unbalancing checksum computation and write times
between parallel threads. It also didn't help with SLOG space
efficiency in that case.
This change introduces new 68KB log block size, used for both writes
below 67KB and 128KB-sharp writes. Writes of 68-127KB are still using
one 128KB block to not increase processing overhead. Writes above
131KB are still using full 128KB blocks, since possible saving there
is small. Mixed loads will likely also fall back to previous 128KB,
since code uses maximum of the last 16 requested block sizes.
Reviewed-by: Matt Ahrens <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Alexander Motin <[email protected]>
Closes #9409
|
|
|
|
|
|
|
|
|
| |
Some of the znode fields are different and functions
consuming an inode don't exist on FreeBSD.
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Jorgen Lundman <[email protected]>
Signed-off-by: Matt Macy <[email protected]>
Closes #9536
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This change modifies some of the infrastructure for enabling the use of
the DTRACE_PROBE* macros, such that we can use tehm in the "spl" module.
Currently, when the DTRACE_PROBE* macros are used, they get expanded to
create new functions, and these dynamically generated functions become
part of the "zfs" module.
Since the "spl" module does not depend on the "zfs" module, the use of
DTRACE_PROBE* in the "spl" module would result in undefined symbols
being used in the "spl" module. Specifically, DTRACE_PROBE* would turn
into a function call, and the function being called would be a symbol
only contained in the "zfs" module; which results in a linker and/or
runtime error.
Thus, this change adds the necessary logic to the "spl" module, to
mirror the tracing functionality available to the "zfs" module. After
this change, we'll have a "trace_zfs.h" header file which defines the
probes available only to the "zfs" module, and a "trace_spl.h" header
file which defines the probes available only to the "spl" module.
Reviewed by: Brad Lewis <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Prakash Surya <[email protected]>
Closes #9525
|
|
|
|
|
|
|
|
|
|
| |
A struct rangelock already exists on FreeBSD. Add a zfs_ prefix as
per our convention to prevent any conflict with existing symbols.
This change is a follow up to 2cc479d0.
Reviewed-by: Matt Ahrens <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Matt Macy <[email protected]>
Closes #9534
|
|
|
|
|
|
|
|
| |
Address two prototype related warnings emitted by clang.
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Igor Kozhukhov <[email protected]>
Signed-off-by: Matt Macy <[email protected]>
Closes #9535
|
|
|
|
|
|
|
|
|
|
| |
Move these Linux module parameter get/set helpers in to
platform specific code.
Reviewed-by: Igor Kozhukhov <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Ryan Moeller <[email protected]>
Signed-off-by: Matt Macy <[email protected]>
Closes #9457
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Currently, when you call 'zfs change-key' on an encrypted dataset
that has an unencrypted child, the code will trigger a VERIFY.
This VERIFY is leftover from before we allowed unencrypted
datasets to exist underneath encrypted ones. This patch fixes the
issue by simply replacing the VERIFY with an early return when
recursing through datasets.
Reviewed by: Jason King <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Igor Kozhukhov <[email protected]>
Signed-off-by: Tom Caputi <[email protected]>
Closes #9524
|
|
|
|
|
|
|
|
|
|
|
|
| |
- FreeBSD's rootpool import code uses spa_config_parse
- Move the zvol_create_minors call out from under the
spa_namespace_lock in spa_import. It isn't needed and it causes
a lock order reversal on FreeBSD.
Reviewed-by: Igor Kozhukhov <[email protected]>
Reviewed-by: Jorgen Lundman <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Matt Macy <[email protected]>
Closes #9499
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This change leverage module_param_call() to run arc_tuning_update()
immediately after the ARC tunable has been updated as suggested in
cffa8372 code review.
A simple test case is added to the ZFS Test Suite to prevent future
regressions in functionality.
Reviewed-by: Matt Macy <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: loli10K <[email protected]>
Closes #9487
Closes #9489
|
|
|
|
|
|
|
|
|
|
| |
This assert makes non portable assumptions about the state of memory
returned by the memory allocator.
Reviewed-by: Igor Kozhukhov <[email protected]>
Reviewed-by: Jorgen Lundman <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Matt Macy <[email protected]>
Closes #9506
|
|
|
|
|
|
|
|
|
|
| |
This logic is not platform dependent and should reside in the
common code.
Reviewed-by: Igor Kozhukhov <[email protected]>
Reviewed-by: Jorgen Lundman <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Matt Macy <[email protected]>
Closes #9505
|
|
|
|
|
|
|
|
|
| |
It's mostly a noop on ZoL and it conflicts with platforms that
support dtrace. Remove this header to resolve the conflict.
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Jorgen Lundman <[email protected]>
Signed-off-by: Matt Macy <[email protected]>
Closes #9497
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Contrary to initial testing we cannot rely on these kernels to
invalidate the per-cpu FPU state and restore the FPU registers.
Nor can we guarantee that the kernel won't modify the FPU state
which we saved in the task struck.
Therefore, the kfpu_begin() and kfpu_end() functions have been
updated to save and restore the FPU state using our own dedicated
per-cpu FPU state variables.
This has the additional advantage of allowing us to use the FPU
again in user threads. So we remove the code which was added to
use task queues to ensure some functions ran in kernel threads.
Reviewed-by: Fabian Grünbichler <[email protected]>
Reviewed-by: Tony Hutter <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Issue #9346
Closes #9403
|
|
|
|
|
|
|
|
|
|
| |
Fixes an obvious issue of calling arc_buf_destroy() on an
unallocated arc_buf.
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Matt Ahrens <[email protected]>
Reviewed-by: Paul Dagnelie <[email protected]>
Signed-off-by: Chris Dunlop <[email protected]>
Closes #9453
|
|
|
|
|
|
|
|
|
|
|
| |
Factor Linux specific memory pressure handling out of ARC. Each
platform will have different available interfaces for managing memory
pressure.
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Jorgen Lundman <[email protected]>
Reviewed-by: Ryan Moeller <[email protected]>
Signed-off-by: Matt Macy <[email protected]>
Closes #9472
|
|
|
|
|
|
|
|
| |
Only pass the file descriptor to make zfsdev_get_miror() portable.
Reviewed-by: George Melikov <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Matt Macy <[email protected]>
Closes #9466
|
|
|
|
|
|
| |
Reviewed-by: Ryan Moeller <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Matt Macy <[email protected]>
Closes #9465
|
|
|
|
|
|
|
|
| |
Clang will complain if a function has no prior declaration
Reviewed-by: Ryan Moeller <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Matt Macy <[email protected]>
Closes #9467
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This addresses a number of problems with dmu_send.c:
* bp_span is unused which makes clang complain
* dump_write conflicts with FreeBSD's existing core dump code
* range_alloc is private to the file and not declared in any headers
causing clang to complain
Reviewed-by: Matt Ahrens <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Matt Macy <[email protected]>
Closes #9432
|
|
|
|
|
|
|
|
|
| |
We get the sizeof the appropriate type, and don't cast away const.
Reviewed-by: Ryan Moeller <[email protected]>
Reviewed-by: Igor Kozhukhov <[email protected]>
Reviewed-by: Matt Macy <[email protected]>
Signed-off-by: Paul Dagnelie <[email protected]>
Closes #9455
|
|
|
|
|
|
|
|
| |
FreeBSD has a very different implementation.
Reviewed-by: Ryan Moeller <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Matt Macy <[email protected]>
Closes #9442
|
|
|
|
|
|
|
|
| |
FreeBSD has its own implementation as do other platforms.
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Ryan Moeller <[email protected]>
Signed-off-by: Matt Macy <[email protected]>
Closes #9439
|
|
|
|
|
|
| |
Reviewed-by: Ryan Moeller <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Chris Dunlop <[email protected]>
Closes #9452
|
|
|
|
|
|
|
|
|
|
|
| |
Rename certain functions for more consistency when they share common
features. Make comments clearer about what arguments should be passed
to the insert and add functions.
Reviewed by: Sara Hartse <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Matt Ahrens <[email protected]>
Signed-off-by: Paul Dagnelie <[email protected]>
Closes #9441
|
|
|
|
|
|
|
|
| |
FreeBSD uses this in its pager ops routines
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Ryan Moeller <[email protected]>
Signed-off-by: Matt Macy <[email protected]>
Closes #9431
|
|
|
|
|
|
|
|
|
|
|
| |
When "feature@allocation_classes" is not enabled on the pool no vdev
with "special" or "dedup" allocation type should be allowed to exist in
the vdev tree.
Reviewed-by: Pavel Zakharov <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: loli10K <[email protected]>
Closes #9427
Closes #9429
|
|
|
|
|
|
|
|
|
|
| |
Temporary property handling at the VFS layer requires
platform specific code.
Reviewed-by: Sean Eric Fagan <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Jorgen Lundman <[email protected]>
Signed-off-by: Matt Macy <[email protected]>
Closes #9401
|
|
|
|
|
|
|
|
|
|
|
|
| |
Make the metaslab platform agnostic again by adding
accessor functions which can be implemented by each
platform.
Reviewed-by: Paul Dagnelie <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Jorgen Lundman <[email protected]>
Reviewed-by: Ryan Moeller <[email protected]>
Signed-off-by: Matt Macy <[email protected]>
Closes #9404
|
|
|
|
|
|
|
|
|
| |
FreeBSD's zvol platform code requires access to the
zil_async_to_sync() function.
Reviewed-by: Ryan Moeller <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Matt Macy <[email protected]>
Closes #9440
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
In the FreeBSD kernel the strdup signature is:
```
char *strdup(const char *__restrict, struct malloc_type *);
```
It's unfortunate that the developers have chosen to change
the signature of libc functions - but it's what I have to
deal with.
Reviewed-by: Jorgen Lundman <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Matt Macy <[email protected]>
Closes #9433
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The macros are used to generate code for conditions without a
corresponding branch. This is not a problem in practice, but
clang has no way of knowing that. Add a default branch with a
VERIFY(0) to indicate that it "can't happen"
```
In file included from \
/usr/home/mmacy/devel/ZoF/module/zfs/vdev_raidz_math_sse2.c:607:
/usr/home/mmacy/devel/ZoF/module/zfs/vdev_raidz_math_impl.h:281:3: \
error: no case matching constant switch condition '3' [-Werror]
```
Reviewed-by: George Melikov <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Matt Macy <[email protected]>
Closes #9434
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This patch implements a new tree structure for ZFS, and uses it to
store range trees more efficiently.
The new structure is approximately a B-tree, though there are some
small differences from the usual characterizations. The tree has core
nodes and leaf nodes; each contain data elements, which the elements
in the core nodes acting as separators between its children. The
difference between core and leaf nodes is that the core nodes have an
array of children, while leaf nodes don't. Every node in the tree may
be only partially full; in most cases, they are all at least 50% full
(in terms of element count) except for the root node, which can be
less full. Underfull nodes will steal from their neighbors or merge to
remain full enough, while overfull nodes will split in two. The data
elements are contained in tree-controlled buffers; they are copied
into these on insertion, and overwritten on deletion. This means that
the elements are not independently allocated, which reduces overhead,
but also means they can't be shared between trees (and also that
pointers to them are only valid until a side-effectful tree operation
occurs). The overhead varies based on how dense the tree is, but is
usually on the order of about 50% of the element size; the per-node
overheads are very small, and so don't make a significant difference.
The trees can accept arbitrary records; they accept a size and a
comparator to allow them to be used for a variety of purposes.
The new trees replace the AVL trees used in the range trees today.
Currently, the range_seg_t structure contains three 8 byte integers
of payload and two 24 byte avl_tree_node_ts to handle its storage in
both an offset-sorted tree and a size-sorted tree (total size: 64
bytes). In the new model, the range seg structures are usually two 4
byte integers, but a separate one needs to exist for the size-sorted
and offset-sorted tree. Between the raw size, the 50% overhead, and
the double storage, the new btrees are expected to use 8*1.5*2 = 24
bytes per record, or 33.3% as much memory as the AVL trees (this is
for the purposes of storing metaslab range trees; for other purposes,
like scrubs, they use ~50% as much memory).
We reduced the size of the payload in the range segments by teaching
range trees about starting offsets and shifts; since metaslabs have a
fixed starting offset, and they all operate in terms of disk sectors,
we can store the ranges using 4-byte integers as long as the size of
the metaslab divided by the sector size is less than 2^32. For 512-byte
sectors, this is a 2^41 (or 2TB) metaslab, which with the default
settings corresponds to a 256PB disk. 4k sector disks can handle
metaslabs up to 2^46 bytes, or 2^63 byte disks. Since we do not
anticipate disks of this size in the near future, there should be
almost no cases where metaslabs need 64-byte integers to store their
ranges. We do still have the capability to store 64-byte integer ranges
to account for cases where we are storing per-vdev (or per-dnode) trees,
which could reasonably go above the limits discussed. We also do not
store fill information in the compact version of the node, since it
is only used for sorted scrub.
We also optimized the metaslab loading process in various other ways
to offset some inefficiencies in the btree model. While individual
operations (find, insert, remove_from) are faster for the btree than
they are for the avl tree, remove usually requires a find operation,
while in the AVL tree model the element itself suffices. Some clever
changes actually caused an overall speedup in metaslab loading; we use
approximately 40% less cpu to load metaslabs in our tests on Illumos.
Another memory and performance optimization was achieved by changing
what is stored in the size-sorted trees. When a disk is heavily
fragmented, the df algorithm used by default in ZFS will almost always
find a number of small regions in its initial cursor-based search; it
will usually only fall back to the size-sorted tree to find larger
regions. If we increase the size of the cursor-based search slightly,
and don't store segments that are smaller than a tunable size floor
in the size-sorted tree, we can further cut memory usage down to
below 20% of what the AVL trees store. This also results in further
reductions in CPU time spent loading metaslabs.
The 16KiB size floor was chosen because it results in substantial memory
usage reduction while not usually resulting in situations where we can't
find an appropriate chunk with the cursor and are forced to use an
oversized chunk from the size-sorted tree. In addition, even if we do
have to use an oversized chunk from the size-sorted tree, the chunk
would be too small to use for ZIL allocations, so it isn't as big of a
loss as it might otherwise be. And often, more small allocations will
follow the initial one, and the cursor search will now find the
remainder of the chunk we didn't use all of and use it for subsequent
allocations. Practical testing has shown little or no change in
fragmentation as a result of this change.
If the size-sorted tree becomes empty while the offset sorted one still
has entries, it will load all the entries from the offset sorted tree
and disregard the size floor until it is unloaded again. This operation
occurs rarely with the default setting, only on incredibly thoroughly
fragmented pools.
There are some other small changes to zdb to teach it to handle btrees,
but nothing major.
Reviewed-by: George Wilson <[email protected]>
Reviewed-by: Matt Ahrens <[email protected]>
Reviewed by: Sebastien Roy [email protected]
Reviewed-by: Igor Kozhukhov <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Paul Dagnelie <[email protected]>
Closes #9181
|
|
|
|
|
|
|
|
|
| |
A rangelock KPI already exists on FreeBSD. Add a zfs_ prefix as
per our convention to prevent any conflict with existing symbols.
Reviewed-by: Igor Kozhukhov <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Matt Macy <[email protected]>
Closes #9402
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Currently every dbuf_hold_impl() incurs kmem_alloc() and kmem_free()
which can be costly for cached read performance.
This change reverts the dbuf_hold_impl() fix stack commit, i.e.
fc5bb51f08a6c91ff9ad3559d0266eeeab0b1f61 to eliminate the extra
kmem_alloc() and kmem_free() operations and improve cached read
performance. With the change, each dbuf_hold_impl() frame uses 40 bytes
more, total of 800 for 20 recursive levels. Linux kernel stack sizes are
8K and 16K for 32bit and 64bit, respectively, so stack overrun risk is
limited.
Sample stack output comparisons with 50 PB file and recordsize=512
Current code
11) 2240 64 arc_alloc_buf+0x4a/0xd0 [zfs]
12) 2176 264 dbuf_read_impl.constprop.16+0x2e3/0x7f0 [zfs]
13) 1912 120 dbuf_read+0xe5/0x520 [zfs]
14) 1792 56 dbuf_hold_impl_arg+0x572/0x630 [zfs]
15) 1736 64 dbuf_hold_impl_arg+0x508/0x630 [zfs]
16) 1672 64 dbuf_hold_impl_arg+0x508/0x630 [zfs]
17) 1608 40 dbuf_hold_impl+0x23/0x40 [zfs]
18) 1568 40 dbuf_hold_level+0x32/0x60 [zfs]
19) 1528 16 dbuf_hold+0x16/0x20 [zfs]
dbuf_hold_impl() cleanup
11) 2320 64 arc_alloc_buf+0x4a/0xd0 [zfs]
12) 2256 264 dbuf_read_impl.constprop.17+0x2e3/0x7f0 [zfs]
13) 1992 120 dbuf_read+0xe5/0x520 [zfs]
14) 1872 96 dbuf_hold_impl+0x50f/0x5e0 [zfs]
15) 1776 104 dbuf_hold_impl+0x4df/0x5e0 [zfs]
16) 1672 104 dbuf_hold_impl+0x4df/0x5e0 [zfs]
17) 1568 40 dbuf_hold_level+0x32/0x60 [zfs]
18) 1528 16 dbuf_hold+0x16/0x20 [zfs]
Performance observations on 8K recordsize filesystem:
- 8/128/1024K at 1-128 sequential cached read, ~3% improvement
Testing done on Ubuntu 18.04 with 4.15 kernel, 8vCPUs and SSD storage on
VMware ESX.
Reviewed-by: Matt Ahrens <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Tony Nguyen <[email protected]>
Closes #9351
|
|
|
|
|
|
| |
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Jorgen Lundman <[email protected]>
Signed-off-by: Matt Macy <[email protected]>
Closes #9389
|
|
|
|
|
|
|
|
| |
Make arc_stats visible to platform code.
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Igor Kozhukhov <[email protected]>
Signed-off-by: Matt Macy <[email protected]>
Closes #9386
|
|
|
|
|
|
|
|
| |
Reviewed-by: Igor Kozhukhov <[email protected]>
Reviewed-by: Jorgen Lundman <[email protected]>
Reviewed-by: George Melikov <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Matt Macy <[email protected]>
Closes #9385
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
We've seen cases where after creating a ZVOL, the ZVOL device node in
"/dev" isn't generated after 20 seconds of waiting, which is the point
at which our applications gives up on waiting and reports an error.
The workload when this occurs is to "refresh" 400+ ZVOLs roughly at the
same time, based on a policy set by the user. This refresh operation
will destroy the ZVOL, and re-create it based on a snapshot.
When this occurs, we see many hundreds of entries on the "z_zvol" taskq
(based on inspection of the /proc/spl/taskq-all file). Many of the
entries on the taskq end up in the "zvol_remove_minors_impl" function,
and I've measured the latency of that function:
Function = zvol_remove_minors_impl
msecs : count distribution
0 -> 1 : 0 | |
2 -> 3 : 0 | |
4 -> 7 : 1 | |
8 -> 15 : 0 | |
16 -> 31 : 0 | |
32 -> 63 : 0 | |
64 -> 127 : 1 | |
128 -> 255 : 45 |****************************************|
256 -> 511 : 5 |**** |
That data is from a 10 second sample, using the BCC "funclatency" tool.
As we can see, in this 10 second sample, most calls took 128ms at a
minimum. Thus, some basic math tells us that in any 20 second interval,
we could only process at most about 150 removals, which is much less
than the 400+ that'll occur based on the workload.
As a result of this, and since all ZVOL minor operations will go through
the single threaded "z_zvol" taskq, the latency for creating a single
ZVOL device can be unreasonably large due to other ZVOL activity on the
system. In our case, it's large enough to cause the application to
generate an error and fail the operation.
When profiling the "zvol_remove_minors_impl" function, I saw that most
of the time in the function was spent off-cpu, blocked in the function
"taskq_wait_outstanding". How this works, is "zvol_remove_minors_impl"
will dispatch calls to "zvol_free" using the "system_taskq", and then
the "taskq_wait_outstanding" function is used to wait for all of those
dispatched calls to occur before "zvol_remove_minors_impl" will return.
As far as I can tell, "zvol_remove_minors_impl" doesn't necessarily have
to wait for all calls to "zvol_free" to occur before it returns. Thus,
this change removes the call to "taskq_wait_oustanding", so that calls
to "zvol_free" don't affect the latency of "zvol_remove_minors_impl".
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: John Gallagher <[email protected]>
Signed-off-by: Prakash Surya <[email protected]>
Closes #9380
|
|
|
|
|
|
|
|
|
| |
Refactor the zfs ioctls in to platform dependent and independent bits.
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Sean Eric Fagan <[email protected]>
Signed-off-by: Matthew Macy <[email protected]>
Signed-off-by: Ryan Moeller <[email protected]>
Closes #9301
|
|
|
|
|
|
|
|
|
|
| |
Refactor the zvol in to platform dependent and independent bits.
Reviewed-by: Allan Jude <[email protected]>
Reviewed-by: Jorgen Lundman <[email protected]>
Reviewed-by: Igor Kozhukhov <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Matt Macy <[email protected]>
Closes #9295
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Trying to 'zfs diff' a snapshot with large dnodes will incorrectly try
to access its interior slots when dnodesize > sizeof(dnode_phys_t).
This is normally not an issue because the interior slots are
zero-filled, which report_dnode() handles calling
report_free_dnode_range(). However this is not the case for encrypted
large dnodes or filesystem using many SA based xattrs where the extra
data past the legacy dnode size boundary is interpreted as a
dnode_phys_t.
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Tom Caputi <[email protected]>
Reviewed-by: Ryan Moeller <[email protected]>
Signed-off-by: loli10K <[email protected]>
Closes #7678
Closes #8931
Closes #9343
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When a disk is replaced with another on a pool with the resilver_defer
feature present, but not enabled the resilver activity restarts during
each spa_sync. This patch checks to make sure that the resilver_defer
feature is first enabled before requesting a deferred resilver.
This was originally fixed in illumos-joyent as OS-7982.
Reviewed-by: Chris Dunlop <[email protected]>
Reviewed-by: George Melikov <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Tom Caputi <[email protected]>
Reviewed by: Jerry Jelinek <[email protected]>
Signed-off-by: Kody A Kantor <[email protected]>
External-issue: illumos-joyent OS-7982
Closes #9299
Closes #9338
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The was incorrect with respect to swapping dataset IDs both in the
on-disk ZAP object and the in-memory queue.
In both cases, if ds1 was already present, then it would be first
replaced with ds2 and then ds would be replaced back with ds1.
Also, both cases did not properly handle a situation where both ds1 and
ds2 are already queued. A duplicate insertion would be attempted and
its failure would result in a panic.
Reviewed-by: Matt Ahrens <[email protected]>
Reviewed-by: Tom Caputi <[email protected]>
Signed-off-by: Andriy Gapon <[email protected]>
Closes #9140
Closes #9163
|
|
|
|
|
|
|
|
|
|
| |
This commit fixes a NULL pointer dereference triggered in
spa_vdev_remove_top_check() by trying to "zpool remove" an indirect
vdev.
Reviewed-by: Matt Ahrens <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: loli10K <[email protected]>
Closes #9327
|