| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
| |
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Richard Yao <[email protected]>
|
|
|
|
|
|
|
|
|
|
| |
This continues what was started in
0eef1bde31d67091d3deed23fe2394f5a8bf2276 by fully converting zvols
to avoid unnecessary dnode_hold() calls. This saves a small amount
of CPU time and slightly improves latencies of operations on zvols.
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Richard Yao <[email protected]>
Closes #6058
|
|
|
|
|
|
|
|
| |
Buildbots and zfs-tests regularly see 7 kilobytes of stack
usage with this function. Convert self-calls to iterations
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: DHE <[email protected]>
Closes #6219
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Authored by: Paul Dagnelie <[email protected]>
Reviewed by: Matthew Ahrens <[email protected]>
Reviewed by: Pavel Zakharov <[email protected]>
Approved by: Robert Mustacchi <[email protected]>
Reviewed-by: Kash Pande <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: George Melikov <[email protected]>
Ported-by: Giuseppe Di Natale <[email protected]>
The send size estimate for a zvol can be too low, if the size of the
record headers (dmu_replay_record_t's) is a significant portion of the
size. This is typically the case when the data is highly compressible,
especially with embedded blocks.
The problem is that dmu_adjust_send_estimate_for_indirects() assumes
that blocks are the size of the "recordsize" property (128KB). However,
for zvols, the blocks are the size of the "volblocksize" property (8KB).
Therefore, we estimate that there will be 16x less record headers than
there really will be.
The fix is to check the type of the object set (whether it is a zvol or
not) and pick the appropriate property. In addition, while we are at it,
we also add the size of the BEGIN and END records to the estimate.
OpenZFS-issue: https://www.illumos.org/issues/8056
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/faf09cd
Closes #6205
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Authored by: Matthew Ahrens <[email protected]>
Reviewed by: Dan Kimmel <[email protected]>
Reviewed by: Paul Dagnelie <[email protected]>
Approved by: Robert Mustacchi <[email protected]>
Reviewed-by: George Melikov <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Ported-by: Giuseppe Di Natale <[email protected]>
dbuf_evict_notify() holds the dbuf_evict_lock while checking if it should
do the eviction itself (because the evict thread is not able to keep up).
This can result in massive lock contention. It isn't necessary to hold
the lock, because if we make the wrong choice occasionally, nothing bad
will happen. This commit results in a ~60% performance improvement for
ARC-cached sequential reads.
OpenZFS-issue: https://www.illumos.org/issues/8156
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f73e5d9
Closes #6204
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
dmu_object_alloc() is single-threaded, so when multiple threads are
creating files in a single filesystem, they spend a lot of time waiting
for the os_obj_lock. To improve performance of multi-threaded file
creation, we must make dmu_object_alloc() typically not grab any
filesystem-wide locks.
The solution is to have a "next object to allocate" for each CPU. Each
of these "next object"s is in a different block of the dnode object, so
that concurrent allocation holds dnodes in different dbufs. When a
thread's "next object" reaches the end of a chunk of objects (by default
4 blocks worth -- 128 dnodes), it will be reset to the per-objset
os_obj_next, which will be increased by a chunk of objects (128). Only
when manipulating the os_obj_next will we need to grab the os_obj_lock.
This decreases lock contention dramatically, because each thread only
needs to grab the os_obj_lock briefly, once per 128 allocations.
This results in a 70% performance improvement to multi-threaded object
creation (where each thread is creating objects in its own directory),
from 67,000/sec to 115,000/sec, with 8 CPUs.
Work sponsored by Intel Corp.
Authored by: Matthew Ahrens <[email protected]>
Reviewed-by: Ned Bass <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Ported-by: Matthew Ahrens <[email protected]>
Signed-off-by: Matthew Ahrens <[email protected]>
OpenZFS-issue: https://www.illumos.org/issues/8199
OpenZFS-commit: https://github.com/openzfs/openzfs/pull/374
Closes #4703
Closes #6117
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
- After some ZIL changes 6 years ago zil_slog_limit got partially broken
due to zl_itx_list_sz not updated when async itx'es upgraded to sync.
Actually because of other changes about that time zl_itx_list_sz is not
really required to implement the functionality, so this patch removes
some unneeded broken code and variables.
- Original idea of zil_slog_limit was to reduce chance of SLOG abuse by
single heavy logger, that increased latency for other (more latency critical)
loggers, by pushing heavy log out into the main pool instead of SLOG. Beside
huge latency increase for heavy writers, this implementation caused double
write of all data, since the log records were explicitly prepared for SLOG.
Since we now have I/O scheduler, I've found it can be much more efficient
to reduce priority of heavy logger SLOG writes from ZIO_PRIORITY_SYNC_WRITE
to ZIO_PRIORITY_ASYNC_WRITE, while still leave them on SLOG.
- Existing ZIL implementation had problem with space efficiency when it
has to write large chunks of data into log blocks of limited size. In some
cases efficiency stopped to almost as low as 50%. In case of ZIL stored on
spinning rust, that also reduced log write speed in half, since head had to
uselessly fly over allocated but not written areas. This change improves
the situation by offloading problematic operations from z*_log_write() to
zil_lwb_commit(), which knows real situation of log blocks allocation and
can split large requests into pieces much more efficiently. Also as side
effect it removes one of two data copy operations done by ZIL code WR_COPIED
case.
- While there, untangle and unify code of z*_log_write() functions.
Also zfs_log_write() alike to zvol_log_write() can now handle writes crossing
block boundary, that may also improve efficiency if ZPL is made to do that.
Sponsored by: iXsystems, Inc.
Authored by: Alexander Motin <[email protected]>
Reviewed by: Matthew Ahrens <[email protected]>
Reviewed by: Prakash Surya <[email protected]>
Reviewed by: Andriy Gapon <[email protected]>
Reviewed by: Steven Hartland <[email protected]>
Reviewed by: Brad Lewis <[email protected]>
Reviewed by: Richard Elling <[email protected]>
Approved by: Robert Mustacchi <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Richard Yao <[email protected]>
Ported-by: Giuseppe Di Natale <[email protected]>
OpenZFS-issue: https://www.illumos.org/issues/7578
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/aeb13ac
Closes #6191
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Authored by: Matthew Ahrens <[email protected]>
Reviewed by: Dan Kimmel <[email protected]>
Reviewed by: George Wilson <[email protected]>
Approved by: Robert Mustacchi <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Ported-by: Giuseppe Di Natale <[email protected]>
When writing pre-compressed buffers, arc_write() requires that
the compression algorithm used to compress the buffer matches
the compression algorithm requested by the zio_prop_t, which is
set by dmu_write_policy(). This makes dmu_write_policy() and its
callers a bit more complicated.
We simplify this by making arc_write() trust the caller to supply
the type of pre-compressed buffer that it wants to write,
and override the compression setting in the zio_prop_t.
OpenZFS-issue: https://www.illumos.org/issues/8155
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/b55ff58
Closes #6200
|
|
|
|
|
|
|
|
|
| |
Since torvalds/linux@d0a5b99 IOP_XATTR is used to indicate the inode
has xattr support: clear it for the ctldir inodes to avoid EIO errors.
Reviewed-by: Chunwei Chen <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: loli10K <[email protected]>
Closes #6189
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When inheriting the "snapdev" property to we don't always call
zfs_prop_set_special(): this prevents device nodes from being created in
certain situations. Because "snapdev" is the only *special* property
that is also inheritable we need to call zfs_prop_set_special() even
when we're not reverting it to the received value ('zfs inherit -S').
Additionally, fix a NULL pointer dereference accidentally introduced in
5559ba0 that can be triggered when setting the "snapdev" property to
the value "hidden" twice.
Finally, add a new test case "zvol_misc_snapdev" to the ZFS Test Suite.
Reviewed by: Boris Protopopov <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: loli10K <[email protected]>
Closes #6131
Closes #6175
Closes #6176
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
If, for example, your aux device was /dev/sdc, but now the aux device is
removed and /dev/sdc points to other device. zpool import will still
use that device and corrupt it.
The problem is that the spa_validate_aux in spa_import, rather than
validate the on-disk label, it would actually write label to disk. We
remove them since spa_load_{spares,l2cache} seems to do everything we
need and they would actually validate on-disk label.
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Chunwei Chen <[email protected]>
Closes #6158
|
|
|
|
|
|
|
|
|
|
|
| |
Move kmem_free() so it's called for every error path: this is
preferred over making `dmu_object_info_t doi` local to accommodate
older kernels with limited stacks.
Reviewed by: Boris Protopopov <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Giuseppe Di Natale <[email protected]>
Signed-off-by: loli10K <[email protected]>
Closes #6177
|
|
|
|
|
|
|
|
|
|
| |
Added missing ida_simple_remove() in the error handling path.
Reviewed-by: Chunwei Chen <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: George Melikov <[email protected]>
Signed-off-by: Boris Protopopov <[email protected]>
Closes #6159
Closes #6172
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
In certain cases (dsl_scan_sync() is one), we may end up calling
bpobj_iterate() on an empty bpobj. Even though we don't end up
modifying the bpobj it still gets dirtied, causing unneeded writes
to the pool.
This patch adds an early bail from bpobj_iterate_impl() if bpobj
is empty to prevent unneeded writes.
Reviewed-by: George Melikov <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed by: Matthew Ahrens <[email protected]>
Signed-off-by: Alek Pinchuk <[email protected]>
Closes #6164
|
|
|
|
|
|
|
|
|
|
|
| |
This reverts commit 959f56b99366c8727647b5b19fb3d47555c96cf3.
An issue was uncovered by the new zvol_misc_snapdev test case
which needs to be investigated and resolved.
Reviewed-by: loli10K <[email protected]>
Reviewed-by: George Melikov <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #6174
Issue #6131
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Authored by: Alan Somers <[email protected]>
Reviewed by: Yuri Pankov <[email protected]>
Reviewed by: Matthew Ahrens <[email protected]>
Approved by: Robert Mustacchi <[email protected]>
Reviewed-by: Giuseppe Di Natale <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Ported-by: bunder2015 <[email protected]>
OpenZFS-issue: https://www.illumos.org/issues/8070
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/40713f2
Closes #6160
|
|
|
|
|
|
|
|
|
|
|
| |
When inheriting the "snapdev" property to we don't always call
zfs_prop_set_special(): this prevents device nodes from being created in
certain situations. Because "snapdev" is the only *special* property
that is also inheritable we need to call zfs_prop_set_special() even
when we're not reverting it to the received value ('zfs inherit -S').
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: loli10K <[email protected]>
Closes #6131
|
|
|
|
|
|
|
|
|
|
| |
Provide a format parameter to super_setup_bdi_name() so we don't
create duplicate names in '/devices/virtual/bdi' sysfs namespace which
would prevent us from mounting more than one ZFS filesystem at a time.
Reviewed-by: Chunwei Chen <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: loli10K <[email protected]>
Closes #6147
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Sync with kernel patches for lz4
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/log/lib/lz4
4a3a99 lz4: add overrun checks to lz4_uncompress_unknownoutputsize()
d5e7ca LZ4 : fix the data abort issue
bea2b5 lib/lz4: Pull out constant tables
99b7e9 lz4: fix system halt at boot kernel on x86_64
Reviewed-by: Chunwei Chen <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Feng Sun <[email protected]>
Closes #5975
Closes #5973
|
|
|
|
|
|
|
|
|
|
|
| |
This addition will enable us to sync an open TXG to the main pool
on demand. The functionality is similar to 'sync(2)' but 'zpool sync'
will return when data has hit the main storage instead of potentially
just the ZIL as is the case with the 'sync(2)' cmd.
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed by: Matthew Ahrens <[email protected]>
Signed-off-by: Alek Pinchuk <[email protected]>
Closes #6122
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This patch adds a '-f' option to 'zpool offline' to fault a vdev
instead of bringing it offline. Unlike the OFFLINE state, the
FAULTED state will trigger the FMA code, allowing for things like
autoreplace and triggering the slot fault LED. The -f faults
persist across imports, unless they were set with the temporary
(-t) flag. Both persistent and temporary faults can be cleared
with zpool clear.
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Tony Hutter <[email protected]>
Closes #6094
|
|
|
|
|
|
|
|
|
|
|
| |
One pre-check in zfs_ereport_start() was being called after
the nvlists were being allocated. This simply corrects that
issue.
Reviewed-by: Tony Hutter <[email protected]>
Reviewed-by: Giuseppe Di Natale <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Tom Caputi <[email protected]>
Closes #6140
|
|
|
|
|
|
|
|
|
|
|
|
| |
The lock is designed to protect internal state of zvol_state_t and
to avoid taking spa_namespace_lock (e.g. in dmu_objset_own() code path)
while holding zvol_stat_lock. Refactor the code accordingly.
Signed-off-by: Boris Protopopov <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #3484
Closes #6065
Closes #6134
|
|
|
|
|
|
|
|
|
|
|
|
| |
Fix lock order inversion with zvol_open() as it did not account
for use of zvols as vdevs. The latter use cases resulted in the
lock order inversion deadlocks that involved spa_namespace_lock
and bdev->bd_mutex.
Signed-off-by: Boris Protopopov <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Issue #6065
Issue #6134
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
On a raidz vdev, a block that does not span all child vdevs, excluding
its skip sectors if any, may not be affected by a child vdev outage or
failure. In such cases, the block does not need to be resilvered.
However, current resilver algorithm simply resilvers all blocks on a
degraded raidz vdev. Such spurious IO is not only wasteful, but also
adds the risk of overwriting good data.
This patch eliminates such spurious IOs.
Reviewed-by: Gvozden Neskovic <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed by: Matthew Ahrens <[email protected]>
Signed-off-by: Isaac Huang <[email protected]>
Closes #5316
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Authored by: Matthew Ahrens <[email protected]>
Reviewed by: Serapheim Dimitropoulos <[email protected]>
Reviewed by: Pavel Zakharov <[email protected]>
Approved by: Robert Mustacchi <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Ported-by: George Melikov <[email protected]>
A standard practice in ZFS is to keep track of "per-txg" state. Any of
the 3 active TXG's (open, quiescing, syncing) can have different values
for this state. We should assert that we do not attempt to modify other
(inactive) TXG's.
Porting Notes:
- ASSERTV added to txg_sync_waiting() for unused variable.
OpenZFS-issue: https://www.illumos.org/issues/8063
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/01acb46
Closes #6109
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Authored by: Matthew Ahrens <[email protected]>
Reviewed by: George Wilson <[email protected]>
Reviewed-by: loli10K <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Ported-by: Matthew Ahrens <[email protected]>
If we do a scrub while a leaf device is offline (via "zpool offline"),
we will inadvertently clear the DTL (dirty time log) of the offline
device, even though it is still damaged. When the device comes back
online, we will incompletely resilver it, thinking that the scrub
repaired blocks written before the scrub was started. The incomplete
resilver can lead to data loss if there is a subsequent failure of a
different leaf device.
The fix is to never clear the DTL of offline devices. Note that if a
device is onlined while a scrub is in progress, the scrub will be
restarted.
The problem can be worked around by running "zpool scrub" after
"zpool online".
OpenZFS-issue: https://www.illumos.org/issues/8166
OpenZFS-commit: https://github.com/openzfs/openzfs/pull/372
Closes #5806
Closes #6103
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The arc layer tracks checksums of its data in the arc header
so that it can ensure that buffers haven't changed when they're
not supposed to. This checksum is only maintained while there
is an uncompressed buffer still attached to the header.
Unfortunately there is a missing call to arc_free_cksum() in
arc_release() that can trigger ASSERTs. This has not been a
common issue because the checksums are only maintained for
debug builds and triggering the bug requires writing a block
(and therefore calling arc_release()) while a compressed buffer
is still being used on a debug build. This simply corrects the
issue.
Reviewed-by: George Melikov <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Tom Caputi <[email protected]>
Closes #6105
|
|
|
|
|
|
|
|
| |
Linux 4.9 added current_time() as the preferred interface to get
the filesystem time. CURRENT_TIME was retired in Linux 4.12.
Reviewed-by: Chunwei Chen <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #6114
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This allows users to specify "-o property=value" to override and
"-x property" to exclude properties when receiving a zfs send stream.
Both native and user properties can be specified.
This is useful when using zfs send/receive for periodic
backup/replication because it lets users change properties such as
canmount, mountpoint, or compression without modifying the source.
References:
https://www.illumos.org/issues/2745
https://www.illumos.org/issues/3753
Reviewed by: Matthew Ahrens <[email protected]>
Reviewed-by: Alek Pinchuk <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: loli10K <[email protected]>
Closes #1350
Closes #5349
|
|
|
|
|
|
|
|
| |
zfsonlinux/spl@8f87971 added __spl_pf_fstrans_check for the xfs related
check, so we use them accordingly.
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Chunwei Chen <[email protected]>
Closes #6113
|
|
|
|
|
|
|
|
|
|
|
| |
Remove the lz4_ac local variable from dmu_write_policy() to resolve
the following unused variable warning on non-debug builds.
dmu.c: In function ‘dmu_write_policy’:
dmu.c:1892:12: warning: unused variable ‘lz4_ac’ [-Wunused-variable]
boolean_t lz4_ac = spa_feature_is_active(os->os_spa,
Signed-off-by: Brian Behlendorf <[email protected]>
|
|
|
|
|
|
|
|
|
| |
The proposed debugging enhancements in zfsonlinux/spl#587
identified the following missing *_destroy/*_fini calls.
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Gvozden Neskovic <[email protected]>
Closes #5428
|
|
|
|
|
|
|
|
|
| |
Change the default ZVOL behavior so requests are handled asynchronously.
This behavior is functionally the same as in the zfs-0.6.4 release.
Reviewed-by: Chunwei Chen <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Issue #5902
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Linux has read-ahead logic designed to accelerate sequential workloads.
ZFS has its own read-ahead logic called zprefetch that operates on both
ZVOLs and datasets. Having two prefetchers active at the same time can
cause overprefetching, which unnecessarily reduces IOPS performance on
CoW filesystems like ZFS.
Testing shows that entirely disabling the Linux prefetch results in
a significant performance penalty for reads while commensurate benefits
are seen in random writes. It appears that read-ahead benefits are
inversely proportional to random write benefits, and so a single page
of Linux-layer read-ahead appears to offer the middle ground for both
workloads.
Reviewed-by: Chunwei Chen <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Richard Yao <[email protected]>
Issue #5902
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The current ZVOL implementation does not explicitly set merge
options on ZVOL device queues, which results in the default merge
behavior.
Explicitly set QUEUE_FLAG_NOMERGES on ZVOL queues allowing the
ZIO pipeline to do its work.
Initial benchmarks (tiotest with no O_DIRECT) show random write
performance going up almost 3X on 8K ZVOLs, even after significant
rewrites of the logical space allocation.
Reviewed-by: Richard Yao <[email protected]>
Reviewed-by: Chunwei Chen <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: RageLtMan <rageltman@sempervictus>
Issue #5902
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When vdev_psize increases, the location of labels 2 and 3 changes
because their location is relative to the end of the device.
The configs for labels 2 and 3 are written during the next spa_sync()
because the vdev is added to the dirty config list. However, the
uberblock rings are not re-written in their new location, leaving the
device vulnerable to the beginning of the device being overwritten or
damaged.
This patch copies the uberblock ring from label 0 to labels 2 and 3,
in their new locations, at the next sync after vdev_psize increases.
Also, add a test zpool_expand_004_pos.ksh to confirm the uberblocks
are copied.
Reviewed-by: BearBabyLiu <[email protected]>
Reviewed-by: Andreas Dilger <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Olaf Faaland <[email protected]>
Closes #5108
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When multiple filesystems are in use, memory pressure causes arc_cache
to collapse to a minimum. Allow arc_cache to maintain proportional size
even when hit rates are disproportionate. We do this only via evictable
size from the kernel shrinker, thus it's only in effect under memory
pressure.
AKAMAI: zfs: CR 3695072
Reviewed-by: Tim Chase <[email protected]>
Reviewed-by: Richard Yao <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Debabrata Banerjee <[email protected]>
Closes #6035
|
|
|
|
|
|
|
|
|
|
|
| |
Could return the wrong pages value
AKAMAI: zfs: CR 3695072
Reviewed-by: Tim Chase <[email protected]>
Reviewed-by: Richard Yao <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Debabrata Banerjee <[email protected]>
Issue #6035
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Calling it when nothing is evictable will cause extra kswapd cpu. Also
if we didn't shrink it's unlikely to have memory to reap because we
likely just called it microseconds ago. The exception is if we are in
direct reclaim.
You can see how hard this is being hit in kswapd with a light test
workload:
34.95% [zfs] [k] arc_kmem_reap_now
5.40% [spl] [k] spl_kmem_cache_reap_now
3.79% [kernel] [k] _raw_spin_lock
2.86% [spl] [k] __spl_kmem_cache_generic_shrinker.isra.7
2.70% [kernel] [k] shrink_slab.part.37
1.93% [kernel] [k] isolate_lru_pages.isra.43
1.55% [kernel] [k] __wake_up_bit
1.20% [kernel] [k] super_cache_count
1.20% [kernel] [k] __radix_tree_lookup
With ZFS just mounted but only ext4/pagecache memory pressure
arc_kmem_reap_now still consumes excessive CPU:
12.69% [kernel] [k] isolate_lru_pages.isra.43
10.76% [kernel] [k] free_pcppages_bulk
7.98% [kernel] [k] drop_buffers
7.31% [kernel] [k] shrink_page_list
6.44% [zfs] [k] arc_kmem_reap_now
4.19% [kernel] [k] free_hot_cold_page
4.00% [kernel] [k] __slab_free
3.95% [kernel] [k] __isolate_lru_page
3.09% [kernel] [k] __radix_tree_lookup
Same pagecache only workload as above with this patch series:
11.58% [kernel] [k] isolate_lru_pages.isra.43
11.20% [kernel] [k] drop_buffers
9.67% [kernel] [k] free_pcppages_bulk
8.44% [kernel] [k] shrink_page_list
4.86% [kernel] [k] __isolate_lru_page
4.43% [kernel] [k] free_hot_cold_page
4.00% [kernel] [k] __slab_free
3.44% [kernel] [k] __radix_tree_lookup
(arc_kmem_reap_now has 0 samples in perf)
AKAMAI: zfs: CR 3695042
Reviewed-by: Tim Chase <[email protected]>
Reviewed-by: Richard Yao <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Debabrata Banerjee <[email protected]>
Issue #6035
|
|
|
|
|
|
|
|
|
| |
AKAMAI: zfs: CR 3695072
Reviewed-by: Tim Chase <[email protected]>
Reviewed-by: Richard Yao <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Debabrata Banerjee <[email protected]>
Issue #6035
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Lock contention, by itself, shouldn't indicate a stop condition to the
kernel's slab shrinker. Doing so can cause stalls when the kernel is
trying to free large parts of the cache such as is done by drop_caches
Also, perhaps arc_reclaim_lock should be a spinlock, and this code
eliminated.
AKAMAI: zfs: CR 3593801
Reviewed-by: Tim Chase <[email protected]>
Reviewed-by: Richard Yao <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Debabrata Banerjee <[email protected]>
Issue #6035
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Move arcstat_need_free increment from all direct calls to when
arc_reclaim_lock is busy and we exit wihout doing anything. Data will
be reclaimed in reclaim thread. The previous location meant that we
both reclaim the memory in this thread, and also schedule the same
amount of memory for reclaim in arc_reclaim, effectively doubling the
requested reclaim.
AKAMAI: zfs: CR 3695072
Reviewed-by: Tim Chase <[email protected]>
Reviewed-by: Richard Yao <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Debabrata Banerjee <[email protected]>
Issue #6035
|
|
|
|
|
|
|
|
|
|
|
| |
Ensures proper accounting of bytes we requested to free
AKAMAI: zfs: CR 3695072
Reviewed-by: Tim Chase <[email protected]>
Reviewed-by: Richard Yao <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Debabrata Banerjee <[email protected]>
Issue #6035
|
|
|
|
|
|
|
|
|
|
|
| |
Ghost meta/data buffers are not actually allocated
AKAMAI: zfs: CR 3695072
Reviewed-by: Tim Chase <[email protected]>
Reviewed-by: Richard Yao <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Debabrata Banerjee <[email protected]>
Issue #6035
|
|
|
|
|
|
|
|
|
|
|
| |
It doesn't need to have a loop to free page in a single scatterlist
entry because it should be single or compound page. The pages can be
freed in one invocation to __free_pages() for both cases.
Reviewed-by: Gvozden Neskovic <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Richard Yao <[email protected]>
Signed-off-by: Jinshan Xiong <[email protected]>
Closes #6057
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
In current implementation, only zio buffers in 16KB and bigger are
guaranteed PAGESIZE alignment. This breaks Lustre since it assumes
that 'arc_buf_t::b_data' must be page aligned when zio buffers are
greater than or equal to PAGESIZE.
This patch will make the zio buffers to be PAGESIZE aligned when
the sizes are not less than PAGESIZE.
This change may cause a little bit memory waste but that should be
fine because after ABD is introduced, zio buffers are used to hold
data temporarily and live in memory for a short while.
Reviewed-by: Don Brady <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Jinshan Xiong <[email protected]>
Signed-off-by: Jinshan Xiong <[email protected]>
Closes #6084
|
|
|
|
|
|
|
|
|
|
| |
All filesystems were converted to dynamically allocated BDIs. The
destruction of backing_dev_info structures is handled as part of
super block destruction. Refactor the code to abstract away the
details of creating and destroying a BDI.
Reviewed-by: Chunwei Chen <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #6089
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Authored by: Yuri Pankov <[email protected]>
Reviewed by: Matthew Ahrens <[email protected]>
Reviewed by: George Wilson <[email protected]>
Approved by: Albert Lee <[email protected]>
Reviewed-by: Giuseppe Di Natale <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Ported-by: bunder2015 <[email protected]>
OpenZFS-issue: https://www.illumos.org/issues/7786
OpenZFS-commit: http://github.com/openzfs/openzfs/commit/db8498f
Closes #6074
|
|
|
|
|
|
|
|
|
|
| |
Reinstate default 4G zfs_dirty_data_max_max limit.
Reviewed-by: Giuseppe Di Natale <[email protected]>
Reviewed-by: Chunwei Chen <[email protected]>
Reviewed-by: George Melikov <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #6072
Closes #6081
|