| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
|
| |
The mmp_exported_import and mmp_inactive_import tests depend on
ztest simulating an active pool. If ztest unexpectedly terminates
due to an unrelated issue the test case will fail. Since ztest is
not yet 100% reliable I've added these tests to the maybe exception
list. They can be removed when the issues with ztest are resolved
or if the test cases are updated to handle these unexpected failures.
Reviewed-by: George Melikov <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #10726
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The ARC caches data in scatter ABD's, which are collections of pages,
which are typically 4K. Therefore, the space used to cache each block
is rounded up to a multiple of 4K. The ABD subsystem tracks this wasted
memory in the `scatter_chunk_waste` kstat. However, the ARC's `size` is
not aware of the memory used by this round-up, it only accounts for the
size that it requested from the ABD subsystem.
Therefore, the ARC is effectively using more memory than it is aware of,
due to the `scatter_chunk_waste`. This impacts observability, e.g.
`arcstat` will show that the ARC is using less memory than it
effectively is. It also impacts how the ARC responds to memory
pressure. As the amount of `scatter_chunk_waste` changes, it appears to
the ARC as memory pressure, so it needs to resize `arc_c`.
If the sector size (`1<<ashift`) is the same as the page size (or
larger), there won't be any waste. If the (compressed) block size is
relatively large compared to the page size, the amount of
`scatter_chunk_waste` will be small, so the problematic effects are
minimal.
However, if using 512B sectors (`ashift=9`), and the (compressed) block
size is small (e.g. `compression=on` with the default `volblocksize=8k`
or a decreased `recordsize`), the amount of `scatter_chunk_waste` can be
very large. On a production system, with `arc_size` at a constant 50%
of memory, `scatter_chunk_waste` has been been observed to be 10-30% of
memory.
This commit adds `scatter_chunk_waste` to `arc_size`, and adds a new
`waste` field to `arcstat`. As a result, the ARC's memory usage is more
observable, and `arc_c` does not need to be adjusted as frequently.
Reviewed-by: Pavel Zakharov <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: George Wilson <[email protected]>
Reviewed-by: Ryan Moeller <[email protected]>
Signed-off-by: Matthew Ahrens <[email protected]>
Closes #10701
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
`KMC_KMEM` and `KMC_VMEM` are now unused since all SPL-implemented
caches are `KMC_KVMEM`.
KMC_KMEM: Given the default value of `spl_kmem_cache_kmem_limit`, we
don't use kmalloc to back the SPL caches, instead we use kvmalloc
(KMC_KVMEM). The flag, module parameter, /proc entries, and associated
code are removed.
KMC_VMEM: This flag is not used, and kvmalloc() is always preferable to
vmalloc(). The flag, /proc entries, and associated code are removed.
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Ryan Moeller <[email protected]>
Signed-off-by: Matthew Ahrens <[email protected]>
Closes #10673
|
|
|
|
|
|
|
|
|
|
|
|
| |
* Cast void * to uintptr_t before casting to boolean_t.
* Avoid clashing definition of __asm when not on Linux to
prevent duplicate __volatile__. This was already done in
some places but not all.
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Matt Macy <[email protected]>
Signed-off-by: Ryan Moeller <[email protected]>
Closes #10723
|
|
|
|
|
| |
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Matt Macy <[email protected]>
Closes #10721
|
|
|
|
|
|
|
| |
Reviewed-by: George Melikov <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Matthew Ahrens <[email protected]>
Signed-off-by: Serapheim Dimitropoulos <[email protected]>
Closes #10725
|
|
|
|
|
|
|
|
|
| |
Up until now zpool.cache has always lived in /boot on FreeBSD.
For the sake of compatibility fallback to /boot if zpool.cache
isn't found in /etc/zfs.
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Matt Macy <[email protected]>
Closes #10720
|
|
|
|
|
|
|
|
|
|
| |
arc_summary3 reports L2ARC writes in bytes. However, the related
arc_stat is reported as hits. arc_summary2 report this correctly.
Reviewed-by: George Melikov <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Ryan Moeller <[email protected]>
Signed-off-by: George Amanakis <[email protected]>
Closes #10717
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
`thread_create` on FreeBSD stringifies the argument passed as the
thread function to create a name for the thread. The thread name for
`l2arc_dev_rebuild_start` ended up with `(void (*)(void *))` in it.
Change the type signature so the function does not need to be cast
when creating the thread. Rename the function to
`l2arc_dev_rebuild_thread` for clarity and consistency, as well.
Reviewed-by: Alexander Motin <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: George Amanakis <[email protected]>
Signed-off-by: Ryan Moeller <[email protected]>
Closes #10716
|
|
|
|
|
|
|
|
| |
Stepping stone toward re-enabling spa_thread on FreeBSD.
Reviewed-by: Alexander Motin <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Ryan Moeller <[email protected]>
Closes #10715
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When reading compressed blocks from the L2ARC, with
compressed ARC disabled, arc_hdr_size() returns
LSIZE rather than PSIZE, but the actual read is PSIZE.
This causes l2arc_read_done() to compare the checksum
against the wrong size, resulting in checksum failure.
This manifests as an increase in the kstat l2_cksum_bad
and the read being retried from the main pool, making the
L2ARC ineffective.
Add new L2ARC tests with Compressed ARC enabled/disabled
Blocks are handled differently depending on the state of the
zfs_compressed_arc_enabled tunable.
If a block is compressed on-disk, and compressed_arc is enabled:
- the block is read from disk
- It is NOT decompressed
- It is added to the ARC in its compressed form
- l2arc_write_buffers() may write it to the L2ARC (as is)
- l2arc_read_done() compares the checksum to the BP (compressed)
However, if compressed_arc is disabled:
- the block is read from disk
- It is decompressed
- It is added to the ARC (uncompressed)
- l2arc_write_buffers() will use l2arc_apply_transforms() to
recompress the block, before writing it to the L2ARC
- l2arc_read_done() compares the checksum to the BP (compressed)
- l2arc_read_done() will use l2arc_untransform() to uncompress it
This test writes out a test file to a pool consisting of one disk
and one cache device, then randomly reads from it. Since the arc_max
in the tests is low, this will feed the L2ARC, and result in reads
from the L2ARC.
We compare the value of the kstat l2_cksum_bad before and after
to determine if any blocks failed to survive the trip through the
L2ARC.
Sponsored-by: The FreeBSD Foundation
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Allan Jude <[email protected]>
Closes #10693
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Linux and FreeBSD will most likely never see this issue.
On macOS when kext is unloaded, but zed is still connected, zed
will be issued ENODEV. As the cdevsw is released, the kernel
will not have zfsdev_release() called to release minor/onexit/events,
and it "leaks". This ensures it is cleaned up before unload.
Changed the for loop from zsprev, to zsnext style, for less
code duplication.
Signed-off-by: Brian Behlendorf <[email protected]>
Signed-off-by: Jorgen Lundman <[email protected]>
Closes #10700
|
|
|
|
|
|
|
|
|
|
|
|
| |
Use github workflow to run checkstyle
- use free (for OS projects) resources
- starts for every commit and branch
- work on forks, contributors may use it
before creating PRs
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: George Melikov <[email protected]>
Closes #10705
|
|
|
|
|
|
| |
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: George Melikov <[email protected]>
Closes #10705
|
|
|
|
|
|
|
|
|
|
| |
- It doesn't work now.
- It has to be manually edited on tests changes.
(even on test runtime changes!)
- Travis gives too small time to run to be useful.
Signed-off-by: Brian Behlendorf <[email protected]>
Signed-off-by: George Melikov <[email protected]>
Closes #10704
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Metaslabs are now (usually) loaded and unloaded infrequently, but when
that is not the case, it is useful to have a log of when and why these
events happened.
This commit enables the zfs_dbgmsg() in metaslab_load(), and adds a
zfs_dbgmsg() in metaslab_unload().
Reviewed-by: Serapheim Dimitropoulos <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Matthew Ahrens <[email protected]>
Closes #10683
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The arc_adapt() function tunes LRU/MLU balance according to 4 types of
cache hits (which is passed as state agrument): ghost LRU, LRU, MRU,
ghost MRU. If this function is called with wrong cache hit (state),
adaptation will be sub-optimal and performance will suffer.
Some time ago upstream received this commit:
6950 ARC should cache compressed data) in arc_read() do next
sequence (access to ghost buffer)
Before this commit, hit to any ghost list was passed arc_adapt() before
call to arc_access() which revive element in cache and change state from
ghost to real hit.
After this commit, the order of calls was reverted and arc_adapt() is
now called only with «real» hits even if hit was in one of two ghost
lists, which renders ghost lists useless and breaks the ARC algorithm.
FreeBSD fixed this problem locally in Change D19094 / Commit r348772.
This change is an adaptation of the above commit to the current arc
code.
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Matt Macy <[email protected]>
Closes #10548
Closes #10618
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The 'zfs share -a' currently skips any filesystems which
have 'canmount=noauto' set. This behavior is unexpected since the
one would expect 'zfs share -a' to share any mounted filesystem
that has the 'sharenfs' property already set.
This changes the behavior of 'zfs share -a' to allow the sharing
of 'canmount=noauto' datasets if they are mounted.
Reviewed-by: Matt Ahrens <[email protected]>
Reviewed-by: Don Brady <[email protected]>
Reviewed-by: Prakash Surya <[email protected]>
Signed-off-by: George Wilson <[email protected]>
External-issue: DLPX-71313
Closes #10688
|
|
|
|
|
|
|
|
|
|
|
| |
The KMOD name is "zfs" instead of "openzfs" when building in FreeBSD.
Define a ZFS_KMOD symbol as "zfs" when IN_BASE is defined, otherwise
"openzfs".
Reviewed-by: Brian Behlendorf <[email protected]>
Co-authored-by: Ryan Moeller <[email protected]>
Signed-off-by: Ryan Moeller <[email protected]>
Closes #10699
|
|
|
|
|
|
|
|
|
|
| |
The make_request_fn and associated API was replaced recently in a
Linux 5.9 merge, to replace its functionality with a new submit_bio
member in struct block_device_operations.
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Coleman Kane <[email protected]>
Closes #10696
|
|
|
|
|
|
|
|
|
|
|
|
| |
This change appears to primarily be a name change for the enum. Had
to update the test logic so that it works so long as either one of
these is present (favoring the newer one). Additionally, as this is
newer, it only shows up in node_page_item, so this commit doesn't
test zone_page_item for the same enum.
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Coleman Kane <[email protected]>
Closes #10696
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Many of the block device operations (often functions with bdev in
the name) were moved into linux/blkdev.h from linux/fs.h. Seems
that this header is already included where needed in the code, but
in the autoconf tests it was missing causing false negatives. This
commit has those tests include linux/fs.h (old location) and now
also linux/blkdev.h (new locations).
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Coleman Kane <[email protected]>
Closes #10696
|
|
|
|
|
|
| |
Reviewed-by: George Melikov <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Allan Jude <[email protected]>
Closes #10694
|
|
|
|
|
|
|
|
|
|
| |
This was previously moved because nothing else in-tree uses it, but
evidently DilOS uses it out of tree.
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Igor Kozhukhov <[email protected]>
Signed-off-by: Ryan Moeller <[email protected]>
Closes #10361
Closes #10685
|
|
|
|
|
|
| |
Reviewed-by: Alexander Motin <[email protected]>
Reviewed-by: Ryan Moeller <[email protected]>
Signed-off-by: Matt Macy <[email protected]>
Closes #10682
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
In various other pieces of logic have resulted in situations where
we double-free space in ZFS. This in turn results in a double-add
to the range trees. These issues have been much more difficult to
diagnose than they should have been, because the error handling
around this case is much weaker than around the double remove case.
Reviewed-by: Matt Ahrens <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: George Wilson <[email protected]>
Signed-off-by: Paul Dagnelie <[email protected]>
Closes #10654
|
|
|
|
|
|
|
|
| |
Bring zfs-tests.sh in to compliance with the other scripts
by converting it /bin/sh for to avoid a dependency on bash.
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Ryan Moeller <[email protected]>
Closes #10640
|
|
|
|
|
|
|
|
|
| |
Remove dead code to make the implementation easier to understand.
Reviewed-by: Ryan Moeller <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Matt Ahrens <[email protected]>
Closes #10650
|
|
|
|
|
|
|
|
|
| |
Remove dead code to make the implementation easier to understand.
Reviewed-by: Ryan Moeller <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Matt Ahrens <[email protected]>
Closes #10650
|
|
|
|
|
|
|
|
|
| |
Remove dead code to make the implementation easier to understand.
Reviewed-by: Ryan Moeller <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Matt Ahrens <[email protected]>
Closes #10650
|
|
|
|
|
|
|
|
|
| |
Remove dead code to make the implementation easier to understand.
Reviewed-by: Ryan Moeller <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Matt Ahrens <[email protected]>
Closes #10650
|
|
|
|
|
|
|
|
|
| |
Remove dead code to make the implementation easier to understand.
Reviewed-by: Ryan Moeller <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Matt Ahrens <[email protected]>
Closes #10650
|
|
|
|
|
|
|
|
|
| |
Remove dead code to make the implementation easier to understand.
Reviewed-by: Ryan Moeller <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Matt Ahrens <[email protected]>
Closes #10650
|
|
|
|
|
|
|
|
|
| |
Remove dead code to make the implementation easier to understand.
Reviewed-by: Ryan Moeller <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Matt Ahrens <[email protected]>
Closes #10650
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Pool-wide metadata is stored in the MOS (Meta Object Set). This
metadata is stored in triplicate, in addition to any pool-level
reduncancy (e.g. RAIDZ). However, if all 3+ copies of this metadata are
not available, we can still get EIO/ECKSUM when reading from the MOS.
If we encounter such an error in syncing context, we have typically
already committed to making a change that we now can't do because of the
corrupt/missing metadata. We typically "handle" this with a `VERIFY()`
or `zfs_panic_recover()`. This prevents the system from continuing on
in an undefined state, while minimizing the amount of error-handling
code.
However, there are some code paths that ignore these i/o errors, or
`ASSERT()` that they don't happen. Since assertions are disabled on
non-debug builds, they effectively ignore them as well. This can lead
to ZFS continuing on in an incorrect state, potentially leading to
on-disk inconsistencies.
This commit adds handling for these i/o errors on MOS metadata,
typically with a `VERIFY()`:
* Handle error return from `zap_cursor_retrieve()` in 4 places in
`dsl_deadlist.c`.
* Handle error return from `zap_contains()` in `dsl_dir_hold_obj()`.
Turns out this call isn't necessary because we can always call
`zap_lookup()`.
* Handle error return from `zap_lookup()` in `dsl_fs_ss_limit_check()`.
* Handle error return from `zap_remove()` in `dsl_dir_rename_sync()`.
* Handle error return from `zap_lookup()` in
`dsl_dir_remove_livelist()`.
* Handle error return from `dsl_process_sub_livelist()` in
`spa_livelist_delete_cb()`.
Additionally:
* Augment the internal history log message for `zfs destroy` to note
which method is used (e.g. bptree, livelist, or, synchronous) and the
mintxg.
* Correct a comment in `dbuf_init()`.
* Correct indentation in `dsl_dir_remove_livelist()`.
Reviewed by: Sara Hartse <[email protected]>
Reviewed-by: George Wilson <[email protected]>
Signed-off-by: Matthew Ahrens <[email protected]>
Closes #10643
|
|
|
|
|
|
| |
Authored-by: mjg <[email protected]>
Reviewed-by: Ryan Moeller <[email protected]>
Signed-off-by: Matt Macy <[email protected]>
Closes #10657
|
|
|
|
|
|
|
| |
Reviewed-by: Jorgen Lundman <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Co-authored-by: Ryan Moeller <[email protected]>
Signed-off-by: Matt Macy <[email protected]>
Closes #10668
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
3442c2a02d added new `arc_wait_for_eviction` tracepoint, which fails to
compile, when tracepoints are enabled.
The tracepoint definition begins with `DEFINE_ARC_WAIT_FOR_EVICTION_EVENT`
and is a multi-line definition, so this fixes the backslash
and parenthesis accordingly.
Reviewed-by: Matt Ahrens <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Pavel Snajdr <[email protected]>
Closes #10669
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This is a minor change to the systemd service templates that verifies
the zfs kernel module is loaded by the kernel prior to attempting to
import any zpool.
The services check for the presence of /sys/module/zfs which indicates
the zfs is module is loaded. This uses the systemd built-in check
ConditionPathIsDirectory.
Reviewed-by: Richard Laager <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Matthew Thode <[email protected]>
Signed-off-by: Jonathon Fernyhough <[email protected]>
Closes #10663
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
In case the L2ARC rebuild was canceled, do not log to spa history
log as the pool may be in the process of being removed and a panic
may occur:
BUG: kernel NULL pointer dereference, address: 0000000000000018
RIP: 0010:spa_history_log_internal+0xb1/0x120 [zfs]
Call Trace:
l2arc_rebuild+0x464/0x7c0 [zfs]
l2arc_dev_rebuild_start+0x2d/0x130 [zfs]
? l2arc_rebuild+0x7c0/0x7c0 [zfs]
thread_generic_wrapper+0x78/0xb0 [spl]
kthread+0xfb/0x130
? IS_ERR+0x10/0x10 [spl]
? kthread_park+0x90/0x90
ret_from_fork+0x35/0x40
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: George Amanakis <[email protected]>
Closes #10659
|
|
|
|
|
|
|
|
|
| |
zfs_jail was not using zfs_ioctl so failed to map the IOC number
correctly. Use zfs_ioctl to perform the jail ioctl and add a test
case for FreeBSD.
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Ryan Moeller <[email protected]>
Closes #10658
|
|
|
|
|
|
|
|
|
|
|
| |
Must acquire the z_teardown_lock before accessing the zfsvfs_t object.
I can't reproduce this panic on demand, but this looks like the
correct solution.
Reviewed-by: Ryan Moeller <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Authored-by: asomers <[email protected]>
Signed-off-by: Matt Macy <[email protected]>
Closes #10656
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
ZFS recv should return a useful error message when an invalid index
property value is provided in the send stream properties nvlist
With a compression= property outside of the understood range:
Before:
```
receiving full stream of zof/zstd_send@send2 into testpool/recv@send2
internal error: Invalid argument
Aborted (core dumped)
```
Note: the recv completes successfully, the abort() is likely just to
make it easier to track the unexpected error code.
After:
```
receiving full stream of zof/zstd_send@send2 into testpool/recv@send2
cannot receive compression property on testpool/recv: invalid property
value received 28.9M stream in 1 seconds (28.9M/sec)
```
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Allan Jude <[email protected]>
Closes #10631
|
|
|
|
|
|
|
|
|
| |
A collection of header changes to enable FreeBSD to build
with vendored OpenZFS.
Reviewed-by: Ryan Moeller <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Matt Macy <[email protected]>
Closes #10635
|
|
|
|
|
|
|
|
|
|
|
| |
Change some comments copied from the Linux code to describe
the appropriate methods on FreeBSD.
Convert some tunables to ZFS_MODULE_PARAM so they get created
on FreeBSD.
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Ryan Moeller <[email protected]>
Closes #10647
|
|
|
|
|
| |
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Ryan Moeller <[email protected]>
Closes #10633
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The ARC shrinker callback `arc_shrinker_count/_scan()` is invoked by the
kernel's shrinker mechanism when the system is running low on free
pages. This happens via 2 code paths:
1. "direct reclaim": The system is attempting to allocate a page, but we
are low on memory. The ARC shrinker callback is invoked from the
page-allocation code path.
2. "indirect reclaim": kswapd notices that there aren't many free pages,
so it invokes the ARC shrinker callback.
In both cases, the kernel's shrinker code requests that the ARC shrinker
callback release some of its cache, and then it measures how many pages
were released. However, it's measurement of released pages does not
include pages that are freed via `__free_pages()`, which is how the ARC
releases memory (via `abd_free_chunks()`). Rather, the kernel shrinker
code is looking for pages to be placed on the lists of reclaimable pages
(which is separate from actually-free pages).
Because the kernel shrinker code doesn't detect that the ARC has
released pages, it may call the ARC shrinker callback many times,
resulting in the ARC "collapsing" down to `arc_c_min`. This has several
negative impacts:
1. ZFS doesn't use RAM to cache data effectively.
2. In the direct reclaim case, a single page allocation may wait a long
time (e.g. more than a minute) while we evict the entire ARC.
3. Even with the improvements made in 67c0f0dedc5 ("ARC shrinking blocks
reads/writes"), occasionally `arc_size` may stay above `arc_c` for the
entire time of the ARC collapse, thus blocking ZFS read/write operations
in `arc_get_data_impl()`.
To address these issues, this commit limits the ways that the ARC
shrinker callback can be used by the kernel shrinker code, and mitigates
the impact of arc_is_overflowing() on ZFS read/write operations.
With this commit:
1. We limit the amount of data that can be reclaimed from the ARC via
the "direct reclaim" shrinker. This limits the amount of time it takes
to allocate a single page.
2. We do not allow the ARC to shrink via kswapd (indirect reclaim).
Instead we rely on `arc_evict_zthr` to monitor free memory and reduce
the ARC target size to keep sufficient free memory in the system. Note
that we can't simply rely on limiting the amount that we reclaim at once
(as for the direct reclaim case), because kswapd's "boosted" logic can
invoke the callback an unlimited number of times (see
`balance_pgdat()`).
3. When `arc_is_overflowing()` and we want to allocate memory,
`arc_get_data_impl()` will wait only for a multiple of the requested
amount of data to be evicted, rather than waiting for the ARC to no
longer be overflowing. This allows ZFS reads/writes to make progress
even while the ARC is overflowing, while also ensuring that the eviction
thread makes progress towards reducing the total amount of memory used
by the ARC.
4. The amount of memory that the ARC always tries to keep free for the
rest of the system, `arc_sys_free` is increased.
5. Now that the shrinker callback is able to provide feedback to the
kernel's shrinker code about our progress, we can safely enable
the kswapd hook. This will allow the arc to receive notifications
when memory pressure is first detected by the kernel. We also
re-enable the appropriate kstats to track these callbacks.
Reviewed-by: Alexander Motin <[email protected]>
Reviewed-by: Ryan Moeller <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Co-authored-by: George Wilson <[email protected]>
Signed-off-by: Matthew Ahrens <[email protected]>
Closes #10600
|
|
|
|
|
|
|
| |
Mark this as a known issue.
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Ryan Moeller <[email protected]>
Closes #10655
|
|
|
|
|
|
|
|
|
|
| |
FreeBSD recently integrated a change which causes \s in a regex to
throw an error instead of silently being misinterpreted as an s.
Change the regex in zpool_colors.ksh to use [[:space:]].
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Ryan Moeller <[email protected]>
Closes #10651
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
FreeBSD uses more stack space in debug configurations and can overflow
the stack while formatting the error message when the call depth limit
of 20 frames is reached. This is readily reproduced by running the
gsub recursion test with increased kstack size. I hit the panic with
16 pages per kstack, and noticed it go away when bumped to 17.
Reserve an additional 64 bytes on the stack when building for FreeBSD.
This is enough to avoid the panic with a deep stack while not wasting
too much space when the default stack size is used.
Reviewed-by: Matt Ahrens <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Ryan Moeller <[email protected]>
Closes #10634
|