| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When building on musl, we get:
```
In file included from tests/zfs-tests/cmd/getversion.c:22:
/usr/include/sys/fcntl.h:1:2: error: #warning redirecting incorrect
#include <sys/fcntl.h> to <fcntl.h> [-Werror=cpp]
1 | #warning redirecting incorrect #include <sys/fcntl.h> to <fcntl.h>
In file included from module/os/linux/zfs/vdev_file.c:36:
/usr/include/sys/fcntl.h:1:2: error: #warning redirecting incorrect
#include <sys/fcntl.h> to <fcntl.h> [-Werror=cpp]
1 | #warning redirecting incorrect #include <sys/fcntl.h> to <fcntl.h>
```
Bug: https://bugs.gentoo.org/925235
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Sam James <[email protected]>
Closes #15925
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
a10e552 updated abd_free_linear_page() to no longer call
abd_update_scatter_stat(). This meant that linear pages that were not
attached to Direct I/O requests were not doing waste accounting for the
ARC. This led to performance issues due to incorrect ARC accounting that
resulted in 100% of CPU time being spent in arc_evict() during prolonged
I/O workloads with the ARC.
The call to abd_update_scatter_stats() is now conditionally called in
abd_free_linear_page() when the ABD is not from a Direct I/O request.
Reviewed-by: Mark Maybee <[email protected]>
Reviewed-by: Tony Nguyen <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Signed-off-by: Brian Atkinson <[email protected]>
Closes #16729
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Currently, even though send_reader_thread prefetches spill block,
do_dump() will not use it and issues its own blocking arc_read. This
causes significant performance degradation when sending datasets with
lots of spill blocks.
For unmodified spill blocks, we also create send_range struct for them
in send_reader_thread and issue prefetches for them. We piggyback them
on the dnode send_range instead of enqueueing them so we don't break
send_range_after check.
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Signed-off-by: Chunwei Chen <[email protected]>
Co-authored-by: david.chen <[email protected]>
Closes #16701
|
|
|
|
|
|
|
|
|
|
|
|
| |
Avoids using fallback_migrate_folio, which starts unnecessary writeback
(leading to BUG in migrate_folio_extra).
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Tony Hutter <[email protected]>
Reviewed-by: Brian Atkinson <[email protected]>
Signed-off-by: tstabrawa <[email protected]>
Closes #16568
Closes #16723
|
|
|
|
|
|
|
|
|
|
|
| |
This reverts commit b052035990594408899fa32fd4ad6603b75b6c6d.
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Tony Hutter <[email protected]>
Reviewed-by: Brian Atkinson <[email protected]>
Signed-off-by: tstabrawa <[email protected]>
Closes #16568
Closes #16723
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
With the previous patch this yields
$ size -G ./module/zfs.ko ./module/zfs.new.ko
text data bss total filename
2865126 1597982 755768 5218876 ./module/zfs.ko
2864038 1429784 755768 5049590 ./module/zfs.new.ko
-1088 -168198
-1k -164k
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Signed-off-by: Ahelenia Ziemiańska <[email protected]>
Closes #16704
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Small block workloads may use a very large number of dirty records.
During simple block cloning test due to BRT still using 4KB blocks
I can easily see up to 2.5M of those used. Before this change
dbuf_dirty_record_t structures representing them were allocated via
kmem_zalloc(), that rounded their size up to 512 bytes.
Introduction of specialized kmem cache allows to reduce the size
from 512 to 408 bytes. Additionally, since override and raw params
in dirty records are mutually exclusive, puting them into a union
allows to reduce structure size down to 368 bytes, increasing the
saving to 28%, that can be a 0.5GB or more of RAM.
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Brian Atkinson <[email protected]>
Signed-off-by: Alexander Motin <[email protected]>
Sponsored by: iXsystems, Inc.
Closes #16694
|
|
|
|
|
|
|
|
|
|
|
| |
I think we've done enough experiments.
Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Reviewed-by: George Melikov <[email protected]>
Signed-off-by: Rob Norris <[email protected]>
Closes #16189
Closes #16712
|
|
|
|
|
|
| |
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Signed-off-by: Ahelenia Ziemiańska <[email protected]>
Closes #16702
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Now that we can handle these different alignments, we don't this
workaround.
This reverts commit aefc2da8a594d7a8059c862eab464d5f798393b3.
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Reviewed-by: Tony Hutter <[email protected]>
Signed-off-by: Rob Norris <[email protected]>
Closes #16687
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Freeing an ABD can take sleeping locks to update various stats. We
aren't allowed to sleep on an interrupt handler. So, move the free off
to the io_done callback.
We should never have been freeing things in the interrupt handler, but
we got away with it because we were usually freeing a linear ABD, which
at most is returning two objects to a cache and never sleeping. Scatter
ABDs can be used now, and those have more complex locking.
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Reviewed-by: Tony Hutter <[email protected]>
Signed-off-by: Rob Norris <[email protected]>
Closes #16687
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
It seems out our notion of "properly" aligned IO was incomplete. In
particular, dm-crypt does its own splitting, and assumes that a logical
block will never cross an order-0 page boundary (ie, the physical page
size, not compound size). This effectively means that it needs to be
possible to split a BIO at any page or block size boundary and have it
work correctly.
This updates the alignment check function to enforce these rules (to the
extent possible).
Our response to misaligned data is to make some new allocation that is
properly aligned, and copy the data into it. It turns out that
linearising (via abd_borrow_buf()) is not enough, because we allocate eg
4K blocks from a general purpose slab, and so may receive (or already
have) a 4K block that crosses pages.
So instead, we allocate a new ABD, which is guaranteed to be aligned
properly to block sizes, and then copy everything into it, and back out
on the way back.
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Reviewed-by: Tony Hutter <[email protected]>
Signed-off-by: Rob Norris <[email protected]>
Closes #16687 #16631 #15646 #15533 #14533
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
While reading some code @grwilson came across the above function that
seemingly had no consumers besides a ztest callback that ensures that
the tx_callback infrastructure works correctly. It turns out that Lustre
is the main (and potentially the only) consumer of this. Refer to
`osd_trans_commit_cb` of `lustre/osd-zfs/osd_handler.c` in the Lustre
repo for more info. Let's add a comment highlighting this before someone
removes it by mistake.
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Tony Hutter <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Signed-off-by: Serapheim Dimitropoulos <[email protected]>
Closes #16698
|
|
|
|
|
|
|
|
|
|
|
|
| |
If on the first open device's logical ashift is bigger than set
by pool's ashift property, ignore the last as unusable instead of
creating vdev that will fail most of I/Os due to misalignment.
Reviewed-by: Rob Norris <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Ameer Hamza <[email protected]>
Signed-off-by: Alexander Motin <[email protected]>
Sponsored by: iXsystems, Inc.
Closes #16690
|
|
|
|
|
|
|
|
|
|
|
| |
In FreeBSD's `zio_do_crypt_data()`, ensure that two `struct uio`
variables are cleared before copying data out of them. This avoids
accessing garbage data, and fixes gcc `-Wuninitialized` warnings.
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Toomas Soome <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Signed-off-by: Dimitry Andric <[email protected]>
Closes #16688
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
in some cases not linearizing buffers with disk sector crossing a
page boundary. It is fine for hardware, but somehow required by LUKS.
It is not typical for ZFS to produce such buffers, but it may happen
if 6KB block is compressed to 4KB, while still having 2KB alignment.
Banning the 6KB buffers helps vdevs with ashifh=12.
Signed-off-by: Alexander Motin <[email protected]>
Sponsored by: iXsystems, Inc.
Reviewed-by: Tony Hutter <[email protected]>
Reviewed-by: Tino Reichardt <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Before 4.20, kernel_siginfo_t was just called siginfo_t. This was
causing the kthread_dequeue_signal_3arg_task check, which uses
kernel_siginfo_t, to fail on older kernels.
In d6b8c17f1, we started checking for the "new" three-arg
dequeue_signal() by testing for the "old" version. Because that test is
explicitly using kernel_siginfo_t, it would fail, leading to the build
trying to use the new three-arg version, which would then not compile.
This commit fixes that by avoiding checking for the old 3-arg
dequeue_signal entirely. Instead, we check for the new one, as well as
the 4-arg form, and we use the old form as a fallback. This way, we
never have to test for it explicitly, and once we're building
HAVE_SIGINFO will make sure we get the right kernel_siginfo_t for it, so
everything works out nice.
Original-patch-by: Finix <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Rob Norris <[email protected]>
Closes #16666
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
While mounting ZFS root during boot on Linux distributions from initrd,
mount from busybox is effectively used which executes mount system call
directly. This skips the ZFS helper mount.zfs, which checks and enables
the mount options as specified in dataset properties. As a result,
datasets mounted during boot from initrd do not have correct mount
options as specified in ZFS dataset properties.
There has been an attempt to use mount.zfs in zfs initrd script,
responsible for mounting the ZFS root filesystem (PR#13305). This was
later reverted (PR#14908) after discovering that using mount.zfs breaks
mounting of snapshots on root (/) and other child datasets of root have
the same issue (Issue#9461).
This happens because switching from busybox mount to mount.zfs correctly
parses the mount options but also adds 'mntpoint=/root' to the mount
options, which is then prepended to the snapshot mountpoint in
'.zfs/snapshot'. '/root' is the directory on Debian with initramfs-tools
where root filesystem is mounted before pivot_root. When Linux runtime
is reached, trying to access the snapshots on root results in
automounting the snapshot on '/root/.zfs/*', which fails.
This commit attempts to fix the automounting of snapshots on root, while
using mount.zfs in initrd script. Since the mountpoint of dataset is
stored in vfs_mntpoint field, we can check if current mountpoint of
dataset and vfs_mntpoint are same or not. If they are not same, reset
the vfs_mntpoint field with current mountpoint. This fixes the
mountpoints of root dataset and children in respective vfs_mntpoint
fields when we try to access the snapshots of root dataset or its
children. With correct mountpoint for root dataset and children stored
in vfs_mntpoint, all snapshots of root dataset are mounted correctly
and become accessible.
This fix will come into play only if current process, that is trying to
access the snapshots is not in chroot context. The Linux kernel API
that is used to convert struct path into char format (d_path), returns
the complete path for given struct path. It works in chroot environment
as well and returns the correct path from original filesystem root.
However d_path fails to return the complete path if any directory from
original root filesystem is mounted using --bind flag or --rbind flag
in chroot environment. In this case, if we try to access the snapshot
from outside the chroot environment, d_path returns the path correctly,
i.e. it returns the correct path to the directory that is mounted with
--bind flag. However inside the chroot environment, it only returns the
path inside chroot.
For now, there is not a better way in my understanding that gives the
complete path in char format and handles the case where directories from
root filesystem are mounted with --bind or --rbind on another path which
user will later chroot into. So this fix gets enabled if current
process trying to access the snapshot is not in chroot context.
With the snapshots issue fixed for root filesystem, using mount.zfs in
ZFS initrd script, mounts the datasets with correct mount options.
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Reviewed-by: Ameer Hamza <[email protected]>
Signed-off-by: Umer Saleem <[email protected]>
Closes #16646
|
|
|
|
|
|
|
|
|
|
| |
This partially reverts commit 41210597. Now that b4e4cbeb2 has
been merged Direct IO can be enabled by default for Linux, but
for FreeBSD there still remains a potentially insufficient range
locking in zfs_getpages() which needs to be resolved.
Reviewed-by: Tony Hutter <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #16629
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This fixes an oversight in the Direct I/O PR. There is nothing that
stops a process from manipulating the contents of a buffer for a
Direct I/O read while the I/O is in flight. This can lead checksum
verify failures. However, the disk contents are still correct, and this
would lead to false reporting of checksum validation failures.
To remedy this, all Direct I/O reads that have a checksum verification
failure are treated as suspicious. In the event a checksum validation
failure occurs for a Direct I/O read, then the I/O request will be
reissued though the ARC. This allows for actual validation to happen and
removes any possibility of the buffer being manipulated after the I/O
has been issued.
Just as with Direct I/O write checksum validation failures, Direct I/O
read checksum validation failures are reported though zpool status -d in
the DIO column. Also the zevent has been updated to have both:
1. dio_verify_wr -> Checksum verification failure for writes
2. dio_verify_rd -> Checksum verification failure for reads.
This allows for determining what I/O operation was the culprit for the
checksum verification failure. All DIO errors are reported only on the
top-level VDEV.
Even though FreeBSD can write protect pages (stable pages) it still has
the same issue as Linux with Direct I/O reads.
This commit updates the following:
1. Propogates checksum failures for reads all the way up to the
top-level VDEV.
2. Reports errors through zpool status -d as DIO.
3. Has two zevents for checksum verify errors with Direct I/O. One for
read and one for write.
4. Updates FreeBSD ABD code to also check for ABD_FLAG_FROM_PAGES and
handle ABD buffer contents validation the same as Linux.
5. Updated manipulate_user_buffer.c to also manipulate a buffer while a
Direct I/O read is taking place.
6. Adds a new ZTS test case dio_read_verify that stress tests the new
code.
7. Updated man pages.
8. Added an IMPLY statement to zio_checksum_verify() to make sure that
Direct I/O reads are not issued as speculative.
9. Removed self healing through mirror, raidz, and dRAID VDEVs for
Direct I/O reads.
This issue was first observed when installing a Windows 11 VM on a ZFS
dataset with the dataset property direct set to always. The zpool
devices would report checksum failures, but running a subsequent zpool
scrub would not repair any data and report no errors.
Reviewed-by: Tony Hutter <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Signed-off-by: Brian Atkinson <[email protected]>
Closes #16598
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
`zvol_rename_minors()` needs to be given the full path not just the
snapshot name. Use code removed in a0bd735ad as a guide
to providing the necessary values.
Add ZTS check for /dev changes after snapshot rename. After
renaming a snapshot with 'snapdev=visible' ensure that the /dev
entries are updated to reflect the rename.
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: James Dingwall <[email protected]>
Closes #14223
Closes #16600
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Since arc_evict() run can take some time, arc_c change during it
may result in undesired shift in ARC states balance. Primarily in
case of arc_c reduction it may cause eviction from MFU data state
despite its being below the target already. Instead we should
evict as originally planned and if needed do another round after.
Reviewed-by: Theera K. <[email protected]>
Reviewed-by: George Melikov <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Alexander Motin <[email protected]>
Sponsored by: iXsystems, Inc.
Closes #16576
Closes #16605
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Restart a resilver from scratch, if the current one in progress is
below a new tunable, zfs_resilver_defer_percent (defaulting to 10%).
The original rationale for deferring additional resilvers, when there is
already one in progress, was to help achieving data redundancy sooner
for the data that gets scanned at the end of the resilver.
But in case the admin wants to attach multiple disks to a single vdev,
it wasn't immediately obvious the admin is supposed to run
`zpool resilver` afterwards to reset the deferred resilvers and start
a new one from scratch.
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Pavel Snajdr <[email protected]>
Closes #15810
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
In a4b21eadec we added the zap_micro_max_size tuneable to raise the size
at which "micro" (single-block) ZAPs are upgraded to "fat" (multi-block)
ZAPs. Before this, a microZAP was limited to 128KiB, which was the old
largest block size. The side effect of raising the max size past 128KiB
is that it be stored in a large block, requiring the large_blocks
feature.
Unfortunately, this means that a backup stream created without the
--large-block (-L) flag to zfs send would split the microZAP block into
smaller blocks and send those, as is normal behaviour for large blocks.
This would be received correctly, but since microZAPs are limited to the
first block in the object by definition, the entries in the later blocks
would be inaccessible. For directory ZAPs, this gives the appearance of
files being lost.
This commit adds a feature flag, large_microzap, that must be enabled
for microZAPs to grow beyond 128KiB, and which will be activated the
first time that occurs. This feature is later checked when generating
the stream and if active, the send operation will abort unless
--large-block has also been requested.
Changing the limit still requires zap_micro_max_size to be changed. The
state of this flag effectively sets the upper value for this tuneable,
that is, if the feature is disabled, the tuneable will be clamped to
128KiB.
A stream flag is also added to ensure that the receiver also activates
its own feature flag upon receiving the stream. This is not strictly
necessary to _use_ the received microZAP, since it doesn't care how
large its block is, but it is required to send the microZAP object on,
otherwise the original problem occurs again.
Because it's difficult to reliably distinguish a microZAP from a fatZAP
from outside the ZAP code, and because it seems unlikely that most
users are affected (a fairly niche tuneable combined with what should be
an uncommon use of send), and for the sake of expediency, this change
activates the feature the first time a microZAP grows to use a large
block, and is never deactivated after that. This can be improved in the
future.
This commit changes nothing for existing pools that already have large
microZAPs. The feature will not be retroactively applied, but will be
activated the next time a microZAP grows past the limit.
Don't use large_blocks feature for enable/disable tests. The
large_microzap depends on large_blocks, so it gets enabled as a
dependency, breaking the test. Instead use feature "longname", which has
the exact same feature characteristics.
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Allan Jude <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Signed-off-by: Rob Norris <[email protected]>
Closes #16593
|
|
|
|
|
|
|
|
|
| |
While some remaining issues are resolved with the recently merged
Direct IO functionality disable it by default.
Reviewed-by: Alexander Motin <[email protected]>
Signed-off-by: Brian Atkinson <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #16597
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
In some environments, just making the .zfs control dir hidden from sight
might not be enough. In particular, the following scenarios might
warrant not allowing access at all:
- old snapshots with wrong permissions/ownership
- old snapshots with exploitable setuid/setgid binaries
- old snapshots with sensitive contents
Introducing a new 'disabled' value that not only hides the control dir,
but prevents access to its contents by returning ENOENT solves all of
the above.
The new property value takes advantage of 'iuv' semantics ("ignore
unknown value") to automatically fall back to the old default value when
a pool is accessed by an older version of ZFS that doesn't yet know
about 'disabled' semantics.
I think that technically the zfs_dirlook change is enough to prevent
access, but preventing lookups and dir entries in an already opened .zfs
handle might also be a good idea to prevent races when modifying the
property at runtime.
Add zfs_snapshot_no_setuid parameter to control whether automatically
mounted snapshots have the setuid mount option set or not.
this could be considered a partial fix for one of the scenarios
mentioned in desired.
Reviewed-by: Alexander Motin <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Tino Reichardt <[email protected]>
Signed-off-by: Fabian Grünbichler <[email protected]>
Co-authored-by: Fabian Grünbichler <[email protected]>
Closes #3963
Closes #16587
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Compiling with -O0 (no proper optimizations), strlen() call
in loops for comparing the size, isn't being called/initialized
before the actual loop gets started, which causes n-numbers of
strlen() calls (as long as the string is). Keeping the length
before entering in the loop is a good idea.
On some places, even with -O2, both GCC and Clang can't
recognize this pattern, which seem to happen in an array
of char pointer.
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Signed-off-by: rilysh <[email protected]>
Closes #16584
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
torvalds/linux@09022bc196d2 removes the flag, and the corresponding
SetPageError() and ClearPageError() macros, with no replacement offered.
Going back through the upstream history, use of this flag has been
gradually removed over the last year as part of the long tail of
converting everything to folios. Interesting tidbit comments from
torvalds/linux@29e9412b250e and torvalds/linux@420e05d0de18 suggest that
this flag has not been used meaningfully since page writeback failures
started being recorded in errseq_t instead (the whole "fsyncgate" thing,
~2017, around torvalds/linux@8ed1e46aaf1b).
Given that, it's possible that since perhaps Linux 4.13 we haven't been
getting anything by setting the flag. I don't know if that's true and/or
if there's something we should be doing instead, but my gut feel is that
its probably fine we only use the page cache as a proxy to allow mmap()
to work, rather than backing IO with it.
As such, I'm expecting that removing this will do no harm, but I'm
leaving it in for older kernels to maintain status quo, and if there is
an overall better way, that is left for a future change.
Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Rob Norris <[email protected]>
Closes #16582
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
See torvalds/linux@a2b80ce87a87. It claims the task arg is always
`current`, and so it is with us, so this is a safe change to make. The
only spanner is that we also support the older pre-5.17 3-arg
dequeue_signal() which had different meaning, so we have to check the
types to get the right one.
Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Rob Norris <[email protected]>
Closes #16582
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This patch adds the ability for zfs to support file/dir name up to 1023
bytes. This number is chosen so we can support up to 255 4-byte
characters. This new feature is represented by the new feature flag
feature@longname.
A new dataset property "longname" is also introduced to toggle longname
support for each dataset individually. This property can be disabled,
even if it contains longname files. In such case, new file cannot be
created with longname but existing longname files can still be looked
up.
Note that, to my knowledge native Linux filesystems don't support name
longer than 255 bytes. So there might be programs not able to work with
longname.
Note that NFS server may needs to use exportfs_get_name to reconnect
dentries, and the buffer being passed is limit to NAME_MAX+1 (256). So
NFS may not work when longname is enabled.
Note, FreeBSD vfs layer imposes a limit of 255 name lengh, so even
though we add code to support it here, it won't actually work.
Reviewed-by: Tony Hutter <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Signed-off-by: Chunwei Chen <[email protected]>
Closes #15921
|
|
|
|
|
|
|
|
|
|
|
|
| |
This patch is preparatory work for long name feature. It changes all
users of zap_attribute_t to allocate it from kmem instead of stack. It
also make zap_attribute_t and zap_name_t structure variable length.
Reviewed-by: Tony Hutter <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Signed-off-by: Chunwei Chen <[email protected]>
Closes #15921
|
|
|
|
|
|
|
|
|
|
|
| |
Specifically, a child in a replacing vdev won't count when assessing
the dtl during a vdev_fault()
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Tino Reichardt <[email protected]>
Signed-off-by: Don Brady <[email protected]>
Closes #16569
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Linux 6.10+ with CONFIG_FORTIFY_SOURCE notices memcpy() accessing past
the end of TString, because it has no indication that there there may be
an additional allocation there.
There's no appropriate upstream change for this (ancient) version of
Lua, so this is the narrowest change I could come up with to add a flex
array field to the end of TString to satisfy the check. It's loosely
based on changes from lua/lua@ca41b43f and lua/lua@9514abc2.
Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Alexander Motin <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Rob Norris <[email protected]>
Closes #16541
Closes #16583
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
ZIL log record structs (lr_XX_t) are frequently allocated with extra
space after the struct to carry variable-sized "payload" items.
Linux 6.10+ compiled with CONFIG_FORTIFY_SOURCE has been doing runtime
bounds checking on memcpy() calls. Because these types had no indicator
that they might use more space than their simple definition,
__fortify_memcpy_chk will frequently complain about overruns eg:
memcpy: detected field-spanning write (size 7) of single field
"lr + 1" at zfs_log.c:425 (size 0)
memcpy: detected field-spanning write (size 9) of single field
"(char *)(lr + 1)" at zfs_log.c:593 (size 0)
memcpy: detected field-spanning write (size 4) of single field
"(char *)(lr + 1) + snamesize" at zfs_log.c:594 (size 0)
memcpy: detected field-spanning write (size 7) of single field
"lr + 1" at zfs_log.c:425 (size 0)
memcpy: detected field-spanning write (size 9) of single field
"(char *)(lr + 1)" at zfs_log.c:593 (size 0)
memcpy: detected field-spanning write (size 4) of single field
"(char *)(lr + 1) + snamesize" at zfs_log.c:594 (size 0)
memcpy: detected field-spanning write (size 7) of single field
"lr + 1" at zfs_log.c:425 (size 0)
memcpy: detected field-spanning write (size 9) of single field
"(char *)(lr + 1)" at zfs_log.c:593 (size 0)
memcpy: detected field-spanning write (size 4) of single field
"(char *)(lr + 1) + snamesize" at zfs_log.c:594 (size 0)
To fix this, this commit adds flex array fields to all lr_XX_t structs
that require them, and then uses those fields to access that
end-of-struct area rather than more complicated casts and pointer
addition.
Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Alexander Motin <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Rob Norris <[email protected]>
Closes #16501
Closes #16539
|
|
|
|
|
|
|
|
|
|
|
| |
Linux page migration code won't wait for writeback to complete unless
it needs to call release_folio. Call SetPagePrivate wherever
PageUptodate is set and define .release_folio, to cause
fallback_migrate_folio to wait for us.
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: tstabrawa <[email protected]>
Closes #15140
Closes #16568
|
|
|
|
|
|
|
|
|
|
|
| |
Since dsl_crypto_key_open() references the key, 0d23f5e2e4 should
have called dsl_crypto_key_rele() to drop it first instead of
calling dsl_crypto_key_free() directly. The final result should
actually be the same, but without triggering dck_holds assertion.
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Alexander Motin <[email protected]>
Sponsored by: iXsystems, Inc.
Closes #16567
|
|
|
|
|
|
|
|
|
|
|
| |
Couple places in the code depend on 0 returned only if the task was
actually cancelled. Doing otherwise could lead to extra references
being dropped. The race could be small, but I believe CI hit it
from time to time.
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Alexander Motin <[email protected]>
Sponsored by: iXsystems, Inc.
Closes #16565
|
|
|
|
|
|
|
|
|
| |
This adds the HAVE_KERNEL_NEON and HAVE_KERNEL_FPU_INTERNAL
guards to simd_stat.c defaulted to 0 to make it build again.
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Shengqi Chen <[email protected]>
Signed-off-by: Sebastian Wuerl <[email protected]>
Closes #16558
|
|
|
|
|
|
|
|
|
|
|
|
| |
Without updating 'm' we evict from MFU metadata all that we wanted
to evict from all metadata, including already evicted MRU metadata
('m' is the total amount of metadata we had at the beginning,
and 'w' is the total amount of metadata we want to have).
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Signed-off-by: Theera K. <[email protected]>
Closes #16521
Closes #16546
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Since the person using the kernel may not be the person who built it,
show a warning at module load too, in case they aren't aware that it
might be weird.
Reviewed-by: Robert Evans <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Tony Hutter <[email protected]>
Signed-off-by: Rob Norris <[email protected]>
Sponsored-by: https://despairlabs.com/sponsor/
Closes #15986
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
It's the main recommendation to set xattr=sa
even in man pages, so let's set it by default.
xattr=sa don't use feature flag, so in the worst
case we'll have non-readable xattrs by other
non-openzfs platforms.
Non-overridden default `xattr` prop of existing pools
will automatically use `sa` after this commit too.
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: George Melikov <[email protected]>
Closes #15147
|
|
|
|
|
|
|
|
| |
Evidently while reworking it on aarch64, I broke it on x86 and
didn't notice.
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Rich Ercolani <[email protected]>
Closes #16556
|
|
|
|
|
|
|
|
|
|
|
| |
I accidentally removed this in c22d56e3e, and didn't notice because it
doesn't fail the build, but does fail to load into the kernel because it
can't link it.
Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: George Melikov <[email protected]>
Signed-off-by: Rob Norris <[email protected]>
Closes #16554
|
|
|
|
|
|
|
|
|
| |
This change accidentally broke the FreeBSD build due to
a conflict between the simd_stat_init()/simd_stat_fini()
macros on FreeBSD and the extern function prototype.
Reviewed-by: Tony Hutter <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #16552
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Too many times, people's performance problems have amounted to
"somehow your SIMD support isn't working", and determining that
at runtime is difficult to describe to people.
This adds a /proc/spl/kstat/zfs/simd node, which exposes
metadata about which instructions ZFS thinks it can use,
on AArch64 and x86_64 Linux, to make investigating things
like this much easier.
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Rich Ercolani <[email protected]>
Closes #16530
|
|
|
|
|
|
|
|
|
|
|
| |
On compression we could be more explicit here for cases
where we can not recompress the data.
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Tony Hutter <[email protected]>
Co-authored-by: Alexander Motin <[email protected]>
Signed-off-by: George Melikov <[email protected]>
Closes #9416
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
ZLE compressor needs additional bytes to process
d_len argument efficiently.
Don't use BPE_PAYLOAD_SIZE as d_len with it
before we rework zle compressor somehow.
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Tony Hutter <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Signed-off-by: George Melikov <[email protected]>
Closes #9416
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Now default compression is lz4, which can stop
compression process by itself on incompressible data.
If there are additional size checks -
we will only make our compressratio worse.
New usable compression thresholds are:
- less than BPE_PAYLOAD_SIZE (embedded_data feature);
- at least one saved sector.
Old 12.5% threshold is left to minimize affect
on existing user expectations of CPU utilization.
If data wasn't compressed - it will be saved as
ZIO_COMPRESS_OFF, so if we really need to recompress
data without ashift info and check anything -
we can just compress it with zero threshold.
So, we don't need a new feature flag here!
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Tony Hutter <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Signed-off-by: George Melikov <[email protected]>
Closes #9416
|
|
|
|
|
|
|
|
|
|
| |
Just nice and simple, with room to grow.
Reviewed by: Brian Behlendorf <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Signed-off-by: Rob Norris <[email protected]>
Sponsored-by: https://despairlabs.com/sponsor/
Closes #16492
|
|
|
|
|
|
|
|
|
|
|
|
| |
For now, userspace has no znode implementation. Some of the property and
path handling code is used there though and is the same on all
platforms, so we only need a single copy of it.
Reviewed by: Brian Behlendorf <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Signed-off-by: Rob Norris <[email protected]>
Sponsored-by: https://despairlabs.com/sponsor/
Closes #16492
|