| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
|
|
| |
Chroot'd process fails to automount snapshots due to realpath(3)
failure in mount.zfs(8).
Construct a mount point path from sb of the ctldir inode and dirent
name, instead of from d_path(), so that chroot'd process doesn't get
affected by its view of fs.
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Tomohiro Kusumi <[email protected]>
Closes #8903
Closes #8966
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
After device removal, performing nopwrites on a dmu_sync-ed block
will result in a panic. This panic can show up in two ways:
1. an attempt to issue an IOCTL in vdev_indirect_io_start()
2. a failed comparison of zio->io_bp and zio->io_bp_orig in
zio_done()
To resolve both of these panics, nopwrites of blocks on indirect
vdevs should be ignored and new allocations should be performed on
concrete vdevs.
Reviewed-by: Igor Kozhukhov <[email protected]>
Reviewed-by: Pavel Zakharov <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Don Brady <[email protected]>
Signed-off-by: George Wilson <[email protected]>
Closes #8957
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
With the new parallel allocators scheme, there is a possibility for
a problem where two threads, allocating from the same allocator at
the same time, conflict with each other. There are two primary cases
to worry about. First, another thread working on another allocator
activates the same metaslab that the first thread was trying to
activate. This results in the first thread needing to go back and
reselect a new metaslab, even though it may have waited a long time
for this metaslab to load. Second, another thread working on the same
allocator may have activated a different metaslab while the first
thread was waiting for its metaslab to load. Both of these cases
can cause the first thread to be significantly delayed in issuing
its IOs. The second case can also cause metaslab load/unload churn;
because the metaslab is loaded but not fully activated, we never set
the selected_txg, which results in the metaslab being immediately
unloaded again. This process can repeat many times, wasting disk and
cpu resources. This is more likely to happen when the IO of the first
thread is a larger one (like a ZIL write) and the other thread is
doing a smaller write, because it is more likely to find an
acceptable metaslab quickly.
There are two primary changes. The first is to always proceed with
the allocation when returning from metaslab_activate if we were
preempted in either of the ways described in the previous section.
The second change is to set the selected_txg before we do the call
to activate so that even if the metaslab is not used for an
allocation, we won't immediately attempt to unload it.
Reviewed by: Jerry Jelinek <[email protected]>
Reviewed by: Matt Ahrens <[email protected]>
Reviewed by: Serapheim Dimitropoulos <[email protected]>
Reviewed by: Brian Behlendorf <[email protected]>
Signed-off-by: Paul Dagnelie <[email protected]>
External-issue: DLPX-61314
Closes #8843
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
DMU sync code calls taskq_dispatch() for each sublist of os_dirty_dnodes
and os_synced_dnodes. Since the number of sublists by default is equal
to number of CPUs, it will dispatch equal, potentially large, number of
tasks, waking up many CPUs to handle them, even if only one or few of
sublists actually have any work to do.
This change adds check for empty sublists to avoid this.
Reviewed by: Sean Eric Fagan <[email protected]>
Reviewed by: Matt Ahrens <[email protected]>
Reviewed by: Brian Behlendorf <[email protected]>
Signed-off-by: Alexander Motin <[email protected]>
Closes #8909
|
|
|
|
|
|
|
|
|
|
| |
With the addition of BP_EMBEDDED_TYPE_REDACTED in 30af21b0 a couple of
codepaths make wrong assumptions and could potentially result in errors.
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Chris Dunlop <[email protected]>
Reviewed-by: Paul Dagnelie <[email protected]>
Signed-off-by: loli10K <[email protected]>
Closes #8951
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The "zfs remap" command was disabled by
6e91a72fe3ff8bb282490773bd687632f3e8c79d, because it has little utility
and introduced some tricky bugs. This commit removes the code for it,
the associated ZFS_IOC_REMAP ioctl, and tests.
Note that the ioctl and property will remain, but have no functionality.
This allows older software to fail gracefully if it attempts to use
these, and avoids a backwards incompatibility that would be introduced if
we renumbered the later ioctls/props.
Reviewed-by: Tom Caputi <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Matthew Ahrens <[email protected]>
Closes #8944
|
|
|
|
|
|
|
|
|
| |
This patch corrects the error message reported when attempting
to promote a dataset outside of its encryption root.
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Tom Caputi <[email protected]>
Closes #8905
Closes #8935
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Problem Statement
=================
ZFS Channel program scripts currently require a timeout, so that hung or
long-running scripts return a timeout error instead of causing ZFS to get
wedged. This limit can currently be set up to 100 million Lua instructions.
Even with a limit in place, it would be desirable to have a sys admin
(support engineer) be able to cancel a script that is taking a long time.
Proposed Solution
=================
Make it possible to abort a channel program by sending an interrupt signal.In
the underlying txg_wait_sync function, switch the cv_wait to a cv_wait_sig to
catch the signal. Once a signal is encountered, the dsl_sync_task function can
install a Lua hook that will get called before the Lua interpreter executes a
new line of code. The dsl_sync_task can resume with a standard txg_wait_sync
call and wait for the txg to complete. Meanwhile, the hook will abort the
script and indicate that the channel program was canceled. The kernel returns
a EINTR to indicate that the channel program run was canceled.
Porting notes: Added missing return value from cv_wait_sig()
Authored by: Don Brady <[email protected]>
Reviewed by: Sebastien Roy <[email protected]>
Reviewed by: Serapheim Dimitropoulos <[email protected]>
Reviewed by: Matt Ahrens <[email protected]>
Reviewed by: Sara Hartse <[email protected]>
Reviewed by: Brian Behlendorf <[email protected]>
Approved by: Robert Mustacchi <[email protected]>
Ported-by: Don Brady <[email protected]>
Signed-off-by: Don Brady <[email protected]>
OpenZFS-issue: https://www.illumos.org/issues/9425
OpenZFS-commit: https://github.com/illumos/illumos-gate/commit/d0cb1fb926
Closes #8904
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The thread calling dmu_tx_try_assign() can't hold the dn_struct_rwlock
while assigning the tx, because this can lead to deadlock. Specifically,
if this dnode is already assigned to an earlier txg, this thread may
need to wait for that txg to sync (the ERESTART case below). The other
thread that has assigned this dnode to an earlier txg prevents this txg
from syncing until its tx can complete (calling dmu_tx_commit()), but it
may need to acquire the dn_struct_rwlock to do so (e.g. via
dmu_buf_hold*()).
This commit adds an assertion to dmu_tx_try_assign() to ensure that this
deadlock is not inadvertently introduced.
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Matthew Ahrens <[email protected]>
Closes #8929
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When exporting ZVOLs as SCSI LUNs, by default Windows will not
issue them UNMAP commands. This reduces storage efficiency in
many cases.
We add the SCSI_PASSTHROUGH flag to the zvol's device queue,
which lets the SCSI target logic know that it can handle SCSI
commands.
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: George Melikov <[email protected]>
Reviewed-by: John Gallagher <[email protected]>
Signed-off-by: Paul Dagnelie <[email protected]>
Closes #8933
|
|
|
|
|
|
|
|
|
|
|
|
| |
`show_str` could be a pointer to a local variable in stack
which is out-of-scope by the time
`return (snprintf(buf, buflen, "%s\n", show_str));`
is called.
Reviewed-by: Tony Hutter <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Tomohiro Kusumi <[email protected]>
Closes #8924
Closes #8940
|
|
|
|
|
|
|
|
|
|
|
|
| |
The logic to handle strong checksum collisions where the data doesn't
match is incorrect. It is not clearing the dedup bit of the blkptr,
which can cause a panic later in zio_ddt_free() due to the dedup table
not matching what is in the blkptr.
Reviewed-by: Tom Caputi <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Matthew Ahrens <[email protected]>
External-issue: DLPX-48097
Closes #8936
|
|
|
|
|
|
|
|
| |
Align vdev_ops_t from illumos for better compatibility.
Reviewed-by: Tony Hutter <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Igor Kozhukhov <[email protected]>
Closes #8925
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When encryption was first added to ZFS, we made a decision to
prevent users from creating unencrypted children of encrypted
datasets. The idea was to prevent users from inadvertently
leaving some of their data unencrypted. However, since the
release of 0.8.0, some legitimate reasons have been brought up
for this behavior to be allowed. This patch simply removes this
limitation from all code paths that had checks for it and updates
the tests accordingly.
Reviewed-by: Jason King <[email protected]>
Reviewed-by: Sean Eric Fagan <[email protected]>
Reviewed-by: Richard Laager <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Tom Caputi <[email protected]>
Closes #8737
Closes #8870
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
If dedup is in use, the `dedupditto` property can be set, causing ZFS to
keep an extra copy of data that is referenced many times (>100x). The
idea was that this data is more important than other data and thus we
want to be really sure that it is not lost if the disk experiences a
small amount of random corruption.
ZFS (and system administrators) rely on the pool-level redundancy to
protect their data (e.g. mirroring or RAIDZ). Since the user/sysadmin
doesn't have control over what data will be offered extra redundancy by
dedupditto, this extra redundancy is not very useful. The bulk of the
data is still vulnerable to loss based on the pool-level redundancy.
For example, if particle strikes corrupt 0.1% of blocks, you will either
be saved by mirror/raidz, or you will be sad. This is true even if
dedupditto saved another 0.01% of blocks from being corrupted.
Therefore, the dedupditto functionality is rarely enabled (i.e. the
property is rarely set), and it fulfills its promise of increased
redundancy even more rarely.
Additionally, this feature does not work as advertised (on existing
releases), because scrub/resilver did not repair the extra (dedupditto)
copy (see https://github.com/zfsonlinux/zfs/pull/8270).
In summary, this seldom-used feature doesn't work, and even if it did it
wouldn't provide useful data protection. It has a non-trivial
maintenance burden (again see https://github.com/zfsonlinux/zfs/pull/8270).
We should remove the dedupditto functionality. For backwards
compatibility with the existing CLI, "zpool set dedupditto" will still
"succeed" (exit code zero), but won't have any effect. For backwards
compatibility with existing pools that had dedupditto enabled at some
point, the code will still be able to understand dedupditto blocks and
free them when appropriate. However, ZFS won't write any new dedupditto
blocks.
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Igor Kozhukhov <[email protected]>
Reviewed-by: Alek Pinchuk <[email protected]>
Issue #8270
Closes #8310
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Redacted send/receive allows users to send subsets of their data to
a target system. One possible use case for this feature is to not
transmit sensitive information to a data warehousing, test/dev, or
analytics environment. Another is to save space by not replicating
unimportant data within a given dataset, for example in backup tools
like zrepl.
Redacted send/receive is a three-stage process. First, a clone (or
clones) is made of the snapshot to be sent to the target. In this
clone (or clones), all unnecessary or unwanted data is removed or
modified. This clone is then snapshotted to create the "redaction
snapshot" (or snapshots). Second, the new zfs redact command is used
to create a redaction bookmark. The redaction bookmark stores the
list of blocks in a snapshot that were modified by the redaction
snapshot(s). Finally, the redaction bookmark is passed as a parameter
to zfs send. When sending to the snapshot that was redacted, the
redaction bookmark is used to filter out blocks that contain sensitive
or unwanted information, and those blocks are not included in the send
stream. When sending from the redaction bookmark, the blocks it
contains are considered as candidate blocks in addition to those
blocks in the destination snapshot that were modified since the
creation_txg of the redaction bookmark. This step is necessary to
allow the target to rehydrate data in the case where some blocks are
accidentally or unnecessarily modified in the redaction snapshot.
The changes to bookmarks to enable fast space estimation involve
adding deadlists to bookmarks. There is also logic to manage the
life cycles of these deadlists.
The new size estimation process operates in cases where previously
an accurate estimate could not be provided. In those cases, a send
is performed where no data blocks are read, reducing the runtime
significantly and providing a byte-accurate size estimate.
Reviewed-by: Dan Kimmel <[email protected]>
Reviewed-by: Matt Ahrens <[email protected]>
Reviewed-by: Prashanth Sreenivasa <[email protected]>
Reviewed-by: John Kennedy <[email protected]>
Reviewed-by: George Wilson <[email protected]>
Reviewed-by: Chris Williamson <[email protected]>
Reviewed-by: Pavel Zhakarov <[email protected]>
Reviewed-by: Sebastien Roy <[email protected]>
Reviewed-by: Prakash Surya <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Paul Dagnelie <[email protected]>
Closes #7958
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
For busy ARC situation when arc_size close to arc_c is desired. But
then it is quite likely that aggsum_compare(&arc_size, arc_c) will need
to flush per-CPU buckets to find exact comparison result. Doing that
often in a hot path penalizes whole idea of aggsum usage there, since it
replaces few simple atomic additions with dozens of lock acquisitions.
Replacing aggsum_compare() with aggsum_upper_bound() in code increasing
arc_p when ARC is growing (arc_size < arc_c) according to PMC profiles
allows to save ~5% of CPU time in aggsum code during sequential write
to 12 ZVOLs with 16KB block size on large dual-socket system.
I suppose there some minor arc_p behavior change due to lower precision
of the new code, but I don't think it is a big deal, since it should
affect only very small window in time (aggsum buckets are flushed every
second) and in ARC size (buckets are limited to 10 average ARC blocks
per CPU).
Reviewed-by: Chris Dunlop <[email protected]>
Reviewed-by: Richard Elling <[email protected]>
Reviewed-by: George Melikov <[email protected]>
Reviewed-by: Allan Jude <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Alexander Motin <[email protected]>
Closes #8901
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
If the zfs_remove_max_segment tunable is changed to be not a multiple of
the sector size, then the device removal code will malfunction and try
to create mappings that are smaller than one sector, leading to a panic.
On debug bits this assertion will fail in spa_vdev_copy_segment():
ASSERT3U(DVA_GET_ASIZE(&dst), ==, size);
On nondebug, the system panics with a stack like:
metaslab_free_concrete()
metaslab_free_impl()
metaslab_free_impl_cb()
vdev_indirect_remap()
free_from_removing_vdev()
metaslab_free_impl()
metaslab_free_dva()
metaslab_free()
Fortunately, the default for zfs_remove_max_segment is 1MB, so this
can't occur by default. We hit it during this test because
removal_remap.ksh changes zfs_remove_max_segment to 1KB. When testing on
4KB-sector disks, we hit the bug.
This change makes the zfs_remove_max_segment tunable more robust,
automatically rounding it up to a multiple of the sector size. We also
turn some key assertions into VERIFY's so that similar bugs would be
caught before they are encoded on disk (and thus avoid a
panic-reboot-loop).
Reviewed-by: Sean Eric Fagan <[email protected]>
Reviewed-by: Pavel Zakharov <[email protected]>
Reviewed-by: Serapheim Dimitropoulos <[email protected]>
Reviewed-by: Sebastien Roy <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Matthew Ahrens <[email protected]>
External-issue: DLPX-61342
Closes #8893
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Starting in sync pass 5 (zfs_sync_pass_dont_compress), we disable
compression (including of metadata). Ostensibly this helps the sync
passes to converge (i.e. for a sync pass to not need to allocate
anything because it is 100% overwrites).
However, in practice it increases the average number of sync passes,
because when we turn compression off, a lot of block's size will change
and thus we have to re-allocate (not overwrite) them. It also increases
the number of 128KB allocations (e.g. for indirect blocks and spacemaps)
because these will not be compressed. The 128K allocations are
especially detrimental to performance on highly fragmented systems,
which may have very few free segments of this size, and may need to load
new metaslabs to satisfy 128K allocations.
We should increase zfs_sync_pass_dont_compress. In practice on a highly
fragmented system we see a few 5-pass txg's, a tiny number of 6-pass
txg's, and no txg's with more than 6 passes.
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Richard Elling <[email protected]>
Reviewed by: Pavel Zakharov <[email protected]>
Reviewed-by: Serapheim Dimitropoulos <[email protected]>
Reviewed-by: George Wilson <[email protected]>
Signed-off-by: Matthew Ahrens <[email protected]>
External-issue: DLPX-63431
Closes #8892
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Memory copy is too heavy operation to do under the congested lock.
Moving it out reduces congestion by many times to almost invisible.
Since the original zio removed from the queue, and the child zio is
not executed yet, I don't see why would the copy need protection.
My guess it just remained like this from the time when lock was not
dropped here, which was added later to fix lock ordering issue.
Multi-threaded sequential write tests with both HDD and SSD pools
with ZVOL block sizes of 4KB, 16KB, 64KB and 128KB all show major
reduction of lock congestion, saving from 15% to 35% of CPU time
and increasing throughput from 10% to 40%.
Reviewed-by: Richard Yao <[email protected]>
Reviewed-by: Matt Ahrens <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Alexander Motin <[email protected]>
Closes #8890
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
On fragmented pools with high-performance storage, the looping in
metaslab_block_picker() can become the performance-limiting bottleneck.
When looking for a larger block (e.g. a 128K block for the ZIL), we may
search through many free segments (up to hundreds of thousands) to find
one that is large enough to satisfy the allocation. This can take a long
time (up to dozens of ms), and is done while holding the ms_lock, which
other threads may spin waiting for.
When this performance problem is encountered, profiling will show
high CPU time in metaslab_block_picker, as well as in mutex_enter from
various callers.
The problem is very evident on a test system with a sync write workload
with 8K writes to a recordsize=8k filesystem, with 4TB of SSD storage,
84% full and 88% fragmented. It has also been observed on production
systems with 90TB of storage, 76% full and 87% fragmented.
The fix is to change metaslab_df_alloc() to search only up to 16MB from
the previous allocation (of this alignment). After that, we will pick a
segment that is of the exact size requested (or larger). This reduces
the number of iterations to a few hundred on fragmented pools (a ~100x
improvement).
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Paul Dagnelie <[email protected]>
Reviewed-by: Tony Nguyen <[email protected]>
Reviewed-by: George Wilson <[email protected]>
Reviewed-by: Serapheim Dimitropoulos <[email protected]>
Signed-off-by: Matthew Ahrens <[email protected]>
External-issue: DLPX-62324
Closes #8877
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When running zloop, we occasionally see the following crash:
dmu_tx_assign(tx, TXG_WAIT) == 0 (0x1c == 0)
ASSERT at ../../module/zfs/vdev_removal.c:1507:spa_vdev_remove_thread()/sbin/ztest(+0x89c3)[0x55faf567b9c3]
The error value 0x1c is ENOSPC.
The transaction used by spa_vdev_remove_thread() should not be able to
fail due to being out of space. i.e. we should not call
dmu_tx_hold_space(). This will allow the removal thread to schedule its
work even when the pool is low on space. The "slop space" will provide
enough free space to sync out the txg.
Reviewed-by: Igor Kozhukhov <[email protected]>
Reviewed-by: Paul Dagnelie <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Matthew Ahrens <[email protected]>
External-issue: DLPX-37853
Closes #8889
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
sysfs_attr_init() is required to make lockdep happy for dynamically
allocated sysfs attributes. This fixed #8868 on Fedora 29 running
kernel-debug.
This requirement was introduced in 2.6.34.
See include/linux/sysfs.h for what it actually does.
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Olaf Faaland <[email protected]>
Signed-off-by: Tomohiro Kusumi <[email protected]>
Closes #8868
Closes #8884
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When iterating over a ZAP object, we're almost always certain to iterate
over the entire object. If there are multiple leaf blocks, we can
realize a performance win by issuing reads for all the leaf blocks in
parallel when the iteration begins.
For example, if we have 10,000 snapshots, "zfs destroy -nv
pool/fs@1%9999" can take 30 minutes when the cache is cold. This change
provides a >3x performance improvement, by issuing the reads for all ~64
blocks of each ZAP object in parallel.
Reviewed-by: Andreas Dilger <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Matthew Ahrens <[email protected]>
External-issue: DLPX-58347
Closes #8862
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Sometimes the target ARC size is reduced to arc_c_min, which impacts
performance. We've seen this happen as part of the random_reads
performance regression test, where the ARC size is reduced before the
reads test starts which impacts how long it takes for system to reach
good IOPS performance.
We call arc_reduce_target_size when arc_reap_cb_check() returns TRUE,
and arc_available_memory() is less than arc_c>>arc_shrink_shift.
However, arc_available_memory() could easily be low, even when arc_c is
low, because we can have tons of unused bufs in the abd kmem cache. This
would be especially true just after the DMU requests a bunch of stuff be
evicted from the ARC (e.g. due to "zpool export").
To fix this, the ARC should reduce arc_c by the requested amount, not
all the way down to arc_size (or arc_c_min), which can be very small.
Reviewed-by: Tim Chase <[email protected]>
Reviewed by: Brian Behlendorf <[email protected]>
Reviewed-by: George Melikov <[email protected]>
Signed-off-by: Matthew Ahrens <[email protected]>
External-issue: DLPX-59431
Closes #8864
|
|
|
|
|
|
|
|
|
| |
Fix typo in vdev_raidz_math.c
Reviewed by: Brian Behlendorf <[email protected]>
Reviewed-by: George Melikov <[email protected]>
Signed-off-by: Brad Forschinger <[email protected]>
Closes #8875
Closes #8880
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Scatter ABD's are allocated from a number of pages. In contrast to
linear ABD's, these pages are disjoint in the kernel's virtual address
space, so they can't be accessed as a contiguous buffer. Therefore
routines that need a linear buffer (e.g. abd_borrow_buf() and friends)
must allocate a separate linear buffer (with zio_buf_alloc()), and copy
the contents of the pages to/from the linear buffer. This can have a
measurable performance overhead on some workloads.
https://github.com/zfsonlinux/zfs/commit/87c25d567fb7969b44c7d8af63990e
("abd_alloc should use scatter for >1K allocations") increased the use
of scatter ABD's, specifically switching 1.5K through 4K (inclusive)
buffers from linear to scatter. For workloads that access blocks whose
compressed sizes are in this range, that commit introduced an additional
copy into the read code path. For example, the
sequential_reads_arc_cached tests in the test suite were reduced by
around 5% (this is doing reads of 8K-logical blocks, compressed to 3K,
which are cached in the ARC).
This commit treats single-chunk scattered buffers as linear buffers,
because they are contiguous in the kernel's virtual address space.
All single-page (4K) ABD's can be represented this way. Some multi-page
ABD's can also be represented this way, if we were able to allocate a
single "chunk" (higher-order "page" which represents a power-of-2 series
of physically-contiguous pages). This is often the case for 2-page (8K)
ABD's.
Representing a single-entry scatter ABD as a linear ABD has the
performance advantage of avoiding the copy (and allocation) in
abd_borrow_buf_copy / abd_return_buf_copy. A performance increase of
around 5% has been observed for ARC-cached reads (of small blocks which
can take advantage of this), fixing the regression introduced by
87c25d567.
Note that this optimization is only possible because all physical memory
is always mapped into the kernel's address space. This is not the case
for HIGHMEM pages, so the optimization can not be made on 32-bit
systems.
Reviewed-by: Chunwei Chen <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Matthew Ahrens <[email protected]>
Closes #8580
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
We've observed that on some highly fragmented pools, most metaslab
allocations are small (~2-8KB), but there are some large, 128K
allocations. The large allocations are for ZIL blocks. If there is a
lot of fragmentation, the large allocations can be hard to satisfy.
The most common impact of this is that we need to check (and thus load)
lots of metaslabs from the ZIL allocation code path, causing sync writes
to wait for metaslabs to load, which can take a second or more. In the
worst case, we may not be able to satisfy the allocation, in which case
the ZIL will resort to txg_wait_synced() to ensure the change is on
disk.
To provide a workaround for this, this change adds a tunable that can
reduce the size of ZIL blocks.
External-issue: DLPX-61719
Reviewed-by: George Wilson <[email protected]>
Reviewed-by: Paul Dagnelie <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Matthew Ahrens <[email protected]>
Closes #8865
|
|
|
|
|
|
|
|
|
|
|
|
| |
When ARC size is very small, aggsum_lower_bound(&arc_size) may return
negative values, that due to unsigned comparison caused delays, waiting
for arc_adjust() to "fix" it by calling aggsum_value(&arc_size). Use
of signed comparison there fixes the problem.
Reviewed-by: Matt Ahrens <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: George Melikov <[email protected]>
Signed-off-by: Alexander Motin <[email protected]>
Closes #8873
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This patch fixes an incorrect error message that comes up when
doing a non-forcing, raw, incremental receive into a dataset
that has a newer snapshot than the "from" snapshot. In this
case, the current code prints a confusing message about an IVset
guid mismatch.
This functionality is supported by non-raw receives as an
undocumented feature, but was never supported by the raw receive
code. If this is desired in the future, we can probably figure
out a way to make it work.
Reviewed by: Brian Behlendorf <[email protected]>
Reviewed by: Matthew Ahrens <[email protected]>
Signed-off-by: Tom Caputi <[email protected]>
Issue #8758
Closes #8863
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
On large systems, the memory used by loaded metaslabs can become
a concern. While range trees are a fairly efficient data structure,
on heavily fragmented pools they can still consume a significant
amount of memory. This problem is amplified when we fail to unload
metaslabs that we aren't using. Currently, we only unload a metaslab
during metaslab_sync_done; in order for that function to be called
on a given metaslab in a given txg, we have to have dirtied that
metaslab in that txg. If the dirtying was the result of an allocation,
we wouldn't be unloading it (since it wouldn't be 8 txgs since it
was selected), so in effect we only unload a metaslab during txgs
where it's being freed from.
We move the unload logic from sync_done to a new function, and
call that function on all metaslabs in a given vdev during
vdev_sync_done().
Reviewed-by: Richard Elling <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Paul Dagnelie <[email protected]>
Closes #8837
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This patch re-adds a check that was removed in 369aa50. The check
confirms that a raw receive is not occuring before truncating an
object's dn_maxblkid. At the time, it was believed that all cases
that would hit this code path would be handled in other places,
but that was not the case.
Reviewed-by: Matt Ahrens <[email protected]>
Reviewed-by: Paul Dagnelie <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Tom Caputi <[email protected]>
Closes #8852
Closes #8857
|
|
|
|
|
|
|
|
|
| |
Reviewed-by: Chris Dunlop <[email protected]>
Reviewed-by: Matt Ahrens <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: George Melikov <[email protected]>
Reviewed-by: Richard Laager <[email protected]>
Signed-off-by: Allan Jude <[email protected]>
Closes #8822
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Historically while doing performance testing we've noticed that IOPS
can be significantly reduced when all vdevs in the pool are hitting
the zfs_mg_fragmentation_threshold percentage. Specifically in a
hypothetical pool with two vdevs, what can happen is the following:
Vdev A would go above that threshold and only vdev B would be used.
Then vdev B would pass that threshold but vdev A would go below it
(we've been freeing from A to allocate to B). The allocations would
go back and forth utilizing one vdev at a time with IOPS taking a hit.
Empirically, we've seen that our vdev selection for allocations is
good enough that fragmentation increases uniformly across all vdevs
the majority of the time. Thus we set the threshold percentage high
enough to avoid hitting the speed bump on pools that are being pushed
to the edge. We effectively disable its effect in the majority of the
cases but we don't remove (at least for now) just in case we hit any
weird behavior in the future.
Reviewed-by: George Melikov <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Matt Ahrens <[email protected]>
Signed-off-by: Serapheim Dimitropoulos <[email protected]>
Closes #8859
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The ZFS on-disk format stores each inode's generation ID as a 64
bit number on disk and in-core. However, the Linux kernel's inode
is only a 32 bit number. In most places, the code handles this
correctly, but the cast is missing in zfs_rezget(). For many pools,
this isn't an issue since the generation ID is computed as the
current txg when the inode is created and many pools don't have
more than 2^32 txgs.
For the pools that have more txgs, this issue causes any inode with
a high enough generation number to report IO errors after a call to
"zfs rollback" while holding the file or directory open. This patch
simply adds the missing cast.
Reviewed-by: Alek Pinchuk <[email protected]>
Reviewed-by: George Melikov <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Tom Caputi <[email protected]>
Closes #8858
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Since zfs_znode_alloc() already takes dmu_buf_t*, taking another
uint64_t argument for objid is redundant. inode's ->i_ino does and
needs to match znode's ->z_id.
zfs_znode_alloc() in FreeBSD and illumos doesn't have this argument
since vnode doesn't have vnode# in VFS (hence ->z_id exists).
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Tony Hutter <[email protected]>
Signed-off-by: Tomohiro Kusumi <[email protected]>
Closes #8841
|
|
|
|
|
|
|
|
| |
Reviewed-by: Paul Dagnelie <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Tony Hutter <[email protected]>
Signed-off-by: DHE <[email protected]>
Closes #8733
Closes #8752
|
|
|
|
|
|
|
|
|
| |
Reviewed-by: Matt Ahrens <[email protected]>
Reviewed-by: Paul Dagnelie <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: TulsiJain <[email protected]>
Closes #8829
Closes #8289
|
|
|
|
|
|
|
|
|
|
|
|
| |
This reverts commit ec4f9b8f30 which introduced a narrow race which
can lead to lseek(, SEEK_DATA) incorrectly returning ENXIO. Resolve
the issue by revering this change to restore the previous behavior
which depends solely on checking the dirty list.
Reviewed-by: Olaf Faaland <[email protected]>
Reviewed-by: Igor Kozhukhov <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #8816
Closes #8834
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Per suggestion from @behlendorf in #8777, remove vn_set_fs_pwd() and
vn_set_pwd() which are only used in zfs_ioctl.c:_init() while loading
zfs.ko.
The rest of initialization functions being called here after cwd set
to / don't depend on cwd of the process except for spa_config_load().
spa_config_load() uses a relative path ".//etc/zfs/zpool.cache" when
`rootdir` is non-NULL, which is "/etc/zfs/zpool.cache" given cwd is /,
so just unconditionally use the absolute path without "./", so that
`vn_set_pwd("/")` as well as the entire functions can be removed.
This is also what FreeBSD does.
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Tony Hutter <[email protected]>
Signed-off-by: Tomohiro Kusumi <[email protected]>
Closes #8826
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When opening a log device during import its allocation bias will
not yet have been set by vdev_load(). This results in the log
device's ashift being incorrectly applied to the maximum ashift
of the vdevs in the normal class. Which in turn prevents the
removal of any top-level devices due to the ashift check in the
spa_vdev_remove_top_check() function.
This issue is resolved by including vdev_islog in the check since
it will be set correctly during vdev_open().
Reviewed-by: Matt Ahrens <[email protected]>
Reviewed-by: Igor Kozhukhov <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #8735
|
|
|
|
|
|
|
|
|
|
| |
dn->dn_datablksz type is uint32_t and need to be casted to uint64_t
to avoid an overflow when the record size is greater than 4 MiB.
Reviewed-by: Tom Caputi <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Olivier Mazouffre <[email protected]>
Closes #8778
Closes #8797
|
|
|
|
|
|
|
|
|
|
| |
This commits fixes a double-free in zfs_ioc_pool_create() triggered by
specifying an unsupported combination of properties when creating a pool
with encryption enabled.
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Tom Caputi <[email protected]>
Signed-off-by: loli10K <[email protected]>
Closes #8791
|
|
|
|
|
|
|
|
| |
These descriptions are not uptodate with the code.
Reviewed-by: Igor Kozhukhov <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Tomohiro Kusumi <[email protected]>
Closes #8767
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Currently, count_block() does not correctly account for the
possibility that the bp that is passed to it could be embedded.
These blocks shouldn't be counted since the work of scanning
these blocks in already handled when the containing block is
scanned. This patch simply resolves this issue by returning
early in this case.
Reviewed by: Allan Jude <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Authored-by: Bill Sommerfeld <[email protected]>
Signed-off-by: Tom Caputi <[email protected]>
Closes #8800
Closes #8766
|
|
|
|
|
|
|
|
|
|
|
| |
wait_on_page_writeback() was made GPL only in torvalds/linux@19343b5bdd.
Directly call wait_on_page_bit() without using wait_on_page_writeback()
interface, given zfs_putpage() is the only caller for now.
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: loli10K <[email protected]>
Signed-off-by: Tomohiro Kusumi <[email protected]>
Closes #8794
|
|
|
|
|
|
|
|
| |
It's accessible via <sys/mntent.h>.
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Tom Caputi <[email protected]>
Signed-off-by: Tomohiro Kusumi <[email protected]>
Closes #8765
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The issue is caused by an incorrect usage of the sizeof() operator
in vdev_obsolete_sm_object(): on 64-bit systems this is not an issue
since both "uint64_t" and "uint64_t*" are 8 bytes in size. However on
32-bit systems pointers are 4 bytes long which is not supported by
zap_lookup_impl(). Trying to remove a top-level vdev on a 32-bit system
will cause the following failure:
VERIFY3(0 == vdev_obsolete_sm_object(vd, &obsolete_sm_object)) failed (0 == 22)
PANIC at vdev_indirect.c:833:vdev_indirect_sync_obsolete()
Showing stack for process 1315
CPU: 6 PID: 1315 Comm: txg_sync Tainted: P OE 4.4.69+ #2
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-2.fc27 04/01/2014
c1abc6e7 0ae10898 00000286 d4ac3bc0 c14397bc da4cd7d8 d4ac3bf0 d4ac3bd0
d790e7ce d7911cc1 00000523 d4ac3d00 d790e7d7 d7911ce4 da4cd7d8 00000341
da4ce664 da4cd8c0 da33fa6e 49524556 28335946 3d3d2030 65647620 626f5f76
Call Trace:
[<>] dump_stack+0x58/0x7c
[<>] spl_dumpstack+0x23/0x27 [spl]
[<>] spl_panic.cold.0+0x5/0x41 [spl]
[<>] ? dbuf_rele+0x3e/0x90 [zfs]
[<>] ? zap_lookup_norm+0xbe/0xe0 [zfs]
[<>] ? zap_lookup+0x57/0x70 [zfs]
[<>] ? vdev_obsolete_sm_object+0x102/0x12b [zfs]
[<>] vdev_indirect_sync_obsolete+0x3e1/0x64d [zfs]
[<>] ? txg_verify+0x1d/0x160 [zfs]
[<>] ? dmu_tx_create_dd+0x80/0xc0 [zfs]
[<>] vdev_sync+0xbf/0x550 [zfs]
[<>] ? mutex_lock+0x10/0x30
[<>] ? txg_list_remove+0x9f/0x1a0 [zfs]
[<>] ? zap_contains+0x4d/0x70 [zfs]
[<>] spa_sync+0x9f1/0x1b10 [zfs]
...
[<>] ? kthread_stop+0x110/0x110
This commit simply corrects the "integer_size" parameter used to lookup
the vdev's ZAP object.
Reviewed-by: Giuseppe Di Natale <[email protected]>
Reviewed-by: Igor Kozhukhov <[email protected]>
Reviewed-by: George Melikov <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: loli10K <[email protected]>
Closes #8790
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
CID 186143: Memory - illegal accesses (USE_AFTER_FREE)
This patch fixes an use-after-free in spa_import_progress_destroy()
moving the kmem_free() call at the end of the function.
Reviewed-by: Chris Dunlop <[email protected]>
Reviewed-by: Giuseppe Di Natale <[email protected]>
Reviewed-by: Igor Kozhukhov <[email protected]>
Reviewed-by: George Melikov <[email protected]>
Reviewed by: Brian Behlendorf <[email protected]>
Signed-off-by: loli10K <[email protected]>
Closes #8788
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When reading kstats, the health (aka state) of the pool is stored into
/proc/spl/kstat/zfs/POOLNAME/state via spa_state_to_name().
However, during import/export there is a case where the spa exists,
but the root vdev does not exist. This fix checks that case and sets
the state to "TRANSITIONING"
Unfortunately, it is not easy to reproduce a test for this. It was
detected randomly during ZTS runs while kstats were also being sampled
regularly. After this change, further testing did not trip on the case
and the TRANSITIONING state was collected at least once by the kstats.
For posterity, the backtrace prior to this fix is:
[Mon May 13 17:21:00 2019] RIP: 0010:spa_state_to_name+0x10/0xb0 [zfs]
...
Mon May 13 17:21:00 2019] Call Trace:
[Mon May 13 17:21:00 2019] spa_state_data+0x1a/0x40 [zfs]
[Mon May 13 17:21:00 2019] kstat_seq_show+0x117/0x440 [spl]
[Mon May 13 17:21:00 2019] seq_read+0xe5/0x430
[Mon May 13 17:21:00 2019] proc_reg_read+0x45/0x70
[Mon May 13 17:21:00 2019] __vfs_read+0x1b/0x40
[Mon May 13 17:21:00 2019] vfs_read+0x8e/0x130
[Mon May 13 17:21:00 2019] SyS_read+0x55/0xc0
[Mon May 13 17:21:00 2019] ? SyS_fcntl+0x5d/0xb0
[Mon May 13 17:21:00 2019] do_syscall_64+0x73/0x130
[Mon May 13 17:21:00 2019] entry_SYSCALL_64_after_hwframe+0x3d/0xa2
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Tony Hutter <[email protected]>
Signed-off-by: Richard Elling <[email protected]>
Closes #8746
|