| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When a pool is imported it will scan the pool to verify the integrity
of the data and metadata. The amount it scans will depend on the
import flags provided. On systems with small amounts of memory or
when importing a pool from the crash kernel, it's possible for
spa_load_verify to issue too many I/Os that it consumes all the memory
of the system resulting in an OOM message or a hang.
To prevent this, we limit the amount of memory that the initial pool
scan can consume. This change will, by default, use 1/16th of the ARC
for scan I/Os to prevent running the system out of memory during import.
Reviewed-by: Matt Ahrens <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Serapheim Dimitropoulos <[email protected]>
Signed-off-by: George Wilson [email protected]
External-issue: DLPX-65237
External-issue: DLPX-65238
Closes #9146
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When we unload metaslabs today in ZFS, the cached max_size value is
discarded. We instead use the histogram to determine whether or not we
think we can satisfy an allocation from the metaslab. This can result in
situations where, if we're doing I/Os of a size not aligned to a
histogram bucket, a metaslab is loaded even though it cannot satisfy the
allocation we think it can. For example, a metaslab with 16 entries in
the 16k-32k bucket may have entirely 16kB entries. If we try to allocate
a 24kB buffer, we will load that metaslab because we think it should be
able to handle the allocation. Doing so is expensive in CPU time, disk
reads, and average IO latency. This is exacerbated if the write being
attempted is a sync write.
This change makes ZFS cache the max_size after the metaslab is
unloaded. If we ever get a free (or a coalesced group of frees) larger
than the max_size, we will update it. Otherwise, we leave it as is. When
attempting to allocate, we use the max_size as a lower bound, and
respect it unless we are in try_hard. However, we do age the max_size
out at some point, since we expect the actual max_size to increase as we
do more frees. A more sophisticated algorithm here might be helpful, but
this works reasonably well.
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Matt Ahrens <[email protected]>
Signed-off-by: Paul Dagnelie <[email protected]>
Closes #9055
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
zed core dumps due to a NULL pointer in zfs_agent_iter_vdev(). The
gs_devid is NULL, but the nvl has a "devid" entry.
zfs_agent_post_event() checks that ZFS_EV_VDEV_GUID or DEV_IDENTIFIER is
present in nvl, but then later it and zfs_agent_iter_vdev() assume that
DEV_IDENTIFIER is present and thus gs_devid is set.
Typically this is not a problem because usually either all vdevs have
devid's, or none of them do. Since zfs_agent_iter_vdev() first checks if
the vdev has devid before dereferencing gs_devid, the problem isn't
typically encountered. However, if some vdevs have devid's and some do
not, then the problem is easily reproduced. This can happen if the pool
has been moved from a system that has devid's to one that does not.
The fix is for zfs_agent_iter_vdev() to only try to match the devid's if
both nvl and gsp have devid's present.
Reviewed-by: Prashanth Sreenivasa <[email protected]>
Reviewed-by: Don Brady <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: loli10K <[email protected]>
Signed-off-by: Matthew Ahrens <[email protected]>
External-issue: DLPX-65090
Closes #9054
Closes #9060
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Deleting a clone requires finding blocks are clone-only, not shared
with the snapshot. This was done by traversing the entire block tree
which results in a large performance penalty for sparsely
written clones.
This is new method keeps track of clone blocks when they are
modified in a "Livelist" so that, when it’s time to delete,
the clone-specific blocks are already at hand.
We see performance improvements because now deletion work is
proportional to the number of clone-modified blocks, not the size
of the original dataset.
Reviewed-by: Sean Eric Fagan <[email protected]>
Reviewed-by: Matt Ahrens <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Serapheim Dimitropoulos <[email protected]>
Signed-off-by: Sara Hartse <[email protected]>
Closes #8416
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
In the past we've seen multiple race conditions that have
to do with open-context threads async threads and concurrent
calls to spa_export()/spa_destroy() (including the one
referenced in issue #9015).
This patch ensures that only one thread can execute the
main body of spa_export_common() at a time, with subsequent
threads returning with a new error code created just for
this situation, eliminating this way any race condition
bugs introduced by concurrent calls to this function.
Reviewed by: Matt Ahrens <[email protected]>
Reviewed by: Brian Behlendorf <[email protected]>
Signed-off-by: Serapheim Dimitropoulos <[email protected]>
Closes #9015
Closes #9044
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Creating a pool with not features enabled and running
`zdb -mmmmmm on` it before the patch:
```
Log Space Maps in Pool:
Log Space Map Obsolete Entry Statistics:
0 valid entries out of 0 - txg 0
0 valid entries out of 0 - total
```
After this patch the above output goes away.
Reviewed by: Matt Ahrens <[email protected]>
Reviewed by: Sara Hartse <[email protected]>
Reviewed by: Brian Behlendorf <[email protected]>
Signed-off-by: Serapheim Dimitropoulos <[email protected]>
Closes #9048
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The zfs-volume-wait.service scans existing zvols and waits for their
links under /dev to be created. Any service that depends on zvol
links to be there should add a dependency on zfs-volumes.target.
By default, this target is not enabled.
Reviewed-by: Fabian Grünbichler <[email protected]>
Reviewed-by: Antonio Russo <[email protected]>
Reviewed-by: Richard Laager <[email protected]>
Reviewed-by: loli10K <[email protected]>
Reviewed-by: John Gallagher <[email protected]>
Reviewed-by: George Wilson <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Pavel Zakharov <[email protected]>
Closes #8975
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Adds the ability to sanity check zfs create arguments and to see the
value of any additional properties that will local to the dataset. For
example, automation that may need to adjust quota on a parent filesystem
before creating a volume may call `zfs create -nP -V <size> <volume>` to
obtain the value of refreservation. This adds the following options to
zfs create:
- -n dry-run (no-op)
- -v verbose
- -P parseable (implies verbose)
Reviewed-by: Ryan Moeller <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Matt Ahrens <[email protected]>
Reviewed-by: Jerry Jelinek <[email protected]>
Signed-off-by: Mike Gerdts <[email protected]>
Closes #8974
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
= Motivation
At Delphix we've seen a lot of customer systems where fragmentation
is over 75% and random writes take a performance hit because a lot
of time is spend on I/Os that update on-disk space accounting metadata.
Specifically, we seen cases where 20% to 40% of sync time is spend
after sync pass 1 and ~30% of the I/Os on the system is spent updating
spacemaps.
The problem is that these pools have existed long enough that we've
touched almost every metaslab at least once, and random writes
scatter frees across all metaslabs every TXG, thus appending to
their spacemaps and resulting in many I/Os. To give an example,
assuming that every VDEV has 200 metaslabs and our writes fit within
a single spacemap block (generally 4K) we have 200 I/Os. Then if we
assume 2 levels of indirection, we need 400 additional I/Os and
since we are talking about metadata for which we keep 2 extra copies
for redundancy we need to triple that number, leading to a total of
1800 I/Os per VDEV every TXG.
We could try and decrease the number of metaslabs so we have less
I/Os per TXG but then each metaslab would cover a wider range on
disk and thus would take more time to be loaded in memory from disk.
In addition, after it's loaded, it's range tree would consume more
memory.
Another idea would be to just increase the spacemap block size
which would allow us to fit more entries within an I/O block
resulting in fewer I/Os per metaslab and a speedup in loading time.
The problem is still that we don't deal with the number of I/Os
going up as the number of metaslabs is increasing and the fact
is that we generally write a lot to a few metaslabs and a little
to the rest of them. Thus, just increasing the block size would
actually waste bandwidth because we won't be utilizing our bigger
block size.
= About this patch
This patch introduces the Log Spacemap project which provides the
solution to the above problem while taking into account all the
aforementioned tradeoffs. The details on how it achieves that can
be found in the references sections below and in the code (see
Big Theory Statement in spa_log_spacemap.c).
Even though the change is fairly constraint within the metaslab
and lower-level SPA codepaths, there is a side-change that is
user-facing. The change is that VDEV IDs from VDEV holes will no
longer be reused. To give some background and reasoning for this,
when a log device is removed and its VDEV structure was replaced
with a hole (or was compacted; if at the end of the vdev array),
its vdev_id could be reused by devices added after that. Now
with the pool-wide space maps recording the vdev ID, this behavior
can cause problems (e.g. is this entry referring to a segment in
the new vdev or the removed log?). Thus, to simplify things the
ID reuse behavior is gone and now vdev IDs for top-level vdevs
are truly unique within a pool.
= Testing
The illumos implementation of this feature has been used internally
for a year and has been in production for ~6 months. For this patch
specifically there don't seem to be any regressions introduced to
ZTS and I have been running zloop for a week without any related
problems.
= Performance Analysis (Linux Specific)
All performance results and analysis for illumos can be found in
the links of the references. Redoing the same experiments in Linux
gave similar results. Below are the specifics of the Linux run.
After the pool reached stable state the percentage of the time
spent in pass 1 per TXG was 64% on average for the stock bits
while the log spacemap bits stayed at 95% during the experiment
(graph: sdimitro.github.io/img/linux-lsm/PercOfSyncInPassOne.png).
Sync times per TXG were 37.6 seconds on average for the stock
bits and 22.7 seconds for the log spacemap bits (related graph:
sdimitro.github.io/img/linux-lsm/SyncTimePerTXG.png). As a result
the log spacemap bits were able to push more TXGs, which is also
the reason why all graphs quantified per TXG have more entries for
the log spacemap bits.
Another interesting aspect in terms of txg syncs is that the stock
bits had 22% of their TXGs reach sync pass 7, 55% reach sync pass 8,
and 20% reach 9. The log space map bits reached sync pass 4 in 79%
of their TXGs, sync pass 7 in 19%, and sync pass 8 at 1%. This
emphasizes the fact that not only we spend less time on metadata
but we also iterate less times to convergence in spa_sync() dirtying
objects.
[related graphs:
stock- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGStock.png
lsm- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGLSM.png]
Finally, the improvement in IOPs that the userland gains from the
change is approximately 40%. There is a consistent win in IOPS as
you can see from the graphs below but the absolute amount of
improvement that the log spacemap gives varies within each minute
interval.
sdimitro.github.io/img/linux-lsm/StockVsLog3Days.png
sdimitro.github.io/img/linux-lsm/StockVsLog10Hours.png
= Porting to Other Platforms
For people that want to port this commit to other platforms below
is a list of ZoL commits that this patch depends on:
Make zdb results for checkpoint tests consistent
db587941c5ff6dea01932bb78f70db63cf7f38ba
Update vdev_is_spacemap_addressable() for new spacemap encoding
419ba5914552c6185afbe1dd17b3ed4b0d526547
Simplify spa_sync by breaking it up to smaller functions
8dc2197b7b1e4d7ebc1420ea30e51c6541f1d834
Factor metaslab_load_wait() in metaslab_load()
b194fab0fb6caad18711abccaff3c69ad8b3f6d3
Rename range_tree_verify to range_tree_verify_not_present
df72b8bebe0ebac0b20e0750984bad182cb6564a
Change target size of metaslabs from 256GB to 16GB
c853f382db731e15a87512f4ef1101d14d778a55
zdb -L should skip leak detection altogether
21e7cf5da89f55ce98ec1115726b150e19eefe89
vs_alloc can underflow in L2ARC vdevs
7558997d2f808368867ca7e5234e5793446e8f3f
Simplify log vdev removal code
6c926f426a26ffb6d7d8e563e33fc176164175cb
Get rid of space_map_update() for ms_synced_length
425d3237ee88abc53d8522a7139c926d278b4b7f
Introduce auxiliary metaslab histograms
928e8ad47d3478a3d5d01f0dd6ae74a9371af65e
Error path in metaslab_load_impl() forgets to drop ms_sync_lock
8eef997679ba54547f7d361553d21b3291f41ae7
= References
Background, Motivation, and Internals of the Feature
- OpenZFS 2017 Presentation:
youtu.be/jj2IxRkl5bQ
- Slides:
slideshare.net/SerapheimNikolaosDim/zfs-log-spacemaps-project
Flushing Algorithm Internals & Performance Results
(Illumos Specific)
- Blogpost:
sdimitro.github.io/post/zfs-lsm-flushing/
- OpenZFS 2018 Presentation:
youtu.be/x6D2dHRjkxw
- Slides:
slideshare.net/SerapheimNikolaosDim/zfs-log-spacemap-flushing-algorithm
Upstream Delphix Issues:
DLPX-51539, DLPX-59659, DLPX-57783, DLPX-61438, DLPX-41227, DLPX-59320
DLPX-63385
Reviewed-by: Sean Eric Fagan <[email protected]>
Reviewed-by: Matt Ahrens <[email protected]>
Reviewed-by: George Wilson <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Serapheim Dimitropoulos <[email protected]>
Closes #8442
|
|
|
|
|
|
|
|
|
| |
Reviewed-by: Richard Laager <[email protected]>
Reviewed-by: Fabian Grünbichler <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Antonio Russo <[email protected]>
Closes #8750
Closes #8848
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Modify zfs-mount-generator to produce a dependency on new
zfs-import-key-*.service units, dynamically created at boot to call
zfs load-key for the encryption root, before attempting to mount any
encrypted datasets.
These units are created by zfs-mount-generator, and RequiresMountsFor on
the keyfile, if present, or call systemd-ask-password if a passphrase is
requested.
This patch includes suggestions from @Fabian-Gruenbichler, @ryanjaeb and
@rlaager, as well an adaptation of @rlaager's script to retry on
incorrect password entry.
Reviewed-by: Richard Laager <[email protected]>
Reviewed-by: Fabian Grünbichler <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Antonio Russo <[email protected]>
Closes #8750
Closes #8848
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Restore the SIMD optimization for 4.19.38 LTS, 4.14.120 LTS,
and 5.0 and newer kernels. This is accomplished by leveraging
the fact that by definition dedicated kernel threads never need
to concern themselves with saving and restoring the user FPU state.
Therefore, they may use the FPU as long as we can guarantee user
tasks always restore their FPU state before context switching back
to user space.
For the 5.0 and 5.1 kernels disabling preemption and local
interrupts is sufficient to allow the FPU to be used. All non-kernel
threads will restore the preserved user FPU state.
For 5.2 and latter kernels the user FPU state restoration will be
skipped if the kernel determines the registers have not changed.
Therefore, for these kernels we need to perform the additional
step of saving and restoring the FPU registers. Invalidating the
per-cpu global tracking the FPU state would force a restore but
that functionality is private to the core x86 FPU implementation
and unavailable.
In practice, restricting SIMD to kernel threads is not a major
restriction for ZFS. The vast majority of SIMD operations are
already performed by the IO pipeline. The remaining cases are
relatively infrequent and can be handled by the generic code
without significant impact. The two most noteworthy cases are:
1) Decrypting the wrapping key for an encrypted dataset,
i.e. `zfs load-key`. All other encryption and decryption
operations will use the SIMD optimized implementations.
2) Generating the payload checksums for a `zfs send` stream.
In order to avoid making any changes to the higher layers of ZFS
all of the `*_get_ops()` functions were updated to take in to
consideration the calling context. This allows for the fastest
implementation to be used as appropriate (see kfpu_allowed()).
The only other notable instance of SIMD operations being used
outside a kernel thread was at module load time. This code
was moved in to a taskq in order to accommodate the new kernel
thread restriction.
Finally, a few other modifications were made in order to further
harden this code and facilitate testing. They include updating
each implementations operations structure to be declared as a
constant. And allowing "cycle" to be set when selecting the
preferred ops in the kernel as well as user space.
Reviewed-by: Tony Hutter <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #8754
Closes #8793
Closes #8965
|
|
|
|
|
|
|
|
|
|
|
| |
Due to some changes introduced in 30af21b 'zfs send' can crash when
provided with invalid inputs: this change attempts to add more checks
to the affected code paths.
Reviewed-by: Attila Fülöp <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Tony Hutter <[email protected]>
Signed-off-by: loli10K <[email protected]>
Closes #9001
|
|
|
|
|
|
|
|
|
|
|
| |
* zfs redact error messages do not end with newline character
* 30af21b0 inadvertently removed some ZFS_PROP comments
* man/zfs: zfs redact <redaction_snapshot> is not optional
Reviewed-by: Giuseppe Di Natale <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: George Melikov <[email protected]>
Signed-off-by: loli10K <[email protected]>
Closes #8988
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When a volume is created in a pool with raidz vdevs and
volblocksize != 128k, the volume can reference more space than is
reserved with the automatically calculated refreservation. There
are two deficiencies in vol_volsize_to_reservation that contribute
to this:
1) Skip blocks may be added to keep each allocation a multiple
of parity + 1. This is the dominating factor when volblocksize
is close to 2^ashift.
2) raidz deflation for 128 KB blocks is different for most other
block sizes.
See "The theory of raidz space accounting" comment in
libzfs_dataset.c for a full explanation.
Authored by: Mike Gerdts <[email protected]>
Reviewed by: Richard Elling <[email protected]>
Reviewed by: Sanjay Nadkarni <[email protected]>
Reviewed by: Jerry Jelinek <[email protected]>
Reviewed by: Matt Ahrens <[email protected]>
Reviewed by: Kody Kantor <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Approved by: Dan McDonald <[email protected]>
Ported-by: Mike Gerdts <[email protected]>
Porting Notes:
* ZTS: wait for zvols to exist before writing
* ZTS: use log_must_busy with {zpool|zfs} destroy
OpenZFS-issue: https://www.illumos.org/issues/9318
OpenZFS-commit: https://github.com/illumos/illumos-gate/commit/b73ccab0
Closes #8973
|
|
|
|
|
|
|
|
|
|
|
| |
This patch adds the ability for the user to unload keys for
datasets as they are being unmounted. This is analogous to
'zfs mount -l'.
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Alek Pinchuk <[email protected]>
Signed-off-by: Tom Caputi <[email protected]>
Closes: #8917
Closes: #8952
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
ztest creates some extremely large files as part of its
operation. When zdb tries to dump a large enough file, it
can run out of memory or spend an extremely long time
attempting to print millions or billions of uint64_ts.
We cap the amount of data from a uint64 object that we
are willing to read and print.
Reviewed-by: Don Brady <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Paul Dagnelie <[email protected]>
External-issue: DLPX-53814
Closes #8947
|
|
|
|
|
|
|
|
|
| |
When used with verbosity >= 4 zdb fails an assertion in dump_bookmarks()
because it expects snprintf() to retun 0 on success.
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Paul Dagnelie <[email protected]>
Signed-off-by: loli10K <[email protected]>
Closes #8948
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The "zfs remap" command was disabled by
6e91a72fe3ff8bb282490773bd687632f3e8c79d, because it has little utility
and introduced some tricky bugs. This commit removes the code for it,
the associated ZFS_IOC_REMAP ioctl, and tests.
Note that the ioctl and property will remain, but have no functionality.
This allows older software to fail gracefully if it attempts to use
these, and avoids a backwards incompatibility that would be introduced if
we renumbered the later ioctls/props.
Reviewed-by: Tom Caputi <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Matthew Ahrens <[email protected]>
Closes #8944
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Resolve the incorrect use of srcdir and builddir references for
various files in the build system. These have crept in over time
and went unnoticed because when building in the top level directory
srcdir and builddir are identical.
With this change it's again possible to build in a subdirectory.
$ mkdir obj
$ cd obj
$ ../configure
$ make
Reviewed-by: loli10K <[email protected]>
Reviewed-by: Tony Hutter <[email protected]>
Reviewed-by: Don Brady <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #8921
Closes #8943
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The zfs-mount service can unexpectedly fail to start when zfs
encounters a mount that is in progress. This service uses
zfs mount -a, which has a window between the time it checks if
the dataset was mounted and when the actual mount (via mount.zfs
binary) occurs.
The reason for the racing mounts is that both zfs-mount.target
and zfs-share.target are allowed to execute concurrently after
the import. This is more of an issue with the relatively recent
addition of parallel mounting, and we should consider serializing
the mount and share targets.
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed by: John Kennedy <[email protected]>
Reviewed-by: Allan Jude <[email protected]>
Signed-off-by: Don Brady <[email protected]>
Closes #8881
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Count the bytes of payload for each replication record type
Count the bytes of overhead (replication records themselves)
Include these counters in the output summary at the end of the run.
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Matt Ahrens <[email protected]>
Signed-off-by: Allan Jude <[email protected]>
Sponsored-By: Klara Systems and Catalogic
Closes #8432
|
|
|
|
|
|
|
|
|
|
|
| |
Since 30af21b0 was merged 'zfs send' help message format is broken
and lists "-r" as a valid option: this commit corrects these
small issues.
Reviewed-by: George Melikov <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Paul Dagnelie <[email protected]>
Signed-off-by: loli10K <[email protected]>
Closes #8942
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
If dedup is in use, the `dedupditto` property can be set, causing ZFS to
keep an extra copy of data that is referenced many times (>100x). The
idea was that this data is more important than other data and thus we
want to be really sure that it is not lost if the disk experiences a
small amount of random corruption.
ZFS (and system administrators) rely on the pool-level redundancy to
protect their data (e.g. mirroring or RAIDZ). Since the user/sysadmin
doesn't have control over what data will be offered extra redundancy by
dedupditto, this extra redundancy is not very useful. The bulk of the
data is still vulnerable to loss based on the pool-level redundancy.
For example, if particle strikes corrupt 0.1% of blocks, you will either
be saved by mirror/raidz, or you will be sad. This is true even if
dedupditto saved another 0.01% of blocks from being corrupted.
Therefore, the dedupditto functionality is rarely enabled (i.e. the
property is rarely set), and it fulfills its promise of increased
redundancy even more rarely.
Additionally, this feature does not work as advertised (on existing
releases), because scrub/resilver did not repair the extra (dedupditto)
copy (see https://github.com/zfsonlinux/zfs/pull/8270).
In summary, this seldom-used feature doesn't work, and even if it did it
wouldn't provide useful data protection. It has a non-trivial
maintenance burden (again see https://github.com/zfsonlinux/zfs/pull/8270).
We should remove the dedupditto functionality. For backwards
compatibility with the existing CLI, "zpool set dedupditto" will still
"succeed" (exit code zero), but won't have any effect. For backwards
compatibility with existing pools that had dedupditto enabled at some
point, the code will still be able to understand dedupditto blocks and
free them when appropriate. However, ZFS won't write any new dedupditto
blocks.
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Igor Kozhukhov <[email protected]>
Reviewed-by: Alek Pinchuk <[email protected]>
Issue #8270
Closes #8310
|
|
|
|
|
|
|
|
| |
Reviewed-by: Allan Jude <[email protected]>
Reviewed-by: Tony Hutter <[email protected]>
Reviewed-by: Richard Elling <[email protected]>
Signed-off-by: Michael Niewöhner <[email protected]>
Closes #8897
Closes #8911
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Redacted send/receive allows users to send subsets of their data to
a target system. One possible use case for this feature is to not
transmit sensitive information to a data warehousing, test/dev, or
analytics environment. Another is to save space by not replicating
unimportant data within a given dataset, for example in backup tools
like zrepl.
Redacted send/receive is a three-stage process. First, a clone (or
clones) is made of the snapshot to be sent to the target. In this
clone (or clones), all unnecessary or unwanted data is removed or
modified. This clone is then snapshotted to create the "redaction
snapshot" (or snapshots). Second, the new zfs redact command is used
to create a redaction bookmark. The redaction bookmark stores the
list of blocks in a snapshot that were modified by the redaction
snapshot(s). Finally, the redaction bookmark is passed as a parameter
to zfs send. When sending to the snapshot that was redacted, the
redaction bookmark is used to filter out blocks that contain sensitive
or unwanted information, and those blocks are not included in the send
stream. When sending from the redaction bookmark, the blocks it
contains are considered as candidate blocks in addition to those
blocks in the destination snapshot that were modified since the
creation_txg of the redaction bookmark. This step is necessary to
allow the target to rehydrate data in the case where some blocks are
accidentally or unnecessarily modified in the redaction snapshot.
The changes to bookmarks to enable fast space estimation involve
adding deadlists to bookmarks. There is also logic to manage the
life cycles of these deadlists.
The new size estimation process operates in cases where previously
an accurate estimate could not be provided. In those cases, a send
is performed where no data blocks are read, reducing the runtime
significantly and providing a byte-accurate size estimate.
Reviewed-by: Dan Kimmel <[email protected]>
Reviewed-by: Matt Ahrens <[email protected]>
Reviewed-by: Prashanth Sreenivasa <[email protected]>
Reviewed-by: John Kennedy <[email protected]>
Reviewed-by: George Wilson <[email protected]>
Reviewed-by: Chris Williamson <[email protected]>
Reviewed-by: Pavel Zhakarov <[email protected]>
Reviewed-by: Sebastien Roy <[email protected]>
Reviewed-by: Prakash Surya <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Paul Dagnelie <[email protected]>
Closes #7958
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
We've observed that on some highly fragmented pools, most metaslab
allocations are small (~2-8KB), but there are some large, 128K
allocations. The large allocations are for ZIL blocks. If there is a
lot of fragmentation, the large allocations can be hard to satisfy.
The most common impact of this is that we need to check (and thus load)
lots of metaslabs from the ZIL allocation code path, causing sync writes
to wait for metaslabs to load, which can take a second or more. In the
worst case, we may not be able to satisfy the allocation, in which case
the ZIL will resort to txg_wait_synced() to ensure the change is on
disk.
To provide a workaround for this, this change adds a tunable that can
reduce the size of ZIL blocks.
External-issue: DLPX-61719
Reviewed-by: George Wilson <[email protected]>
Reviewed-by: Paul Dagnelie <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Matthew Ahrens <[email protected]>
Closes #8865
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This matches the behavior of other python scripts, such as arcstat and
dbufstat, which are always installed but whose install-exec-hook actions
will simply touch up the shebang if a python interpreter was configured
*and* that interpreter is a python2 interpreter.
Fixes installation in a minimal build chroot without python available.
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Ryan Moeller <[email protected]>
Signed-off-by: Eli Schwartz <[email protected]>
Closes #8851
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Previously, --without-python would cause ./configure to fail. Now it is
able to proceed, and the Python scripts will not be built.
Use portable parameter expansion matching instead of nonstandard
substring matching to detect the Python version. This test is
duplicated in several places, so define a function for it.
Don't assume the full path to binaries, since different platforms do
install things in different places. Use AC_CHECK_PROGS instead.
When building without Python, also build without pyzfs.
Sponsored by: iXsystems, Inc.
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Richard Laager <[email protected]>
Reviewed-by: Eli Schwartz <[email protected]>
Signed-off-by: Ryan Moeller <[email protected]>
Closes #8809
Closes #8731
|
|
|
|
|
|
|
| |
Reviewed-by: Richard Laager <[email protected]>
Reviewed-by: Matt Ahrens <[email protected]>
Reviewed-by: Chris Dunlop <[email protected]>
Signed-off-by: Josh Soref <[email protected]>
Closes #8818
|
|
|
|
|
|
|
|
|
|
|
|
| |
s/get_vdev_spec/make_root_vdev
The former doesn't exist anymore.
Sponsored by: iXsystems, Inc.
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Tom Caputi <[email protected]>
Signed-off-by: Ryan Moeller <[email protected]>
Closes #8759
|
|
|
|
|
|
|
|
|
|
|
|
| |
On systems where "char" is an unsigned type the value returned by
getopt() will never be negative (-1), leading to an endless loop:
this issue prevents both 'zpool remove' and 'zstreamdump' for
working on some systems.
Reviewed-by: Igor Kozhukhov <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Chris Dunlop <[email protected]>
Signed-off-by: loli10K <[email protected]>
Closes #8789
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Currently, 'zfs mount -a' will always attempt to parallelize
work related to mounting as best it can. Unfortunately, when
the user passes the '-l' option to load keys, this causes
all threads to prompt the user for their keys at once,
resulting in a confusing and racy user experience. This patch
simply disables parallel mounting when using the '-l' flag.
Reviewed by: Sebastien Roy <[email protected]>
Reviewed by: Brian Behlendorf <[email protected]>
Signed-off-by: Tom Caputi <[email protected]>
Closes #8762
Closes #8811
|
|
|
|
|
|
|
|
|
| |
This commit adds the undocumented "-t" option to zpool(8) help message.
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: George Melikov <[email protected]>
Reviewed-by: Chris Dunlop <[email protected]>
Signed-off-by: loli10K <[email protected]>
Closes #8782
|
|
|
|
|
|
|
|
|
|
|
|
| |
This commit simply adds a missing newline ("\n") character to the error
message printed by the zfs command when the provided pool parameter
can't be found.
Reviewed-by: Chris Dunlop <[email protected]>
Reviewed-by: Giuseppe Di Natale <[email protected]>
Reviewed-by: Igor Kozhukhov <[email protected]>
Reviewed-by: George Melikov <[email protected]>
Signed-off-by: loli10K <[email protected]>
Closes #8783
|
|
|
|
|
|
|
|
|
|
| |
This commit removes the documented but not handled "-p" option from
zpool(8) help message.
Reviewed-by: George Melikov <[email protected]>
Reviewed-by: Chris Dunlop <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: loli10K <[email protected]>
Closes #8781
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The code never returned match comparing two datasets (not snapshots).
As result, uu_avl_find(), called from zfs_callback(), never succeeded,
allowing to add same dataset into the list multiple times, for example:
# zfs get name pers pers pers@z pers@z
NAME PROPERTY VALUE SOURCE
pers name pers -
pers name pers -
pers@z name pers@z -
With the patch:
# zfs get name pers pers pers@z pers@z
NAME PROPERTY VALUE SOURCE
pers name pers -
pers@z name pers@z -
Reviewed by: Brian Behlendorf <[email protected]>
Reviewed-by: Igor Kozhukhov <[email protected]>
Signed-off-by: Alexander Motin <[email protected]>
Closes #8723
|
|
|
|
|
|
|
|
| |
Reviewed-by: Olaf Faaland <[email protected]>
Reviewed-by: Richard Laager <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Justin Scholz <[email protected]>
Closes #8712
Closes #8721
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Users of existing pools, especially pools with top-level encrypted
datasets, could run into trouble trying to work around Errata #4.
Clarify that removing encrypted snapshots and bookmarks is enough
to clear the errata.
Reviewed-by: George Melikov <[email protected]>
Reviewed-by: Richard Laager <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Tom Caputi <[email protected]>
Signed-off-by: Justin Scholz <[email protected]>
Closes #8682
Closes #8683
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Currently, it is possible for the 'zpool scrub' command to
progress slightly beyond 100% due to concurrent changes
happening on the live pool. This behavior is expected, but
the userspace code for 'zpool status' would subtract the
expected amount of data from the amount of data already
scrubbed, resulting in a negative integer being casted to a
large positive one. This number was then used to calculate
the estimated completion time, resulting in wildly wrong
results. This code changes the behavior so that 'zpool status'
does not attempt to report an estimate during this period.
Reviewed by: Brian Behlendorf <[email protected]>
Reviewed-by: Igor Kozhukhov <[email protected]>
Reviewed-by: George Melikov <[email protected]>
Signed-off-by: Tom Caputi <[email protected]>
Closes #8611
Closes #8687
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
sigignore(3) isn't portable.
This code fails to compile on platforms without sigignore(3).
Use sigaction(2).
--
zfs_main.c: In function 'zfs_do_diff':
zfs_main.c:7178:9: error: implicit declaration of function 'sigignore' [-Werror=implicit-function-declaration]
(void) sigignore(SIGPIPE);
^~~~~~~~~
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Tomohiro Kusumi <[email protected]>
Closes #8593
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Add the 'zfs version' and 'zpool version' subcommands to display
the version of the user space utilities and loaded zfs kernel
module. For example:
$ zfs version
zfs-0.8.0-rc3_169_g67e0366b88
zfs-kmod-0.8.0-rc3_169_g67e0366b88
The '-V' and '--version' aliases were added to support the
common convention of using 'zfs --version` to obtain the version
information.
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Matthew Ahrens <[email protected]>
Reviewed-by: Richard Laager <[email protected]>
Signed-off-by: TerraTech <[email protected]>
Closes #2501
Closes #8567
|
|
|
|
|
|
|
|
|
| |
This patch simply up cleans up a nit and corrects an error message
issue that were introduced in the Multiple DVA scrub patch.
Reviewed-by: George Melikov <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Tom Caputi <[email protected]>
Closes #8619
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Commit df583073 introduced the ability to list the snapshots for a
specified dataset. This change inadvertently resulted in only the top-
level snapshots being listed when no dataset was specified. Fix this
issue by adding an additional check to determine if a dataset was
provided to avoid incorrectly restricting the depth.
Reviewed-by: George Melikov <[email protected]>
Reviewed-by: Tom Caputi <[email protected]>
Reviewed-by: Alek Pinchuk <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #8591
Closes #8594
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
There are several places where we use zfs_dbgmsg and %p to
print pointers. In the Linux kernel, these values obfuscated
to prevent information leaks which means the pointers aren't
very useful for debugging crash dumps. We decided to restrict
the permissions of dbgmsg (and some other kstats while we were
at it) and print pointers with %px in zfs_dbgmsg as well as
spl_dumpstack
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: John Gallagher <[email protected]>
Signed-off-by: sara hartse <[email protected]>
Closes #8467
Closes #8476
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Currently, when attempting to list snapshots ZFS may do a lot of
extra work checking child datasets. This is because the code does
not realize that it will not be able to reach any snapshots
contained within snapshots that are at the depth limit since the
snapshots of those datasets are counted as an additional layer
deeper. This patch corrects this issue.
In addition, this patch adds the ability to do perform the commands:
$ zfs list -t snapshot <dataset>
$ zfs get -t snapshot <prop> <dataset>
as a convenient way to list out properties of all snapshots of a
given dataset without having to use the depth limit.
Reviewed-by: Alek Pinchuk <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Richard Laager <[email protected]>
Reviewed-by: Matt Ahrens <[email protected]>
Reviewed-by: George Melikov <[email protected]>
Signed-off-by: Tom Caputi <[email protected]>
Closes #8539
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
UNMAP/TRIM support is a frequently-requested feature to help
prevent performance from degrading on SSDs and on various other
SAN-like storage back-ends. By issuing UNMAP/TRIM commands for
sectors which are no longer allocated the underlying device can
often more efficiently manage itself.
This TRIM implementation is modeled on the `zpool initialize`
feature which writes a pattern to all unallocated space in the
pool. The new `zpool trim` command uses the same vdev_xlate()
code to calculate what sectors are unallocated, the same per-
vdev TRIM thread model and locking, and the same basic CLI for
a consistent user experience. The core difference is that
instead of writing a pattern it will issue UNMAP/TRIM commands
for those extents.
The zio pipeline was updated to accommodate this by adding a new
ZIO_TYPE_TRIM type and associated spa taskq. This new type makes
is straight forward to add the platform specific TRIM/UNMAP calls
to vdev_disk.c and vdev_file.c. These new ZIO_TYPE_TRIM zios are
handled largely the same way as ZIO_TYPE_READs or ZIO_TYPE_WRITEs.
This makes it possible to largely avoid changing the pipieline,
one exception is that TRIM zio's may exceed the 16M block size
limit since they contain no data.
In addition to the manual `zpool trim` command, a background
automatic TRIM was added and is controlled by the 'autotrim'
property. It relies on the exact same infrastructure as the
manual TRIM. However, instead of relying on the extents in a
metaslab's ms_allocatable range tree, a ms_trim tree is kept
per metaslab. When 'autotrim=on', ranges added back to the
ms_allocatable tree are also added to the ms_free tree. The
ms_free tree is then periodically consumed by an autotrim
thread which systematically walks a top level vdev's metaslabs.
Since the automatic TRIM will skip ranges it considers too small
there is value in occasionally running a full `zpool trim`. This
may occur when the freed blocks are small and not enough time
was allowed to aggregate them. An automatic TRIM and a manual
`zpool trim` may be run concurrently, in which case the automatic
TRIM will yield to the manual TRIM.
Reviewed-by: Jorgen Lundman <[email protected]>
Reviewed-by: Tim Chase <[email protected]>
Reviewed-by: Matt Ahrens <[email protected]>
Reviewed-by: George Wilson <[email protected]>
Reviewed-by: Serapheim Dimitropoulos <[email protected]>
Contributions-by: Saso Kiselkov <[email protected]>
Contributions-by: Tim Chase <[email protected]>
Contributions-by: Chunwei Chen <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #8419
Closes #598
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When Multihost is enabled, and a pool is imported, uberblock writes
include ub_mmp_delay to allow an importing node to calculate the
duration of an activity test. This value, is not enough information.
If zfs_multihost_fail_intervals > 0 on the node with the pool imported,
the safe minimum duration of the activity test is well defined, but does
not depend on ub_mmp_delay:
zfs_multihost_fail_intervals * zfs_multihost_interval
and if zfs_multihost_fail_intervals == 0 on that node, there is no such
well defined safe duration, but the importing host cannot tell whether
mmp_delay is high due to I/O delays, or due to a very large
zfs_multihost_interval setting on the host which last imported the pool.
As a result, it may use a far longer period for the activity test than
is necessary.
This patch renames ub_mmp_sequence to ub_mmp_config and uses it to
record the zfs_multihost_interval and zfs_multihost_fail_intervals
values, as well as the mmp sequence. This allows a shorter activity
test duration to be calculated by the importing host in most situations.
These values are also added to the multihost_history kstat records.
It calculates the activity test duration differently depending on
whether the new fields are present or not; for importing pools with
only ub_mmp_delay, it uses
(zfs_multihost_interval + ub_mmp_delay) * zfs_multihost_import_intervals
Which results in an activity test duration less sensitive to the leaf
count.
In addition, it makes a few other improvements:
* It updates the "sequence" part of ub_mmp_config when MMP writes
in between syncs occur. This allows an importing host to detect MMP
on the remote host sooner, when the pool is idle, as it is not limited
to the granularity of ub_timestamp (1 second).
* It issues writes immediately when zfs_multihost_interval is changed
so remote hosts see the updated value as soon as possible.
* It fixes a bug where setting zfs_multihost_fail_intervals = 1 results
in immediate pool suspension.
* Update tests to verify activity check duration is based on recorded
tunable values, not tunable values on importing host.
* Update tests to verify the expected number of uberblocks have valid
MMP fields - fail_intervals, mmp_interval, mmp_seq (sequence number),
that sequence number is incrementing, and that uberblock values match
tunable settings.
Reviewed-by: Andreas Dilger <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Tony Hutter <[email protected]>
Signed-off-by: Olaf Faaland <[email protected]>
Closes #7842
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
1) As implemented the `zpool labelclear` command overwrites
the calculated offsets of all four vdev labels even when only a
single valid label is found. If the device as been re-purposed
but still contains a valid label this can result in space no
longer owned by ZFS being zeroed. Prevent this by verifying
every label removed is intact before it's overwritten.
2) Address a small bug in zpool_do_labelclear() which prevented
labelclear from working on file vdevs. Only block devices support
BLKFLSBUF, try the ioctl() but when it's reported as unsupported
this should not be fatal.
3) Fix `zpool labelclear` so it can be run on vdevs which were
removed from the pool with `zpool remove`. Additionally, allow
intact but partial labels to be cleared as in the case of a failed
`zpool attach` or `zpool replace`.
4) Remove LABELCLEAR and LABELREAD variables for test cases.
Reviewed-by: Matt Ahrens <[email protected]>
Reviewed-by: Tim Chase <[email protected]>
Reviewed-by: Tony Hutter <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #8500
Closes #8373
Closes #6261
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Currently, there is an issue in the sequential scrub code which
prevents self healing from working in some cases. The scrub code
will split up all DVA copies of a bp and issue each of them
separately. The problem is that, since each of the DVAs is no
longer associated with the others, the self healing code doesn't
have the opportunity to repair problems that show up in one of the
DVAs with the data from the others.
This patch fixes this issue by ensuring that all IOs issued by the
sequential scrub code include all DVAs. Initially, only the first
DVA of each is attempted. If an issue arises, the IO is retried
with all available copies, giving the self healing code a chance
to correct the issue.
To test this change, this patch also adds the ability for zinject
to specify individual DVAs to inject read errors into. We then
add a new test case that utilizes this functionality to ensure
scrubs and self-healing reads can handle and transparently fix
issues with individual copies of blocks.
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Matt Ahrens <[email protected]>
Signed-off-by: Tom Caputi <[email protected]>
Closes #8453
|