| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Porting Notes:
* Add options to zfs-module-parameters(5) man page.
* zfs_nocacheflush move to vdev.c instead of vdev_disk.c, since
the latter doesn't get built for user space.
Authored by: Prakash Surya <[email protected]>
Reviewed by: Matt Ahrens <[email protected]>
Reviewed by: Brad Lewis <[email protected]>
Reviewed by: Patrick Mooney <[email protected]>
Reviewed by: Tom Caputi <[email protected]>
Reviewed by: George Melikov <[email protected]>
Approved by: Dan McDonald <[email protected]>
Ported-by: Signed-off-by: Brian Behlendorf <[email protected]>
OpenZFS-issue: https://www.illumos.org/issues/9963
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f8fdf68125
Closes #8186
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* Detect IO errors during device removal
While device removal cannot verify the checksums of individual
blocks during device removal, it can reasonably detect hard IO
errors from the leaf vdevs. Failure to perform this error
checking can result in device removal completing successfully,
but moving no data which will permanently corrupt the pool.
Situation 1: faulted/degraded vdevs
In the configuration shown below, the removal of mirror-0 will
permanently corrupt the pool. Device removal will preferentially
copy data from 'vdev1 -> vdev3' and from 'vdev2 -> vdev4'. Which
in this case will result in nothing being copied since one vdev
in each of those groups in unavailable. However, device removal
will complete successfully since all IO errors are ignored.
tank DEGRADED 0 0 0
mirror-0 DEGRADED 0 0 0
/var/tmp/vdev1 FAULTED 0 0 0 external fault
/var/tmp/vdev2 ONLINE 0 0 0
mirror-1 DEGRADED 0 0 0
/var/tmp/vdev3 ONLINE 0 0 0
/var/tmp/vdev4 FAULTED 0 0 0 external fault
This issue is resolved by updating the source child selection
logic to exclude unreadable leaf vdevs. Additionally, unwritable
destination child vdevs which can never succeed are skipped to
prevent generating a large number of write IO errors.
Situation 2: individual hard IO errors
During removal if an unexpected hard IO error is encountered when
either reading or writing the child vdev the entire removal
operation is cancelled. While it may be possible to reconstruct
the data after removal that cannot be guaranteed. The only
strictly safe thing to do is to cancel the removal.
As a future improvement we may want to instead suspend the removal
process and allow the damaged region to be retried. But that work
is left for another time, hard IO errors during the removal process
are expected to be exceptionally rare.
Reviewed-by: Serapheim Dimitropoulos <[email protected]>
Reviewed-by: Tony Hutter <[email protected]>
Reviewed-by: Tom Caputi <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Issue #6900
Closes #8161
|
|
|
|
|
|
|
|
| |
Reviewed-by: George Melikov <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Giuseppe Di Natale <[email protected]>
Reviewed-by: bunder2015 <[email protected]>
Signed-off-by: Rich Ercolani <[email protected]>
Closes #8153
|
|
|
|
|
|
|
|
| |
Reviewed-by: George Melikov <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Giuseppe Di Natale <[email protected]>
Reviewed-by: Richard Laager <[email protected]>
Signed-off-by: Christian Schwarz <[email protected]>
Closes #8134
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This patch adds a new slow I/Os (-s) column to zpool status to show the
number of VDEV slow I/Os. This is the number of I/Os that didn't
complete in zio_slow_io_ms milliseconds. It also adds a new parsable
(-p) flag to display exact values.
NAME STATE READ WRITE CKSUM SLOW
testpool ONLINE 0 0 0 -
mirror-0 ONLINE 0 0 0 -
loop0 ONLINE 0 0 0 20
loop1 ONLINE 0 0 0 0
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed by: Matthew Ahrens <[email protected]>
Signed-off-by: Tony Hutter <[email protected]>
Closes #7756
Closes #6885
|
|
|
|
|
|
|
|
|
|
| |
Update zfs-events.5 with info from PSARC 2009/497 regarding ereport fields.
Also updates ZIO_STAGE_* and ZIO_FLAG_* descriptions to match current source.
Reviewed by: loli10K <[email protected]>
Reviewed-by: George Melikov <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Richard Elling <[email protected]>
Closes #8057
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Currently, if a resilver is triggered for any reason while an
existing one is running, zfs will immediately restart the existing
resilver from the beginning to include the new drive. This causes
problems for system administrators when a drive fails while another
is already resilvering. In this case, the optimal thing to do to
reduce risk of data loss is to wait for the current resilver to end
before immediately replacing the second failed drive, which allows
the system to operate with two incomplete drives for the minimum
amount of time.
This patch introduces the resilver_defer feature that essentially
does this for the admin without forcing them to wait and monitor
the resilver manually. The change requires an on-disk feature
since we must mark drives that are part of a deferred resilver in
the vdev config to ensure that we do not assume they are done
resilvering when an existing resilver completes.
Reviewed-by: Matthew Ahrens <[email protected]>
Reviewed-by: John Kennedy <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: @mmaybee
Signed-off-by: Tom Caputi <[email protected]>
Closes #7732
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Porting notes:
* Renamed zfs_dirty_data_sync_pct to zfs_dirty_data_sync_percent and
changed the type to be consistent with the other dirty module params.
* Updated zfs-module-parameters.5 accordingly.
Authored by: Matthew Ahrens <[email protected]>
Reviewed by: Serapheim Dimitropoulos <[email protected]>
Reviewed by: Brad Lewis <[email protected]>
Reviewed by: George Wilson <[email protected]>
Reviewed by: Andrew Stormont <[email protected]>
Reviewed-by: George Melikov <[email protected]>
Approved by: Robert Mustacchi <[email protected]>
Ported-by: Brian Behlendorf <[email protected]>
OpenZFS-issue: https://illumos.org/issues/9617
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/7928f4ba
Closes #7976
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Porting notes:
* Two of the three changes from the upstream patch were already
applied for Linux. Only the last one is required.
Authored by: Yuri Pankov <[email protected]>
Reviewed by: Robert Mustacchi <[email protected]>
Reviewed-by: George Melikov <[email protected]>
Approved by: Gordon Ross <[email protected]>
Ported-by: Brian Behlendorf <[email protected]>
OpenZFS-issue: https://illumos.org/issues/9763
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/8a702e55
Closes #7972
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Due to a flaw in 4589f3ae the number of unique combinations
could be calculated incorrectly. This could result in the
random combinations reconstruction being used when it would
have been possible to check all combinations.
This change fixes the unique combinations calculation and
simplifies the reconstruction logic by maintaining a per-
segment list of unique copies.
The vdev_indirect_splits_damage() function was introduced
to validate both the enumeration and random reconstruction
logic with ztest. It is implemented such it will never
make a known recoverable block unrecoverable.
Reviewed-by: Matthew Ahrens <[email protected]>
Reviewed-by: Serapheim Dimitropoulos <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Issue #6900
Closes #7934
|
|
|
|
|
|
|
|
| |
Reviewed-by: Giuseppe Di Natale <[email protected]>
Reviewed-by: George Melikov <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Richard Laager <[email protected]>
Signed-off-by: DHE <[email protected]>
Closes #7938
|
|
|
|
|
|
|
| |
Reviewed-by: George Melikov <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Richard Laager <[email protected]>
Signed-off-by: DHE <[email protected]>
Closes #7920
|
|
|
|
|
|
|
|
|
|
| |
The man pages for zpool and zfs (get command)
listed the pool/dataset parameter as required,
but these are optional. Fixed that.
Reviewed-by: George Melikov <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Gregor Kopka <[email protected]>
Closes #7916
|
|
|
|
|
|
|
|
|
|
|
| |
Update zpool(8) to clarify what type of vdevs may be safely
removed and that the existence of any top-level raidz device
which is part of the primary pool will prevent device removal.
Reviewed by: Matthew Ahrens <[email protected]>
Signed-off-by: Tim Chase <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #7880
Closes #7893
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When no permission set is defined for a dataset the create time
permissions are incorrectly shown as if they were a permission set.
This change simply correct how allow permissions are displayed.
This commit also fixes a small manpage formatting issue and adds the
"zfs_allow_003_pos" test case to the ZFS Test Suite.
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: George Melikov <[email protected]>
Signed-off-by: loli10K <[email protected]>
Closes #7519
Closes #7860
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Allocation Classes add the ability to have allocation classes in a
pool that are dedicated to serving specific block categories, such
as DDT data, metadata, and small file blocks. A pool can opt-in to
this feature by adding a 'special' or 'dedup' top-level VDEV.
Reviewed by: Pavel Zakharov <[email protected]>
Reviewed-by: Richard Laager <[email protected]>
Reviewed-by: Alek Pinchuk <[email protected]>
Reviewed-by: HÃ¥kan Johansson <[email protected]>
Reviewed-by: Andreas Dilger <[email protected]>
Reviewed-by: DHE <[email protected]>
Reviewed-by: Richard Elling <[email protected]>
Reviewed-by: Gregor Kopka <[email protected]>
Reviewed-by: Kash Pande <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Matthew Ahrens <[email protected]>
Signed-off-by: Don Brady <[email protected]>
Closes #5182
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Assertion failed in arc_buf_destroy() when concurrently reading
block with checksum error.
Porting notes:
* The ability to zinject decompression errors has been added, but
this only works at the zio_decompress() level, where we have all
of the info we need to match against the user's zinject options.
* The decompress_fault test has been added to test the new zinject
functionality
* We attempted to set zio_decompress_fail_fraction to (1 << 18) in
ztest for further test coverage. Although this did uncover a few
low priority issues, this unfortuantely also causes ztest to
ASSERT in many locations where the code is working correctly since
it is designed to fail on IO errors. Developers can manually set
this variable with the '-o' option to find and debug issues.
Authored by: Matt Ahrens <[email protected]>
Reviewed by: George Wilson <[email protected]>
Reviewed by: Paul Dagnelie <[email protected]>
Reviewed by: Pavel Zakharov <[email protected]>
Reviewed by: Brian Behlendorf <[email protected]>
Approved by: Matt Ahrens <[email protected]>
Ported-by: Tom Caputi <[email protected]>
OpenZFS-issue: https://illumos.org/issues/9403
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/fa98e487a9
Closes #7822
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The following patch introduces a few statistics on reads and writes
grouped by dataset. These statistics are implemented as kstats
(backed by aggregate sums for performance) and can be retrieved by
using the dataset objset ID number. The motivation for this change is
to provide some preliminary analytics on dataset usage/performance.
Reviewed-by: Richard Elling <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed by: Matthew Ahrens <[email protected]>
Signed-off-by: Serapheim Dimitropoulos <[email protected]>
Closes #7705
|
|
|
|
|
|
|
|
|
|
| |
This change simply documents the existing "scripted mode" option in
both command help and man page.
Reviewed-by: Giuseppe Di Natale <[email protected]>
Reviewed-by: George Melikov <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: loli10K <[email protected]>
Closes #7798
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
One small integration that was absent from b52563 was
support for zfs recv -o / -x with regards to encryption
parameters. The main use cases of this are as follows:
* Receiving an unencrypted stream as encrypted without
needing to create a "dummy" encrypted parent so that
encryption can be inheritted.
* Allowing users to change their keylocation on receive,
so long as the receiving dataset is an encryption root.
* Allowing users to explicitly exclude or override the
encryption property from an unencrypted properties stream,
allowing it to be received as encrypted.
* Receiving a recursive heirarchy of unencrypted datasets,
encrypting the top-level one and forcing all children to
inherit the encryption.
Reviewed-by: Jorgen Lundman <[email protected]>
Reviewed by: Matthew Ahrens <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Richard Elling <[email protected]>
Signed-off-by: Tom Caputi <[email protected]>
Closes #7650
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Porting notes:
* As of grub-2.02 these checksums are not supported. However, as
pointed out in #6501 there are alternatives such as EFISTUB which
work and have no such restriction. A warning was added to the
checksum property section of the zfs.8 man page.
Authored by: Toomas Soome <[email protected]>
Reviewed by: C Fraire <[email protected]>
Reviewed by: Robert Mustacchi <[email protected]>
Reviewed by: Yuri Pankov <[email protected]>
Approved by: Dan McDonald <[email protected]>
Ported-by: Brian Behlendorf <[email protected]>
OpenZFS-issue: https://illumos.org/issues/8906
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/7dec52f
Closes #6501
Closes #7714
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This project's goal is to make read-heavy channel programs and zfs(1m)
administrative commands faster by caching all the metadata that they will
need in the dbuf layer. This will prevent the data from being evicted, so
that any future call to i.e. zfs get all won't have to go to disk (very
much). There are two parts:
The dbuf_metadata_cache. We identify what to put into the cache based on
the object type of each dbuf. Caching objset properties os
{version,normalization,utf8only,casesensitivity} in the objset_t. The reason
these needed to be cached is that although they are queried frequently,
they aren't stored in a dbuf type which we can easily recognize and cache in
the dbuf layer; instead, we have to explicitly store them. There's already
existing infrastructure for maintaining cached properties in the objset
setup code, so I simply used that.
Performance Testing:
- Disabled kmem_flags
- Tuned dbuf_cache_max_bytes very low (128K)
- Tuned zfs_arc_max very low (64M)
Created test pool with 400 filesystems, and 100 snapshots per filesystem.
Later on in testing, added 600 more filesystems (with no snapshots) to make
sure scaling didn't look different between snapshots and filesystems.
Results:
| Test | Time (trunk / diff) | I/Os (trunk / diff) |
+------------------------+---------------------+---------------------+
| zpool import | 0:05 / 0:06 | 12.9k / 12.9k |
| zfs get all (uncached) | 1:36 / 0:53 | 16.7k / 5.7k |
| zfs get all (cached) | 1:36 / 0:51 | 16.0k / 6.0k |
Authored by: Matthew Ahrens <[email protected]>
Reviewed by: Prakash Surya <[email protected]>
Reviewed by: George Wilson <[email protected]>
Reviewed by: Thomas Caputi <[email protected]>
Reviewed by: Brian Behlendorf <[email protected]>
Approved by: Richard Lowe <[email protected]>
Ported-by: Alek Pinchuk <[email protected]>
Signed-off-by: Alek Pinchuk <[email protected]>
OpenZFS-issue: https://illumos.org/issues/9337
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/7dec52f
Closes #7668
|
|
|
|
|
|
|
|
|
|
|
|
| |
Authored by: Don Brady <[email protected]>
Reviewed by: Paul Dagnelie <[email protected]>
Reviewed by: Matt Ahrens <[email protected]>
Ported-by: Brian Behlendorf <[email protected]>
Approved by: Dan McDonald <[email protected]>
OpenZFS-issue: https://www.illumos.org/issues/9426
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f1c88afb1
Closes #7700
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Datasets that are deeply nested (~100 levels) are impractical. We just
put a limit of 50 levels to newly created datasets. Existing datasets
should work without a problem.
The problem can be seen by attempting to create a dataset using the -p
option with many levels:
panic[cpu0]/thread=ffffff01cd282c20: BAD TRAP: type=8 (#df Double fault) rp=ffffffff
fffffffffbc3aa60 unix:die+100 ()
fffffffffbc3ab70 unix:trap+157d ()
ffffff00083d7020 unix:_patch_xrstorq_rbx+196 ()
ffffff00083d7050 zfs:dbuf_rele+2e ()
...
ffffff00083d7080 zfs:dsl_dir_close+32 ()
ffffff00083d70b0 zfs:dsl_dir_evict+30 ()
ffffff00083d70d0 zfs:dbuf_evict_user+4a ()
ffffff00083d7100 zfs:dbuf_rele_and_unlock+87 ()
ffffff00083d7130 zfs:dbuf_rele+2e ()
... The block above repeats once per directory in the ...
... create -p command, working towards the root ...
ffffff00083db9f0 zfs:dsl_dataset_drop_ref+19 ()
ffffff00083dba20 zfs:dsl_dataset_rele+42 ()
ffffff00083dba70 zfs:dmu_objset_prefetch+e4 ()
ffffff00083dbaa0 zfs:findfunc+23 ()
ffffff00083dbb80 zfs:dmu_objset_find_spa+38c ()
ffffff00083dbbc0 zfs:dmu_objset_find+40 ()
ffffff00083dbc20 zfs:zfs_ioc_snapshot_list_next+4b ()
ffffff00083dbcc0 zfs:zfsdev_ioctl+347 ()
ffffff00083dbd00 genunix:cdev_ioctl+45 ()
ffffff00083dbd40 specfs:spec_ioctl+5a ()
ffffff00083dbdc0 genunix:fop_ioctl+7b ()
ffffff00083dbec0 genunix:ioctl+18e ()
ffffff00083dbf10 unix:brand_sys_sysenter+1c9 ()
Porting notes:
* Added zfs_max_dataset_nesting module option with documentation.
* Updated zfs_rename_014_neg.ksh for Linux.
* Increase the zfs.sh stack warning to 15K. Enough time has passed
that 16K can be reasonably assumed to be the default value. It
was increased in the 3.15 kernel released in June of 2014.
Authored by: Serapheim Dimitropoulos <[email protected]>
Reviewed by: John Kennedy <[email protected]>
Reviewed by: Matt Ahrens <[email protected]>
Ported-by: Brian Behlendorf <[email protected]>
Approved by: Garrett D'Amore <[email protected]>
OpenZFS-issue: https://www.illumos.org/issues/9330
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/757a75a
Closes #7681
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Motivation
==========
The current space map encoding has the following disadvantages:
[1] Assuming 512 sector size each entry can represent at most 16MB for a segment.
This makes the encoding very inefficient for large regions of space.
[2] As vdev-wide space maps have started to be used by new features (i.e.
device removal, zpool checkpoint) we've started imposing limits in the
vdevs that can be used with them based on the maximum addressable offset
(currently 64PB for a top-level vdev).
New encoding
============
The layout can be found at space_map.h and it remains backwards compatible with
the old one. The introduced two-word entry format, besides extending the limits
imposed by the single-entry layout, also includes a vdev field and some extra
padding after its prefix.
The extra padding after the prefix should is reserved for future usage (e.g.
new prefixes for future encodings or new fields for flags). The new vdev field
not only makes the space maps more self-descriptive, but also opens the doors
for pool-wide space maps (expected to be used in the log spacemap project).
One final important note is that the number of bits used for vdevs is reduced
to 24 bits for blkptrs. That was decided as we don't know of any setups that
use more than 16M vdevs for the time being and we wanted to fit the vdev field
in the space map. In addition that gives us some extra bits in dva_t.
Other references:
=================
The new encoding is also discussed towards the end of the Log Space Map
presentation from 2017's OpenZFS summit.
Link: https://www.youtube.com/watch?v=jj2IxRkl5bQ
Authored by: Serapheim Dimitropoulos <[email protected]>
Reviewed by: Matt Ahrens <[email protected]>
Reviewed by: George Wilson <[email protected]>
Reviewed by: Brian Behlendorf <[email protected]>
Approved by: Gordon Ross <[email protected]>
Ported-by: Tim Chase <[email protected]>
Signed-off-by: Tim Chase <[email protected]>
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/90a56e6d
OpenZFS-issue: https://www.illumos.org/issues/9238
Closes #7665
|
|
|
|
|
|
|
|
|
| |
The formatting of the features beginning with large_blocks was broken
when the zpool_checkpoint feature was added.
Signed-off-by: Tim Chase <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #7658
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Add checkpoint field in the default list of the zpool-list man page
Authored by: Eitan Adler <[email protected]>
Reviewed by: Serapheim Dimitropoulos <[email protected]>
Reviewed by: Matthew Ahrens <[email protected]>
Reviewed by: kpande <[email protected]>
Reviewed by: Tim Chase <[email protected]>
Reviewed-by: George Melikov <[email protected]>
Reviewed-by: Giuseppe Di Natale <[email protected]>
Approved by: Dan McDonald <[email protected]>
Ported-by: Brian Behlendorf <[email protected]>
OpenZFS-issue: https://illumos.org/issues/9521
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/c5a860f7b
Closes #7658
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Details about the motivation of this feature and its usage can
be found in this blogpost:
https://sdimitro.github.io/post/zpool-checkpoint/
A lightning talk of this feature can be found here:
https://www.youtube.com/watch?v=fPQA8K40jAM
Implementation details can be found in big block comment of
spa_checkpoint.c
Side-changes that are relevant to this commit but not explained
elsewhere:
* renames members of "struct metaslab trees to be shorter without
losing meaning
* space_map_{alloc,truncate}() accept a block size as a
parameter. The reason is that in the current state all space
maps that we allocate through the DMU use a global tunable
(space_map_blksz) which defauls to 4KB. This is ok for metaslab
space maps in terms of bandwirdth since they are scattered all
over the disk. But for other space maps this default is probably
not what we want. Examples are device removal's vdev_obsolete_sm
or vdev_chedkpoint_sm from this review. Both of these have a
1:1 relationship with each vdev and could benefit from a bigger
block size.
Porting notes:
* The part of dsl_scan_sync() which handles async destroys has
been moved into the new dsl_process_async_destroys() function.
* Remove "VERIFY(!(flags & FWRITE))" in "kernel.c" so zhack can write
to block device backed pools.
* ZTS:
* Fix get_txg() in zpool_sync_001_pos due to "checkpoint_txg".
* Don't use large dd block sizes on /dev/urandom under Linux in
checkpoint_capacity.
* Adopt Delphix-OS's setting of 4 (spa_asize_inflation =
SPA_DVAS_PER_BP + 1) for the checkpoint_capacity test to speed
its attempts to fill the pool
* Create the base and nested pools with sync=disabled to speed up
the "setup" phase.
* Clear labels in test pool between checkpoint tests to avoid
duplicate pool issues.
* The import_rewind_device_replaced test has been marked as "known
to fail" for the reasons listed in its DISCLAIMER.
* New module parameters:
zfs_spa_discard_memory_limit,
zfs_remove_max_bytes_pause (not documented - debugging only)
vdev_max_ms_count (formerly metaslabs_per_vdev)
vdev_min_ms_count
Authored by: Serapheim Dimitropoulos <[email protected]>
Reviewed by: Matthew Ahrens <[email protected]>
Reviewed by: John Kennedy <[email protected]>
Reviewed by: Dan Kimmel <[email protected]>
Reviewed by: Brian Behlendorf <[email protected]>
Approved by: Richard Lowe <[email protected]>
Ported-by: Tim Chase <[email protected]>
Signed-off-by: Tim Chase <[email protected]>
OpenZFS-issue: https://illumos.org/issues/9166
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/7159fdb8
Closes #7570
|
|
|
|
|
|
|
|
| |
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Giuseppe Di Natale <[email protected]>
Reviewed-by: George Melikov <[email protected]>
Reviewed-by: bunder2015 <[email protected]>
Signed-off-by: ajs124 <[email protected]>
Closes #7649
|
|
|
|
|
|
|
|
|
|
|
|
| |
This patch adds tunables for modifying the maximum memory limit and
maximum instruction limit that can be specified when running a channel
program.
Reviewed-by: Matthew Ahrens <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]
Reviewed-by: Sara Hartse <[email protected]>
Signed-off-by: John Gallagher <[email protected]>
External-issue: LX-1085
Closes #7618
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
zpool and zed place scripts in subdirectories of libexecdir. Some
distributions locate architecture independent scripts in other locations
(e.g. Debian). To avoid these paths getting out of sync, centralize the
definitions.
Build zfs-test's default.cfg by Makefile. Use the new directory
logic building tests/zfs-tests/include/default.cfg.in.
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Antonio Russo <[email protected]>
Closes #7597
|
|
|
|
|
|
|
|
|
| |
This patch collects some minor inconsistencies and typos in the
documentation, logging and testing infrastructure.
Reviewed-by: George Melikov <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Antonio Russo <[email protected]>
Closes #7608
|
|
|
|
|
|
|
|
|
| |
Fixed some highlighting in the zpool man page
Reviewed-by: Richard Laager <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: George Melikov <[email protected]>
Signed-off-by: bunder2015 <[email protected]>
Closes #7596
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Minimal changes required to integrate the SPL sources in to the
ZFS repository build infrastructure and packaging.
Build system and packaging:
* Renamed SPL_* autoconf m4 macros to ZFS_*.
* Removed redundant SPL_* autoconf m4 macros.
* Updated the RPM spec files to remove SPL package dependency.
* The zfs package obsoletes the spl package, and the zfs-kmod
package obsoletes the spl-kmod package.
* The zfs-kmod-devel* packages were updated to add compatibility
symlinks under /usr/src/spl-x.y.z until all dependent packages
can be updated. They will be removed in a future release.
* Updated copy-builtin script for in-kernel builds.
* Updated DKMS package to include the spl.ko.
* Updated stale AUTHORS file to include all contributors.
* Updated stale COPYRIGHT and included the SPL as an exception.
* Renamed README.markdown to README.md
* Renamed OPENSOLARIS.LICENSE to LICENSE.
* Renamed DISCLAIMER to NOTICE.
Required code changes:
* Removed redundant HAVE_SPL macro.
* Removed _BOOT from nvpairs since it doesn't apply for Linux.
* Initial header cleanup (removal of empty headers, refactoring).
* Remove SPL repository clone/build from zimport.sh.
* Use of DEFINE_RATELIMIT_STATE and DEFINE_SPINLOCK removed due
to build issues when forcing C99 compilation.
* Replaced legacy ACCESS_ONCE with READ_ONCE.
* Include needed headers for `current` and `EXPORT_SYMBOL`.
Reviewed-by: Tony Hutter <[email protected]>
Reviewed-by: Olaf Faaland <[email protected]>
Reviewed-by: Matthew Ahrens <[email protected]>
Reviewed-by: Pavel Zakharov <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
TEST_ZIMPORT_SKIP="yes"
Closes #7556
|
|\
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Merge a minimal version of the zfsonlinux/spl repository in to the
zfsonlinux/zfs repository. Care was taken to prevent file conflicts
when merging and to preserve the spl repository history. The spl
kernel module remains under the GPLv2 license as documented by the
additional THIRDPARTYLICENSE.gplv2 file.
Signed-off-by: Brian Behlendorf <[email protected]>
|
| |
| |
| |
| |
| |
| |
| |
| |
| | |
This commit removes everything from the repository except the core
SPL implementation for Linux. Those files which remain have been
moved to non-conflicting locations to facilitate the merge.
The README.md and associated files have been updated accordingly.
Signed-off-by: Brian Behlendorf <[email protected]>
|
| |
| |
| |
| |
| |
| |
| |
| | |
Update spl module parameters man5 with the following missing parameter
details for spl_panic_halt.
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Alex Braunegg <[email protected]>
Closes #664
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
There is no '.sh' macro in troff/groff/man, only '.SH' for section
headers. I assume .sp for a line break was intended here like
in the rest of the man page.
Reviewed-by: George Melikov <[email protected]>
Reviewed-by: Giuseppe Di Natale <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Closes #643
|
| |
| |
| |
| |
| |
| | |
Reviewed-by: George Melikov <[email protected]>
Reviewed-by: Giuseppe Di Natale <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Closes #642
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
To prevent holding tq_lock for too long.
Before zfsonlinux/zfs@8e71ab9, hogging delay tasks and cat /proc/spl/taskq
would easily cause a lockup. While that bug has been fixed. It's probably
still a good idea to do this just in case task lists grow too large.
Reviewed-by: Tim Chase <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Chunwei Chen <[email protected]>
Closes #586
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
This patch add a module parameter spl_taskq_kick. When writing non-zero value
to it, it will scan all the taskq, if a taskq contains a task pending for more
than 5 seconds, it will be forced to spawn a new thread. This is use as an
emergency recovery from deadlock, not a general solution.
Signed-off-by: Chunwei Chen <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #529
|
| |
| |
| |
| |
| |
| |
| |
| | |
The spl_kmem_cache_kmem_threads module option was accidentally
omitted from the documentation. Add it.
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #512
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Add a new defclsyspri macro which can be used to request the default
Linux scheduler priority. Neither the minclsyspri or maxclsyspri map
to the default Linux kernel thread priority. This makes it awkward to
create taskqs which run with the same priority as the rest of the kernel
threads on the system which can lead to performance issues.
All SPL callers which previously used minclsyspri or maxclsyspri have
been changed to use defclsyspri. The vast majority of callers were
part of the test suite which won't have an external impact. The few
places where it could impact performance the change was from maxclsyspri
to defclsyspri. This makes it more likely the process will be scheduled
which may help performance.
To facilitate further performance analysis the spl_taskq_thread_priority
module option has been added. When disabled (0) all newly created kernel
threads will use the default kernel thread priority. When enabled (1)
the specified taskq priority will be used. By default this value is
enabled (1).
Signed-off-by: Brian Behlendorf <[email protected]>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Setting the TASKQ_DYNAMIC flag will create a taskq with dynamic
semantics. Initially only a single worker thread will be created
to service tasks dispatched to the queue. As additional threads
are needed they will be dynamically spawned up to the max number
specified by 'nthreads'. When the threads are no longer needed,
because the taskq is empty, they will automatically terminate.
Due to the low cost of creating and destroying threads under Linux
by default new threads and spawned and terminated aggressively.
There are two modules options which can be tuned to adjust this
behavior if needed.
* spl_taskq_thread_sequential - The number of sequential tasks,
without interruption, which needed to be handled by a worker
thread before a new worker thread is spawned. Default 4.
* spl_taskq_thread_dynamic - Provides the ability to completely
disable the use of dynamic taskqs on the system. This is provided
for the purposes of debugging and troubleshooting. Default 1
(enabled).
This behavior is fundamentally consistent with the dynamic taskq
implementation found in both illumos and FreeBSD.
Signed-off-by: Brian Behlendorf <[email protected]>
Signed-off-by: Tim Chase <[email protected]>
Closes #458
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
This change is designed to improve the memory utilization of
slabs by more carefully setting their size. The way the code
currently works is problematic for slabs which contain large
objects (>1MB). This is due to slabs being unconditionally
rounded up to a power of two which may result in unused space
at the end of the slab.
The reason the existing code rounds up every slab is because it
assumes it will backed by the buddy allocator. Since the buddy
allocator can only performs power of two allocations this is
desirable because it avoids wasting any space. However, this
logic breaks down if slab is backed by vmalloc() which operates
at a page level granularity. In this case, the optimal thing to
do is calculate the minimum required slab size given certain
constraints (object size, alignment, objects/slab, etc).
Therefore, this patch reworks the spl_slab_size() function so
that it sizes KMC_KMEM slabs differently than KMC_VMEM slabs.
KMC_KMEM slabs are rounded up to the nearest power of two, and
KMC_VMEM slabs are allowed to be the minimum required size.
This change also reduces the default number of objects per slab.
This reduces how much memory a single cache object can pin, which
can result in significant memory saving for highly fragmented
caches. But depending on the workload it may result in slabs
being allocated and freed more frequently. In practice, this
has been shown to be a better default for most workloads.
Also the maximum slab size has been reduced to 4MB on 32-bit
systems. Due to the limited virtual address space it's critical
the we be as frugal as possible. A limit of 4M still lets us
reasonably comfortably allocate a limited number of 1MB objects.
Finally, the kmem:slab_small and kmem:slab_large SPLAT tests
were extended to provide better test coverage of various object
sizes and alignments. Caches are created with random parameters
and their basic functionality is verified by allocating several
slabs worth of objects.
Signed-off-by: Brian Behlendorf <[email protected]>
|
| |
| |
| |
| |
| |
| |
| |
| |
| | |
The spl-module-parameters(5) was not kept up to date. Refresh
the man page so that it lists all the possible module options,
describes what the do, and justify why the default values are
set they way the are.
Signed-off-by: Brian Behlendorf <[email protected]>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Many people have noticed that the kmem cache implementation is slow
to release its memory. This patch makes the reclaim behavior more
aggressive by immediately freeing a slab once it is empty. Unused
objects which are cached in the magazines will still prevent a slab
from being freed.
Signed-off-by: Brian Behlendorf <[email protected]>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
This patch achieves the following goals:
1. It replaces the preprocessor kmem flag to gfp flag mapping with
proper translation logic. This eliminates the potential for
surprises that were previously possible where kmem flags were
mapped to gfp flags.
2. It maps vmem_alloc() allocations to kmem_alloc() for allocations
sized less than or equal to the newly-added spl_kmem_alloc_max
parameter. This ensures that small allocations will not contend
on a single global lock, large allocations can still be handled,
and potentially limited virtual address space will not be squandered.
This behavior is entirely different than under Illumos due to
different memory management strategies employed by the respective
kernels. However, this functionally provides the semantics required.
3. The --disable-debug-kmem, --enable-debug-kmem (default), and
--enable-debug-kmem-tracking allocators have been unified in to
a single spl_kmem_alloc_impl() allocation function. This was
done to simplify the code and make it more maintainable.
4. Improve portability by exposing an implementation of the memory
allocations functions that can be safely used in the same way
they are used on Illumos. Specifically, callers may safely
use KM_SLEEP in contexts which perform filesystem IO. This
allows us to eliminate an entire class of Linux specific changes
which were previously required to avoid deadlocking the system.
This change will be largely transparent to existing callers but there
are a few caveats:
1. Because the headers were refactored and extraneous includes removed
callers may find they need to explicitly add additional #includes.
In particular, kmem_cache.h must now be explicitly includes to
access the SPL's kmem cache implementation. This behavior is
different from Illumos but it was done to avoid always masking
the Linux slab functions when kmem.h is included.
2. Callers, like Lustre, which made assumptions about the definitions
of KM_SLEEP, KM_NOSLEEP, and KM_PUSHPAGE will need to be updated.
Other callers such as ZFS which did not will not require changes.
3. KM_PUSHPAGE is no longer overloaded to imply GFP_NOIO. It retains
its original meaning of allowing allocations to access reserved
memory. KM_PUSHPAGE callers can be converted back to KM_SLEEP.
4. The KM_NODEBUG flags has been retired and the default warning
threshold increased to 32k.
5. The kmem_virt() functions has been removed. For callers which
need to distinguish between a physical and virtual address use
is_vmalloc_addr().
Signed-off-by: Brian Behlendorf <[email protected]>
|
| |
| |
| |
| |
| |
| |
| |
| | |
Since the Linux 2.6.29 kernel all mutexes have been adaptive mutexs.
There is no longer any point in keeping this code so it is being
removed to simplify the code.
Signed-off-by: Brian Behlendorf <[email protected]>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
The problem is described in commit aeeb4e0c0ae75b99ebbaa3056f0afc8e12949532.
However, instead of disabling the binding to CPU altogether we just keep the
last CPU index across calls to taskq_create() and thus achieve even
distribution of the taskq threads across all available CPUs.
The implementation based on assumption that task queues initialization
performed in serial manner.
Signed-off-by: Andrey Vesnovaty <[email protected]>
Signed-off-by: Andrey Vesnovaty <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #336
|