| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
| |
The parameters dsl_dataset_t *os in function prototype should be
renamed to dsl_dataset_t *ds.
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Damian Wojsław <[email protected]>
Closes #6756
Closes #6273
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
All objects after the last written or freed object are not supposed to
exist after receiving the stream. Free them accordingly, as if a
freeobjects record for them had been included in the stream.
Reviewed by: Paul Dagnelie <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Fabian Grünbichler <[email protected]>
Closes #5699
Closes #6507
Closes #6616
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Commit d3c2ae1 introduced a dbuf cache with a default size of the
minimum of 100M or 1/32 maximum ARC size. (These figures may be adjusted
using dbuf_cache_max_bytes and dbuf_cache_max_shift.) The dbuf cache
is counted as metadata for the purposes of ARC size calculations.
On a 1GB box the ARC maximum size defaults to c_max 493M which gives a
dbuf cache default minimum size of 15.4M, and the ARC metadata defaults
to minimum 16M. I.e. the dbuf cache is an significant proportion of the
minimum metadata size. With other overheads involved this actually means
the ARC metadata doesn't get down to the minimum.
This patch dynamically scales the dbuf cache to the target ARC size
instead of statically scaling it to the maximum ARC size. (The scale is
still set by dbuf_cache_max_shift and the maximum size is still fixed by
dbuf_cache_max_bytes.) Using the target ARC size rather than the current
ARC size is done to help the ARC reach the target rather than simply
focusing on the current size.
Reviewed-by: Chunwei Chen <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: George Melikov <[email protected]>
Signed-off-by: Chris Dunlop <[email protected]>
Issue #6506
Closes #6561
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
ZFS buildbot STYLE builder was moved to Ubuntu 17.04
which has a newer version of cppcheck. Handle the
new cppcheck errors.
uu_* functions removed in this commit were unused
and effectively dead code. They are now retired.
Reviewed-by: George Melikov <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Giuseppe Di Natale <[email protected]>
Closes #6653
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
generic_start_io_acct/generic_end_io_acct in the master
branch of the linux kernel requires that the request_queue
be provided.
Move the logic from freemem in the spl to arc_free_memory
in arc.c. Do this so we can take advantage of global_page_state
interface checks in zfs.
Upstream kernel replaced struct block_device with
struct gendisk in struct bio. Determine if the
function bio_set_dev exists during configure
and have zfs use that if it exists.
bio_set_dev https://github.com/torvalds/linux/commit/74d4699
global_node_page_state https://github.com/torvalds/linux/commit/75ef718
io acct https://github.com/torvalds/linux/commit/d62e26b
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Giuseppe Di Natale <[email protected]>
Closes #6635
Signed-off-by: Brian Behlendorf <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Refactor dmu_object_alloc_dnsize() and dnode_hold_impl() to simplify the
code, fix errors introduced by commit dbeb879 (PR #6117) interacting
badly with large dnodes, and improve performance.
* When allocating a new dnode in dmu_object_alloc_dnsize(), update the
percpu object ID for the core's metadnode chunk immediately. This
eliminates most lock contention when taking the hold and creating the
dnode.
* Correct detection of the chunk boundary to work properly with large
dnodes.
* Separate the dmu_hold_impl() code for the FREE case from the code for
the ALLOCATED case to make it easier to read.
* Fully populate the dnode handle array immediately after reading a
block of the metadnode from disk. Subsequently the dnode handle array
provides enough information to determine which dnode slots are in use
and which are free.
* Add several kstats to allow the behavior of the code to be examined.
* Verify dnode packing in large_dnode_008_pos.ksh. Since the test is
purely creates, it should leave very few holes in the metadnode.
* Add test large_dnode_009_pos.ksh, which performs concurrent creates
and deletes, to complement existing test which does only creates.
With the above fixes, there is very little contention in a test of about
200,000 racing dnode allocations produced by tests 'large_dnode_008_pos'
and 'large_dnode_009_pos'.
name type data
dnode_hold_dbuf_hold 4 0
dnode_hold_dbuf_read 4 0
dnode_hold_alloc_hits 4 3804690
dnode_hold_alloc_misses 4 216
dnode_hold_alloc_interior 4 3
dnode_hold_alloc_lock_retry 4 0
dnode_hold_alloc_lock_misses 4 0
dnode_hold_alloc_type_none 4 0
dnode_hold_free_hits 4 203105
dnode_hold_free_misses 4 4
dnode_hold_free_lock_misses 4 0
dnode_hold_free_lock_retry 4 0
dnode_hold_free_overflow 4 0
dnode_hold_free_refcount 4 57
dnode_hold_free_txg 4 0
dnode_allocate 4 203154
dnode_reallocate 4 0
dnode_buf_evict 4 23918
dnode_alloc_next_chunk 4 4887
dnode_alloc_race 4 0
dnode_alloc_next_block 4 18
The performance is slightly improved for concurrent creates with
16+ threads, and unchanged for low thread counts.
Signed-off-by: Brian Behlendorf <[email protected]>
Signed-off-by: Olaf Faaland <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
| |
Update the dbuf__evict__one() tracepoint so that it can safely
handle a NULL dmu_buf_impl_t pointer.
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Giuseppe Di Natale <[email protected]>
Reviewed-by: George Melikov <[email protected]>
Reviewed-by: loli10K <[email protected]>
Signed-off-by: gaurkuma <[email protected]>
Closes #6463
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Redefine the SET_ERROR macro in terms of __dprintf() so the error
return codes get logged as both tracepoint events (if tracepoints are
enabled) and as ZFS debug log entries. This also allows us to use
the same definition of SET_ERROR() in kernel and user space.
Define a new debug flag ZFS_DEBUG_SET_ERROR=512 that may be bitwise
or'd into zfs_flags. Setting this flag enables both dprintf() and
SET_ERROR() messages in the debug log. That is, setting
ZFS_DEBUG_SET_ERROR and ZFS_DEBUG_DPRINTF|ZFS_DEBUG_SET_ERROR are
equivalent (this was done for sake of simplicity). Leaving
ZFS_DEBUG_SET_ERROR unset suppresses the SET_ERROR() messages which
helps avoid cluttering up the logs.
To enable SET_ERROR() logging, run:
echo 1 > /sys/module/zfs/parameters/zfs_dbgmsg_enable
echo 512 > /sys/module/zfs/parameters/zfs_flags
Remove the zfs_set_error_class tracepoints event class since
SET_ERROR() now uses __dprintf(). This sacrifices a bit of
granularity when selecting individual tracepoint events to enable but
it makes the code simpler.
Include file, function, and line number information in debug log
entries. The information is now added to the message buffer in
__dprintf() and as a result the zfs_dprintf_class tracepoints event
class was changed from a 4 parameter interface to a single parameter.
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Ned Bass <[email protected]>
Closes #6400
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Add a callback to wake all running mmp threads when
zfs_multihost_interval is changed.
This is necessary when the interval is changed from a very large value
to a significantly lower one, while pools are imported that have the
multihost property enabled.
Without this commit, the mmp thread does not wake up and detect the new
interval until after it has waited the old multihost interval time. A
user monitoring mmp writes via the provided kstat would be led to
believe that the changed setting did not work.
Added a test in the ZTS under mmp to verify the new functionality is
working.
Added a test to ztest which starts and stops mmp threads, and calls into
the code to signal sleeping mmp threads, to test for deadlocks or
similar locking issues.
Reviewed-by: Giuseppe Di Natale <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Olaf Faaland <[email protected]>
Closes #6387
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
merging zpool checkpoint & MMP in ZFS
The zpool checkpoint feature in DxOS added a new field in the uberblock.
The Multi-Modifier Protection Pull Request from ZoL adds three new fields
in the uberblock (Reference: https://github.com/zfsonlinux/zfs/pull/6279).
As these two changes come from two different sources and once upstreamed
and deployed will introduce an incompatibility with each other we want
to upstream a change that will reserve the padding for both of them so
integration goes smoothly and everyone gets both features.
Porting Notes: Preserved MMP comments in uberblock struct.
Authored by: Serapheim Dimitropoulos <[email protected]>
Reviewed by: Matthew Ahrens <[email protected]>
Reviewed by: Brian Behlendorf <[email protected]>
Reviewed by: Olaf Faaland <[email protected]>
Approved by: Gordon Ross <[email protected]>
Ported-by: Giuseppe Di Natale <[email protected]>
OpenZFS-issue: https://www.illumos.org/issues/8491
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/d84fa5f
Closes #6390
|
|
|
|
|
|
|
|
|
| |
Commit torvalds/linux@4e4cbee9. The bio->bi_error field was
replaced with bio->bi_status which is an enum that describes
all possible error types.
Reviewed-by: Chunwei Chen <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #6351
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Add multihost=on|off pool property to control MMP. When enabled
a new thread writes uberblocks to the last slot in each label, at a
set frequency, to indicate to other hosts the pool is actively imported.
These uberblocks are the last synced uberblock with an updated
timestamp. Property defaults to off.
During tryimport, find the "best" uberblock (newest txg and timestamp)
repeatedly, checking for change in the found uberblock. Include the
results of the activity test in the config returned by tryimport.
These results are reported to user in "zpool import".
Allow the user to control the period between MMP writes, and the
duration of the activity test on import, via a new module parameter
zfs_multihost_interval. The period is specified in milliseconds. The
activity test duration is calculated from this value, and from the
mmp_delay in the "best" uberblock found initially.
Add a kstat interface to export statistics about Multiple Modifier
Protection (MMP) updates. Include the last synced txg number, the
timestamp, the delay since the last MMP update, the VDEV GUID, the VDEV
label that received the last MMP update, and the VDEV path. Abbreviated
output below.
$ cat /proc/spl/kstat/zfs/mypool/multihost
31 0 0x01 10 880 105092382393521 105144180101111
txg timestamp mmp_delay vdev_guid vdev_label vdev_path
20468 261337 250274925 68396651780 3 /dev/sda
20468 261339 252023374 6267402363293 1 /dev/sdc
20468 261340 252000858 6698080955233 1 /dev/sdx
20468 261341 251980635 783892869810 2 /dev/sdy
20468 261342 253385953 8923255792467 3 /dev/sdd
20468 261344 253336622 042125143176 0 /dev/sdab
20468 261345 253310522 1200778101278 2 /dev/sde
20468 261346 253286429 0950576198362 2 /dev/sdt
20468 261347 253261545 96209817917 3 /dev/sds
20468 261349 253238188 8555725937673 3 /dev/sdb
Add a new tunable zfs_multihost_history to specify the number of MMP
updates to store history for. By default it is set to zero meaning that
no MMP statistics are stored.
When using ztest to generate activity, for automated tests of the MMP
function, some test functions interfere with the test. For example, the
pool is exported to run zdb and then imported again. Add a new ztest
function, "-M", to alter ztest behavior to prevent this.
Add new tests to verify the new functionality. Tests provided by
Giuseppe Di Natale.
Reviewed by: Matthew Ahrens <[email protected]>
Reviewed-by: Giuseppe Di Natale <[email protected]>
Reviewed-by: Ned Bass <[email protected]>
Reviewed-by: Andreas Dilger <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Olaf Faaland <[email protected]>
Closes #745
Closes #6279
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Authored by: Dave Eddy <[email protected]>
Reviewed by: Patrick Mooney <[email protected]>
Reviewed by: Joshua M. Clulow <[email protected]>
Reviewed by: Josh Wilsdon <[email protected]>
Reviewed by: Matthew Ahrens <[email protected]>
Reviewed by: George Wilson <[email protected]>
Reviewed by: Richard Elling <[email protected]>
Reviewed by: Alan Somers <[email protected]>
Reviewed by: Andrew Stormont <[email protected]>
Approved by: Matthew Ahrens <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: George Melikov <[email protected]>
Ported-by: Giuseppe Di Natale <[email protected]>
OpenZFS-issue: https://www.illumos.org/issues/6939
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/ce1577b
Closes #6328
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The volmode property may be set to control the visibility of ZVOL
block devices.
This allow switching ZVOL between three modes:
full - existing fully functional behaviour (default)
dev - hide partitions on ZVOL block devices
none - not exposing volumes outside ZFS
Additionally the new zvol_volmode module parameter can be used to
control the default behaviour.
This functionality can be used, for instance, on "backup" pools to
avoid cluttering /dev with unneeded zd* devices.
Original-patch-by: mav <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Ported-by: loli10K <[email protected]>
Signed-off-by: loli10K <[email protected]>
FreeBSD-commit: https://github.com/freebsd/freebsd/commit/dd28e6bb
Closes #1796
Closes #3438
Closes #6233
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Authored by: Yuri Pankov <[email protected]>
Reviewed by: Robert Mustacchi <[email protected]>
Approved by: Joshua M. Clulow <[email protected]>
Reviewed-by: Giuseppe Di Natale <[email protected]>
Reviewed-by: George Melikov <[email protected]>
Ported-by: Brian Behlendorf <[email protected]>
Porting Notes:
* All hunks unrelated to ZFS were dropped.
OpenZFS-issue: https://www.illumos.org/issues/5428
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/4585130
Closes #6326
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Authored by: Matthew Ahrens <[email protected]>
Reviewed by: George Wilson <[email protected]>
Reviewed by: Alex Reece <[email protected]>
Reviewed by: Yuri Pankov <[email protected]>
Approved by: Robert Mustacchi <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Ported-by: Giuseppe Di Natale <[email protected]>
OpenZFS-issue: https://www.illumos.org/issues/8067
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/8173085
Closes #6319
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Currently, there is no way to pause a scrub. Pausing may
be useful when the pool is busy with other I/O to preserve
bandwidth.
This patch adds the ability to pause and resume scrubbing.
This is achieved by maintaining a persistent on-disk scrub state.
While the state is 'paused' we do not scrub any more blocks.
We do however perform regular scan housekeeping such as
freeing async destroyed and deadlist blocks while paused.
Reviewed by: Matthew Ahrens <[email protected]>
Reviewed by: Thomas Caputi <[email protected]>
Reviewed-by: Serapheim Dimitropoulos <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Alek Pinchuk <[email protected]>
Closes #6167
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Authored by: Andriy Gapon <[email protected]>
Reviewed by: Matthew Ahrens <[email protected]>
Reviewed by: Pavel Zakharov <[email protected]>
Approved by: Robert Mustacchi <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Ported-by: Giuseppe Di Natale <[email protected]>
The existing kernel-side code only provides a method to rollback to a
latest snapshot, whatever it happens to be at the time when the rollback
is actually done. That could be unsafe or confusing in environments
where concurrent DSL changes are possible as the resulting state could
correspond to a newer or older snapshot than the originally requested
one.
This change allows to amend that method such that the rollback is
performed only when the latest snapshot has a specific name. That is,
if a new snapshot is concurrently created or the target snapshot is
destroyed, then no rollback is done and EXDEV error is returned.
New libzfs_core function lzc_rollback_to() is provided for the new
functionality. libzfs is changed to use lzc_rollback_to() to implement
zfs rollback command.
Perhaps we should return different errors to distinguish the case where
the desired snapshot exists but it's not the latest snapshot and the
case where the desired snapshot does not exist.
OpenZFS-issue: https://www.illumos.org/issues/7600
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/3d645eb
Closes #6292
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Authored by: Andriy Gapon <[email protected]>
Reviewed by: Igor Kozhukhov <[email protected]>
Reviewed by: Dan Kimmel <[email protected]>
Reviewed by: Alek Pinchuk <[email protected]>
Approved by: Robert Mustacchi <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: George Melikov <[email protected]>
Ported-by: Giuseppe Di Natale <[email protected]>
OpenZFS-issue: https://www.illumos.org/issues/8416
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/589c189
Closes #6288
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Authored by: Andriy Gapon <[email protected]>
Reviewed by: Matt Ahrens <[email protected]>
Approved by: Robert Mustacchi <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: George Melikov <[email protected]>
Ported-by: Giuseppe Di Natale <[email protected]>
OpenZFS-issue: https://www.illumos.org/issues/8426
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/37359a6
Closes #6287
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Authored by: Andrew Stormont <[email protected]>
Reviewed by: Andriy Gapon <[email protected]>
Reviewed by: Matthew Ahrens <[email protected]>
Reviewed by: Dan McDonald <[email protected]>
Approved by: Dan McDonald <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Ported-by: Giuseppe Di Natale <[email protected]>
OpenZFS-issue: https://www.illumos.org/issues/8264
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/a4b8c9a
Closes #6254
|
|
|
|
|
|
|
|
|
|
|
|
| |
This prints dashes instead of zeros for zero latency values in
'zpool iostat -p'. You'll get zero latencies reported when the
disk is idle, but technically a zero latency is invalid, since you
can't measure the latency of doing nothing.
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: George Melikov <[email protected]>
Reviewed-by: Giuseppe Di Natale <[email protected]>
Signed-off-by: Tony Hutter <[email protected]>
Closes #6210
|
|
|
|
|
|
|
|
|
|
|
| |
In the original form of device error injection, it was an all or nothing
situation. To help simulate intermittent error conditions, you can now
specify a real number percentage value. This is also very useful for our
ZFS fault diagnosis testing and for injecting intermittent errors during
load testing.
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Don Brady <[email protected]>
Closes #6227
|
|
|
|
|
|
|
|
|
|
| |
This continues what was started in
0eef1bde31d67091d3deed23fe2394f5a8bf2276 by fully converting zvols
to avoid unnecessary dnode_hold() calls. This saves a small amount
of CPU time and slightly improves latencies of operations on zvols.
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Richard Yao <[email protected]>
Closes #6058
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
dmu_object_alloc() is single-threaded, so when multiple threads are
creating files in a single filesystem, they spend a lot of time waiting
for the os_obj_lock. To improve performance of multi-threaded file
creation, we must make dmu_object_alloc() typically not grab any
filesystem-wide locks.
The solution is to have a "next object to allocate" for each CPU. Each
of these "next object"s is in a different block of the dnode object, so
that concurrent allocation holds dnodes in different dbufs. When a
thread's "next object" reaches the end of a chunk of objects (by default
4 blocks worth -- 128 dnodes), it will be reset to the per-objset
os_obj_next, which will be increased by a chunk of objects (128). Only
when manipulating the os_obj_next will we need to grab the os_obj_lock.
This decreases lock contention dramatically, because each thread only
needs to grab the os_obj_lock briefly, once per 128 allocations.
This results in a 70% performance improvement to multi-threaded object
creation (where each thread is creating objects in its own directory),
from 67,000/sec to 115,000/sec, with 8 CPUs.
Work sponsored by Intel Corp.
Authored by: Matthew Ahrens <[email protected]>
Reviewed-by: Ned Bass <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Ported-by: Matthew Ahrens <[email protected]>
Signed-off-by: Matthew Ahrens <[email protected]>
OpenZFS-issue: https://www.illumos.org/issues/8199
OpenZFS-commit: https://github.com/openzfs/openzfs/pull/374
Closes #4703
Closes #6117
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
- After some ZIL changes 6 years ago zil_slog_limit got partially broken
due to zl_itx_list_sz not updated when async itx'es upgraded to sync.
Actually because of other changes about that time zl_itx_list_sz is not
really required to implement the functionality, so this patch removes
some unneeded broken code and variables.
- Original idea of zil_slog_limit was to reduce chance of SLOG abuse by
single heavy logger, that increased latency for other (more latency critical)
loggers, by pushing heavy log out into the main pool instead of SLOG. Beside
huge latency increase for heavy writers, this implementation caused double
write of all data, since the log records were explicitly prepared for SLOG.
Since we now have I/O scheduler, I've found it can be much more efficient
to reduce priority of heavy logger SLOG writes from ZIO_PRIORITY_SYNC_WRITE
to ZIO_PRIORITY_ASYNC_WRITE, while still leave them on SLOG.
- Existing ZIL implementation had problem with space efficiency when it
has to write large chunks of data into log blocks of limited size. In some
cases efficiency stopped to almost as low as 50%. In case of ZIL stored on
spinning rust, that also reduced log write speed in half, since head had to
uselessly fly over allocated but not written areas. This change improves
the situation by offloading problematic operations from z*_log_write() to
zil_lwb_commit(), which knows real situation of log blocks allocation and
can split large requests into pieces much more efficiently. Also as side
effect it removes one of two data copy operations done by ZIL code WR_COPIED
case.
- While there, untangle and unify code of z*_log_write() functions.
Also zfs_log_write() alike to zvol_log_write() can now handle writes crossing
block boundary, that may also improve efficiency if ZPL is made to do that.
Sponsored by: iXsystems, Inc.
Authored by: Alexander Motin <[email protected]>
Reviewed by: Matthew Ahrens <[email protected]>
Reviewed by: Prakash Surya <[email protected]>
Reviewed by: Andriy Gapon <[email protected]>
Reviewed by: Steven Hartland <[email protected]>
Reviewed by: Brad Lewis <[email protected]>
Reviewed by: Richard Elling <[email protected]>
Approved by: Robert Mustacchi <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Richard Yao <[email protected]>
Ported-by: Giuseppe Di Natale <[email protected]>
OpenZFS-issue: https://www.illumos.org/issues/7578
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/aeb13ac
Closes #6191
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Authored by: Matthew Ahrens <[email protected]>
Reviewed by: Dan Kimmel <[email protected]>
Reviewed by: George Wilson <[email protected]>
Approved by: Robert Mustacchi <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Ported-by: Giuseppe Di Natale <[email protected]>
When writing pre-compressed buffers, arc_write() requires that
the compression algorithm used to compress the buffer matches
the compression algorithm requested by the zio_prop_t, which is
set by dmu_write_policy(). This makes dmu_write_policy() and its
callers a bit more complicated.
We simplify this by making arc_write() trust the caller to supply
the type of pre-compressed buffer that it wants to write,
and override the compression setting in the zio_prop_t.
OpenZFS-issue: https://www.illumos.org/issues/8155
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/b55ff58
Closes #6200
|
|
|
|
|
|
|
|
|
|
| |
Provide a format parameter to super_setup_bdi_name() so we don't
create duplicate names in '/devices/virtual/bdi' sysfs namespace which
would prevent us from mounting more than one ZFS filesystem at a time.
Reviewed-by: Chunwei Chen <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: loli10K <[email protected]>
Closes #6147
|
|
|
|
|
|
|
|
|
|
|
| |
This addition will enable us to sync an open TXG to the main pool
on demand. The functionality is similar to 'sync(2)' but 'zpool sync'
will return when data has hit the main storage instead of potentially
just the ZIL as is the case with the 'sync(2)' cmd.
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed by: Matthew Ahrens <[email protected]>
Signed-off-by: Alek Pinchuk <[email protected]>
Closes #6122
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This patch adds a '-f' option to 'zpool offline' to fault a vdev
instead of bringing it offline. Unlike the OFFLINE state, the
FAULTED state will trigger the FMA code, allowing for things like
autoreplace and triggering the slot fault LED. The -f faults
persist across imports, unless they were set with the temporary
(-t) flag. Both persistent and temporary faults can be cleared
with zpool clear.
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Tony Hutter <[email protected]>
Closes #6094
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Bit 21 of the send stream flags was inadvertently used for two
different features under concurrent development. To avoid any
future compatibility problems the large dnode flag is being
switched to bit 23 which is unused.
The large dnode feature has only been present in pre-releases of
ZoL and dnodesize defaults to legacy which is compatible with
existing OpenZFS implementations. Users with dnodesize=auto
needing to use zfs send/recv must update ZoL on both the
source and destination systems.
Reviewed by: Matthew Ahrens <[email protected]>
Reviewed-by: Ned Bass <[email protected]>
Reviewed-by: George Melikov <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #6139
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
On a raidz vdev, a block that does not span all child vdevs, excluding
its skip sectors if any, may not be affected by a child vdev outage or
failure. In such cases, the block does not need to be resilvered.
However, current resilver algorithm simply resilvers all blocks on a
degraded raidz vdev. Such spurious IO is not only wasteful, but also
adds the risk of overwriting good data.
This patch eliminates such spurious IOs.
Reviewed-by: Gvozden Neskovic <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed by: Matthew Ahrens <[email protected]>
Signed-off-by: Isaac Huang <[email protected]>
Closes #5316
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Authored by: Matthew Ahrens <[email protected]>
Reviewed by: Serapheim Dimitropoulos <[email protected]>
Reviewed by: Pavel Zakharov <[email protected]>
Approved by: Robert Mustacchi <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Ported-by: George Melikov <[email protected]>
A standard practice in ZFS is to keep track of "per-txg" state. Any of
the 3 active TXG's (open, quiescing, syncing) can have different values
for this state. We should assert that we do not attempt to modify other
(inactive) TXG's.
Porting Notes:
- ASSERTV added to txg_sync_waiting() for unused variable.
OpenZFS-issue: https://www.illumos.org/issues/8063
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/01acb46
Closes #6109
|
|
|
|
|
|
|
|
| |
Linux 4.9 added current_time() as the preferred interface to get
the filesystem time. CURRENT_TIME was retired in Linux 4.12.
Reviewed-by: Chunwei Chen <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #6114
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This allows users to specify "-o property=value" to override and
"-x property" to exclude properties when receiving a zfs send stream.
Both native and user properties can be specified.
This is useful when using zfs send/receive for periodic
backup/replication because it lets users change properties such as
canmount, mountpoint, or compression without modifying the source.
References:
https://www.illumos.org/issues/2745
https://www.illumos.org/issues/3753
Reviewed by: Matthew Ahrens <[email protected]>
Reviewed-by: Alek Pinchuk <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: loli10K <[email protected]>
Closes #1350
Closes #5349
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Document the existence of `createtxg` and `guid` native properties
in man pages and zfs command output.
One of the great features of ZFS is incremental replication of
snapshots, possibly between pools on different machines.
Shell scripts are commonly used to auomate this procedure. They have to
find the most recent common snapshot between both sides and then
perform incremental send & recv.
Currently, scripts rely on the sorting order of `zfs list`, which
defaults to `createtxg`, and the assumption that snapshot names on
either side do not change.
By making `createtxg` and `guid` part of the public ZFS interface,
scripts are enabled to use
a) `createtxg` to determine the logical & temporal order of snapshots
(the creation property is not an equivalent substitute since
multiple snapshots may be created within one second)
b) `guid` to uniquely identify a snapshot, independent of its current
display name
This has the potential of making scripts safer and correct.
Reviewed-by: George Melikov <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: DHE <[email protected]>
Reviewed-by: Richard Laager <[email protected]>
Signed-off-by: Christian Schwarz <[email protected]>
Closes #6102
|
|
|
|
|
|
|
|
| |
zfsonlinux/spl@8f87971 added __spl_pf_fstrans_check for the xfs related
check, so we use them accordingly.
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Chunwei Chen <[email protected]>
Closes #6113
|
|
|
|
|
|
|
|
|
| |
The proposed debugging enhancements in zfsonlinux/spl#587
identified the following missing *_destroy/*_fini calls.
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Gvozden Neskovic <[email protected]>
Closes #5428
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Linux has read-ahead logic designed to accelerate sequential workloads.
ZFS has its own read-ahead logic called zprefetch that operates on both
ZVOLs and datasets. Having two prefetchers active at the same time can
cause overprefetching, which unnecessarily reduces IOPS performance on
CoW filesystems like ZFS.
Testing shows that entirely disabling the Linux prefetch results in
a significant performance penalty for reads while commensurate benefits
are seen in random writes. It appears that read-ahead benefits are
inversely proportional to random write benefits, and so a single page
of Linux-layer read-ahead appears to offer the middle ground for both
workloads.
Reviewed-by: Chunwei Chen <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Richard Yao <[email protected]>
Issue #5902
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This commit allow higher ashift values (up to 16) in 'zpool create'
The ashift value was previously limited to 13 (8K block) in b41c990
because the limited number of uberblocks we could fit in the
statically sized (128K) vdev label ring buffer could prevent the
ability the safely roll back a pool to recover it.
Since b02fe35 the largest uberblock size we support is 8K: this
allow us to store a minimum number of 16 uberblocks in the vdev
label, even with higher ashift values.
Additionally change 'ashift' pool property behaviour: if set it will
be used as the default hint value in subsequent vdev operations
('zpool add', 'attach' and 'replace'). A custom ashift value can still
be specified from the command line, if desired.
Finally, fix a bug in add-o_ashift.ksh caused by a missing variable.
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: loli10K <[email protected]>
Closes #2024
Closes #4205
Closes #4740
Closes #5763
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When vdev_psize increases, the location of labels 2 and 3 changes
because their location is relative to the end of the device.
The configs for labels 2 and 3 are written during the next spa_sync()
because the vdev is added to the dirty config list. However, the
uberblock rings are not re-written in their new location, leaving the
device vulnerable to the beginning of the device being overwritten or
damaged.
This patch copies the uberblock ring from label 0 to labels 2 and 3,
in their new locations, at the next sync after vdev_psize increases.
Also, add a test zpool_expand_004_pos.ksh to confirm the uberblocks
are copied.
Reviewed-by: BearBabyLiu <[email protected]>
Reviewed-by: Andreas Dilger <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Olaf Faaland <[email protected]>
Closes #5108
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* Add zfs_nicebytes() to print human-readable sizes
Some 'zfs', 'zpool' and 'zdb' output strings can be confusing to the
user when no units are specified. This add a new zfs_nicenum_format
"ZFS_NICENUM_BYTES" used to print bytes in their human-readable form.
Additionally, update some test cases to use machine-parsable 'zfs get'.
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: loli10K <[email protected]>
Closes #2414
Closes #3185
Closes #3594
Closes #6032
|
|
|
|
|
|
|
|
|
|
| |
All filesystems were converted to dynamically allocated BDIs. The
destruction of backing_dev_info structures is handled as part of
super block destruction. Refactor the code to abstract away the
details of creating and destroying a BDI.
Reviewed-by: Chunwei Chen <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #6089
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Commit 37f9dac removed the zvol_taskq for processing zvol requests.
This was removed as part of switching to make_request_fn and was
motivated by a concern at the time over dispatch latency.
However, this also made all bio request synchronous, and caused
serious performance issues as the bio submitter would wait for
every bio it submitted, effectively making the IO depth 1.
This patch reinstate zvol_taskq, and to make sure overlapped I/Os
are ordered properly, we take range lock in zvol_request, and pass
it along with bio to the I/O functions zvol_{write,discard,read}.
In order to facilitate benchmarks a zvol_request_sync module
option was added to switch between sync and async request handling.
For the moment, the default behavior is synchronous but this is
likely to change pending additional testing.
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Chunwei Chen <[email protected]>
Closes #5824
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This patch updates the "zpool status/iostat -c" commands to only run
"pre-baked" scripts from the /etc/zfs/zpool.d directory (or wherever
you install to). The scripts can only be run from -c as an unprivileged
user (unless the ZPOOL_SCRIPTS_AS_ROOT environment var is
set by root). This was done to encourage scripts to be written is such
a way that normal users can use them, and to be cautious. If your
script needs to run a privileged command, consider adding the
appropriate line in /etc/sudoers. See zpool(8) for an example of how
to do this.
The patch also allows the scripts to output custom column names. If
the script outputs a line like:
name=value
then "name" is used for the column name, and "value" is its value.
Multiple columns can be specified by outputting multiple lines. Column
names and values can have spaces. If the value is empty, a dash (-) is
printed instead.
After all the "name=value" lines are read (if any), zpool will take the
next the next line of output (if any) and print it without a column
header. After that, no more lines will be processed. This can be
useful for printing errors.
Lastly, this patch also disables the -c option with the latency and
request size histograms, since it produced awkward output and made the
code harder to maintain.
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Giuseppe Di Natale <[email protected]>
Signed-off-by: Tony Hutter <[email protected]>
Closes #5852
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
project)
Authored by: Toomas Soome <[email protected]>
Reviewed by: George Wilson <[email protected]>
Reviewed by: Yuri Pankov <[email protected]>
Reviewed by: Andrew Stormont <[email protected]>
Reviewed by: Matthew Ahrens <[email protected]>
Approved by: Robert Mustacchi <[email protected]>
Reviewed-by: Giuseppe Di Natale <[email protected]>
Reviewed-by: George Melikov <[email protected]>
Reviewed-by: Don Brady <[email protected]>
Ported-by: Brian Behlendorf <[email protected]>
Porting Notes:
- grub-2.02-beta2-422-gcad5cc0 includes support for large blocks.
- Commit 8aab121 allowed GZIP[1-9].
- Grub allows pools with multiple top-level vdevs.
OpenZFS-issue: https://www.illumos.org/issues/5120
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/c8811bd
Closes #6007
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Authored by: Yuri Pankov <[email protected]>
Reviewed by: Matthew Ahrens <[email protected]>
Reviewed by: John Kennedy <[email protected]>
Approved by: Robert Mustacchi <[email protected]>
Reviewed-by: loli10K <[email protected]>
Ported-by: Brian Behlendorf <[email protected]>
Porting Notes:
- Updated 'zpool labelclear' and 'zdb -l' such that they attempt
to find a vdev given solely its short name. This behavior is
consistent with the upstream OpenZFS code and the test cases
depend on it. The actual implementation differs slightly due
to device naming conventions on Linux.
- auto_online_001_pos, auto_replace_001_pos and add-o_ashift
test cases updated to expect failure when no label exists.
- read_efi_label() and zpool_label_disk_check() are read-only
operations and should use O_RDONLY at open time to enforce this.
- zpool_label_disk() and zpool_relabel_disk() write the partition
information using O_DIRECT an fsync() and page cache invalidation
to ensure a consistent view of the device.
- dump_label() in zdb should invalidate the page cache in order
to get the authoritative label from disk.
OpenZFS-issue: https://www.illumos.org/issues/6865
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/c95076c
Closes #5981
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Authored by: Bill Pijewski <[email protected]>
Reviewed by: Jerry Jelinek <[email protected]>
Reviewed by: Matthew Ahrens <[email protected]>
Approved by: Dan McDonald <[email protected]>
Reviewed-by: George Melikov <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Ported-by: Giuseppe Di Natale <[email protected]>
OpenZFS-issue: https://www.illumos.org/issues/2932
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/810e43b
Closes #5984
Closes #5216
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Authored by: George Wilson <[email protected]>
Approved by: Dan McDonald <[email protected]>
Reviewed by: Brad Lewis <[email protected]>
Reviewed by: Matt Ahrens <[email protected]>
Reviewed by: Dan Kimmel <[email protected]>
Reviewed by: Saso Kiselkov <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: George Melikov <[email protected]>
Ported-by: Giuseppe Di Natale <[email protected]>
We don't want to dirty any data when we're in the final txgs of the pool
export logic. This change introduces checks to make sure that no data is
dirtied after a certain point. It also addresses the culprit of this
specific bug – the space map cannot be upgraded when we're in final
stages of pool export. If we encounter a space map that wants to be
upgraded in this phase, then we simply ignore the request as it will get
retried the next time we set the fragmentation metric on that metaslab.
OpenZFS-issue: https://www.illumos.org/issues/8023
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/2ef00f5
Closes #5991
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
fail with the loader project bits
Authored by: Toomas Soome <[email protected]>
Reviewed by: Igor Kozhukhov <[email protected]>
Reviewed by: Marcel Telka <[email protected]>
Reviewed by: Matthew Ahrens <[email protected]>
Approved by: Richard Lowe <[email protected]>
Reviewed-by: George Melikov <[email protected]>
Ported-by: Brian Behlendorf <[email protected]>
Porting Notes:
- Removed gzip and zle compression restriction on bootfs
datasets. Grub added support for these long ago. Ay
version of grub which understands lz4 also supports this.
- Enabled rootpool tests in runfile but skipped by default
in setup on Linux since they modify the rootpool.
- bootfs_006_pos.ksh, striped pools are allowed as bootfs.
OpenZFS-issue: https://www.illumos.org/issues/7404
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/55a424c
Closes #5982
|