aboutsummaryrefslogtreecommitdiffstats
path: root/module/zfs
Commit message (Collapse)AuthorAgeFilesLines
* Fix volmode=none property behavior at import timeLOLi2017-07-311-0/+3
| | | | | | | | | | | At import time spa_import() calls zvol_create_minors() directly: with the current implementation we have no way to avoid device node creation when volmode=none. Fix this by enforcing volmode=none directly in zvol_alloc(). Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: loli10K <[email protected]> Closes #6426
* zfs promote|rename .../%recv should be an errorLOLi2017-07-281-2/+10
| | | | | | | | | | | | | | If we are in the middle of an incremental 'zfs receive', the child .../%recv will exist. If we run 'zfs promote' .../%recv, it will "work", but then zfs gets confused about the status of the new dataset. Attempting to do this promote should be an error. Similarly renaming .../%recv datasets should not be allowed. Reviewed-by: Giuseppe Di Natale <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: loli10K <[email protected]> Closes #4843 Closes #6339
* OpenZFS 7915 - checks in l2arc_evict could use some cleaning upAndriy Gapon2017-07-281-15/+9
| | | | | | | | | | | | | Authored by: Andriy Gapon <[email protected]> Reviewed by: Dan Kimmel <[email protected]> Reviewed by: Prakash Surya <[email protected]> Approved by: Matthew Ahrens <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Ported-by: Giuseppe Di Natale <[email protected]> OpenZFS-issue: https://www.illumos.org/issues/7915 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/836a00c Closes #6375
* OpenZFS 8373 - TXG_WAIT in ZIL commit pathAndriy Gapon2017-07-281-1/+18
| | | | | | | | | | | | | | Authored by: Andriy Gapon <[email protected]> Reviewed by: Prakash Surya <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Approved by: Dan McDonald <[email protected]> Reviewed-by: George Melikov <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Ported-by: Giuseppe Di Natale <[email protected]> OpenZFS-issue: https://www.illumos.org/issues/8373 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/7f04961 Closes #6403
* OpenZFS 8508 - Mounting a zpool on 32-bit platforms panicsGiuseppe Di Natale2017-07-261-1/+1
| | | | | | | | | | | | Authored by: Justin Hibbits <[email protected]> Reviewed by: Matt Ahrens <[email protected]> Approved by: Dan McDonald <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Ported-by: Giuseppe Di Natale <[email protected]> OpenZFS-issue: https://www.illumos.org/issues/8508 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/15fc257 Closes #6404
* Add line info and SET_ERROR() to ZFS debug logNed Bass2017-07-251-7/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Redefine the SET_ERROR macro in terms of __dprintf() so the error return codes get logged as both tracepoint events (if tracepoints are enabled) and as ZFS debug log entries. This also allows us to use the same definition of SET_ERROR() in kernel and user space. Define a new debug flag ZFS_DEBUG_SET_ERROR=512 that may be bitwise or'd into zfs_flags. Setting this flag enables both dprintf() and SET_ERROR() messages in the debug log. That is, setting ZFS_DEBUG_SET_ERROR and ZFS_DEBUG_DPRINTF|ZFS_DEBUG_SET_ERROR are equivalent (this was done for sake of simplicity). Leaving ZFS_DEBUG_SET_ERROR unset suppresses the SET_ERROR() messages which helps avoid cluttering up the logs. To enable SET_ERROR() logging, run: echo 1 > /sys/module/zfs/parameters/zfs_dbgmsg_enable echo 512 > /sys/module/zfs/parameters/zfs_flags Remove the zfs_set_error_class tracepoints event class since SET_ERROR() now uses __dprintf(). This sacrifices a bit of granularity when selecting individual tracepoint events to enable but it makes the code simpler. Include file, function, and line number information in debug log entries. The information is now added to the message buffer in __dprintf() and as a result the zfs_dprintf_class tracepoints event class was changed from a 4 parameter interface to a single parameter. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Ned Bass <[email protected]> Closes #6400
* Some additional send stream validity checkingNed Bass2017-07-251-8/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | Check in the DMU whether an object record in a send stream being received contains an unsupported dnode slot count, and return an error if it does. Failure to catch an unsupported dnode slot count would result in a panic when the SPA attempts to increment the reference count for the large_dnode feature and the pool has the feature disabled. This is not normally an issue for a well-formed send stream which would have the DMU_BACKUP_FEATURE_LARGE_DNODE flag set if it contains large dnodes, so it will be rejected as unsupported if the required feature is disabled. This change adds a missing object record field validation. Add missing stream feature flag checks in dmu_recv_resume_begin_check(). Consolidate repetitive comment blocks in dmu_recv_begin_check(). Update zstreamdump to print the dnode slot count (dn_slots) for an object record when running in verbose mode. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Giuseppe Di Natale <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Signed-off-by: Ned Bass <[email protected]> Closes #6396
* Fix 'zpool clear' on suspended poolsBrian Behlendorf2017-07-251-1/+2
| | | | | | | | | | | | | | | | | 'zpool clear' should be able to resume I/O on suspended, but otherwise healthy, pools. 4a283c7 accidentally introduced a new code path where we call txg_wait_synced() on the suspended pool before we had the chance to resume I/O via zio_resume(): this results in the 'zpool clear' command hanging indefinitely, waiting for a TXG that cannot be synced. Fix this by avoiding the call to txg_wait_synced(). Reviewed-by: George Melikov <[email protected]> Reviewed-by: loli10K <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #6399
* Report MMP_STATE_NO_HOSTID immediatelyOlaf Faaland2017-07-251-6/+6
| | | | | | | | | | | | There is no need to perform the activity check before detecting that the user must set the system hostid, because the pool's multihost property is on, but spa_get_hostid() returned 0. The initial call to vdev_uberblock_load() provided the information required. Reviewed-by: Giuseppe Di Natale <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Olaf Faaland <[email protected]> Closes #6388
* Add callback for zfs_multihost_intervalOlaf Faaland2017-07-251-1/+49
| | | | | | | | | | | | | | | | | | | | | | | | | | Add a callback to wake all running mmp threads when zfs_multihost_interval is changed. This is necessary when the interval is changed from a very large value to a significantly lower one, while pools are imported that have the multihost property enabled. Without this commit, the mmp thread does not wake up and detect the new interval until after it has waited the old multihost interval time. A user monitoring mmp writes via the provided kstat would be led to believe that the changed setting did not work. Added a test in the ZTS under mmp to verify the new functionality is working. Added a test to ztest which starts and stops mmp threads, and calls into the code to signal sleeping mmp threads, to test for deadlocks or similar locking issues. Reviewed-by: Giuseppe Di Natale <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Olaf Faaland <[email protected]> Closes #6387
* Release SCL_STATE in map_write_done()Olaf Faaland2017-07-251-5/+6
| | | | | | | | | | | | | | | | | The config lock must be held for the duration of the MMP write. Since the I/Os are executed via map_nowait(), the done function is the only place where we know the write has completed. Since SCL_STATE is taken as reader, overlapping I/Os do not create a deadlock. The refcount is simply increased when new I/Os are queued and decreased when I/Os complete. Test case added which exercises the probe IO call path to verify the fix and prevent a regression. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Olaf Faaland <[email protected]> Closes #6394
* Revert Fix vdev_probe() call wrt SCL_STATE_ALLOlaf Faaland2017-07-251-1/+1
| | | | | | | | | This reverts commit cc9c6bc, which has been causing intermittent test failures on buildbot. A correct fix for this locking issue has been applied in a separate patch. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Olaf Faaland <[email protected]>
* Use correct macro for hz in mmp.cOlaf Faaland2017-07-241-1/+1
| | | | | | | | | | | | | | Commit 379ca9c Multi-modifier protection (MMP) used HZ to convert nanoseconds to ticks for use with cv_timedwait() and ddi_get_lbolt(). The correct macro is hz, which is defined within the SPL for kernel space, and within zfs_context.h for user space. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Giuseppe Di Natale <[email protected]> Reviewed-by: George Melikov <[email protected]> Signed-off-by: Olaf Faaland <[email protected]> Closes #6357 Closes #6360
* Fix coverity defects: CID 165755Giuseppe Di Natale2017-07-242-3/+3
| | | | | | | | | CID 165755: Division or modulo by zero (DIVIDE_BY_ZERO) Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Olaf Faaland <[email protected]> Signed-off-by: Giuseppe Di Natale <[email protected]> Closes #6352
* Linux 4.13 compat: bio->bi_status and blk_status_tBrian Behlendorf2017-07-231-5/+6
| | | | | | | | | Commit torvalds/linux@4e4cbee9. The bio->bi_error field was replaced with bio->bi_status which is an enum that describes all possible error types. Reviewed-by: Chunwei Chen <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #6351
* Fix vdev_probe() call outside SCL_STATE_ALL lockBrian Behlendorf2017-07-131-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | When an IO fails then zio_vdev_io_done() can call vdev_probe() to determine the health of the vdev. This is safe as long as the original zio was submitted with zio_wait() and holds the SCL_STATE_ALL lock over the operation. If zio_no_wait() was used then the done callback will submit the probe IO outside the SCL_STATE_ALL lock and hit this ASSERT in zio_create() ASSERT(!vd || spa_config_held(spa, SCL_STATE_ALL, RW_READER)); Resolve the issue by only allowing vdev_probe() to be called when there's a waiter indicating the caller is using zio_wait(). This assumes that caller is still holding SCL_STATE_ALL. This issue isn't MMP specific but was surfaced when testing. Without this patch it can be reproduced by running: zpool set multihost on <pool> zinject -d <vdev> -e io -T write -f 50 <pool> -L uber Reviewed-by: Olaf Faaland <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Don Brady <[email protected]> Closes #745 Closes #6279
* Multi-modifier protection (MMP)Olaf Faaland2017-07-1310-60/+998
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add multihost=on|off pool property to control MMP. When enabled a new thread writes uberblocks to the last slot in each label, at a set frequency, to indicate to other hosts the pool is actively imported. These uberblocks are the last synced uberblock with an updated timestamp. Property defaults to off. During tryimport, find the "best" uberblock (newest txg and timestamp) repeatedly, checking for change in the found uberblock. Include the results of the activity test in the config returned by tryimport. These results are reported to user in "zpool import". Allow the user to control the period between MMP writes, and the duration of the activity test on import, via a new module parameter zfs_multihost_interval. The period is specified in milliseconds. The activity test duration is calculated from this value, and from the mmp_delay in the "best" uberblock found initially. Add a kstat interface to export statistics about Multiple Modifier Protection (MMP) updates. Include the last synced txg number, the timestamp, the delay since the last MMP update, the VDEV GUID, the VDEV label that received the last MMP update, and the VDEV path. Abbreviated output below. $ cat /proc/spl/kstat/zfs/mypool/multihost 31 0 0x01 10 880 105092382393521 105144180101111 txg timestamp mmp_delay vdev_guid vdev_label vdev_path 20468 261337 250274925 68396651780 3 /dev/sda 20468 261339 252023374 6267402363293 1 /dev/sdc 20468 261340 252000858 6698080955233 1 /dev/sdx 20468 261341 251980635 783892869810 2 /dev/sdy 20468 261342 253385953 8923255792467 3 /dev/sdd 20468 261344 253336622 042125143176 0 /dev/sdab 20468 261345 253310522 1200778101278 2 /dev/sde 20468 261346 253286429 0950576198362 2 /dev/sdt 20468 261347 253261545 96209817917 3 /dev/sds 20468 261349 253238188 8555725937673 3 /dev/sdb Add a new tunable zfs_multihost_history to specify the number of MMP updates to store history for. By default it is set to zero meaning that no MMP statistics are stored. When using ztest to generate activity, for automated tests of the MMP function, some test functions interfere with the test. For example, the pool is exported to run zdb and then imported again. Add a new ztest function, "-M", to alter ztest behavior to prevent this. Add new tests to verify the new functionality. Tests provided by Giuseppe Di Natale. Reviewed by: Matthew Ahrens <[email protected]> Reviewed-by: Giuseppe Di Natale <[email protected]> Reviewed-by: Ned Bass <[email protected]> Reviewed-by: Andreas Dilger <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Olaf Faaland <[email protected]> Closes #745 Closes #6279
* OpenZFS 6939 - add sysevents to zfs core for commandsDave Eddy2017-07-126-49/+182
| | | | | | | | | | | | | | | | | | | | Authored by: Dave Eddy <[email protected]> Reviewed by: Patrick Mooney <[email protected]> Reviewed by: Joshua M. Clulow <[email protected]> Reviewed by: Josh Wilsdon <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Reviewed by: George Wilson <[email protected]> Reviewed by: Richard Elling <[email protected]> Reviewed by: Alan Somers <[email protected]> Reviewed by: Andrew Stormont <[email protected]> Approved by: Matthew Ahrens <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: George Melikov <[email protected]> Ported-by: Giuseppe Di Natale <[email protected]> OpenZFS-issue: https://www.illumos.org/issues/6939 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/ce1577b Closes #6328
* Add port of FreeBSD 'volmode' propertyLOLi2017-07-122-13/+175
| | | | | | | | | | | | | | | | | | | | | | | | | The volmode property may be set to control the visibility of ZVOL block devices. This allow switching ZVOL between three modes: full - existing fully functional behaviour (default) dev - hide partitions on ZVOL block devices none - not exposing volumes outside ZFS Additionally the new zvol_volmode module parameter can be used to control the default behaviour. This functionality can be used, for instance, on "backup" pools to avoid cluttering /dev with unneeded zd* devices. Original-patch-by: mav <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Ported-by: loli10K <[email protected]> Signed-off-by: loli10K <[email protected]> FreeBSD-commit: https://github.com/freebsd/freebsd/commit/dd28e6bb Closes #1796 Closes #3438 Closes #6233
* OpenZFS 5428 - provide fts(), reallocarray(), and strtonum()Yuri Pankov2017-07-086-13/+11
| | | | | | | | | | | | | | | | Authored by: Yuri Pankov <[email protected]> Reviewed by: Robert Mustacchi <[email protected]> Approved by: Joshua M. Clulow <[email protected]> Reviewed-by: Giuseppe Di Natale <[email protected]> Reviewed-by: George Melikov <[email protected]> Ported-by: Brian Behlendorf <[email protected]> Porting Notes: * All hunks unrelated to ZFS were dropped. OpenZFS-issue: https://www.illumos.org/issues/5428 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/4585130 Closes #6326
* OpenZFS 8126 - ztest assertion failed in dbuf_dirty due to dn_nlevels changingMatthew Ahrens2017-07-071-5/+10
| | | | | | | | | | | | | | | | | | | The sync thread is concurrently modifying dn_phys->dn_nlevels while dbuf_dirty() is trying to assert something about it, without holding the necessary lock. We need to move this assertion further down in the function, after we have acquired the dn_struct_rwlock. Authored by: Matthew Ahrens <[email protected]> Reviewed by: Pavel Zakharov <[email protected]> Reviewed by: Serapheim Dimitropoulos <[email protected]> Approved by: Robert Mustacchi <[email protected]> Reviewed-by: George Melikov <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Ported-by: Giuseppe Di Natale <[email protected]> OpenZFS-issue: https://www.illumos.org/issues/8126 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/0ef125d Closes #6314
* OpenZFS 8067 - zdb should be able to dump literal embedded block pointerMatthew Ahrens2017-07-071-0/+33
| | | | | | | | | | | | | | Authored by: Matthew Ahrens <[email protected]> Reviewed by: George Wilson <[email protected]> Reviewed by: Alex Reece <[email protected]> Reviewed by: Yuri Pankov <[email protected]> Approved by: Robert Mustacchi <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Ported-by: Giuseppe Di Natale <[email protected]> OpenZFS-issue: https://www.illumos.org/issues/8067 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/8173085 Closes #6319
* Fix 'zpool clear' on readonly poolsLOLi2017-07-071-1/+1
| | | | | | | | | | Illumos 4080 inadvertently allows 'zpool clear' on readonly pools: fix this by reintroducing a check (POOL_CHECK_READONLY) in zfs_ioc_clear registration code. Signed-off-by: Giuseppe Di Natale <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: loli10K <[email protected]> Closes #6306
* Implemented zpool scrub pause/resumeAlek P2017-07-064-34/+166
| | | | | | | | | | | | | | | | | | Currently, there is no way to pause a scrub. Pausing may be useful when the pool is busy with other I/O to preserve bandwidth. This patch adds the ability to pause and resume scrubbing. This is achieved by maintaining a persistent on-disk scrub state. While the state is 'paused' we do not scrub any more blocks. We do however perform regular scan housekeeping such as freeing async destroyed and deadlist blocks while paused. Reviewed by: Matthew Ahrens <[email protected]> Reviewed by: Thomas Caputi <[email protected]> Reviewed-by: Serapheim Dimitropoulos <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alek Pinchuk <[email protected]> Closes #6167
* Reschedule processes on -ERESTARTSYSArkadiusz Bubała2017-07-061-0/+2
| | | | | | | | | | | | On the single core machine the system may hang when the spa_namespare_lock acquisition fails in the zvol_first_open function. It returns -ERESTARTSYS error what causes the endless loop in __blkdev_get function. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: George Melikov <[email protected]> Signed-off-by: Arkadiusz Bubała <[email protected]> Closes #6283 Closes #6312
* OpenZFS 8378 - crash due to bp in-memory modification of nopwrite blockMatthew Ahrens2017-07-043-32/+53
| | | | | | | | | | | | | | | | | | | | | | | | | Authored by: Matthew Ahrens <[email protected]> Reviewed by: Prakash Surya <[email protected]> Reviewed by: George Wilson <[email protected]> Approved by: Robert Mustacchi <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Ported-by: Giuseppe Di Natale <[email protected]> The problem is that zfs_get_data() supplies a stale zgd_bp to dmu_sync(), which we then nopwrite against. zfs_get_data() doesn't hold any DMU-related locks, so after it copies db_blkptr to zgd_bp, dbuf_write_ready() could change db_blkptr, and dbuf_write_done() could remove the dirty record. dmu_sync() then sees the stale BP and that the dbuf it not dirty, so it is eligible for nop-writing. The fix is for dmu_sync() to copy db_blkptr to zgd_bp after acquiring the db_mtx. We could still see a stale db_blkptr, but if it is stale then the dirty record will still exist and thus we won't attempt to nopwrite. OpenZFS-issue: https://www.illumos.org/issues/8378 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/3127742 Closes #6293
* OpenZFS 7600 - zfs rollback should pass target snapshot to kernelAndriy Gapon2017-07-042-6/+34
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Authored by: Andriy Gapon <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Reviewed by: Pavel Zakharov <[email protected]> Approved by: Robert Mustacchi <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Ported-by: Giuseppe Di Natale <[email protected]> The existing kernel-side code only provides a method to rollback to a latest snapshot, whatever it happens to be at the time when the rollback is actually done. That could be unsafe or confusing in environments where concurrent DSL changes are possible as the resulting state could correspond to a newer or older snapshot than the originally requested one. This change allows to amend that method such that the rollback is performed only when the latest snapshot has a specific name. That is, if a new snapshot is concurrently created or the target snapshot is destroyed, then no rollback is done and EXDEV error is returned. New libzfs_core function lzc_rollback_to() is provided for the new functionality. libzfs is changed to use lzc_rollback_to() to implement zfs rollback command. Perhaps we should return different errors to distinguish the case where the desired snapshot exists but it's not the latest snapshot and the case where the desired snapshot does not exist. OpenZFS-issue: https://www.illumos.org/issues/7600 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/3d645eb Closes #6292
* OpenZFS 7910 - l2arc_write_buffers() may write beyond target_szAndriy Gapon2017-07-041-44/+44
| | | | | | | | | | | | | Authored by: Andriy Gapon <[email protected]> Reviewed by: George Wilson <[email protected]> Reviewed by: Dan Kimmel <[email protected]> Approved by: Robert Mustacchi <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Ported-by: Giuseppe Di Natale <[email protected]> OpenZFS-issue: https://www.illumos.org/issues/7910 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/cb6af4b Closes #6291
* OpenZFS 8377 - Panic in bookmark deletionMatthew Ahrens2017-06-301-1/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Authored by: Matthew Ahrens <[email protected]> Reviewed by: Paul Dagnelie <[email protected]> Reviewed by: Pavel Zakharov <[email protected]> Reviewed by: George Wilson <[email protected]> Approved by: Robert Mustacchi <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: George Melikov <[email protected]> Ported-by: Giuseppe Di Natale <[email protected]> The problem is that when dsl_bookmark_destroy_check() is executed from open context (the pre-check), it fills in dbda_success based on the existence of the bookmark. But the bookmark (or containing filesystem as in this case) can be destroyed before we get to syncing context. When we re-run dsl_bookmark_destroy_check() in syncing context, it will not add the deleted bookmark to dbda_success, intending for dsl_bookmark_destroy_sync() to not process it. But because the bookmark is still in dbda_success from the open-context call, we do try to destroy it. The fix is that dsl_bookmark_destroy_check() should not modify dbda_success when called from open context. OpenZFS-issue: https://www.illumos.org/issues/8377 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/b0b6fe3 Closes #6286
* Clean up large dnode codeMatthew Ahrens2017-06-292-2/+4
| | | | | | | | | | | | Resolves issues discovered when porting to OpenZFS. * Lint warnings. * Made dnode_move_impl() large dnode aware. This functionality is currently unused on Linux. Reviewed-by: Ned Bass <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> Closes #6262
* Set arc_meta_limit, arc_dnode_limit on changechrisrd2017-06-291-18/+24
| | | | | | | | | | | | | | | Make zfs_arc_meta_limit_percent and zfs_arc_dnode_limit_percent behave as you would expect from zfs-module-parameters.5. - recalculate arc_meta_limit if zfs_arc_meta_limit_percent changes - recalculate arc_dnode_limit if zfs_arc_dnode_limit_percent changes - correctly set arc_meta_limit and arc_dnode_limit if zfs_arc_max or zfs_arc_meta_min changes Reviewed-by: Tim Chase <[email protected]> Reviewed-by: Giuseppe Di Natale <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Chris Dunlop <[email protected]> Closes #6269
* GCC 7.1 fixesTony Hutter2017-06-282-7/+6
| | | | | | | | | | | GCC 7.1 with will warn when we're not checking the snprintf() return code in cases where the buffer could be truncated. This patch either checks the snprintf return code (where applicable), or simply disables the warnings (ztest.c). Reviewed-by: Chunwei Chen <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tony Hutter <[email protected]> Closes #6253
* Cap maximum aggregate IO sizeBrian Behlendorf2017-06-271-2/+5
| | | | | | | | | | | | | | Commit 8542ef8 allowed optional IOs to be aggregated beyond the specified aggregation limit. Since the aggregation limit was also used to enforce the maximum block size, setting `zfs_vdev_aggregation_limit=16777216` could result in an attempt to allocate an ABD larger than 16M. Reviewed by: Matthew Ahrens <[email protected]> Reviewed-by: George Melikov <[email protected]> Reviewed-by: Giuseppe Di Natale <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #6259 Closes #6270
* Refine use of zv_state_lock.Boris Protopopov2017-06-271-68/+165
| | | | | | | | | | Use zv_state_lock to protect all members of zvol_state structure, add relevant ASSERT()s. Take zv_suspend_lock before zv_state_lock, do not hold zv_state_lock across suspend/resume. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Boris Protopopov <[email protected]> Closes #6226
* OpenZFS 5220 - L2ARC does not support devices that do not provide 512B accessGiuseppe Di Natale2017-06-261-11/+56
| | | | | | | | | | | | | Authored by: Andriy Gapon <[email protected]> Reviewed by: George Wilson <[email protected]> Reviewed by: Dan Kimmel <[email protected]> Reviewed by: Saso Kiselkov <[email protected]> Approved by: Dan McDonald <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Ported-by: Giuseppe Di Natale <[email protected]> OpenZFS-issue: https://www.illumos.org/issues/5220 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/403a8da Closes #6260
* OpenZFS 8264 - want support for promoting datasets in libzfs_coreGiuseppe Di Natale2017-06-261-3/+36
| | | | | | | | | | | | | Authored by: Andrew Stormont <[email protected]> Reviewed by: Andriy Gapon <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Reviewed by: Dan McDonald <[email protected]> Approved by: Dan McDonald <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Ported-by: Giuseppe Di Natale <[email protected]> OpenZFS-issue: https://www.illumos.org/issues/8264 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/a4b8c9a Closes #6254
* Call cv_signal() with mutex heldBoris Protopopov2017-06-261-1/+1
| | | | | | | | In bqueue_dequeue(), call cv_signal() with bq_lock held. Re-enable rsend_009_pos to test the fix. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Boris Protopopov <[email protected]> Closes #5887
* Add kpreempt_disable/enable around CPU_SEQID usesMorgan Jones2017-06-191-2/+6
| | | | | | | | | | | In zfs/dmu_object and icp/core/kcf_sched, the CPU_SEQID macro should be surrounded by `kpreempt_disable` and `kpreempt_enable` calls to avoid a Linux kernel BUG warning. These code paths use the cpuid to minimize lock contention and is is safe to reschedule the process to a different processor at any time. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Morgan Jones <[email protected]> Closes #6239
* Inject zinject(8) a percentage amount of dev errsDon Brady2017-06-161-7/+30
| | | | | | | | | | | In the original form of device error injection, it was an all or nothing situation. To help simulate intermittent error conditions, you can now specify a real number percentage value. This is also very useful for our ZFS fault diagnosis testing and for injecting intermittent errors during load testing. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Don Brady <[email protected]> Closes #6227
* Avoid 'queue not locked' warning at pool import.Boris Protopopov2017-06-151-1/+1
| | | | | | | | Use queue_flag_set_unlocked() in zvol_alloc(). Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Boris Protopopov <[email protected]> Issue #6226
* Fix zvol_state_t->zv_open_count raceLOLi2017-06-151-11/+21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 5559ba0 added zv_state_lock to protect zvol_state_t internal data: this, however, doesn't guard zv->zv_open_count and zv->zv_disk->private_data in zvol_remove_minors_impl(). Fix this by taking zv->zv_state_lock before we check its zv_open_count. P1 (z_zvol) P2 (systemd-udevd) --- --- zvol_remove_minors_impl() : zv->zv_open_count==0 zvol_open() ->mutex_enter(zv_state_lock) : zv->zv_open_count++ ->mutex_exit(zv_state_lock) ->mutex_enter(zv->zv_state_lock) ->zvol_remove(zv) ->mutex_exit(zv->zv_state_lock) : zv->zv_disk->private_data = NULL ->zvol_free() -->ASSERT(zv->zv_open_count==0) * zvol_release() : zv = disk->private_data ->ASSERT(zv && zv->zv_open_count>0) * --- --- * ASSERT() fails Reviewed by: Boris Protopopov <[email protected]> Reviewed-by: Brian Behlendorf <[email protected] Signed-off-by: loli10K <[email protected]> Closes #6213
* Fix zvol_init error handlingRichard Yao2017-06-131-0/+1
| | | | | Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Richard Yao <[email protected]>
* Make zvol operations use _by_dnode routinesRichard Yao2017-06-132-14/+12
| | | | | | | | | | This continues what was started in 0eef1bde31d67091d3deed23fe2394f5a8bf2276 by fully converting zvols to avoid unnecessary dnode_hold() calls. This saves a small amount of CPU time and slightly improves latencies of operations on zvols. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Richard Yao <[email protected]> Closes #6058
* Reduce stack usage of dsl_dir_tempreserve_implDeHackEd2017-06-121-6/+19
| | | | | | | | Buildbots and zfs-tests regularly see 7 kilobytes of stack usage with this function. Convert self-calls to iterations Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: DHE <[email protected]> Closes #6219
* OpenZFS 8056 - zfs send size estimate is inaccurate for some zvolsPaul Dagnelie2017-06-091-2/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Authored by: Paul Dagnelie <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Reviewed by: Pavel Zakharov <[email protected]> Approved by: Robert Mustacchi <[email protected]> Reviewed-by: Kash Pande <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: George Melikov <[email protected]> Ported-by: Giuseppe Di Natale <[email protected]> The send size estimate for a zvol can be too low, if the size of the record headers (dmu_replay_record_t's) is a significant portion of the size. This is typically the case when the data is highly compressible, especially with embedded blocks. The problem is that dmu_adjust_send_estimate_for_indirects() assumes that blocks are the size of the "recordsize" property (128KB). However, for zvols, the blocks are the size of the "volblocksize" property (8KB). Therefore, we estimate that there will be 16x less record headers than there really will be. The fix is to check the type of the object set (whether it is a zvol or not) and pick the appropriate property. In addition, while we are at it, we also add the size of the BEGIN and END records to the estimate. OpenZFS-issue: https://www.illumos.org/issues/8056 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/faf09cd Closes #6205
* OpenZFS 8156 - dbuf_evict_notify() does not need dbuf_evict_lockMatthew Ahrens2017-06-091-11/+7
| | | | | | | | | | | | | | | | | | | | | Authored by: Matthew Ahrens <[email protected]> Reviewed by: Dan Kimmel <[email protected]> Reviewed by: Paul Dagnelie <[email protected]> Approved by: Robert Mustacchi <[email protected]> Reviewed-by: George Melikov <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Ported-by: Giuseppe Di Natale <[email protected]> dbuf_evict_notify() holds the dbuf_evict_lock while checking if it should do the eviction itself (because the evict thread is not able to keep up). This can result in massive lock contention. It isn't necessary to hold the lock, because if we make the wrong choice occasionally, nothing bad will happen. This commit results in a ~60% performance improvement for ARC-cached sequential reads. OpenZFS-issue: https://www.illumos.org/issues/8156 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f73e5d9 Closes #6204
* OpenZFS 8199 - multi-threaded dmu_object_alloc()Matthew Ahrens2017-06-093-70/+130
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | dmu_object_alloc() is single-threaded, so when multiple threads are creating files in a single filesystem, they spend a lot of time waiting for the os_obj_lock. To improve performance of multi-threaded file creation, we must make dmu_object_alloc() typically not grab any filesystem-wide locks. The solution is to have a "next object to allocate" for each CPU. Each of these "next object"s is in a different block of the dnode object, so that concurrent allocation holds dnodes in different dbufs. When a thread's "next object" reaches the end of a chunk of objects (by default 4 blocks worth -- 128 dnodes), it will be reset to the per-objset os_obj_next, which will be increased by a chunk of objects (128). Only when manipulating the os_obj_next will we need to grab the os_obj_lock. This decreases lock contention dramatically, because each thread only needs to grab the os_obj_lock briefly, once per 128 allocations. This results in a 70% performance improvement to multi-threaded object creation (where each thread is creating objects in its own directory), from 67,000/sec to 115,000/sec, with 8 CPUs. Work sponsored by Intel Corp. Authored by: Matthew Ahrens <[email protected]> Reviewed-by: Ned Bass <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Ported-by: Matthew Ahrens <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> OpenZFS-issue: https://www.illumos.org/issues/8199 OpenZFS-commit: https://github.com/openzfs/openzfs/pull/374 Closes #4703 Closes #6117
* OpenZFS 7578 - Fix/improve some aspects of ZIL writingGiuseppe Di Natale2017-06-094-114/+107
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | - After some ZIL changes 6 years ago zil_slog_limit got partially broken due to zl_itx_list_sz not updated when async itx'es upgraded to sync. Actually because of other changes about that time zl_itx_list_sz is not really required to implement the functionality, so this patch removes some unneeded broken code and variables. - Original idea of zil_slog_limit was to reduce chance of SLOG abuse by single heavy logger, that increased latency for other (more latency critical) loggers, by pushing heavy log out into the main pool instead of SLOG. Beside huge latency increase for heavy writers, this implementation caused double write of all data, since the log records were explicitly prepared for SLOG. Since we now have I/O scheduler, I've found it can be much more efficient to reduce priority of heavy logger SLOG writes from ZIO_PRIORITY_SYNC_WRITE to ZIO_PRIORITY_ASYNC_WRITE, while still leave them on SLOG. - Existing ZIL implementation had problem with space efficiency when it has to write large chunks of data into log blocks of limited size. In some cases efficiency stopped to almost as low as 50%. In case of ZIL stored on spinning rust, that also reduced log write speed in half, since head had to uselessly fly over allocated but not written areas. This change improves the situation by offloading problematic operations from z*_log_write() to zil_lwb_commit(), which knows real situation of log blocks allocation and can split large requests into pieces much more efficiently. Also as side effect it removes one of two data copy operations done by ZIL code WR_COPIED case. - While there, untangle and unify code of z*_log_write() functions. Also zfs_log_write() alike to zvol_log_write() can now handle writes crossing block boundary, that may also improve efficiency if ZPL is made to do that. Sponsored by: iXsystems, Inc. Authored by: Alexander Motin <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Reviewed by: Prakash Surya <[email protected]> Reviewed by: Andriy Gapon <[email protected]> Reviewed by: Steven Hartland <[email protected]> Reviewed by: Brad Lewis <[email protected]> Reviewed by: Richard Elling <[email protected]> Approved by: Robert Mustacchi <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Richard Yao <[email protected]> Ported-by: Giuseppe Di Natale <[email protected]> OpenZFS-issue: https://www.illumos.org/issues/7578 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/aeb13ac Closes #6191
* OpenZFS 8155 - simplify dmu_write_policy handling of pre-compressed buffersMatthew Ahrens2017-06-074-23/+15
| | | | | | | | | | | | | | | | | | | | | | | Authored by: Matthew Ahrens <[email protected]> Reviewed by: Dan Kimmel <[email protected]> Reviewed by: George Wilson <[email protected]> Approved by: Robert Mustacchi <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Ported-by: Giuseppe Di Natale <[email protected]> When writing pre-compressed buffers, arc_write() requires that the compression algorithm used to compress the buffer matches the compression algorithm requested by the zio_prop_t, which is set by dmu_write_policy(). This makes dmu_write_policy() and its callers a bit more complicated. We simplify this by making arc_write() trust the caller to supply the type of pre-compressed buffer that it wants to write, and override the compression setting in the zio_prop_t. OpenZFS-issue: https://www.illumos.org/issues/8155 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/b55ff58 Closes #6200
* Linux 4.9 compat: fix zfs_ctldir xattr handlingLOLi2017-06-051-0/+3
| | | | | | | | | Since torvalds/linux@d0a5b99 IOP_XATTR is used to indicate the inode has xattr support: clear it for the ctldir inodes to avoid EIO errors. Reviewed-by: Chunwei Chen <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: loli10K <[email protected]> Closes #6189