summaryrefslogtreecommitdiffstats
path: root/include
Commit message (Collapse)AuthorAgeFilesLines
* Typo in dsl_dataset.hDamian Wojsław2017-10-171-3/+3
| | | | | | | | | | The parameters dsl_dataset_t *os in function prototype should be renamed to dsl_dataset_t *ds. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Damian Wojsław <[email protected]> Closes #6756 Closes #6273
* Free objects when receiving full stream as cloneFabian Grünbichler2017-10-161-0/+1
| | | | | | | | | | | | | All objects after the last written or freed object are not supposed to exist after receiving the stream. Free them accordingly, as if a freeobjects record for them had been included in the stream. Reviewed by: Paul Dagnelie <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Fabian Grünbichler <[email protected]> Closes #5699 Closes #6507 Closes #6616
* Scale the dbuf cache with arc_cchrisrd2017-10-161-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | Commit d3c2ae1 introduced a dbuf cache with a default size of the minimum of 100M or 1/32 maximum ARC size. (These figures may be adjusted using dbuf_cache_max_bytes and dbuf_cache_max_shift.) The dbuf cache is counted as metadata for the purposes of ARC size calculations. On a 1GB box the ARC maximum size defaults to c_max 493M which gives a dbuf cache default minimum size of 15.4M, and the ARC metadata defaults to minimum 16M. I.e. the dbuf cache is an significant proportion of the minimum metadata size. With other overheads involved this actually means the ARC metadata doesn't get down to the minimum. This patch dynamically scales the dbuf cache to the target ARC size instead of statically scaling it to the maximum ARC size. (The scale is still set by dbuf_cache_max_shift and the maximum size is still fixed by dbuf_cache_max_bytes.) Using the target ARC size rather than the current ARC size is done to help the ARC reach the target rather than simply focusing on the current size. Reviewed-by: Chunwei Chen <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: George Melikov <[email protected]> Signed-off-by: Chris Dunlop <[email protected]> Issue #6506 Closes #6561
* Correct cppcheck errors (#6662)Giuseppe Di Natale2017-09-201-6/+0
| | | | | | | | | | | | | ZFS buildbot STYLE builder was moved to Ubuntu 17.04 which has a newer version of cppcheck. Handle the new cppcheck errors. uu_* functions removed in this commit were unused and effectively dead code. They are now retired. Reviewed-by: George Melikov <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Giuseppe Di Natale <[email protected]> Closes #6653
* Linux 4.14 compat: IO acct, global_page_state, etc (#6655)Brian Behlendorf2017-09-191-4/+14
| | | | | | | | | | | | | | | | | | | | | | | | generic_start_io_acct/generic_end_io_acct in the master branch of the linux kernel requires that the request_queue be provided. Move the logic from freemem in the spl to arc_free_memory in arc.c. Do this so we can take advantage of global_page_state interface checks in zfs. Upstream kernel replaced struct block_device with struct gendisk in struct bio. Determine if the function bio_set_dev exists during configure and have zfs use that if it exists. bio_set_dev https://github.com/torvalds/linux/commit/74d4699 global_node_page_state https://github.com/torvalds/linux/commit/75ef718 io acct https://github.com/torvalds/linux/commit/d62e26b Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Giuseppe Di Natale <[email protected]> Closes #6635 Signed-off-by: Brian Behlendorf <[email protected]>
* Improved dnode allocation and dmu_hold_impl() (#6611)Giuseppe Di Natale2017-09-131-0/+136
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Refactor dmu_object_alloc_dnsize() and dnode_hold_impl() to simplify the code, fix errors introduced by commit dbeb879 (PR #6117) interacting badly with large dnodes, and improve performance. * When allocating a new dnode in dmu_object_alloc_dnsize(), update the percpu object ID for the core's metadnode chunk immediately. This eliminates most lock contention when taking the hold and creating the dnode. * Correct detection of the chunk boundary to work properly with large dnodes. * Separate the dmu_hold_impl() code for the FREE case from the code for the ALLOCATED case to make it easier to read. * Fully populate the dnode handle array immediately after reading a block of the metadnode from disk. Subsequently the dnode handle array provides enough information to determine which dnode slots are in use and which are free. * Add several kstats to allow the behavior of the code to be examined. * Verify dnode packing in large_dnode_008_pos.ksh. Since the test is purely creates, it should leave very few holes in the metadnode. * Add test large_dnode_009_pos.ksh, which performs concurrent creates and deletes, to complement existing test which does only creates. With the above fixes, there is very little contention in a test of about 200,000 racing dnode allocations produced by tests 'large_dnode_008_pos' and 'large_dnode_009_pos'. name type data dnode_hold_dbuf_hold 4 0 dnode_hold_dbuf_read 4 0 dnode_hold_alloc_hits 4 3804690 dnode_hold_alloc_misses 4 216 dnode_hold_alloc_interior 4 3 dnode_hold_alloc_lock_retry 4 0 dnode_hold_alloc_lock_misses 4 0 dnode_hold_alloc_type_none 4 0 dnode_hold_free_hits 4 203105 dnode_hold_free_misses 4 4 dnode_hold_free_lock_misses 4 0 dnode_hold_free_lock_retry 4 0 dnode_hold_free_overflow 4 0 dnode_hold_free_refcount 4 57 dnode_hold_free_txg 4 0 dnode_allocate 4 203154 dnode_reallocate 4 0 dnode_buf_evict 4 23918 dnode_alloc_next_chunk 4 4887 dnode_alloc_race 4 0 dnode_alloc_next_block 4 18 The performance is slightly improved for concurrent creates with 16+ threads, and unchanged for low thread counts. Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Olaf Faaland <[email protected]>
* Crash in dbuf_evict_one with DTRACE_PROBEgaurkuma2017-08-211-19/+39
| | | | | | | | | | | Update the dbuf__evict__one() tracepoint so that it can safely handle a NULL dmu_buf_impl_t pointer. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Giuseppe Di Natale <[email protected]> Reviewed-by: George Melikov <[email protected]> Reviewed-by: loli10K <[email protected]> Signed-off-by: gaurkuma <[email protected]> Closes #6463
* Add line info and SET_ERROR() to ZFS debug logNed Bass2017-07-254-95/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Redefine the SET_ERROR macro in terms of __dprintf() so the error return codes get logged as both tracepoint events (if tracepoints are enabled) and as ZFS debug log entries. This also allows us to use the same definition of SET_ERROR() in kernel and user space. Define a new debug flag ZFS_DEBUG_SET_ERROR=512 that may be bitwise or'd into zfs_flags. Setting this flag enables both dprintf() and SET_ERROR() messages in the debug log. That is, setting ZFS_DEBUG_SET_ERROR and ZFS_DEBUG_DPRINTF|ZFS_DEBUG_SET_ERROR are equivalent (this was done for sake of simplicity). Leaving ZFS_DEBUG_SET_ERROR unset suppresses the SET_ERROR() messages which helps avoid cluttering up the logs. To enable SET_ERROR() logging, run: echo 1 > /sys/module/zfs/parameters/zfs_dbgmsg_enable echo 512 > /sys/module/zfs/parameters/zfs_flags Remove the zfs_set_error_class tracepoints event class since SET_ERROR() now uses __dprintf(). This sacrifices a bit of granularity when selecting individual tracepoint events to enable but it makes the code simpler. Include file, function, and line number information in debug log entries. The information is now added to the message buffer in __dprintf() and as a result the zfs_dprintf_class tracepoints event class was changed from a 4 parameter interface to a single parameter. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Ned Bass <[email protected]> Closes #6400
* Add callback for zfs_multihost_intervalOlaf Faaland2017-07-251-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | Add a callback to wake all running mmp threads when zfs_multihost_interval is changed. This is necessary when the interval is changed from a very large value to a significantly lower one, while pools are imported that have the multihost property enabled. Without this commit, the mmp thread does not wake up and detect the new interval until after it has waited the old multihost interval time. A user monitoring mmp writes via the provided kstat would be led to believe that the changed setting did not work. Added a test in the ZTS under mmp to verify the new functionality is working. Added a test to ztest which starts and stops mmp threads, and calls into the code to signal sleeping mmp threads, to test for deadlocks or similar locking issues. Reviewed-by: Giuseppe Di Natale <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Olaf Faaland <[email protected]> Closes #6387
* OpenZFS 8491 - uberblock on-disk padding to reserve space for smoothly ↵Serapheim Dimitropoulos2017-07-241-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | merging zpool checkpoint & MMP in ZFS The zpool checkpoint feature in DxOS added a new field in the uberblock. The Multi-Modifier Protection Pull Request from ZoL adds three new fields in the uberblock (Reference: https://github.com/zfsonlinux/zfs/pull/6279). As these two changes come from two different sources and once upstreamed and deployed will introduce an incompatibility with each other we want to upstream a change that will reserve the padding for both of them so integration goes smoothly and everyone gets both features. Porting Notes: Preserved MMP comments in uberblock struct. Authored by: Serapheim Dimitropoulos <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Reviewed by: Brian Behlendorf <[email protected]> Reviewed by: Olaf Faaland <[email protected]> Approved by: Gordon Ross <[email protected]> Ported-by: Giuseppe Di Natale <[email protected]> OpenZFS-issue: https://www.illumos.org/issues/8491 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/d84fa5f Closes #6390
* Linux 4.13 compat: bio->bi_status and blk_status_tBrian Behlendorf2017-07-231-1/+91
| | | | | | | | | Commit torvalds/linux@4e4cbee9. The bio->bi_error field was replaced with bio->bi_status which is an enum that describes all possible error types. Reviewed-by: Chunwei Chen <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #6351
* Multi-modifier protection (MMP)Olaf Faaland2017-07-1312-3/+114
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add multihost=on|off pool property to control MMP. When enabled a new thread writes uberblocks to the last slot in each label, at a set frequency, to indicate to other hosts the pool is actively imported. These uberblocks are the last synced uberblock with an updated timestamp. Property defaults to off. During tryimport, find the "best" uberblock (newest txg and timestamp) repeatedly, checking for change in the found uberblock. Include the results of the activity test in the config returned by tryimport. These results are reported to user in "zpool import". Allow the user to control the period between MMP writes, and the duration of the activity test on import, via a new module parameter zfs_multihost_interval. The period is specified in milliseconds. The activity test duration is calculated from this value, and from the mmp_delay in the "best" uberblock found initially. Add a kstat interface to export statistics about Multiple Modifier Protection (MMP) updates. Include the last synced txg number, the timestamp, the delay since the last MMP update, the VDEV GUID, the VDEV label that received the last MMP update, and the VDEV path. Abbreviated output below. $ cat /proc/spl/kstat/zfs/mypool/multihost 31 0 0x01 10 880 105092382393521 105144180101111 txg timestamp mmp_delay vdev_guid vdev_label vdev_path 20468 261337 250274925 68396651780 3 /dev/sda 20468 261339 252023374 6267402363293 1 /dev/sdc 20468 261340 252000858 6698080955233 1 /dev/sdx 20468 261341 251980635 783892869810 2 /dev/sdy 20468 261342 253385953 8923255792467 3 /dev/sdd 20468 261344 253336622 042125143176 0 /dev/sdab 20468 261345 253310522 1200778101278 2 /dev/sde 20468 261346 253286429 0950576198362 2 /dev/sdt 20468 261347 253261545 96209817917 3 /dev/sds 20468 261349 253238188 8555725937673 3 /dev/sdb Add a new tunable zfs_multihost_history to specify the number of MMP updates to store history for. By default it is set to zero meaning that no MMP statistics are stored. When using ztest to generate activity, for automated tests of the MMP function, some test functions interfere with the test. For example, the pool is exported to run zdb and then imported again. Add a new ztest function, "-M", to alter ztest behavior to prevent this. Add new tests to verify the new functionality. Tests provided by Giuseppe Di Natale. Reviewed by: Matthew Ahrens <[email protected]> Reviewed-by: Giuseppe Di Natale <[email protected]> Reviewed-by: Ned Bass <[email protected]> Reviewed-by: Andreas Dilger <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Olaf Faaland <[email protected]> Closes #745 Closes #6279
* OpenZFS 6939 - add sysevents to zfs core for commandsDave Eddy2017-07-125-3/+51
| | | | | | | | | | | | | | | | | | | | Authored by: Dave Eddy <[email protected]> Reviewed by: Patrick Mooney <[email protected]> Reviewed by: Joshua M. Clulow <[email protected]> Reviewed by: Josh Wilsdon <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Reviewed by: George Wilson <[email protected]> Reviewed by: Richard Elling <[email protected]> Reviewed by: Alan Somers <[email protected]> Reviewed by: Andrew Stormont <[email protected]> Approved by: Matthew Ahrens <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: George Melikov <[email protected]> Ported-by: Giuseppe Di Natale <[email protected]> OpenZFS-issue: https://www.illumos.org/issues/6939 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/ce1577b Closes #6328
* Add port of FreeBSD 'volmode' propertyLOLi2017-07-122-0/+9
| | | | | | | | | | | | | | | | | | | | | | | | | The volmode property may be set to control the visibility of ZVOL block devices. This allow switching ZVOL between three modes: full - existing fully functional behaviour (default) dev - hide partitions on ZVOL block devices none - not exposing volumes outside ZFS Additionally the new zvol_volmode module parameter can be used to control the default behaviour. This functionality can be used, for instance, on "backup" pools to avoid cluttering /dev with unneeded zd* devices. Original-patch-by: mav <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Ported-by: loli10K <[email protected]> Signed-off-by: loli10K <[email protected]> FreeBSD-commit: https://github.com/freebsd/freebsd/commit/dd28e6bb Closes #1796 Closes #3438 Closes #6233
* OpenZFS 5428 - provide fts(), reallocarray(), and strtonum()Yuri Pankov2017-07-081-1/+1
| | | | | | | | | | | | | | | | Authored by: Yuri Pankov <[email protected]> Reviewed by: Robert Mustacchi <[email protected]> Approved by: Joshua M. Clulow <[email protected]> Reviewed-by: Giuseppe Di Natale <[email protected]> Reviewed-by: George Melikov <[email protected]> Ported-by: Brian Behlendorf <[email protected]> Porting Notes: * All hunks unrelated to ZFS were dropped. OpenZFS-issue: https://www.illumos.org/issues/5428 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/4585130 Closes #6326
* OpenZFS 8067 - zdb should be able to dump literal embedded block pointerMatthew Ahrens2017-07-071-0/+1
| | | | | | | | | | | | | | Authored by: Matthew Ahrens <[email protected]> Reviewed by: George Wilson <[email protected]> Reviewed by: Alex Reece <[email protected]> Reviewed by: Yuri Pankov <[email protected]> Approved by: Robert Mustacchi <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Ported-by: Giuseppe Di Natale <[email protected]> OpenZFS-issue: https://www.illumos.org/issues/8067 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/8173085 Closes #6319
* Implemented zpool scrub pause/resumeAlek P2017-07-065-6/+28
| | | | | | | | | | | | | | | | | | Currently, there is no way to pause a scrub. Pausing may be useful when the pool is busy with other I/O to preserve bandwidth. This patch adds the ability to pause and resume scrubbing. This is achieved by maintaining a persistent on-disk scrub state. While the state is 'paused' we do not scrub any more blocks. We do however perform regular scan housekeeping such as freeing async destroyed and deadlist blocks while paused. Reviewed by: Matthew Ahrens <[email protected]> Reviewed by: Thomas Caputi <[email protected]> Reviewed-by: Serapheim Dimitropoulos <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alek Pinchuk <[email protected]> Closes #6167
* OpenZFS 7600 - zfs rollback should pass target snapshot to kernelAndriy Gapon2017-07-042-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Authored by: Andriy Gapon <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Reviewed by: Pavel Zakharov <[email protected]> Approved by: Robert Mustacchi <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Ported-by: Giuseppe Di Natale <[email protected]> The existing kernel-side code only provides a method to rollback to a latest snapshot, whatever it happens to be at the time when the rollback is actually done. That could be unsafe or confusing in environments where concurrent DSL changes are possible as the resulting state could correspond to a newer or older snapshot than the originally requested one. This change allows to amend that method such that the rollback is performed only when the latest snapshot has a specific name. That is, if a new snapshot is concurrently created or the target snapshot is destroyed, then no rollback is done and EXDEV error is returned. New libzfs_core function lzc_rollback_to() is provided for the new functionality. libzfs is changed to use lzc_rollback_to() to implement zfs rollback command. Perhaps we should return different errors to distinguish the case where the desired snapshot exists but it's not the latest snapshot and the case where the desired snapshot does not exist. OpenZFS-issue: https://www.illumos.org/issues/7600 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/3d645eb Closes #6292
* OpenZFS 8416 - abd.h is not C++ friendlyAndriy Gapon2017-06-301-1/+1
| | | | | | | | | | | | | | | Authored by: Andriy Gapon <[email protected]> Reviewed by: Igor Kozhukhov <[email protected]> Reviewed by: Dan Kimmel <[email protected]> Reviewed by: Alek Pinchuk <[email protected]> Approved by: Robert Mustacchi <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: George Melikov <[email protected]> Ported-by: Giuseppe Di Natale <[email protected]> OpenZFS-issue: https://www.illumos.org/issues/8416 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/589c189 Closes #6288
* OpenZFS 8426 - mark immutable buffer arguments as such in abd.hAndriy Gapon2017-06-301-2/+2
| | | | | | | | | | | | | Authored by: Andriy Gapon <[email protected]> Reviewed by: Matt Ahrens <[email protected]> Approved by: Robert Mustacchi <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: George Melikov <[email protected]> Ported-by: Giuseppe Di Natale <[email protected]> OpenZFS-issue: https://www.illumos.org/issues/8426 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/37359a6 Closes #6287
* OpenZFS 8264 - want support for promoting datasets in libzfs_coreGiuseppe Di Natale2017-06-261-0/+2
| | | | | | | | | | | | | Authored by: Andrew Stormont <[email protected]> Reviewed by: Andriy Gapon <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Reviewed by: Dan McDonald <[email protected]> Approved by: Dan McDonald <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Ported-by: Giuseppe Di Natale <[email protected]> OpenZFS-issue: https://www.illumos.org/issues/8264 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/a4b8c9a Closes #6254
* Dashes for zero latency values in zpool iostat -pTony Hutter2017-06-221-1/+11
| | | | | | | | | | | | This prints dashes instead of zeros for zero latency values in 'zpool iostat -p'. You'll get zero latencies reported when the disk is idle, but technically a zero latency is invalid, since you can't measure the latency of doing nothing. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: George Melikov <[email protected]> Reviewed-by: Giuseppe Di Natale <[email protected]> Signed-off-by: Tony Hutter <[email protected]> Closes #6210
* Inject zinject(8) a percentage amount of dev errsDon Brady2017-06-161-0/+5
| | | | | | | | | | | In the original form of device error injection, it was an all or nothing situation. To help simulate intermittent error conditions, you can now specify a real number percentage value. This is also very useful for our ZFS fault diagnosis testing and for injecting intermittent errors during load testing. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Don Brady <[email protected]> Closes #6227
* Make zvol operations use _by_dnode routinesRichard Yao2017-06-131-0/+3
| | | | | | | | | | This continues what was started in 0eef1bde31d67091d3deed23fe2394f5a8bf2276 by fully converting zvols to avoid unnecessary dnode_hold() calls. This saves a small amount of CPU time and slightly improves latencies of operations on zvols. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Richard Yao <[email protected]> Closes #6058
* OpenZFS 8199 - multi-threaded dmu_object_alloc()Matthew Ahrens2017-06-091-1/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | dmu_object_alloc() is single-threaded, so when multiple threads are creating files in a single filesystem, they spend a lot of time waiting for the os_obj_lock. To improve performance of multi-threaded file creation, we must make dmu_object_alloc() typically not grab any filesystem-wide locks. The solution is to have a "next object to allocate" for each CPU. Each of these "next object"s is in a different block of the dnode object, so that concurrent allocation holds dnodes in different dbufs. When a thread's "next object" reaches the end of a chunk of objects (by default 4 blocks worth -- 128 dnodes), it will be reset to the per-objset os_obj_next, which will be increased by a chunk of objects (128). Only when manipulating the os_obj_next will we need to grab the os_obj_lock. This decreases lock contention dramatically, because each thread only needs to grab the os_obj_lock briefly, once per 128 allocations. This results in a 70% performance improvement to multi-threaded object creation (where each thread is creating objects in its own directory), from 67,000/sec to 115,000/sec, with 8 CPUs. Work sponsored by Intel Corp. Authored by: Matthew Ahrens <[email protected]> Reviewed-by: Ned Bass <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Ported-by: Matthew Ahrens <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> OpenZFS-issue: https://www.illumos.org/issues/8199 OpenZFS-commit: https://github.com/openzfs/openzfs/pull/374 Closes #4703 Closes #6117
* OpenZFS 7578 - Fix/improve some aspects of ZIL writingGiuseppe Di Natale2017-06-094-10/+21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | - After some ZIL changes 6 years ago zil_slog_limit got partially broken due to zl_itx_list_sz not updated when async itx'es upgraded to sync. Actually because of other changes about that time zl_itx_list_sz is not really required to implement the functionality, so this patch removes some unneeded broken code and variables. - Original idea of zil_slog_limit was to reduce chance of SLOG abuse by single heavy logger, that increased latency for other (more latency critical) loggers, by pushing heavy log out into the main pool instead of SLOG. Beside huge latency increase for heavy writers, this implementation caused double write of all data, since the log records were explicitly prepared for SLOG. Since we now have I/O scheduler, I've found it can be much more efficient to reduce priority of heavy logger SLOG writes from ZIO_PRIORITY_SYNC_WRITE to ZIO_PRIORITY_ASYNC_WRITE, while still leave them on SLOG. - Existing ZIL implementation had problem with space efficiency when it has to write large chunks of data into log blocks of limited size. In some cases efficiency stopped to almost as low as 50%. In case of ZIL stored on spinning rust, that also reduced log write speed in half, since head had to uselessly fly over allocated but not written areas. This change improves the situation by offloading problematic operations from z*_log_write() to zil_lwb_commit(), which knows real situation of log blocks allocation and can split large requests into pieces much more efficiently. Also as side effect it removes one of two data copy operations done by ZIL code WR_COPIED case. - While there, untangle and unify code of z*_log_write() functions. Also zfs_log_write() alike to zvol_log_write() can now handle writes crossing block boundary, that may also improve efficiency if ZPL is made to do that. Sponsored by: iXsystems, Inc. Authored by: Alexander Motin <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Reviewed by: Prakash Surya <[email protected]> Reviewed by: Andriy Gapon <[email protected]> Reviewed by: Steven Hartland <[email protected]> Reviewed by: Brad Lewis <[email protected]> Reviewed by: Richard Elling <[email protected]> Approved by: Robert Mustacchi <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Richard Yao <[email protected]> Ported-by: Giuseppe Di Natale <[email protected]> OpenZFS-issue: https://www.illumos.org/issues/7578 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/aeb13ac Closes #6191
* OpenZFS 8155 - simplify dmu_write_policy handling of pre-compressed buffersMatthew Ahrens2017-06-071-3/+3
| | | | | | | | | | | | | | | | | | | | | | | Authored by: Matthew Ahrens <[email protected]> Reviewed by: Dan Kimmel <[email protected]> Reviewed by: George Wilson <[email protected]> Approved by: Robert Mustacchi <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Ported-by: Giuseppe Di Natale <[email protected]> When writing pre-compressed buffers, arc_write() requires that the compression algorithm used to compress the buffer matches the compression algorithm requested by the zio_prop_t, which is set by dmu_write_policy(). This makes dmu_write_policy() and its callers a bit more complicated. We simplify this by making arc_write() trust the caller to supply the type of pre-compressed buffer that it wants to write, and override the compression setting in the zio_prop_t. OpenZFS-issue: https://www.illumos.org/issues/8155 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/b55ff58 Closes #6200
* Linux 4.12 compat: fix super_setup_bdi_name() callLOLi2017-05-251-4/+5
| | | | | | | | | | Provide a format parameter to super_setup_bdi_name() so we don't create duplicate names in '/devices/virtual/bdi' sysfs namespace which would prevent us from mounting more than one ZFS filesystem at a time. Reviewed-by: Chunwei Chen <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: loli10K <[email protected]> Closes #6147
* Implemented zpool sync commandAlek P2017-05-193-0/+8
| | | | | | | | | | | This addition will enable us to sync an open TXG to the main pool on demand. The functionality is similar to 'sync(2)' but 'zpool sync' will return when data has hit the main storage instead of potentially just the ZIL as is the case with the 'sync(2)' cmd. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Signed-off-by: Alek Pinchuk <[email protected]> Closes #6122
* Force fault a vdev with 'zpool offline -f'Tony Hutter2017-05-192-2/+4
| | | | | | | | | | | | | This patch adds a '-f' option to 'zpool offline' to fault a vdev instead of bringing it offline. Unlike the OFFLINE state, the FAULTED state will trigger the FMA code, allowing for things like autoreplace and triggering the slot fault LED. The -f faults persist across imports, unless they were set with the temporary (-t) flag. Both persistent and temporary faults can be cleared with zpool clear. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tony Hutter <[email protected]> Closes #6094
* Fix large dnode send stream flag conflictBrian Behlendorf2017-05-181-1/+2
| | | | | | | | | | | | | | | | | | Bit 21 of the send stream flags was inadvertently used for two different features under concurrent development. To avoid any future compatibility problems the large dnode flag is being switched to bit 23 which is unused. The large dnode feature has only been present in pre-releases of ZoL and dnodesize defaults to legacy which is compatible with existing OpenZFS implementations. Users with dnodesize=auto needing to use zfs send/recv must update ZoL on both the source and destination systems. Reviewed by: Matthew Ahrens <[email protected]> Reviewed-by: Ned Bass <[email protected]> Reviewed-by: George Melikov <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #6139
* Skip spurious resilver IO on raidz vdevIsaac Huang2017-05-122-0/+3
| | | | | | | | | | | | | | | | On a raidz vdev, a block that does not span all child vdevs, excluding its skip sectors if any, may not be affected by a child vdev outage or failure. In such cases, the block does not need to be resilvered. However, current resilver algorithm simply resilvers all blocks on a degraded raidz vdev. Such spurious IO is not only wasteful, but also adds the risk of overwriting good data. This patch eliminates such spurious IOs. Reviewed-by: Gvozden Neskovic <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Signed-off-by: Isaac Huang <[email protected]> Closes #5316
* OpenZFS 8063 - verify that we do not attempt to access inactive txgMatthew Ahrens2017-05-102-3/+15
| | | | | | | | | | | | | | | | | | | | | Authored by: Matthew Ahrens <[email protected]> Reviewed by: Serapheim Dimitropoulos <[email protected]> Reviewed by: Pavel Zakharov <[email protected]> Approved by: Robert Mustacchi <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Ported-by: George Melikov <[email protected]> A standard practice in ZFS is to keep track of "per-txg" state. Any of the 3 active TXG's (open, quiescing, syncing) can have different values for this state. We should assert that we do not attempt to modify other (inactive) TXG's. Porting Notes: - ASSERTV added to txg_sync_waiting() for unused variable. OpenZFS-issue: https://www.illumos.org/issues/8063 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/01acb46 Closes #6109
* Linux 4.12 compat: CURRENT_TIME removedBrian Behlendorf2017-05-101-0/+11
| | | | | | | | Linux 4.9 added current_time() as the preferred interface to get the filesystem time. CURRENT_TIME was retired in Linux 4.12. Reviewed-by: Chunwei Chen <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #6114
* Add property overriding (-o|-x) to 'zfs receive'LOLi2017-05-091-0/+3
| | | | | | | | | | | | | | | | | | | | This allows users to specify "-o property=value" to override and "-x property" to exclude properties when receiving a zfs send stream. Both native and user properties can be specified. This is useful when using zfs send/receive for periodic backup/replication because it lets users change properties such as canmount, mountpoint, or compression without modifying the source. References: https://www.illumos.org/issues/2745 https://www.illumos.org/issues/3753 Reviewed by: Matthew Ahrens <[email protected]> Reviewed-by: Alek Pinchuk <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: loli10K <[email protected]> Closes #1350 Closes #5349
* Make createtxg and guid properties publicChristian Schwarz2017-05-091-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Document the existence of `createtxg` and `guid` native properties in man pages and zfs command output. One of the great features of ZFS is incremental replication of snapshots, possibly between pools on different machines. Shell scripts are commonly used to auomate this procedure. They have to find the most recent common snapshot between both sides and then perform incremental send & recv. Currently, scripts rely on the sorting order of `zfs list`, which defaults to `createtxg`, and the assumption that snapshot names on either side do not change. By making `createtxg` and `guid` part of the public ZFS interface, scripts are enabled to use a) `createtxg` to determine the logical & temporal order of snapshots (the creation property is not an equivalent substitute since multiple snapshots may be created within one second) b) `guid` to uniquely identify a snapshot, independent of its current display name This has the potential of making scripts safer and correct. Reviewed-by: George Melikov <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: DHE <[email protected]> Reviewed-by: Richard Laager <[email protected]> Signed-off-by: Christian Schwarz <[email protected]> Closes #6102
* Linux 4.12 compat: PF_FSTRANS was removedChunwei Chen2017-05-091-1/+1
| | | | | | | | zfsonlinux/spl@8f87971 added __spl_pf_fstrans_check for the xfs related check, so we use them accordingly. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Chunwei Chen <[email protected]> Closes #6113
* Add missing *_destroy/*_fini callsGvozden Neskovic2017-05-041-0/+1
| | | | | | | | | The proposed debugging enhancements in zfsonlinux/spl#587 identified the following missing *_destroy/*_fini calls. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Gvozden Neskovic <[email protected]> Closes #5428
* Enable Linux read-ahead for a single page on ZVOLsRichard Yao2017-05-041-0/+11
| | | | | | | | | | | | | | | | | | | | Linux has read-ahead logic designed to accelerate sequential workloads. ZFS has its own read-ahead logic called zprefetch that operates on both ZVOLs and datasets. Having two prefetchers active at the same time can cause overprefetching, which unnecessarily reduces IOPS performance on CoW filesystems like ZFS. Testing shows that entirely disabling the Linux prefetch results in a significant performance penalty for reads while commensurate benefits are seen in random writes. It appears that read-ahead benefits are inversely proportional to random write benefits, and so a single page of Linux-layer read-ahead appears to offer the middle ground for both workloads. Reviewed-by: Chunwei Chen <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Richard Yao <[email protected]> Issue #5902
* More ashift improvementsLOLi2017-05-031-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | This commit allow higher ashift values (up to 16) in 'zpool create' The ashift value was previously limited to 13 (8K block) in b41c990 because the limited number of uberblocks we could fit in the statically sized (128K) vdev label ring buffer could prevent the ability the safely roll back a pool to recover it. Since b02fe35 the largest uberblock size we support is 8K: this allow us to store a minimum number of 16 uberblocks in the vdev label, even with higher ashift values. Additionally change 'ashift' pool property behaviour: if set it will be used as the default hint value in subsequent vdev operations ('zpool add', 'attach' and 'replace'). A custom ashift value can still be specified from the command line, if desired. Finally, fix a bug in add-o_ashift.ksh caused by a missing variable. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: loli10K <[email protected]> Closes #2024 Closes #4205 Closes #4740 Closes #5763
* Write label 2,3 uberblocks when vdev expandsOlaf Faaland2017-05-021-0/+1
| | | | | | | | | | | | | | | | | | | | | | When vdev_psize increases, the location of labels 2 and 3 changes because their location is relative to the end of the device. The configs for labels 2 and 3 are written during the next spa_sync() because the vdev is added to the dirty config list. However, the uberblock rings are not re-written in their new location, leaving the device vulnerable to the beginning of the device being overwritten or damaged. This patch copies the uberblock ring from label 0 to labels 2 and 3, in their new locations, at the next sync after vdev_psize increases. Also, add a test zpool_expand_004_pos.ksh to confirm the uberblocks are copied. Reviewed-by: BearBabyLiu <[email protected]> Reviewed-by: Andreas Dilger <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Olaf Faaland <[email protected]> Closes #5108
* Add zfs_nicebytes() to print human-readable sizesLOLi2017-05-021-2/+4
| | | | | | | | | | | | | | | | * Add zfs_nicebytes() to print human-readable sizes Some 'zfs', 'zpool' and 'zdb' output strings can be confusing to the user when no units are specified. This add a new zfs_nicenum_format "ZFS_NICENUM_BYTES" used to print bytes in their human-readable form. Additionally, update some test cases to use machine-parsable 'zfs get'. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: loli10K <[email protected]> Closes #2414 Closes #3185 Closes #3594 Closes #6032
* Linux 4.12 compat: super_setup_bdi_name()Brian Behlendorf2017-05-022-10/+78
| | | | | | | | | | All filesystems were converted to dynamically allocated BDIs. The destruction of backing_dev_info structures is handled as part of super block destruction. Refactor the code to abstract away the details of creating and destroying a BDI. Reviewed-by: Chunwei Chen <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #6089
* Reinstate zvol_taskq to fix aio on zvolChunwei Chen2017-04-261-2/+9
| | | | | | | | | | | | | | | | | | | | | | | Commit 37f9dac removed the zvol_taskq for processing zvol requests. This was removed as part of switching to make_request_fn and was motivated by a concern at the time over dispatch latency. However, this also made all bio request synchronous, and caused serious performance issues as the bio submitter would wait for every bio it submitted, effectively making the IO depth 1. This patch reinstate zvol_taskq, and to make sure overlapped I/Os are ordered properly, we take range lock in zvol_request, and pass it along with bio to the I/O functions zvol_{write,discard,read}. In order to facilitate benchmarks a zvol_request_sync module option was added to switch between sync and async request handling. For the moment, the default behavior is synchronous but this is likely to change pending additional testing. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Chunwei Chen <[email protected]> Closes #5824
* Prebaked scripts for zpool status/iostat -cTony Hutter2017-04-211-0/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch updates the "zpool status/iostat -c" commands to only run "pre-baked" scripts from the /etc/zfs/zpool.d directory (or wherever you install to). The scripts can only be run from -c as an unprivileged user (unless the ZPOOL_SCRIPTS_AS_ROOT environment var is set by root). This was done to encourage scripts to be written is such a way that normal users can use them, and to be cautious. If your script needs to run a privileged command, consider adding the appropriate line in /etc/sudoers. See zpool(8) for an example of how to do this. The patch also allows the scripts to output custom column names. If the script outputs a line like: name=value then "name" is used for the column name, and "value" is its value. Multiple columns can be specified by outputting multiple lines. Column names and values can have spaces. If the value is empty, a dash (-) is printed instead. After all the "name=value" lines are read (if any), zpool will take the next the next line of output (if any) and print it without a column header. After that, no more lines will be processed. This can be useful for printing errors. Lastly, this patch also disables the -c option with the latency and request size histograms, since it produced awkward output and made the code harder to maintain. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Giuseppe Di Natale <[email protected]> Signed-off-by: Tony Hutter <[email protected]> Closes #5852
* OpenZFS 5120 - zfs should allow large block/gzip/raidz boot pool (loader ↵Brian Behlendorf2017-04-131-0/+2
| | | | | | | | | | | | | | | | | | | | | | | project) Authored by: Toomas Soome <[email protected]> Reviewed by: George Wilson <[email protected]> Reviewed by: Yuri Pankov <[email protected]> Reviewed by: Andrew Stormont <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Approved by: Robert Mustacchi <[email protected]> Reviewed-by: Giuseppe Di Natale <[email protected]> Reviewed-by: George Melikov <[email protected]> Reviewed-by: Don Brady <[email protected]> Ported-by: Brian Behlendorf <[email protected]> Porting Notes: - grub-2.02-beta2-422-gcad5cc0 includes support for large blocks. - Commit 8aab121 allowed GZIP[1-9]. - Grub allows pools with multiple top-level vdevs. OpenZFS-issue: https://www.illumos.org/issues/5120 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/c8811bd Closes #6007
* OpenZFS 6865 - want zfs-tests cases for zpool labelclear commandYuri Pankov2017-04-111-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | Authored by: Yuri Pankov <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Reviewed by: John Kennedy <[email protected]> Approved by: Robert Mustacchi <[email protected]> Reviewed-by: loli10K <[email protected]> Ported-by: Brian Behlendorf <[email protected]> Porting Notes: - Updated 'zpool labelclear' and 'zdb -l' such that they attempt to find a vdev given solely its short name. This behavior is consistent with the upstream OpenZFS code and the test cases depend on it. The actual implementation differs slightly due to device naming conventions on Linux. - auto_online_001_pos, auto_replace_001_pos and add-o_ashift test cases updated to expect failure when no label exists. - read_efi_label() and zpool_label_disk_check() are read-only operations and should use O_RDONLY at open time to enforce this. - zpool_label_disk() and zpool_relabel_disk() write the partition information using O_DIRECT an fsync() and page cache invalidation to ensure a consistent view of the device. - dump_label() in zdb should invalidate the page cache in order to get the authoritative label from disk. OpenZFS-issue: https://www.illumos.org/issues/6865 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/c95076c Closes #5981
* OpenZFS 2932 - support crash dumps to raidz, etc. poolsGiuseppe Di Natale2017-04-101-0/+2
| | | | | | | | | | | | | | | Authored by: Bill Pijewski <[email protected]> Reviewed by: Jerry Jelinek <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Approved by: Dan McDonald <[email protected]> Reviewed-by: George Melikov <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Ported-by: Giuseppe Di Natale <[email protected]> OpenZFS-issue: https://www.illumos.org/issues/2932 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/810e43b Closes #5984 Closes #5216
* OpenZFS 8023 - Panic destroying a metaslab deferred range treeGeorge Wilson2017-04-091-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | Authored by: George Wilson <[email protected]> Approved by: Dan McDonald <[email protected]> Reviewed by: Brad Lewis <[email protected]> Reviewed by: Matt Ahrens <[email protected]> Reviewed by: Dan Kimmel <[email protected]> Reviewed by: Saso Kiselkov <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: George Melikov <[email protected]> Ported-by: Giuseppe Di Natale <[email protected]> We don't want to dirty any data when we're in the final txgs of the pool export logic. This change introduces checks to make sure that no data is dirtied after a certain point. It also addresses the culprit of this specific bug – the space map cannot be upgraded when we're in final stages of pool export. If we encounter a space map that wants to be upgraded in this phase, then we simply ignore the request as it will get retried the next time we set the fragmentation metric on that metaslab. OpenZFS-issue: https://www.illumos.org/issues/8023 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/2ef00f5 Closes #5991
* OpenZFS 7404 - rootpool_007_neg, bootfs_006_pos and bootfs_008_neg tests ↵Toomas Soome2017-04-071-0/+10
| | | | | | | | | | | | | | | | | | | | | | | | fail with the loader project bits Authored by: Toomas Soome <[email protected]> Reviewed by: Igor Kozhukhov <[email protected]> Reviewed by: Marcel Telka <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Approved by: Richard Lowe <[email protected]> Reviewed-by: George Melikov <[email protected]> Ported-by: Brian Behlendorf <[email protected]> Porting Notes: - Removed gzip and zle compression restriction on bootfs datasets. Grub added support for these long ago. Ay version of grub which understands lz4 also supports this. - Enabled rootpool tests in runfile but skipped by default in setup on Linux since they modify the rootpool. - bootfs_006_pos.ksh, striped pools are allowed as bootfs. OpenZFS-issue: https://www.illumos.org/issues/7404 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/55a424c Closes #5982