summaryrefslogtreecommitdiffstats
path: root/module/zfs/spa.c
Commit message (Collapse)AuthorAgeFilesLines
* Fix import wrong spare/l2 device when path changeChunwei Chen2017-06-011-6/+0
| | | | | | | | | | | | | | | If, for example, your aux device was /dev/sdc, but now the aux device is removed and /dev/sdc points to other device. zpool import will still use that device and corrupt it. The problem is that the spa_validate_aux in spa_import, rather than validate the on-disk label, it would actually write label to disk. We remove them since spa_load_{spares,l2cache} seems to do everything we need and they would actually validate on-disk label. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Chunwei Chen <[email protected]> Closes #6158
* OpenZFS 8063 - verify that we do not attempt to access inactive txgMatthew Ahrens2017-05-101-2/+2
| | | | | | | | | | | | | | | | | | | | | Authored by: Matthew Ahrens <[email protected]> Reviewed by: Serapheim Dimitropoulos <[email protected]> Reviewed by: Pavel Zakharov <[email protected]> Approved by: Robert Mustacchi <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Ported-by: George Melikov <[email protected]> A standard practice in ZFS is to keep track of "per-txg" state. Any of the 3 active TXG's (open, quiescing, syncing) can have different values for this state. We should assert that we do not attempt to modify other (inactive) TXG's. Porting Notes: - ASSERTV added to txg_sync_waiting() for unused variable. OpenZFS-issue: https://www.illumos.org/issues/8063 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/01acb46 Closes #6109
* OpenZFS 5120 - zfs should allow large block/gzip/raidz boot pool (loader ↵Brian Behlendorf2017-04-131-6/+2
| | | | | | | | | | | | | | | | | | | | | | | project) Authored by: Toomas Soome <[email protected]> Reviewed by: George Wilson <[email protected]> Reviewed by: Yuri Pankov <[email protected]> Reviewed by: Andrew Stormont <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Approved by: Robert Mustacchi <[email protected]> Reviewed-by: Giuseppe Di Natale <[email protected]> Reviewed-by: George Melikov <[email protected]> Reviewed-by: Don Brady <[email protected]> Ported-by: Brian Behlendorf <[email protected]> Porting Notes: - grub-2.02-beta2-422-gcad5cc0 includes support for large blocks. - Commit 8aab121 allowed GZIP[1-9]. - Grub allows pools with multiple top-level vdevs. OpenZFS-issue: https://www.illumos.org/issues/5120 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/c8811bd Closes #6007
* OpenZFS 3821 - Race in rollback, zil close, and zil flushGeorge Wilson2017-03-231-2/+7
| | | | | | | | | | | | | | | Authored by: George Wilson <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Reviewed by: Dan Kimmel <[email protected]> Reviewed by: Pavel Zakharov <[email protected]> Reviewed by: Andriy Gapon <[email protected]> Approved by: Richard Lowe <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Ported-by: George Melikov <[email protected]> OpenZFS-issue: https://www.illumos.org/issues/3821 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/43297f9 Closes #5905
* Fix incorrect spare vdev state after replacingIsaac Huang2017-02-231-0/+5
| | | | | | | | | | | | | | After a hot spare replaces an OFFLINE vdev, the new parent spare vdev state is set incorrectly to OFFLINE. The correct state should be DEGRADED. The incorrect OFFLINE state will prevent top-level vdev from reading the spare vdev, thus causing unnecessary reconstruction. Reviewed-by: George Melikov <[email protected]> Reviewed-by: Don Brady <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Isaac Huang <[email protected]> Closes #5766 Closes #5770
* OpenZFS 7136 - ESC_VDEV_REMOVE_AUX ought to always include vdev informationGeorge Melikov2017-01-311-3/+6
| | | | | | | | | | | | | | | | | | Authored by: Alan Somers <[email protected]> 7115 6922 generates ESC_ZFS_VDEV_REMOVE_AUX a bit too often Reviewed by: George Wilson <[email protected]> Reviewed by: Josef 'Jeff' Sipek <[email protected]> Approved by: Robert Mustacchi <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Ported-by: George Melikov <[email protected]> OpenZFS-issue: https://www.illumos.org/issues/7136 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/b72b6bb Closes #5691 Porting notes: - Functionally this patch behaves the same as the OpenZFS version but it was adapted because because ZoL doesn't have the same illumos sysevent_t infrastructure and functionality.
* OpenZFS 6922 - Emit ESC_ZFS_VDEV_REMOVE_AUX after removing an aux deviceGeorge Melikov2017-01-301-1/+1
| | | | | | | | | | | | Authored by: Alan Somers <[email protected]> Reviewed by: George Wilson <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Approved by: Dan McDonald <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Ported-by: George Melikov <[email protected]> OpenZFS-issue: https://www.illumos.org/issues/6922 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/63364b0 Closes #5690
* OpenZFS 7743 - per-vdev-zaps init path for upgradeDon Brady2017-01-131-3/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | Authored by: Paul Dagnelie <[email protected]> Reviewed by: Matt Ahrens <[email protected]> Reviewed by: Pavel Zakharov <[email protected]> Reviewed by: George Wilson <[email protected]> Reviewed by: Don Brady <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Joe Stein <[email protected]> Ported-by: Don Brady <[email protected]> When loading a pool that had been created before the existance of per-vdev zaps, on a system that knows about per-vdev zaps, the per-vdev zaps will not be allocated and initialized. This appears to be because the logic that would have done so, in spa_sync_config_object(), is not reached under normal operation. It is only reached if spa_config_dirty_list is non-empty. The fix is to add another `AVZ_ACTION_` enum that will allow this code to be reached when we detect that we're loading an old pool, even when there are no dirty configs. OpenZFS-issue: https://www.illumos.org/issues/7743 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/e2d29d0 Closes #5582
* OpenZFS 7303 - dynamic metaslab selectionDon Brady2017-01-121-1/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | This change introduces a new weighting algorithm to improve metaslab selection. The new weighting algorithm relies on the SPACEMAP_HISTOGRAM feature. As a result, the metaslab weight now encodes the type of weighting algorithm used (size-based vs segment-based). Porting Notes: The metaslab allocation tracing code is conditionally removed on linux (dependent on mdb debugger). Authored by: George Wilson <[email protected]> Reviewed by: Alex Reece <[email protected]> Reviewed by: Chris Siden <[email protected]> Reviewed by: Dan Kimmel <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Reviewed by: Paul Dagnelie <[email protected]> Reviewed by: Pavel Zakharov [email protected] Reviewed by: Prakash Surya <[email protected]> Reviewed by: Don Brady <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Ported-by: Don Brady <[email protected]> OpenZFS-issue: https://www.illumos.org/issues/7303 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/d5190931bd Closes #5404
* Fix spellingka72017-01-031-3/+3
| | | | | | | | | Reviewed-by: Brian Behlendorf <[email protected] Reviewed-by: Giuseppe Di Natale <[email protected]>> Reviewed-by: George Melikov <[email protected]> Reviewed-by: Haakan T Johansson <[email protected]> Closes #5547 Closes #5543
* Use cstyle -cpP in `make cstyle` checkBrian Behlendorf2016-12-121-0/+1
| | | | | | | | | | | | | | | | | | | | | | | Enable picky cstyle checks and resolve the new warnings. The vast majority of the changes needed were to handle minor issues with whitespace formatting. This patch contains no functional changes. Non-whitespace changes are as follows: * 8 times ; to { } in for/while loop * fix missing ; in cmd/zed/agents/zfs_diagnosis.c * comment (confim -> confirm) * change endline , to ; in cmd/zpool/zpool_main.c * a number of /* BEGIN CSTYLED */ /* END CSTYLED */ blocks * /* CSTYLED */ markers * change == 0 to ! * ulong to unsigned long in module/zfs/dsl_scan.c * rearrangement of module_param lines in module/zfs/metaslab.c * add { } block around statement after for_each_online_node Reviewed-by: Giuseppe Di Natale <[email protected]> Reviewed-by: Håkan Johansson <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #5465
* Speed up zvol import and export speedBrian Behlendorf2016-12-081-4/+4
|\ | | | | | | | | | | | | | | | | | | | | | | | | Speed up import and export speed by: * Add system delay taskq * Parallel prefetch zvol dnodes during zvol_create_minors * Parallel zvol_free during zvol_remove_minors * Reduce list linear search using ida and hash Reviewed-by: Boris Protopopov <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Chunwei Chen <[email protected]> Closes #5433
| * Use system_delay_taskq for long delay tasksChunwei Chen2016-12-011-4/+4
| | | | | | | | | | | | | | | | | | Use it for spa_deadman, zpl_posix_acl_free, snapentry_expire. This free system_taskq from the above long delay tasks, and allow us to do taskq_wait_outstanding on system_taskq without being blocked forever, making system_taskq more generic and useful. Signed-off-by: Chunwei Chen <[email protected]>
* | DLPX-44812 integrate EP-220 large memory scalabilityDavid Quigley2016-11-291-5/+3
|/
* Fix coverity defects: CID 147540, 147542cao2016-11-091-2/+1
| | | | | | | | | | | CID 147540: unsigned_compare - Cast nsec to a int32_t to properly detect the expected overflow. CID 147542: unsigned_compare - intval can never be less than ZIO_FAILURE_MODE_WAIT which is defined to be zero. Remove this useless check. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: cao.xuewen <[email protected]> Closes #5379
* Fix coverity defects: CID 147488, 147490cao2016-10-141-1/+1
| | | | | | | | | CID 147488, Type:explicit null dereferenced CID 147490, Type:dereference null return value Reviewed-by: Giuseppe Di Natale <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: cao.xuewen <[email protected]> Closes #5237
* OpenZFS 7090 - zfs should throttle allocationsDon Brady2016-10-131-2/+47
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | OpenZFS 7090 - zfs should throttle allocations Authored by: George Wilson <[email protected]> Reviewed by: Alex Reece <[email protected]> Reviewed by: Christopher Siden <[email protected]> Reviewed by: Dan Kimmel <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Reviewed by: Paul Dagnelie <[email protected]> Reviewed by: Prakash Surya <[email protected]> Reviewed by: Sebastien Roy <[email protected]> Approved by: Matthew Ahrens <[email protected]> Ported-by: Don Brady <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> When write I/Os are issued, they are issued in block order but the ZIO pipeline will drive them asynchronously through the allocation stage which can result in blocks being allocated out-of-order. It would be nice to preserve as much of the logical order as possible. In addition, the allocations are equally scattered across all top-level VDEVs but not all top-level VDEVs are created equally. The pipeline should be able to detect devices that are more capable of handling allocations and should allocate more blocks to those devices. This allows for dynamic allocation distribution when devices are imbalanced as fuller devices will tend to be slower than empty devices. The change includes a new pool-wide allocation queue which would throttle and order allocations in the ZIO pipeline. The queue would be ordered by issued time and offset and would provide an initial amount of allocation of work to each top-level vdev. The allocation logic utilizes a reservation system to reserve allocations that will be performed by the allocator. Once an allocation is successfully completed it's scheduled on a given top-level vdev. Each top-level vdev maintains a maximum number of allocations that it can handle (mg_alloc_queue_depth). The pool-wide reserved allocations (top-levels * mg_alloc_queue_depth) are distributed across the top-level vdevs metaslab groups and round robin across all eligible metaslab groups to distribute the work. As top-levels complete their work, they receive additional work from the pool-wide allocation queue until the allocation queue is emptied. OpenZFS-issue: https://www.illumos.org/issues/7090 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/4756c3d7 Closes #5258 Porting Notes: - Maintained minimal stack in zio_done - Preserve linux-specific io sizes in zio_write_compress - Added module params and documentation - Updated to use optimize AVL cmp macros
* Fix coverity defects: CID 147571, 147574luozhengzheng2016-10-131-1/+1
| | | | | | | | CID 147571: Unintentional integer overflow (OVERFLOW_BEFORE_WIDEN) CID 147574: Unintentional integer overflow (OVERFLOW_BEFORE_WIDEN) Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: luozhengzheng <[email protected]> Closes #5268
* Write issue taskq shouldn't be dynamicTim Chase2016-10-101-1/+2
| | | | | | | | | | | | | | | | This is as much an upstream compatibility as it's a bit of a performance gain. The illumos taskq implemention doesn't allow a TASKQ_THREADS_CPU_PCT type to be dynamic and in fact enforces as much with an ASSERT. As to performance, if this taskq is dynamic, it can cause excessive contention on tq_lock as the threads are created and destroyed because it can see bursts of many thousands of tasks in a short time, particularly in heavy high-concurrency zvol write workloads. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tim Chase <[email protected]> Closes #5236
* Add support for user/group dnode accounting & quotaJinshan Xiong2016-10-071-0/+12
| | | | | | | | | | | | | | | | This patch tracks dnode usage for each user/group in the DMU_USER/GROUPUSED_OBJECT ZAPs. ZAP entries dedicated to dnode accounting have the key prefixed with "obj-" followed by the UID/GID in string format (as done for the block accounting). A new SPA feature has been added for dnode accounting as well as a new ZPL version. The SPA feature must be enabled in the pool before upgrading the zfs filesystem. During the zfs version upgrade, a "quotacheck" will be executed by marking all dnode as dirty. ZoL-bug-id: https://github.com/zfsonlinux/zfs/issues/3500 Signed-off-by: Jinshan Xiong <[email protected]> Signed-off-by: Johann Lombardi <[email protected]>
* OpenZFS 4185 - add new cryptographic checksums to ZFS: SHA-512, Skein, Edon-RTony Hutter2016-10-031-0/+34
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Reviewed by: George Wilson <[email protected]> Reviewed by: Prakash Surya <[email protected]> Reviewed by: Saso Kiselkov <[email protected]> Reviewed by: Richard Lowe <[email protected]> Approved by: Garrett D'Amore <[email protected]> Ported by: Tony Hutter <[email protected]> OpenZFS-issue: https://www.illumos.org/issues/4185 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/45818ee Porting Notes: This code is ported on top of the Illumos Crypto Framework code: https://github.com/zfsonlinux/zfs/pull/4329/commits/b5e030c8dbb9cd393d313571dee4756fbba8c22d The list of porting changes includes: - Copied module/icp/include/sha2/sha2.h directly from illumos - Removed from module/icp/algs/sha2/sha2.c: #pragma inline(SHA256Init, SHA384Init, SHA512Init) - Added 'ctx' to lib/libzfs/libzfs_sendrecv.c:zio_checksum_SHA256() since it now takes in an extra parameter. - Added CTASSERT() to assert.h from for module/zfs/edonr_zfs.c - Added skein & edonr to libicp/Makefile.am - Added sha512.S. It was generated from sha512-x86_64.pl in Illumos. - Updated ztest.c with new fletcher_4_*() args; used NULL for new CTX argument. - In icp/algs/edonr/edonr_byteorder.h, Removed the #if defined(__linux) section to not #include the non-existant endian.h. - In skein_test.c, renane NULL to 0 in "no test vector" array entries to get around a compiler warning. - Fixup test files: - Rename <sys/varargs.h> -> <varargs.h>, <strings.h> -> <string.h>, - Remove <note.h> and define NOTE() as NOP. - Define u_longlong_t - Rename "#!/usr/bin/ksh" -> "#!/bin/ksh -p" - Rename NULL to 0 in "no test vector" array entries to get around a compiler warning. - Remove "for isa in $($ISAINFO); do" stuff - Add/update Makefiles - Add some userspace headers like stdio.h/stdlib.h in places of sys/types.h. - EXPORT_SYMBOL *_Init/*_Update/*_Final... routines in ICP modules. - Update scripts/zfs2zol-patch.sed - include <sys/sha2.h> in sha2_impl.h - Add sha2.h to include/sys/Makefile.am - Add skein and edonr dirs to icp Makefile - Add new checksums to zpool_get.cfg - Move checksum switch block from zfs_secpolicy_setprop() to zfs_check_settable() - Fix -Wuninitialized error in edonr_byteorder.h on PPC - Fix stack frame size errors on ARM32 - Don't unroll loops in Skein on 32-bit to save stack space - Add memory barriers in sha2.c on 32-bit to save stack space - Add filetest_001_pos.ksh checksum sanity test - Add option to write psudorandom data in file_write utility
* Fix coverity defectsluozhengzheng2016-09-211-1/+1
| | | | | | | | | | | | | coverity scan CID:147633,type: sizeof not portable coverity scan CID:147637,type: sizeof not portable coverity scan CID:147638,type: sizeof not portable coverity scan CID:147640,type: sizeof not portable In these particular cases sizeof (XX **) happens to be equal to sizeof (X *), but this is not a portable assumption. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: luozhengzheng <[email protected]> Closes #5144
* Refactor spa_load_l2cache to make build happyNikolay Borisov2016-09-121-29/+28
| | | | | | | | | | | | | | | | | In case sav->sav_config was NULL the body of the function would skip the iteration of the l2 cache devices and will just cleanup the old devices. However, this wasn't very obvious since the null check was performed after the loop body and after the old devices were cleaned. Refactor the code so that it's now obvious when the iteration of the l2cache devices is skipped. This fixes the following cppcheck warning: [module/zfs/spa.c:1552]: (error) Possible null pointer dereference: newvdevs Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Nikolay Borisov <[email protected]> Closes #5087
* Performance optimization of AVL tree comparator functionsGvozden Neskovic2016-08-311-9/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | perf: 2.75x faster ddt_entry_compare() First 256bits of ddt_key_t is a block checksum, which are expected to be close to random data. Hence, on average, comparison only needs to look at first few bytes of the keys. To reduce number of conditional jump instructions, the result is computed as: sign(memcmp(k1, k2)). Sign of an integer 'a' can be obtained as: `(0 < a) - (a < 0)` := {-1, 0, 1} , which is computed efficiently. Synthetic performance evaluation of original and new algorithm over 1G random keys on 2.6GHz Intel(R) Xeon(R) CPU E5-2660 v3: old 6.85789 s new 2.49089 s perf: 2.8x faster vdev_queue_offset_compare() and vdev_queue_timestamp_compare() Compute the result directly instead of using conditionals perf: zfs_range_compare() Speedup between 1.1x - 2.5x, depending on compiler version and optimization level. perf: spa_error_entry_compare() `bcmp()` is not suitable for comparator use. Use `memcmp()` instead. perf: 2.8x faster metaslab_compare() and metaslab_rangesize_compare() perf: 2.8x faster zil_bp_compare() perf: 2.8x faster mze_compare() perf: faster dbuf_compare() perf: faster compares in spa_misc perf: 2.8x faster layout_hash_compare() perf: 2.8x faster space_reftree_compare() perf: libzfs: faster avl tree comparators perf: guid_compare() perf: dsl_deadlist_compare() perf: perm_set_compare() perf: 2x faster range_tree_seg_compare() perf: faster unique_compare() perf: faster vdev_cache _compare() perf: faster vdev_uberblock_compare() perf: faster fuid _compare() perf: faster zfs_znode_hold_compare() Signed-off-by: Gvozden Neskovic <[email protected]> Signed-off-by: Richard Elling <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #5033
* Fix "zpool get guid,freeing,leaked" sourceHajo Möller2016-08-301-6/+8
| | | | | | | | | | | | `zpool get guid,freeing,leaked` shows SOURCE as `default`, it should be `-` as those props are not editable. Changed code to not overwrite `src` for `ZPOOL_PROP_VERSION`, so it stays `ZPROP_SRC_NONE`. Make src const to avoid future mistakes Signed-off-by: Hajo Möller <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #4170
* Fix spa config generate memory leak in spa_load_best functionheary-cao2016-08-191-0/+2
| | | | | | | | | | When spa retry load succeeds and spa recovery is requested it may leak in spa_load_best function. Always free the generated config when it is not assigned to the spa. Signed-off-by: cao.xuewen <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #4940
* OpenZFS 5997 - FRU field not set during pool creation and never updatedHans Rosenfeld2016-08-121-16/+27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Authored by: Hans Rosenfeld <[email protected]> Reviewed by: Dan Fields <[email protected]> Reviewed by: Josef Sipek <[email protected]> Reviewed by: Richard Elling <[email protected]> Reviewed by: George Wilson <[email protected]> Approved by: Robert Mustacchi <[email protected]> Signed-off-by: Don Brady <[email protected]> Ported-by: Brian Behlendorf <[email protected]> OpenZFS-issue: https://www.illumos.org/issues/5997 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/1437283 Porting Notes: In addition to the OpenZFS changes this patch realigns the events with those found in OpenZFS. Events which would be logged as sysevents on illumos have been been mapped to the 'sysevent' class for Linux. In addition, several subclass names have been changed to match what is used in OpenZFS. In all cases this means a '.' was changed to an '_' in the subclass. The scripts provided by ZoL have been updated, however users which provide scripts for any of the following events will need to rename them based on the new subclass names. ereport.fs.zfs.config.sync sysevent.fs.zfs.config_sync ereport.fs.zfs.zpool.destroy sysevent.fs.zfs.pool_destroy ereport.fs.zfs.zpool.reguid sysevent.fs.zfs.pool_reguid ereport.fs.zfs.vdev.remove sysevent.fs.zfs.vdev_remove ereport.fs.zfs.vdev.clear sysevent.fs.zfs.vdev_clear ereport.fs.zfs.vdev.check sysevent.fs.zfs.vdev_check ereport.fs.zfs.vdev.spare sysevent.fs.zfs.vdev_spare ereport.fs.zfs.vdev.autoexpand sysevent.fs.zfs.vdev_autoexpand ereport.fs.zfs.resilver.start sysevent.fs.zfs.resilver_start ereport.fs.zfs.resilver.finish sysevent.fs.zfs.resilver_finish ereport.fs.zfs.scrub.start sysevent.fs.zfs.scrub_start ereport.fs.zfs.scrub.finish sysevent.fs.zfs.scrub_finish ereport.fs.zfs.bootfs.vdev.attach sysevent.fs.zfs.bootfs_vdev_attach
* Remove custom root pool import codeBrian Behlendorf2016-08-111-205/+0
| | | | | | | | | | Non-Linux OpenZFS implementations require additional support to be used a root pool. This code should simply be removed to avoid confusion and improve readability. Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Chunwei Chen <[email protected]> Closes #4951
* Fix sync behavior for disk vdevsTim Chase2016-07-251-3/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Prior to b39c22b, which was first generally available in the 0.6.5 release as b39c22b, ZoL never actually submitted synchronous read or write requests to the Linux block layer. This means the vdev_disk_dio_is_sync() function had always returned false and, therefore, the completion in dio_request_t.dr_comp was never actually used. In b39c22b, synchronous ZIO operations were translated to synchronous BIO requests in vdev_disk_io_start(). The follow-on commits 5592404 and aa159af fixed several problems introduced by b39c22b. In particular, 5592404 introduced the new flag parameter "wait" to __vdev_disk_physio() but under ZoL, since vdev_disk_physio() is never actually used, the wait flag was always zero so the new code had no effect other than to cause a bug in the use of the dio_request_t.dr_comp which was fixed by aa159af. The original rationale for introducing synchronous operations in b39c22b was to hurry certains requests through the BIO layer which would have otherwise been subject to its unplug timer which would increase the latency. This behavior of the unplug timer, however, went away during the transition of the plug/unplug system between kernels 2.6.32 and 2.6.39. To handle the unplug timer behavior on 2.6.32-2.6.35 kernels the BIO_RW_UNPLUG flag is used as a hint to suppress the plugging behavior. For kernels 2.6.36-2.6.38, the REQ_UNPLUG macro will be available and ise used for the same purpose. Signed-off-by: Tim Chase <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #4858
* OpenZFS 6876 - Stack corruption after importing a pool with a too-long namePaul Dagnelie2016-06-281-0/+18
| | | | | | | | | | | | | | | | Reviewed by: Prakash Surya <[email protected]> Reviewed by: Dan Kimmel <[email protected]> Reviewed by: George Wilson <[email protected]> Reviewed by: Yuri Pankov <[email protected]> Ported-by: Brian Behlendorf <[email protected]> Calling dsl_dataset_name on a dataset with a 256 byte buffer is asking for trouble. We should check every dataset on import, using a 1024 byte buffer and checking each time to see if the dataset's new name is longer than 256 bytes. OpenZFS-issue: https://www.illumos.org/issues/6876 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/ca8674e
* OpenZFS 6314 - buffer overflow in dsl_dataset_nameIgor Kozhukhov2016-06-281-4/+2
| | | | | | | | | | | Reviewed by: George Wilson <[email protected]> Reviewed by: Prakash Surya <[email protected]> Reviewed by: Igor Kozhukhov <[email protected]> Approved by: Dan McDonald <[email protected]> Ported-by: Brian Behlendorf <[email protected]> OpenZFS-issue: https://www.illumos.org/issues/6314 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/d6160ee
* Implement large_dnode pool featureNed Bass2016-06-241-1/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Justification ------------- This feature adds support for variable length dnodes. Our motivation is to eliminate the overhead associated with using spill blocks. Spill blocks are used to store system attribute data (i.e. file metadata) that does not fit in the dnode's bonus buffer. By allowing a larger bonus buffer area the use of a spill block can be avoided. Spill blocks potentially incur an additional read I/O for every dnode in a dnode block. As a worst case example, reading 32 dnodes from a 16k dnode block and all of the spill blocks could issue 33 separate reads. Now suppose those dnodes have size 1024 and therefore don't need spill blocks. Then the worst case number of blocks read is reduced to from 33 to two--one per dnode block. In practice spill blocks may tend to be co-located on disk with the dnode blocks so the reduction in I/O would not be this drastic. In a badly fragmented pool, however, the improvement could be significant. ZFS-on-Linux systems that make heavy use of extended attributes would benefit from this feature. In particular, ZFS-on-Linux supports the xattr=sa dataset property which allows file extended attribute data to be stored in the dnode bonus buffer as an alternative to the traditional directory-based format. Workloads such as SELinux and the Lustre distributed filesystem often store enough xattr data to force spill bocks when xattr=sa is in effect. Large dnodes may therefore provide a performance benefit to such systems. Other use cases that may benefit from this feature include files with large ACLs and symbolic links with long target names. Furthermore, this feature may be desirable on other platforms in case future applications or features are developed that could make use of a larger bonus buffer area. Implementation -------------- The size of a dnode may be a multiple of 512 bytes up to the size of a dnode block (currently 16384 bytes). A dn_extra_slots field was added to the current on-disk dnode_phys_t structure to describe the size of the physical dnode on disk. The 8 bits for this field were taken from the zero filled dn_pad2 field. The field represents how many "extra" dnode_phys_t slots a dnode consumes in its dnode block. This convention results in a value of 0 for 512 byte dnodes which preserves on-disk format compatibility with older software. Similarly, the in-memory dnode_t structure has a new dn_num_slots field to represent the total number of dnode_phys_t slots consumed on disk. Thus dn->dn_num_slots is 1 greater than the corresponding dnp->dn_extra_slots. This difference in convention was adopted because, unlike on-disk structures, backward compatibility is not a concern for in-memory objects, so we used a more natural way to represent size for a dnode_t. The default size for newly created dnodes is determined by the value of a new "dnodesize" dataset property. By default the property is set to "legacy" which is compatible with older software. Setting the property to "auto" will allow the filesystem to choose the most suitable dnode size. Currently this just sets the default dnode size to 1k, but future code improvements could dynamically choose a size based on observed workload patterns. Dnodes of varying sizes can coexist within the same dataset and even within the same dnode block. For example, to enable automatically-sized dnodes, run # zfs set dnodesize=auto tank/fish The user can also specify literal values for the dnodesize property. These are currently limited to powers of two from 1k to 16k. The power-of-2 limitation is only for simplicity of the user interface. Internally the implementation can handle any multiple of 512 up to 16k, and consumers of the DMU API can specify any legal dnode value. The size of a new dnode is determined at object allocation time and stored as a new field in the znode in-memory structure. New DMU interfaces are added to allow the consumer to specify the dnode size that a newly allocated object should use. Existing interfaces are unchanged to avoid having to update every call site and to preserve compatibility with external consumers such as Lustre. The new interfaces names are given below. The versions of these functions that don't take a dnodesize parameter now just call the _dnsize() versions with a dnodesize of 0, which means use the legacy dnode size. New DMU interfaces: dmu_object_alloc_dnsize() dmu_object_claim_dnsize() dmu_object_reclaim_dnsize() New ZAP interfaces: zap_create_dnsize() zap_create_norm_dnsize() zap_create_flags_dnsize() zap_create_claim_norm_dnsize() zap_create_link_dnsize() The constant DN_MAX_BONUSLEN is renamed to DN_OLD_MAX_BONUSLEN. The spa_maxdnodesize() function should be used to determine the maximum bonus length for a pool. These are a few noteworthy changes to key functions: * The prototype for dnode_hold_impl() now takes a "slots" parameter. When the DNODE_MUST_BE_FREE flag is set, this parameter is used to ensure the hole at the specified object offset is large enough to hold the dnode being created. The slots parameter is also used to ensure a dnode does not span multiple dnode blocks. In both of these cases, if a failure occurs, ENOSPC is returned. Keep in mind, these failure cases are only possible when using DNODE_MUST_BE_FREE. If the DNODE_MUST_BE_ALLOCATED flag is set, "slots" must be 0. dnode_hold_impl() will check if the requested dnode is already consumed as an extra dnode slot by an large dnode, in which case it returns ENOENT. * The function dmu_object_alloc() advances to the next dnode block if dnode_hold_impl() returns an error for a requested object. This is because the beginning of the next dnode block is the only location it can safely assume to either be a hole or a valid starting point for a dnode. * dnode_next_offset_level() and other functions that iterate through dnode blocks may no longer use a simple array indexing scheme. These now use the current dnode's dn_num_slots field to advance to the next dnode in the block. This is to ensure we properly skip the current dnode's bonus area and don't interpret it as a valid dnode. zdb --- The zdb command was updated to display a dnode's size under the "dnsize" column when the object is dumped. For ZIL create log records, zdb will now display the slot count for the object. ztest ----- Ztest chooses a random dnodesize for every newly created object. The random distribution is more heavily weighted toward small dnodes to better simulate real-world datasets. Unused bonus buffer space is filled with non-zero values computed from the object number, dataset id, offset, and generation number. This helps ensure that the dnode traversal code properly skips the interior regions of large dnodes, and that these interior regions are not overwritten by data belonging to other dnodes. A new test visits each object in a dataset. It verifies that the actual dnode size matches what was stored in the ztest block tag when it was created. It also verifies that the unused bonus buffer space is filled with the expected data patterns. ZFS Test Suite -------------- Added six new large dnode-specific tests, and integrated the dnodesize property into existing tests for zfs allow and send/recv. Send/Receive ------------ ZFS send streams for datasets containing large dnodes cannot be received on pools that don't support the large_dnode feature. A send stream with large dnodes sets a DMU_BACKUP_FEATURE_LARGE_DNODE flag which will be unrecognized by an incompatible receiving pool so that the zfs receive will fail gracefully. While not implemented here, it may be possible to generate a backward-compatible send stream from a dataset containing large dnodes. The implementation may be tricky, however, because the send object record for a large dnode would need to be resized to a 512 byte dnode, possibly kicking in a spill block in the process. This means we would need to construct a new SA layout and possibly register it in the SA layout object. The SA layout is normally just sent as an ordinary object record. But if we are constructing new layouts while generating the send stream we'd have to build the SA layout object dynamically and send it at the end of the stream. For sending and receiving between pools that do support large dnodes, the drr_object send record type is extended with a new field to store the dnode slot count. This field was repurposed from unused padding in the structure. ZIL Replay ---------- The dnode slot count is stored in the uppermost 8 bits of the lr_foid field. The bits were unused as the object id is currently capped at 48 bits. Resizing Dnodes --------------- It should be possible to resize a dnode when it is dirtied if the current dnodesize dataset property differs from the dnode's size, but this functionality is not currently implemented. Clearly a dnode can only grow if there are sufficient contiguous unused slots in the dnode block, but it should always be possible to shrink a dnode. Growing dnodes may be useful to reduce fragmentation in a pool with many spill blocks in use. Shrinking dnodes may be useful to allow sending a dataset to a pool that doesn't support the large_dnode feature. Feature Reference Counting -------------------------- The reference count for the large_dnode pool feature tracks the number of datasets that have ever contained a dnode of size larger than 512 bytes. The first time a large dnode is created in a dataset the dataset is converted to an extensible dataset. This is a one-way operation and the only way to decrement the feature count is to destroy the dataset, even if the dataset no longer contains any large dnodes. The complexity of reference counting on a per-dnode basis was too high, so we chose to track it on a per-dataset basis similarly to the large_block feature. Signed-off-by: Ned Bass <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #3542
* Add -lhHpw options to "zpool iostat" for avg latency, histograms, & queuesTony Hutter2016-05-121-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Update the zfs module to collect statistics on average latencies, queue sizes, and keep an internal histogram of all IO latencies. Along with this, update "zpool iostat" with some new options to print out the stats: -l: Include average IO latencies stats: total_wait disk_wait syncq_wait asyncq_wait scrub read write read write read write read write wait ----- ----- ----- ----- ----- ----- ----- ----- ----- - 41ms - 2ms - 46ms - 4ms - - 5ms - 1ms - 1us - 4ms - - 5ms - 1ms - 1us - 4ms - - - - - - - - - - - 49ms - 2ms - 47ms - - - - - - - - - - - - - 2ms - 1ms - - - 1ms - ----- ----- ----- ----- ----- ----- ----- ----- ----- 1ms 1ms 1ms 413us 16us 25us - 5ms - 1ms 1ms 1ms 413us 16us 25us - 5ms - 2ms 1ms 2ms 412us 26us 25us - 5ms - - 1ms - 413us - 25us - 5ms - - 1ms - 460us - 29us - 5ms - 196us 1ms 196us 370us 7us 23us - 5ms - ----- ----- ----- ----- ----- ----- ----- ----- ----- -w: Print out latency histograms: sdb total disk sync_queue async_queue latency read write read write read write read write scrub ------- ------ ------ ------ ------ ------ ------ ------ ------ ------ 1ns 0 0 0 0 0 0 0 0 0 ... 33us 0 0 0 0 0 0 0 0 0 66us 0 0 107 2486 2 788 12 12 0 131us 2 797 359 4499 10 558 184 184 6 262us 22 801 264 1563 10 286 287 287 24 524us 87 575 71 52086 15 1063 136 136 92 1ms 152 1190 5 41292 4 1693 252 252 141 2ms 245 2018 0 50007 0 2322 371 371 220 4ms 189 7455 22 162957 0 3912 6726 6726 199 8ms 108 9461 0 102320 0 5775 2526 2526 86 17ms 23 11287 0 37142 0 8043 1813 1813 19 34ms 0 14725 0 24015 0 11732 3071 3071 0 67ms 0 23597 0 7914 0 18113 5025 5025 0 134ms 0 33798 0 254 0 25755 7326 7326 0 268ms 0 51780 0 12 0 41593 10002 10002 0 537ms 0 77808 0 0 0 64255 13120 13120 0 1s 0 105281 0 0 0 83805 20841 20841 0 2s 0 88248 0 0 0 73772 14006 14006 0 4s 0 47266 0 0 0 29783 17176 17176 0 9s 0 10460 0 0 0 4130 6295 6295 0 17s 0 0 0 0 0 0 0 0 0 34s 0 0 0 0 0 0 0 0 0 69s 0 0 0 0 0 0 0 0 0 137s 0 0 0 0 0 0 0 0 0 ------------------------------------------------------------------------------- -h: Help -H: Scripted mode. Do not display headers, and separate fields by a single tab instead of arbitrary space. -q: Include current number of entries in sync & async read/write queues, and scrub queue: syncq_read syncq_write asyncq_read asyncq_write scrubq_read pend activ pend activ pend activ pend activ pend activ ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- 0 0 0 0 78 29 0 0 0 0 0 0 0 0 78 29 0 0 0 0 0 0 0 0 0 0 0 0 0 0 - - - - - - - - - - 0 0 0 0 0 0 0 0 0 0 - - - - - - - - - - 0 0 0 0 0 0 0 0 0 0 ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- 0 0 227 394 0 19 0 0 0 0 0 0 227 394 0 19 0 0 0 0 0 0 108 98 0 19 0 0 0 0 0 0 19 98 0 0 0 0 0 0 0 0 78 98 0 0 0 0 0 0 0 0 19 88 0 0 0 0 0 0 ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -p: Display numbers in parseable (exact) values. Also, update iostat syntax to allow the user to specify specific vdevs to show statistics for. The three options for choosing pools/vdevs are: Display a list of pools: zpool iostat ... [pool ...] Display a list of vdevs from a specific pool: zpool iostat ... [pool vdev ...] Display a list of vdevs from any pools: zpool iostat ... [vdev ...] Lastly, allow zpool command "interval" value to be floating point: zpool iostat -v 0.5 Signed-off-by: Tony Hutter <[email protected] Signed-off-by: Brian Behlendorf <[email protected]> Closes #4433
* Wrap vdev_count_verify_zaps() with ZFS_DEBUGBrian Behlendorf2016-05-061-0/+4
| | | | | | | | | | Commit e0ab3ab introduced two blocks of code which are only needed when debugging is enabled. These blocks should be wrapped with ZFS_DEBUG for clarity and to prevent unused variable warnings in a production build. Signed-off-by: Don Brady <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #4515
* OpenZFS 6736 - ZFS per-vdev ZAPsJoe Stein2016-05-021-7/+227
| | | | | | | | | | | | | | | | | 6736 ZFS per-vdev ZAPs Reviewed by: Matthew Ahrens <[email protected]> Reviewed by: John Kennedy <[email protected]> Reviewed by: George Wilson <[email protected]> Reviewed by: Don Brady <[email protected]> Reviewed by: Dan McDonald <[email protected]> References: https://www.illumos.org/issues/6736 https://github.com/openzfs/openzfs/commit/215198a Ported-by: Don Brady <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #4515
* Illumos 6659 - nvlist_free(NULL) is a no-opJosef 'Jeff' Sipek2016-04-271-2/+1
| | | | | | | | | | | | | | | 6659 nvlist_free(NULL) is a no-op Reviewed by: Toomas Soome <[email protected]> Reviewed by: Marcel Telka <[email protected]> Approved by: Robert Mustacchi <[email protected]> References: https://www.illumos.org/issues/6659 https://github.com/illumos/illumos-gate/commit/aab83bb Ported-by: David Quigley <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #4566
* Add support for asynchronous zvol minor operationsBoris Protopopov2016-03-101-7/+30
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | zfsonlinux issue #2217 - zvol minor operations: check snapdev property before traversing snapshots of a dataset zfsonlinux issue #3681 - lock order inversion between zvol_open() and dsl_pool_sync()...zvol_rename_minors() Create a per-pool zvol taskq for asynchronous zvol tasks. There are a few key design decisions to be aware of. * Each taskq must be single threaded to ensure tasks are always processed in the order in which they were dispatched. * There is a taskq per-pool in order to keep the pools independent. This way if one pool is suspended it will not impact another. * The preferred location to dispatch a zvol minor task is a sync task. In this context there is easy access to the spa_t and minimal error handling is required because the sync task must succeed. Support for asynchronous zvol minor operations address issue #3681. Signed-off-by: Boris Protopopov <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #2217 Closes #3678 Closes #3681
* Illumos 6527 - Possible access beyond end of string in zpool commentJorgen Lundman2016-01-281-1/+0
| | | | | | | | | | | | | | | 6527 Possible access beyond end of string in zpool comment Reviewed by: George Wilson <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Reviewed by: Dan McDonald <[email protected]> Approved by: Gordon Ross <[email protected]> References: https://www.illumos.org/issues/6527 https://github.com/illumos/illumos-gate/commit/2bd7a8d Ported-by: Brian Behlendorf <[email protected]> Signed-off-by: Chunwei Chen <[email protected]>
* Illumos 6414 - vdev_config_sync could be simplerBrian Behlendorf2016-01-281-8/+2
| | | | | | | | | | | | | | 6414 vdev_config_sync could be simpler Reviewed by: George Wilson <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Approved by: Robert Mustacchi <[email protected]> References: https://www.illumos.org/issues/6414 https://github.com/illumos/illumos-gate/commit/eb5bb58 Ported-by: Brian Behlendorf <[email protected]> Signed-off-by: Chunwei Chen <[email protected]>
* Illumos 5045 - use atomic_{inc,dec}_* instead of atomic_add_*Josef 'Jeff' Sipek2016-01-151-2/+2
| | | | | | | | | | | | | | | | | | 5045 use atomic_{inc,dec}_* instead of atomic_add_* Reviewed by: Matthew Ahrens <[email protected]> Reviewed by: Garrett D'Amore <[email protected]> Approved by: Robert Mustacchi <[email protected]> References: https://www.illumos.org/issues/5045 https://github.com/illumos/illumos-gate/commit/1a5e258 Porting notes: - All changes to non-ZFS files dropped. - Changes to zfs_vfsops.c dropped because they were Illumos specific. Ported-by: Brian Behlendorf <[email protected]> Closes #4220
* Illumos 3749 - zfs event processing should work on R/O root filesystemsWill Andrews2016-01-121-2/+29
| | | | | | | | | | | | | | | | | | | | | | | | | 3749 zfs event processing should work on R/O root filesystems Reviewed by: Matthew Ahrens <[email protected]> Reviewed by: Eric Schrock <[email protected]> Approved by: Christopher Siden <[email protected]> References: https://www.illumos.org/issues/3749 https://github.com/illumos/illumos-gate/commit/3cb69f7 Porting notes: - [include/sys/spa_impl.h] - ffe9d38 Add generic errata infrastructure - 1421c89 Add visibility in to arc_read - [include/sys/fm/fs/zfs.h] - 2668527 Add linux events - 6283f55 Support custom build directories and move includes - [module/zfs/spa_config.c] - Updated spa_config_sync() to match illumos with the exception of a Linux specific block. Ported-by: kernelOfTruth [email protected] Signed-off-by: Brian Behlendorf <[email protected]>
* Revert "Illumos 3749 - zfs event processing should work on R/O root filesystems"Brian Behlendorf2016-01-111-29/+2
| | | | | | | | | | This reverts commit b47637ecdc7b647ec5bd9dfca888179eecfaa72d which introduced a regression in ztest. $ ./cmd/ztest/ztest -V 5 vdevs, 7 datasets, 23 threads, 300 seconds... *** Error in `/rpool/home/behlendo/src/git/zfs/cmd/ztest/.libs/lt-ztest': double free or corruption (fasttop): 0x0000000000d339f0 ***
* Illumos 3749 - zfs event processing should work on R/O root filesystemsWill Andrews2016-01-111-2/+29
| | | | | | | | | | | | | | | | | | | | | | | | | 3749 zfs event processing should work on R/O root filesystems Reviewed by: Matthew Ahrens <[email protected]> Reviewed by: Eric Schrock <[email protected]> Approved by: Christopher Siden <[email protected]> References: https://www.illumos.org/issues/3749 https://github.com/illumos/illumos-gate/commit/3cb69f7 Porting notes: - [include/sys/spa_impl.h] - ffe9d38 Add generic errata infrastructure - 1421c89 Add visibility in to arc_read - [include/sys/fm/fs/zfs.h] - 2668527 Add linux events - 6283f55 Support custom build directories and move includes - [module/zfs/spa_config.c] - Updated spa_config_sync() to match illumos with the exception of a Linux specific block. Ported-by: kernelOfTruth [email protected] Signed-off-by: Brian Behlendorf <[email protected]>
* Illumos 5960, 5925Paul Dagnelie2016-01-081-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 5960 zfs recv should prefetch indirect blocks 5925 zfs receive -o origin= Reviewed by: Prakash Surya <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> References: https://www.illumos.org/issues/5960 https://www.illumos.org/issues/5925 https://github.com/illumos/illumos-gate/commit/a2cdcdd Porting notes: - [lib/libzfs/libzfs_sendrecv.c] - b8864a2 Fix gcc cast warnings - 325f023 Add linux kernel device support - 5c3f61e Increase Linux pipe buffer size on 'zfs receive' - [module/zfs/zfs_vnops.c] - 3558fd7 Prototype/structure update for Linux - c12e3a5 Restructure zfs_readdir() to fix regressions - [module/zfs/zvol.c] - Function @zvol_map_block() isn't needed in ZoL - 9965059 Prefetch start and end of volumes - [module/zfs/dmu.c] - Fixed ISO C90 - mixed declarations and code - Function dmu_prefetch() 'int i' is initialized before the following code block (c90 vs. c99) - [module/zfs/dbuf.c] - fc5bb51 Fix stack dbuf_hold_impl() - 9b67f60 Illumos 4757, 4913 - 34229a2 Reduce stack usage for recursive traverse_visitbp() - [module/zfs/dmu_send.c] - Fixed ISO C90 - mixed declarations and code - b58986e Use large stacks when available - 241b541 Illumos 5959 - clean up per-dataset feature count code - 77aef6f Use vmem_alloc() for nvlists - 00b4602 Add linux kernel memory support Ported-by: kernelOfTruth [email protected] Signed-off-by: Brian Behlendorf <[email protected]>
* Make zio_taskq_batch_pct user configurableDHE2015-12-181-1/+6
| | | | | | | | | Adds zio_taskq_batch_pct as an exported module parameter, allowing users to modify it at module load time. Signed-off-by: DHE <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #4110
* Align thread priority with Linux defaultsBrian Behlendorf2015-07-281-3/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Under Linux filesystem threads responsible for handling I/O are normally created with the maximum priority. Non-I/O filesystem processes run with the default priority. ZFS should adopt the same priority scheme under Linux to maintain good performance and so that it will complete fairly when other Linux filesystems are active. The priorities have been updated to the following: $ ps -eLo rtprio,cls,pid,pri,nice,cmd | egrep 'z_|spl_|zvol|arc|dbu|meta' - TS 10743 19 -20 [spl_kmem_cache] - TS 10744 19 -20 [spl_system_task] - TS 10745 19 -20 [spl_dynamic_tas] - TS 10764 19 0 [dbu_evict] - TS 10765 19 0 [arc_prune] - TS 10766 19 0 [arc_reclaim] - TS 10767 19 0 [arc_user_evicts] - TS 10768 19 0 [l2arc_feed] - TS 10769 39 0 [z_unmount] - TS 10770 39 -20 [zvol] - TS 11011 39 -20 [z_null_iss] - TS 11012 39 -20 [z_null_int] - TS 11013 39 -20 [z_rd_iss] - TS 11014 39 -20 [z_rd_int_0] - TS 11022 38 -19 [z_wr_iss] - TS 11023 39 -20 [z_wr_iss_h] - TS 11024 39 -20 [z_wr_int_0] - TS 11032 39 -20 [z_wr_int_h] - TS 11033 39 -20 [z_fr_iss_0] - TS 11041 39 -20 [z_fr_int] - TS 11042 39 -20 [z_cl_iss] - TS 11043 39 -20 [z_cl_int] - TS 11044 39 -20 [z_ioctl_iss] - TS 11045 39 -20 [z_ioctl_int] - TS 11046 39 -20 [metaslab_group_] - TS 11050 19 0 [z_iput] - TS 11121 38 -19 [z_wr_iss] Note that under Linux the meaning of a processes priority is inverted with respect to illumos. High values on Linux indicate a _low_ priority while high value on illumos indicate a _high_ priority. In order to preserve the logical meaning of the minclsyspri and maxclsyspri macros when they are used by the illumos wrapper functions their values have been inverted. This way when changes are merged from upstream illumos we won't need to remember to invert the macro. It could also lead to confusion. This patch depends on https://github.com/zfsonlinux/spl/pull/466. Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Ned Bass <[email protected]> Closes #3607
* Illumos 5347 - idle pool may run itself out of spaceMatthew Ahrens2015-07-141-16/+35
| | | | | | | | | | | | | | | | | | | | | | | | 5347 idle pool may run itself out of space Reviewed by: Alex Reece <[email protected]> Reviewed by: George Wilson <[email protected]> Reviewed by: Steven Hartland <[email protected]> Reviewed by: Richard Elling <[email protected]> Approved by: Dan McDonald <[email protected]> References: https://github.com/illumos/illumos-gate/commit/231aab8 https://github.com/illumos/illumos-gate/commit/4a92375 3642 https://www.illumos.org/issues/5347 https://github.com/zfsonlinux/zfs/commit/89b1cd6 (partial commit & fix) https://github.com/zfsonlinux/zfs/commit/fbeddd6 Illumos 4390 https://github.com/zfsonlinux/zfs/commit/2696dfa Illumos 3642, 3643 Porting notes: This is completing the partial fix from FreeBSD Ported-by: kernelOfTruth [email protected] Signed-off-by: Brian Behlendorf <[email protected]> Closes #3586
* Update all default taskq settingsBrian Behlendorf2015-06-251-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Over the years the default values for the taskqs used on Linux have differed slightly from illumos. In the vast majority of cases this was done to avoid creating an obnoxious number of idle threads which would pollute the process listing. With the addition of support for dynamic taskqs all multi-threaded queues should be created as dynamic taskqs. This allows us to get the best of both worlds. * The illumos default values for the I/O pipeline can be restored. These values are known to work well for most workloads. The only exception is the zio write interrupt taskq which is changed to ZTI_P(12, 8). At least under Linux more threads has been shown to improve performance, see commit 7e55f4e. * Reduces the number of idle threads on the system when it's not under heavy load. The maximum number of threads will only be created when they are required. * Remove the vdev_file_taskq and rely on the system_taskq instead which is now dynamic and may have up to 64-threads. Again this brings us back inline with upstream. * Tasks dispatched with taskq_dispatch_ent() are allowed to use dynamic taskqs. The Linux taskq implementation supports this. Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Tim Chase <[email protected]> Closes #3507
* Illumos 5818 - zfs {ref}compressratio is incorrect with 4k sector sizeMatthew Ahrens2015-06-101-0/+2
| | | | | | | | | | | | | | | | | 5818 zfs {ref}compressratio is incorrect with 4k sector size Reviewed by: Alex Reece <[email protected]> Reviewed by: George Wilson <[email protected]> Reviewed by: Richard Elling <[email protected]> Reviewed by: Steven Hartland <[email protected]> Approved by: Albert Lee <[email protected]> References: https://www.illumos.org/issues/5818 https://github.com/illumos/illumos-gate/commit/81cd5c5 Ported-by: Don Brady <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #3432
* Illumos 5269 - zpool import slowArne Jansen2015-06-091-5/+6
| | | | | | | | | | | | | | | | 5269 zpool import slow Reviewed by: Matthew Ahrens <[email protected]> Reviewed by: George Wilson <[email protected]> Reviewed by: Dan McDonald <[email protected]> Approved by: Dan McDonald <[email protected]> References: https://www.illumos.org/issues/5269 https://github.com/illumos/illumos-gate/commit/12380e1e Ported-by: DHE <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #3396