summaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* Fix coverity defects: CID 147553cao2016-11-011-1/+2
| | | | | | | CID 147553: Type:Dereference null return value Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: cao.xuewen <[email protected]> Closes #5305
* Fix coverity defects: CID 147548cao2016-10-313-5/+7
| | | | | | | CID 147548: Type:Dereference null return value Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: cao.xuewen <[email protected]> Closes #5321
* Fix coverity defects: CID 152975cao2016-10-311-2/+7
| | | | | | | CID 152975: Type:Dereference null return value Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: cao.xuewen <[email protected]> Closes #5322
* Fix coverity defects: CID 147509GeLiXin2016-10-311-2/+12
| | | | | | | | CID 147509: Explicit null dereferenced - l2arc_sublist_lock is fragile as relied on caller too much. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: GeLiXin <[email protected]> Closes #5319
* Update migration testslegend-hua2016-10-311-10/+3
| | | | | | | | | | | Due to the instability of the migration tests, the test will skip. The migration tests focus on migrating test file from fs to ZFS fs. We can create zpool and ext2 directly by loop device, rather than by set_partition Reviewed-by: Sydney Vanda <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: legend-hua <[email protected]> Closes #5315
* Add paxcheck make lint targetJason Zaman2016-10-282-1/+49
| | | | | | | | | | | | | This uses scanelf (from pax-utils) to check for any issues with the binaries. It currently checks for executable stacks and textrels. The checks are in a script so can be extended easily in the future for more checks. Executable stacks and textrels are frequently caused by issues in asm files and lead to security and perf problems. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Jason Zaman <[email protected]> Closes #5338
* Tag 0.7.0-rc2zfs-0.7.0-rc2Brian Behlendorf2016-10-261-1/+1
| | | | | | Second release candidate. Signed-off-by: Brian Behlendorf <[email protected]>
* Fix lookup_bdev() on UbuntuHajo Möller2016-10-263-10/+31
| | | | | | | | | | | | Ubuntu added support for checking inode permissions to lookup_bdev() in kernel commit 193fb6a2c94fab8eb8ce70a5da4d21c7d4023bee (merged in 4.4.0-6.21). Upstream bug: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1636517 This patch adds a test for Ubuntu's variant of lookup_bdev() to configure and calls the function in the correct way. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Hajo Möller <[email protected]> Closes #5336
* Disable zio_dva_throttle_enabled by defaultBrian Behlendorf2016-10-262-2/+2
| | | | | | | | | | Until it can be determined definitively that a performance regression wasn't introduced accidentally by 3dfb57a this functionality is being disabled by default. It can be re- enabled by setting zio_dva_throttle_enabled=1. Signed-off-by: Brian Behlendorf <[email protected]> Closes #5335 Issue #5289
* Allow for '-o feature@<feature>=disabled' on the command lineLOLi2016-10-257-22/+136
| | | | | | | | | | | | | Sometimes it is desirable to specifically disable one or several features directly on the 'zpool create' command line. $ zpool create -o feature@<feature>=disabled ... Original-patch-by: Turbo Fredriksson <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: loli10K <[email protected]> Closes #3460 Closes #5142 Closes #5324
* Do not upgrade userobj accounting for snapshot datasetjxiong2016-10-251-0/+1
| | | | | | | | | | | | | | 'zfs recv' could disown a living objset without calling dmu_objset_disown(). This will cause the problem that the objset would be released while the upgrading thread is still running. This patch avoids the problem by checking if a dataset is a snapshot before calling dmu_objset_userobjspace_upgrade(). Snapshots are immutable and therefore it doesn't make sense to update them. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Jinshan Xiong <[email protected]> Closes #5295 Closes #5328
* Fix statechange-led.sh & unnecessary libdevmapper warningTony Hutter2016-10-254-22/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | - Fix autoreplace behaviour on statechange-led.sh script. ZED sends the following events on an auto-replace: 1. statechange: Disk goes UNAVAIL->ONLINE 2. statechange: Disk goes ONLINE->UNAVAIL 3. vdev_attach: Disk goes ONLINE Events 1-2 happen when ZED first attempts to do an auto-online. When that fails, ZED then tries an auto-replace, generating the vdev_attach event in #3. In the previous code, statechange-led was only looking at the UNAVAIL->ONLINE transition to turn off the LED. It ignored the #2 ONLINE->UNAVAIL transition, assuming it was just the "old" VDEV going offline. This is problematic, as a drive can go from ONLINE->UNAVAIL when it's malfunctioning, and we don't want to ignore that. This new patch correctly turns on the fault LED every time a drive becomes UNAVAIL. It also monitors vdev_attach events to trigger turning off the LED when an auto-replaced disk comes online. - Remove unnecessary libdevmapper warning with --with-config=kernel This fixes an unnecessary libdevmapper warning when building --with-config=kernel. Kernel code does not use libdevmapper, so the warning is not needed. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tony Hutter <[email protected]> Closes #2375 Closes #5312 Closes #5331
* icp: mark asm files with noexec stackJason Zaman2016-10-251-0/+4
| | | | | | | | Similar to commit a3600a106. Asm files need an explicit note that they do not require an executable stack. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Jason Zaman <[email protected]> Closes #5332
* Fix cred leak in zpl_fallocate_commontuxoko2016-10-241-2/+1
| | | | | | | | | This is caught by kmemleak when running compress_004_pos Reviewed-by: Tim Chase <[email protected]> Reviewed by: Brian Behlendorf <[email protected]> Signed-off-by: Chunwei Chen <[email protected]> Closes #5244 Closes #5330
* Disable zpool_upgrade_002_pos test caseBrian Behlendorf2016-10-241-1/+2
| | | | | | | | | | This test case frequently triggers issue #4034. There exists a fix for this which is in the process of being upstreamed. Until that fix is available disable the test case. Reviewed by: George Wilson <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #5329 Issue #4034
* Fix coverity defects: CID 147511, 147513cao2016-10-242-3/+5
| | | | | | | | CID 147511: Type:Dereference before null check CID 147513: Type:Dereference before null check Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: cao.xuewen <[email protected]> Closes #5306
* Fix taskq creation failure in vdev_open_children()Brian Behlendorf2016-10-246-0/+155
| | | | | | | | | | | | | | When creating and destroying pools in tight loop it's possible to exhaust the number of allowed threads on a system. This results in taskq_create() failling and a NULL dereference. Resolve the issue by falling back to opening the vdevs all synchronously. Reviewed-by: Denys Rtveliashvili <[email protected]> Reviewed-by: Håkan Johansson <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes zfsonlinux/spl#521 Closes #4637
* Turn on/off enclosure slot fault LED even when disk isn't presentTony Hutter2016-10-2413-80/+215
| | | | | | | | | | | | | | | | | Previously when a drive faulted, the statechange-led.sh script would lookup the drive's LED sysfs entry in /sys/block/sd*/device/enclosure_device, and turn it on. During testing we noticed that if you pulled out a drive, or if the drive was so badly broken that it no longer appeared to Linux, that the /sys/block/sd* path would be removed, and the script could not lookup the LED entry. To fix this, this patch looks up the disks's more persistent "/sys/class/enclosure/X:X:X:X/Slot N" LED sysfs path at pool import. It then passes that path to the statechange-led script to use, rather than having the script look it up on the fly. This allows the script to turn on/off the slot LEDs even when the drive is missing. Closes #5309 Closes #2375
* Change location of current symlink created by test-runnerGiuseppe Di Natale2016-10-241-1/+2
| | | | | | | | | | | | | | | | test-runner should be creating the current symlink in the directory above the output directory. In a previous commit, the current symlink was placed in the current working directory, which could be inaccessible. It is more likely that the output directory is always accessible. This is needed because without this there's no deterministic way to get the path to ZFS Test Suite results until after the test suite has started. This makes it difficult for buildbot to follow the log file. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Giuseppe Di Natale <[email protected]> Closes #5314
* Fletcher4 algorithm implemented in pure NEON for Aarch64 / ARMv8 64 bitsRomain Dolbeau2016-10-216-1/+232
| | | | | | | | | | | | | | | | | | | | | | | | | This is not useful on micro-architecture with a weak NEON implementation (only 64 bits); the native version is slower & the byteswap barely faster than scalar. On A53 or A57, it's a small improvement on scalar but OK for byteswap. Results from an A53 system: 0 0 0x01 -1 0 1499068294333000 1499101101878000 implementation native byteswap scalar 1008227510 755880264 aarch64_neon 1198098720 1044818671 fastest aarch64_neon aarch64_neon Results from a A57 system: 0 0 0x01 -1 0 4407214734807033 4407233933777404 implementation native byteswap scalar 2302071241 1124873346 aarch64_neon 2542214946 2245570352 fastest aarch64_neon aarch64_neon Reviewed-by: Gvozden Neskovic <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Romain Dolbeau <[email protected]> Closes #5248
* Fix userquota_compare() functionBrian Behlendorf2016-10-211-1/+4
| | | | | | | | | | | | | | | | | The AVL tree compare function requires that either -1, 0, or 1 be returned. However the strcmp() function only guarantees that a negative, zero, or positive value is returned. Therefore, the return value of strcmp() needs to be sanitized with AVL_ISIGN. This was initially overlooked because the x86_64 implementation of strcmp() happens to only returns the allowed values. This was observed on an aarch64 platform which behaves correctly but differently as described above. Reviewed-by: Jinshan Xiong <[email protected]> Reviewed-by: Richard Laager <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #5311 Closes #5313
* Fix coverity defects: CID 153459luozhengzheng2016-10-201-1/+1
| | | | | | | | CID 153459: Null pointer dereferences (FORWARD_NULL) Accidentally introduced by #5159. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: luozhengzheng <[email protected]> Closes #5310
* Fix coverity defects: CID 147551, 147552cao2016-10-202-2/+9
| | | | | | | | CID 147551: Type:dereference null return value CID 147552: Type:dereference null return value Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: cao.xuewen <[email protected]> Closes #5279
* Fix coverity defects: CID 147472cao2016-10-203-4/+15
| | | | | | | CID 147472: Type: 'Constant' variable guards dead code Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: cao.xuewen <[email protected]> Closes #5288
* Fix coverity defects: CID 150919, 150923luozhengzheng2016-10-202-2/+2
| | | | | | | | | CID 150919: Buffer not null terminated (BUFFER_SIZE_WARNING) CID 150923: Buffer not null terminated (BUFFER_SIZE_WARNING) Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Tom Caputi <[email protected]> Signed-off-by: luozhengzheng <[email protected]> Closes #5298
* Update migration_004_pos, migration_005_pos, migration_006_poslegend-hua2016-10-203-3/+3
| | | | | | | Log function should be "log_fail", rather than "log_failED" Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: legend-hua <[email protected]> Closes #5300
* Fix make distclean Makefile.am removalBrian Behlendorf2016-10-201-0/+1
| | | | | | | | | | | | | | The file tests/zfs-tests/tests/stress/Makefile.am gets mistakenly removed by the distclean target because it's empty. Adding a `SUBDIRS =` line prevents the removal. This directory is being preserved as the location to add assorted stress tests. These may include but are not limited to. http://kernel.ubuntu.com/~cking/stress-ng/ https://github.com/zfsonlinux/zfsstress/ Signed-off-by: Brian Behlendorf <[email protected]> Closes #5308
* Linux 4.9 compat: inode_change_ok() renamed setattr_prepare()Brian Behlendorf2016-10-204-1/+36
| | | | | | | | | | | | In torvalds/linux@31051c8 the inode_change_ok() function was renamed setattr_prepare() and updated to take a dentry ratheri than an inode. Update the code to call the setattr_prepare() and add a wrapper function which call inode_change_ok() for older kernels. Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Chunwei Chen <[email protected]> Requires-spl: refs/pull/581/head
* Linux 4.9 compat: remove iops->{set,get,remove}xattrChunwei Chen2016-10-203-0/+34
| | | | | | | | In Linux 4.9, torvalds/linux@fd50eca, iops->{set,get,remove}xattr and generic_{set,get,remove}xattr are removed. xattr operations will directly go through sb->s_xattr. Signed-off-by: Chunwei Chen <[email protected]>
* Linux 4.9 compat: iops->rename() wants flagsChunwei Chen2016-10-204-5/+65
| | | | | | | In Linux 4.9, torvalds/linux@2773bf0, iops->rename() and iops->rename2() are merged together into iops->rename(), it now wants flags. Signed-off-by: Chunwei Chen <[email protected]>
* Remove dir inode operations from zpl_inode_operationsChunwei Chen2016-10-201-8/+0
| | | | | | | These operations are dir specific, there's no point putting them in zpl_inode_operations which is for regular files. Signed-off-by: Chunwei Chen <[email protected]>
* Update .gitignoreBrian Behlendorf2016-10-192-0/+2
| | | | | | | Two additional files were recently introduced and should be ignored by git. Signed-off-by: Brian Behlendorf <[email protected]> Closes #5299
* Multipath autoreplace, control enclosure LEDs, event rate limitingTony Hutter2016-10-1924-61/+668
| | | | | | | | | | | | | | | | | | | | | | | | | | 1. Enable multipath autoreplace support for FMA. This extends FMA autoreplace to work with multipath disks. This requires libdevmapper to be installed at build time. 2. Turn on/off fault LEDs when VDEVs become degraded/faulted/online Set ZED_USE_ENCLOSURE_LEDS=1 in zed.rc to have ZED turn on/off the enclosure LED for a drive when a drive becomes FAULTED/DEGRADED. Your enclosure must be supported by the Linux SES driver for this to work. The enclosure LED scripts work for multipath devices as well. The scripts will clear the LED when the fault is cleared. 3. Rate limit ZIO delay and checksum events so as not to flood ZED ZIO delay and checksum events are rate limited to 5/sec in the zfs module. Reviewed-by: Richard Laager <[email protected]> Reviewed by: Don Brady <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tony Hutter <[email protected]> Closes #2449 Closes #3017 Closes #5159
* Fix coverity defects: CID 150926luozhengzheng2016-10-181-2/+8
| | | | | | | | | | CID 150926: Unchecked return value (CHECKED_RETURN) - This case cannot occur given the existing taskq implementation and flags passed to task_dispatch(). Reviewed-by: Chunwei Chen <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: luozhengzheng <[email protected]> Closes #5272
* Fix unused variableBrian Behlendorf2016-10-181-2/+2
| | | | | | | | | Accidentally introduced by 3dfb57a, when building with debugging disabled several variables are unused. Resolve this by wrapping them in ASSERTV to remove them for non-debug builds. Reviewed by: Don Brady <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #5284
* Fix coverity defects: CID 147643, 152204, 49339GeLiXin2016-10-182-2/+4
| | | | | | | | | | | | | | | | | CID 147643: Type: String not null terminated - make sure that the string is null terminated before strlen and fprintf. CID 152204: Type: Copy into fixed size buffer - since strlcpy isn't availabe here, use strncpy and terminate the string manually. CID 49339: Type: Buffer not null terminated - since strlcpy isn't availabe here, terminate the string manually before fprintf. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: GeLiXin <[email protected]> Closes #5283
* Fix coverity defects: CID 49339, 153393cao2016-10-182-1/+3
| | | | | | | | CID 49339: Type:Buffer not null terminated CID 153393: Type:Buffer not null terminated Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: <cao.xuewen [email protected]> Closes #5296
* Create a symlink to current test-runner outputGiuseppe Di Natale2016-10-181-0/+8
| | | | | | | | | | Generate a symlink in the current working directory to test-runner.py output. This will make it easier for the ZFS buildbot to collect logs. Reviewed by: John Kennedy <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Giuseppe Di Natale <[email protected]> Closes #5293
* Fix coverity defects: CID 150924luozhengzheng2016-10-171-1/+5
| | | | | | | | | | CID 150924: Unchecked return value (CHECKED_RETURN) - On taskq_dispatch failure the reference must be dropped and this entry can be safely skipped. This case should be impossible in the existing implementation but should be handled regardless. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: luozhengzheng <[email protected]> Closes #5278
* Properly use the Dracut cleanup hook to order pool shutdownRudd-O2016-10-174-2/+13
| | | | | | | | | | | | | | | | | | When Dracut starts up, it needs to determine whether a pool will remain "hanging open" before the system shuts off. In such a case, then the code to clean up the pool (using the previous export -F work) must be invoked. Since Dracut has had a recent change that makes mount-zfs.sh simply not run when the root dataset is already mounted, we must use the cleanup hook to order Dracut to do shutdown cleanup. Important note: this code will not accomplish its stated goal until this bug is fixed: https://bugzilla.redhat.com/show_bug.cgi?id=1385432 That bug impacts more than just ZFS. It impacts LUKS, dmraid, and unmount during poweroff. It is a Fedora-wide bug. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Manuel Amador (Rudd-O) <[email protected]> Closes #5287
* Pass status_cbdata_t to print_status_config() and friendsHåkan Johansson2016-10-171-68/+66
| | | | | | | | | | | | | | | | | | First rename spare_cbdata_t cb -> spare_cb in print_status_config(), to free up cb. Using the structure removes the explicit parameters namewidth and name_flags from several functions. Also use status_cbdata_t for print_import_config(). This simplifies print_logs(). Remove the parameter 'verbose' for print_logs(). It does not really mean verbose, it selected between the print_status_config and print_import_config() paths. This selection is now done by cb_print_config of spare_cbdata_t. Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Håkan Johansson <[email protected]> Closes #5259
* Use -F to export pools so as not to dirty up device labelsRudd-O2016-10-154-10/+11
| | | | | | Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Manuel Amador (Rudd-O) <[email protected]> Closes #5228 Closes #5238
* Allow partition aliases in vdev_id.conf (#5266)Brian Behlendorf2016-10-141-0/+1
| | | | | | | | | | | | | When pools are assembled from partitions, vdev_id.conf aliases do not work. The directory /dev/disk/by-vdev is not created because the associated udev rule for parsing vdev_id.conf is never called. Extend to logic to match "disk" and "partition". Patch-proposed-by: @sparksh Reviewed-by: Ned Bass <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #3859 Closes #5266
* Fix coverity defects: CID 147488, 147490cao2016-10-142-2/+8
| | | | | | | | | CID 147488, Type:explicit null dereferenced CID 147490, Type:dereference null return value Reviewed-by: Giuseppe Di Natale <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: cao.xuewen <[email protected]> Closes #5237
* OpenZFS 6877 - zfs_rename_006_pos fails due to missing zvol snapshot device fileAkash Ayare2016-10-142-4/+11
| | | | | | | | | | | | | | | | | | | | | | | | Authored by: Akash Ayare <[email protected]> Reviewed by: John Kennedy <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Reviewed by: Dan Kimmel <[email protected]> Reviewed-by: luozhengzheng <[email protected]> Reviewed-by: yuxiang <[email protected]> Ported-by: Brian Behlendorf <[email protected]> Bug was caused due to a change in functionality. At some point, ZFS snapshots no longer created associated device files which were being used in the test. To resolve this issue, a clone of the snapshot can be produced which will also create the expected device files; then, the test will behave as it did historically. OpenZFS-issue: https://www.illumos.org/issues/6877 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/2200f27 Closes #5275 Porting Notes: - Hardcoded /dev/zvol/rdsk changed to $ZVOL_RDEVDIR for compatibility. - Enabled in linux runfile.
* Enable zfs_rename_002_pos, zfs_rename_005_neg, zfs_rename_007_posBrian Behlendorf2016-10-143-5/+8
| | | | | | | | | | These tests all pass once updated to wait for udev to create the expected linked under /dev/zvol/. Reviewed-by: luozhengzheng <[email protected]> Reviewed-by: yuxiang <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #5275
* Fix coverity defects: CID 150921, 150927luozhengzheng2016-10-141-2/+2
| | | | | | | | CID 150921: Unchecked return value (CHECKED_RETURN) CID 150927 : Unchecked return value (CHECKED_RETURN) Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: luozhengzheng <[email protected]> Closes #5267
* Enable quota_002_pos, quota_004_pos and quota_005_posliaoyuxiangqin2016-10-144-9/+6
| | | | | | | | | | | | | | | | In this test the 'ls -ls' command was used to print testfile size in blocks. Because the environment variable BLOCK_SIZE was set the 'ls -ls' command detected this and output its block count as the number of 8192 blocks. Rather than change the variable name the -k was was added to force ls to return 1k blocks. This has the additional advantage of behaving consistently across platforms. For additional details on GNU 'ls' behavior regarding block size: https://www.gnu.org/software/coreutils/manual/html_node/Block-size.html Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: yuxiang <[email protected]> Closes #5269
* Enable zfs_receive_011_posBrian Behlendorf2016-10-141-2/+2
| | | | | | | | The zfs_receive_011_pos test can be enabled now that OpenZFS 6562 has been merged. Reviewed-by: Giuseppe Di Natale <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #5276
* OpenZFS 7090 - zfs should throttle allocationsDon Brady2016-10-1318-212/+1069
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | OpenZFS 7090 - zfs should throttle allocations Authored by: George Wilson <[email protected]> Reviewed by: Alex Reece <[email protected]> Reviewed by: Christopher Siden <[email protected]> Reviewed by: Dan Kimmel <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Reviewed by: Paul Dagnelie <[email protected]> Reviewed by: Prakash Surya <[email protected]> Reviewed by: Sebastien Roy <[email protected]> Approved by: Matthew Ahrens <[email protected]> Ported-by: Don Brady <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> When write I/Os are issued, they are issued in block order but the ZIO pipeline will drive them asynchronously through the allocation stage which can result in blocks being allocated out-of-order. It would be nice to preserve as much of the logical order as possible. In addition, the allocations are equally scattered across all top-level VDEVs but not all top-level VDEVs are created equally. The pipeline should be able to detect devices that are more capable of handling allocations and should allocate more blocks to those devices. This allows for dynamic allocation distribution when devices are imbalanced as fuller devices will tend to be slower than empty devices. The change includes a new pool-wide allocation queue which would throttle and order allocations in the ZIO pipeline. The queue would be ordered by issued time and offset and would provide an initial amount of allocation of work to each top-level vdev. The allocation logic utilizes a reservation system to reserve allocations that will be performed by the allocator. Once an allocation is successfully completed it's scheduled on a given top-level vdev. Each top-level vdev maintains a maximum number of allocations that it can handle (mg_alloc_queue_depth). The pool-wide reserved allocations (top-levels * mg_alloc_queue_depth) are distributed across the top-level vdevs metaslab groups and round robin across all eligible metaslab groups to distribute the work. As top-levels complete their work, they receive additional work from the pool-wide allocation queue until the allocation queue is emptied. OpenZFS-issue: https://www.illumos.org/issues/7090 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/4756c3d7 Closes #5258 Porting Notes: - Maintained minimal stack in zio_done - Preserve linux-specific io sizes in zio_write_compress - Added module params and documentation - Updated to use optimize AVL cmp macros