aboutsummaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* zfs-import: Perform verbatim import using cache fileJames Lee2015-10-132-64/+57
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This change modifies the import service to use the default cache file to perform a verbatim import of pools at boot. This fixes code that searches all devices and imported all visible pools. Using the cache file is in keeping with the way ZFS has always worked, how Solaris, Illumos, FreeBSD, and systemd performs imports, and is how it is written in the man page (zpool(1M,8)): All pools in this cache are automatically imported when the system boots. Importantly, the cache contains important information for importing multipath devices, and helps control which pools get imported in more dynamic environments like SANs, which may have thousands of visible and constantly changing pools, which the ZFS_POOL_EXCEPTIONS variable is not equipped to handle. Verbatim imports prevent rogue pools from being automatically imported and mounted where they shouldn't be. The change also stops the service from exporting pools at shutdown. Exporting pools is only meant to be performed explicitly by the administrator of the system. The old behavior of searching and importing all visible pools is preserved and can be switched on by heeding the warning and toggling the ZPOOL_IMPORT_ALL_VISIBLE variable in /etc/default/zfs. Signed-off-by: James Lee <[email protected]> Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #3777 Closes #3526
* zdb: segfault in dump_bpobj_subobjs()Tim Chase2015-10-131-2/+2
| | | | | | | | | Avoid buffer overrun on all-zero bpobj subobjects by using signed array index. Also fix the type cast on the printf() argument. Signed-off-by: Tim Chase <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #3905
* libzfs: handle EDOM errorsDHE2015-10-131-0/+5
| | | | | | | | | | | EDOM may occur if a user tries to set `recordsize` too large without use "zfs set". This can be demonstrated with: > zpool create testpool -O recordsize=32M /dev/... Signed-off-by: DHE <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #3911
* Fix 'arc_c < arc_c_min' panicBrian Behlendorf2015-10-131-1/+2
| | | | | | | | | Strictly enforce keeping 'arc_c >= arc_c_min'. The ASSERTs are left in place to catch this in a debug build but logic has been added to gracefully handle in a production build. Signed-off-by: Brian Behlendorf <[email protected]> Issue #3904
* Rename 'zed.service' to 'zfs-zed.service'Turbo Fredriksson2015-10-023-3/+6
| | | | | | | | | | | | | | | For consistency all systemd unit files and init scripts now share the same names. This prevents an issue where the zed is started twice on systems where both the systemd and sysv infrastructure is installed concurrently. For backward compatibility a 'zed' alias has been added. This allows the user to interact with the service using either the name 'zed' or 'zfs-zed'. Signed-off-by: Turbo Fredriksson <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #3837
* Fix zfs-dkms uninstall/updateBrian Behlendorf2015-10-021-5/+6
| | | | | | | | | | | | | Modern versions of dkms cleanup the build directory after installing. This resulted in 'dkms uninstall' never running because the check added by commit 866c162 which verifies the existance of the zfs.release build product would never be true. This patch resolves the issue by updating the conditional to check in the explicitly installed zfs_config.h file for the version. Signed-off-by: Brian Behlendorf <[email protected]> Closes #3862
* zfs_inode_update should not call dmu_object_size_from_db under spinlockRichard Yao2015-09-301-2/+4
| | | | | | | | | | | We should never block when holding a spin lock, but zfs_inode_update can block in the critical section of a spin lock in zfs_inode_update: zfs_inode_update -> dmu_object_size_from_db -> zrl_add -> mutex_enter Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #3858
* Remove obsolete zv_lockRichard Yao2015-09-301-2/+0
| | | | | | | | | All users of zv_lock were removed by 37f9dac, but we forgot to remove it. Lets remove it as clean up. Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #3858
* Init script fixesTurbo Fredriksson2015-09-297-63/+88
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * Fix regression - "OVERLAY_MOUNTS" should have been "DO_OVERLAY_MOUNTS". * Fix update-rc.d commands in postinst. Thanx to subzero79@GitHub. * Fix make sure a filesystem exists before trying to mount in mount_fs() * Fix local variable usage. * Fix to read_mtab(): * Strip control characters (space - \040) from /proc/mounts GLOBALY, not just first occurrence. * Don't replace unprintable characters ([/-. ]) for use in the variable name with underscore. No need, just remove them all together. * Add check_boolean() to check if a user configure option is set ('yes', 'Yes', 'YES' or any combination there of) OR '1'. Anything else is considered 'unset'. * Add a ZFS_POOL_IMPORT to the default config. * This is a semi colon separated list of pools to import ONLY. * This is intended for systems which have _a lot_ of pools (from a SAN for example) and it would be to many to put in the ZFS_POOL_EXCEPTIONS variable.. * Add a config option "ZPOOL_IMPORT_OPTS" for adding additional options to "zpool import". * Add documentation and the chance of overriding the ZPOOL_CACHE variable in the config file. * Remove "sort" from find_pools() and setup_snapshot_booting(). Sometimes not available, and not really necessary. Signed-off-by: Turbo Fredriksson <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Ned Bass <[email protected]> Issue #3816
* Fix uioskip crash when skip to endChunwei Chen2015-09-291-2/+4
| | | | | | | | | | | When doing uioskip to skip an iovec to the very end, the current loop condition will falsely check pass the end of iovec. We fix this checking uio_iovcnt first. Signed-off-by: Chunwei Chen <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #3806 Closes #3850
* Userspace can pass zero length segments via writev/readvRichard Yao2015-09-251-5/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Userspace can trigger an assertion by passing a zero-length segment when assertions are enabled: [27961.614792] VERIFY3(skip < iov->iov_len) failed (0 < 0) [27961.614795] PANIC at zfs_uio.c:187:uio_prefaultpages() [27961.614805] Call Trace: [27961.614811] dump_stack+0x45/0x57 [27961.614830] spl_dumpstack+0x44/0x50 [spl] [27961.614834] spl_panic+0xbb/0x100 [spl] [27961.614908] uio_prefaultpages+0x134/0x140 [zcommon] [27961.614930] zfs_write+0x1fd/0xe80 [zfs] [27961.615014] zpl_write_common_iovec+0x7f/0x110 [zfs] [27961.615035] zpl_iter_write+0xa0/0xd0 [zfs] [27961.615037] do_iter_readv_writev+0x59/0x80 [27961.615063] do_readv_writev+0x11b/0x260 [27961.615098] vfs_writev+0x39/0x50 [27961.615100] SyS_writev+0x4a/0xe0 [27961.615103] system_call_fastpath+0x16/0x6e The solution is to delete the assertion. This could potentially occur in uiomove as well, which contains analogous assertions that appear similarly unnecessary, so we remove those as well. Reported-by: Jonathan Vasquez <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Richard Yao <[email protected]> Issue #3792
* Revert "dmu_objset_userquota_get_ids uses dn_bonus unsafely"Brian Behlendorf2015-09-251-17/+3
| | | | | | | | | | | | This reverts commit 5f8e1e850522ee5cd37366427da4b4101a71c8a8. It was determined that this patch introduced the quota regression described in #3789. Signed-off-by: Tim Chase <[email protected]> Signed-off-by: Ned Bass <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #3443 Issue #3789
* Fix synchronous behavior in __vdev_disk_physio()Brian Behlendorf2015-09-253-81/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit b39c22b set the READ_SYNC and WRITE_SYNC flags for a bio based on the ZIO_PRIORITY_* flag passed in. This had the unnoticed side-effect of making the vdev_disk_io_start() synchronous for certain I/Os. This in turn resulted in vdev_disk_io_start() being able to re-dispatch zio's which would result in a RCU stalls when a disk was removed from the system. Additionally, this could negatively impact performance and explains the performance regressions reported in both #3829 and #3780. This patch resolves the issue by making the blocking behavior dependent on a 'wait' flag being passed rather than overloading the passed bio flags. Finally, the WRITE_SYNC and READ_SYNC behavior is restricted to non-rotational devices where there is no benefit to queuing to aggregate the I/O. Signed-off-by: Brian Behlendorf <[email protected]> Issue #3652 Issue #3780 Issue #3785 Issue #3817 Issue #3821 Issue #3829 Issue #3832 Issue #3870
* Avoid blocking in arc_reclaim_thread()Brian Behlendorf2015-09-251-11/+4
| | | | | | | | | | | | | | | | | | | | | | As described in the comment above arc_reclaim_thread() it's critical that the reclaim thread be careful about blocking. Just like it must never wait on a hash lock, it must never wait on a task which can in turn wait on the CV in arc_get_data_buf(). This will deadlock, see issue #3822 for full backtraces showing the problem. To resolve this issue arc_kmem_reap_now() has been updated to use the asynchronous arc prune function. This means that arc_prune_async() may now be called while there are still outstanding arc_prune_tasks. However, this isn't a problem because arc_prune_async() already keeps a reference count preventing multiple outstanding tasks per registered consumer. Functionally, this behavior is the same as the counterpart illumos function dnlc_reduce_cache(). Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Tim Chase <[email protected]> Issue #3808 Issue #3834 Issue #3822
* Disable zpl_nr_cached_objects() callbackBrian Behlendorf2015-09-251-14/+1
| | | | | | | | | | | The zpl_nr_cached_objects() function has been disabled because in the current code it doesn't provide any critical functionality and it may result in a deadlock under certain circumstances. However, because we expect to need these hooks in the future this code has not been entirely removed. Signed-off-by: Brian Behlendorf <[email protected]> Issue #3719
* Allow NFS activity to defer snapshot unmountsBrian Behlendorf2015-09-251-2/+13
| | | | | | | | | | Accessing a snapshot via NFS should cause an auto-unmount of that snapshot to be deferred until such as time as the snapshot is idle. This is analogous to the zpl_revalidate logic employed by locally mounted snapshots. Signed-off-by: Brian Behlendorf <[email protected]> Issue #3794
* Linux 4.3 compat: bio_end_io_t / BIO_UPTODATELukas Wunner2015-09-254-40/+34
| | | | | | | | | | | | | | | | | | Commit torvalds/linux@4246a0b63bd8f56a1469b12eafeb875b1041a451 ("block: add a bi_error field to struct bio") dropped the error argument from bio_endio in favor of newly introduced bio->bi_error. This also replaces bio->bi_flags value BIO_UPTODATE. bio_endio was a 3 argument function until Linux 2.6.24, which made it a 2 argument function, and now the prototype has changed yet again to a 1 argument function. Support for pre 2.6.24 kernels was already dropped with 37f9dac592bf ("zvol processing should use struct bio") which assumed the 2 argument version in zvol_request(). Remaining code to support the 3 argument version is hereby removed. Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Lukas Wunner <[email protected]> Issue #3799
* Fixed --signal typoyuina8222015-09-221-1/+1
| | | | | | Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #3773
* Add extra_started_commands because reload function is not defaultyuina8222015-09-221-0/+2
| | | | | | Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #3773
* Add large block support to zpios(1) benchmarkDon Brady2015-09-227-17/+127
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | As part of the large block support effort, it makes sense to add support for large blocks to **zpios(1)**. The specifying of a zfs block size for zpios is optional and will default to 128K if the block size is not specified. `zpios ... -S size | --blocksize size ...` This will use *size* ZFS blocks for each test, specified as a comma delimited list with an optional unit suffix. The supported range is powers of two from 128K through 16M. A range of block sizes can be tested as follows: `-S 128K,256K,512K,1M` Example run below (non realistic results from a VM and output abbreviated for space) ``` --regioncount=750 --regionsize=8M --chunksize=1M --offset=4K --threaddelay=0 --cleanup --human-readable --verbose --cleanup --blocksize=128K,256K,512K,1M th-cnt rg-cnt rg-sz ch-sz blksz wr-data wr-bw rd-data rd-bw --------------------------------------------------------------------- 4 750 8m 1m 128k 5g 90.06m 5g 93.37m 4 750 8m 1m 256k 5g 79.71m 5g 99.81m 4 750 8m 1m 512k 5g 42.20m 5g 93.14m 4 750 8m 1m 1m 5g 35.51m 5g 89.36m 8 750 8m 1m 128k 5g 85.49m 5g 90.81m 8 750 8m 1m 256k 5g 61.42m 5g 99.24m 8 750 8m 1m 512k 5g 49.09m 5g 108.78m 16 750 8m 1m 128k 5g 86.28m 5g 88.73m 16 750 8m 1m 256k 5g 64.34m 5g 93.47m 16 750 8m 1m 512k 5g 68.84m 5g 124.47m 16 750 8m 1m 1m 5g 53.97m 5g 97.20m --------------------------------------------------------------------- ``` Signed-off-by: Don Brady <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #3795 Closes #2071
* Tab-indent continuation lines in the "scan:" section of "zpool status".Remy Blank2015-09-191-3/+3
| | | | | | | | | | | All other sections use a tab, which makes them easy to parse. Only the "scan:" section had its continuation lines indented with four spaces. This makes them consistent with the others. Signed-off-by: Remy Blank <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Richard Yao <[email protected]> Closes #3769
* Honor xattr=sa dataset propertyNed Bass2015-09-195-8/+28
| | | | | | | | | | | | | | | | | | | | | | | | | ZFS incorrectly uses directory-based extended attributes even when xattr=sa is specified as a dataset property or mount option. Support to honor temporary mount options including "xattr" was added in commit 0282c4137e7409e6d85289f4955adf07fac834f5. There are two issues with the mount option handling: * Libzfs has historically included "xattr" in its list of default mount options. This overrides the dataset property, so the dataset is always configured to use directory-based xattrs even when the xattr dataset property is set to off or sa. Address this by removing "xattr" from the set of default mount options in libzfs. * There was no way to enable system attribute-based extended attributes using temporary mount options. Add the mount options "saxattr" and "dirxattr" which enable the xattr behavior their names suggest. This approach has the advantages of mirroring the valid xattr dataset property values and following existing conventions for mount option names. Signed-off-by: Ned Bass <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #3787
* Fix NULL as mount(2) syscall data parameterBrian Behlendorf2015-09-191-23/+24
| | | | | | | | | Passing NULL for the mount data should not result in EINVAL. It should be treated as if an empty string were passed and succeed. Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Ned Bass <[email protected]> Closes #3771
* Discard on zvols should not exceed the length of a blockRichard Yao2015-09-191-0/+1
| | | | | | | | | | | | 37f9dac592bf5889c3efb305c48ac39b4c7dd140 replaced the end-start calculation with a cached value, but neglected to update it on discard operations. This can cause us to discard data not requested, causing data loss on zvols. Reported-by: Richard Connon <[email protected]> Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #3798
* Tag zfs-0.6.5zfs-0.6.5Brian Behlendorf2015-09-113-1/+7
| | | | | | META file and release log updated. Signed-off-by: Brian Behlendorf <[email protected]>
* Illumos 6214 - zpools going southArne Jansen2015-09-113-34/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | 6214 zpools going south Reviewed by: Igor Kozhukhov <[email protected]> Reviewed by: George Wilson <[email protected]> Reviewed by: Dan McDonald <[email protected]> Reviewed by: Saso Kiselkov <[email protected]> References: https://www.illumos.org/issues/6214 http://cr.illumos.org/~webrev/sensille/6214_zpools_going_south/ Porting Notes: Reintroduce b_compress to the l2arc_buf_hdr_t. In commit b9541d6 the compression flags were moved to the generic b_flags in the arc_buf_hdr_t. This is a problem because l2arc_compress_buf() may manipulate the compression flags and this can only be done safely under the hash lock which is not held. See Illumos 6214 for a detailed analysis of the race. HDR_GET_COMPRESS() macro was removed from arc_buf_info(). Ported-by: Brian Behlendorf <[email protected]> Closes #3757
* Prefetch start and end of volumesBrian Behlendorf2015-09-092-0/+32
| | | | | | | | | | | | When adding a zvol to the system prefetch zvol_prefetch_bytes from the start and end of the volume. Prefetching these regions of the volume is desirable because they are likely to be accessed immediately by blkid(8), the kernel scanning for a partition table, or another task which probes the devices. Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #3659
* Reintroduce IO accounting on zvols on Linux 3.19+Richard Yao2015-09-094-5/+45
| | | | | | | | | | | | | | zfsonlinux/zfs@e20cd6f7a8922709b1aa2ecefd783390102d79e0 caused us to lose IO accounting on zvols. When I originally wrote that last year, the symbols we needed to maintain IO accounting were GPL exported, but torvalds/linux@394ffa503bc40e32d7f54a9b817264e81ce131b4 provided suitable symbols for restoring this functionality 4 months later. We can call them to restore the IO accounting on Linux 3.19 and later as well as any older kernels where that patch is backported. Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #3741
* Force create /run/sendsigs.omit.d link when starting zedSenH2015-09-081-1/+1
| | | | | | | | | | | | | | Resolve the following error when restarting the zed by force creating the /run/sendsigs.omit.d/zed link. sudo /etc/init.d/zfs-zed restart * Stopping ZFS Event Daemon [ OK ] * Starting ZFS Event Daemon ln: failed to create symbolic link `/run/sendsigs.omit.d/zed': File exists Signed-off-by: SenH <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #3747
* Add dbgmsg kstatBrian Behlendorf2015-09-046-142/+199
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Internally ZFS keeps a small log to facilitate debugging. By default the log is disabled, to enable it set zfs_dbgmsg_enable=1. The contents of the log can be accessed by reading the /proc/spl/kstat/zfs/dbgmsg file. Writing 0 to this proc file clears the log. $ echo 1 >/sys/module/zfs/parameters/zfs_dbgmsg_enable $ echo 0 >/proc/spl/kstat/zfs/dbgmsg $ zpool import tank $ cat /proc/spl/kstat/zfs/dbgmsg 1 0 0x01 -1 0 2492357525542 2525836565501 timestamp message 1441141408 spa=tank async request task=1 1441141408 txg 70 open pool version 5000; software version 5000/5; ... 1441141409 spa=tank async request task=32 1441141409 txg 72 import pool version 5000; software version 5000/5; ... 1441141414 command: lt-zpool import tank Note the zfs_dbgmsg() and dprintf() functions are both now mapped to the same log. As mentioned above the kernel debug log can be accessed though the /proc/spl/kstat/zfs/dbgmsg kstat. For user space consumers log messages are immediately written to stdout after applying the ZFS_DEBUG environment variable. $ ZFS_DEBUG=on ./cmd/ztest/ztest -V Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Ned Bass <[email protected]> Closes #3728
* Support accessing .zfs/snapshot via NFSBrian Behlendorf2015-09-043-0/+141
| | | | | | | | | | | | | | | | | | | | This patch is based on the previous work done by @andrey-ve and @yshui. It triggers the automount by using kern_path() to traverse to the known snapshout mount point. Once the snapshot is mounted NFS can access the contents of the snapshot. Allowing NFS clients to access to the .zfs/snapshot directory would normally mean that a root user on a client mounting an export with 'no_root_squash' would be able to use mkdir/rmdir/mv to manipulate snapshots on the server. To prevent configuration mistakes a zfs_admin_snapshot module option was added which disables the mkdir/rmdir/mv functionally. System administators desiring this functionally must explicitly enable it. Signed-off-by: Brian Behlendorf <[email protected]> Closes #2797 Closes #1655 Closes #616
* Fix invalid fileid for snapshot root dentryAndrey Vesnovaty2015-09-041-0/+10
| | | | | | | | Prevents NFS client from detection of different fileids of snapshot root dentry before & after snapshot mount. Signed-off-by: Andrey Vesnovaty <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]>
* Merge branch 'zvol'Brian Behlendorf2015-09-0421-700/+288
|\ | | | | | | | | | | | | | | Performance improvements for zvols. Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #3720
| * Remove blk_queue_nonrot() autotools checkRichard Yao2015-09-042-26/+0
| | | | | | | | | | | | | | | | | | | | This autotools check was never needed because we can check for the existence of QUEUE_FLAG_NONROT in the kernel headers. Also, the comment in config/kernel-blk-queue-nonrot.m4 is incorrect. This was a Linux 2.6.28 API change, not a Linux 2.6.27 API change. Signed-off-by: Richard Yao <[email protected]>
| * Remove blk_queue_discard() autotools checkRichard Yao2015-09-042-23/+0
| | | | | | | | | | | | | | This autotools check was never needed because we can check for the existence of QUEUE_FLAG_DISCARD in the kernel headers. Signed-off-by: Richard Yao <[email protected]>
| * Remove blk_rq_bytes()/blk_rq_sectors autotools checksRichard Yao2015-09-044-87/+0
| | | | | | | | Signed-off-by: Richard Yao <[email protected]>
| * Remove blk_rq_pos() autotools checkRichard Yao2015-09-043-30/+0
| | | | | | | | Signed-off-by: Richard Yao <[email protected]>
| * Remove blk_fetch_request() autotools checkRichard Yao2015-09-043-40/+0
| | | | | | | | Signed-off-by: Richard Yao <[email protected]>
| * Remove blk_requeue_request() autotools checkRichard Yao2015-09-043-34/+0
| | | | | | | | Signed-off-by: Richard Yao <[email protected]>
| * Remove blk_end_request() autotools check.Richard Yao2015-09-043-115/+0
| | | | | | | | Signed-off-by: Richard Yao <[email protected]>
| * Remove rq_is_sync() autotools checkRichard Yao2015-09-043-30/+0
| | | | | | | | Signed-off-by: Richard Yao <[email protected]>
| * Remove rq_for_each_segment() autotools checkRichard Yao2015-09-043-90/+0
| | | | | | | | Signed-off-by: Richard Yao <[email protected]>
| * Support secure discard on zvolsRichard Yao2015-09-041-7/+12
| | | | | | | | | | | | | | | | | | | | | | | | Linux 2.6.36 introduced REQ_SECURE to indicate when discards *must* be processed, such that we cannot do optimizations like block alignment. Consequently, the discard semantics prior to 2.6.36 require us to always process unaligned discards. Previously, we would do this optimization regardless. This patch changes things to correctly restrict this optimization to situations where REQ_SECURE exists, but is not included in the flags. Signed-off-by: Richard Yao <[email protected]>
| * zvol processing should use struct bioRichard Yao2015-09-0411-218/+276
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Internally, zvols are files exposed through the block device API. This is intended to reduce overhead when things require block devices. However, the ZoL zvol code emulates a traditional block device in that it has a top half and a bottom half. This is an unnecessary source of overhead that does not exist on any other OpenZFS platform does this. This patch removes it. Early users of this patch reported double digit performance gains in IOPS on zvols in the range of 50% to 80%. Comments in the code suggest that the current implementation was done to obtain IO merging from Linux's IO elevator. However, the DMU already does write merging while arc_read() should implicitly merge read IOs because only 1 thread is permitted to fetch the buffer into ARC. In addition, commercial ZFSOnLinux distributions report that regular files are more performant than zvols under the current implementation, and the main consumers of zvols are VMs and iSCSI targets, which have their own elevators to merge IOs. Some minor refactoring allows us to register zfs_request() as our ->make_request() handler in place of the generic_make_request() function. This eliminates the layer of code that broke IO requests on zvols into a top half and a bottom half. This has several benefits: 1. No per zvol spinlocks. 2. No redundant IO elevator processing. 3. Interrupts are disabled only when actually necessary. 4. No redispatching of IOs when all taskq threads are busy. 5. Linux's page out routines will properly block. 6. Many autotools checks become obsolete. An unfortunate consequence of eliminating the layer that generic_make_request() is that we no longer calls the instrumentation hooks for block IO accounting. Those hooks are GPL-exported, so we cannot call them ourselves and consequently, we lose the ability to do IO monitoring via iostat. Since zvols are internally files mapped as block devices, this should be okay. Anyone who is willing to accept the performance penalty for the block IO layer's accounting could use the loop device in between the zvol and its consumer. Alternatively, perf and ftrace likely could be used. Also, tools like latencytop will still work. Tools such as latencytop sometimes provide a better view of performance bottlenecks than the traditional block IO accounting tools do. Lastly, if direct reclaim occurs during spacemap loading and swap is on a zvol, this code will deadlock. That deadlock could already occur with sync=always on zvols. Given that swap on zvols is not yet production ready, this is not a blocker. Signed-off-by: Richard Yao <[email protected]>
| * VDEV_REQ_FUA should be mapped to REQ_FUARichard Yao2015-09-021-1/+1
| | | | | | | | | | | | | | | | | | | | | | Pre-2.6.37 kernels support REQ_FUA in request flags, but not in BIO flags. zvols are the only consumer of VDEV_REQ_FUA and since they are passed requests, they should be obey the REQ_FUA flag like later kernels. This optimization will only matter on 2.6.36 and 2.6.37 because the zvol rework changes things to use bio, where we no longer are able to distinguish on earlier kernels Signed-off-by: Richard Yao <[email protected]>
* | Prevent reclaim in the traverse prefetch threadTim Chase2015-09-041-0/+2
| | | | | | | | | | | | | | | | | | Reclaim in the traverse prefetch thread, which is run on the system taskq, can overrun the stack. Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Tim Chase <[email protected]> Closes #3733
* | Add temporary mount optionsBrian Behlendorf2015-09-0313-62/+332
|/ | | | | | | | | | | | | Add the required kernel side infrastructure to parse arbitrary mount options. This enables us to support temporary mount options in largely the same way it is handled on other platforms. See the 'Temporary Mount Point Properties' section of zfs(8) for complete details. Signed-off-by: Brian Behlendorf <[email protected]> Closes #985 Closes #3351
* Dbuf hash table should be sized as is the arc hash tableTim Chase2015-09-022-3/+7
| | | | | | | | | | | Commit 49ddb315066e372f31bda29a5c546a9eccc8b418 added the zfs_arc_average_blocksize parameter to allow control over the size of the arc hash table. The dbuf hash table's size should be determined similarly. Signed-off-by: Tim Chase <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #3721
* Add spa_slop_shift module optionBrian Behlendorf2015-09-022-0/+19
| | | | | | | | | | Allow for easy turning of a pools reserved free space. Previous versions of ZFS (v0.6.4 and earlier) held 1/64 of the pools capacity in reserve. Commits 3d45fdd and 0c60cc3 increased this to 1/32. Setting spa_slop_shift=6 will restore the previous default setting. Signed-off-by: Brian Behlendorf <[email protected]> Closes #3724
* Reorder zfs-* services to allow /var on separate datasetJames Lee2015-09-025-46/+45
| | | | | | | | | | | | | | | | | | | | | | | | | ZED depends on /var. When /var is a separate dataset, it must be mounted before starting ZED. This change moves the zfs-zed service from starting first, to starting after zfs-mount, but before zfs-share. As discussed in issue #3513, ZED does not need to start first in order to consume events made during the zfs-import and zfs-mount services. The events will be queued and can be handled later in the boot process. ZED may, however, handle sharing in the future, so it should be started before the zfs-share service. This commit also stops the zfs-import service from writing temp files to /var/tmp on shutdown and it corrects the return code for the OpenRC service. Other OpenRC-specific changes noted in issue #3513 were reitereated in issue #3715 and committed in da619f3. Signed-off-by: James Lee <[email protected]> Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #3513