summaryrefslogtreecommitdiffstats
path: root/module
Commit message (Collapse)AuthorAgeFilesLines
* Clear QUEUE_FLAG_ADD_RANDOM on zvolsRichard Yao2015-08-301-0/+3
| | | | | | | | | | | zvols should not be an entropy source for the kernel. Disable it to be consistent with the upstream kernel. torvalds/linux@b277da0a8a594308e17881f4926879bd5fca2a2d Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #3713
* Fix small typololi10K2015-08-301-1/+1
| | | | | | | | | | Add a missing space to the zfs_vdev_sync_write_min_active module parameter description. Signed-off-by: loli10K <[email protected]> Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #3714
* Initialize the taskq entry embedded within struct vdevTim Chase2015-08-301-0/+1
| | | | | | | | | | | | | As part of the stack reduction effort in 50b25b2187134ac7b19cf93bd35a420223f1d343, a zio_t containing a taskq_ent was added to struct vdev_queue which itself is part of struct vdev. The taskq entry should be initialized as is currently done in zio_create() for newly-created bare zio_t object. The rationale is the same as is described in f467b05a265abcfb8e5a3269f79d08f36a58646a. Signed-off-by: Tim Chase <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #3709
* Allow recovery from corrupted snapshot mapsTim Chase2015-08-281-0/+6
| | | | | | | | | | | | | | | | | | If the ZAP object containing a snapshot map is corrupted due to an unrecoverable checksum error or otherwise, dsl_dataset_name() will normally panic the system due to its VERIFY. This patch attempts to allow a recovery avenue from such situations by manufacturing a descriptive snapshot name and then ignoring the error. Scrubbing a pool with this type of corruption will then show the affected object in the error list rather than panicking. The recovery code is only enabled when the zfs_recover module parameter is set. Signed-off-by: Tim Chase <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #3705
* Check large block feature flag on volumesBrian Behlendorf2015-08-282-2/+27
| | | | | | | | | | | | | | | Since ZoL allows large blocks to be used by volumes, unlike upstream illumos, the feature flag must be checked prior to volume creation. This is critical because unlike filesystems, volumes will create a object which uses large blocks as part of the create. Therefore, it cannot be safely checked in zfs_check_settable() after the dataset can been created. In addition this patch updates the relevant error messages to use zfs_nicenum() to print the maximum blocksize. Signed-off-by: Brian Behlendorf <[email protected]> Closes #3591
* Limit max_hw_sectors_kb to 16MBrian Behlendorf2015-08-281-1/+1
| | | | | | | | | | | | | | | | When support for large blocks was added DMU_MAX_ACCESS was increased to allow for blocks of up to 16M to fit in a transaction handle. This had the side effect of increasing the max_hw_sectors_kb for volumes, which are scaled off DMU_MAX_ACCESS, to 64M from 10M. This is an issue for volumes which by default use an 8K block size because it results in dmu_buf_hold_array_by_dnode() allocating a 64K array for the dbufs. The solution is to restore the maximum size to ~10M. This patch specifically changes it to 16M which is close enough. Signed-off-by: Brian Behlendorf <[email protected]> Closes #3684
* Linux 4.1 compat: loop device on ZFSChunwei Chen2015-08-245-137/+162
| | | | | | | | | | | | | | | | | | | Starting from Linux 4.1 allows iov_iter with bio_vec to be passed into iter_read/iter_write. Notably, the loop device will pass bio_vec to backend filesystem. However, current ZFS code assumes iovec without any check, so it will always crash when using loop device. With the restructured uio_t, we can safely pass bio_vec in uio_t with UIO_BVEC set. The uio* functions are modified to handle bio_vec case separately. The const uio_iov causes some warning in xuio related stuff, so explicit convert them to non const. Signed-off-by: Chunwei Chen <[email protected]> Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #3511 Closes #3640
* Linux 4.2 compat: vfs_rename()Brian Behlendorf2015-08-191-0/+22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The spa_config_write() function relies on the classic method of making sure updates to the /etc/zfs/zpool.cache file are atomic. It writes out a temporary version of the file and then uses vn_rename() to switch it in to place. This way there can never exist a partial version of the file, it's all or nothing. Conceptually this is a good strategy and it makes good sense for platforms where it's easy to do a rename within the kernel. Unfortunately, Linux is not one of those platforms. Even doing basic I/O to a file system from within the kernel is strongly discouraged. In order to support this at all the vn_rename() implementation ends up being complex and fragile. So fragile that recent Linux 4.2 changes have broken it. While it is possible to update vn_rename() to work with the latest kernels a better long term strategy is to stop using vn_rename() entirely. Then all this complex, fragile code can be removed. Achieving this is straight forward because config_write() is the only consumer of vn_rename(). This patch reworks spa_config_write() to update the cache file in place. The file will be truncated, written out, and then synced to disk. If an error is encountered the file will be unlinked leaving the system in a consistent state. This does expose a tiny tiny tiny window where a system could crash at exactly the wrong moment could leave a partially written cache file. However, this is highly unlikely because the cache file is 1) infrequently updated, 2) only a few kilobytes in size, and 3) written with a single vn_rdwr() call. If this were to somehow happen it poses no risk to pool. Simply removing the cache file will allow the pool to be imported cleanly. Going forward this will be even less of an issue as we intend to disable the use of a cache file by default. Bottom line not using vn_rename() allows us to make ZoL more robust against upstream kernel changes. Signed-off-by: Brian Behlendorf <[email protected]> Closes #3653
* Handle zap_lookup() failure in ddt_object_load()Brian Behlendorf2015-08-191-3/+4
| | | | | | | | | Failing to lookup a name in the spa_ddt_stat_object should not result in a panic in ddt_object_load(). The error can be safely returned to the caller for handling resulting in a useful user error message. Signed-off-by: Brian Behlendorf <[email protected]> Closes #3370
* Add parenthesis to the ternary operatortuxoko2015-08-191-1/+1
| | | | | | | | | Without the parenthesis, this particular ASSERT will evaluate to "(RW_READER == (!zap->zap_ismicro && fatreader)) ? RW_READER : lti" Signed-off-by: Chunwei Chen <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #3685
* Update arc_memory_throttle() to check pageoutBrian Behlendorf2015-07-301-7/+37
| | | | | | | | | | | | | | | | | | | | | | | | This brings the behavior of arc_memory_throttle() back in sync with illumos. The updated memory throttling policy roughly goes like this: * Never throttle if more than 10% of memory is free. This threshold is configurable with the zfs_arc_lotsfree_percent module option. * Minimize any throttling of kswapd even when free memory is below the set threshold. Allow it to write out pages as quickly as possible to help alleviate the memory pressure. * Delay all other threads when free memory is below the set threshold in order to avoid compounding the memory pressure. Buffers will be evicted from the ARC to reduce the issue. The Linux specific zfs_arc_memory_throttle_disable module option has been removed in favor of the existing zfs_arc_lotsfree_percent tuning. Setting zfs_arc_lotsfree_percent=0 will have the same effect as zfs_arc_memory_throttle_disable and it was therefore redundant. Signed-off-by: Brian Behlendorf <[email protected]> Closes #3637
* Update arc_available_memory() to check freememBrian Behlendorf2015-07-301-31/+33
| | | | | | | | | | | | | | | | | | | | | | While Linux doesn't provide detailed information about the state of the VM it does provide us total free pages. This information should be incorporated in to the arc_available_memory() calculation rather than solely relying on a signal from direct reclaim. Conceptually this brings arc_available_memory() back in sync with illumos. It is also desirable that the target amount of free memory be tunable on a system. While the default values are expected to work well for most workloads there may be cases where custom values are needed. The zfs_arc_sys_free module option was added for this purpose. zfs_arc_sys_free - The target number of bytes the ARC should leave as free memory on the system. This value can checked in /proc/spl/kstat/zfs/arcstats and setting this module option will override the default value. Signed-off-by: Brian Behlendorf <[email protected]> Closes #3637
* Bound zvol_threads module optionBrian Behlendorf2015-07-291-3/+4
| | | | | | | | | | The zvol_threads module option should be bounded to a reasonable range. The taskq must have at least 1 thread and shouldn't have more than 1,024 at most. The default value of 32 is a reasonable default. Signed-off-by: Brian Behlendorf <[email protected]> Closes #3614
* Fix "BUG: Bad page state" caused by writeback flagChunwei Chen2015-07-291-0/+50
| | | | | | | | | | | | | | | | | | | | | | | | | Commit d958324 fixed the deadlock between page lock and range lock by unlocking the page lock before acquiring the range lock. However, this created a new issue #3075. The problem is that if we can't set the write back bit before releasing the page lock. Then other processes will be unaware that the page is under active write back. They may therefore truncate the page, invalidate the page, or not honor the sync semantics. To workaround this problem we re-dirty the page before dropping the page lock. While this doesn't prevent the page from being truncated it does ensure it won't be invalidated. Then the range lock and the page lock are reacquired in the correct deadlock-free order. Once both locks are safely held the page state can be rechecked. If all is well and the page is in the expect state the dirty bit can be removed, the write back bit set, and the page removed from the skip count. If not the page will be handled as appropriate. Signed-off-by: Chunwei Chen <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #3075
* Align thread priority with Linux defaultsBrian Behlendorf2015-07-289-16/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Under Linux filesystem threads responsible for handling I/O are normally created with the maximum priority. Non-I/O filesystem processes run with the default priority. ZFS should adopt the same priority scheme under Linux to maintain good performance and so that it will complete fairly when other Linux filesystems are active. The priorities have been updated to the following: $ ps -eLo rtprio,cls,pid,pri,nice,cmd | egrep 'z_|spl_|zvol|arc|dbu|meta' - TS 10743 19 -20 [spl_kmem_cache] - TS 10744 19 -20 [spl_system_task] - TS 10745 19 -20 [spl_dynamic_tas] - TS 10764 19 0 [dbu_evict] - TS 10765 19 0 [arc_prune] - TS 10766 19 0 [arc_reclaim] - TS 10767 19 0 [arc_user_evicts] - TS 10768 19 0 [l2arc_feed] - TS 10769 39 0 [z_unmount] - TS 10770 39 -20 [zvol] - TS 11011 39 -20 [z_null_iss] - TS 11012 39 -20 [z_null_int] - TS 11013 39 -20 [z_rd_iss] - TS 11014 39 -20 [z_rd_int_0] - TS 11022 38 -19 [z_wr_iss] - TS 11023 39 -20 [z_wr_iss_h] - TS 11024 39 -20 [z_wr_int_0] - TS 11032 39 -20 [z_wr_int_h] - TS 11033 39 -20 [z_fr_iss_0] - TS 11041 39 -20 [z_fr_int] - TS 11042 39 -20 [z_cl_iss] - TS 11043 39 -20 [z_cl_int] - TS 11044 39 -20 [z_ioctl_iss] - TS 11045 39 -20 [z_ioctl_int] - TS 11046 39 -20 [metaslab_group_] - TS 11050 19 0 [z_iput] - TS 11121 38 -19 [z_wr_iss] Note that under Linux the meaning of a processes priority is inverted with respect to illumos. High values on Linux indicate a _low_ priority while high value on illumos indicate a _high_ priority. In order to preserve the logical meaning of the minclsyspri and maxclsyspri macros when they are used by the illumos wrapper functions their values have been inverted. This way when changes are merged from upstream illumos we won't need to remember to invert the macro. It could also lead to confusion. This patch depends on https://github.com/zfsonlinux/spl/pull/466. Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Ned Bass <[email protected]> Closes #3607
* Check for NULL in dmu_free_long_range_impl()Brian Behlendorf2015-07-281-1/+5
| | | | | | | | | | A NULL should never be passed as the dnode_t pointer to the function dmu_free_long_range_impl(). Regardless, because we have a reported occurrence of this let's add some error handling to catch this. Better to report a reasonable error to caller than panic the system. Signed-off-by: Brian Behlendorf <[email protected]> Issue #3445
* Minor style cleanupBrian Behlendorf2015-07-231-10/+7
| | | | | | | | | Address minor differences in style between upstream and ZoL. This patch contains no functional differences and is solely designed to minimize the delta from upstream. Signed-off-by: Brian Behlendorf <[email protected]> Issue #3533
* Remove double counting HDR_L2ONLY_SIZEBrian Behlendorf2015-07-231-3/+0
| | | | | | | | | | | Commit d962d5d didn't quite properly resolve the HDR_L2ONLY_SIZE accounting. Accounting is now performed only in the constructor and destructor which is a nice simplification. It should have been removed the from create and destroy functions. This brings up back in sync with upstream. Signed-off-by: Brian Behlendorf <[email protected]> Issue #3533
* Add hdr_recl() reclaim callbackBrian Behlendorf2015-07-231-2/+18
| | | | | | | | | | | Originally removed because it wasn't required under Linux. However, there may still be some utility in signaling the arc reclaim thread under Linux via reclaim. This should already have happened by other means but it's not harmless and reduces another point of divergence with upstream. Signed-off-by: Brian Behlendorf <[email protected]> Issue #3533
* Reinstate zfs_arc_p_min_shiftBrian Behlendorf2015-07-231-2/+14
| | | | | | | | | | | | | | | Commit f521ce1 removed the minimum value for "arc_p" allowing it to drop to zero or grow to "arc_c". This was done to improve specific workload which constantly dirties new "metadata" but also frequently touches a "small" amount of mfu data (e.g. mkdir's). This change may still be desirable but it needs to be re-investigated. in the context of the recent ARC changes from upstream. Therefore this code is being restored to facilitate benchmarking. By setting "zfs_arc_p_min_shift=64" we easily compare the performance. Signed-off-by: Brian Behlendorf <[email protected]> Issue #3533
* Illumos 5817 - change type of arcs_size from uint64_t to refcount_tPrakash Surya2015-07-231-27/+112
| | | | | | | | | | | | | | | | | 5817 change type of arcs_size from uint64_t to refcount_t Reviewed by: Matthew Ahrens <[email protected]> Reviewed by: Paul Dagnelie <[email protected]> Reviewed by: Adam Leventhal <[email protected]> Reviewed by: Alex Reece <[email protected]> Reviewed by: Richard Elling <[email protected]> Approved by: Garrett D'Amore <[email protected]> References: https://www.illumos.org/issues/5817 https://github.com/illumos/illumos-gate/commit/2fd872a Ported-by: Brian Behlendorf <[email protected]> Issue #3533
* Illumos 5445 - Add more visibility via arcstatsPrakash Surya2015-07-231-40/+151
| | | | | | | | | | | | | | | | | | | | | | | 5445 Add more visibility via arcstats; specifically arc_state_t stats and differentiate between "data" and "metadata" Reviewed by: Basil Crow <[email protected]> Reviewed by: George Wilson <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Reviewed by: Bayard Bell <[email protected]> Approved by: Robert Mustacchi <[email protected]> References: https://www.illumos.org/issues/5445 https://github.com/illumos/illumos-gate/commit/4076b1b Porting Notes: This patch is an improved version of cc7f677 which was previously merged in ZoL. This patch incorporates the additional improvements which were made upstream. Ported-by: Brian Behlendorf <[email protected]> Issue #3533
* Illumos 5376 - arc_kmem_reap_now() should not result in clearing arc_no_growMatthew Ahrens2015-07-231-156/+369
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 5376 arc_kmem_reap_now() should not result in clearing arc_no_grow Reviewed by: Christopher Siden <[email protected]> Reviewed by: George Wilson <[email protected]> Reviewed by: Steven Hartland <[email protected]> Reviewed by: Richard Elling <[email protected]> Approved by: Dan McDonald <[email protected]> References: https://www.illumos.org/issues/5376 https://github.com/illumos/illumos-gate/commit/2ec99e3 Porting Notes: The good news is that many of the recent changes made upstream to the ARC tackled issues previously observed by ZoL with similar solutions. The bad news is those solution weren't identical to the ones we applied. This patch is designed to split the difference and apply as much of the upstream work as possible. * The arc_available_memory() function was removed previous in ZoL but due to the upstream changes it makes sense to add it back. This function has been customized for Linux so that it can be used to determine a low memory. This provides the same basic functionality as the illumos version allowing us to minimize changes through the rest of the code base. The exact mechanism used to detect a low memory state remains unchanged so this change isn't a significant as it might first appear. * This patch includes the long standing fix for arc_shrink() which was originally proposed in #2167. Since there were related changes to this function it made sense to include that work. * The arc_init() function has been re-factored. As before it sets sane default values for the ARC but then calls arc_tuning_update() to apply user specific tuning made via module options. The arc_tuning_update() function is then called periodically by the arc_reclaim_thread() to apply changes to the tunings made during normal operation. Ported-by: Brian Behlendorf <[email protected]> Closes #3616 Closes #2167
* Add logic to try and recover an inode with an invalid modeBrian Behlendorf2015-07-171-4/+11
| | | | | | | | | | | | | | | | | | | | | | | When an inode is detected with invalid mode bits the safe thing to do is panic the system. This indicates a problem with the contents of a dnode and it should never be possible. This is the default behavior. Unfortunately, due to flaws in the system attribute (SA) implementation (on all platforms) it was possible that ZFS could create a damaged dnode. This was a rare issue which only impacted dnodes which used a spill block. Normally only symlinks and files with ACLs would require a spill block. However, if the dataset had the xattr=sa property set and extended attributes were used this problem could occur. As of the 0.6.4 tag the root cause of this issue has been fixed. For pools which are exhibiting this damage the 'zfs_recover=1' module option may be set. This will cause ZFS to interpret the dnode with invalid mode bits as a normal file. This may allow the files to be accessed for recovery purposes. Signed-off-by: Brian Behlendorf <[email protected]> Closes #3548
* Support parallel build trees (VPATH builds)Turbo Fredriksson2015-07-177-115/+137
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Build products from an out of tree build should be written relative to the build directory. Sources should be referred to by their locations in the source directory. This is accomplished by adding the 'src' and 'obj' variables for the module Makefile.am, using relative paths to reference source files, and by setting VPATH when source files are not co-located with the Makefile. This enables the following: $ mkdir build $ cd build $ ../configure \ --with-spl=$HOME/src/git/spl/ \ --with-spl-obj=$HOME/src/git/spl/build $ make -s This change also has the advantage of resolving the following warning which is generated by modern versions of automake. Makefile.am:00: warning: source file 'xxx' is in a subdirectory, Makefile.am:00: but option 'subdir-objects' is disabled Signed-off-by: Turbo Fredriksson <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #1082
* Update inode under range lockBrian Behlendorf2015-07-171-1/+1
| | | | | | | | | | | | After a successful write the inode must be updated under the range lock. If it is updated after dropping the lock there exists a race where the znode and inode wile disagree about the file size. This could result in narrow window of time where read(2) is able to access data beyond what fstat(2) reports as the file size. Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Ned Bass <[email protected]> Closes #3601
* Linux 4.2 compat: follow_link() / put_link()Brian Behlendorf2015-07-171-4/+27
| | | | | | | | | | | | | | | | | | | | As of Linux 4.2 the kernel has completely retired the nameidata structure. One of the few remaining consumers of this interface were the follow_link() and put_link() callbacks. This patch adds the required checks to configure to detect the interface change and updates the functions accordingly. Migrating to the simple_follow_link() interface was considered but was decided against ironically due to the increased complexity. It also should be noted that the kernel follow_link() and put_link() interfaces changes several times after 4.1 and but before 4.2. This means there is a narrow range of kernel commits which never appear in an official tag of the Linux kernel which ZoL will not build. Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Richard Yao <[email protected]> Issue #3596
* Linux 4.2 compat: remove bio->bi_cnt accessBrian Behlendorf2015-07-171-9/+0
| | | | | | | | | | | | Linux 4.2 commit torvalds/linux@dac5621 renamed bio->bi_cnt to bio->__bi_cnt. Because this value is only used once in a block of debug code it simplest just to remove the PANIC. To my knowledge this debugging has never been hit or proved useful so this is no great loss. Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Richard Yao <[email protected]> Closes #3596
* Illumos 5347 - idle pool may run itself out of spaceMatthew Ahrens2015-07-143-23/+46
| | | | | | | | | | | | | | | | | | | | | | | | 5347 idle pool may run itself out of space Reviewed by: Alex Reece <[email protected]> Reviewed by: George Wilson <[email protected]> Reviewed by: Steven Hartland <[email protected]> Reviewed by: Richard Elling <[email protected]> Approved by: Dan McDonald <[email protected]> References: https://github.com/illumos/illumos-gate/commit/231aab8 https://github.com/illumos/illumos-gate/commit/4a92375 3642 https://www.illumos.org/issues/5347 https://github.com/zfsonlinux/zfs/commit/89b1cd6 (partial commit & fix) https://github.com/zfsonlinux/zfs/commit/fbeddd6 Illumos 4390 https://github.com/zfsonlinux/zfs/commit/2696dfa Illumos 3642, 3643 Porting notes: This is completing the partial fix from FreeBSD Ported-by: kernelOfTruth [email protected] Signed-off-by: Brian Behlendorf <[email protected]> Closes #3586
* Illumos 5610 - zfs clone from different source and target pools produces ↵Alexander Eremin2015-07-141-11/+2
| | | | | | | | | | | | | | | | | | | | coredump 5610 zfs clone from different source and target pools produces coredump Reviewed by: Josef 'Jeff' Sipek <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Approved by: Dan McDonald <[email protected]> References: https://github.com/illumos/illumos-gate/commit/03b1c29 https://www.illumos.org/issues/5610 https://www.illumos.org/issues/5824 https://github.com/zfsonlinux/zfs/issues/2911 https://github.com/zfsonlinux/zfs/commit/9063f65 Ported-by: kernelOfTruth [email protected] Signed-off-by: Brian Behlendorf <[email protected]> Closes #3584
* Failure of userland copy should return EFAULTRichard Yao2015-07-141-1/+1
| | | | | | | | | | | | | | | | | Many key internal functions pass system return codes that are safe to return to userland. In the case of ddi_copyin(9F), an error passes -1 and the documentation states very clearly that drivers should pass EFAULT to userland when this happens. http://illumos.org/man/9F/ddi_copyin This does not happen in the ZFS source code. I believe it should be changed to pass EFAULT. I caught this when writing man pages for the libzfs_core API. Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #3575
* Translate sync zio to sync bioBoris Protopopov2015-07-131-2/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Translate zio requests with ZIO_PRIORITY_SYNC_READ and ZIO_PRIORITY_SYNC_WRITE into synchronous bio requests by setting READ_SYNC and WRITE_SYNC flags. Specifically, WRITE_SYNC flag turns out to have a pronounced effect when writing to an SSD-based SLOG. When WRITE_SYNC is not set (WRITE is set instead), the block trace for a SLOG device looks as follows: ... 130,96 0 3 0.008968390 0 C W 830464 + 136 [0] 130,96 0 4 0.011999161 0 C W 830720 + 136 [0] 130,96 0 5 0.023955549 0 C W 831744 + 136 [0] 130,96 0 6 0.024337663 19775 A W 832000 + 136 <- (130,97) 829952 130,96 0 7 0.024338823 19775 Q W 832000 + 136 [z_wr_iss/6] 130,96 0 8 0.024340523 19775 G W 832000 + 136 [z_wr_iss/6] 130,96 0 9 0.024343187 19775 P N [z_wr_iss/6] 130,96 0 10 0.024344120 19775 I W 832000 + 136 [z_wr_iss/6] 130,96 0 11 0.026784405 0 UT N [swapper] 1 130,96 0 12 0.026805339 202 U N [kblockd/0] 1 130,96 0 13 0.026807199 202 D W 832000 + 136 [kblockd/0] 130,96 0 14 0.026966948 0 C W 832000 + 136 [0] 130,96 3 1 0.000449358 19788 A W 829952 + 136 <- (130,97) 827904 130,96 3 2 0.000450951 19788 Q W 829952 + 136 [z_wr_iss/19] 130,96 3 3 0.000453212 19788 G W 829952 + 136 [z_wr_iss/19] 130,96 3 4 0.000455956 19788 P N [z_wr_iss/19] 130,96 3 5 0.000457076 19788 I W 829952 + 136 [z_wr_iss/19] 130,96 3 6 0.002786349 0 UT N [swapper] 1 ... Here the 130,197 is the partition created on the log device when adding it to the pool, whereas the base device is 130,96. As one can see, the writes to the SLOG are not marked synchronous (the S is missing next to W), and the queue unplugs occur based on the timer (UT event) resulting in slightly over 2 msec latency of writes. This results in a sub-par performance of single stream synchronous writes (limited by latency of the SLOG). When the WRITE_SYNC is set, a similar trace looks as follows: ... 130,96 4 1 0.000000000 70714 A WS 4280576 + 136 <- (130,97) 4278528 130,96 4 2 0.000000832 70714 Q WS 4280576 + 136 [(null)] 130,96 4 3 0.000002109 70714 G WS 4280576 + 136 [(null)] 130,96 4 4 0.000003394 70714 P N [(null)] 130,96 4 5 0.000003846 70714 I WS 4280576 + 136 [(null)] 130,96 4 6 0.000004854 70714 D WS 4280576 + 136 [(null)] 130,96 5 1 0.000354487 70713 A WS 4280832 + 136 <- (130,97) 4278784 130,96 5 2 0.000355072 70713 Q WS 4280832 + 136 [(null)] 130,96 5 3 0.000356383 70713 G WS 4280832 + 136 [(null)] 130,96 5 4 0.000357635 70713 P N [(null)] 130,96 5 5 0.000358088 70713 I WS 4280832 + 136 [(null)] 130,96 5 6 0.000359191 70713 D WS 4280832 + 136 [(null)] 130,96 0 76 0.000159539 0 C WS 4280576 + 136 [0] 130,96 16 85 0.000742108 70718 A WS 4281088 + 136 <- (130,97) 4279040 130,96 16 86 0.000743197 70718 Q WS 4281088 + 136 [z_wr_iss/15] 130,96 16 87 0.000744450 70718 G WS 4281088 + 136 [z_wr_iss/15] 130,96 16 88 0.000745817 70718 P N [z_wr_iss/15] 130,96 16 89 0.000746705 70718 I WS 4281088 + 136 [z_wr_iss/15] 130,96 16 90 0.000747848 70718 D WS 4281088 + 136 [z_wr_iss/15] 130,96 0 77 0.000604063 0 C WS 4280832 + 136 [0] 130,96 0 78 0.000899858 0 C WS 4281088 + 136 [0] As one can see, all the writes are synchronous (WS), and I/O completions (e.g. from issue I to completion C) take 160-250 usec, or about 10x faster. Since WRITE_SYNC or READ_SYNC flags are among several factors that are considered when processing bio requests, it seems prudent to mark all the zio requests of synchronous priority with the READ/WRITE_SYNC flags to make them eligible for consideration as such by the Linux block I/O layer. Signed-off-by: Boris Protopopov <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #3529
* Fix switch-bool warningBrian Behlendorf2015-07-131-1/+1
| | | | | | | | | | | As of gcc version 5.1.1 a new warning has been added to detect the use of a boolean in a switch statement (-Wswitch-bool). Resolve the warning by explicitly casting the value to an integer type. zfs-0.6.4/module/zfs/zvol.c: In function 'zvol_request': error: switch condition has boolean value [-Werror=switch-bool] Signed-off-by: Brian Behlendorf <[email protected]>
* Illumos 5661 - ZFS: "compression = on" should use lz4 if feature is enabledJustin T. Gibbs2015-07-103-20/+27
| | | | | | | | | | | | | | | | 5661 ZFS: "compression = on" should use lz4 if feature is enabled Reviewed by: Matthew Ahrens <[email protected]> Reviewed by: Josef 'Jeff' Sipek <[email protected]> Reviewed by: Xin LI <[email protected]> Approved by: Robert Mustacchi <[email protected]> References: https://github.com/illumos/illumos-gate/commit/db1741f https://www.illumos.org/issues/5661 Ported-by: kernelOfTruth [email protected] Signed-off-by: Brian Behlendorf <[email protected]> Closes #3571
* Illumos 4745 - fix AVL code misspellingsJosef 'Jeff' Sipek2015-07-101-5/+6
| | | | | | | | | | | | | | | 4745 fix AVL code misspellings Reviewed by: Matthew Ahrens <[email protected]> Reviewed by: Richard Lowe <[email protected]> Approved by: Robert Mustacchi <[email protected]> References: https://github.com/illumos/illumos-gate/commit/6907ca4 https://www.illumos.org/issues/4745 Ported-by: kernelOfTruth [email protected] Signed-off-by: Brian Behlendorf <[email protected]> Closes #3565
* Prevent reclaim in metaslab preload threadsTim Chase2015-07-061-0/+2
| | | | | | | | | Reclaim during metaslab preloading can cause deadlocks involving znode z_lock and ARC buffer header ht_lock. Signed-off-by: Tim Chase <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #3532.
* Illumos 5008 - lock contention (rrw_exit) while running a read only loadAlexander Motin2015-07-063-10/+98
| | | | | | | | | | | | | | | | | | | | | | | | | | 5008 lock contention (rrw_exit) while running a read only load Reviewed by: Matthew Ahrens <[email protected]> Reviewed by: George Wilson <[email protected]> Reviewed by: Alex Reece <[email protected]> Reviewed by: Christopher Siden <[email protected]> Reviewed by: Richard Yao <[email protected]> Reviewed by: Saso Kiselkov <[email protected]> Approved by: Garrett D'Amore <[email protected]> Porting notes: This patch ported perfectly cleanly to ZoL. During testing 100% cached small-block reads, extreme contention was noticed on rrl->rr_lock from rrw_exit() due to the frequent entering and leaving ZPL. Illumos picked up this patch from FreeBSD and it also helps under Linux. On a 1-minute 4K cached read test with 10 fio processes pinned to a single socket on a 4-socket (10 thread per socket) NUMA system, contentions on rrl->rr_lock were reduced from 508799 to 43085. Ported-by: Tim Chase <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #3555
* Illumos 5911 - ZFS "hangs" while deleting fileMatthew Ahrens2015-07-064-32/+89
| | | | | | | | | | | | | | | | | | | | | | 5911 ZFS "hangs" while deleting file Reviewed by: Bayard Bell <[email protected]> Reviewed by: Alek Pinchuk <[email protected]> Reviewed by: Simon Klinkert <[email protected]> Reviewed by: Dan McDonald <[email protected]> Approved by: Richard Lowe <[email protected]> References: https://www.illumos.org/issues/5911 https://github.com/illumos/illumos-gate/commit/46e1baa Porting notes: Resolved ISO C90 forbids mixed declarations and code wanting in the dnode_free_range() function. Ported-by: kernelOfTruth [email protected] Signed-off-by: Brian Behlendorf <[email protected]> Closes #3554
* Illumos 5981 - Deadlock in dmu_objset_find_dpArne Jansen2015-07-063-4/+38
| | | | | | | | | | | | | | | 5981 Deadlock in dmu_objset_find_dp Reviewed by: Matthew Ahrens <[email protected]> Reviewed by: Dan McDonald <[email protected]> Approved by: Robert Mustacchi <[email protected]> References: https://www.illumos.org/issues/5981 https://github.com/illumos/illumos-gate/commit/1d3f896 Ported-by: kernelOfTruth [email protected] Signed-off-by: Brian Behlendorf <[email protected]> Closes #3553
* Illumos 5946, 5945Andriy Gapon2015-07-062-0/+12
| | | | | | | | | | | | | | | | | 5946 zfs_ioc_space_snaps must check that firstsnap and lastsnap refer to snapshots 5945 zfs_ioc_send_space must ensure that fromsnap refers to a snapshot Reviewed by: Steven Hartland <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Approved by: Gordon Ross <[email protected]> References: https://www.illumos.org/issues/5946 https://www.illumos.org/issues/5945 https://github.com/illumos/illumos-gate/commit/24218be Ported-by: Andriy Gapon <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #3552
* Illumos 5870 - dmu_recv_end_check() leaks origin_head hold if error happens ↵Andriy Gapon2015-07-061-2/+6
| | | | | | | | | | | | | | | | | in drc_force branch 5870 dmu_recv_end_check() leaks origin_head hold if error happens in drc_force branch Reviewed by: Matthew Ahrens <[email protected]> Reviewed by: Andrew Stormont <[email protected]> Approved by: Dan McDonald <[email protected]> References: https://www.illumos.org/issues/5870 https://github.com/illumos/illumos-gate/commit/beddaa9 Ported-by: Andriy Gapon <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #3551
* Illumos 5909 - ensure that shared snap names don't become too long after ↵Andriy Gapon2015-07-061-0/+6
| | | | | | | | | | | | | | | | | promotion 5909 ensure that shared snap names don't become too long after promotion Reviewed by: Matthew Ahrens <[email protected]> Reviewed by: George Wilson <[email protected]> Approved by: Dan McDonald <[email protected]> References: https://www.illumos.org/issues/5909 https://github.com/illumos/illumos-gate/commit/cb5842f Ported-by: Andriy Gapon <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #3550
* Illumos 5912 - full stream can not be force-received into a dataset if it ↵Andriy Gapon2015-07-061-4/+6
| | | | | | | | | | | | | | | | | has a snapshot 5912 full stream can not be force-received into a dataset if it has a snapshot Reviewed by: Matthew Ahrens <[email protected]> Reviewed by: Paul Dagnelie <[email protected]> Approved by: Dan McDonald <[email protected]> References: https://www.illumos.org/issues/5912 https://github.com/illumos/illumos-gate/commit/5bae108 Ported-by: Andriy Gapon <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #3549
* Illumos 6033 - arc_adjust() should search MFU listsAlek Pinchuk2015-07-011-1/+1
| | | | | | | | | | | | | | | | | 6033 arc_adjust() should search MFU lists for oldest buffer when adjusting MFU size Reviewed by: Saso Kiselkov <[email protected]> Reviewed by: Xin Li <[email protected]> Reviewed by: Prakash Surya <[email protected]> Approved by: Matthew Ahrens <[email protected]> References: https://www.illumos.org/issues/6033 https://github.com/illumos/illumos-gate/commit/31c46cf Ported-by: kernelOfTruth [email protected] Signed-off-by: Brian Behlendorf <[email protected]> Closes #3545
* Illumos 5175 - implement dmu_read_uio_dbuf() to improve cached read performanceMatthew Ahrens2015-06-292-11/+77
| | | | | | | | | | | | | | | | | | | | | | | | 5175 implement dmu_read_uio_dbuf() to improve cached read performance Reviewed by: Adam Leventhal <[email protected]> Reviewed by: Alex Reece <[email protected]> Reviewed by: George Wilson <[email protected]> Reviewed by: Richard Elling <[email protected]> Approved by: Robert Mustacchi <[email protected]> References: https://www.illumos.org/issues/5175 https://github.com/illumos/illumos-gate/commit/f8554bb Porting notes: This patch doesn't include the changes for the COMSTAR (Common Multiprotocol SCSI Target) - since it's not available for ZoL. http://thegreyblog.blogspot.co.at/2010/02/setting-up-solaris-comstar-and.html Ported by: kernelOfTruth <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #3392
* Illumos 5368 - ARC should cache more metadataMatthew Ahrens2015-06-251-0/+6
| | | | | | | | | | | | | | | | | | | | 5368 ARC should cache more metadata Reviewed by: Alex Reece <[email protected]> Reviewed by: Christopher Siden <[email protected]> Reviewed by: George Wilson <[email protected]> Reviewed by: Richard Elling <[email protected]> Approved by: Dan McDonald <[email protected]> References: https://www.illumos.org/issues/5368 https://github.com/illumos/illumos-gate/commit/3a5286a Porting Notes: The vast majority of this patch was already merged in the context of the 06358ea changes. This is just a small hunk which was missed. Ported-by: Brian Behlendorf <[email protected]>
* Illumos 5163 - arc should reap range_seg_cacheGeorge Wilson2015-06-252-1/+3
| | | | | | | | | | | | | | | | | | | | 5163 arc should reap range_seg_cache Reviewed by: Christopher Siden <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Reviewed by: Richard Elling <[email protected]> Reviewed by: Saso Kiselkov <[email protected]> Approved by: Dan McDonald <[email protected]> References: https://www.illumos.org/issues/5163 https://github.com/illumos/illumos-gate/commit/83803b5 Porting Notes: Added umem_cache_reap_now() wrapped to suppress unused variable warning for user space build in arc_kmem_reap_now(). Ported-by: Brian Behlendorf <[email protected]>
* Update all default taskq settingsBrian Behlendorf2015-06-258-31/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Over the years the default values for the taskqs used on Linux have differed slightly from illumos. In the vast majority of cases this was done to avoid creating an obnoxious number of idle threads which would pollute the process listing. With the addition of support for dynamic taskqs all multi-threaded queues should be created as dynamic taskqs. This allows us to get the best of both worlds. * The illumos default values for the I/O pipeline can be restored. These values are known to work well for most workloads. The only exception is the zio write interrupt taskq which is changed to ZTI_P(12, 8). At least under Linux more threads has been shown to improve performance, see commit 7e55f4e. * Reduces the number of idle threads on the system when it's not under heavy load. The maximum number of threads will only be created when they are required. * Remove the vdev_file_taskq and rely on the system_taskq instead which is now dynamic and may have up to 64-threads. Again this brings us back inline with upstream. * Tasks dispatched with taskq_dispatch_ent() are allowed to use dynamic taskqs. The Linux taskq implementation supports this. Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Tim Chase <[email protected]> Closes #3507
* Account for ashift when gathering buffers to be written to l2arc deviceAndriy Gapon2015-06-251-13/+34
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If we don't account for that, then we might end up overwriting disk area of buffers that have not been evicted yet, because l2arc_evict operates in terms of disk addresses. The discrepancy between the write size calculation and the actual increment to l2ad_hand was introduced in commit 3a17a7a9. The change that introduced l2ad_hand alignment was almost correct as the write size was accumulated as a sum of rounded buffer sizes. See commit illumos/illumos-gate@e14bb32. Also, we now consistently use asize / a_sz for the allocated size and psize / p_sz for the physical size. The latter accounts for a possible size reduction because of the compression, whereas the former accounts for a possible subsequent size expansion because of the alignment requirements. The code still assumes that either underlying storage subsystems or hardware is able to do read-modify-write when an L2ARC buffer size is not a multiple of a disk's block size. This is true for 4KB sector disks that provide 512B sector emulation, but may not be true in general. In other words, we currently do not have any code to make sure that an L2ARC buffer, whether compressed or not, which is used for physical I/O has a suitable size. Note that currently the cache device utilization is calculated based on the physical size, not the allocated size. The same applies to l2_asize kstat. That is wrong, but this commit does not fix that. The accounting problem was introduced partially in commit 3a17a7a9 and partially in 3038a2b (accounting became consistent but in favour of the wrong size). Porting Notes: Reworked to be C90 compatible and the 'write_psize' variable was removed because it is now unused. References: https://reviews.csiden.org/r/229/ https://reviews.freebsd.org/D2764 Ported-by: kernelOfTruth <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #3400 Closes #3433 Closes #3451
* Illumos 5701 - zpool list reports incorrect "alloc" value for cache devicesPrakash Surya2015-06-251-42/+139
| | | | | | | | | | | | | | | | | | | | 5701 zpool list reports incorrect "alloc" value for cache devices Reviewed by: Matthew Ahrens <[email protected]> Reviewed by: George Wilson <[email protected]> Reviewed by: Alek Pinchuk <[email protected]> Approved by: Dan McDonald <[email protected]> References: https://www.illumos.org/issues/5701 https://github.com/illumos/illumos-gate/commit/a52fc31 Porting Notes: arc_space_return(HDR_L2ONLY_SIZE, ARC_SPACE_L2HDRS); correctly placed at arc_hdr_l2hdr_destroy(arc_buf_hdr_t *hdr). Ported by: kernelOfTruth [email protected] Signed-off-by: Brian Behlendorf <[email protected]>