summaryrefslogtreecommitdiffstats
path: root/module
Commit message (Collapse)AuthorAgeFilesLines
* OpenZFS 9617 - too-frequent TXG sync causes excessive write inflationMatthew Ahrens2018-10-042-7/+12
| | | | | | | | | | | | | | | | | | | | Porting notes: * Renamed zfs_dirty_data_sync_pct to zfs_dirty_data_sync_percent and changed the type to be consistent with the other dirty module params. * Updated zfs-module-parameters.5 accordingly. Authored by: Matthew Ahrens <[email protected]> Reviewed by: Serapheim Dimitropoulos <[email protected]> Reviewed by: Brad Lewis <[email protected]> Reviewed by: George Wilson <[email protected]> Reviewed by: Andrew Stormont <[email protected]> Reviewed-by: George Melikov <[email protected]> Approved by: Robert Mustacchi <[email protected]> Ported-by: Brian Behlendorf <[email protected]> OpenZFS-issue: https://illumos.org/issues/9617 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/7928f4ba Closes #7976
* Add new fnvlist_lookup_* functionsPaul Dagnelie2018-10-031-1/+80
| | | | | | Reviewed by: Serapheim Dimitropoulos <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Paul Dagnelie <[email protected]> Closes #7977
* OpenZFS 9677 - panic from zio_write_gang_block()Brad Lewis2018-10-031-6/+14
| | | | | | | | | | | | | | | | Panic from zio_write_gang_block() when creating dump device on fragmented rpool. Authored by: Brad Lewis <[email protected]> Reviewed by: Matt Ahrens <[email protected]> Reviewed by: George Wilson <[email protected]> Reviewed by: Prashanth Sreenivasa <[email protected]> Approved by: Robert Mustacchi <[email protected]> Ported-by: Behlendorf <[email protected]> OpenZFS-issue: https://illumos.org/issues/9677 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/7341a7d Closes #7975
* Refcounted DSL Crypto Key MappingsTom Caputi2018-10-036-94/+186
| | | | | | | | | | | | | | | | | | | | | | | | Since native ZFS encryption was merged, we have been fighting against a series of bugs that come down to the same problem: Key mappings (which must be present during all I/O operations) are created and destroyed based on dataset ownership, but I/Os can have traditionally been allowed to "leak" into the next txg after the dataset is disowned. In the past we have attempted to solve this problem by trying to ensure that datasets are disowned ater all I/O is finished by calling txg_wait_synced(), but we have repeatedly found edge cases that need to be squashed and code paths that might incur a high number of txg syncs. This patch attempts to resolve this issue differently, by adding a reference to the key mapping for each txg it is dirtied in. By doing so, we can remove many of the unnecessary calls to txg_wait_synced() we have added in the past and ensure we don't need to deal with this problem in the future. Reviewed-by: Jorgen Lundman <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tom Caputi <[email protected]> Closes #7949
* OpenZFS 9700 - ZFS resilvered mirror does not balance readsJerry Jelinek2018-10-021-0/+4
| | | | | | | | | | | | | | | Authored by: Jerry Jelinek <[email protected]> Reviewed by: Toomas Soome <[email protected]> Reviewed by: Sanjay Nadkarni <[email protected]> Reviewed by: George Wilson <[email protected]> Reviewed-by: George Melikov <[email protected]> Reviewed-by: Richard Elling <[email protected]> Approved by: Matthew Ahrens <[email protected]> Ported-by: Brian Behlendorf <[email protected]> OpenZFS-issue: https://illumos.org/issues/9700 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/82f63c3c Closes #7973
* Prefix all refcount functions with zfs_Tim Schumacher2018-10-0120-361/+382
| | | | | | | | | | | | Recent changes in the Linux kernel made it necessary to prefix the refcount_add() function with zfs_ due to a name collision. To bring the other functions in line with that and to avoid future collisions, prefix the other refcount functions as well. Reviewed by: Matthew Ahrens <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tim Schumacher <[email protected]> Closes #7963
* Refine split block reconstructionBrian Behlendorf2018-10-011-121/+264
| | | | | | | | | | | | | | | | | | | | | Due to a flaw in 4589f3ae the number of unique combinations could be calculated incorrectly. This could result in the random combinations reconstruction being used when it would have been possible to check all combinations. This change fixes the unique combinations calculation and simplifies the reconstruction logic by maintaining a per- segment list of unique copies. The vdev_indirect_splits_damage() function was introduced to validate both the enumeration and random reconstruction logic with ztest. It is implemented such it will never make a known recoverable block unrecoverable. Reviewed-by: Matthew Ahrens <[email protected]> Reviewed-by: Serapheim Dimitropoulos <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #6900 Closes #7934
* Fixes for procfs files backed by linked listsJohn Gallagher2018-09-266-522/+607
| | | | | | | | | | | | | | | | | | | | | | | | | | | There are some issues with the way the seq_file interface is implemented for kstats backed by linked lists (zfs_dbgmsgs and certain per-pool debugging info): * We don't account for the fact that seq_file sometimes visits a node multiple times, which results in missing messages when read through procfs. * We don't keep separate state for each reader of a file, so concurrent readers will receive incorrect results. * We don't account for the fact that entries may have been removed from the list between read syscalls, so reading from these files in procfs can cause the system to crash. This change fixes these issues and adds procfs_list, a wrapper around a linked list which abstracts away the details of implementing the seq_file interface for a list and exposing the contents of the list through procfs. Reviewed by: Don Brady <[email protected]> Reviewed-by: Serapheim Dimitropoulos <[email protected]> Reviewed by: Brad Lewis <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: John Gallagher <[email protected]> External-issue: LX-1211 Closes #7819
* Linux 4.19-rc3+ compat: Remove refcount_t compatTim Schumacher2018-09-2615-65/+65
| | | | | | | | | | | | | | | torvalds/linux@59b57717f ("blkcg: delay blkg destruction until after writeback has finished") added a refcount_t to the blkcg structure. Due to the refcount_t compatibility code, zfs_refcount_t was used by mistake. Resolve this by removing the compatibility code and replacing the occurrences of refcount_t with zfs_refcount_t. Reviewed-by: Franz Pletz <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tim Schumacher <[email protected]> Closes #7885 Closes #7932
* Fix small sysfs leakBrian Behlendorf2018-09-261-10/+15
| | | | | | | | | | | | | | | | | | | | | When zfs_kobj_init() is called with an attr_cnt of 0 only the kobj->zko_default_attrs is allocated. It subsequently won't get freed in zfs_kobj_release since the free is wrapped in a kobj->zko_attr_count != 0 conditional. Split the block in zfs_kobj_release() to make sure the kobj->zko_default_attrs are freed in this case. Additionally, fix a minor spelling mistake and typo in zfs_kobj_init() which could also cause a leak but in practice is almost certain not to fail. Reviewed-by: Richard Elling <[email protected]> Reviewed-by: Tim Chase <[email protected]> Reviewed-by: John Gallagher <[email protected]> Reviewed-by: Don Brady <[email protected]> Reviewed-by: George Melikov <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #7957
* Fix statfs(2) for 32-bit user spaceBrian Behlendorf2018-09-242-5/+25
| | | | | | | | | | | | | | | | | | | | | | | | When handling a 32-bit statfs() system call the returned fields, although 64-bit in the kernel, must be limited to 32-bits or an EOVERFLOW error will be returned. This is less of an issue for block counts since the default reported block size in 128KiB. But since it is possible to set a smaller block size, these values will be scaled as needed to fit in a 32-bit unsigned long. Unlike most other filesystems the total possible file counts are more likely to overflow because they are calculated based on the available free space in the pool. In order to prevent this the reported value must be capped at 2^32-1. This is only for statfs(2) reporting, there are no changes to the internal ZFS limits. Reviewed-by: Andreas Dilger <[email protected]> Reviewed-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #7927 Closes #7122 Closes #7937
* vdev_disk_error() prints ASCII SOH to debug logLOLi2018-09-211-4/+3
| | | | | | | | | | | | | | | | | | Currently vdev_disk_error() prepends its messages sent to the internal ZFS debug log with KERN_WARNING, which is currently defined as follows: #define KERN_SOH "\001" #define KERN_WARNING KERN_SOH "4" Since "\001" (ASCII Start Of Header) is not printable this results in weird characters displayed when inspecting the debug log. This commit simply removes this superfluous prefix passed to zfs_dbgmsg(). Reviewed-by: Giuseppe Di Natale <[email protected]> Reviewed-by: George Melikov <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Richard Laager <[email protected]> Signed-off-by: loli10K <[email protected]> Closes #7936
* Add limits to spa_slop_shift tunableLOLi2018-09-201-1/+22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | This change adds limits to the possible spa_slop_shift values set via the sysfs interface. Accepted values are from a minimum of 1 to a maximum of 31 (inclusive): these limits are based on the following values observed on a 128PB file-vdev test pool: spa_slop_shift=1, spa_get_slop_space=63.5PiB spa_slop_shift=2, spa_get_slop_space=31.8PiB spa_slop_shift=3, spa_get_slop_space=15.9PiB spa_slop_shift=4, spa_get_slop_space=7.9PiB spa_slop_shift=5, spa_get_slop_space=4PiB spa_slop_shift=6, spa_get_slop_space=2PiB ... spa_slop_shift=25, spa_get_slop_space=4GiB spa_slop_shift=26, spa_get_slop_space=2GiB spa_slop_shift=27, spa_get_slop_space=1016MiB spa_slop_shift=28, spa_get_slop_space=508MiB spa_slop_shift=29, spa_get_slop_space=254MiB spa_slop_shift=30, spa_get_slop_space=128MiB spa_slop_shift=31, spa_get_slop_space=128MiB spa_slop_shift=32, spa_get_slop_space=128MiB Reviewed-by: Richard Elling <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: loli10K <[email protected]> Closes #7876 Closes #7900
* zpool split can create a corrupted poolRoman Strashkin2018-09-121-3/+3
| | | | | | | | | | | Added vdev_resilver_needed() check to verify VDEVs are fully synced, so that after split the new pool will not be corrupted. Reviewed by: Pavel Zakharov <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: loli10K <[email protected]> Signed-off-by: Roman Strashkin <[email protected]> Closes #7865 Closes #7881
* Fix in-kernel sysfs entriesDon Brady2018-09-063-29/+36
| | | | | | | | | | The recent sysfs zfs properties feature breaks the in-kernel builds of zfs (sans module). When not built as a module add the sysfs entries under /sys/fs/zfs/. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Don Brady <[email protected]> Closes #7868 Closes #7872
* Fix zfs_sysfs_live test failureDon Brady2018-09-061-2/+4
| | | | | | | | The ZTS zfs_sysfs_live test fails occasionally due to an uninitialized string on an error path. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Don Brady <[email protected]> Closes #7869
* Pool allocation classesDon Brady2018-09-0513-134/+613
| | | | | | | | | | | | | | | | | | | | Allocation Classes add the ability to have allocation classes in a pool that are dedicated to serving specific block categories, such as DDT data, metadata, and small file blocks. A pool can opt-in to this feature by adding a 'special' or 'dedup' top-level VDEV. Reviewed by: Pavel Zakharov <[email protected]> Reviewed-by: Richard Laager <[email protected]> Reviewed-by: Alek Pinchuk <[email protected]> Reviewed-by: HÃ¥kan Johansson <[email protected]> Reviewed-by: Andreas Dilger <[email protected]> Reviewed-by: DHE <[email protected]> Reviewed-by: Richard Elling <[email protected]> Reviewed-by: Gregor Kopka <[email protected]> Reviewed-by: Kash Pande <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Matthew Ahrens <[email protected]> Signed-off-by: Don Brady <[email protected]> Closes #5182
* Correctly handle errors from kern_pathChris Siebenmann2018-09-041-1/+1
| | | | | | | | | | | | | As a regular kernel function, kern_path() returns errors as negative errnos, such as -ELOOP. zfsctl_snapdir_vget() must convert these into the positive errnos used throughout the ZFS code when it returns them to other ZFS functions so that the ZFS code properly sees them as errors. Reviewed-by: George Melikov <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Chris Siebenmann <[email protected]> Closes #7764 Closes #7864
* Added recalculation of ARC stats mid-evictionRich Ercolani2018-09-041-0/+7
| | | | | | | | | | | Re-adds a recalculation step for the ARC stats after the MRU eviction so that we don't pathologically attempt to evict the MFU. Reviewed-by: Richard Elling <[email protected]> Reviewed-by: George Melikov <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Authored-by: Mark Johnston <[email protected]> Signed-off-by: Rich Ercolani <[email protected]> Closes #7855
* Revert "Update zfs_admin_snapshot default value (disabled)"Brian Behlendorf2018-09-041-1/+1
| | | | | | | | | | This reverts commit a6214a0ae9e78d3cac0e495e2fcf7af0858a872f. Disabling zfs_admin_snapshot by default results in multiple ZTS tests failing which depend on this functionality. Revert this change until the relevant test cases can be updated. Signed-off-by: Brian Behlendorf <[email protected]> Issue #7838
* Update zfs_admin_snapshot default value (disabled)George Melikov2018-09-041-1/+1
| | | | | | | | | | It's disabled by default, update code to reflect the documentation. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Gregor Kopka <[email protected]> Signed-off-by: George Melikov <[email protected]> Closes #7835 Closes #7838
* OpenZFS 9751 - Allocation throttling misplacing ditto blocksmav2018-09-021-3/+10
| | | | | | | | | | | | | | | | | | | Relax allocation throttling for ditto blocks. Due to random imbalances in allocation it tends to push block copies to one vdev, that looks slightly better at the moment. Slightly less strict policy allows both improve data security and surprisingly write performance, since we don't need to touch extra metaslabs on each vdev to respect the min distance. Sponsored by: iXsystems, Inc. Authored by: mav <[email protected]> Reviewed by: Matt Ahrens <[email protected]> Reviewed by: Pavel Zakharov <[email protected]> Ported-by: Brian Behlendorf <[email protected]> OpenZFS-issue: https://illumos.org/issues/9751 FreeBSD-commit: https://github.com/freebsd/freebsd/commit/8253837ac3 Closes #7857
* OpenZFS 9738 - Fix third block copy allocations, broken at 9112.mav2018-09-021-7/+5
| | | | | | | | | | | | | | | Use METASLAB_WEIGHT_CLAIM weight to allocate tertiary blocks. Previous use of METASLAB_WEIGHT_SECONDARY for that caused errors later on metaslab_activate_allocator() call, leading to massive load of unneeded metaslabs and write freezes. Authored by: mav <[email protected]> Reviewed by: Paul Dagnelie <[email protected]> Ported-by: Brian Behlendorf <[email protected]> OpenZFS-issue: https://illumos.org/issues/9738 FreeBSD-commit: https://github.com/freebsd/freebsd/commit/63e7138 Closes #7858
* Add basic zfs ioc input nvpair validationDon Brady2018-09-021-97/+364
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We want newer versions of libzfs_core to run against an existing zfs kernel module (i.e. a deferred reboot or module reload after an update). Programmatically document, via a zfs_ioc_key_t, the valid arguments for the ioc commands that rely on nvpair input arguments (i.e. non legacy commands from libzfs_core). Automatically verify the expected pairs before dispatching a command. This initial phase focuses on the non-legacy ioctls. A follow-on change can address the legacy ioctl input from the zfs_cmd_t. The zfs_ioc_key_t for zfs_keys_channel_program looks like: static const zfs_ioc_key_t zfs_keys_channel_program[] = { {"program", DATA_TYPE_STRING, 0}, {"arg", DATA_TYPE_UNKNOWN, 0}, {"sync", DATA_TYPE_BOOLEAN_VALUE, ZK_OPTIONAL}, {"instrlimit", DATA_TYPE_UINT64, ZK_OPTIONAL}, {"memlimit", DATA_TYPE_UINT64, ZK_OPTIONAL}, }; Introduce four input errors to identify specific input failures (in addition to generic argument value errors like EINVAL, ERANGE, EBADF, and E2BIG). ZFS_ERR_IOC_CMD_UNAVAIL the ioctl number is not supported by kernel ZFS_ERR_IOC_ARG_UNAVAIL an input argument is not supported by kernel ZFS_ERR_IOC_ARG_REQUIRED a required input argument is missing ZFS_ERR_IOC_ARG_BADTYPE an input argument has an invalid type Reviewed-by: Matthew Ahrens <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Don Brady <[email protected]> Closes #7780
* Add zfs module feature and property info to sysfsDon Brady2018-09-026-1/+707
| | | | | | | | | | | | | | | | | | | | | This extends our sysfs '/sys/module/zfs' entry to include feature and property attributes. The primary consumer of this information is user processes, like the zfs CLI, that need to know what the current loaded ZFS module supports. The libzfs binary will consult this information when instantiating the zfs and zpool property tables and the pool features table. This introduces 4 kernel objects (dirs) into '/sys/module/zfs' with corresponding attributes (files): features.runtime features.pool properties.dataset properties.pool Reviewed-by: Matthew Ahrens <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Don Brady <[email protected]> Closes #7706
* Allow ECKSUM in vdev_checkpoint_sm_object()Brian Behlendorf2018-08-311-4/+8
| | | | | | | | | | | The checkpoint space map object may not be accessible from the vdev's ZAP when it has been damaged. This may be the case when performing an extreme rewind when importing the pool. Reviewed-by: Serapheim Dimitropoulos <[email protected]> Reviewed by: Tim Chase <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #7809 Closes #7853
* clean up __dbuf_hold_implMatthew Ahrens2018-08-311-43/+37
| | | | | | | | | | | | | We can simplify the dbuf_hold code by allocating dbuf_hold_arg_t's on demand, rather than allocating a big array of them up front. While this can occasionally increase the number of allocations, typically only one allocation is needed since the indirect block is already cached. The performance test suite gets the same results with this change. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Tony Nguyen <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> Closes #7841
* OpenZFS 9403 - assertion failed in arc_buf_destroy()Tom Caputi2018-08-294-28/+81
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | Assertion failed in arc_buf_destroy() when concurrently reading block with checksum error. Porting notes: * The ability to zinject decompression errors has been added, but this only works at the zio_decompress() level, where we have all of the info we need to match against the user's zinject options. * The decompress_fault test has been added to test the new zinject functionality * We attempted to set zio_decompress_fail_fraction to (1 << 18) in ztest for further test coverage. Although this did uncover a few low priority issues, this unfortuantely also causes ztest to ASSERT in many locations where the code is working correctly since it is designed to fail on IO errors. Developers can manually set this variable with the '-o' option to find and debug issues. Authored by: Matt Ahrens <[email protected]> Reviewed by: George Wilson <[email protected]> Reviewed by: Paul Dagnelie <[email protected]> Reviewed by: Pavel Zakharov <[email protected]> Reviewed by: Brian Behlendorf <[email protected]> Approved by: Matt Ahrens <[email protected]> Ported-by: Tom Caputi <[email protected]> OpenZFS-issue: https://illumos.org/issues/9403 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/fa98e487a9 Closes #7822
* Always wait for txg sync when umounting datasetTom Caputi2018-08-273-28/+10
| | | | | | | | | | | | | | | | | | | | | | | | | Currently, when unmounting a filesystem, ZFS will only wait for a txg sync if the dataset is dirty and not readonly. However, this can be problematic in cases where a dataset is remounted readonly immediately before being unmounted, which often happens when the system is being shut down. Since encrypted datasets require that all I/O is completed before the dataset is disowned, this issue causes problems when write I/Os leak into the txgs after the dataset is disowned, which can happen when sync=disabled. While looking into fixes for this issue, it was discovered that dsl_dataset_is_dirty() does not return B_TRUE when the dataset has been removed from the txg dirty datasets list, but has not actually been processed yet. Furthermore, the implementation is comletely different from dmu_objset_is_dirty(), adding to the confusion. Rather than relying on this function, this patch forces the umount code path (and the remount readonly code path) to always perform a txg sync on read-write datasets and removes the function altogether. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tom Caputi <[email protected]> Closes #7753 Closes #7795
* Small rework of txg_list codeTom Caputi2018-08-272-21/+45
| | | | | | | | | | This patch simply adds some missing locking to the txg_list functions and refactors txg_verify() so that it is only compiled in for debug builds. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tom Caputi <[email protected]> Closes #7795
* Direct IO supportBrian Behlendorf2018-08-271-0/+52
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Direct IO via the O_DIRECT flag was originally introduced in XFS by IRIX for database workloads. Its purpose was to allow the database to bypass the page and buffer caches to prevent unnecessary IO operations (e.g. readahead) while preventing contention for system memory between the database and kernel caches. On Illumos, there is a library function called directio(3C) that allows user space to provide a hint to the file system that Direct IO is useful, but the file system is free to ignore it. The semantics are also entirely a file system decision. Those that do not implement it return ENOTTY. Since the semantics were never defined in any standard, O_DIRECT is implemented such that it conforms to the behavior described in the Linux open(2) man page as follows. 1. Minimize cache effects of the I/O. By design the ARC is already scan-resistant which helps mitigate the need for special O_DIRECT handling. Data which is only accessed once will be the first to be evicted from the cache. This behavior is in consistent with Illumos and FreeBSD. Future performance work may wish to investigate the benefits of immediately evicting data from the cache which has been read or written with the O_DIRECT flag. Functionally this behavior is very similar to applying the 'primarycache=metadata' property per open file. 2. O_DIRECT _MAY_ impose restrictions on IO alignment and length. No additional alignment or length restrictions are imposed. 3. O_DIRECT _MAY_ perform unbuffered IO operations directly between user memory and block device. No unbuffered IO operations are currently supported. In order to support features such as transparent compression, encryption, and checksumming a copy must be made to transform the data. 4. O_DIRECT _MAY_ imply O_DSYNC (XFS). O_DIRECT does not imply O_DSYNC for ZFS. Callers must provide O_DSYNC to request synchronous semantics. 5. O_DIRECT _MAY_ disable file locking that serializes IO operations. Applications should avoid mixing O_DIRECT and normal IO or mmap(2) IO to the same file. This is particularly true for overlapping regions. All I/O in ZFS is locked for correctness and this locking is not disabled by O_DIRECT. However, concurrently mixing O_DIRECT, mmap(2), and normal I/O on the same file is not recommended. This change is implemented by layering the aops->direct_IO operations on the existing AIO operations. Code already existed in ZFS on Linux for bypassing the page cache when O_DIRECT is specified. References: * http://xfs.org/docs/xfsdocs-xml-dev/XFS_User_Guide/tmp/en-US/html/ch02s09.html * https://blogs.oracle.com/roch/entry/zfs_and_directio * https://ext4.wiki.kernel.org/index.php/Clarifying_Direct_IO's_Semantics * https://illumos.org/man/3c/directio Reviewed-by: Richard Elling <[email protected]> Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #224 Closes #7823
* Fedora 28: Fix misc bounds check compiler warningsJoao Carlos Mendes Luis2018-08-261-1/+1
| | | | | | | | | | | Fix a bunch of truncation compiler warnings that show up on Fedora 28 (GCC 8.0.1). Reviewed-by: Giuseppe Di Natale <[email protected]> Reviewed-by: George Melikov <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Issue #7368 Closes #7826 Closes #7830
* Stack overflow when destroying deeply nested clonesLOLi2018-08-221-30/+54
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Destroy operations on deeply nested chains of clones can overflow the stack: Depth Size Location (221 entries) ----- ---- -------- 0) 15664 48 mutex_lock+0x5/0x30 1) 15616 8 mutex_lock+0x5/0x30 ... 26) 13576 72 dsl_dataset_remove_clones_key.isra.4+0x124/0x1e0 [zfs] 27) 13504 72 dsl_dataset_remove_clones_key.isra.4+0x18a/0x1e0 [zfs] 28) 13432 72 dsl_dataset_remove_clones_key.isra.4+0x18a/0x1e0 [zfs] ... 185) 2128 72 dsl_dataset_remove_clones_key.isra.4+0x18a/0x1e0 [zfs] 186) 2056 72 dsl_dataset_remove_clones_key.isra.4+0x18a/0x1e0 [zfs] 187) 1984 72 dsl_dataset_remove_clones_key.isra.4+0x18a/0x1e0 [zfs] 188) 1912 136 dsl_destroy_snapshot_sync_impl+0x4e0/0x1090 [zfs] 189) 1776 16 dsl_destroy_snapshot_check+0x0/0x90 [zfs] ... 218) 304 128 kthread+0xdf/0x100 219) 176 48 ret_from_fork+0x22/0x40 220) 128 128 kthread+0x0/0x100 Fix this issue by converting dsl_dataset_remove_clones_key() from recursive to iterative. Reviewed-by: Paul Zuchowski <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: loli10K <[email protected]> Closes #7279 Closes #7810
* s/VERIFY/VERIFY3S in vdev_checkpoint_sm_objectTim Chase2018-08-211-1/+2
| | | | | | | | | | Using VERIFY3S allows to view the unexpected error value in the system log. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Don Brady <[email protected]> Signed-off-by: Tim Chase <[email protected]> Issue #7809 Closes #7818
* Fix issues with raw receive_write_byref()Tom Caputi2018-08-204-41/+48
| | | | | | | | | | | | | | | | | | | | | | | This patch fixes 2 issues with raw, deduplicated send streams. The first is that datasets who had been completely received earlier in the stream were not still marked as raw receives. This caused problems when newly received datasets attempted to fetch raw data from these datasets without this flag set. The second problem was that the arc freeze checksum code was not consistent about which locks needed to be held while performing its asserts. The proper locking needed to run these asserts is actually fairly nuanced, since the asserts touch the linked list of buffers (requiring the header lock), the arc_state (requiring the b_evict_lock), and the b_freeze_cksum (requiring the b_freeze_lock). This seems like a large performance sacrifice and a lot of unneeded complexity to verify that this relatively small debug feature is working as intended, so this patch simply removes these asserts instead. Reviewed by: Matthew Ahrens <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tom Caputi <[email protected]> Closes #7701
* Introduce read/write kstats per datasetSerapheim Dimitropoulos2018-08-209-76/+282
| | | | | | | | | | | | | The following patch introduces a few statistics on reads and writes grouped by dataset. These statistics are implemented as kstats (backed by aggregate sums for performance) and can be retrieved by using the dataset objset ID number. The motivation for this change is to provide some preliminary analytics on dataset usage/performance. Reviewed-by: Richard Elling <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Signed-off-by: Serapheim Dimitropoulos <[email protected]> Closes #7705
* Fix traverse_impl() kmem leakBrian Behlendorf2018-08-151-2/+2
| | | | | | | | | | | | The error path must free the memory allocated by this function or it will be leaked. In practice, this would leak only a few bytes of memory under rare circumstances and thus is unlikely to have caused any real problems. This issue was caught by the kmemleak. Reviewed-by: Giuseppe Di Natale <[email protected]> Reviewed-by: George Melikov <[email protected]> Reviewed-by: Chunwei Chen <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #7791
* Check encrypted dataset + embedded recv earlierTom Caputi2018-08-154-13/+56
| | | | | | | | | | | | | | This patch fixes a bug where attempting to receive a send stream with embedded data into an encrypted dataset would not cleanup that dataset when the error was reached. The check was moved into dmu_recv_begin_check(), preventing this issue. Reviewed-by: Jorgen Lundman <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Richard Elling <[email protected]> Signed-off-by: Tom Caputi <[email protected]> Closes #7650
* Added encryption support for zfs recv -o / -xTom Caputi2018-08-153-48/+119
| | | | | | | | | | | | | | | | | | | | | | | | | | | | One small integration that was absent from b52563 was support for zfs recv -o / -x with regards to encryption parameters. The main use cases of this are as follows: * Receiving an unencrypted stream as encrypted without needing to create a "dummy" encrypted parent so that encryption can be inheritted. * Allowing users to change their keylocation on receive, so long as the receiving dataset is an encryption root. * Allowing users to explicitly exclude or override the encryption property from an unencrypted properties stream, allowing it to be received as encrypted. * Receiving a recursive heirarchy of unencrypted datasets, encrypting the top-level one and forcing all children to inherit the encryption. Reviewed-by: Jorgen Lundman <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Richard Elling <[email protected]> Signed-off-by: Tom Caputi <[email protected]> Closes #7650
* Fix comment on calculating blkidTomohiro Kusumi2018-08-131-1/+1
| | | | | | | | | Fix comment on calculating blkid at level n within dnode's blkptrs. "(2^(level*(indblkshift - SPA_BLKPTRSHIFT)" is part of divisor in this division. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tomohiro Kusumi <[email protected]> Closes #7768
* Allow inherited properties in zfs_check_settable()LOLi2018-08-031-13/+13
| | | | | | | | | | | | | This change modifies how 'checksum' and 'dedup' properties are verified in zfs_check_settable() handling the case where they are explicitly inherited in the dataset hierarchy when receiving a recursive send stream. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Tom Caputi <[email protected]> Signed-off-by: loli10K <[email protected]> Closes #7755 Closes #7576 Closes #7757
* zfs_ioc_unload_key can drop extra spa refDon Brady2018-08-032-5/+14
| | | | | | | Reviewed by: Thomas Caputi <[email protected]> Reviewed by: Matt Ahrens <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Don Brady <[email protected]> Closes #7759
* Reduce taskq and context-switch cost of zio pipeMatthew Ahrens2018-08-021-124/+144
| | | | | | | | | | | | | | | | | | | | When doing a read from disk, ZFS creates 3 ZIO's: a zio_null(), the logical zio_read(), and then a physical zio. Currently, each of these results in a separate taskq_dispatch(zio_execute). On high-read-iops workloads, this causes a significant performance impact. By processing all 3 ZIO's in a single taskq entry, we reduce the overhead on taskq locking and context switching. We accomplish this by allowing zio_done() to return a "next zio to execute" to zio_execute(). This results in a ~12% performance increase for random reads, from 96,000 iops to 108,000 iops (with recordsize=8k, on SSD's). Reviewed by: Pavel Zakharov <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed by: George Wilson <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> External-issue: DLPX-59292 Closes #7736
* Add missing checks to zpl_xattr_* functionsJohn Gallagher2018-08-021-6/+9
| | | | | | | | | | | | | Linux specific zpl_* entry points, such as xattrs, must include the same unmounted and sa handle checks as the common zfs_ entry points. The additional ZPL_* wrappers are identical to their ZFS_ counterparts except the errno is negated since they are expected to be used at the zpl_ layer. Reviewed by: Matthew Ahrens <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: John Gallagher <[email protected]> Closes #5866 Closes #7761
* Add support for selecting encryption backendNathan Lewis2018-08-0213-1582/+2197
| | | | | | | | | | | | | | | | | | - Add two new module parameters to icp (icp_aes_impl, icp_gcm_impl) that control the crypto implementation. At the moment there is a choice between generic and aesni (on platforms that support it). - This enables support for AES-NI and PCLMULQDQ-NI on AMD Family 15h (bulldozer) and newer CPUs (zen). - Modify aes_key_t to track what implementation it was generated with as key schedules generated with various implementations are not necessarily interchangable. Reviewed by: Gvozden Neskovic <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Tom Caputi <[email protected]> Reviewed-by: Richard Laager <[email protected]> Signed-off-by: Nathaniel R. Lewis <[email protected]> Closes #7102 Closes #7103
* Fix OpenZFS 9337 mismergeGeorge Wilson2018-08-023-13/+14
| | | | | | | | | | | | | | | | This change reintroduces logic required by OpenZFS 9577. When OpenZFS 9337, zfs get all is slow due to uncached metadata, was merged in it ended up removing logic required by OpenZFS 9577, remove zfs_dbuf_evict_key, and inadvertently reintroduced the bug that 9577 was designed to fix. This change re-enables the "evicting" flag to dbuf_rele_and_unlock and dnode_rele_and_unlock and updates all callers to provide the correct parameter. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: George Wilson <[email protected]> Closes #7758
* Fix deadlock between zfs umount & snapentry_expireRohan Puri2018-08-011-6/+5
| | | | | | | | | | | | | | | | | | zfs umount -> zfsctl_destroy() takes the zfs_snapshot_lock as a writer and calls zfsctl_snapshot_unmount_cancel(), which waits for snapentry_expire() if present (when snap is automounted). This snapentry_expire() itself then waits for zfs_snapshot_lock as a reader, resulting in a deadlock. The fix is to only hold the zfs_snapshot_lock over the tree lookup and removal. After a successful lookup the lock can be dropped and zfs_snapentry_t will remain valid until the reference taken by the lookup is released. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rohan Puri <[email protected]> Closes #7751 Closes #7752
* OpenZFS 9112 - Improve allocation performance on high-end systemsPaul Dagnelie2018-07-318-175/+505
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Overview ======== We parallelize the allocation process by creating the concept of "allocators". There are a certain number of allocators per metaslab group, defined by the value of a tunable at pool open time. Each allocator for a given metaslab group has up to 2 active metaslabs; one "primary", and one "secondary". The primary and secondary weight mean the same thing they did in in the pre-allocator world; primary metaslabs are used for most allocations, secondary metaslabs are used for ditto blocks being allocated in the same metaslab group. There is also the CLAIM weight, which has been separated out from the other weights, but that is less important to understanding the patch. The active metaslabs for each allocator are moved from their normal place in the metaslab tree for the group to the back of the tree. This way, they will not be selected for use by other allocators searching for new metaslabs unless all the passive metaslabs are unsuitable for allocations. If that does happen, the allocators will "steal" from each other to ensure that IOs don't fail until there is truly no space left to perform allocations. In addition, the alloc queue for each metaslab group has been broken into a separate queue for each allocator. We don't want to dramatically increase the number of inflight IOs on low-end systems, because it can significantly increase txg times. On the other hand, we want to ensure that there are enough IOs for each allocator to allow for good coalescing before sending the IOs to the disk. As a result, we take a compromise path; each allocator's alloc queue max depth starts at a certain value for every txg. Every time an IO completes, we increase the max depth. This should hopefully provide a good balance between the two failure modes, while not dramatically increasing complexity. We also parallelize the spa_alloc_tree and spa_alloc_lock, which cause very similar contention when selecting IOs to allocate. This parallelization uses the same allocator scheme as metaslab selection. Performance Results =================== Performance improvements from this change can vary significantly based on the number of CPUs in the system, whether or not the system has a NUMA architecture, the speed of the drives, the values for the various tunables, and the workload being performed. For an fio async sequential write workload on a 24 core NUMA system with 256 GB of RAM and 8 128 GB SSDs, there is a roughly 25% performance improvement. Future Work =========== Analysis of the performance of the system with this patch applied shows that a significant new bottleneck is the vdev disk queues, which also need to be parallelized. Prototyping of this change has occurred, and there was a performance improvement, but more work needs to be done before its stability has been verified and it is ready to be upstreamed. Authored by: Paul Dagnelie <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Reviewed by: George Wilson <[email protected]> Reviewed by: Serapheim Dimitropoulos <[email protected]> Reviewed by: Alexander Motin <[email protected]> Reviewed by: Brian Behlendorf <[email protected]> Approved by: Gordon Ross <[email protected]> Ported-by: Paul Dagnelie <[email protected]> Signed-off-by: Paul Dagnelie <[email protected]> Porting Notes: * Fix reservation test failures by increasing tolerance. OpenZFS-issue: https://illumos.org/issues/9112 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/3f3cc3c3 Closes #7682
* OpenZFS 9465 - ARC check for 'anon_size > arc_c/2' can stall the systemDon Brady2018-07-303-16/+43
| | | | | | | | | | | | | | | | | | | | | | | | | | In the case of one pool being built on another pool, we want to make sure we don't end up throttling the lower (backing) pool when the upper pool is the majority contributor to dirty data. To insure we make forward progress during throttling, we also check the current pool's net dirty data and only throttle if it exceeds zfs_arc_pool_dirty_percent of the anonymous dirty data in the cache. Authored by: Don Brady <[email protected]> Reviewed by: Sebastien Roy <[email protected]> Reviewed by: Matt Ahrens <[email protected]> Reviewed by: Prashanth Sreenivasa <[email protected]> Approved by: Robert Mustacchi <[email protected]> Ported-by: Brian Behlendorf <[email protected]> Porting Notes: * The new global variables zfs_arc_dirty_limit_percent, zfs_arc_anon_limit_percent, and zfs_arc_pool_dirty_percent were intentially not added as tunable module parameters. OpenZFS-issue: https://illumos.org/issues/9465 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/d6a4c3ef Closes #7749
* OpenZFS 9580 - Add a hash-table on top of nvlist to speed-up operationsSerapheim Dimitropoulos2018-07-302-48/+319
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | = Motivation While dealing with another performance issue (see 126118f) we noticed that we spend a lot of time in various places in the kernel when constructing long nvlists. The problem is that when an nvlist is created with the NV_UNIQUE_NAME set (which is the case most of the time), we do a linear search through the whole list to ensure uniqueness for every entry we add. An example of the above scenario can be seen in the following flamegraph, where more than have the time of the zfsdev_ioctl() is spent on constructing nvlists. Flamegraph: https://sdimitro.github.io/img/flame/sdimitro_snap_unmount3.svg Adding a table to speed up lookups will help situations where we just construct an nvlist (like the scenario above), in addition to regular lookups and removals. = What this patch does In this diff we've implemented a hash-table on top of the nvlist code that converts most nvlist operations from O(# number of entries) to O(1)* (the start is for amortized time as the hash-table grows and shrinks depending on the # of entries - plain lookup is strictly O(1)). = Performance Analysis To analyze the performance improvement I just used the setup from the snapshot deletion issue mentioned above in the Motivation section. Basically I created 10K filesystems with one snapshot each and then I just used the API of libZFS_Core to pass down an nvlist of all the snapshots to have them deleted. The reason I used my own driver program was to have clean performance results of what actually happens in the kernel. The flamegraphs and wall clock times mentioned below were gathered from the start to the end of the driver program's run. Between trials the testpool used was completely destroyed, the system was rebooted and the testpool was completely recreated. The reason for this dance was to get consistent results. == Results (before patch): === Sampling Flamegraphs [Trial 1] https://sdimitro.github.io/img/flame/DLPX-53417/trial-A.svg [Trial 2] https://sdimitro.github.io/img/flame/DLPX-53417/trial-A2.svg [Trial 3] https://sdimitro.github.io/img/flame/DLPX-53417/trial-A3.svg === Wall clock times (in seconds) ``` [Trial 4] real 5.3 user 0.4 sys 2.3 [Trial 5] real 8.2 user 0.4 sys 2.4 [Trial 6] real 6.0 user 0.5 sys 2.3 ``` == Results (after patch): === Sampling Flamegraphs [Trial 1] https://sdimitro.github.io/img/flame/DLPX-53417/trial-Ae.svg [Trial 2] https://sdimitro.github.io/img/flame/DLPX-53417/trial-A2e.svg [Trial 3] https://sdimitro.github.io/img/flame/DLPX-53417/trial-A3e.svg === Wall clock times (in seconds) ``` [Trial 4] real 4.9 user 0.0 sys 0.9 [Trial 5] real 3.8 user 0.0 sys 0.9 [Trial 6] real 3.6 user 0.0 sys 0.9 ``` == Analysis The results between the trials are consistent so in this sections I will only talk about the flamegraph results from trial-1 and the wall-clock results from trial-4. From trial-1 we can see that zfs_dev_ioctl() goes from 2,331 to 996 samples counts. Specifically, the samples from fnvlist_add_nvlist() and spa_history_log_nvl() are almost gone (~500 & ~800 to 5 & 5 samples), leaving zfs_ioc_destroy_snaps() to dominate most samples from zfs_dev_ioctl(). From trial-4 we see that the user time dropped to 0 secods. I believe the consistent 0.4 seconds before my patch was applied was due to my driver program constructing the long nvlist of snapshots so it can pass it to the kernel. As for the system time, the effect there is more clear (2.3 down to 0.9 seconds). Porting Notes: * DATA_TYPE_DONTCARE case added to switch in fm_nvprintr() and zpool_do_events_nvprint(). Authored by: Serapheim Dimitropoulos <[email protected]> Reviewed by: Matt Ahrens <[email protected]> Reviewed by: Sebastien Roy <[email protected]> Approved by: Robert Mustacchi <[email protected]> Ported-by: Brian Behlendorf <[email protected]> OpenZFS-issue: https://www.illumos.org/issues/9580 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/b5eca7b1 Closes #7748