summaryrefslogtreecommitdiffstats
path: root/module/os
Commit message (Collapse)AuthorAgeFilesLines
* linux: zvol: avoid heap allocation for zvol_request_sync=1Christian Schwarz2021-03-031-29/+64
| | | | | | | | | | | | The spl_kmem_alloc showed up in some flamegraphs in a single-threaded 4k sync write workload at 85k IOPS on an Intel(R) Xeon(R) Silver 4215 CPU @ 2.50GHz. Certainly not a huge win but I believe the change is clean and easy to maintain down the road. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Matthew Ahrens <[email protected]> Signed-off-by: Christian Schwarz <[email protected]> Closes #11666
* Fix assert in FreeBSD-specific dmu_read_pagesAndriy Gapon2021-02-271-1/+1
| | | | | | | | | | | | | The function has three similar pieces of code: for read-behind pages, requested pages and read-ahead pages. All three pieces had an assert to ensure that the page is not mapped. Later the assert was relaxed to require that the page is not mapped for writing. But that was done in two places out of three. This change fixes the third piece, read-ahead. Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Andriy Gapon <[email protected]> Closes #11654
* Linux 5.12 compat: bio->bi_disk member movedColeman Kane2021-02-242-0/+8
| | | | | | | | | | The struct bio member bi_disk was moved underneath a new member named bi_bdev. So all attempts to reference bio->bi_disk need to now become bio->bi_bdev->bd_disk. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Coleman Kane <[email protected]> Closes #11639
* Linux: increase max nvlist_src sizeBrian Behlendorf2021-02-241-1/+1
| | | | | | | | | | | | On Linux increase the maximum allowed size of the src nvlist which can be passed to the /dev/zfs ioctl. Originally, this was set to a maximum of KMALLOC_MAX_SIZE (4M) because it was kmalloc'd. Since that time it's been converted to a vmalloc so that's no longer a hard limit, and it's desirable for `zfs send/recv` to allow larger nvlists so more snapshots can be sent at once. Signed-off-by: Brian Behlendorf <[email protected]> Closes #6572 Closes #11638
* Cleaning up uio headersBrian Atkinson2021-02-201-1/+8
| | | | | | | | | Making uio_impl.h the common header interface between Linux and FreeBSD so both OS's can share a common header file. This also helps reduce code duplication for zfs_uio_t for each OS. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Brian Atkinson <[email protected]> Closes #11622
* Restore FreeBSD resource usage accountingRyan Moeller2021-02-193-0/+92
| | | | | | | Add zfs_racct_* interfaces for platform-dependent read/write accounting. Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Ryan Moeller <[email protected]> Closes #11613
* FreeBSD: disable the use of hardware crypto offload drivers for nowMark Johnston2021-02-181-2/+13
| | | | | | | | | | | | | | | | | | | | | First, the crypto request completion handler contains a bug in that it fails to reset fs_done correctly after the request is completed. This is only a problem for asynchronous drivers. Second, some hardware drivers have input constraints which ZFS does not satisfy. For instance, ccp(4) apparently requires the AAD length for AES-GCM to be a multiple of the cipher block size, and with qat(4) the AES-GCM AAD length may not be longer than 240 bytes. FreeBSD's generic crypto framework doesn't have a mechanism to automatically fall back to a software implementation if a hardware driver cannot process a request, and ZFS does not tolerate such errors. The plan is to implement such a fallback mechanism, but with FreeBSD 13.0 approaching we should simply disable the use hardware drivers for now. Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Mark Johnston <[email protected]> Closes #11612
* Remove unused abd_alloc_scatter_offset_chunkcntRyan Libby2021-02-171-19/+0
| | | | | | | | | | Remove function that become unused after refactoring in e2af2acce3436acdb2b35fdc7c9de1a30ea85514. Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Matthew Ahrens <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Ryan Libby <[email protected]> Closes #11614
* Rename zfs_inode_update to zfs_znode_update_vfskhng3002021-02-094-32/+28
| | | | | | | | | | | zfs_znode_update_vfs is a more platform-agnostic name than zfs_inode_update. Besides that, the function's prototype is moved to include/sys/zfs_znode.h as the function is also used in common code. Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Ka Ho Ng <[email protected]> Sponsored by: The FreeBSD Foundation Closes #11580
* Add an assert to clarify codeKleber Tarcísio2021-02-092-2/+6
| | | | | | | | | | The first time through the loop prevdb and prevhdl are NULL. They are then both set, but only prevdb is checked. Add an ASSERT to make it clear that prevhdl must be set when prevdb is. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Kleber <[email protected]> Closes #10754 Closes #11575
* fix abd_nr_pages_off for gang abdMatthew Ahrens2021-01-282-33/+40
| | | | | | | | | | | | | | | | | | | | | | | | | | | | `__vdev_disk_physio()` uses `abd_nr_pages_off()` to allocate a bio with a sufficient number of iovec's to process this zio (i.e. `nr_iovecs`/`bi_max_vecs`). If there are not enough iovec's in the bio, then additional bio's will be allocated. However, this is a sub-optimal code path. In particular, it requires several abd calls (to `abd_nr_pages_off()` and `abd_bio_map_off()`) which will have to walk the constituents of the ABD (the pages or the gang children) because they are looking for offsets > 0. For gang ABD's, `abd_nr_pages_off()` returns the number of iovec's needed for the first constituent, rather than the sum of all constituents (within the requested range). This always under-estimates the required number of iovec's, which causes us to always need several bio's. The end result is that `__vdev_disk_physio()` is usually O(n^2) for gang ABD's (and occasionally O(n^3), when more than 16 bio's are needed). This commit fixes `abd_nr_pages_off()`'s handling of gang ABD's, to correctly determine how many iovec's are needed, by adding up the number of iovec's for each of the gang children in the requested range. Reviewed-by: Mark Maybee <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Brian Atkinson <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> Closes #11536
* Fix zrele race in zrele_async that can cause hangPaul Dagnelie2021-01-271-12/+22
| | | | | | | | | | | | | | | | There is a race condition in zfs_zrele_async when we are checking if we would be the one to evict an inode. This can lead to a txg sync deadlock. Instead of calling into iput directly, we attempt to perform the atomic decrement ourselves, unless that would set the i_count value to zero. In that case, we dispatch a call to iput to run later, to prevent a deadlock from occurring. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Matthew Ahrens <[email protected]> Signed-off-by: Paul Dagnelie <[email protected]> Closes #11527 Closes #11530
* cppcheck: integrete cppcheckBrian Behlendorf2021-01-261-2/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In order for cppcheck to perform a proper analysis it needs to be aware of how the sources are compiled (source files, include paths/files, extra defines, etc). All the needed information is available from the Makefiles and can be leveraged with a generic cppcheck Makefile target. So let's add one. Additional minor changes: * Removing the cppcheck-suppressions.txt file. With cppcheck 2.3 and these changes it appears to no longer be needed. Some inline suppressions were also removed since they appear not to be needed. We can add them back if it turns out they're needed for older versions of cppcheck. * Added the ax_count_cpus m4 macro to detect at configure time how many processors are available in order to run multiple cppcheck jobs. This value is also now used as a replacement for nproc when executing the kernel interface checks. * "PHONY =" line moved in to the Rules.am file which is included at the top of all Makefile.am's. This is just convenient becase it allows us to use the += syntax to add phony targets. * One upside of this integration worth mentioning is it now allows `make cppcheck` to be run in any directory to check that subtree. * For the moment, cppcheck is not run against the FreeBSD specific kernel sources. The cppcheck-FreeBSD target will need to be implemented and testing on FreeBSD to support this. Reviewed-by: Ryan Moeller <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #11508
* cppcheck: return value always 0Brian Behlendorf2021-01-261-1/+1
| | | | | | | | | Identical condition and return expression 'rc', return value is always 0. Reviewed-by: Ryan Moeller <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #11508
* cppcheck: remove redundant ASSERTsBrian Behlendorf2021-01-261-2/+0
| | | | | | | | | The ASSERT that the passed pointer isn't NULL appears after the pointer has already been dereferenced. Remove the redundant check. Reviewed-by: Ryan Moeller <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #11508
* spl-taskq: Make sure thread tsd hash entry is clearedMatthew Macy2021-01-251-0/+1
| | | | | | | | | Like any other thread created by thread_create() we need to call thread_exit() to properly clean it up. In particular, this ensures the tsd hash for the thread is cleared. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matt Macy <[email protected]> Closes #11512
* FreeBSD: upstream changes to VFS interfaceRyan Moeller2021-01-231-2/+4
| | | | | | | | | Set VIRF_MOUNTPOINT flag on snapshot mountpoint. Authored-by: Mateusz Guzik <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Ryan Moeller <[email protected]> Closes #11458
* Linux 5.10 compat: restore custom uio_prefaultpages()Brian Behlendorf2021-01-211-11/+44
| | | | | | | | | | | | | | | | As part of commit 1c2358c1 the custom uio_prefaultpages() code was removed in favor of using the generic kernel provided iov_iter_fault_in_readable() interface. Unfortunately, it turns out that up until the Linux 4.7 kernel the function would only ever fault in the first iovec of the iov_iter. The result being uiomove_iov() may hang waiting for the page. This commit effectively restores the custom uio_prefaultpages() pages code for Linux 4.9 and earlier kernels which contain the troublesome version of iov_iter_fault_in_readable(). Signed-off-by: Brian Behlendorf <[email protected]> Closes #11463 Closes #11484
* Extending FreeBSD UIO StructBrian Atkinson2021-01-2014-194/+235
| | | | | | | | | | | | | | In FreeBSD the struct uio was just a typedef to uio_t. In order to extend this struct, outside of the definition for the struct uio, the struct uio has been embedded inside of a uio_t struct. Also renamed all the uio_* interfaces to be zfs_uio_* to make it clear this is a ZFS interface. Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Jorgen Lundman <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Brian Atkinson <[email protected]> Closes #11438
* allow callers to allocate and provide the abd_t structMatthew Ahrens2021-01-202-33/+23
| | | | | | | | | | | | | | | | | | | | | | | | | | | The `abd_get_offset_*()` routines create an abd_t that references another abd_t, and doesn't allocate any pages/buffers of its own. In some workloads, these routines may be called frequently, to create many abd_t's representing small pieces of a single large abd_t. In particular, the upcoming RAIDZ Expansion project makes heavy use of these routines. This commit adds the ability for the caller to allocate and provide the abd_t struct to a variant of `abd_get_offset_*()`. This eliminates the cost of allocating the abd_t and performing the accounting associated with it (`abdstat_struct_size`). The RAIDZ/DRAID code uses this for the `rc_abd`, which references the zio's abd. The upcoming RAIDZ Expansion project will leverage this infrastructure to increase performance of reads post-expansion by around 50%. Additionally, some of the interfaces around creating and destroying abd_t's are cleaned up. Most significantly, the distinction between `abd_put()` and `abd_free()` is eliminated; all types of abd_t's are now disposed of with `abd_free()`. Reviewed-by: Brian Atkinson <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> Issue #8853 Closes #11439
* VZ 7 kernel compat: introduce ITER-enabled .direct_IO() via IOVECsKonstantin Khorenko2020-12-301-1/+14
| | | | | | | | | | | | | | | | | Virtuozzo 7 kernels starting 3.10.0-1127.18.2.vz7.163.46 have the following configuration: * no HAVE_VFS_RW_ITERATE * HAVE_VFS_DIRECT_IO_ITER_RW_OFFSET => let's add implementation of zpl_direct_IO() via zpl_aio_{read,write}() in this case. https://bugs.openvz.org/browse/OVZ-7243 Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Konstantin Khorenko <[email protected]> Closes #11410 Closes #11411
* Linux 5.11 compat: blk_{un}register_region()Brian Behlendorf2020-12-271-44/+0
| | | | | | | | | | | | As of 5.11 the blk_register_region() and blk_unregister_region() functions have been retired. This isn't a problem since add_disk() has implicitly allocated minor numbers for a very long time. Reviewed-by: Rafael Kitover <[email protected]> Reviewed-by: Coleman Kane <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #11387 Closes #11390
* Linux 5.11 compat: revalidate_disk_size()Brian Behlendorf2020-12-271-2/+4
| | | | | | | | | | | | | | | Both revalidate_disk_size() and revalidate_disk() have been removed. Functionally this isn't a problem because we only relied on these functions to call zvol_revalidate_disk() for us and to perform any additional handling which might be needed for that kernel version. When neither are available we know there's no additional handling needed and we can directly call zvol_revalidate_disk(). Reviewed-by: Rafael Kitover <[email protected]> Reviewed-by: Coleman Kane <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #11387 Closes #11390
* Linux 5.11 compat: bdev_whole()Brian Behlendorf2020-12-271-4/+12
| | | | | | | | | | | | | The bd_contains member was removed from the block_device structure. Callers needing to determine if a vdev is a whole block device should use the new bdev_whole() wrapper. For older kernels we provide our own bdev_whole() wrapper which relies on bd_contains for compatibility. Reviewed-by: Rafael Kitover <[email protected]> Reviewed-by: Coleman Kane <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #11387 Closes #11390
* Linux 5.11 compat: bio_start_io_acct() / bio_end_io_acct()Brian Behlendorf2020-12-271-16/+33
| | | | | | | | | | | | | | | | | | | The generic IO accounting functions have been removed in favor of the bio_start_io_acct() and bio_end_io_acct() functions which provide a better interface. These new functions were introduced in the 5.8 kernels but it wasn't until the 5.11 kernel that the previous generic IO accounting interfaces were removed. This commit updates the blk_generic_*_io_acct() wrappers to provide and interface similar to the updated kernel interface. It's slightly different because for older kernels we need to pass the request queue as well as the bio. Reviewed-by: Rafael Kitover <[email protected]> Reviewed-by: Coleman Kane <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #11387 Closes #11390
* Linux 5.11 compat: lookup_bdev()Brian Behlendorf2020-12-271-9/+4
| | | | | | | | | | | | | | | The lookup_bdev() function has been updated to require a dev_t be passed as the second argument. This is actually pretty nice since the major number stored in the dev_t was the only part we were interested in. This allows to us avoid handling the bdev entirely. The vdev_lookup_bdev() wrapper was updated to emulate the behavior of the new lookup_bdev() for all supported kernels. Reviewed-by: Rafael Kitover <[email protected]> Reviewed-by: Coleman Kane <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #11387 Closes #11390
* Fix maybe uninitialized variable warningBrian Behlendorf2020-12-201-1/+1
| | | | | | | | | | | | | Commit 1c2358c12 restructured this code and introduced a warning about the variable maybe not being initialized. This cannot happen with the updated code but we should initialize the variable anyway to silence the warning. zpl_file.c: In function ‘zpl_iter_write’: zpl_file.c:324:9: warning: ‘count’ may be used uninitialized in this function [-Wmaybe-uninitialized] Signed-off-by: Brian Behlendorf <[email protected]> Closes #11373
* Remove iov_iter_advance() from iter_readBrian Behlendorf2020-12-201-3/+0
| | | | | | | | | | | | | There's no need to call iov_iter_advance() in zpl_iter_read(). This was preserved from the previous code where it wasn't needed but also didn't cause any problems. Now that the iter functions also handle pipes that's no longer the case. When fully reading a pipe buffer iov_iter_advance() may results in the pipe buf release function being called which will not be registered resulting in a NULL dereference. Signed-off-by: Brian Behlendorf <[email protected]> Closes #11375 Closes #11378
* Linux 5.10 compat: also zvol_revalidate_disk()Michael D Labriola2020-12-181-2/+3
| | | | | | | | | | | | Commit 59b68723 added a configure check for 5.10, which removed revalidate_disk(), and conditionally replaced it's usage with a call to the new revalidate_disk_size() function. However, the old function also invoked the device's registered callback, in our case zvol_revalidate_disk(). This commit adds a call to zvol_revalidate_disk() in zvol_update_volsize() to make sure the code path stays the same. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Michael D Labriola <[email protected]> Closes #11358
* Linux 5.10 compat: use iov_iter in uio structureBrian Behlendorf2020-12-187-222/+503
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | As of the 5.10 kernel the generic splice compatibility code has been removed. All filesystems are now responsible for registering a ->splice_read and ->splice_write callback to support this operation. The good news is the VFS provided generic_file_splice_read() and iter_file_splice_write() callbacks can be used provided the ->iter_read and ->iter_write callback support pipes. However, this is currently not the case and only iovecs and bvecs (not pipes) are ever attached to the uio structure. This commit changes that by allowing full iov_iter structures to be attached to uios. Ever since the 4.9 kernel the iov_iter structure has supported iovecs, kvecs, bvevs, and pipes so it's desirable to pass the entire thing when possible. In conjunction with this the uio helper functions (i.e uiomove(), uiocopy(), etc) have been updated to understand the new UIO_ITER type. Note that using the kernel provided uio_iter interfaces allowed the existing Linux specific uio handling code to be simplified. When there's no longer a need to support kernel's older than 4.9, then it will be possible to remove the iovec and bvec members from the uio structure and always use a uio_iter. Until then we need to maintain all of the existing types for older kernels. Some additional refactoring and cleanup was included in this change: - Added checks to configure to detect available iov_iter interfaces. Some are available all the way back to the 3.10 kernel and are used when available. In particular, uio_prefaultpages() now always uses iov_iter_fault_in_readable() which is available for all supported kernels. - The unused UIO_USERISPACE type has been removed. It is no longer needed now that the uio_seg enum is platform specific. - Moved zfs_uio.c from the zcommon.ko module to the Linux specific platform code for the zfs.ko module. This gets it out of libzfs where it was never needed and keeps this Linux specific code out of the common sources. - Removed unnecessary O_APPEND handling from zfs_iter_write(), this is redundant and O_APPEND is already handled in zfs_write(); Reviewed-by: Colin Ian King <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #11351
* FreeBSD: Fix format of vfs.zfs.arc_no_grow_shiftRyan Moeller2020-12-101-6/+5
| | | | | | | | | | | | | vfs.zfs.arc_no_grow_shift has an invalid type (15) and this causes py-sysctl to format it as a bytearray when it should be an integer. "U" is not a valid format, it should be "I" and the type should match the variable type, int. We can return EINVAL if the value is set below zero. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Ryan Moeller <[email protected]> Closes #11318
* Fix possibly uninitialized 'root_inode' variable warningBrian Behlendorf2020-12-101-1/+1
| | | | | | | | | | | | Resolve an uninitialized variable warning when compiling. In function ‘zfs_domount’: warning: ‘root_inode’ may be used uninitialized in this function [-Wmaybe-uninitialized] sb->s_root = d_make_root(root_inode); Reviewed-by: Tony Hutter <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #11306
* Implement memory and CPU hotplugPaul Dagnelie2020-12-103-16/+214
| | | | | | | | | | | | | | ZFS currently doesn't react to hotplugging cpu or memory into the system in any way. This patch changes that by adding logic to the ARC that allows the system to take advantage of new memory that is added for caching purposes. It also adds logic to the taskq infrastructure to support dynamically expanding the number of threads allocated to a taskq. Reviewed-by: Brian Behlendorf <[email protected]> Co-authored-by: Matthew Ahrens <[email protected]> Co-authored-by: Brian Behlendorf <[email protected]> Signed-off-by: Paul Dagnelie <[email protected]> Closes #11212
* Bring consistency to ABD chunk count types.Alexander Motin2020-12-062-19/+26
| | | | | | | | | | | | | | | With both abd_size and abd_nents being uint_t it makes no sense for abd_chunkcnt_for_bytes() to return size_t. Random mix of different types used to count chunks looks bad and makes compiler more difficult to optimize the code. In particular on FreeBSD this change allows compiler to completely optimize out abd_verify_scatter() when built without debug, removing pointless 64-bit division and even more pointless empty loop. Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Closes #11279
* Fix raw sends on encrypted datasets when copying back snapshotsGeorge Amanakis2020-12-042-2/+28
| | | | | | | | | | | | | | When sending raw encrypted datasets the user space accounting is present when it's not expected to be. This leads to the subsequent mount failure due a checksum error when verifying the local mac. Fix this by clearing the OBJSET_FLAG_USERACCOUNTING_COMPLETE and reset the local mac. This allows the user accounting to be correctly updated on first mount using the normal upgrade process. Reviewed-By: Brian Behlendorf <[email protected]> Reviewed-By: Tom Caputi <[email protected]> Signed-off-by: George Amanakis <[email protected]> Closes #10523 Closes #11221
* Fix problems in zvol_set_volmode_implMatthew Macy2020-11-172-27/+67
| | | | | | | | | | | | | - Don't leave fstrans set when passed a snapshot - Don't remove minor if volmode already matches new value - (FreeBSD) Wait for GEOM ops to complete before trying remove (at create time GEOM will be "tasting" in parallel) - (FreeBSD) Don't leak zvol_state_lock on open if zv == NULL - (FreeBSD) Don't try to unlock zv->zv_state lock if zv == NULL Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matt Macy <[email protected]> Closes #11199
* Linux: Fix ZFS_ENTER/ZFS_EXIT/ZFS_VERFY_ZP usageBrian Behlendorf2020-11-143-18/+18
| | | | | | | | | | | | | | | | | | | The ZFS_ENTER/ZFS_EXIT/ZFS_VERFY_ZP macros should not be used in the Linux zpl_*.c source files. They return a positive error value which is correct for the common code, but not for the Linux specific kernel code which expects a negative return value. The ZPL_ENTER/ZPL_EXIT/ZPL_VERFY_ZP macros should be used instead. Furthermore, the ZPL_EXIT macro has been updated to not call the zfs_exit_fs() function. This prevents a possible deadlock which can occur when a snapshot is automatically unmounted because the zpl_show_devname() must never wait on in progress automatic snapshot unmounts. Reviewed-by: Adam Moss <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #11169 Closes #11201
* Distributed Spare (dRAID) FeatureBrian Behlendorf2020-11-134-39/+93
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch adds a new top-level vdev type called dRAID, which stands for Distributed parity RAID. This pool configuration allows all dRAID vdevs to participate when rebuilding to a distributed hot spare device. This can substantially reduce the total time required to restore full parity to pool with a failed device. A dRAID pool can be created using the new top-level `draid` type. Like `raidz`, the desired redundancy is specified after the type: `draid[1,2,3]`. No additional information is required to create the pool and reasonable default values will be chosen based on the number of child vdevs in the dRAID vdev. zpool create <pool> draid[1,2,3] <vdevs...> Unlike raidz, additional optional dRAID configuration values can be provided as part of the draid type as colon separated values. This allows administrators to fully specify a layout for either performance or capacity reasons. The supported options include: zpool create <pool> \ draid[<parity>][:<data>d][:<children>c][:<spares>s] \ <vdevs...> - draid[parity] - Parity level (default 1) - draid[:<data>d] - Data devices per group (default 8) - draid[:<children>c] - Expected number of child vdevs - draid[:<spares>s] - Distributed hot spares (default 0) Abbreviated example `zpool status` output for a 68 disk dRAID pool with two distributed spares using special allocation classes. ``` pool: tank state: ONLINE config: NAME STATE READ WRITE CKSUM slag7 ONLINE 0 0 0 draid2:8d:68c:2s-0 ONLINE 0 0 0 L0 ONLINE 0 0 0 L1 ONLINE 0 0 0 ... U25 ONLINE 0 0 0 U26 ONLINE 0 0 0 spare-53 ONLINE 0 0 0 U27 ONLINE 0 0 0 draid2-0-0 ONLINE 0 0 0 U28 ONLINE 0 0 0 U29 ONLINE 0 0 0 ... U42 ONLINE 0 0 0 U43 ONLINE 0 0 0 special mirror-1 ONLINE 0 0 0 L5 ONLINE 0 0 0 U5 ONLINE 0 0 0 mirror-2 ONLINE 0 0 0 L6 ONLINE 0 0 0 U6 ONLINE 0 0 0 spares draid2-0-0 INUSE currently in use draid2-0-1 AVAIL ``` When adding test coverage for the new dRAID vdev type the following options were added to the ztest command. These options are leverages by zloop.sh to test a wide range of dRAID configurations. -K draid|raidz|random - kind of RAID to test -D <value> - dRAID data drives per group -S <value> - dRAID distributed hot spares -R <value> - RAID parity (raidz or dRAID) The zpool_create, zpool_import, redundancy, replacement and fault test groups have all been updated provide test coverage for the dRAID feature. Co-authored-by: Isaac Huang <[email protected]> Co-authored-by: Mark Maybee <[email protected]> Co-authored-by: Don Brady <[email protected]> Co-authored-by: Matthew Ahrens <[email protected]> Co-authored-by: Brian Behlendorf <[email protected]> Reviewed-by: Mark Maybee <[email protected]> Reviewed-by: Matt Ahrens <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #10102
* Linux: Fix mount/unmount when dataset name has a spaceBrian Behlendorf2020-11-111-4/+17
| | | | | | | | | | | | | | | The custom zpl_show_devname() helper should translate spaces in to the octal escape sequence \040. The getmntent(2) function is aware of this convention and properly translates the escape character back to a space when reading the fsname. Without this change the `zfs mount` and `zfs unmount` commands incorrectly detect when a dataset with a name containing spaces is mounted. Reviewed-by: Ryan Moeller <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #11182 Closes #11187
* Start snapdir_iterate traversals to begin wtih the value of zero.Tony Perkins2020-11-111-1/+2
| | | | | | | | | | | | The microzap hash can sometimes be zero for single digit snapnames. The zap cursor can then have a serialized value of two (for . and ..), and skip the first entry in the avl tree for the .zfs/snapshot directory listing, and therefore does not return all snapshots. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Cedric Berger <[email protected]> Signed-off-by: Tony Perkins <[email protected]> Closes #11039
* G/C struct znode -> z_movedMateusz Guzik2020-11-103-8/+0
| | | | | | | | The field is yet another leftover from unsupported zfs_znode_move. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Signed-off-by: Mateusz Guzik <[email protected]> Closes #11186
* FreeBSD: Simplify zvol_geom_open and zvol_cdev_openRyan Moeller2020-11-101-36/+16
| | | | | | | | | | | | | | We can consolidate the unlocking procedure into one place by starting with drop_suspend set to B_FALSE and moving the open count check up. While here, a little code cleanup. Match the out labels between zvol_geom_open and zvol_cdev_open, and add a missing period in some comments. Reviewed-by: Matt Macy <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Ryan Moeller <[email protected]> Closes #11175
* FreeBSD: Avoid spurious EINTR in zvol_cdev_openRyan Moeller2020-11-101-1/+20
| | | | | | | | | | | | | zvol_first_open can fail with EINTR if spa_namespace_lock is not held and cannot be taken without waiting. Apply the same logic that was done for zvol_geom_open to take spa_namespace_lock if not already held on first open in zvol_cdev_open. Reviewed-by: Matt Macy <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Ryan Moeller <[email protected]> Closes #11175
* Remove redundant oid parameter to update_pagesRyan Moeller2020-11-102-7/+6
| | | | | | | | | | The oid comes from the znode we are already passing. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Matt Macy <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Ryan Moeller <[email protected]> Closes #11176
* FreeBSD: Prevent a NULL reference in zvol_cdev_openMariusz Zaborski2020-11-051-1/+2
| | | | | | | | | Check if the ZVOL has been written before calling zil_async_to_sync. The ZIL will be opened on the first write, not earlier. Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Mariusz Zaborski <[email protected]> Closes #11152
* FreeBSD: Prevent NULL pointer dereference of residkhng3002020-11-041-1/+2
| | | | | | | | | | spa_config_load() passes NULL into resid when doing zfs_file_read(). This would trip over when vfs.zfs.autoimport_disable=0. Sponsored by: The FreeBSD Foundation Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Allan Jude <[email protected]> Signed-off-by: Ka Ho Ng <[email protected]> Closes #11149
* FreeBSD: zvol_os: Use SET_ERROR more judiciouslyRyan Moeller2020-11-031-13/+13
| | | | | | | | SET_ERROR is useful to trace errors, so use it where the errors occur rather than factored out to the end of a function. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Ryan Moeller <[email protected]> Closes #11146
* Linux 5.10 compat: revalidate_disk_size() addedColeman Kane2020-11-021-0/+4
| | | | | | | | | | | A new function was added named revalidate_disk_size() and the old revalidate_disk() appears to have been deprecated. As the only ZFS code that calls this function is zvol_update_volsize, swapping the old function call out for the new one should be all that is required. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Coleman Kane <[email protected]> Closes #11085
* Linux 5.10 compat: check_disk_change() removedColeman Kane2020-11-022-2/+2
| | | | | | | | | | | | | | Kernel 5.10 removed check_disk_change() in favor of callers using the faster bdev_check_media_change() instead, and explicitly forcing bdev revalidation when they desire that behavior. To preserve prior behavior, I have wrapped this into a zfs_check_media_change() macro that calls an inline function for the new API that mimics the old behavior when check_disk_change() doesn't exist, and just calls check_disk_change() if it exists. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Coleman Kane <[email protected]> Closes #11085
* Linux 5.10 compat: percpu_ref added data memberColeman Kane2020-11-021-0/+4
| | | | | | | | | | | | Kernel commit 2b0d3d3e4fcfb brought in some changes to the struct percpu_ref structure that moves most of its fields into a member struct named "data" of type struct percpu_ref_data. This includes the "count" member which is updated by vdev_blkg_tryget(), so update this function to chase the API change, and detect it via configure. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Coleman Kane <[email protected]> Closes #11085