aboutsummaryrefslogtreecommitdiffstats
path: root/include/sys
Commit message (Collapse)AuthorAgeFilesLines
* Replace ZFS on Linux references with OpenZFSBrian Behlendorf2020-10-082-5/+0
| | | | | | | | | | | | | This change updates the documentation to refer to the project as OpenZFS instead ZFS on Linux. Web links have been updated to refer to https://github.com/openzfs/zfs. The extraneous zfsonlinux.org web links in the ZED and SPL sources have been dropped. Reviewed-by: George Melikov <[email protected]> Reviewed-by: Richard Laager <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #11007
* Throw const on some stringsRyan Moeller2020-10-024-11/+12
| | | | | | | | | | In C, const indicates to the reader that mutation will not occur. It can also serve as a hint about ownership. Add const in a few places where it makes sense. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Ryan Moeller <[email protected]> Closes #10997
* do a cyclic seek for unused memory objects in poolSebastian Gottschall2020-09-301-0/+1
| | | | | | | | | | | | | | | | | | | | In non regular use cases allocated memory might stay persistent in memory pool. This small patch checks every minute if there are old objects which can be released from memory pool. Right now with regular use, the pool is checked for old objects on each allocation attempt from this pool. so basically polling by its use. Now consider what happens if someone writes a lot of files and stops use of the volume or even unmounts it. So the code will no longer check if objects can be released from the pool. Already allocated objects will still stay in pool cache. this is no big issue for common use. But someone discovered this issue while doing tests. personally i know this behavior and I'm aware of it. Its no big issue. just a enhancement Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Kjeld Schouten-Lebbing <[email protected]> Signed-off-by: Sebastian Gottschall <[email protected]> Closes #10938 Closes #10969
* Fix objtool configure checkBrian Behlendorf2020-09-281-1/+2
| | | | | | | | | | | | | | | | | | | | | | The m4 objtool configure check can incorrectly fail because of a missing header in the test. This appears to be the result of a recent kernel change and was observed on the Fedora 5.8.11-200 kernel. In file included from /home/fedora/zfs/build/objtool/objtool.c:75: ./arch/x86/include/asm/frame.h:100:57: error: 'struct pt_regs' declared inside parameter list will not be visible outside of this definition or declaration [-Werror] The consequence of this is that the "stack_frame_non_standard" check is never run and HAVE_STACK_FRAME_NON_STANDARD is set incorrectly which results in a build failure. This change adds the appropriate header to the "objtool" check so it now behaves as intended. Reviewed-by: Kjeld Schouten <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #10990
* zfetch: Don't issue new streams when old have not completedMatthew Macy2020-09-272-7/+16
| | | | | | | | | | | | | | | | | The current dmu_zfetch code implicitly assumes that I/Os complete within min_sec_reap seconds. With async dmu and a readonly workload (and thus no exponential backoff in operations from the "write throttle") such as L2ARC rebuild it is possible to saturate the drives with I/O requests. These are then effectively compounded with prefetch requests. This change reference counts streams and prevents them from being recycled after their min_sec_reap timeout if they still have outstanding I/Os. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matt Macy <[email protected]> Closes #10900
* FreeBSD: Add support for procfs_listMatthew Macy2020-09-231-0/+1
| | | | | | | | | | The procfs_list interface is required by several kstats. Implement this functionality for FreeBSD to provide access to these kstats. Reviewed-by: Allan Jude <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matt Macy <[email protected]> Closes #10890
* FreeBSD: Reduce stack usage of LuaRyan Moeller2020-09-221-4/+0
| | | | | | | | | | | | | Use the same reduced buffer size for lauxlib that is used on Linux. Fixes panic on HEAD in lua gsub test designed to exhaust stack space. With this we can remove the special case to reserve more stack space on FreeBSD. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Kyle Evans <[email protected]> Signed-off-by: Ryan Moeller <[email protected]> Closes #10959
* vdev_ashift should only be set onceGeorge Wilson2020-09-181-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | == Motivation and Context The new vdev ashift optimization prevents the removal of devices when a zfs configuration is comprised of disks which have different logical and physical block sizes. This is caused because we set 'spa_min_ashift' in vdev_open and then later call 'vdev_ashift_optimize'. This would result in an inconsistency between spa's ashift calculations and that of the top-level vdev. In addition, the optimization logical ignores the overridden ashift value that would be provided by '-o ashift=<val>'. == Description This change reworks the vdev ashift optimization so that it's only set the first time the device is configured. It still allows the physical and logical ahsift values to be set every time the device is opened but those values are only consulted on first open. Reviewed-by: Matthew Ahrens <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Cedric Berger <[email protected]> Signed-off-by: George Wilson <[email protected]> External-Issue: DLPX-71831 Closes #10932
* Rename acltype=posixacl to acltype=posixRyan Moeller2020-09-161-1/+1
| | | | | | | Prefer acltype=off|posix, retaining the old names as aliases. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Ryan Moeller <[email protected]> Closes #10918
* zfs label bootenv should store data as nvlistToomas Soome2020-09-155-4/+67
| | | | | | | | | | | | | nvlist does allow us to support different data types and systems. To encapsulate user data to/from nvlist, the libzfsbootenv library is provided. Reviewed-by: Arvind Sankar <[email protected]> Reviewed-by: Allan Jude <[email protected]> Reviewed-by: Paul Dagnelie <[email protected]> Reviewed-by: Igor Kozhukhov <[email protected]> Signed-off-by: Toomas Soome <[email protected]> Closes #10774
* Add L2ARC arcstats for MFU/MRU buffers and buffer content typeGeorge Amanakis2020-09-141-0/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently the ARC state (MFU/MRU) of cached L2ARC buffer and their content type is unknown. Knowing this information may prove beneficial in adjusting the L2ARC caching policy. This commit adds L2ARC arcstats that display the aligned size (in bytes) of L2ARC buffers according to their content type (data/metadata) and according to their ARC state (MRU/MFU or prefetch). It also expands the existing evict_l2_eligible arcstat to differentiate between MFU and MRU buffers. L2ARC caches buffers from the MRU and MFU lists of ARC. Upon caching a buffer, its ARC state (MRU/MFU) is stored in the L2 header (b_arcs_state). The l2_m{f,r}u_asize arcstats reflect the aligned size (in bytes) of L2ARC buffers according to their ARC state (based on b_arcs_state). We also account for the case where an L2ARC and ARC cached MRU or MRU_ghost buffer transitions to MFU. The l2_prefetch_asize reflects the alinged size (in bytes) of L2ARC buffers that were cached while they had the prefetch flag set in ARC. This is dynamically updated as the prefetch flag of L2ARC buffers changes. When buffers are evicted from ARC, if they are determined to be L2ARC eligible then their logical size is recorded in evict_l2_eligible_m{r,f}u arcstats according to their ARC state upon eviction. Persistent L2ARC: When committing an L2ARC buffer to a log block (L2ARC metadata) its b_arcs_state and prefetch flag is also stored. If the buffer changes its arcstate or prefetch flag this is reflected in the above arcstats. However, the L2ARC metadata cannot currently be updated to reflect this change. Example: L2ARC caches an MRU buffer. L2ARC metadata and arcstats count this as an MRU buffer. The buffer transitions to MFU. The arcstats are updated to reflect this. Upon pool re-import or on/offlining the L2ARC device the arcstats are cleared and the buffer will now be counted as an MRU buffer, as the L2ARC metadata were not updated. Bug fix: - If l2arc_noprefetch is set, arc_read_done clears the L2CACHE flag of an ARC buffer. However, prefetches may be issued in a way that arc_read_done() is bypassed. Instead, move the related code in l2arc_write_eligible() to account for those cases too. Also add a test and update manpages for l2arc_mfuonly module parameter, and update the manpages and code comments for l2arc_noprefetch. Move persist_l2arc tests to l2arc. Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Richard Elling <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: George Amanakis <[email protected]> Closes #10743
* Avoid possibility of division by zeroRyan Moeller2020-09-081-3/+3
| | | | | | | | | | | When hz > 1000, msec / (1000 / hz) results in division by zero. I found somewhere in FreeBSD using howmany(msec * hz, 1000) to convert ms to ticks, avoiding the potential for a zero in the divisor. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Ryan Moeller <[email protected]> Closes #10894
* Avoid posting duplicate zpool eventsDon Brady2020-09-044-5/+13
| | | | | | | | | | | | | | | | | | | Duplicate io and checksum ereport events can misrepresent that things are worse than they seem. Ideally the zpool events and the corresponding vdev stat error counts in a zpool status should be for unique errors -- not the same error being counted over and over. This can be demonstrated in a simple example. With a single bad block in a datafile and just 5 reads of the file we end up with a degraded vdev, even though there is only one unique error in the pool. The proposed solution to the above issue, is to eliminate duplicates when posting events and when updating vdev error stats. We now save recent error events of interest when posting events so that we can easily check for duplicates when posting an error. Reviewed by: Brad Lewis <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Don Brady <[email protected]> Closes #10861
* nowait synctask must succeedMatthew Ahrens2020-09-041-2/+2
| | | | | | | | | | | | | | If a `zfs_space_check_t` other than `ZFS_SPACE_CHECK_NONE` is used with `dsl_sync_task_nowait()`, the sync task may fail due to ENOSPC. However, there is no way to notice or communicate this failure, so it's extremely difficult to use this functionality correctly, and in fact almost all callers use `ZFS_SPACE_CHECK_NONE`. This commit removes the `zfs_space_check_t` argument from `dsl_sync_task_nowait()`, and always uses `ZFS_SPACE_CHECK_NONE`. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> Closes #10855
* Replace cv_{timed}wait_sig with cv_{timed}wait_idle where appropriateMatthew Macy2020-09-031-0/+4
| | | | | | | | | | | | | | | | There are a number of places where cv_?_sig is used simply for accounting purposes but the surrounding code has no ability to cope with actually receiving a signal. On FreeBSD it is possible to send signals to individual kernel threads so this could enable undesirable behavior. This patch adds routines on Linux that will do the same idle accounting as _sig without making the task interruptible. On FreeBSD cv_*_idle are all aliases for cv_* Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matt Macy <[email protected]> Closes #10843
* Add 'zfs rename -u' to rename without remountingRyan Moeller2020-09-012-0/+36
| | | | | | | | | | | | | | | | | | | | | | | | | | | Allow to rename file systems without remounting if it is possible. It is possible for file systems with 'mountpoint' property set to 'legacy' or 'none' - we don't have to change mount directory for them. Currently such file systems are unmounted on rename and not even mounted back. This introduces layering violation, as we need to update 'f_mntfromname' field in statfs structure related to mountpoint (for the dataset we are renaming and all its children). In my opinion it is worth it, as it allow to update FreeBSD in even cleaner way - in ZFS-only configuration root file system is ZFS file system with 'mountpoint' property set to 'legacy'. If root dataset is named system/rootfs, we can snapshot it (system/rootfs@upgrade), clone it (system/oldrootfs), update FreeBSD and if it doesn't boot we can boot back from system/oldrootfs and rename it back to system/rootfs while it is mounted as /. Before it was not possible, because unmounting / was not possible. Authored by: Pawel Jakub Dawidek <[email protected]> Reviewed-by: Allan Jude <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Ported by: Matt Macy <[email protected]> Signed-off-by: Ryan Moeller <[email protected]> Closes #10839
* zpool: Change base URL for ZFS messages to openzfs-docsRyan Moeller2020-08-261-4/+4
| | | | | | | Reviewed-by: George Melikov <[email protected]> Reviewed-by: Kjeld Schouten <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Ryan Moeller <[email protected]> Closes #10820
* Avoid symbol collision with in-kernel zstdlibSebastian Gottschall2020-08-241-6/+6
| | | | | | | | | | | For Linux, when zfs is compiled as an in kernel static variant and the in kernel zstd library is compiled statically into the kernel a symbol collision will occur. This wrapper header renames all of the relevant zstd functions to avoid this problem. Reviewed-by: Kjeld Schouten <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Sebastian Gottschall <[email protected]> Closes #10775
* Appease GCC sprintf warnings found on Fedora 32/GCC 10.0.1Chris McDonough2020-08-241-1/+1
| | | | | | | | | Increase the size of DDT_NAMELEN and MNT_LINE_MAX to appease GCC snprintf truncation warnings. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Chris McDonough <[email protected]> Closes #10712 Closes #10766
* Import vdev ashift optimization from FreeBSDRyan Moeller2020-08-213-1/+39
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Many modern devices use physical allocation units that are much larger than the minimum logical allocation size accessible by external commands. Two prevalent examples of this are 512e disk drives (512b logical sector, 4K physical sector) and flash devices (512b logical sector, 4K or larger allocation block size, and 128k or larger erase block size). Operations that modify less than the physical sector size result in a costly read-modify-write or garbage collection sequence on these devices. Simply exporting the true physical sector of the device to ZFS would yield optimal performance, but has two serious drawbacks: 1. Existing pools created with devices that have different logical and physical block sizes, but were configured to use the logical block size (e.g. because the OS version used for pool construction reported the logical block size instead of the physical block size) will suddenly find that the vdev allocation size has increased. This can be easily tolerated for active members of the array, but ZFS would prevent replacement of a vdev with another identical device because it now appears that the smaller allocation size required by the pool is not supported by the new device. 2. The device's physical block size may be too large to be supported by ZFS. The optimal allocation size for the vdev may be quite large. For example, a RAID controller may export a vdev that requires read-modify-write cycles unless accessed using 64k aligned/sized requests. ZFS currently has an 8k minimum block size limit. Reporting both the logical and physical allocation sizes for vdevs solves these problems. A device may be used so long as the logical block size is compatible with the configuration. By comparing the logical and physical block sizes, new configurations can be optimized and administrators can be notified of any existing pools that are sub-optimal. Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Co-authored-by: Matthew Macy <[email protected]> Signed-off-by: Matt Macy <[email protected]> Closes #10619
* FreeBSD: 11.x arc_stats compatibilityMatthew Macy2020-08-201-0/+7
| | | | | | | | | Removing other_size from arc_stats breaks top in 11.x jails running on HEAD. Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matt Macy <[email protected]> Closes #10745
* Add zstd support to zfsMichael Niewöhner2020-08-2013-14/+241
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This PR adds two new compression types, based on ZStandard: - zstd: A basic ZStandard compression algorithm Available compression. Levels for zstd are zstd-1 through zstd-19, where the compression increases with every level, but speed decreases. - zstd-fast: A faster version of the ZStandard compression algorithm zstd-fast is basically a "negative" level of zstd. The compression decreases with every level, but speed increases. Available compression levels for zstd-fast: - zstd-fast-1 through zstd-fast-10 - zstd-fast-20 through zstd-fast-100 (in increments of 10) - zstd-fast-500 and zstd-fast-1000 For more information check the man page. Implementation details: Rather than treat each level of zstd as a different algorithm (as was done historically with gzip), the block pointer `enum zio_compress` value is simply zstd for all levels, including zstd-fast, since they all use the same decompression function. The compress= property (a 64bit unsigned integer) uses the lower 7 bits to store the compression algorithm (matching the number of bits used in a block pointer, as the 8th bit was borrowed for embedded block pointers). The upper bits are used to store the compression level. It is necessary to be able to determine what compression level was used when later reading a block back, so the concept used in LZ4, where the first 32bits of the on-disk value are the size of the compressed data (since the allocation is rounded up to the nearest ashift), was extended, and we store the version of ZSTD and the level as well as the compressed size. This value is returned when decompressing a block, so that if the block needs to be recompressed (L2ARC, nop-write, etc), that the same parameters will be used to result in the matching checksum. All of the internal ZFS code ( `arc_buf_hdr_t`, `objset_t`, `zio_prop_t`, etc.) uses the separated _compress and _complevel variables. Only the properties ZAP contains the combined/bit-shifted value. The combined value is split when the compression_changed_cb() callback is called, and sets both objset members (os_compress and os_complevel). The userspace tools all use the combined/bit-shifted value. Additional notes: zdb can now also decode the ZSTD compression header (flag -Z) and inspect the size, version and compression level saved in that header. For each record, if it is ZSTD compressed, the parameters of the decoded compression header get printed. ZSTD is included with all current tests and new tests are added as-needed. Per-dataset feature flags now get activated when the property is set. If a compression algorithm requires a feature flag, zfs activates the feature when the property is set, rather than waiting for the first block to be born. This is currently only used by zstd but can be extended as needed. Portions-Sponsored-By: The FreeBSD Foundation Co-authored-by: Allan Jude <[email protected]> Co-authored-by: Brian Behlendorf <[email protected]> Co-authored-by: Sebastian Gottschall <[email protected]> Co-authored-by: Kjeld Schouten-Lebbing <[email protected]> Co-authored-by: Michael Niewöhner <[email protected]> Signed-off-by: Allan Jude <[email protected]> Signed-off-by: Allan Jude <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Sebastian Gottschall <[email protected]> Signed-off-by: Kjeld Schouten-Lebbing <[email protected]> Signed-off-by: Michael Niewöhner <[email protected]> Closes #6247 Closes #9024 Closes #10277 Closes #10278
* Add define to enable autotrim to default to onMatthew Macy2020-08-181-1/+7
| | | | | | | | | | | | In FreeBSD trim has defaulted to on for several years. In order to minimize POLA violations on import it's important to maintain this default when importing vendored openzfs in to FreeBSD base. Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matt Macy <[email protected]> Closes #10719
* Make zc_nvlist_src_size limit tunableRyan Moeller2020-08-181-0/+2
| | | | | | | | | | | | | | | | | | We limit the size of nvlists passed to the kernel so a user cannot make the kernel do an unreasonably large allocation. On FreeBSD this limit was 128 kiB, which turns out to be a bit too small when doing some operations involving a large number of datasets or snapshots, for example replication. Make this limit tunable, with a platform-specific auto default. Linux keeps its limit at KMALLOC_MAX_SIZE. FreeBSD uses 1/4 of the system limit on user wired memory, which allows it to scale depending on system configuration. Reviewed-by: Matt Macy <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Ryan Moeller <[email protected]> Issue #6572 Closes #10706
* Include scatter_chunk_waste in arc_sizeMatthew Ahrens2020-08-172-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The ARC caches data in scatter ABD's, which are collections of pages, which are typically 4K. Therefore, the space used to cache each block is rounded up to a multiple of 4K. The ABD subsystem tracks this wasted memory in the `scatter_chunk_waste` kstat. However, the ARC's `size` is not aware of the memory used by this round-up, it only accounts for the size that it requested from the ABD subsystem. Therefore, the ARC is effectively using more memory than it is aware of, due to the `scatter_chunk_waste`. This impacts observability, e.g. `arcstat` will show that the ARC is using less memory than it effectively is. It also impacts how the ARC responds to memory pressure. As the amount of `scatter_chunk_waste` changes, it appears to the ARC as memory pressure, so it needs to resize `arc_c`. If the sector size (`1<<ashift`) is the same as the page size (or larger), there won't be any waste. If the (compressed) block size is relatively large compared to the page size, the amount of `scatter_chunk_waste` will be small, so the problematic effects are minimal. However, if using 512B sectors (`ashift=9`), and the (compressed) block size is small (e.g. `compression=on` with the default `volblocksize=8k` or a decreased `recordsize`), the amount of `scatter_chunk_waste` can be very large. On a production system, with `arc_size` at a constant 50% of memory, `scatter_chunk_waste` has been been observed to be 10-30% of memory. This commit adds `scatter_chunk_waste` to `arc_size`, and adds a new `waste` field to `arcstat`. As a result, the ARC's memory usage is more observable, and `arc_c` does not need to be adjusted as frequently. Reviewed-by: Pavel Zakharov <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: George Wilson <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> Closes #10701
* Remove KMC_KMEM and KMC_VMEMMatthew Ahrens2020-08-171-2/+0
| | | | | | | | | | | | | | | | | `KMC_KMEM` and `KMC_VMEM` are now unused since all SPL-implemented caches are `KMC_KVMEM`. KMC_KMEM: Given the default value of `spl_kmem_cache_kmem_limit`, we don't use kmalloc to back the SPL caches, instead we use kvmalloc (KMC_KVMEM). The flag, module parameter, /proc entries, and associated code are removed. KMC_VMEM: This flag is not used, and kvmalloc() is always preferable to vmalloc(). The flag, /proc entries, and associated code are removed. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> Closes #10673
* FreeBSD: fallback to /boot/ to look for zpool.cacheMatthew Macy2020-08-171-1/+1
| | | | | | | | | Up until now zpool.cache has always lived in /boot on FreeBSD. For the sake of compatibility fallback to /boot if zpool.cache isn't found in /etc/zfs. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matt Macy <[email protected]> Closes #10720
* Move ZVOL_DIR back to zfs.hRyan Moeller2020-08-111-0/+3
| | | | | | | | | | This was previously moved because nothing else in-tree uses it, but evidently DilOS uses it out of tree. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Igor Kozhukhov <[email protected]> Signed-off-by: Ryan Moeller <[email protected]> Closes #10361 Closes #10685
* Change the error handling for invalid property valuesAllan Jude2020-08-011-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | ZFS recv should return a useful error message when an invalid index property value is provided in the send stream properties nvlist With a compression= property outside of the understood range: Before: ``` receiving full stream of zof/zstd_send@send2 into testpool/recv@send2 internal error: Invalid argument Aborted (core dumped) ``` Note: the recv completes successfully, the abort() is likely just to make it easier to track the unexpected error code. After: ``` receiving full stream of zof/zstd_send@send2 into testpool/recv@send2 cannot receive compression property on testpool/recv: invalid property value received 28.9M stream in 1 seconds (28.9M/sec) ``` Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Allan Jude <[email protected]> Closes #10631
* Revise ARC shrinker algorithmMatthew Ahrens2020-07-311-14/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The ARC shrinker callback `arc_shrinker_count/_scan()` is invoked by the kernel's shrinker mechanism when the system is running low on free pages. This happens via 2 code paths: 1. "direct reclaim": The system is attempting to allocate a page, but we are low on memory. The ARC shrinker callback is invoked from the page-allocation code path. 2. "indirect reclaim": kswapd notices that there aren't many free pages, so it invokes the ARC shrinker callback. In both cases, the kernel's shrinker code requests that the ARC shrinker callback release some of its cache, and then it measures how many pages were released. However, it's measurement of released pages does not include pages that are freed via `__free_pages()`, which is how the ARC releases memory (via `abd_free_chunks()`). Rather, the kernel shrinker code is looking for pages to be placed on the lists of reclaimable pages (which is separate from actually-free pages). Because the kernel shrinker code doesn't detect that the ARC has released pages, it may call the ARC shrinker callback many times, resulting in the ARC "collapsing" down to `arc_c_min`. This has several negative impacts: 1. ZFS doesn't use RAM to cache data effectively. 2. In the direct reclaim case, a single page allocation may wait a long time (e.g. more than a minute) while we evict the entire ARC. 3. Even with the improvements made in 67c0f0dedc5 ("ARC shrinking blocks reads/writes"), occasionally `arc_size` may stay above `arc_c` for the entire time of the ARC collapse, thus blocking ZFS read/write operations in `arc_get_data_impl()`. To address these issues, this commit limits the ways that the ARC shrinker callback can be used by the kernel shrinker code, and mitigates the impact of arc_is_overflowing() on ZFS read/write operations. With this commit: 1. We limit the amount of data that can be reclaimed from the ARC via the "direct reclaim" shrinker. This limits the amount of time it takes to allocate a single page. 2. We do not allow the ARC to shrink via kswapd (indirect reclaim). Instead we rely on `arc_evict_zthr` to monitor free memory and reduce the ARC target size to keep sufficient free memory in the system. Note that we can't simply rely on limiting the amount that we reclaim at once (as for the direct reclaim case), because kswapd's "boosted" logic can invoke the callback an unlimited number of times (see `balance_pgdat()`). 3. When `arc_is_overflowing()` and we want to allocate memory, `arc_get_data_impl()` will wait only for a multiple of the requested amount of data to be evicted, rather than waiting for the ARC to no longer be overflowing. This allows ZFS reads/writes to make progress even while the ARC is overflowing, while also ensuring that the eviction thread makes progress towards reducing the total amount of memory used by the ARC. 4. The amount of memory that the ARC always tries to keep free for the rest of the system, `arc_sys_free` is increased. 5. Now that the shrinker callback is able to provide feedback to the kernel's shrinker code about our progress, we can safely enable the kswapd hook. This will allow the arc to receive notifications when memory pressure is first detected by the kernel. We also re-enable the appropriate kstats to track these callbacks. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Co-authored-by: George Wilson <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> Closes #10600
* Rename refcount.h to zfs_refcount.hMatthew Macy2020-07-2914-15/+15
| | | | | | | | | Renamed to avoid conflicting with refcount.h when a different implementation is already provided by the platform. Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matt Macy <[email protected]> Closes #10620
* Introduce names for ZTHRsSerapheim Dimitropoulos2020-07-292-4/+8
| | | | | | | | | | | | When debugging issues or generally analyzing the runtime of a system it would be nice to be able to tell the different ZTHRs running by name rather than having to analyze their stack. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Matthew Ahrens <[email protected]> Co-authored-by: Ryan Moeller <[email protected]> Signed-off-by: Serapheim Dimitropoulos <[email protected]> Closes #10630
* Prefix zfs internal endian checks with _ZFSMatthew Macy2020-07-281-1/+1
| | | | | | | | | | | FreeBSD defines _BIG_ENDIAN BIG_ENDIAN _LITTLE_ENDIAN LITTLE_ENDIAN on every architecture. Trying to do cross builds whilst hiding this from ZFS has proven extremely cumbersome. Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matt Macy <[email protected]> Closes #10621
* Refactor ccompile.h to not include system headersMatthew Macy2020-07-251-0/+3
| | | | | | | | This is a step toward being able to vendor the OpenZFS code in FreeBSD. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Signed-off-by: Matt Macy <[email protected]> Closes #10625
* Annotate unused parameters on inline definitions as suchKyle Evans2020-07-231-1/+2
| | | | | | | | | | | | | | * libspl: umem: These are obviously and intentionally unused; annotate them as such to appease -Wunused-parameter builds that include this header. * sys/dmu.h: In this case, clear_on_evict_dbufp is only used for ZFS_DEBUG builds, so annotate it as __maybe_unused to appease -Wunused-parameter. Reviewed-By: Brian Behlendorf <[email protected]> Signed-off-by: Kyle Evans <[email protected]> Closes #10606
* FreeBSD: Remove some code duplication in sysctl_os.cRyan Moeller2020-07-231-0/+1
| | | | | | | | Drop unnecessary redefinition's of several arcstat values. Put missing extern declaration of arc_no_grow_shift in arc_impl.h. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Ryan Moeller <[email protected]> Closes #10609
* Adjust ARC terminologyMatthew Ahrens2020-07-221-4/+4
| | | | | | | | | | | | The process of evicting data from the ARC is referred to as `arc_adjust`. This commit changes the term to `arc_evict`, which is more specific. Reviewed-by: George Wilson <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: George Melikov <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> Closes #10592
* FreeBSD: Add legacy arc_min and arc_maxRyan Moeller2020-07-191-0/+2
| | | | | | | | | These tunables were renamed from vfs.zfs.arc_min and vfs.zfs.arc_max to vfs.zfs.arc.min and vfs.zfs.arc.max. Add legacy compat tunables for the old names. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Ryan Moeller <[email protected]> Closes #10579
* Extend zdb to print inconsistencies in livelists and metaslabsMatthew Ahrens2020-07-142-0/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Livelists and spacemaps are data structures that are logs of allocations and frees. Livelists entries are block pointers (blkptr_t). Spacemaps entries are ranges of numbers, most often used as to track allocated/freed regions of metaslabs/vdevs. These data structures can become self-inconsistent, for example if a block or range can be "double allocated" (two allocation records without an intervening free) or "double freed" (two free records without an intervening allocation). ZDB (as well as zfs running in the kernel) can detect these inconsistencies when loading livelists and metaslab. However, it generally halts processing when the error is detected. When analyzing an on-disk problem, we often want to know the entire set of inconsistencies, which is not possible with the current behavior. This commit adds a new flag, `zdb -y`, which analyzes the livelist and metaslab data structures and displays all of their inconsistencies. Note that this is different from the leak detection performed by `zdb -b`, which checks for inconsistencies between the spacemaps and the tree of block pointers, but assumes the spacemaps are self-consistent. The specific checks added are: Verify livelists by iterating through each sublivelists and: - report leftover FREEs - report double ALLOCs and double FREEs - record leftover ALLOCs together with their TXG [see Cross Check] Verify spacemaps by iterating over each metaslab and: - iterate over spacemap and then the metaslab's entries in the spacemap log, then report any double FREEs and double ALLOCs Verify that livelists are consistenet with spacemaps. The space referenced by livelists (after using the FREE's to cancel out corresponding ALLOCs) should be allocated, according to the spacemaps. Reviewed-by: Serapheim Dimitropoulos <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Co-authored-by: Sara Hartse <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> External-issue: DLPX-66031 Closes #10515
* Remove dependency on sharetab file and refactor sharing logicGeorge Wilson2020-07-131-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | == Motivation and Context The current implementation of 'sharenfs' and 'sharesmb' relies on the use of the sharetab file. The use of this file is os-specific and not required by linux or freebsd. Currently the code must maintain updates to this file which adds complexity and presents a significant performance impact when sharing many datasets. In addition, concurrently running 'zfs sharenfs' command results in missing entries in the sharetab file leading to unexpected failures. == Description This change removes the sharetab logic from the linux and freebsd implementation of 'sharenfs' and 'sharesmb'. It still preserves an os-specific library which contains the logic required for sharing NFS or SMB. The following entry points exist in the vastly simplified libshare library: - sa_enable_share -- shares a dataset but may not commit the change - sa_disable_share -- unshares a dataset but may not commit the change - sa_is_shared -- determine if a dataset is shared - sa_commit_share -- notify NFS/SMB subsystem to commit the shares - sa_validate_shareopts -- determine if sharing options are valid The sa_commit_share entry point is provided as a performance enhancement and is not required. The sa_enable_share/sa_disable_share may commit the share as part of the implementation. Libshare provides a framework for both NFS and SMB but some operating systems may not fully support these protocols or all features of the protocol. NFS Operation: For linux, libshare updates /etc/exports.d/zfs.exports to add and remove shares and then commits the changes by invoking 'exportfs -r'. This file, is automatically read by the kernel NFS implementation which makes for better integration with the NFS systemd service. For FreeBSD, libshare updates /etc/zfs/exports to add and remove shares and then commits the changes by sending a SIGHUP to mountd. SMB Operation: For linux, libshare adds and removes files in /var/lib/samba/usershares by calling the 'net' command directly. There is no need to commit the changes. FreeBSD does not support SMB. == Performance Results To test sharing performance we created a pool with an increasing number of datasets and invoked various zfs actions that would enable and disable sharing. The performance testing was limited to NFS sharing. The following tests were performed on an 8 vCPU system with 128GB and a pool comprised of 4 50GB SSDs: Scale testing: - Share all filesystems in parallel -- zfs sharenfs=on <dataset> & - Unshare all filesystems in parallel -- zfs sharenfs=off <dataset> & Functional testing: - share each filesystem serially -- zfs share -a - unshare each filesystem serially -- zfs unshare -a - reset sharenfs property and unshare -- zfs inherit -r sharenfs <pool> For 'zfs sharenfs=on' scale testing we saw an average reduction in time of 89.43% and for 'zfs sharenfs=off' we saw an average reduction in time of 83.36%. Functional testing also shows a huge improvement: - zfs share -- 97.97% reduction in time - zfs unshare -- 96.47% reduction in time - zfs inhert -r sharenfs -- 99.01% reduction in time Reviewed-by: Matt Ahrens <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Bryant G. Ly <[email protected]> Signed-off-by: George Wilson <[email protected]> External-Issue: DLPX-68690 Closes #1603 Closes #7692 Closes #7943 Closes #10300
* filesystem_limit/snapshot_limit is incorrectly enforced against rootMatthew Ahrens2020-07-115-3/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The filesystem_limit and snapshot_limit properties limit the number of filesystems or snapshots that can be created below this dataset. According to the manpage, "The limit is not enforced if the user is allowed to change the limit." Two types of users are allowed to change the limit: 1. Those that have been delegated the `filesystem_limit` or `snapshot_limit` permission, e.g. with `zfs allow USER filesystem_limit DATASET`. This works properly. 2. A user with elevated system privileges (e.g. root). This does not work - the root user will incorrectly get an error when trying to create a snapshot/filesystem, if it exceeds the `_limit` property. The problem is that `priv_policy_ns()` does not work if the `cred_t` is not that of the current process. This happens when `dsl_enforce_ds_ss_limits()` is called in syncing context (as part of a sync task's check func) to determine the permissions of the corresponding user process. This commit fixes the issue by passing the `task_struct` (typedef'ed as a `proc_t`) to syncing context, and then using `has_capability()` to determine if that process is privileged. Note that we still need to pass the `cred_t` to syncing context so that we can check if the user was delegated this permission with `zfs allow`. This problem only impacts Linux. Wrappers are added to FreeBSD but it continues to use `priv_check_cred()`, which works on arbitrary `cred_t`. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> Closes #8226 Closes #10545
* Add a "try" operation for range locksMark Johnston2020-07-061-0/+2
| | | | | | | | | | | zfs_rangelock_tryenter() bails immediately instead of waiting for the lock to become available. This will be used to resolve a deadlock in the FreeBSD page-in code. No functional change intended. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Signed-off-by: Mark Johnston <[email protected]> Closes #10519
* Add device rebuild featureBrian Behlendorf2020-07-039-10/+159
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The device_rebuild feature enables sequential reconstruction when resilvering. Mirror vdevs can be rebuilt in LBA order which may more quickly restore redundancy depending on the pools average block size, overall fragmentation and the performance characteristics of the devices. However, block checksums cannot be verified as part of the rebuild thus a scrub is automatically started after the sequential resilver completes. The new '-s' option has been added to the `zpool attach` and `zpool replace` command to request sequential reconstruction instead of healing reconstruction when resilvering. zpool attach -s <pool> <existing vdev> <new vdev> zpool replace -s <pool> <old vdev> <new vdev> The `zpool status` output has been updated to report the progress of sequential resilvering in the same way as healing resilvering. The one notable difference is that multiple sequential resilvers may be in progress as long as they're operating on different top-level vdevs. The `zpool wait -t resilver` command was extended to wait on sequential resilvers. From this perspective they are no different than healing resilvers. Sequential resilvers cannot be supported for RAIDZ, but are compatible with the dRAID feature being developed. As part of this change the resilver_restart_* tests were moved in to the functional/replacement directory. Additionally, the replacement tests were renamed and extended to verify both resilvering and rebuilding. Original-patch-by: Isaac Huang <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: John Poduska <[email protected]> Co-authored-by: Mark Maybee <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #10349
* Clean up OS-specific ARC and kmem codeMatthew Ahrens2020-06-292-4/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | OS-specific code (e.g. under `module/os/linux`) does not need to share its code structure with any other operating systems. In particular, the ARC and kmem code need not be similar to the code in illumos, because we won't be syncing this OS-specific code between operating systems. For example, if/when illumos support is added to the common repo, we would add a file `module/os/illumos/zfs/arc_os.c` for the illumos versions of this code. Therefore, we can simplify the code in the OS-specific ARC and kmem routines. These changes do not impact system behavior, they are purely code cleanup. The changes are: Arenas are not used on Linux or FreeBSD (they are always `NULL`), so `heap_arena`, `zio_arena`, and `zio_alloc_arena` can be removed, along with code that uses them. In `arc_available_memory()`: * `desfree` is unused, remove it * rename `freemem` to avoid conflict with pre-existing `#define` * remove checks related to arenas * use units of bytes, rather than converting from bytes to pages and then back to bytes `SPL_KMEM_CACHE_REAP` is unused, remove it. `skc_reap` is unused, remove it. The `count` argument to `spl_kmem_cache_reap_now()` is unused, remove it. `vmem_size()` and associated type and macros are unused, remove them. In `arc_memory_throttle()`, use a less confusing variable name to store the result of `arc_free_memory()`. Reviewed-by: George Wilson <[email protected]> Reviewed-by: Pavel Zakharov <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> Closes #10499
* Avoid installing kernel headers on FreeBSDArvind Sankar2020-06-277-0/+14
| | | | | | | | | The kernel headers are installed for DKMS on linux, so don't install them unless we're building on linux. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Arvind Sankar <[email protected]> Closes #10506
* Fixes for make distArvind Sankar2020-06-267-53/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Reduce the usage of EXTRA_DIST. If files are conditionally included in _SOURCES, _HEADERS etc, automake is smart enough to dist all files that could possibly be included, but this does not apply to EXTRA_DIST, resulting in make dist depending on the configuration. Add some files that were missing altogether in various Makefile's. The changes to disted files in this commit (excluding deleted files): +./cmd/zed/agents/README.md +./etc/init.d/README.md +./lib/libspl/os/freebsd/getexecname.c +./lib/libspl/os/freebsd/gethostid.c +./lib/libspl/os/freebsd/getmntany.c +./lib/libspl/os/freebsd/mnttab.c -./lib/libzfs/libzfs_core.pc -./lib/libzfs/libzfs.pc +./lib/libzfs/os/freebsd/libzfs_compat.c +./lib/libzfs/os/freebsd/libzfs_fsshare.c +./lib/libzfs/os/freebsd/libzfs_ioctl_compat.c +./lib/libzfs/os/freebsd/libzfs_zmount.c +./lib/libzutil/os/freebsd/zutil_compat.c +./lib/libzutil/os/freebsd/zutil_device_path_os.c +./lib/libzutil/os/freebsd/zutil_import_os.c +./module/lua/README.zfs +./module/os/linux/spl/README.md +./tests/README.md +./tests/zfs-tests/tests/functional/cli_root/zfs_clone/zfs_clone_rm_nested.ksh +./tests/zfs-tests/tests/functional/cli_root/zfs_send/zfs_send_encrypted_unloaded.ksh +./tests/zfs-tests/tests/functional/inheritance/README.config +./tests/zfs-tests/tests/functional/inheritance/README.state +./tests/zfs-tests/tests/functional/rsend/rsend_016_neg.ksh +./tests/zfs-tests/tests/perf/fio/sequential_readwrite.fio Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Arvind Sankar <[email protected]> Closes #10501
* ARC shrinking blocks reads/writesMatthew Ahrens2020-06-261-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ZFS registers a memory hook, `__arc_shrinker_func`, which is supposed to allow the ARC to shrink when the kernel experiences memory pressure. The ARC shrinker changes `arc_c` via a call to `arc_reduce_target_size()`. Before commit 3ec34e55271d433e3c, the ARC shrinker would also evict data from the ARC to bring `arc_size` down to the new `arc_c`. However, that commit (seemingly inadvertently) made it so that the ARC shrinker no longer evicts any data or waits for eviction to complete. Repeated calls to the ARC shrinker can reduce `arc_c` drastically, often all the way to `arc_c_min`. Since it doesn't wait for the actual eviction of data from the ARC, this creates a situation where `arc_size` is more than `arc_c` for the several seconds/minutes it takes for `arc_adjust_zthr` to evict data from the ARC. During this time, arc_get_data_impl() will block, so ZFS can't process read/write requests (e.g. from iSCSI, NFS, or read/write syscalls). To ensure that `arc_c` doesn't shrink faster than the adjust thread can keep up, this commit makes the ARC shrinker wait for the eviction to complete, resulting in similar behavior to what we had before commit 3ec34e55271d433e3c. Note: commit 3ec34e55271d433e3c is `OpenZFS 9284 - arc_reclaim_thread has 2 jobs` and was integrated in December 2018, and is part of ZoL 0.8.x but not 0.7.x. Additionally, when the ARC size is reduced drastically, the `arc_adjust_zthr` can be on-CPU for many seconds without blocking. Any threads that are bound to the same CPU that arc_adjust_zthr is running on will not able to run for a long time. To ensure that CPU-bound threads can make progress, this commit changes `arc_evict_state_impl()` make a voluntary preemption call, `cond_resched()`. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: George Wilson <[email protected]> Reviewed-by: Prakash Surya <[email protected]> Reviewed-by: Pavel Zakharov <[email protected]> Reviewed-by: Tony Nguyen <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> External-issue: DLPX-70703 Closes #10496
* Drop unnecessary srcdir pathsArvind Sankar2020-06-247-142/+142
| | | | | | | | | | There's no need to specify the srcdir explicitly in _HEADERS and EXTRA_DIST. Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Arvind Sankar <[email protected]> Closes #10493
* Add zfs_multihost_interval tunable handler for FreeBSDRyan Moeller2020-06-231-0/+1
| | | | | | | | | | This tunable required a handler to be implemented for ZFS_MODULE_PARAM_CALL. Add the handler so the tunable can be declared in common code. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Ryan Moeller <[email protected]> Closes #10490
* Add prototypesArvind Sankar2020-06-183-0/+4
| | | | | | | | | Add prototypes/move prototypes to header files. Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Arvind Sankar <[email protected]> Closes #10470