aboutsummaryrefslogtreecommitdiffstats
path: root/include
Commit message (Collapse)AuthorAgeFilesLines
* Allow to control failfastMariusz Zaborski2022-11-103-2/+10
| | | | | | | | | | | | | | | | | | | | | Linux defaults to setting "failfast" on BIOs, so that the OS will not retry IOs that fail, and instead report the error to ZFS. In some cases, such as errors reported by the HBA driver, not the device itself, we would wish to retry rather than generating vdev errors in ZFS. This new property allows that. This introduces a per vdev option to disable the failfast option. This also introduces a global module parameter to define the failfast mask value. Reviewed-by: Brian Behlendorf <[email protected]> Co-authored-by: Allan Jude <[email protected]> Signed-off-by: Allan Jude <[email protected]> Signed-off-by: Mariusz Zaborski <[email protected]> Sponsored-by: Seagate Technology LLC Submitted-by: Klara, Inc. Closes #14056
* Optionally skip zil_close during zvol_create_minor_implAlan Somers2022-11-081-2/+2
| | | | | | | | | | | | | | | If there were no zil entries to replay, skip zil_close. zil_close waits for a transaction to sync. That can take several seconds, for example during pool import of a resilvering pool. Skipping zil_close can cut the time for "zpool import" from 2 hours to 45 seconds on a resilvering pool with a thousand zvols. Reviewed-by: Richard Yao <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Sponsored-by: Axcient Closes #13999 Closes #14015
* Support idmapped mount in user namespaceyouzhongyang2022-11-083-23/+76
| | | | | | | | | | | | | | | | | | Linux 5.17 commit torvalds/linux@5dfbfe71e enables "the idmapping infrastructure to support idmapped mounts of filesystems mounted with an idmapping". Update the OpenZFS accordingly to improve the idmapped mount support. This pull request contains the following changes: - xattr setter functions are fixed to take mnt_ns argument. Without this, cp -p would fail for an idmapped mount in a user namespace. - idmap_util is enhanced/fixed for its use in a user ns context. - One test case added to test idmapped mount in a user ns. Reviewed-by: Christian Brauner <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Youzhong Yang <[email protected]> Closes #14097
* freebsd: simplify MD isa_defs.hBrooks Davis2022-11-072-455/+0
| | | | | | | | | | | | | | | | Most of this file was a pile of defines, apparently from Solaris that controlled nothing in the source tree. A few things controlled the definition of unused types or macros which I have removed. Considerable further cleanup is possible including removal of architectures FreeBSD never supported. This file should likely converge with the Linux version to the extent possible. Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Richard Yao <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Brooks Davis <[email protected]> Closes #14127
* freebsd: trim dkio.h to the minimumBrooks Davis2022-11-071-460/+0
| | | | | | | | | | Only DKIOCFLUSHWRITECACHE is required. Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Richard Yao <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Brooks Davis <[email protected]> Closes #14127
* freebsd: add ifdefs around legacy ioctl supportBrooks Davis2022-11-071-0/+13
| | | | | | | | | | | | | | | | Require that ZFS_LEGACY_SUPPORT be defined for legacy ioctl support to be built. For now, define it in zfs_ioctl_compat.h so support is always built. This will allow systems that need never support pre-openzfs tools a mechanism to remove support at build time. This code should be removed once the need for tool compatability is gone. No functional change at this time. Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Richard Yao <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Brooks Davis <[email protected]> Closes #14127
* freebsd: remove no-op vn_renamepath()Brooks Davis2022-11-071-1/+0
| | | | | | | | | | | | vn_renamepath() is a Solaris-ism that was defined away in the FreeBSD port. Now that the only use is in the FreeBSD zfs_vnops_os.c, drop it entierly. Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Richard Yao <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Brooks Davis <[email protected]> Closes #14127
* freebsd: remove unused vn_rename()Brooks Davis2022-11-071-9/+0
| | | | | | | | Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Richard Yao <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Brooks Davis <[email protected]> Closes #14127
* Make 1-bit bitfields unsignedBrooks Davis2022-11-031-5/+5
| | | | | | | | | | | | | | | | This fixes -Wsingle-bit-bitfield-constant-conversion warning from clang-16 like: lib/libzfs/libzfs_dataset.c:4529:19: error: implicit truncation from 'int' to a one-bit wide bit-field changes value from 1 to -1 [-Werror,-Wsingle-bit-bitfield-constant-conversion] flags.nounmount = B_TRUE; ^ ~~~~~~ Reviewed-by: Matthew Ahrens <[email protected]> Reviewed-by: Richard Yao <[email protected]> Signed-off-by: Brooks Davis <[email protected]> Closes #14125
* libuutil: deobfuscate internal pointersBrooks Davis2022-11-031-25/+6
| | | | | | | | | | | | | uu_avl and uu_list stored internal next/prev pointers and parent pointers (unused) obfuscated (byte swapped) to hide them from a long forgotten leak checker (No one at the 2022 OpenZFS developers meeting could recall the history.) This would break on CHERI systems and adds no obvious value. Rename the members, use proper types rather than uintptr_t, and eliminate the related macros. Reviewed-by: Richard Yao <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Brooks Davis <[email protected]> Closes #14126
* zfs_onexit_add_cb: make action_handle point to a uintptr_tBrooks Davis2022-11-031-1/+1
| | | | | | | | | | Avoid assuming than a uint64_t can hold a pointer and reduce the number of casts in the process. Reviewed-by: Matthew Ahrens <[email protected]> Reviewed-by: Richard Yao <[email protected]> Signed-off-by: Brooks Davis <[email protected]> Closes #14131
* acl: use uintptr_t for ace walker cookiesBrooks Davis2022-11-031-1/+1
| | | | | | | | | | Avoid assuming that a pointer can fit in a uint64_t and use uintptr_t instead. Reviewed-by: Matthew Ahrens <[email protected]> Reviewed-by: Richard Yao <[email protected]> Signed-off-by: Brooks Davis <[email protected]> Closes #14131
* linux isa_defs.h: Don't define _ALIGNMENT_REQUIREDBrooks Davis2022-11-031-32/+0
| | | | | | | | | Nothing consumes this definition so stop defining it. Reviewed-by: Richard Yao <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Brooks Davis <[email protected]> Closes #14128
* Improve RISC-V supportBrooks Davis2022-11-031-1/+5
| | | | | | | | | Check __riscv_xlen == 64 rather than _LP64 and define _LP64 if missing. Reviewed-by: Richard Yao <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Brooks Davis <[email protected]> Closes #14128
* Introduce kmem_scnprintf()Richard Yao2022-10-294-1/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | `snprintf()` is meant to protect against buffer overflows, but operating on the buffer using its return value, possibly by calling it again, can cause a buffer overflow, because it will return how many characters it would have written if it had enough space even when it did not. In a number of places, we repeatedly call snprintf() by successively incrementing a buffer offset and decrementing a buffer length, by its return value. This is a potentially unsafe usage of `snprintf()` whenever the buffer length is reached. CodeQL complained about this. To fix this, we introduce `kmem_scnprintf()`, which will return 0 when the buffer is zero or the number of written characters, minus 1 to exclude the NULL character, when the buffer was too small. In all other cases, it behaves like snprintf(). The name is inspired by the Linux and XNU kernels' `scnprintf()`. The implementation was written before I thought to look at `scnprintf()` and had a good name for it, but it turned out to have identical semantics to the Linux kernel version. That lead to the name, `kmem_scnprintf()`. CodeQL only catches this issue in loops, so repeated use of snprintf() outside of a loop was not caught. As a result, a thorough audit of the codebase was done to examine all instances of `snprintf()` usage for potential problems and a few were caught. Fixes for them are included in this patch. Unfortunately, ZED is one of the places where `snprintf()` is potentially used incorrectly. Since using `kmem_scnprintf()` in it would require changing how it is linked, we modify its usage to make it safe, no matter what buffer length is used. In addition, there was a bug in the use of the return value where the NULL format character was not being written by pwrite(). That has been fixed. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Richard Yao <[email protected]> Closes #14098
* debug: fix output from VERIFY0 assertionRob N ★2022-10-282-2/+2
| | | | | | | | | | The previous version reported all the right info, but the VERIFY3 name made a little more confusing when looking for the matching location in the source code. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Richard Yao <[email protected]> Signed-off-by: Rob N ★ <[email protected]> Closes #14099
* zfs_rename: support RENAME_* flagsAleksa Sarai2022-10-287-3/+53
| | | | | | | | | | | | | | | | | | | | | | Implement support for Linux's RENAME_* flags (for renameat2). Aside from being quite useful for userspace (providing race-free ways to exchange paths and implement mv --no-clobber), they are used by overlayfs and are thus required in order to use overlayfs-on-ZFS. In order for us to represent the new renameat2(2) flags in the ZIL, we create two new transaction types for the two flags which need transactional-level support (RENAME_EXCHANGE and RENAME_WHITEOUT). RENAME_NOREPLACE does not need any ZIL support because we know that if the operation succeeded before creating the ZIL entry, there was no file to be clobbered and thus it can be treated as a regular TX_RENAME. Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Pavel Snajdr <[email protected]> Signed-off-by: Aleksa Sarai <[email protected]> Closes #12209 Closes #14070
* zfs_rename: restructure to have cleaner fallbacksAleksa Sarai2022-10-281-0/+1
| | | | | | | | | | | | | | | | | This is in preparation for RENAME_EXCHANGE and RENAME_WHITEOUT support for ZoL, but the changes here allow for far nicer fallbacks than the previous implementation (the source and target are re-linked in case of the final link failing). In addition, a small cleanup was done for the "target exists but is a different type" codepath so that it's more understandable. Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Aleksa Sarai <[email protected]> Closes #12209 Closes #14070
* debug: add VERIFY_{IMPLY,EQUIV} variantsAleksa Sarai2022-10-281-8/+12
| | | | | | | | | | This allows for much cleaner VERIFY-level assertions. Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Aleksa Sarai <[email protected]> Closes #14070
* Remove zpl_revalidate: fix snapshot rollbackPavel Snajdr2022-10-284-8/+26
| | | | | | | | | | | | | | | | | | | Open files, which aren't present in the snapshot, which is being roll-backed to, need to disappear from the visible VFS image of the dataset. Kernel provides d_drop function to drop invalid entry from the dcache, but inode can be referenced by dentry multiple dentries. The introduced zpl_d_drop_aliases function walks and invalidates all aliases of an inode. Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Pavel Snajdr <[email protected]> Closes #9600 Closes #14070
* Fix zio_flag_t print formatyouzhongyang2022-10-281-1/+1
| | | | | | | | Follow up for 4938d01d which changed zio_flag from enum to uint64_t. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Richard Yao <[email protected]> Signed-off-by: Youzhong Yang <[email protected]> Closes #14100
* Convert enum zio_flag to uint64_tRichard Yao2022-10-273-53/+57
| | | | | | | | | | | | | We ran out of space in enum zio_flag for additional flags. Rather than introduce enum zio_flag2 and then modify a bunch of functions to take a second flags variable, we expand the type to 64 bits via `typedef uint64_t zio_flag_t`. Reviewed-by: Allan Jude <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Allan Jude <[email protected]> Co-authored-by: Richard Yao <[email protected]> Closes #14086
* FreeBSD: vn_flush_cached_data: observe vnode locking contractAndriy Gapon2022-10-261-0/+2
| | | | | | | | vm_object_page_clean() expects that the associated vnode is locked as VOP_PUTPAGES() may get called on the vnode. Reviewed-by: Ryan Moeller <[email protected]> Signed-off-by: Andriy Gapon <[email protected]> Closes #14079
* Silence objtool warnings from 55d7afa4Richard Yao2022-10-262-1/+17
| | | | | | | | | | | | The use of __noreturn__ in 55d7afa4adbb4ca569e9c4477a7d121f4dc0bfbd on spl_panic() caused objtool warnings on Linux when the kernel is built with CONFIG_STACK_VALIDATION=y. This patch works around that by restricting the application of __noreturn__ to builds for static analyzers. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Signed-off-by: Richard Yao <[email protected]> Closes #14068
* Optimize microzapsAlexander Motin2022-10-202-13/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Microzap on-disk format does not include a hash tree, expecting one to be built in RAM during mzap_open(). The built tree is linked to DMU user buffer, freed when original DMU buffer is dropped from cache. I've found that workloads accessing many large directories and having active eviction from DMU cache spend significant amount of time building and then destroying the trees. I've also found that for each 64 byte mzap element additional 64 byte tree element is allocated, that is a waste of memory and CPU caches. Improve memory efficiency of the hash tree by switching from AVL-tree to B-tree. It allows to save 24 bytes per element just on pointers. Save 32 bits on mze_hash by storing only upper 32 bits since lower 32 bits are always zero for microzaps. Save 16 bits on mze_chunkid, since microzap can never have so many elements. Respectively with the 16 bits there can be no more than 16 bits of collision differentiators. As result, struct mzap_ent now drops from 48 (rounded to 64) to 8 bytes. Tune B-trees for small data. Reduce BTREE_CORE_ELEMS from 128 to 126 to allow struct zfs_btree_core in case of 8 byte elements to pack into 2KB instead of 4KB. Aside of the microzaps it should also help 32bit range trees. Allow custom B-tree leaf size to reduce memmove() time. Split zap_name_alloc() into zap_name_alloc() and zap_name_init_str(). It allows to not waste time allocating/freeing memory when processing multiple names in a loop during mzap_open(). Together on a pool with 10K directories of 1800 files each and DMU cache limited to 128MB this reduces time of `find . -name zzz` by 41% from 7.63s to 4.47s, and saves additional ~30% of CPU time on the DMU cache reclamation. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Matthew Ahrens <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #14039
* Add options to zfs redundant_metadata propertyAkash B2022-10-192-1/+11
| | | | | | | | | | | | | | | | | Currently, additional/extra copies are created for metadata in addition to the redundancy provided by the pool(mirror/raidz/draid), due to this 2 times more space is utilized per inode and this decreases the total number of inodes that can be created in the filesystem. By setting redundant_metadata to none, no additional copies of metadata are created, hence can reduce the space consumed by the additional metadata copies and increase the total number of inodes that can be created in the filesystem. Additionally, this can improve file create performance due to the reduced amount of metadata which needs to be written. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Dipak Ghosh <[email protected]> Signed-off-by: Akash B <[email protected]> Closes #13680
* Support idmapped mountyouzhongyang2022-10-198-19/+58
| | | | | | | | | | | | Adds support for idmapped mounts. Supported as of Linux 5.12 this functionality allows user and group IDs to be remapped without changing their state on disk. This can be useful for portable home directories and a variety of container related use cases. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Signed-off-by: Youzhong Yang <[email protected]> Closes #12923 Closes #13671
* Cleanup: 64-bit kernel module parameters should use fixed width typesRichard Yao2022-10-1311-36/+73
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Various module parameters such as `zfs_arc_max` were originally `uint64_t` on OpenSolaris/Illumos, but were changed to `unsigned long` for Linux compatibility because Linux's kernel default module parameter implementation did not support 64-bit types on 32-bit platforms. This caused problems when porting OpenZFS to Windows because its LLP64 memory model made `unsigned long` a 32-bit type on 64-bit, which created the undesireable situation that parameters that should accept 64-bit values could not on 64-bit Windows. Upon inspection, it turns out that the Linux kernel module parameter interface is extensible, such that we are allowed to define our own types. Rather than maintaining the original type change via hacks to to continue shrinking module parameters on 32-bit Linux, we implement support for 64-bit module parameters on Linux. After doing a review of all 64-bit kernel parameters (found via the man page and also proposed changes by Andrew Innes), the kernel module parameters fell into a few groups: Parameters that were originally 64-bit on Illumos: * dbuf_cache_max_bytes * dbuf_metadata_cache_max_bytes * l2arc_feed_min_ms * l2arc_feed_secs * l2arc_headroom * l2arc_headroom_boost * l2arc_write_boost * l2arc_write_max * metaslab_aliquot * metaslab_force_ganging * zfetch_array_rd_sz * zfs_arc_max * zfs_arc_meta_limit * zfs_arc_meta_min * zfs_arc_min * zfs_async_block_max_blocks * zfs_condense_max_obsolete_bytes * zfs_condense_min_mapping_bytes * zfs_deadman_checktime_ms * zfs_deadman_synctime_ms * zfs_initialize_chunk_size * zfs_initialize_value * zfs_lua_max_instrlimit * zfs_lua_max_memlimit * zil_slog_bulk Parameters that were originally 32-bit on Illumos: * zfs_per_txg_dirty_frees_percent Parameters that were originally `ssize_t` on Illumos: * zfs_immediate_write_sz Note that `ssize_t` is `int32_t` on 32-bit and `int64_t` on 64-bit. It has been upgraded to 64-bit. Parameters that were `long`/`unsigned long` because of Linux/FreeBSD influence: * l2arc_rebuild_blocks_min_l2size * zfs_key_max_salt_uses * zfs_max_log_walking * zfs_max_logsm_summary_length * zfs_metaslab_max_size_cache_sec * zfs_min_metaslabs_to_flush * zfs_multihost_interval * zfs_unflushed_log_block_max * zfs_unflushed_log_block_min * zfs_unflushed_log_block_pct * zfs_unflushed_max_mem_amt * zfs_unflushed_max_mem_ppm New parameters that do not exist in Illumos: * l2arc_trim_ahead * vdev_file_logical_ashift * vdev_file_physical_ashift * zfs_arc_dnode_limit * zfs_arc_dnode_limit_percent * zfs_arc_dnode_reduce_percent * zfs_arc_meta_limit_percent * zfs_arc_sys_free * zfs_deadman_ziotime_ms * zfs_delete_blocks * zfs_history_output_max * zfs_livelist_max_entries * zfs_max_async_dedup_frees * zfs_max_nvlist_src_size * zfs_rebuild_max_segment * zfs_rebuild_vdev_limit * zfs_unflushed_log_txg_max * zfs_vdev_max_auto_ashift * zfs_vdev_min_auto_ashift * zfs_vnops_read_chunk_size * zvol_max_discard_blocks Rather than clutter the lists with commentary, the module parameters that need comments are repeated below. A few parameters were defined in Linux/FreeBSD specific code, where the use of ulong/long is not an issue for portability, so we leave them alone: * zfs_delete_blocks * zfs_key_max_salt_uses * zvol_max_discard_blocks The documentation for a few parameters was found to be incorrect: * zfs_deadman_checktime_ms - incorrectly documented as int * zfs_delete_blocks - not documented as Linux only * zfs_history_output_max - incorrectly documented as int * zfs_vnops_read_chunk_size - incorrectly documented as long * zvol_max_discard_blocks - incorrectly documented as ulong The documentation for these has been fixed, alongside the changes to document the switch to fixed width types. In addition, several kernel module parameters were percentages or held ashift values, so being 64-bit never made sense for them. They have been downgraded to 32-bit: * vdev_file_logical_ashift * vdev_file_physical_ashift * zfs_arc_dnode_limit_percent * zfs_arc_dnode_reduce_percent * zfs_arc_meta_limit_percent * zfs_per_txg_dirty_frees_percent * zfs_unflushed_log_block_pct * zfs_vdev_max_auto_ashift * zfs_vdev_min_auto_ashift Of special note are `zfs_vdev_max_auto_ashift` and `zfs_vdev_min_auto_ashift`, which were already defined as `uint64_t`, and passed to the kernel as `ulong`. This is inherently buggy on big endian 32-bit Linux, since the values would not be written to the correct locations. 32-bit FreeBSD was unaffected because its sysctl code correctly treated this as a `uint64_t`. Lastly, a code comment suggests that `zfs_arc_sys_free` is Linux-specific, but there is nothing to indicate to me that it is Linux-specific. Nothing was done about that. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Jorgen Lundman <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Original-patch-by: Andrew Innes <[email protected]> Original-patch-by: Jorgen Lundman <[email protected]> Signed-off-by: Richard Yao <[email protected]> Closes #13984 Closes #14004
* Linux: Remove ZFS_AC_KERNEL_SRC_MODULE_PARAM_CALL_CONST autotools checkRichard Yao2022-10-131-5/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | On older kernels, the definition for `module_param_call()` typecasts function pointers to `(void *)`, which triggers -Werror, causing the check to return false when it should return true. Fixing this breaks the build process on some older kernels because they define a `__check_old_set_param()` function in their headers that checks for a non-constified `->set()`. We workaround that through the c preprocessor by defining `__check_old_set_param(set)` to `(set)`, which prevents the build failures. However, it is now apparent that all kernels that we support have adopted the GRSecurity change, so there is no need to have an explicit autotools check for it anymore. We therefore remove the autotools check, while adding the workaround to our headers for the build time non-constified `->set()` check done by older kernel headers. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Jorgen Lundman <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Richard Yao <[email protected]> Closes #13984 Closes #14004
* Expose libzutil error info in libpc_handle_tUmer Saleem2022-10-041-4/+23
| | | | | | | | | | | | | | | | | | | | In libzutil, for zpool_search_import and zpool_find_config, we use libpc_handle_t internally, which does not maintain error code and it is not exposed in the interface. Due to this, the error information is not propagated to the caller. Instead, an error message is printed on stderr. This commit adds lpc_error field in libpc_handle_t and exposes it in the interface, which can be used by the users of libzutil to get the appropriate error information and handle it accordingly. Users of the API can also control if they want to print the error message on stderr. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Signed-off-by: Umer Saleem <[email protected]> Closes #13969
* Fix double const qualifier declarationsTino Reichardt2022-09-304-7/+7
| | | | | | | | | | | | | | | | | | | | Some header files define structures like this one: typedef const struct zio_checksum_info { /* ... */ const char *ci_name; } zio_abd_checksum_func_t; So we can use `zio_abd_checksum_func_t` for const declarations now. It's not needed that we use the `const` qualifier again like this: `const zio_abd_checksum_func_t *varname;` This patch solves the double const qualifiers, which were found by smatch. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Richard Yao <[email protected]> Signed-off-by: Tino Reichardt <[email protected]> Closes #13961
* Reduce false positives from Static AnalyzersRichard Yao2022-09-3010-35/+91
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Both Clang's Static Analyzer and Synopsys' Coverity would ignore assertions. Following Clang's advice, we annotate our assertions: https://clang-analyzer.llvm.org/annotations.html#custom_assertions This makes both Clang's Static Analyzer and Coverity properly identify assertions. This change reduced Clang's reported defects from 246 to 180. It also reduced the false positives reported by Coverityi by 10, while enabling Coverity to find 9 more defects that previously were false negatives. A couple examples of this would be CID-1524417 and CID-1524423. After submitting a build to coverity with the modified assertions, CID-1524417 disappeared while the report for CID-1524423 no longer claimed that the assertion tripped. Coincidentally, it turns out that it is possible to more accurately annotate our headers than the Coverity modelling file permits in the case of format strings. Since we can do that and this patch annotates headers whenever `__coverity_panic__()` would have been used in the model file, we drop all models that use `__coverity_panic__()` from the model file. Upon seeing the success in eliminating false positives involving assertions, it occurred to me that we could also modify our headers to eliminate coverity's false positives involving byte swaps. We now have coverity specific byteswap macros, that do nothing, to disable Coverity's false positives when we do byte swaps. This allowed us to also drop the byteswap definitions from the model file. Lastly, a model file update has been done beyond the mentioned deletions: * The definitions of `umem_alloc_aligned()`, `umem_alloc()` andi `umem_zalloc()` were originally implemented in a way that was intended to inform coverity that when KM_SLEEP has been passed these functions, they do not return NULL. A small error in how this was done was found, so we correct it. * Definitions for umem_cache_alloc() and umem_cache_free() have been added. In practice, no false positives were avoided by making these changes, but in the interest of correctness from future coverity builds, we make them anyway. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Signed-off-by: Richard Yao <[email protected]> Closes #13902
* zed: mark disks as REMOVED when they are removedAmeer Hamza2022-09-288-1/+65
| | | | | | | | | | | | | ZED does not take any action for disk removal events if there is no spare VDEV available. Added zpool_vdev_remove_wanted() in libzfs and vdev_remove_wanted() in vdev.c to remove the VDEV through ZED on removal event. This means that if you are running zed and remove a disk, it will be properly marked as REMOVED. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Signed-off-by: Ameer Hamza <[email protected]> Closes #13797
* DMU_BACKUP_FEATURE: indicate that bit 28 and 29 are reservedChristian Schwarz2022-09-271-0/+7
| | | | | | | | | | | | | | | Bit 28 is used by an internal Nutanix feature which might be upstreamed in the future. Bit 29 is the last unused bit. It is reserved to indicate a to-be-designed extension to the stream format which will accomodate more feature flags. Reviewed-by: Tino Reichardt <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Christian Schwarz <[email protected]> Issue #13795 Closes #13796
* DMU_BACKUP_FEATURE: remove unused BLAKE3 featureChristian Schwarz2022-09-271-2/+1
| | | | | | | | | | | Commit 985c33b132 added DMU_BACKUP_FEATURE_BLAKE3 but it is not used by the code. Reviewed-by: Tino Reichardt <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Christian Schwarz <[email protected]> Issue #13795 Closes #13796
* Cleanup: Specify unsignedness on things that should not be signedRichard Yao2022-09-2710-21/+21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In #13871, zfs_vdev_aggregation_limit_non_rotating and zfs_vdev_aggregation_limit being signed was pointed out as a possible reason not to eliminate an unnecessary MAX(unsigned, 0) since the unsigned value was assigned from them. There is no reason for these module parameters to be signed and upon inspection, it was found that there are a number of other module parameters that are signed, but should not be, so we make them unsigned. Making them unsigned made it clear that some other variables in the code should also be unsigned, so we also make those unsigned. This prevents users from setting negative values that could potentially cause bad behaviors. It also makes the code slightly easier to understand. Mostly module parameters that deal with timeouts, limits, bitshifts and percentages are made unsigned by this. Any that are boolean are left signed, since whether booleans should be considered signed or unsigned does not matter. Making zfs_arc_lotsfree_percent unsigned caused a `zfs_arc_lotsfree_percent >= 0` check to become redundant, so it was removed. Removing the check was also necessary to prevent a compiler error from -Werror=type-limits. Several end of line comments had to be moved to their own lines because replacing int with uint_t caused us to exceed the 80 character limit enforced by cstyle.pl. The following were kept signed because they are passed to taskq_create(), which expects signed values and modifying the OpenSolaris/Illumos DDI is out of scope of this patch: * metaslab_load_pct * zfs_sync_taskq_batch_pct * zfs_zil_clean_taskq_nthr_pct * zfs_zil_clean_taskq_minalloc * zfs_zil_clean_taskq_maxalloc * zfs_arc_prune_task_threads Also, negative values in those parameters was found to be harmless. The following were left signed because either negative values make sense, or more analysis was needed to determine whether negative values should be disallowed: * zfs_metaslab_switch_threshold * zfs_pd_bytes_max * zfs_livelist_min_percent_shared zfs_multihost_history was made static to be consistent with other parameters. A number of module parameters were marked as signed, but in reality referenced unsigned variables. upgrade_errlog_limit is one of the numerous examples. In the case of zfs_vdev_async_read_max_active, it was already uint32_t, but zdb had an extern int declaration for it. Interestingly, the documentation in zfs.4 was right for upgrade_errlog_limit despite the module parameter being wrongly marked, while the documentation for zfs_vdev_async_read_max_active (and friends) was wrong. It was also wrong for zstd_abort_size, which was unsigned, but was documented as signed. Also, the documentation in zfs.4 incorrectly described the following parameters as ulong when they were int: * zfs_arc_meta_adjust_restarts * zfs_override_estimate_recordsize They are now uint_t as of this patch and thus the man page has been updated to describe them as uint. dbuf_state_index was left alone since it does nothing and perhaps should be removed in another patch. If any module parameters were missed, they were not found by `grep -r 'ZFS_MODULE_PARAM' | grep ', INT'`. I did find a few that grep missed, but only because they were in files that had hits. This patch intentionally did not attempt to address whether some of these module parameters should be elevated to 64-bit parameters, because the length of a long on 32-bit is 32-bit. Lastly, it was pointed out during review that uint_t is a better match for these variables than uint32_t because FreeBSD kernel parameter definitions are designed for uint_t, whose bit width can change in future memory models. As a result, we change the existing parameters that are uint32_t to use uint_t. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Neal Gompa <[email protected]> Signed-off-by: Richard Yao <[email protected]> Closes #13875
* Cleanup: Switch to strlcpy from strncpyRichard Yao2022-09-272-10/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Coverity found a bug in `zfs_secpolicy_create_clone()` where it is possible for us to pass an unterminated string when `zfs_get_parent()` returns an error. Upon inspection, it is clear that using `strlcpy()` would have avoided this issue. Looking at the codebase, there are a number of other uses of `strncpy()` that are unsafe and even when it is used safely, switching to `strlcpy()` would make the code more readable. Therefore, we switch all instances where we use `strncpy()` to use `strlcpy()`. Unfortunately, we do not portably have access to `strlcpy()` in tests/zfs-tests/cmd/zfs_diff-socket.c because it does not link to libspl. Modifying the appropriate Makefile.am to try to link to it resulted in an error from the naming choice used in the file. Trying to disable the check on the file did not work on FreeBSD because Clang ignores `#undef` when a definition is provided by `-Dstrncpy(...)=...`. We workaround that by explictly including the C file from libspl into the test. This makes things build correctly everywhere. We add a deprecation warning to `config/Rules.am` and suppress it on the remaining `strncpy()` usage. `strlcpy()` is not portably avaliable in tests/zfs-tests/cmd/zfs_diff-socket.c, so we use `snprintf()` there as a substitute. This patch does not tackle the related problem of `strcpy()`, which is even less safe. Thankfully, a quick inspection found that it is used far more correctly than strncpy() was used. A quick inspection did not find any problems with `strcpy()` usage outside of zhack, but it should be said that I only checked around 90% of them. Lastly, some of the fields in kstat_t varied in size by 1 depending on whether they were in userspace or in the kernel. The origin of this discrepancy appears to be 04a479f7066ccdaa23a6546955303b172f4a6909 where it was made for no apparent reason. It conflicts with the comment on KSTAT_STRLEN, so we shrink the kernel field sizes to match the userspace field sizes. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Signed-off-by: Richard Yao <[email protected]> Closes #13876
* Enforce "-F" flag on resuming recv of full/newfs on existing datasetJitendra Patidar2022-09-272-0/+2
| | | | | | | | | | | | | | | | | | | | | | When receiving full/newfs on existing dataset, then it should be done with "-F" flag. Its enforced for initial receive in checks done in zfs_receive_one function of libzfs. Similarly, on resuming full/newfs recv on existing dataset, it should be done with "-F" flag. When dataset doesn't exist, then full/new recv is done on newly created dataset and it's marked INCONSISTENT. But when receiving on existing dataset, recv is first done on %recv and its marked INCONSISTENT. Existing dataset is not marked INCONSISTENT. Resume of full/newfs receive with dataset not INCONSISTENT indicates that its resuming newfs on existing dataset. So, enforce "-F" flag in this case. Also return an error from dmu_recv_resume_begin_check() in zfs kernel, when its resuming full/newfs recv without force. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Chunwei Chen <[email protected]> Signed-off-by: Jitendra Patidar <[email protected]> Closes #13856 Closes #13857
* Fix double declaration of getauxval() for FreeBSD PPCTino Reichardt2022-09-261-1/+3
| | | | | | | | | | | The extern declaration is only for Linux, move this line into the right #ifdef section. Reviewed-by: Richard Yao <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Co-authored-by: Martin Matuska <[email protected]> Signed-off-by: Tino Reichardt <[email protected]> Closes #13934 Closes #13936
* Dynamically size dbuf hash mutex arrayBrian Behlendorf2022-09-221-4/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Incorrectly sizing the array of hash locks used to protect the dbuf hash table can lead to contention and reduce performance. We could unconditionally allocate a larger array for the locks but it's wasteful, particularly for a low-memory system. Instead, dynamically allocate the array of locks and scale it based on total system memory. Additionally, add a new `dbuf_mutex_cache_shift` module option which can be used to override the hash lock array size. This is disabled by default (dbuf_mutex_hash_shift=0) and can only be set at module load time. The minimum target array size is set to 8192, this matches the current constant value. Note that the count of the dbuf hash table and count of the mutex array were added to the /proc/spl/kstat/zfs/dbufstats kstat. Finally, this change removes the _KERNEL conditional checks. These were not required since for the user space build there is no difference between the kmem and vmem interfaces. Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Richard Yao <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #13928
* Revert "Reduce dbuf_find() lock contention"Brian Behlendorf2022-09-221-3/+4
| | | | | | | | | | This reverts commit 34dbc618f50cfcd392f90af80c140398c38cbcd1. While this change resolved the lock contention observed for certain workloads, it inadventantly reduced the maximum hash inserts/removes per second. This appears to be due to the slightly higher acquisition cost of a rwlock vs a mutex. Reviewed-by: Brian Behlendorf <[email protected]>
* Retire ZFS_TEARDOWN_TRY_ENTER_READMateusz Guzik2022-09-202-9/+0
| | | | | | | | | | There were never any users and it so happens the operation is not even supported by rrm locks -- the macros were wrong for Linux and FreeBSD when not using it's RMS locks. Reviewed-by: Richard Yao <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Mateusz Guzik <[email protected]> Closes #13906
* Add membar_syncMateusz Guzik2022-09-202-0/+2
| | | | | | | | | | | | | Provides the missing full barrier variant to the membar primitive set. While not used right now, this is probably going to change down the road. Name taken from Solaris, to follow the existing routines. Reviewed-by: Richard Yao <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Mateusz Guzik <[email protected]> Closes #13907
* FreeBSD: Cleanup zfs_readdir()Richard Yao2022-09-203-85/+0
| | | | | | | | | | | | The FreeBSD project's coverity scans found dead code in `zfs_readdir()`. Also, the comment above `zfs_readdir()` is out of date. I fixed the comment and deleted all of the dead code, plus additional dead code that was found upon review. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Signed-off-by: Richard Yao <[email protected]> Closes #13924
* Cleanup: Remove unused uu_pname codeRichard Yao2022-09-191-33/+0
| | | | | | | | | Coverity caught a possible NULL pointer dereference in dead code. We can delete it all. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Chunwei Chen <[email protected]> Signed-off-by: Richard Yao <[email protected]> Closes #13900
* FreeBSD: fix static module build broken in 7bb707ffaMartin Matuška2022-09-191-0/+34
| | | | | | | | | | param_set_arc_free_target(SYSCTL_HANDLER_ARGS) and param_set_arc_no_grow_shift(SYSCTL_HANDLER_ARGS) defined in sysctl_os.c must be made available to arc_os.c. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Signed-off-by: Martin Matuska <[email protected]> Closes #13915
* Add PPC cpu feature tests for FreeBSD and LinuxTino Reichardt2022-09-165-68/+158
| | | | | | | | | | | | | | | | | | | | | | | | | Add needed cpu feature tests for powerpc architecture. Overview: zfs_altivec_available() - needed by RAID-Z zfs_vsx_available() - needed by BLAKE3 zfs_isa207_available() - needed by SHA2 Part 1 - Userspace - use getauxval() for Linux and elf_aux_info() for FreeBSD - direct including <sys/auxv.h> fails with double definitions - so we self define the needed functions and definitions Part 2 - Kernel space FreeBSD - use exported cpu_features of <powerpc/cpu.h> Part 3 - Kernel space Linux - use cpu_has_feature() function of <asm/cpufeature.h> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Signed-off-by: Tino Reichardt <[email protected]> Closes #13725
* Fix BLAKE3 tuneable and module loading on Linux and FreeBSDTino Reichardt2022-09-162-17/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Apply similar options to BLAKE3 as it is done for zfs_fletcher_4_impl. The zfs module parameter on Linux changes from icp_blake3_impl to zfs_blake3_impl. You can check and set it on Linux via sysfs like this: ``` [bash]# cat /sys/module/zfs/parameters/zfs_blake3_impl cycle [fastest] generic sse2 sse41 avx2 [bash]# echo sse2 > /sys/module/zfs/parameters/zfs_blake3_impl [bash]# cat /sys/module/zfs/parameters/zfs_blake3_impl cycle fastest generic [sse2] sse41 avx2 ``` The modprobe module parameters may also be used now: ``` [bash]# modprobe zfs zfs_blake3_impl=sse41 [bash]# cat /sys/module/zfs/parameters/zfs_blake3_impl cycle fastest generic sse2 [sse41] avx2 ``` On FreeBSD the BLAKE3 implementation can be set via sysctl like this: ``` [bsd]# sysctl vfs.zfs.blake3_impl vfs.zfs.blake3_impl: cycle [fastest] generic sse2 sse41 avx2 [bsd]# sysctl vfs.zfs.blake3_impl=sse2 vfs.zfs.blake3_impl: cycle [fastest] generic sse2 sse41 avx2 \ -> cycle fastest generic [sse2] sse41 avx2 ``` This commit changes also some Blake3 internals like these: - blake3_impl_ops_t was renamed to blake3_ops_t - all functions are named blake3_impl_NAME() now Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Co-authored-by: Ryan Moeller <[email protected]> Signed-off-by: Tino Reichardt <[email protected]> Closes #13725
* Handle ECKSUM as new EZFS_CKSUM ‒ "insufficient replicas"наб2022-09-161-0/+1
| | | | | | | | | | Add a meaningful error message for ECKSUM to common error messages. Reviewed-by: Richard Yao <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Ahelenia Ziemiańska <[email protected]> Closes #6805 Closes #13808 Closes #13898
* zfs recv hangs if max recordsize is less than received recordsizeAmeer Hamza2022-09-162-10/+10
| | | | | | | | | | - Some optimizations for bqueue enqueue/dequeue. - Added a fix to prevent deadlock when both bqueue_enqueue_impl() and bqueue_dequeue() waits for signal to be triggered. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Signed-off-by: Ameer Hamza <[email protected]> Closes #13855