aboutsummaryrefslogtreecommitdiffstats
path: root/include
Commit message (Collapse)AuthorAgeFilesLines
* BRT: Fix FICLONE/FICLONERANGE shortened copyTony Hutter2024-02-063-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | On Linux the ioctl_ficlonerange() and ioctl_ficlone() system calls are expected to either fully clone the specified range or return an error. The range may be for an entire file. While internally ZFS supports cloning partial ranges there's no way to return the length cloned to the caller so we need to make this all or nothing. As part of this change support for the REMAP_FILE_CAN_SHORTEN flag has been added. When REMAP_FILE_CAN_SHORTEN is set zfs_clone_range() will return a shortened range when encountering pending dirty records. When it's clear zfs_clone_range() will block and wait for the records to be written out allowing the blocks to be cloned. Furthermore, the file range lock is held over the region being cloned to prevent it from being modified while cloning. This doesn't quite provide an atomic semantics since if an error is encountered only a portion of the range may be cloned. This will be converted to an error if REMAP_FILE_CAN_SHORTEN was not provided and returned to the caller. However, the destination file range is left in an undefined state. A test case has been added which exercises this functionality by verifying that `cp --reflink=never|auto|always` works correctly. Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #15728 Closes #15842
* zpool: Add slot power control, print power statusTony Hutter2024-01-292-0/+60
| | | | | | | | | | | | | | | | | | | | | Add `zpool` flags to control the slot power to drives. This assumes your SAS or NVMe enclosure supports slot power control via sysfs. The new `--power` flag is added to `zpool offline|online|clear`: zpool offline --power <pool> <device> Turn off device slot power zpool online --power <pool> <device> Turn on device slot power zpool clear --power <pool> [device] Turn on device slot power If the ZPOOL_AUTO_POWER_ON_SLOT env var is set, then the '--power' option is automatically implied for `zpool online` and `zpool clear` and does not need to be passed. zpool status also gets a --power option to print the slot power status. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Mart Frauenlob <[email protected]> Signed-off-by: Tony Hutter <[email protected]> Closes #15662
* zed: misc vdev_enc_sysfs_path fixesTony Hutter2024-01-291-0/+2
| | | | | | | | | | | | | | | | | | There have been rare cases where the VDEV_ENC_SYSFS_PATH value that zed gets passed is stale. To mitigate this, dynamically check the sysfs path at the time of zed event processing, and use the dynamic value if possible. Note that there will be other times when we can not dynamically detect the sysfs path (like if a disk disappears) and have to rely on the old value for things like turning on the fault LED. That is to say, we can't just blindly use the dynamic path in every case. Also: - Add enclosure sysfs entry when running 'zpool add' - Fix 'slot' and 'enc' zpool.d scripts for nvme Reviewed-by: Don Brady <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tony Hutter <[email protected]> Closes #15462
* Linux 6.8 compat: handle mnt_idmap user_namespace changeRob Norris2024-01-292-5/+61
| | | | | | | | | | | | struct mnt_idmap no longer has a struct user_namespace within it. Work around this by creating a temporary with the copy of the map we need taken from the idmap. Reviewed-by: Brian Behlendorf <[email protected]> Co-authored-by: Youzhong Yang <[email protected]> Signed-off-by: Rob Norris <[email protected]> Sponsored-by: https://despairlabs.com/sponsor/ Closes #15805
* Linux 6.8 compat: implement strlcpy fallbackRob Norris2024-01-291-0/+49
| | | | | | | | | | Linux has removed strlcpy in favour of strscpy. This implements a fallback implementation of strlcpy for this case. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rob Norris <[email protected]> Sponsored-by: https://despairlabs.com/sponsor/ Closes #15805
* fix: Uber block label not always found for aux vdevsAmeer Hamza2024-01-291-0/+1
| | | | | | | | | | | | | | | When spare or l2cache (aux) vdev is added during pool creation, spa->spa_uberblock is not dumped until that point. Subsequently, the aux label is never synchronized after its initial creation, resulting in the uberblock label remaining undumped. The uberblock is crucial for lib_blkid in identifying the ZFS partition type. To address this issue, we now ensure sync of the uberblock label once if it's not dumped initially. Reviewed-by: Umer Saleem <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Ameer Hamza <[email protected]> Closes #15737
* compact: workaround for GPL-only symbols on riscv from Linux 6.2Shengqi Chen2024-01-161-2/+13
| | | | | | | | | | | | | | | | | Since Linux 6.2, the implementation of flush_dcache_page on riscv references GPL-only symbol `PageHuge`, breaking the build of zfs. This patch uses existing mechanism to override flush_dcache_page, removing the call to `PageHuge`. According to comments in kernel, it is only used to do some check against HugeTLB pages, which only exist in userspace. ZFS uses flush_dcache_page only on kernel pages, thus this patch will not introduce any behaviour change. See also: torvalds/linux@d33deda, openzfs/zfs@589f59b Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Shengqi Chen <[email protected]> Closes #14974 Closes #15627
* freebsd: fix compile for spa_taskq_read/spa_taskq_write paramsRob Norris2024-01-161-0/+6
| | | | | | Missed in #15695, backporting #15675. Signed-off-by: Rob Norris <[email protected]>
* Linux 6.2 compat: add check for kernel_neon_* availabilityShengqi Chen2024-01-122-0/+12
| | | | | | | | | | | | | | | | | | This patch adds check for `kernel_neon_*` symbols on arm and arm64 platforms to address the following issues: 1. Linux 6.2+ on arm64 has exported them with `EXPORT_SYMBOL_GPL`, so license compatibility must be checked before use. 2. On both arm and arm64, the definitions of these symbols are guarded by `CONFIG_KERNEL_MODE_NEON`, but their declarations are still present. Checking in configuration phase only leads to MODPOST errors (undefined references). Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Shengqi Chen <[email protected]> Closes #15711 Closes #14555 Closes: #15401
* Allow block cloning across encrypted datasetsoromenahar2024-01-081-0/+1
| | | | | | | | | | | | | | | | | | | | | When two datasets share the same master encryption key, it is safe to clone encrypted blocks. Currently only snapshots and clones of a dataset share with it the same encryption key. Added a test for: - Clone from encrypted sibling to encrypted sibling with non encrypted parent - Clone from encrypted parent to inherited encrypted child - Clone from child to sibling with encrypted parent - Clone from snapshot to the original datasets - Clone from foreign snapshot to a foreign dataset - Cloning from non-encrypted to encrypted datasets - Cloning from encrypted to non-encrypted datasets Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Original-patch-by: Pawel Jakub Dawidek <[email protected]> Signed-off-by: Kay Pedersen <[email protected]> Closes #15544
* Linux: Reclaim unused spl_kmem_cache_reclaimAlexander Motin2024-01-081-2/+0
| | | | | | | | | It is unused for 3 years since #10576. Reviewed-by: George Melikov <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #15507
* Update the kstat dataset_name when renaming a zvolAlan Somers2024-01-081-0/+1
| | | | | | | | | | | Add a dataset_kstats_rename function, and call it when renaming a zvol on FreeBSD and Linux. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alan Somers <[email protected]> Sponsored-by: Axcient Closes #15482 Closes #15486
* linux 6.7 compat: rework shrinker setup for heap allocationsRob Norris2023-12-211-47/+19
| | | | | | | | | 6.7 changes the shrinker API such that shrinkers must be allocated dynamically by the kernel. To accomodate this, this commit reworks spl_register_shrinker() to do something similar against earlier kernels. Signed-off-by: Rob Norris <[email protected]> Sponsored-by: https://github.com/sponsors/robn
* linux 6.7 compat: use inode atime/mtime accessorsRob Norris2023-12-211-0/+20
| | | | | | | | | 6.6 made i_ctime inaccessible; 6.7 has done the same for i_atime and i_mtime. This extends the method used for ctime in b37f29341 to atime and mtime as well. Signed-off-by: Rob Norris <[email protected]> Sponsored-by: https://github.com/sponsors/robn
* FreeBSD: Fix ZFS so that snapshots under .zfs/snapshot are NFS visiblermacklem2023-11-291-1/+1
| | | | | | | | | | | | | | Call vfs_exjail_clone() for mounts created under .zfs/snapshot to fill in the mnt_exjail field for the mount. If this is not done, the snapshots under .zfs/snapshot with not be accessible over NFS. This version has the argument name in vfs.h fixed to match that of the name in spl_vfs.c, although it really does not matter. External-issue: https://reviews.freebsd.org/D42672 Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Rick Macklem <[email protected]> Closes #15563
* ZIL: Call brt_pending_add() replaying TX_CLONE_RANGEAlexander Motin2023-11-291-2/+1
| | | | | | | | | | | | | | | | | | zil_claim_clone_range() takes references on cloned blocks before ZIL replay. Later zil_free_clone_range() drops them after replay or on dataset destroy. The total balance is neutral. It means on actual replay we must take additional references, which would stay in BRT. Without this blocks could be freed prematurely when either original file or its clone are destroyed. I've observed BRT being emptied and the feature being deactivated after ZIL replay completion, which should not have happened. With the patch I see expected stats. Reviewed-by: Kay Pedersen <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Rob Norris <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #15603
* FreeBSD: Fix the build on FreeBSD 12Alan Somers2023-11-281-1/+2
| | | | | | | | | | | | | | | | It was broken for several reasons: * VOP_UNLOCK lost an argument in 13.0. So OpenZFS should be using VOP_UNLOCK1, but a few direct calls to VOP_UNLOCK snuck in. * The location of the zlib header moved in 13.0 and 12.1. We can drop support for building on 12.0, which is EoL. * knlist_init lost an argument in 13.0. OpenZFS change 9d0887402ba assumed 13.0 or later. * FreeBSD 13.0 added copy_file_range, and OpenZFS change 67a1b037915 assumed 13.0 or later. Sponsored-by: Axcient Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Alan Somers <[email protected]> Closes #15551
* brt: lift internal definitions into _impl headerRob Norris2023-11-282-0/+200
| | | | | | | | | | So that zdb (and others!) can get at the BRT on-disk structures. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Kay Pedersen <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes #15541
* Add a tunable to disable BRT support.Rich Ercolani2023-11-162-0/+3
| | | | | | | | | Copy the disable parameter that FreeBSD implemented, and extend it to work on Linux as well, until we're sure this is stable. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rich Ercolani <[email protected]> Closes #15529
* Workaround UBSAN errors for variable arraysTony Hutter2023-11-162-3/+3
| | | | | | | | | | | | | | | This gets around UBSAN errors when using arrays at the end of structs. It converts some zero-length arrays to variable length arrays and disables UBSAN checking on certain modules. It is based off of the patch from #15460. Reviewed-by: Brian Behlendorf <[email protected]> Tested-by: Thomas Lamprecht <[email protected]> Co-authored-by: Thomas Lamprecht <[email protected]> Signed-off-by: Tony Hutter <[email protected]> Issue #15145 Closes #15510
* Fix accounting error for pending sync IO ops in zpool iostatMigeljanImeri2023-11-161-1/+4
| | | | | | | | | | | | | Currently vdev_queue_class_length is responsible for checking how long the queue length is, however, it doesn't check the length when a list is used, rather it just returns whether it is empty or not. To fix this I added a counter variable to vdev_queue_class to keep track of the sync IO ops, and changed vdev_queue_class_length to reference this variable instead. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: MigeljanImeri <[email protected]> Closes #15478
* Unify arc_prune_async() codeAlexander Motin2023-11-083-3/+2
| | | | | | | | | | | | | | There is no sense to have separate implementations for FreeBSD and Linux. Make Linux code shared as more functional and just register FreeBSD-specific prune callback with arc_add_prune_callback() API. Aside of code cleanup this should fix excessive pruning on FreeBSD: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=274698 Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Mark Johnston <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #15456
* Linux 6.6 compat: generic_fillattr has a new u32 request_mask added at arg2Coleman Kane2023-11-082-0/+11
| | | | | | | | | | | | | | In commit 0d72b92883c651a11059d93335f33d65c6eb653b, a new u32 argument for the request_mask was added to generic_fillattr. This is the same request_mask for statx that's present in the most recent API implemented by zpl_getattr_impl. This commit conditionally adds it to the zpl_generic_fillattr(...) macro, as well as the zfs_getattr_fast(...) implementation, when configure determines it's present in the kernel's generic_fillattr(...). Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Coleman Kane <[email protected]> Closes #15263
* Linux 6.6 compat: use inode_get/set_ctime*(...)Coleman Kane2023-11-081-0/+11
| | | | | | | | | | | | | | | | | | | | In Linux commit 13bc24457850583a2e7203ded05b7209ab4bc5ef, direct access to the i_ctime member of struct inode was removed. The new approach is to use accessor methods that exclusively handle passing the timestamp around by value. This change adds new tests for each of these functions and introduces zpl_* equivalents in include/os/linux/zfs/sys/zpl.h. In where the inode_get/set_ctime*() functions exist, these zpl_* calls will be mapped to the new functions. On older kernels, these macros just wrap direct-access calls. The code that operated on an address of ip->i_ctime to call ZFS_TIME_DECODE() now will take a local copy using zpl_inode_get_ctime(), and then pass the address of the local copy when performing the ZFS_TIME_DECODE() call, in all cases, rather than directly accessing the member. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Coleman Kane <[email protected]> Closes #15263 Closes #15257
* Add mutex_enter_interruptible() for interruptible sleeping IOCTLsThomas Bertschinger2023-11-064-9/+17
| | | | | | | | | | | | | | | | | | | | | | Many long-running ZFS ioctls lock the spa_namespace_lock, forcing concurrent ioctls to sleep for the mutex. Previously, the only option is to call mutex_enter() which sleeps uninterruptibly. This is a usability issue for sysadmins, for example, if the admin runs `zpool status` while a slow `zpool import` is ongoing, the admin's shell will be locked in uninterruptible sleep for a long time. This patch resolves this admin usability issue by introducing mutex_enter_interruptible() which sleeps interruptibly while waiting to acquire a lock. It is implemented for both Linux and FreeBSD. The ZFS_IOC_POOL_CONFIGS ioctl, used by `zpool status`, is changed to use this new macro so that the command can be interrupted if it is issued during a concurrent `zpool import` (or other long-running operation). Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Thomas Bertschinger <[email protected]> Closes #15360
* zvol: Remove broken blk-mq optimizationTony Hutter2023-11-061-8/+0
| | | | | | | | | | | | | | | | | This fix removes a dubious optimization in zfs_uiomove_bvec_rq() that saved the iterator contents of a rq_for_each_segment(). This optimization allowed restoring the "saved state" from a previous rq_for_each_segment() call on the same uio so that you wouldn't need to iterate though each bvec on every zfs_uiomove_bvec_rq() call. However, if the kernel is manipulating the requests/bios/bvecs under the covers between zfs_uiomove_bvec_rq() calls, then it could result in corruption from using the "saved state". This optimization results in an unbootable system after installing an OS on a zvol with blk-mq enabled. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tony Hutter <[email protected]> Closes #15351
* Properly pad struct tx_cpu to cache lineAlexander Motin2023-11-061-2/+1
| | | | | | | | | We already use ____cacheline_aligned in many places, so add one more instead of seems arbitrary char tc_pad[8]. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #15402
* Fix ZED auto-replace for VDEVs using by-id pathsDon Brady2023-11-061-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | The change is simple -- restore the original code so that the VDEV path is updated when using by-id paths. The more challenging part was to devise a second ZTS test, that would test auto-replace for 'by-id' and help prevent a future regression. With that new test, we can now do an A|B test with , and without, the fix to confirm that auto-replace for by-id paths works. The existing auto-replace test, functional/fault/auto_replace_001_pos, will confirm that we didn't break auto-replace for 'by-vdev' paths. In the original functional/fault/auto_replace_001_pos test, the disk wipe (using dd) was not effective in removing the partitioning since the kernel was never informed of the wipe. Added a call to wipefs(8) so that the kernel is informed and ZED will re-partition the device. Added a validation step that the re-partitioning occurred by confirming that the GPT partition UUID changes. Sponsored-By: OpenDrives Inc. Sponsored-By: Klara Inc. Reviewed-by: Rob Norris <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Signed-off-by: Don Brady <[email protected]> Closes #15363
* Add zfs_prepare_disk script for disk firmware installTony Hutter2023-11-061-0/+9
| | | | | | | | | | Have libzfs call a special `zfs_prepare_disk` script before a disk is included into the pool. The user can edit this script to add things like a disk firmware update or a disk health check. Use of the script is totally optional. See the zfs_prepare_disk manpage for full details. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tony Hutter <[email protected]> Closes #15243
* FreeBSD: Improve taskq wrapperAlexander Motin2023-11-061-9/+9
| | | | | | | | | | | | | | | | | | | - Group tqent_task and tqent_timeout_task into a union. They are never used same time. This shrinks taskq_ent_t from 192 to 160 bytes. - Remove tqent_registered. Use tqent_id != 0 instead. - Remove tqent_cancelled. Use taskqueue pending counter instead. - Change tqent_type into uint_t. We don't need to pack it any more. - Change tqent_rc into uint_t, matching refcount(9). - Take shared locks in taskq_lookup(). - Call proper taskqueue_drain_timeout() for TIMEOUT_TASK in taskq_cancel_id() and taskq_wait_id(). - Switch from CK_LIST to regular LIST. Reviewed-by: Allan Jude <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Mateusz Guzik <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #15356
* FreeBSD: Reduce divergence from in-tree sourcesAlexander Motin2023-10-108-7/+12
| | | | | | | | | | This includes random small tweaks, primarily a build fixes, required when ZFS is built as part of FreeBSD base. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Tino Reichardt <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #15368
* Reduce number of metaslab preload taskq threads.Alexander Motin2023-10-072-3/+2
| | | | | | | | | | | | | | | | | | Before this change ZFS created threads for 50% of CPUs for each top- level vdev. Plus it created the same number of threads for embedded log groups (that have only one metaslab and don't need any preload). As result, on system with 80 CPUs and pool of 60 vdevs this resulted in 4800 metaslab preload threads, that is absolutely insane. This patch changes the preload threads to 50% of CPUs in one taskq per pool, so on the mentioned system it will be only 40 threads. Among other things this fixes zdb on the mentioned system and pool on FreeBSD, that failed to create so many threads in one process. Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #15319
* ARC: Remove b_bufcnt/b_ebufcnt from ARC headersAlexander Motin2023-10-072-13/+7
| | | | | | | | | | | | | | | | | | In most cases we do not care about exact number of buffers linked to the header, we just need to know if it is zero, non-zero or one. That can easily be checked just looking on b_buf pointer or in some cases derefencing it. b_ebufcnt is read only once, and in that case we already traverse the list as part of arc_buf_remove(), so second traverse should not be expensive. This reduces L1 ARC header size by 8 bytes and full crypto header by 16 bytes, down to 176 and 232 bytes on FreeBSD respectively. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #15350
* ARC: Remove b_cv from struct l1arc_buf_hdrAlexander Motin2023-10-071-2/+0
| | | | | | | | | | | | | Earlier as part of #14123 I've removed one use of b_cv. This patch reuses the same approach to remove the other one from much more rare code path. This saves 16 bytes of L1 ARC header on FreeBSD (reducing it from 200 to 184 bytes) and seems even more on Linux. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #15340
* Add '-u' - nomount flag for zfs setUmer Saleem2023-10-031-0/+8
| | | | | | | | | | | | | | | | | | | | | | | | This commit adds '-u' flag for zfs set operation. With this flag, mountpoint, sharenfs and sharesmb properties can be updated without actually mounting or sharing the dataset. Previously, if dataset was unmounted, and mountpoint property was updated, dataset was not mounted after the update. This behavior is changed in #15240. We mount the dataset whenever mountpoint property is updated, regardless if it's mounted or not. To provide the user with option to keep the dataset unmounted and still update the mountpoint without mounting the dataset, '-u' flag can be used. If any of mountpoint, sharenfs or sharesmb properties are updated with '-u' flag, the property is set to desired value but the operation to (re/un)mount and/or (re/un)share the dataset is not performed and dataset remains as it was before. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Umer Saleem <[email protected]> Closes #15322
* Fix invalid pointer access in trace_dbuf.hChunwei Chen2023-10-031-2/+6
| | | | | | | | | | In dnode_destroy, dn_objset is invalidated. However, it will later call into dbuf_destroy, in which DTRACE_SET_STATE will try to access spa_name via dn_objset causing illegal pointer access. Reviewed-by: Brian Atkinson <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Chunwei Chen <[email protected]> Closes #15333
* Retire z_nr_znodesMateusz Guzik2023-09-192-2/+0
| | | | | | | | | | | | Added in ab26409db753 ("Linux 3.1 compat, super_block->s_shrink"), with the only consumer which needed the count getting retired in 066e82522101 ("Linux compat: Minimum kernel version 3.10"). The counter gets in the way of not maintaining the list to begin with. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Mateusz Guzik <[email protected]> Closes #15274
* Linux 4.20 compat: wrapper function for iov_iter type accessColeman Kane2023-09-191-0/+6
| | | | | | | | | | | | An iov_iter_type() function to access the "type" member of the struct iov_iter was added at one point. Move the conditional logic to decide which method to use for accessing it into a macro and simplify the zpl_uio_init code. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Brian Atkinson <[email protected]> Signed-off-by: Coleman Kane <[email protected]> Closes #15100
* Linux 6.4 compat: iter_iov() function now used to get old iov memberColeman Kane2023-09-191-0/+6
| | | | | | | | | | | | The iov_iter->iov member is now iov_iter->__iov and must be accessed via the accessor function iter_iov(). Create a wrapper that is conditionally compiled to use the access method appropriate for the target kernel version. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Brian Atkinson <[email protected]> Signed-off-by: Coleman Kane <[email protected]> Closes #15100
* Linux 6.5 compat: blkdev changesColeman Kane2023-09-191-0/+6
| | | | | | | | | | | | | | | | | | | | | | | | | Multiple changes to the blkdev API were introduced in Linux 6.5. This includes passing (void* holder) to blkdev_put, adding a new blk_holder_ops* arg to blkdev_get_by_path, adding a new blk_mode_t type that replaces uses of fmode_t, and removing an argument from the release handler on block_device_operations that we weren't using. The open function definition has also changed to take gendisk* and blk_mode_t, so update it accordingly, too. Implement local wrappers for blkdev_get_by_path() and vdev_blkdev_put() so that the in-line calls are cleaner, and place the conditionally-compiled implementation details inside of both of these local wrappers. Both calls are exclusively used within vdev_disk.c, at this time. Add blk_mode_is_open_write() to test FMODE_WRITE / BLK_OPEN_WRITE The wrapper function is now used for testing using the appropriate method for the kernel, whether the open mode is writable or not. Emphasize fmode_t arg in zvol_release is not used Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Coleman Kane <[email protected]> Closes #15099
* Linux 6.5 compat: use disk_check_media_change when it existsColeman Kane2023-09-191-0/+1
| | | | | | | | | | | | | When disk_check_media_change() exists, then define zfs_check_media_change() to simply call disk_check_media_change() on the bd_disk member of its argument. Since disk_check_media_change() is newer than when revalidate_disk was present in bops, we should be able to safely do this via a macro, instead of recreating a new implementation of the inline function that forces revalidation. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Brian Atkinson <[email protected]> Signed-off-by: Coleman Kane <[email protected]> Closes #15101
* Relax error reporting in zpool import and zpool splitUmer Saleem2023-09-021-0/+1
| | | | | | | | | | | | | | | | | For zpool import and zpool split, zpool_enable_datasets is called to mount and share all datasets in a pool. If there is an error while mounting or sharing any dataset in the pool, the status of import or split is reported as failure. However, the changes do show up in zpool list. This commit updates the error reporting in zpool import and zpool split path. More descriptive messages are shown to user in case there is an error during mount or share. Errors in mount or share do not effect the overall status of zpool import and zpool split. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Umer Saleem <[email protected]> Closes #15216
* Avoid save/restoring AMX registers to avoid a SPR erratumRich Ercolani2023-08-271-5/+14
| | | | | | | | | | | | | | | | | | Intel SPR erratum SPR4 says that if you trip into a vmexit while doing FPU save/restore, your AMX register state might misbehave... and by misbehave, I mean save all zeroes incorrectly, leading to explosions if you restore it. Since we're not using AMX for anything, the simple way to avoid this is to just not save/restore those when we do anything, since we're killing preemption of any sort across our save/restores. If we ever decide to use AMX, it's not clear that we have any way to mitigate this, on Linux...but I am not an expert. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rich Ercolani <[email protected]> Closes #14989 Closes #15168
* linux/spl/kmem_cache: undefine `kmem_cache_alloc` before defining itRyan Lahfa2023-08-251-0/+8
| | | | | | | | | | | | | | | | | | | | When compiling a kernel with bcachefs and zfs, the two macros will collide, making it impossible to have both filesystems. It is sufficient to just undefine the macro before calling it. On why this should be in ZFS rather than bcachefs, currently, bcachefs is not a in-tree filesystem, but, it has a reasonably high chance of getting included soon. This avoids the breakage in ZFS early, this patch may be distributed downstream in NixOS and is already used there. Reviewed-by: Brian Atkinson <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Ryan Lahfa <[email protected]> Closes #15144
* Fix some typosMateusz Piotrowski2023-08-251-3/+3
| | | | | Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Mateusz Piotrowski <[email protected]> Closes #15141
* ZIL: Second attempt to reduce scope of zl_issuer_lock.Alexander Motin2023-08-251-14/+29
| | | | | | | | | | | | | | | | | | | | | | | | | | | The previous patch #14841 appeared to have significant flaw, causing deadlocks if zl_get_data callback got blocked waiting for TXG sync. I already handled some of such cases in the original patch, but issue #14982 shown cases that were impossible to solve in that design. This patch fixes the problem by postponing log blocks allocation till the very end, just before the zios issue, leaving nothing blocking after that point to cause deadlocks. Before that point though any sleeps are now allowed, not causing sync thread blockage. This require slightly more complicated lwb state machine to allocate blocks and issue zios in proper order. But with removal of special early issue workarounds the new code is much cleaner now, and should even be more efficient. Since this patch uses null zios between write, I've found that null zios do not wait for logical children ready status in zio_ready(), that makes parent write to proceed prematurely, producing incorrect log blocks. Added ZIO_CHILD_LOGICAL_BIT to zio_wait_for_children() fixes it. Reviewed-by: Rob Norris <[email protected]> Reviewed-by: Mark Maybee <[email protected]> Reviewed-by: George Wilson <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #15122
* ZIL: Avoid dbuf_read() before dmu_sync(). Alexander Motin2023-08-252-2/+4
| | | | | | | | | | | | | | In most cases dmu_sync() works with dirty records directly and does not need actual data. The only exception is dmu_sync_late_arrival(). To save some CPU time use dmu_buf_hold_noread*() in z*_get_data() and explicitly call dbuf_read() in dmu_sync_late_arrival(). There is also a chance that by that time TXG will already be synced and we won't have to do it at all. Reviewed-by: Brian Atkinson <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #15153
* Remove fastwrite mechanism.Alexander Motin2023-08-254-6/+0
| | | | | | | | | | | | | | | | | | | | | | | Fastwrite was introduced many years ago to improve ZIL writes spread between multiple top-level vdevs by tracking number of allocated but not written blocks and choosing vdev with smaller count. It suposed to reduce ZIL knowledge about allocation, but actually made ZIL to even more actively report allocation code about the allocations, complicating both ZIL and metaslabs code. On top of that, it seems ZIO_FLAG_FASTWRITE setting in dmu_sync() was lost many years ago, that was one of the declared benefits. Plus introduction of embedded log metaslab class solved another problem with allocation rotor accounting both normal and log allocations, since in most cases those are now in different metaslab classes. After all that, I'd prefer to simplify already too complicated ZIL, ZIO and metaslab code if the benefit of complexity is not obvious. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: George Wilson <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #15107
* zdb: include cloned blocks in block statisticsRob N2023-08-021-0/+1
| | | | | | | | | | | | | | | | | This gives `zdb -b` support for clone blocks. Previously, it didn't know what clones were, so would count their space allocation multiple times and then report leaked space (or, in debug, would assert trying to claim blocks a second time). This commit fixes those bugs, and reports the number of clones and the space "used" (saved) by them. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Kay Pedersen <[email protected]> Signed-off-by: Rob Norris <[email protected]> Sponsored-By: OpenDrives Inc. Sponsored-By: Klara Inc. Closes #15123
* linux: implement filesystem-side copy/clone functions for EL7Rob Norris2023-07-261-0/+4
| | | | | | | | | | | | | Redhat have backported copy_file_range and clone_file_range to the EL7 kernel using an "extended file operations" wrapper structure. This connects all that up to let cloning work there too. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Kay Pedersen <[email protected]> Signed-off-by: Rob Norris <[email protected]> Sponsored-By: OpenDrives Inc. Sponsored-By: Klara Inc. Closes #15050