aboutsummaryrefslogtreecommitdiffstats
path: root/module/os/linux/zfs
Commit message (Collapse)AuthorAgeFilesLines
* Adding Direct IO SupportBrian Atkinson2024-09-146-59/+592
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Adding O_DIRECT support to ZFS to bypass the ARC for writes/reads. O_DIRECT support in ZFS will always ensure there is coherency between buffered and O_DIRECT IO requests. This ensures that all IO requests, whether buffered or direct, will see the same file contents at all times. Just as in other FS's , O_DIRECT does not imply O_SYNC. While data is written directly to VDEV disks, metadata will not be synced until the associated TXG is synced. For both O_DIRECT read and write request the offset and request sizes, at a minimum, must be PAGE_SIZE aligned. In the event they are not, then EINVAL is returned unless the direct property is set to always (see below). For O_DIRECT writes: The request also must be block aligned (recordsize) or the write request will take the normal (buffered) write path. In the event that request is block aligned and a cached copy of the buffer in the ARC, then it will be discarded from the ARC forcing all further reads to retrieve the data from disk. For O_DIRECT reads: The only alignment restrictions are PAGE_SIZE alignment. In the event that the requested data is in buffered (in the ARC) it will just be copied from the ARC into the user buffer. For both O_DIRECT writes and reads the O_DIRECT flag will be ignored in the event that file contents are mmap'ed. In this case, all requests that are at least PAGE_SIZE aligned will just fall back to the buffered paths. If the request however is not PAGE_SIZE aligned, EINVAL will be returned as always regardless if the file's contents are mmap'ed. Since O_DIRECT writes go through the normal ZIO pipeline, the following operations are supported just as with normal buffered writes: Checksum Compression Encryption Erasure Coding There is one caveat for the data integrity of O_DIRECT writes that is distinct for each of the OS's supported by ZFS. FreeBSD - FreeBSD is able to place user pages under write protection so any data in the user buffers and written directly down to the VDEV disks is guaranteed to not change. There is no concern with data integrity and O_DIRECT writes. Linux - Linux is not able to place anonymous user pages under write protection. Because of this, if the user decides to manipulate the page contents while the write operation is occurring, data integrity can not be guaranteed. However, there is a module parameter `zfs_vdev_direct_write_verify` that controls the if a O_DIRECT writes that can occur to a top-level VDEV before a checksum verify is run before the contents of the I/O buffer are committed to disk. In the event of a checksum verification failure the write will return EIO. The number of O_DIRECT write checksum verification errors can be observed by doing `zpool status -d`, which will list all verification errors that have occurred on a top-level VDEV. Along with `zpool status`, a ZED event will be issues as `dio_verify` when a checksum verification error occurs. ZVOLs and dedup is not currently supported with Direct I/O. A new dataset property `direct` has been added with the following 3 allowable values: disabled - Accepts O_DIRECT flag, but silently ignores it and treats the request as a buffered IO request. standard - Follows the alignment restrictions outlined above for write/read IO requests when the O_DIRECT flag is used. always - Treats every write/read IO request as though it passed O_DIRECT and will do O_DIRECT if the alignment restrictions are met otherwise will redirect through the ARC. This property will not allow a request to fail. There is also a module parameter zfs_dio_enabled that can be used to force all reads and writes through the ARC. By setting this module parameter to 0, it mimics as if the direct dataset property is set to disabled. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Signed-off-by: Brian Atkinson <[email protected]> Co-authored-by: Mark Maybee <[email protected]> Co-authored-by: Matt Macy <[email protected]> Co-authored-by: Brian Behlendorf <[email protected]> Closes #10018
* Fix an uninitialized data access (#16511)Alan Somers2024-09-101-1/+1
| | | | | | | | | | | | | | zfs_acl_node_alloc allocates an uninitialized data buffer, but upstack zfs_acl_chmod only partially initializes it. KMSAN reported that this memory remained uninitialized at the point when it was read by lzjb_compress, which suggests a possible kernel memory disclosure bug. The full KMSAN warning may be found in the PR. https://github.com/openzfs/zfs/pull/16511 Signed-off-by: Alan Somers <[email protected]> Sponsored by: Axcient Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Tony Hutter <[email protected]>
* Make mount.zfs(8) calling zfs_mount_at for legacy mounts as wellLow-power2024-08-231-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 329e2ffa4bca456e65c3db7f5c5c04931c551b61 has made mount.zfs(8) to call libzfs function 'zfs_mount_at', in order to propagate dataset properties into mount options. This fix however, is limited to a special use case where mount.zfs(8) is used in initrd with option '-o zfsutil'. If either initrd or the user need to use mount.zfs(8) to mount a file system with 'mountpoint' set to 'legacy', '-o zfsutil' can't be used and the original issue #7947 will still happen. Since the existing code already excluded the possibility of calling 'zfs_mount_at' when it was invoked as a helper program from zfs(8), by checking 'ZFS_MOUNT_HELPER' environment variable, it makes no sense to avoid calling 'zfs_mount_at' without '-o zfsutil'. An exception however, is when mount.zfs(8) was invoked with '-o remount' to update the mount options for an existing mount point. In this case call mount(2) directly without modifying the mount options passed from command line. Furthermore, don't run mount.zfs(8) helper for automounting snapshot. The above change to make mount.zfs(8) to call 'zfs_mount_at' apparently caused it to trigger an automount for the snapshot directory. When the helper was invoked as a result of a snapshot automount, an infinite recursion will occur. Since the need of invoking user mode mount(8) for automounting was to overcome that the 'vfs_kern_mount' being GPL-only, just run mount(8) without the mount.zfs(8) helper by adding option '-i'. Reviewed-by: Umer Saleem <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: WHR <[email protected]> Closes #16393
* abd_os: split userspace and Linux kernel codeRob Norris2024-08-211-148/+3
| | | | | | | | | | | | | | The Linux abd_os.c serves double-duty as the userspace scatter abd implementation, by carrying an emulation of kernel scatterlists. This commit lifts common and userspace-specific parts out into a separate abd_os.c for libzpool. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes #16253
* abd: remove ABD_FLAG_ZEROSRob Norris2024-08-211-1/+1
| | | | | | | | | | | Nothing ever checks it. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes #16253
* Ignore zfs_arc_shrinker_limit in direct reclaim modeshodanshok2024-08-211-3/+3
| | | | | | | | | | | | | | | | | zfs_arc_shrinker_limit (default: 10000) avoids ARC collapse due to excessive memory reclaim. However, when the kernel is in direct reclaim mode (ie: low on memory), limiting ARC reclaim increases OOM risk. This is especially true on system without (or with inadequate) swap. This patch ignores zfs_arc_shrinker_limit when the kernel is in direct reclaim mode, avoiding most OOM. It also restores "echo 3 > /proc/sys/vm/drop_caches" ability to correctly drop (almost) all ARC. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Adam Moss <[email protected]> Signed-off-by: Gionatan Danti <[email protected]> Closes #16313
* linux/zvol_os.c: cleanup limits for non-blk mq caseAmeer Hamza2024-08-201-5/+0
| | | | | | | | | | | Rob Noris suggested that we could clean up redundant limits for the case of non-blk mq scenario. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Rob Norris <[email protected]> Signed-off-by: Ameer Hamza <[email protected]> Closes #16462
* linux/zvol_os.c: Fix max_discard_sectors limit for 6.8+ kernelAmeer Hamza2024-08-201-0/+1
| | | | | | | | | | | | | | | | In kernels 6.8 and later, the zvol block device is allocated with qlimits passed during initialization. However, the zvol driver does not set `max_hw_discard_sectors`, which is necessary to properly initialize `max_discard_sectors`. This causes the `zvol_misc_trim` test to fail on 6.8+ kernels when invoking the `blkdiscard` command. Setting `max_hw_discard_sectors` in the `HAVE_BLK_ALLOC_DISK_2ARG` case resolve the issue. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Rob Norris <[email protected]> Signed-off-by: Ameer Hamza <[email protected]> Closes #16462
* Skip ro check for snaps when multi-mountChunwei Chen2024-08-191-1/+7
| | | | | | | | | | Skip ro check for snapshots since they are always ro regardless if ro flag is passed by mount or not. This allows multi-mounting snapshots without requiring to specify ro flag. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Signed-off-by: Chunwei Chen <[email protected]> Closes #16299
* linux/zvol_os: fix zvol queue limits initializationAmeer Hamza2024-08-151-3/+4
| | | | | | | | | | | | | | | | | zvol queue limits initialization depends on `zv_volblocksize`, but it is initialized later, leading to several limits being initialized with incorrect values, including `max_discard_*` limits. This also causes `blkdiscard` command to consistently fail, as `blk_ioctl_discard` reads `bdev_max_discard_sectors()` limits as 0, leading to failure. The fix is straightforward: initialize `zv->zv_volblocksize` early, before setting the queue limits. This PR should fix `zvol/zvol_misc/zvol_misc_trim` failure on recent PRs, as the test case issues `blkdiscard` for a zvol. Additionally, `zvol_misc_trim` was recently enabled in `6c7d41a`, which is why the issue wasn't identified earlier. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Signed-off-by: Ameer Hamza <[email protected]> Closes #16454
* Linux 6.10 compat: Fix zvol NULL pointer deferenceTony Hutter2024-08-151-3/+4
| | | | | | | | zvol_alloc_non_blk_mq()->blk_queue_set_write_cache() needs the disk queue setup to prevent a NULL pointer deference. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tony Hutter <[email protected]> Closes #16453
* Fix projid accounting for xattr objectsJitendra Patidar2024-08-141-8/+20
| | | | | | | | | | | | | | | | | | zpool upgraded with 'feature@project_quota' needs re-layout of SA's to fix the SA_ZPL_PROJID at SA_PROJID_OFFSET (128). Its necessary for the correct accounting of object usage against its projid. Old object (created before upgrade) when gets a projid assigned, its SA gets re-layout via sa_add_projid(). If object has xattr dir, SA of xattr dir also gets re-layout. But SA re-layout of xattr objects inside a xattr dir is not done. Fix zfs_setattr_dir() to re-layout SA's on xattr objects, when setting projid on old xattr object (created before upgrade). Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Jitendra Patidar <[email protected]> Closes #16355 Closes #16356
* Linux 6.11: add compat macro for page_mapping()Rob Norris2024-08-131-0/+1
| | | | | | | | | | | Since the change to folios it has just been a wrapper anyway. Linux has removed their wrapper, so we add one. Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rob Norris <[email protected]> Sponsored-by: https://despairlabs.com/sponsor/ Closes #16400
* Linux 6.11: add more queue_limit fields with removed settersRob Norris2024-08-131-8/+15
| | | | | | | | | | | These fields are very old, so no detection necessary; we just move them into the limit setup functions. Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rob Norris <[email protected]> Sponsored-by: https://despairlabs.com/sponsor/ Closes #16400
* Linux 6.11: IO stats is now a queue feature flagRob Norris2024-08-131-4/+3
| | | | | | | | | | Apply them with with the rest of the settings. Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rob Norris <[email protected]> Sponsored-by: https://despairlabs.com/sponsor/ Closes #16400
* Linux 6.11: enable queue flush through queue limitsRob Norris2024-08-131-2/+10
| | | | | | | | | | | | | | | | | In 6.11 struct queue_limits gains a 'features' field, where, among other things, flush and write-cache are enabled. Detect it and use it. Along the way, the blk_queue_set_write_cache() compat wrapper gets a little cleanup. Since both flags are alway set together, its now a single bool. Also the very very ancient version that sets q->flush_flags directly couldn't actually turn it off, so I've fixed that. Not that we use it, but still. Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rob Norris <[email protected]> Sponsored-by: https://despairlabs.com/sponsor/ Closes #16400
* linux/zvol_os: tidy and document queue limit/config setupRob Norris2024-08-131-7/+38
| | | | | | | | | | | It gets hairier again in Linux 6.11, so I want some actual theory of operation laid out for next time. Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rob Norris <[email protected]> Sponsored-by: https://despairlabs.com/sponsor/ Closes #16400
* Linux: Make zfs_prune() fair on NUMA systemsAlexander Motin2024-08-081-5/+13
| | | | | | | | | | | | | | | Previous code evicted nr_to_scan items from each NUMA node. This not only multiplied the eviction by the number of nodes, but could exhaust the smaller ones, evicting inodes used by acive workload and requiring their immediate recreation. This patch spreads the requested eviction between all NUMA nodes proportionally to their evictable counts, which should be closer to expected LRU logic. See kernel's super_cache_scan() as a similar logic example. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Ameer Hamza <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #16397
* zvol: ensure device minors are properly cleaned upRob Norris2024-08-061-0/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, if a minor is in use when we try to remove it, we'll skip it and never come back to it again. Since the zvol state is hung off the minor in the kernel, this can get us into weird situations if something tries to use it after the removal fails. It's even worse at pool export, as there's now a vestigial zvol state with no pool under it. It's weirder again if the pool is subsequently reimported, as the zvol code (reasonably) assumes the zvol state has been properly setup, when it's actually left over from the previous import of the pool. This commit attempts to tackle that by setting a flag on the zvol if its minor can't be removed, and then checking that flag when a request is made and rejecting it, thus stopping new work coming in. The flag also causes a condvar to be signaled when the last client finishes. For the case where a single minor is being removed (eg changing volmode), it will wait for this signal before proceeding. Meanwhile, when removing all minors, a background task is created for each minor that couldn't be removed on the spot, and those tasks then wake and clean up. Since any new tasks are queued on to the pool's spa_zvol_taskq, spa_export_common() will continue to wait at export until all minors are removed. Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <[email protected]> Closes #14872 Closes #16364
* linux/zvol_os: fix SET_ERROR with negative return codesRob Norris2024-08-061-4/+4
| | | | | | | | | | | | | SET_ERROR is our facility for tracking errors internally. The negation is to match the what the kernel expects from us. Thus, the negation should happen outside of the SET_ERROR. Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <[email protected]> Closes #16364
* Linux: Report reclaimable memory to kernel as such (#16385)Alexander Motin2024-07-302-4/+5
| | | | | | | | | | | | | | | Linux provides SLAB_RECLAIM_ACCOUNT and __GFP_RECLAIMABLE flags to mark memory allocations that can be freed via shinker calls. It should allow kernel to tune and group such allocations for lower memory fragmentation and better reclamation under pressure. This patch marks as reclaimable most of ARC memory, directly evictable via ZFS shrinker, plus also dnode/znode/sa memory, indirectly evictable via kernel's superblock shrinker. Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Allan Jude <[email protected]>
* Several improvements to ARC shrinking (#16197)Alexander Motin2024-07-251-41/+47
| | | | | | | | | | | | | | | | | | | | | | | | - When receiving memory pressure signal from OS be more strict trying to free some memory. Otherwise kernel may come again and request much more. Return as result how much arc_c was actually reduced due to this request, that may be less than requested. - On Linux when receiving direct reclaim from some file system (that may be ZFS) instead of ignoring request completely, just shrink the ARC, but do not wait for eviction. Waiting there may cause deadlock. Ignoring it as before may put extra pressure on other caches and/or swap, and cause OOM if nothing help. While not waiting may result in more ARC evicted later, and may be too late if OOM killer activate right now, but I hope it to be better than doing nothing at all. - On Linux set arc_no_grow before waiting for reclaim, not after, or it may grow back while we are waiting. - On Linux add new parameter zfs_arc_shrinker_seeks to balance ARC eviction cost, relative to page cache and other subsystems. - Slightly update Linux arc_set_sys_free() math for new kernels. Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Reviewed-by: Rob Norris <[email protected]> Reviewed-by: Tony Hutter <[email protected]>
* Use kmap_local_page instead of kmap_atomic (#16329)Jason Lee2024-07-162-8/+8
| | | | | | | Changed zfs_k(un)map_atomic to zfs_k(un)map_local Signed-off-by: Jason Lee <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Atkinson <[email protected]>
* Linux 5.16: use bdev_nr_bytes() to get device capacityRob Norris2024-07-151-5/+9
| | | | | | | | | | | This helper was introduced long ago, in 5.16. Since 6.10, bd_inode no longer exists, but the helper has been updated, so detect it and use it in all versions where it is available. Signed-off-by: Rob Norris <[email protected]> Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]>
* Linux 6.10: rework queue limits setupRob Norris2024-07-151-70/+116
| | | | | | | | | | | | | | | | | | | | | | | Linux has started moving to a model where instead of applying block queue limits through individual modification functions, a complete limits structure is built up and applied atomically, either when the block device or open, or some time afterwards. As of 6.10 this transition appears only partly completed. This commit matches that model within OpenZFS in a way that should work for past and future kernels. We set up a queue limits structure with any limits that have had their modification functions removed. For newer kernels that can have limits applied at block device open (HAVE_BLK_ALLOC_DISK_2ARG), we have a conversion function to turn the OpenZFS queue limits structure into Linux's queue_limits structure, which can then be passed in. For older kernels, we provide an application function that just calls the old functions for each limit in the structure. Signed-off-by: Rob Norris <[email protected]> Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]>
* zvol: Fix suspend lock leaks (#16270)Mark Johnston2024-07-101-0/+1
| | | | | | | | | | | | In several functions, we use a flag variable to track whether zv_suspend_lock is held. This flag was not getting reset in a particular case where we need to retry the underlying operation, resulting in a lock leak. Make sure to update the flag where necessary. Signed-off-by: Mark Johnston <[email protected]> Reviewed-by: Allan Jude <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Tino Reichardt <[email protected]> Reviewed-by: Tony Hutter <[email protected]>
* Linux 6.9: Call add_disk() from workqueue to fix zfs_allow_010_pos (#16282)Tony Hutter2024-06-281-5/+97
| | | | | | | | | | | | | | | The 6.9 kernel behaves differently in how it releases block devices. In the common case it will async release the device only after the return to userspace. This is different from the 6.8 and older kernels which release the block devices synchronously. To get around this, call add_disk() from a workqueue so that the kernel uses a different codepath to release our zvols in the way we expect. This stops zfs_allow_010_pos from hanging. Fixes: #16089 Signed-off-by: Tony Hutter <[email protected]> Reviewed-by: Tino Reichardt <[email protected]> Reviewed-by: Rob Norris <[email protected]>
* zdb/ztest: send dbgmsg output to stderrRob Norris2024-05-141-10/+9
| | | | | | | | | | | | | And, make the output fd an arg to zfs_dbgmsg_print(). This is a change in behaviour, but keeps it consistent with where crash traces go, and it's easy to argue this is what we want anyway; this is information about the task, not the actual output of the task. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes #16181
* zfs_dbgmsg_print: make FreeBSD and Linux consistentRob Norris2024-05-141-1/+2
| | | | | | | | | | | | | | FreeBSD was using fprintf(), which might not be signal-safe. Meanwhile, Linux's locking did not cover the header output. This two quirks are unrelated, but both have the same response: be like the other one. So with this commit, both functions are the same except for the names of their lock and list variables. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes #16181
* Replace P2ALIGN with P2ALIGN_TYPED and delete P2ALIGN.chenqiuhao19972024-05-101-1/+1
| | | | | | | | | | | | In P2ALIGN, the result would be incorrect when align is unsigned integer and x is larger than max value of the type of align. In that case, -(align) would be a positive integer, which means high bits would be zero and finally stay zero after '&' when align is converted to a larger integer type. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Youzhong Yang <[email protected]> Signed-off-by: Qiuhao Chen <[email protected]> Closes #15940
* Replace usage of schedule_timeout with schedule_timeout_interruptible (#16150)Daniel Perry2024-05-092-2/+3
| | | | | | | | | | | | | | | | This commit replaces current usages of schedule_timeout() with schedule_timeout_interruptible() in code paths that expect the running task to sleep for a short period of time. When schedule_timeout() is called without previously calling set_current_state(), the running task never sleeps because the task state remains in TASK_RUNNING. By calling schedule_timeout_interruptible() to set the task state to TASK_INTERRUPTIBLE before calling schedule_timeout() we achieve the intended/desired behavior of putting the task to sleep for the specified timeout. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Daniel Perry <[email protected]> Closes #16150
* vdev_disk: disable flushes if device does not support itRob N2024-05-021-2/+5
| | | | | | | | | | | | | If the underlying device doesn't have a write-back cache, the kernel will just return a successful response. This doesn't hurt anything, but it's extra work on the IO taskqs that are unnecessary. So, detect this when we open the device for the first time. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes #16148
* Fix updating the zvol_htable when renaming a zvolAlan Somers2024-04-251-1/+1
| | | | | | | | | | | | When renaming a zvol, insert it into zvol_htable using the new name, not the old name. Otherwise some operations won't work. For example, "zfs set volsize" while the zvol is open. Sponsored by: Axcient Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alek Pinchuk <[email protected]> Signed-off-by: Alan Somers <[email protected]> Closes #16127 Closes #16128
* abd_iter_page: rework to handle multipage scatterlistsRob N2024-04-191-46/+74
| | | | | | | | | | | | | | | | | | | Previously, abd_iter_page() would assume that every scatterlist would contain a single page (compound or no), because that's all we ever create in abd_alloc_chunks(). However, scatterlists can contain multiple pages of arbitrary provenance, and if we get one of those, we'd get all the math wrong. This reworks things to handle multiple pages in a scatterlist, by properly finding the right page within it for the given offset, and understanding better where the end of the page is and not crossing it. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reported-by: Brian Atkinson <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Brian Atkinson <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes #16108
* zio: rename ZIO_TYPE_IOCTL to ZIO_TYPE_FLUSHRob Norris2024-04-112-2/+2
| | | | | | | | | | | | | The only possible ioctl is a flush, and any other kind of meta-operation introduced in the future is likely to have different semantics (much like trim did). So, lets just call it what it is. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes #16064
* zio: remove io_cmd and DKIOCFLUSHWRITECACHERob Norris2024-04-112-49/+34
| | | | | | | | | | | | | | There's no other options, so we can just always assume its a flush. Includes some light refactoring where a switch statement was doing control flow that no longer works. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes #16064
* vdev_disk: fix alignment check when buffer has non-zero starting offsetRob Norris2024-04-111-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If a linear buffer spans multiple pages, and the first page has a non-zero starting offset, the checker would not include the offset, and so would think there was an alignment gap at the end of the first page, rather than at the start. That is, for a 16K buffer spread across five pages with an initial 512B offset: [.XXXXXXX][XXXXXXXX][XXXXXXXX][XXXXXXXX][XXXXXXX.] It would be interpreted as: [XXXXXXX.][XXXXXXXX]... And be rejected as misaligned. Since it's already a linear ABD, the "linearising" copy would just reuse the buffer as-is, and the second check would failing, tripping the VERIFY in vdev_disk_io_rw(). This commit fixes all this by including the offset in the check for end-of-page alignment. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes #16076
* tests: add test for vdev_disk page alignment checkRob Norris2024-04-111-0/+6
| | | | | | | | | | | | | | | | This provides a test driver and a set of test vectors for the page alignment check callback function vdev_disk_check_pages_cb(). Because there's no good facility for exposing this function to a userspace test right now, for now I'm just duplicating the function and adding commentary to remind people to keep them in sync. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes #16076
* vdev_disk: ensure trim errors are returned immediatelyRob N2024-04-081-45/+86
| | | | | | | | | | | | | | | | | | | | | | | | | After 06e25f9c4, the discard issuing code was organised such that if requesting an async discard or secure erase failed before the IO was issued (that is, calling __blkdev_issue_discard() returned an error), the failed zio would never be executed, resulting in txg_sync hanging forever waiting for IO to finish. This commit fixes that by immediately executing a failed zio on error. To handle the successful synchronous op case, we fake an async op by, when not using an asynchronous submission method, queuing the successful result zio as part of the discard handler. Since it was hard to understand the differences between discard and secure erase, and sync and async, across different kernel versions, I've commented and reorganised the code a bit to try and make everything more contained and linear. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Ameer Hamza <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes #16070
* zvol_os: fix compile with blk-mq on Linux 4.xRob N2024-04-081-0/+5
| | | | | | | | | | | | | | | 99741bde5 accesses a cached blk-mq hardware context through the mq_hctx field of struct request. However, this field did not exist until 5.0. Before that, the private function blk_mq_map_queue() was used to dig it out of broader queue context. This commit detects this situation, and handles it with a poor-man's simulation of that function. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Ameer Hamza <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes #16069
* zvol_os: fix build on Linux <3.13Rob N2024-04-081-1/+2
| | | | | | | | | | | | | | | | 99741bde5 introduced zvol_num_taskqs, but put it behind the HAVE_BLK_MQ define, preventing builds on versions of Linux that don't have it (<3.13, incl EL7). Nothing about it seems dependent on blk-mq, so this just moves it out from behind that define and so fixes the build. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Ameer Hamza <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes #16062
* zvol: use multiple taskqAmeer Hamza2024-04-031-10/+92
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, zvol uses a single taskq, resulting in throughput bottleneck under heavy load due to lock contention on the single taskq. This patch addresses the performance bottleneck under heavy load conditions by utilizing multiple taskqs, thus mitigating lock contention. The number of taskqs scale dynamically based on the available CPUs in the system, as illustrated below: taskq total cpus taskqs threads threads ------- ------- ------- ------- 1 1 32 32 2 1 32 32 4 1 32 32 8 2 16 32 16 3 11 33 32 5 7 35 64 8 8 64 128 11 12 132 256 16 16 256 Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Tony Nguyen <[email protected]> Signed-off-by: Ameer Hamza <[email protected]> Closes #15992
* Linux 6.9 compat: blk_alloc_disk() now takes two argsRob Norris2024-04-031-1/+22
| | | | | | | | | | | | | | | | There's an extra nullable arg for queue limits. Detect it, and set it to NULL. Similar change for blk_mq_alloc_disk(), now three args, same treatment. Error return now has error encoded in the return, so detect with IS_ERR() and explicitly NULL our own return. Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rob Norris <[email protected]> Sponsored-by: https://despairlabs.com/sponsor/ Closes #16027 Closes #16033
* Linux 6.9 compat: bdev handles are now struct fileRob Norris2024-04-031-5/+19
| | | | | | | | | | | | | | bdev_open_by_path() is replaced by bdev_file_open_by_path(), which returns a plain old struct file*. Release function is gone entirely; the regular file release function fput() will take care of the bdev specifics. Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rob Norris <[email protected]> Sponsored-by: https://despairlabs.com/sponsor/ Closes #16027 Closes #16033
* vdev_disk: don't touch vbio after its handed off to the kernelRob N2024-04-031-5/+6
| | | | | | | | | | | | | | | | | | | | | After IO is unplugged, it may complete immediately and vbio_completion be called on interrupt context. That may interrupt or deschedule our task. If its the last bio, the vbio will be freed. Then, we get rescheduled, and try to write to freed memory through vbio->. This patch just removes the the cleanup, and the corresponding assert. These were leftovers from a previous iteration of vbio_submit() and were always "belt and suspenders" ops anyway, never strictly required. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc Reported-by: Rich Ercolani <[email protected]> Reviewed-by: Laurențiu Nicola <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: George Wilson <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes #16045 Closes #16050 Closes #16049
* vdev_disk: clean up spa/bdev mode conversionRob N2024-03-291-42/+39
| | | | | | | | | | | | | | | | | | | | 43e8f6e37 introduced a subtle API misuse, in that it passed the output from vdev_bdev_mode() back into itself. Fortunately, the SPA_MODE_(READ|WRITE) bit values exactly map to the FMODE_(READ|WRITE) & BLK_OPEN_(READ|WRITE) bit values, so it didn't result in a bug, but it was hard to read and understand, so I cleaned it up. In doing so, I noticed that the only call to vdev_bdev_mode() without the "exclusive" flag set was in that misuse, and actually, we never do a non-exclusive blkdev_get_by_path(). So I've just made exclusive be always-on. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed by: Brian Behlendorf <[email protected]> Reviewed-by: Allan Jude <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes #15995
* zvols: prevent overflow of minor device numbersFabian-Gruenbichler2024-03-291-0/+7
| | | | | | | | | | | | | | | | | | | | | | currently, the linux kernel allows 2^20 minor devices per major device number. ZFS reserves blocks of 2^4 minors per zvol: 1 for the zvol itself, the other 15 for the first partitions of that zvol. as a result, only 2^16 such blocks are available for use. there are no checks in place to avoid overflowing into the major device number when more than 2^16 zvols are allocated (with volmode=dev or default). instead of ignoring this limit, which comes with all sorts of weird knock-on effects, detect this situation and simply fail allocating the zvol block device early on. without this safeguard, the kernel will reject the attempt to create an already existing block device, but ZFS doesn't handle this error and gets confused about which zvol occupies which minor slot, potentially resulting in kernel NULL derefs and other issues later on. Reviewed-by: Tony Hutter <[email protected]> Reviewed by: Brian Behlendorf <[email protected]> Signed-off-by: Fabian Grünbichler <[email protected]> Closes #16006
* abd_iter_page: don't use compound heads on Linux <4.5Rob Norris2024-03-251-0/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Before 4.5 (specifically, torvalds/linux@ddc58f2), head and tail pages in a compound page were refcounted separately. This means that using the head page without taking a reference to it could see it cleaned up later before we're finished with it. Specifically, bio_add_page() would take a reference, and drop its reference after the bio completion callback returns. If the zio is executed immediately from the completion callback, this is usually ok, as any data is referenced through the tail page referenced by the ABD, and so becomes "live" that way. If there's a delay in zio execution (high load, error injection), then the head page can be freed, along with any dirty flags or other indicators that the underlying memory is used. Later, when the zio completes and that memory is accessed, its either unmapped and an unhandled fault takes down the entire system, or it is mapped and we end up messing around in someone else's memory. Both of these are very bad. The solution on these older kernels is to take a reference to the head page when we use it, and release it when we're done. There's not really a sensible way under our current structure to do this; the "best" would be to keep a list of head page references in the ABD, and release them when the ABD is freed. Since this additional overhead is totally unnecessary on 4.5+, where head and tail pages share refcounts, I've opted to simply not use the compound head in ABD page iteration there. This is theoretically less efficient (though cleaning up head page references would add overhead), but its safe, and we still get the other benefits of not mapping pages before adding them to a bio and not mis-splitting pages. There doesn't appear to be an obvious symbol name or config option we can match on to discover this behaviour in configure (and the mm/page APIs have changed a lot since then anyway), so I've gone with a simple version check. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rob Norris <[email protected]> Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Closes #15533 Closes #15588
* vdev_disk: use bio_chain() to submit multiple BIOsRob Norris2024-03-251-151/+80
| | | | | | | | | | | | | Simplifies our code a lot, so we don't have to wait for each and reassemble them. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rob Norris <[email protected]> Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Closes #15533 Closes #15588
* vdev_disk: add module parameter to select BIO submission methodRob Norris2024-03-251-2/+29
| | | | | | | | | | | | | | | This makes the submission method selectable at module load time via the `zfs_vdev_disk_classic` parameter, allowing this change to be backported to 2.2 safely, and disabled in favour of the "classic" submission method if new problems come up. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rob Norris <[email protected]> Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Closes #15533 Closes #15588