aboutsummaryrefslogtreecommitdiffstats
path: root/module/os/linux/zfs
Commit message (Collapse)AuthorAgeFilesLines
* ZIL: Assert record sizes in different placesAlexander Motin2023-11-281-2/+7
| | | | | | | | This should make sure we have log written without overflows. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #15517
* ZVOL: Minor code cleanupAlexander Motin2023-11-271-1/+1
| | | | | | | | | | | | | | | - Remove zsda_tx field, it is used only once. - Remove unneeded string lengths checks, all names are terminated. - Replace few explicit MAXNAMELEN usages with sizeof(). - Change dsname from MAXNAMELEN to ZFS_MAX_DATASET_NAME_LEN, as expected by dsl_dataset_name(). Both are 256 bytes now, but it is better to be safe. This should have no functional difference. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #15535
* ZIL: Refactor TX_WRITE encryption similar to TX_CLONE_RANGEAlexander Motin2023-11-271-8/+4
| | | | | | | | | | | | | | | It should be purely textual change to make the code more readable. Should cause no functional difference. Reviewed-by: Richard Yao <[email protected]> Reviewed-by: Tom Caputi <[email protected]> Reviewed-by: Sean Eric Fagan <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Edmund Nadolski <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #15543 Closes #15513
* ZIL: Do not encrypt block pointers in lr_clone_range_tAlexander Motin2023-11-271-0/+15
| | | | | | | | | | | | | | | | | | | | In case of crash cloned blocks need to be claimed on pool import. It is only possible if they (lr_bps) and their count (lr_nbps) are not encrypted but only authenticated, similar to block pointer in lr_write_t. Few other fields can be and are still encrypted. This should fix panic on ZIL claim after crash when block cloning is actively used. Reviewed-by: Richard Yao <[email protected]> Reviewed-by: Tom Caputi <[email protected]> Reviewed-by: Sean Eric Fagan <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Edmund Nadolski <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #15543 Closes #15513
* Add a tunable to disable BRT support.Rich Ercolani2023-11-162-0/+9
| | | | | | | | | Copy the disable parameter that FreeBSD implemented, and extend it to work on Linux as well, until we're sure this is stable. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rich Ercolani <[email protected]> Closes #15529
* Linux: reject read/write mapping to immutable file only on VM_SHAREDLow-power2023-11-081-2/+2
| | | | | | | | | | Private read/write mapping can't be used to modify the mapped files, so they will remain be immutable. Private read/write mappings are usually used to load the data segment of executable files, rejecting them will rendering immutable executable files to stop working. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: WHR <[email protected]> Closes #15344
* RAID-Z expansion featureDon Brady2023-11-082-1/+47
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This feature allows disks to be added one at a time to a RAID-Z group, expanding its capacity incrementally. This feature is especially useful for small pools (typically with only one RAID-Z group), where there isn't sufficient hardware to add capacity by adding a whole new RAID-Z group (typically doubling the number of disks). == Initiating expansion == A new device (disk) can be attached to an existing RAIDZ vdev, by running `zpool attach POOL raidzP-N NEW_DEVICE`, e.g. `zpool attach tank raidz2-0 sda`. The new device will become part of the RAIDZ group. A "raidz expansion" will be initiated, and the new device will contribute additional space to the RAIDZ group once the expansion completes. The `feature@raidz_expansion` on-disk feature flag must be `enabled` to initiate an expansion, and it remains `active` for the life of the pool. In other words, pools with expanded RAIDZ vdevs can not be imported by older releases of the ZFS software. == During expansion == The expansion entails reading all allocated space from existing disks in the RAIDZ group, and rewriting it to the new disks in the RAIDZ group (including the newly added device). The expansion progress can be monitored with `zpool status`. Data redundancy is maintained during (and after) the expansion. If a disk fails while the expansion is in progress, the expansion pauses until the health of the RAIDZ vdev is restored (e.g. by replacing the failed disk and waiting for reconstruction to complete). The pool remains accessible during expansion. Following a reboot or export/import, the expansion resumes where it left off. == After expansion == When the expansion completes, the additional space is available for use, and is reflected in the `available` zfs property (as seen in `zfs list`, `df`, etc). Expansion does not change the number of failures that can be tolerated without data loss (e.g. a RAIDZ2 is still a RAIDZ2 even after expansion). A RAIDZ vdev can be expanded multiple times. After the expansion completes, old blocks remain with their old data-to-parity ratio (e.g. 5-wide RAIDZ2, has 3 data to 2 parity), but distributed among the larger set of disks. New blocks will be written with the new data-to-parity ratio (e.g. a 5-wide RAIDZ2 which has been expanded once to 6-wide, has 4 data to 2 parity). However, the RAIDZ vdev's "assumed parity ratio" does not change, so slightly less space than is expected may be reported for newly-written blocks, according to `zfs list`, `df`, `ls -s`, and similar tools. Sponsored-by: The FreeBSD Foundation Sponsored-by: iXsystems, Inc. Sponsored-by: vStack Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Mark Maybee <[email protected]> Authored-by: Matthew Ahrens <[email protected]> Contributions-by: Fedor Uporov <[email protected]> Contributions-by: Stuart Maybee <[email protected]> Contributions-by: Thorsten Behrens <[email protected]> Contributions-by: Fmstrat <[email protected]> Contributions-by: Don Brady <[email protected]> Signed-off-by: Don Brady <[email protected]> Closes #15022
* Linux 6.6 compat: fix implicit conversion error with debug buildUmer Saleem2023-11-071-1/+1
| | | | | | | | | | | | With Linux v6.6.0 and GCC 12, when debug build is configured, implicit conversion error is raised while converting 'enum <anonymous>' to 'boolean_t'. Use 'B_TRUE' instead of 'true' to fix the issue. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Pavel Snajdr <[email protected]> Reviewed-by: Brian Atkinson <[email protected]> Signed-off-by: Umer Saleem <[email protected]> Closes #15489
* Update the kstat dataset_name when renaming a zvolAlan Somers2023-11-071-0/+2
| | | | | | | | | | | Add a dataset_kstats_rename function, and call it when renaming a zvol on FreeBSD and Linux. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alan Somers <[email protected]> Sponsored-by: Axcient Closes #15482 Closes #15486
* zvol: Implement zvol threading as a PropertyAmeer Hamza2023-10-311-1/+8
| | | | | | | | | | | | Currently, zvol threading can be switched through the zvol_request_sync module parameter system-wide. By making it a zvol property, zvol threading can be switched per zvol. Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Ameer Hamza <[email protected]> Closes #15409
* Unify arc_prune_async() codeAlexander Motin2023-10-302-52/+1
| | | | | | | | | | | | | | There is no sense to have separate implementations for FreeBSD and Linux. Make Linux code shared as more functional and just register FreeBSD-specific prune callback with arc_add_prune_callback() API. Aside of code cleanup this should fix excessive pruning on FreeBSD: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=274698 Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Mark Johnston <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #15456
* ZIL: Cleanup sync and commit handlingAlexander Motin2023-10-302-14/+24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ZVOL: - Mark all ZVOL ZIL transactions as sync. Since ZVOLs have only one object, it makes no sense to maintain async queue and on each commit merge it into sync. Single sync queue is just cheaper, while it changes nothing until actual commit request arrives. - Remove zsd_sync_cnt and the zil_async_to_sync() calls since we are no longer switching between sync and async queues. ZFS: - Mark write transactions as sync based only on number of sync opens (z_sync_cnt). We can not randomly jump between sync and async unless we want data corruptions due to writes reordering. - When file first opened with O_SYNC (z_sync_cnt incremented to 1) call zil_async_to_sync() for it to preserve correct ordering between past and future writes. - Drop zfs_fsyncer_key logic. Looks like it was an optimization for workloads heavily intermixing async writes with tons of fsyncs. But first it was broken 8 years ago due to Linux tsd implementation not allowing data storage between syscalls, and second, I doubt it is safe to switch from async to sync so often and without calling zil_async_to_sync(). - Rename sync argument of *_log_write() into commit, now only signalling caller's intent to call zil_commit() soon after. It allows WR_COPIED optimizations without extra other meanings. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: George Wilson <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #15366
* arc_default_max on Linux should match FreeBSDednadolski-ix2023-10-261-2/+8
| | | | | | | | | | | | Commits 518b487 and 23bdb07 changed the default ARC size limit on Linux systems to 1/2 of physical memory, which has become too strict for modern systems with large amounts of RAM. This patch changes the default limit to match that of FreeBSD, so ZFS may have a unified value on both platforms. Reviewed-by: George Melikov <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Edmund Nadolski <[email protected]> Closes #15437
* Revert "zvol: Temporally disable blk-mq"Tony Hutter2023-10-241-0/+12
| | | | | | | | | This reverts commit aefb6a2bd6c24597cde655e9ce69edd0a4c34357. aefb6a2bd temporally disabled blk-mq until we could fix a fix for Signed-off-by: Tony Hutter <[email protected]> Closes #15439
* zvol: Remove broken blk-mq optimizationTony Hutter2023-10-241-29/+0
| | | | | | | | | | | | | | | | | This fix removes a dubious optimization in zfs_uiomove_bvec_rq() that saved the iterator contents of a rq_for_each_segment(). This optimization allowed restoring the "saved state" from a previous rq_for_each_segment() call on the same uio so that you wouldn't need to iterate though each bvec on every zfs_uiomove_bvec_rq() call. However, if the kernel is manipulating the requests/bios/bvecs under the covers between zfs_uiomove_bvec_rq() calls, then it could result in corruption from using the "saved state". This optimization results in an unbootable system after installing an OS on a zvol with blk-mq enabled. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tony Hutter <[email protected]> Closes #15351
* Ensure we call fput when cloning fails due to different devices.Daniel Berlin2023-10-101-2/+6
| | | | | | | | | | | Right now, zpl_ioctl_ficlone and zpl_ioctl_ficlonerange do not call put on the src fd if the source and destination are on two different devices. This leaves the source file held open in this case. Reviewed-by: Kay Pedersen <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Daniel Berlin <[email protected]> Closes #15386
* zvol: Temporally disable blk-mqTony Hutter2023-10-101-12/+0
| | | | | | | | | | | | | | | There was a report of zvol data loss (#15351) after enabling blk-mq on a zvol backed with 16k physical block sized disks. Out of an abundance of caution, do not allow the user to enable blk-mq until we can look into the issue. Note that blk-mq was not enabled by default on zvols. It was always opt-in via the zvol_use_blk_mq module parameter. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Tony Nguyen <[email protected]> Signed-off-by: Tony Hutter <[email protected]> Addresses: #15351 Closes #15378
* Linux 6.6 compat: fsync_bdev() has been removed in favor of sync_blockdev()Coleman Kane2023-09-211-0/+6
| | | | | | | | | | | | In Linux commit 560e20e4bf6484a0c12f9f3c7a1aa55056948e1e, the fsync_bdev() function was removed in favor of sync_blockdev() to do (roughly) the same thing, given the same input. This change conditionally attempts to call sync_blockdev() if fsync_bdev() isn't discovered during configure. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Coleman Kane <[email protected]> Closes #15263
* Linux 6.6 compat: generic_fillattr has a new u32 request_mask added at arg2Coleman Kane2023-09-213-2/+22
| | | | | | | | | | | | | | In commit 0d72b92883c651a11059d93335f33d65c6eb653b, a new u32 argument for the request_mask was added to generic_fillattr. This is the same request_mask for statx that's present in the most recent API implemented by zpl_getattr_impl. This commit conditionally adds it to the zpl_generic_fillattr(...) macro, as well as the zfs_getattr_fast(...) implementation, when configure determines it's present in the kernel's generic_fillattr(...). Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Coleman Kane <[email protected]> Closes #15263
* Linux 6.6 compat: use inode_get/set_ctime*(...)Coleman Kane2023-09-215-15/+26
| | | | | | | | | | | | | | | | | | | | In Linux commit 13bc24457850583a2e7203ded05b7209ab4bc5ef, direct access to the i_ctime member of struct inode was removed. The new approach is to use accessor methods that exclusively handle passing the timestamp around by value. This change adds new tests for each of these functions and introduces zpl_* equivalents in include/os/linux/zfs/sys/zpl.h. In where the inode_get/set_ctime*() functions exist, these zpl_* calls will be mapped to the new functions. On older kernels, these macros just wrap direct-access calls. The code that operated on an address of ip->i_ctime to call ZFS_TIME_DECODE() now will take a local copy using zpl_inode_get_ctime(), and then pass the address of the local copy when performing the ZFS_TIME_DECODE() call, in all cases, rather than directly accessing the member. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Coleman Kane <[email protected]> Closes #15263 Closes #15257
* Retire z_nr_znodesMateusz Guzik2023-09-183-7/+3
| | | | | | | | | | | | Added in ab26409db753 ("Linux 3.1 compat, super_block->s_shrink"), with the only consumer which needed the count getting retired in 066e82522101 ("Linux compat: Minimum kernel version 3.10"). The counter gets in the way of not maintaining the list to begin with. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Mateusz Guzik <[email protected]> Closes #15274
* Intel QAT 1.7 compatibilityVolker Mauel2023-09-071-0/+2
| | | | | | | | | | Based on the intel QAT samples which are bundled in the 1.x drivers, this is the preferred approach since api version 1.6. See: https://www.intel.de/content/www/de/de/download/19734/intel-quickassist-technology-driver-for-linux-hw-version-1-x.html? Reviewed-by: Weigang Li <[email protected]> Signed-off-by: Volker Mauel <[email protected]> Closes #15190
* Selectable block allocatorsednadolski-ix2023-09-011-0/+12
| | | | | | | | | | | | | ZFS historically has had several space allocators that were dynamically selectable. While these have been retained in OpenZFS, only a single allocator has been statically compiled in. This patch compiles all allocators for OpenZFS and provides a module parameter to allow for manual selection between them. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Ameer Hamza <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Edmund Nadolski <[email protected]> Closes #15218
* copy_file_range: fix fallback when source create on same txgRob N2023-08-141-2/+3
| | | | | | | | | | | | | In 019dea0a5 we removed the conversion from EAGAIN->EXDEV inside zfs_clone_range(), but forgot to add a test for EAGAIN to the copy_file_range() entry points to trigger fallback to a content copy. This commit fixes that. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Kay Pedersen <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes #15170 Closes #15172
* Linux 6.5 compat: Use copy_splice_read instead of filemap_splice_readColeman Kane2023-08-081-2/+2
| | | | | | | | | | | Using the filemap_splice_read function for the splice_read handler was leading to occasional data corruption under certain circumstances. Favor using copy_splice_read instead, which does not demonstrate the same erroneous behavior under the tested failure cases. Reviewed-by: Brian Atkinson <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Coleman Kane <[email protected]> Closes #15164
* zfs_clone_range should return a descriptive error codesoromenahar2023-08-081-2/+2
| | | | | | | | | | | | | | | Return the more descriptive error codes instead of `EXDEV` when the parameters don't match the requirements of the clone function. Updated the comments in `brt.c` accordingly. The first three errors are just invalid parameters, which zfs can not handle. The fourth error indicates that the block which should be cloned is created and cloned or modified in the same transaction group (`txg`). Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Rob Norris <[email protected]> Signed-off-by: Kay Pedersen <[email protected]> Closes #15148
* Linux 6.5 compat: replace generic_file_splice_read with filemap_splice_readColeman Kane2023-08-071-0/+4
| | | | | | | | | | | | | | | | The generic_file_splice_read function was removed in Linux 6.5 in favor of filemap_splice_read. Add an autoconf test for filemap_splice_read and use it if it is found as the handler for .splice_read in the file_operations struct. Additionally, ITER_PIPE was removed in 6.5. This change removes the ITER_* macros that OpenZFS doesn't use from being tested in config/kernel-vfs-iov_iter.m4. The removal of ITER_PIPE was causing the test to fail, which also affected the code responsible for setting the .splice_read handler, above. That behavior caused run-time panics on Linux 6.5. Reviewed-by: Brian Atkinson <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Coleman Kane <[email protected]> Closes #15155
* Refactor dmu_prefetch().Alexander Motin2023-08-071-5/+2
| | | | | | | | | | | | | | | | | | - Split dmu_prefetch_dnode() from dmu_prefetch() into a separate function. It is quite inconvenient to read the code where len = 0 means dnode prefetch instead indirect/data prefetch. One function doing both has no benefits, since the code paths are independent. - Improve dmu_prefetch() handling of long block ranges. Instead of limiting L0 data length to prefetch for to dmu_prefetch_max, make dmu_prefetch_max limit the actual amount of prefetch at the specified level, and, if there is more, prefetch all the rest at higher indirection level. It should improve random access times within the prefetched range of any length, reducing importance of specific dmu_prefetch_max value. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #15076
* linux/copy_file_range: properly request a fallback copy on Linux <5.3Rob N2023-08-011-0/+7
| | | | | | | | | | | | | | | | | | | | Before Linux 5.3, the filesystem's copy_file_range handler had to signal back to the kernel that we can't fulfill the request and it should fallback to a content copy. This is done by returning -EOPNOTSUPP. This commit converts the EXDEV return from zfs_clone_range to EOPNOTSUPP, to force the kernel to fallback for all the valid reasons it might be unable to clone. Without it the copy_file_range() syscall will return EXDEV to userspace, breaking its semantics. Add test for copy_file_range fallbacks. copy_file_range should always fallback to a content copy whenever ZFS can't service the request with cloning. Reviewed-by: Brian Atkinson <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Kay Pedersen <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes #15131
* linux: zfs: ctldir: set [amc]time to snapshot's creation propertyнаб2023-08-011-5/+21
| | | | | | | | | | | | If looking up a snapdir inode failed, hold pool config – hold the snapshot – get its creation property – release it – release it, then use that as the [amc]time in the allocated inode. If that fails then fall back to current time. No performance impact since this is only done when allocating a new snapdir inode. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Ahelenia Ziemiańska <[email protected]> Closes #15110 Closes #15117
* Linux 4.20 compat: wrapper function for iov_iter type accessColeman Kane2023-08-011-5/+2
| | | | | | | | | | | | An iov_iter_type() function to access the "type" member of the struct iov_iter was added at one point. Move the conditional logic to decide which method to use for accessing it into a macro and simplify the zpl_uio_init code. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Brian Atkinson <[email protected]> Signed-off-by: Coleman Kane <[email protected]> Closes #15100
* Linux 6.4 compat: iter_iov() function now used to get old iov memberColeman Kane2023-08-011-5/+3
| | | | | | | | | | | | The iov_iter->iov member is now iov_iter->__iov and must be accessed via the accessor function iter_iov(). Create a wrapper that is conditionally compiled to use the access method appropriate for the target kernel version. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Brian Atkinson <[email protected]> Signed-off-by: Coleman Kane <[email protected]> Closes #15100
* Linux 6.5 compat: blkdev changesColeman Kane2023-08-014-14/+83
| | | | | | | | | | | | | | | | | | | | | | | | | Multiple changes to the blkdev API were introduced in Linux 6.5. This includes passing (void* holder) to blkdev_put, adding a new blk_holder_ops* arg to blkdev_get_by_path, adding a new blk_mode_t type that replaces uses of fmode_t, and removing an argument from the release handler on block_device_operations that we weren't using. The open function definition has also changed to take gendisk* and blk_mode_t, so update it accordingly, too. Implement local wrappers for blkdev_get_by_path() and vdev_blkdev_put() so that the in-line calls are cleaner, and place the conditionally-compiled implementation details inside of both of these local wrappers. Both calls are exclusively used within vdev_disk.c, at this time. Add blk_mode_is_open_write() to test FMODE_WRITE / BLK_OPEN_WRITE The wrapper function is now used for testing using the appropriate method for the kernel, whether the open mode is writable or not. Emphasize fmode_t arg in zvol_release is not used Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Coleman Kane <[email protected]> Closes #15099
* linux: implement filesystem-side copy/clone functions for EL7Rob Norris2023-07-244-8/+34
| | | | | | | | | | | | | Redhat have backported copy_file_range and clone_file_range to the EL7 kernel using an "extended file operations" wrapper structure. This connects all that up to let cloning work there too. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Kay Pedersen <[email protected]> Signed-off-by: Rob Norris <[email protected]> Sponsored-By: OpenDrives Inc. Sponsored-By: Klara Inc. Closes #15050
* linux: implement filesystem-side clone ioctlsRob Norris2023-07-242-0/+85
| | | | | | | | | | | | | | | | | | Prior to Linux 4.5, the FICLONE etc ioctls were specific to BTRFS, and were implemented as regular filesystem-specific ioctls. This implements those ioctls directly in OpenZFS, allowing cloning to work on older kernels. There's no need to gate these behind version checks; on later kernels Linux will simply never deliver these ioctls, instead calling the approprate VFS op. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Kay Pedersen <[email protected]> Signed-off-by: Rob Norris <[email protected]> Sponsored-By: OpenDrives Inc. Sponsored-By: Klara Inc. Closes #15050
* linux: implement filesystem-side copy/clone functionsRob Norris2023-07-242-1/+195
| | | | | | | | | | | | | | | | | | | | This implements the Linux VFS ops required to service the file copy/clone APIs: .copy_file_range (4.5+) .clone_file_range (4.5-4.19) .dedupe_file_range (4.5-4.19) .remap_file_range (4.20+) Note that dedupe_file_range() and remap_file_range(REMAP_FILE_DEDUP) are hooked up here, but are not implemented yet. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Kay Pedersen <[email protected]> Signed-off-by: Rob Norris <[email protected]> Sponsored-By: OpenDrives Inc. Sponsored-By: Klara Inc. Closes #15050
* Fix zpl_test_super race with zfs_umountChunwei Chen2023-07-202-15/+25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We cannot call zpl_enter in zpl_test_super, because zpl_test_super is under spinlock so we can't sleep, and also because zpl_test_super is called without sb->s_umount taken, so it's possible we would race with zfs_umount and call zpl_enter on freed zfsvfs. Here's an stack trace when this happens: [ 2379.114837] VERIFY(cvp->cv_magic == CV_MAGIC) failed [ 2379.114845] PANIC at spl-condvar.c:497:__cv_broadcast() [ 2379.114854] Kernel panic - not syncing: VERIFY(cvp->cv_magic == CV_MAGIC) failed [ 2379.115012] Call Trace: [ 2379.115019] dump_stack+0x74/0x96 [ 2379.115024] panic+0x114/0x2f6 [ 2379.115035] spl_panic+0xcf/0xfc [spl] [ 2379.115477] __cv_broadcast+0x68/0xa0 [spl] [ 2379.115585] rrw_exit+0xb8/0x310 [zfs] [ 2379.115696] rrm_exit+0x4a/0x80 [zfs] [ 2379.115808] zpl_test_super+0xa9/0xd0 [zfs] [ 2379.115920] sget+0xd1/0x230 [ 2379.116033] zpl_mount+0xdc/0x230 [zfs] [ 2379.116037] legacy_get_tree+0x28/0x50 [ 2379.116039] vfs_get_tree+0x27/0xc0 [ 2379.116045] path_mount+0x2fe/0xa70 [ 2379.116048] do_mount+0x80/0xa0 [ 2379.116050] __x64_sys_mount+0x8b/0xe0 [ 2379.116052] do_syscall_64+0x35/0x50 [ 2379.116054] entry_SYSCALL_64_after_hwframe+0x61/0xc6 [ 2379.116057] RIP: 0033:0x7f9912e8b26a Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Chunwei Chen <[email protected]> Closes #15077
* Enable tuning of ZVOL open timeout valuePrakash Surya2023-06-301-1/+6
| | | | | | | | | | | The default timeout for ZVOL opens may not be sufficient for all cases, so we should enable the value to be more easily tuned to account for systems where the default value is insufficient. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Matthew Ahrens <[email protected]> Signed-off-by: Prakash Surya <[email protected]> Closes #15023
* Use list_remove_head() where possible.Alexander Motin2023-06-091-3/+1
| | | | | | | | | | | ... instead of list_head() + list_remove(). On FreeBSD the list functions are not inlined, so in addition to more compact code this also saves another function call. Reviewed-by: Brian Atkinson <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #14955
* znode: expose zfs_get_zplprop to libzpoolRob Norris2023-06-052-85/+85
| | | | | | | | | | | | | There's no particular reason this function should be kernel-only, and I want to use it (indirectly) from zdb. I've moved it to zfs_znode.c because libzpool does not compile in zfs_vfsops.c, and this at least matches the header its imported from. Sponsored-By: Klara, Inc. Reviewed-by: Tino Reichardt <[email protected]> Reviewed-by: WHR <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes #14642
* Linux 6.4 compat: reclaimed_slab renamed to reclaimedyouzhongyang2023-05-241-0/+4
| | | | | | | Reviewed-by: Richard Yao <[email protected]> Reviewed-by: Brian Atkinson <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Youzhong Yang <[email protected]> Closes #14891
* Linux: zfs_zaccess_trivial() should always call generic_permission()Richard Yao2023-04-201-2/+1
| | | | | | | | | | | | | | | | Building with Clang on Linux generates a warning that err could be uninitialized if mnt_ns is a NULL pointer. However, mnt_ns should never be NULL, so there is no need to put this behind an if statement. Taking it outside of the if statement means that the possibility of err being uninitialized goes from being always zero in a way that the compiler could not realize to a way that is always zero in a way that the compiler can realize. Sponsored-By: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Youzhong Yang <[email protected]> Signed-off-by: Richard Yao <[email protected]> Closes #14738
* Linux 6.3 compat: idmapped mount API changesyouzhongyang2023-04-1010-95/+161
| | | | | | | | | Linux kernel 6.3 changed a bunch of APIs to use the dedicated idmap type for mounts (struct mnt_idmap), we need to detect these changes and make zfs work with the new APIs. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Youzhong Yang <[email protected]> Closes #14682
* Linux 6.3 compat: writepage_t first arg struct folio*youzhongyang2023-04-051-3/+25
| | | | | | | | | | The type def of writepage_t in kernel 6.3 is changed to take struct folio* as the first argument. We need to detect this change and pass correct function to write_cache_pages(). Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Brian Atkinson <[email protected]> Signed-off-by: Youzhong Yang <[email protected]> Closes #14699
* Use vmem_zalloc to silence allocation warningBrian Behlendorf2023-03-311-2/+2
| | | | | | | | | | | | | The kmem allocation in zfs_prune_aliases() will trigger a large allocation warning on systems with 64K pages. Resolve this by switching to vmem_alloc() which internally uses kvmalloc() so the right allocator will be used based on the allocation size. Reviewed-by: Richard Yao <[email protected]> Reviewed-by: Tino Reichardt <[email protected]> Reviewed-by: Brian Atkinson <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #8491 Closes #14694
* QAT: Fix uninitialized seed in QAT compressionnaivekun2023-03-161-1/+1
| | | | | | | | | | | | | | CpaDcRqResults have to be initialized with checksum=1 for adler32. Otherwise when error CPA_DC_OVERFLOW occurred, the next compress operation will continue on previously part-compressed data, and write invalid checksum data. When zfs decompress the compressed data, a invalid checksum will occurred and lead to #14463 Reviewed-by: Tino Reichardt <[email protected]> Reviewed-by: Weigang Li <[email protected]> Reviewed-by: Chengfei Zhu <[email protected]> Signed-off-by: naivekun <[email protected]> Closes #14632 Closes #14463
* nvpair: Constify string functionsRichard Yao2023-03-141-1/+1
| | | | | | | | | | | | | | After addressing coverity complaints involving `nvpair_name()`, the compiler started complaining about dropping const. This lead to a rabbit hole where not only `nvpair_name()` needed to be constified, but also `nvpair_value_string()`, `fnvpair_value_string()` and a few other static functions, plus variable pointers throughout the code. The result became a fairly big change, so it has been split out into its own patch. Reviewed-by: Tino Reichardt <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Richard Yao <[email protected]> Closes #14612
* Linux: Suppress clang static analyzer warning in zfs_remove()Richard Yao2023-03-081-0/+1
| | | | | | | | | | | | | | | | | Clang's static analyzer points out that if we fail to find an extended attribute directory, but somehow find it when calculating delete_now and delete_now is true, we will have a NULL pointer dereference when we try to unlink the extended attribute directory. I am not sure if this is possible, but if it is, I do not see a sane way of handling this other than rolling back the transaction and retrying. For now, let us do an VERIFY_IMPLY(). If this trips, it will stop the transaction from committing, which will prevent an attribute directory leak. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Richard Yao <[email protected]> Closes #14575
* Linux: Silence static analyzer warning in crypto_create_ctx_template()Richard Yao2023-03-081-1/+1
| | | | | | | | | | | | | | | | | | A CodeChecker report from Clang's CTU analysis indicated that we were assigning uninitialized values in crypto_create_ctx_template() when we call it from zio_crypt_key_init(). This occurs because the ->cm_param and ->cm_param_len fields are uninitialized. Thankfully, the uninitialized values are only used in the skein via KCF_PROV_CREATE_CTX_TEMPLATE() -> skein_create_ctx_template() -> skein_mac_ctx_build() -> skein_get_digest_bitlen(), but that should not be called from here. We fix this to avoid a possible trap should this code change in the future. The FreeBSD version of zio_crypt_key_init() is unaffected. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Richard Yao <[email protected]> Closes #14575
* Linux cleanup: zvol_discard() should only call blk_queue_io_stat() onceRichard Yao2023-03-081-2/+1
| | | | | | Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Richard Yao <[email protected]> Closes #14575