aboutsummaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* linux 6.7 compat: rework shrinker setup for heap allocationsRob Norris2023-12-215-56/+189
| | | | | | | | | 6.7 changes the shrinker API such that shrinkers must be allocated dynamically by the kernel. To accomodate this, this commit reworks spl_register_shrinker() to do something similar against earlier kernels. Signed-off-by: Rob Norris <[email protected]> Sponsored-by: https://github.com/sponsors/robn
* linux 6.7 compat: handle superblock shrinker member changeRob Norris2023-12-212-3/+42
| | | | | | | | | In 6.7 the superblock shrinker member s_shrink has changed from being an embedded struct to a pointer. Detect this, and don't take a reference if it already is one. Signed-off-by: Rob Norris <[email protected]> Sponsored-by: https://github.com/sponsors/robn
* linux 6.7 compat: use inode atime/mtime accessorsRob Norris2023-12-216-35/+148
| | | | | | | | | 6.6 made i_ctime inaccessible; 6.7 has done the same for i_atime and i_mtime. This extends the method used for ctime in b37f29341 to atime and mtime as well. Signed-off-by: Rob Norris <[email protected]> Sponsored-by: https://github.com/sponsors/robn
* linux 6.7 compat: simplify current_time() checkRob Norris2023-12-211-1/+4
| | | | | | | | | | 6.7 changed the names of the time members in struct inode, so we can't assign back to it because we don't know its name. In practice this doesn't matter though - if we're missing current_time(), then we must be on <4.9, and we know our fallback will need to return timespec. Signed-off-by: Rob Norris <[email protected]> Sponsored-by: https://github.com/sponsors/robn
* Tag zfs-2.2.2zfs-2.2.2Tony Hutter2023-11-291-1/+1
| | | | | | META file and changelog updated. Signed-off-by: Tony Hutter <[email protected]>
* FreeBSD: Fix ZFS so that snapshots under .zfs/snapshot are NFS visiblermacklem2023-11-293-3/+11
| | | | | | | | | | | | | | Call vfs_exjail_clone() for mounts created under .zfs/snapshot to fill in the mnt_exjail field for the mount. If this is not done, the snapshots under .zfs/snapshot with not be accessible over NFS. This version has the argument name in vfs.h fixed to match that of the name in spl_vfs.c, although it really does not matter. External-issue: https://reviews.freebsd.org/D42672 Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Rick Macklem <[email protected]> Closes #15563
* ZIL: Call brt_pending_add() replaying TX_CLONE_RANGEAlexander Motin2023-11-294-15/+12
| | | | | | | | | | | | | | | | | | zil_claim_clone_range() takes references on cloned blocks before ZIL replay. Later zil_free_clone_range() drops them after replay or on dataset destroy. The total balance is neutral. It means on actual replay we must take additional references, which would stay in BRT. Without this blocks could be freed prematurely when either original file or its clone are destroyed. I've observed BRT being emptied and the feature being deactivated after ZIL replay completion, which should not have happened. With the patch I see expected stats. Reviewed-by: Kay Pedersen <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Rob Norris <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #15603
* zdb: fix printf() length for uint64_t devidMartin Matuška2023-11-291-3/+3
| | | | | | | | | Bug introduced in 213d6829673. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Warner Losh <[email protected]> Signed-off-by: Martin Matuska <[email protected]> Closes #15606
* Linux 6.6 compat: fix configure error with clang (#15558)Jaron Kent-Dobias2023-11-281-1/+1
| | | | | | | | With Linux v6.6.x and clang 16, a configure step fails on a warning that later results in an error while building, due to 'ts' being uninitialized. Add a trivial initialization to silence the warning. Signed-off-by: Jaron Kent-Dobias <[email protected]> Reviewed-by: Tony Hutter <[email protected]>
* zfs-dkms: fix shell-init error messageAllKind2023-11-281-0/+1
| | | | | | | | | If all zfs dkms modules have been removed, a shell-init error message may appear, because /var/lib/dkms/zfs does no longer exist. Resolve this by leaving the directory earlier on. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Mart Frauenlob <[email protected]> Closes #15576
* FreeBSD: Fix the build on FreeBSD 12Alan Somers2023-11-285-10/+25
| | | | | | | | | | | | | | | | It was broken for several reasons: * VOP_UNLOCK lost an argument in 13.0. So OpenZFS should be using VOP_UNLOCK1, but a few direct calls to VOP_UNLOCK snuck in. * The location of the zlib header moved in 13.0 and 12.1. We can drop support for building on 12.0, which is EoL. * knlist_init lost an argument in 13.0. OpenZFS change 9d0887402ba assumed 13.0 or later. * FreeBSD 13.0 added copy_file_range, and OpenZFS change 67a1b037915 assumed 13.0 or later. Sponsored-by: Axcient Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Alan Somers <[email protected]> Closes #15551
* dmu_buf_will_clone: fix race in transition back to NOFILLRob N2023-11-281-1/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Previously, dmu_buf_will_clone() would roll back any dirty record, but would not clean out the modified data nor reset the state before releasing the lock. That leaves the last-written data in db_data, but the dbuf in the wrong state. This is eventually corrected when the dbuf state is made NOFILL, and dbuf_noread() called (which clears out the old data), but at this point its too late, because the lock was already dropped with that invalid state. Any caller acquiring the lock before the call into dmu_buf_will_not_fill() can find what appears to be a clean, readable buffer, and would take the wrong state from it: it should be getting the data from the cloned block, not from earlier (unwritten) dirty data. Even after the state was switched to NOFILL, the old data was still not cleaned out until dbuf_noread(), which is another gap for a caller to take the lock and read the wrong data. This commit fixes all this by properly cleaning up the previous state and then setting the new state before dropping the lock. The DBUF_VERIFY() calls confirm that the dbuf is in a valid state when the lock is down. Sponsored-by: Klara, Inc. Sponsored-By: OpenDrives Inc. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Pawel Jakub Dawidek <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes #15566 Closes #15526
* zdb: Fix zdb '-O|-r' options with -e/exported zpoolAkash B2023-11-281-16/+23
| | | | | | | | | | | | | | | | | | | | | | zdb with '-e' or exported zpool doesn't work along with '-O' and '-r' options as we process them before '-e' has been processed. Below errors are seen: ~> zdb -e pool-mds65/mdt65 -O oi.9/0x200000009:0x0:0x0 failed to hold dataset 'pool-mds65/mdt65': No such file or directory ~> zdb -e pool-oss0/ost0 -r file1 /tmp/filecopy1 -p. failed to hold dataset 'pool-oss0/ost0': No such file or directory zdb: internal error: No such file or directory We need to make sure to process '-O|-r' options after the '-e' option has been processed, which imports the pool to the namespace if it's not in the cachefile. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Akash B <[email protected]> Closes #15532
* zdb: show BRT statistics and dump its contentsRob Norris2023-11-283-4/+99
| | | | | | | | | | Same idea as the dedup stats, but for block cloning. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Kay Pedersen <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes #15541
* brt: lift internal definitions into _impl headerRob Norris2023-11-283-163/+201
| | | | | | | | | | So that zdb (and others!) can get at the BRT on-disk structures. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Kay Pedersen <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes #15541
* ZTS: Fix zfs_load-key failures on F39Tony Hutter2023-11-281-1/+4
| | | | | | | | | | | The zfs_load-key tests were failing on F39 due to their use of the deprecated ssl.wrap_socket function. This commit updates the test to instead use ssl.SSLContext() as described in https://stackoverflow.com/a/65194957. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tony Hutter <[email protected]> Closes #15534 Closes #15550
* ZIL: Do not encrypt block pointers in lr_clone_range_tAlexander Motin2023-11-282-0/+28
| | | | | | | | | | | | | | | | | | | | In case of crash cloned blocks need to be claimed on pool import. It is only possible if they (lr_bps) and their count (lr_nbps) are not encrypted but only authenticated, similar to block pointer in lr_write_t. Few other fields can be and are still encrypted. This should fix panic on ZIL claim after crash when block cloning is actively used. Reviewed-by: Richard Yao <[email protected]> Reviewed-by: Tom Caputi <[email protected]> Reviewed-by: Sean Eric Fagan <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Edmund Nadolski <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #15543 Closes #15513
* dnode_is_dirty: check dnode and its data for dirtinessRob N2023-11-281-2/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Over its history this the dirty dnode test has been changed between checking for a dnodes being on `os_dirty_dnodes` (`dn_dirty_link`) and `dn_dirty_record`. de198f2d9 Fix lseek(SEEK_DATA/SEEK_HOLE) mmap consistency 2531ce372 Revert "Report holes when there are only metadata changes" ec4f9b8f3 Report holes when there are only metadata changes 454365bba Fix dirty check in dmu_offset_next() 66aca2473 SEEK_HOLE should not block on txg_wait_synced() Also illumos/illumos-gate@c543ec060d illumos/illumos-gate@2bcf0248e9 It turns out both are actually required. In the case of appending data to a newly created file, the dnode proper is dirtied (at least to change the blocksize) and dirty records are added. Thus, a single logical operation is represented by separate dirty indicators, and must not be separated. The incorrect dirty check becomes a problem when the first block of a file is being appended to while another process is calling lseek to skip holes. There is a small window where the dnode part is undirtied while there are still dirty records. In this case, `lseek(fd, 0, SEEK_DATA)` would not know that the file is dirty, and would go to `dnode_next_offset()`. Since the object has no data blocks yet, it returns `ESRCH`, indicating no data found, which results in `ENXIO` being returned to `lseek()`'s caller. Since coreutils 9.2, `cp` performs sparse copies by default, that is, it uses `SEEK_DATA` and `SEEK_HOLE` against the source file and attempts to replicate the holes in the target. When it hits the bug, its initial search for data fails, and it goes on to call `fallocate()` to create a hole over the entire destination file. This has come up more recently as users upgrade their systems, getting OpenZFS 2.2 as well as a newer coreutils. However, this problem has been reproduced against 2.1, as well as on FreeBSD 13 and 14. This change simply updates the dirty check to check both types of dirty. If there's anything dirty at all, we immediately go to the "wait for sync" stage, It doesn't really matter after that; both changes are on disk, so the dirty fields should be correct. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Rich Ercolani <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes #15571 Closes #15526
* Revert "Tune zio buffer caches and their alignments"Brian Behlendorf2023-11-281-39/+50
| | | | | | | | | | This reverts commit bd7a02c251d8c119937e847d5161b512913667e6 which can trigger an unlikely existing bio alignment issue on Linux. This change is good, but the underlying issue it exposes needs to be resolved before this can be re-applied. Signed-off-by: Brian Behlendorf <[email protected]> Issue #15533
* Tag zfs-2.2.1zfs-2.2.1Tony Hutter2023-11-201-1/+1
| | | | | | META file and changelog updated. Signed-off-by: Tony Hutter <[email protected]>
* ZTS: Fix 'could not unmount datasets' on Alma 9Tony Hutter2023-11-201-0/+6
| | | | | | | | Many tests are failing on AlmaLinux 9 because ZTS could not destroy the pool in cleanup. This was due to $PWD being set to '.' instead of the expected full path. This patch sets $PWD to the full path. Signed-off-by: Tony Hutter <[email protected]>
* zfs-2.2.1: Disable block cloning by defaultTony Hutter2023-11-162-2/+2
| | | | | | | Disable block cloning by default to mitigate possible data corruption (see #15529 and #15526). Signed-off-by: Tony Hutter <[email protected]>
* Add a tunable to disable BRT support.Rich Ercolani2023-11-1611-0/+51
| | | | | | | | | Copy the disable parameter that FreeBSD implemented, and extend it to work on Linux as well, until we're sure this is stable. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rich Ercolani <[email protected]> Closes #15529
* Packaging: Auto-generate changelog during configure (#15528)Umer Saleem2023-11-163-0/+8
| | | | | | | | | Auto-generate changelog based off on @VERSION@ during configure, so that it is not needed to be update with new releases / version updates. Signed-off-by: Umer Saleem <[email protected]> Reviewed-by: Tony Hutter <[email protected]>
* Linux 6.6 compat: METATony Hutter2023-11-161-1/+1
| | | | | | | | | Update the META file to reflect compatibility with the 6.6 kernel. Reviewed-by: George Melikov <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Umer Saleem <[email protected]> Signed-off-by: Tony Hutter <[email protected]> Closes #15520
* Workaround UBSAN errors for variable arraysTony Hutter2023-11-163-3/+7
| | | | | | | | | | | | | | | This gets around UBSAN errors when using arrays at the end of structs. It converts some zero-length arrays to variable length arrays and disables UBSAN checking on certain modules. It is based off of the patch from #15460. Reviewed-by: Brian Behlendorf <[email protected]> Tested-by: Thomas Lamprecht <[email protected]> Co-authored-by: Thomas Lamprecht <[email protected]> Signed-off-by: Tony Hutter <[email protected]> Issue #15145 Closes #15510
* ZTS: Test for all known zpool feature setsUmer Saleem2023-11-162-4/+7
| | | | | | | | | | | | | | | | | | zpool_create_features_007_pos only tested for compat-2020 feature set. It would be useful to test for all known features sets. If any additional feature is found enabled that is not present in compatibility list or feature set, it should be caught and reported earlier. This commit also removes encryption from openzfsonosx-1.8.1 compatibility list. Encryption enables bookmark_v2, since it is a dependency of encryption, but not listed in openzfsonoxx-1.8.1 compatibility list. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Umer Saleem <[email protected]> Closes #15505
* Update zpool-features.7 for grub2 compatibility list updatesUmer Saleem2023-11-161-0/+9
| | | | | | | | | | This commit updates zpool-features.7 man page to add newly added zpool features to grub2 compatibility list. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Umer Saleem <[email protected]> Closes #15505
* Workaround to allow openzfs-zfs-dkms install on UbuntuAllKind2023-11-161-1/+0
| | | | | | | | | | | As shown in #15404#issuecomment-1765002181, Ubuntu kernel has 'Provides: zfs-dkms', which will cause uninstall of the kernel, when attempting to install openzfs-zfs-dkms. As a workaround remove the 'Conflicts: zfs-dkms' definition from the debian control file. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Mart Frauenlob <[email protected]> Closes #15503
* Linux: reject read/write mapping to immutable file only on VM_SHAREDLow-power2023-11-161-2/+2
| | | | | | | | | | Private read/write mapping can't be used to modify the mapped files, so they will remain be immutable. Private read/write mappings are usually used to load the data segment of executable files, rejecting them will rendering immutable executable files to stop working. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: WHR <[email protected]> Closes #15344
* Fix accounting error for pending sync IO ops in zpool iostatMigeljanImeri2023-11-162-3/+9
| | | | | | | | | | | | | Currently vdev_queue_class_length is responsible for checking how long the queue length is, however, it doesn't check the length when a list is used, rather it just returns whether it is empty or not. To fix this I added a counter variable to vdev_queue_class to keep track of the sync IO ops, and changed vdev_queue_class_length to reference this variable instead. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: MigeljanImeri <[email protected]> Closes #15478
* Linux 6.6 compat: fix implicit conversion error with debug buildUmer Saleem2023-11-161-1/+1
| | | | | | | | | | | | With Linux v6.6.0 and GCC 12, when debug build is configured, implicit conversion error is raised while converting 'enum <anonymous>' to 'boolean_t'. Use 'B_TRUE' instead of 'true' to fix the issue. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Pavel Snajdr <[email protected]> Reviewed-by: Brian Atkinson <[email protected]> Signed-off-by: Umer Saleem <[email protected]> Closes #15489
* Remove obsolete_counts from grub2 compatibility listUmer Saleem2023-11-161-1/+0
| | | | | | | | | | | | | | | | | | PR#15459 add all read-only compatible zpool features to grub2 compatibility list. 'obsolete_counts' is a read-only features that depends on 'device_removal' feature which is not read-only and is marked as ZFEATURE_FLAG_MOS. Creating a pool with grub2 compatibility enables 'device_removal' feature as well, which is not desired. This commit removes the 'obsolete_counts' feature from grub2 compatibility list, as GRUB only supports read-only compatible features. Reviewed-by: George Melikov <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Umer Saleem <[email protected]> Closes #15499
* Fix dkms installation of deb packages created with Alien.AllKind2023-11-161-3/+87
| | | | | | | | | | | | | | Alien does not honour the %posttrans hook. So move the dkms uninstall/install scripts to the %pre/%post hooks in case of package install/upgrade. In case of package removal, handle that in %preun. Add removal of all old dkms modules. Add checking for broken 'dkms status'. Handle that as good as possible and warn the user about it. Also add more verbose messages about what we are doing. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Mart Frauenlob <[email protected]> Closes #15415
* Add all read-only compatible zpool features to grub2 compatibilityUmer Saleem2023-11-161-0/+10
| | | | | | | | | | | GRUB opens the boot pool in read-only mode. All read-only compatible features for zpool can be enabled and added to grub2 compatibility, as GRUB does not open the boot-pool for write. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Umer Saleem <[email protected]> Closes #15459
* Unify arc_prune_async() codeAlexander Motin2023-11-088-117/+87
| | | | | | | | | | | | | | There is no sense to have separate implementations for FreeBSD and Linux. Make Linux code shared as more functional and just register FreeBSD-specific prune callback with arc_add_prune_callback() API. Aside of code cleanup this should fix excessive pruning on FreeBSD: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=274698 Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Mark Johnston <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #15456
* Tune zio buffer caches and their alignmentsAlexander Motin2023-11-081-50/+39
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We should not always use PAGESIZE alignment for caches bigger than it and SPA_MINBLOCKSIZE otherwise. Doing that caches for 5, 6, 7, 10 and 14KB rounded up to 8, 12 and 16KB respectively make no sense. Instead specify as alignment the biggest power-of-2 divisor. This way 2KB and 6KB caches are both aligned to 2KB, while 4KB and 8KB are aligned to 4KB. Reduce number of caches to half-power of 2 instead of quarter-power of 2. This removes caches difficult for underlying allocators to fit into page-granular slabs, such as: 2.5, 3.5, 5, 7, 10KB, etc. Since these caches are mostly used for transient allocations like ZIOs and small DBUF cache it does not worth being too aggressive. Due to the above alignment issue some of those caches were not working properly any way. 6KB cache now finally has a chance to work right, placing 2 buffers into 3 pages, that makes sense. Remove explicit alignment in Linux user-space case. I don't think it should be needed any more with the above fixes. As result on FreeBSD instead of such numbers of pages per slab: vm.uma.zio_buf_comb_16384.keg.ppera: 4 vm.uma.zio_buf_comb_14336.keg.ppera: 4 vm.uma.zio_buf_comb_12288.keg.ppera: 3 vm.uma.zio_buf_comb_10240.keg.ppera: 3 vm.uma.zio_buf_comb_8192.keg.ppera: 2 vm.uma.zio_buf_comb_7168.keg.ppera: 2 vm.uma.zio_buf_comb_6144.keg.ppera: 2 <= Broken vm.uma.zio_buf_comb_5120.keg.ppera: 2 vm.uma.zio_buf_comb_4096.keg.ppera: 1 vm.uma.zio_buf_comb_3584.keg.ppera: 7 <= Hard to free vm.uma.zio_buf_comb_3072.keg.ppera: 3 vm.uma.zio_buf_comb_2560.keg.ppera: 2 vm.uma.zio_buf_comb_2048.keg.ppera: 1 vm.uma.zio_buf_comb_1536.keg.ppera: 2 vm.uma.zio_buf_comb_1024.keg.ppera: 1 vm.uma.zio_buf_comb_512.keg.ppera: 1 I am now getting such: vm.uma.zio_buf_comb_16384.keg.ppera: 4 vm.uma.zio_buf_comb_12288.keg.ppera: 3 vm.uma.zio_buf_comb_8192.keg.ppera: 2 vm.uma.zio_buf_comb_6144.keg.ppera: 3 <= Fixed, 2 in 3 pages vm.uma.zio_buf_comb_4096.keg.ppera: 1 vm.uma.zio_buf_comb_3072.keg.ppera: 3 vm.uma.zio_buf_comb_2048.keg.ppera: 1 vm.uma.zio_buf_comb_1536.keg.ppera: 2 vm.uma.zio_buf_comb_1024.keg.ppera: 1 vm.uma.zio_buf_comb_512.keg.ppera: 1 Reviewed-by: Allan Jude <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #15452
* DMU: Do not pre-read holes during writeAlexander Motin2023-11-081-3/+5
| | | | | | | | | | | | | | | | | | | | dmu_tx_check_ioerr() pre-reads blocks that are going to be dirtied as part of transaction to both prefetch them and check for errors. But it makes no sense to do it for holes, since there are no disk reads to prefetch and there can be no errors. On the other side those blocks are anonymous, and they are freed immediately by the dbuf_rele() without even being put into dbuf cache, so we just burn CPU time on decompression and overheads and get absolutely no result at the end. Use of dbuf_hold_impl() with fail_sparse parameter allows to skip the extra work, and on my tests with sequential 8KB writes to empty ZVOL with 32KB blocks shows throughput increase from 1.7 to 2GB/s. Reviewed-by: Brian Atkinson <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #15371
* Linux 6.6 compat: fsync_bdev() has been removed in favor of sync_blockdev()Coleman Kane2023-11-083-0/+44
| | | | | | | | | | | | In Linux commit 560e20e4bf6484a0c12f9f3c7a1aa55056948e1e, the fsync_bdev() function was removed in favor of sync_blockdev() to do (roughly) the same thing, given the same input. This change conditionally attempts to call sync_blockdev() if fsync_bdev() isn't discovered during configure. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Coleman Kane <[email protected]> Closes #15263
* Linux 6.6 compat: generic_fillattr has a new u32 request_mask added at arg2Coleman Kane2023-11-086-11/+63
| | | | | | | | | | | | | | In commit 0d72b92883c651a11059d93335f33d65c6eb653b, a new u32 argument for the request_mask was added to generic_fillattr. This is the same request_mask for statx that's present in the most recent API implemented by zpl_getattr_impl. This commit conditionally adds it to the zpl_generic_fillattr(...) macro, as well as the zfs_getattr_fast(...) implementation, when configure determines it's present in the kernel's generic_fillattr(...). Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Coleman Kane <[email protected]> Closes #15263
* Linux 6.6 compat: use inode_get/set_ctime*(...)Coleman Kane2023-11-087-15/+80
| | | | | | | | | | | | | | | | | | | | In Linux commit 13bc24457850583a2e7203ded05b7209ab4bc5ef, direct access to the i_ctime member of struct inode was removed. The new approach is to use accessor methods that exclusively handle passing the timestamp around by value. This change adds new tests for each of these functions and introduces zpl_* equivalents in include/os/linux/zfs/sys/zpl.h. In where the inode_get/set_ctime*() functions exist, these zpl_* calls will be mapped to the new functions. On older kernels, these macros just wrap direct-access calls. The code that operated on an address of ip->i_ctime to call ZFS_TIME_DECODE() now will take a local copy using zpl_inode_get_ctime(), and then pass the address of the local copy when performing the ZFS_TIME_DECODE() call, in all cases, rather than directly accessing the member. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Coleman Kane <[email protected]> Closes #15263 Closes #15257
* Read prefetched buffers from L2ARCshodanshok2023-11-061-4/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Prefetched buffers are currently read from L2ARC if, and only if, l2arc_noprefetch is set to non-default value of 0. This means that a streaming read which can be served from L2ARC will instead engage the main pool. For example, consider what happens when a file is sequentially read: - application requests contiguous data, engaging the prefetcher; - ARC buffers are initially marked as prefetched but, as the calling application consumes data, the prefetch tag is cleared; - these "normal" buffers become eligible for L2ARC and are copied to it; - re-reading the same file will *not* engage L2ARC even if it contains the required buffers; - main pool has to suffer another sequential read load, which (due to most NCQ-enabled HDDs preferring sequential loads) can dramatically increase latency for uncached random reads. In other words, current behavior is to write data to L2ARC (wearing it) without using the very same cache when reading back the same data. This was probably useful many years ago to preserve L2ARC read bandwidth but, with current SSD speed/size/price, it is vastly sub-optimal. Setting l2arc_noprefetch=1, while enabling L2ARC to serve these reads, means that even prefetched but unused buffers will be copied into L2ARC, further increasing wear and load for potentially not-useful data. This patch enable prefetched buffer to be read from L2ARC even when l2arc_noprefetch=1 (default), increasing sequential read speed and reducing load on the main pool without polluting L2ARC with not-useful (ie: unused) prefetched data. Moreover, it clear users confusion about L2ARC size increasing but not serving any IO when doing sequential reads. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Gionatan Danti <[email protected]> Closes #15451
* Add mutex_enter_interruptible() for interruptible sleeping IOCTLsThomas Bertschinger2023-11-067-19/+38
| | | | | | | | | | | | | | | | | | | | | | Many long-running ZFS ioctls lock the spa_namespace_lock, forcing concurrent ioctls to sleep for the mutex. Previously, the only option is to call mutex_enter() which sleeps uninterruptibly. This is a usability issue for sysadmins, for example, if the admin runs `zpool status` while a slow `zpool import` is ongoing, the admin's shell will be locked in uninterruptible sleep for a long time. This patch resolves this admin usability issue by introducing mutex_enter_interruptible() which sleeps interruptibly while waiting to acquire a lock. It is implemented for both Linux and FreeBSD. The ZFS_IOC_POOL_CONFIGS ioctl, used by `zpool status`, is changed to use this new macro so that the command can be interrupted if it is issued during a concurrent `zpool import` (or other long-running operation). Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Thomas Bertschinger <[email protected]> Closes #15360
* Revert "zvol: Temporally disable blk-mq"Tony Hutter2023-11-063-1/+70
| | | | | | | | | This reverts commit aefb6a2bd6c24597cde655e9ce69edd0a4c34357. aefb6a2bd temporally disabled blk-mq until we could fix a fix for Signed-off-by: Tony Hutter <[email protected]> Closes #15439
* zvol: Remove broken blk-mq optimizationTony Hutter2023-11-062-37/+0
| | | | | | | | | | | | | | | | | This fix removes a dubious optimization in zfs_uiomove_bvec_rq() that saved the iterator contents of a rq_for_each_segment(). This optimization allowed restoring the "saved state" from a previous rq_for_each_segment() call on the same uio so that you wouldn't need to iterate though each bvec on every zfs_uiomove_bvec_rq() call. However, if the kernel is manipulating the requests/bios/bvecs under the covers between zfs_uiomove_bvec_rq() calls, then it could result in corruption from using the "saved state". This optimization results in an unbootable system after installing an OS on a zvol with blk-mq enabled. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tony Hutter <[email protected]> Closes #15351
* "ARC prefetch metadata accesses:" appears twice in the output.ofthesun92023-11-061-1/+1
| | | | | | | | The first occurrence should be "ARC prefetch data accesses:" Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: ofthesun9 <[email protected]> Closes #15427
* Trust ARC_BUF_SHARED() moreAlexander Motin2023-11-061-17/+24
| | | | | | | | | | | | | | | | | In my understanding ARC_BUF_SHARED() and arc_buf_is_shared() should return identical results, except the second also asserts it deeper. The first is much cheaper though, saving few pointer dereferences. Replace production arc_buf_is_shared() calls with ARC_BUF_SHARED(), and call arc_buf_is_shared() in random assertions, while making it even more strict. On my tests this in half reduces arc_buf_destroy_impl() time, that noticeably reduces hash_lock congestion under heavy dbuf eviction. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: George Wilson <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #15397
* Remove lock from dsl_pool_need_dirty_delay()Alexander Motin2023-11-061-7/+7
| | | | | | | | | | | | | | Torn reads/writes of dp_dirty_total are unlikely: on 64-bit systems due to register size, while on 32-bit due to memory constraints. And even if we hit some race, the code implementing the delay takes the lock any way. Removal of the poll-wide lock acquisition saves ~1% of CPU time on 8-thread 8KB write workload. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #15390
* run-zts test procfs/pool_state failed with uncorrectable I/O failureVaibhavB2023-11-061-1/+5
| | | | | | | | | | | Once we trigger the zpool scrub, all zpool/zfs command gets stuck for 180 seconds. Post 180 seconds zpool/zfs commands gets start executing however few more seconds(10s) it take to update the status. hence sleeping for 200 seconds so that we get the correct status. Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: vaibhav.bhanawat <[email protected]> Closes #15364
* Properly pad struct tx_cpu to cache lineAlexander Motin2023-11-061-2/+1
| | | | | | | | | We already use ____cacheline_aligned in many places, so add one more instead of seems arbitrary char tc_pad[8]. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #15402