openzfs/zfs.git - Unnamed repository; edit this file 'description' to name the repository.

	Commit message (Collapse)	Author	Age	Files	Lines
*	ZTS: Add LUKS sanity test	Tony Hutter	2024-10-25	1	-0/+1
\| \| \| \| \| \| \| \| \| \|	Add a LUKS sanity test to trigger: #16631 Reviewed-by: Tino Reichardt <[email protected]> Reviewed-by: Rob Norris <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Tony Hutter <[email protected]> Closes #16681
*	Defer resilver only when progress is above a threshold	Pavel Snajdr	2024-10-04	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Restart a resilver from scratch, if the current one in progress is below a new tunable, zfs_resilver_defer_percent (defaulting to 10%). The original rationale for deferring additional resilvers, when there is already one in progress, was to help achieving data redundancy sooner for the data that gets scanned at the end of the resilver. But in case the admin wants to attach multiple disks to a single vdev, it wasn't immediately obvious the admin is supposed to run `zpool resilver` afterwards to reset the deferred resilvers and start a new one from scratch. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Pavel Snajdr <[email protected]> Closes #15810
*	snapdir: add 'disabled' value to make .zfs inaccessible	Brian Behlendorf	2024-10-02	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In some environments, just making the .zfs control dir hidden from sight might not be enough. In particular, the following scenarios might warrant not allowing access at all: - old snapshots with wrong permissions/ownership - old snapshots with exploitable setuid/setgid binaries - old snapshots with sensitive contents Introducing a new 'disabled' value that not only hides the control dir, but prevents access to its contents by returning ENOENT solves all of the above. The new property value takes advantage of 'iuv' semantics ("ignore unknown value") to automatically fall back to the old default value when a pool is accessed by an older version of ZFS that doesn't yet know about 'disabled' semantics. I think that technically the zfs_dirlook change is enough to prevent access, but preventing lookups and dir entries in an already opened .zfs handle might also be a good idea to prevent races when modifying the property at runtime. Add zfs_snapshot_no_setuid parameter to control whether automatically mounted snapshots have the setuid mount option set or not. this could be considered a partial fix for one of the scenarios mentioned in desired. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Tino Reichardt <[email protected]> Signed-off-by: Fabian Grünbichler <[email protected]> Co-authored-by: Fabian Grünbichler <[email protected]> Closes #3963 Closes #16587
*	ZTS: Replace MD5 and SHA256 wit XXH128	Tino Reichardt	2024-09-28	2	-21/+9
\| \| \| \| \| \| \| \| \|	For data integrity checks as done in ZTS, the verification for unintended data corruption with xxhash128 should be a lot faster and perfectly usable. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tino Reichardt <[email protected]> Closes #16577
*	ZTS: use openssl for md5digest and sha256digest	Tino Reichardt	2024-09-19	1	-24/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	On larger files this should improve the speed. Sample values of my system: [mcmilk@xz]$ time dd if=/dev/zero bs=128k count=1k \| sha256sum 254bcc3fc4f27172636df4bf32de9f107f620d559b20d760197e452b97453917 - real 0m1,050s user 0m0,985s sys 0m0,153s [mcmilk@xz]$ time dd if=/dev/zero bs=128k count=1k \| openssl sha256 -r 254bcc3fc4f27172636df4bf32de9f107f620d559b20d760197e452b97453917 *stdin real 0m0,254s user 0m0,206s sys 0m0,160s I think cli_root/zdb/zdb_backup.ksh runs also an FreeBSD and I needed to include the sysutils/coreutils package for the FreeBSD tests within the QEMU patchset. This could be reverted, when this pull request gets upstream Reviewed-by: George Melikov <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tino Reichardt <[email protected]> Closes #16543
*	Adding Direct IO Support	Brian Atkinson	2024-09-14	3	-0/+15
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Adding O_DIRECT support to ZFS to bypass the ARC for writes/reads. O_DIRECT support in ZFS will always ensure there is coherency between buffered and O_DIRECT IO requests. This ensures that all IO requests, whether buffered or direct, will see the same file contents at all times. Just as in other FS's , O_DIRECT does not imply O_SYNC. While data is written directly to VDEV disks, metadata will not be synced until the associated TXG is synced. For both O_DIRECT read and write request the offset and request sizes, at a minimum, must be PAGE_SIZE aligned. In the event they are not, then EINVAL is returned unless the direct property is set to always (see below). For O_DIRECT writes: The request also must be block aligned (recordsize) or the write request will take the normal (buffered) write path. In the event that request is block aligned and a cached copy of the buffer in the ARC, then it will be discarded from the ARC forcing all further reads to retrieve the data from disk. For O_DIRECT reads: The only alignment restrictions are PAGE_SIZE alignment. In the event that the requested data is in buffered (in the ARC) it will just be copied from the ARC into the user buffer. For both O_DIRECT writes and reads the O_DIRECT flag will be ignored in the event that file contents are mmap'ed. In this case, all requests that are at least PAGE_SIZE aligned will just fall back to the buffered paths. If the request however is not PAGE_SIZE aligned, EINVAL will be returned as always regardless if the file's contents are mmap'ed. Since O_DIRECT writes go through the normal ZIO pipeline, the following operations are supported just as with normal buffered writes: Checksum Compression Encryption Erasure Coding There is one caveat for the data integrity of O_DIRECT writes that is distinct for each of the OS's supported by ZFS. FreeBSD - FreeBSD is able to place user pages under write protection so any data in the user buffers and written directly down to the VDEV disks is guaranteed to not change. There is no concern with data integrity and O_DIRECT writes. Linux - Linux is not able to place anonymous user pages under write protection. Because of this, if the user decides to manipulate the page contents while the write operation is occurring, data integrity can not be guaranteed. However, there is a module parameter `zfs_vdev_direct_write_verify` that controls the if a O_DIRECT writes that can occur to a top-level VDEV before a checksum verify is run before the contents of the I/O buffer are committed to disk. In the event of a checksum verification failure the write will return EIO. The number of O_DIRECT write checksum verification errors can be observed by doing `zpool status -d`, which will list all verification errors that have occurred on a top-level VDEV. Along with `zpool status`, a ZED event will be issues as `dio_verify` when a checksum verification error occurs. ZVOLs and dedup is not currently supported with Direct I/O. A new dataset property `direct` has been added with the following 3 allowable values: disabled - Accepts O_DIRECT flag, but silently ignores it and treats the request as a buffered IO request. standard - Follows the alignment restrictions outlined above for write/read IO requests when the O_DIRECT flag is used. always - Treats every write/read IO request as though it passed O_DIRECT and will do O_DIRECT if the alignment restrictions are met otherwise will redirect through the ARC. This property will not allow a request to fail. There is also a module parameter zfs_dio_enabled that can be used to force all reads and writes through the ARC. By setting this module parameter to 0, it mimics as if the direct dataset property is set to disabled. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Signed-off-by: Brian Atkinson <[email protected]> Co-authored-by: Mark Maybee <[email protected]> Co-authored-by: Matt Macy <[email protected]> Co-authored-by: Brian Behlendorf <[email protected]> Closes #10018
*	ddt: dedup log	Rob Norris	2024-08-16	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Adds a log/journal to dedup. At the end of txg, instead of writing the entry directly to the ZAP, instead its adding to an in-memory tree and appended to an on-disk object. The on-disk object is only read at import, to reload the in-memory tree. Lookups first go the the log tree before going to the ZAP, so recently-used entries will remain close by in memory. This vastly reduces overhead from dedup IO, as it will not have to do so many read/update/write cycles on ZAP leaf nodes. A flushing facility is added at end of txg, to push logged entries out to the ZAP. There's actually two separate "logs" (in-memory tree and on-disk object), one active (recieving updated entries) and one flushing (writing out to disk). These are swapped (ie flushing begins) based on memory used by the in-memory log trees and time since we last flushed something. The flushing facility monitors the amount of entries coming in and being flushed out, and calibrates itself to try to flush enough each txg to keep up with the ingest rate without competing too much with other IO. Multiple tuneables are provided to control the flushing facility. All the histograms and stats are update to accomodate the log as a separate entry store. zdb gains knowledge of how to count them and dump them. Documentation included! Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Co-authored-by: Allan Jude <[email protected]> Signed-off-by: Rob Norris <[email protected]> Sponsored-by: Klara, Inc. Sponsored-by: iXsystems, Inc. Closes #15895
*	Fix the names of some FreeBSD sysctls in include/tunables.cfg (#16395)	Allan Jude	2024-08-06	1	-3/+5
\| \| \| \| \| \| \| \|	Sponsored-by: Klara, Inc. Signed-off-by: Allan Jude <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Tino Reichardt <[email protected]>
*	ZTS: Add zfs/zpool JSON sanity tests	Tony Hutter	2024-08-06	1	-0/+1
\| \| \| \| \| \| \| \| \|	Run basic JSON validation tests on the new `zfs\|zpool -j` output. Reviewed-by: Ameer Hamza <[email protected]> Reviewed-by: Umer Saleem <[email protected]> Signed-off-by: Tony Hutter <[email protected]> Closes #16217
*	ddt: dedup table quota enforcement	Allan Jude	2024-07-25	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This adds two new pool properties: - dedup_table_size, the total size of all DDTs on the pool; and - dedup_table_quota, the maximum possible size of all DDTs in the pool When set, quota will be enforced by checking when a new entry is about to be created. If the pool is over its dedup quota, the entry won't be created, and the corresponding write will be converted to a regular non-dedup write. Note that existing entries can be updated (ie their refcounts changed), as that reuses the space rather than requiring more. dedup_table_quota can be set to 'auto', which will set it based on the size of the devices backing the "dedup" allocation device. This makes it possible to limit the DDTs to the size of a dedup vdev only, such that when the device fills, no new blocks are deduplicated. Sponsored-by: iXsystems, Inc. Sponsored-By: Klara Inc. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rob Norris <[email protected]> Signed-off-by: Don Brady <[email protected]> Co-authored-by: Don Brady <[email protected]> Co-authored-by: Rob Wing <[email protected]> Co-authored-by: Sean Eric Fagan <[email protected]> Closes #15889
*	ZTS: handle FreeBSD version numbers correctly (#16340)	Rob Norris	2024-07-12	1	-1/+29
\| \| \| \| \| \| \| \| \| \| \|	FreeBSD patchlevel versions are optional and, if present, in a different location in the version string. Sponsored-by: https://despairlabs.com/sponsor/ Signed-off-by: Rob Norris <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Tino Reichardt <[email protected]> Reviewed-by: Tony Hutter <[email protected]>
*	zed: Add deadman-slot_off.sh zedlet	Brian Behlendorf	2024-05-29	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Optionally turn off disk's enclosure slot if an I/O is hung triggering the deadman. It's possible for outstanding I/O to a misbehaving SCSI disk to neither promptly complete or return an error. This can occur due to retry and recovery actions taken by the SCSI layer, driver, or disk. When it occurs the pool will be unresponsive even though there may be sufficient redundancy configured to proceeded without this single disk. When a hung I/O is detected by the kmods it will be posted as a deadman event. By default an I/O is considered to be hung after 5 minutes. This value can be changed with the zfs_deadman_ziotime_ms module parameter. If ZED_POWER_OFF_ENCLOSURE_SLOT_ON_DEADMAN is set the disk's enclosure slot will be powered off causing the outstanding I/O to fail. The ZED will then handle this like a normal disk failure. By default ZED_POWER_OFF_ENCLOSURE_SLOT_ON_DEADMAN is not set. As part of this change `zfs_deadman_events_per_second` is added to control the ratelimitting of deadman events independantly of delay events. In practice, a single deadman event is sufficient and more aren't particularly useful. Alphabetize the zfs_deadman_* entries in zfs.4. Reviewed-by: Tony Hutter <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #16226
*	ZTS: Skip cross-fs bclone tests if FreeBSD < 14.0	Tony Hutter	2024-02-16	1	-7/+20
\| \| \| \| \| \| \| \| \|	Skip cross filesystem block cloning tests on FreeBSD if running less than version 14.0. Cross filesystem copy_file_range() was added in FreeBSD 14. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tony Hutter <[email protected]> Closes #15901
*	Allowing PERFPOOL to be defined by zfs-test users	Kevin Greene	2024-02-09	1	-1/+1
\| \| \| \| \| \|	Reviewed-by: John Wren Kennedy <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Kevin Greene <[email protected]> Closes #15868
*	BRT: Fix FICLONE/FICLONERANGE shortened copy	Brian Behlendorf	2024-02-05	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	On Linux the ioctl_ficlonerange() and ioctl_ficlone() system calls are expected to either fully clone the specified range or return an error. The range may be for an entire file. While internally ZFS supports cloning partial ranges there's no way to return the length cloned to the caller so we need to make this all or nothing. As part of this change support for the REMAP_FILE_CAN_SHORTEN flag has been added. When REMAP_FILE_CAN_SHORTEN is set zfs_clone_range() will return a shortened range when encountering pending dirty records. When it's clear zfs_clone_range() will block and wait for the records to be written out allowing the blocks to be cloned. Furthermore, the file range lock is held over the region being cloned to prevent it from being modified while cloning. This doesn't quite provide an atomic semantics since if an error is encountered only a portion of the range may be cloned. This will be converted to an error if REMAP_FILE_CAN_SHORTEN was not provided and returned to the caller. However, the destination file range is left in an undefined state. A test case has been added which exercises this functionality by verifying that `cp --reflink=never\|auto\|always` works correctly. Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #15728 Closes #15842
*	Fix cloning into mmaped and cached file.	Pawel Jakub Dawidek	2024-01-17	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \|	If the destination file is mmaped and the mmaped region was already read, so it is cached, we need to update mmaped pages after successful clone using update_pages(). Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Pointed out by: Ka Ho Ng <[email protected]> Signed-off-by: Pawel Jakub Dawidek <[email protected]> Closes #15772
*	ZTS: Test for clone, mmap and write for block cloning	Umer Saleem	2024-01-16	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	For block cloning, if we mmap the cloned file and write from the map into the file, it triggers a panic in dbuf_redirty() on Linux. The same scenario causes data corruption on FreeBSD. Both these issues are fixed under PR#15656 and PR#15665. It would be good to add a test for this scenario in ZTS. The test program and issue was produced by @robn. Reviewed-by: Pawel Jakub Dawidek <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Ameer Hamza <[email protected]> Signed-off-by: Umer Saleem <[email protected]> Closes #15717
*	Block cloning tests.	Pawel Jakub Dawidek	2023-12-26	2	-2/+14
\| \| \| \| \| \| \| \| \| \|	The test mostly focus on testing various corner cases. The tests take a long time to run, so for the common.run runfile we randomly select a hundred tests. To run all the bclone tests, bclone.run runfile should be used. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Pawel Jakub Dawidek <[email protected]> Closes #15631
*	ZTS: get_persistent_disk_name can return truncated names	VaibhavB	2023-11-29	1	-1/+1
\| \| \| \| \| \| \| \| \|	Instead of using only the 3rd element return the entire string after the split to handle device names with dashes. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Vaibhav Bhanawat <[email protected]> Closes #15567
*	ZTS: Fix 'could not unmount datasets' on Alma 9 (#15542)	Tony Hutter	2023-11-20	1	-0/+6
\| \| \| \| \| \| \| \|	Many tests are failing on AlmaLinux 9 because ZTS could not destroy the pool in cleanup. This was due to $PWD being set to '.' instead of the expected full path. This patch sets $PWD to the full path. Signed-off-by: Tony Hutter <[email protected]> Reviewed-by: Don Brady <[email protected]>
*	Add a tunable to disable BRT support.	Rich Ercolani	2023-11-16	2	-0/+16
\| \| \| \| \| \| \| \| \|	Copy the disable parameter that FreeBSD implemented, and extend it to work on Linux as well, until we're sure this is stable. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rich Ercolani <[email protected]> Closes #15529
*	RAID-Z expansion feature	Don Brady	2023-11-08	1	-0/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This feature allows disks to be added one at a time to a RAID-Z group, expanding its capacity incrementally. This feature is especially useful for small pools (typically with only one RAID-Z group), where there isn't sufficient hardware to add capacity by adding a whole new RAID-Z group (typically doubling the number of disks). == Initiating expansion == A new device (disk) can be attached to an existing RAIDZ vdev, by running `zpool attach POOL raidzP-N NEW_DEVICE`, e.g. `zpool attach tank raidz2-0 sda`. The new device will become part of the RAIDZ group. A "raidz expansion" will be initiated, and the new device will contribute additional space to the RAIDZ group once the expansion completes. The `feature@raidz_expansion` on-disk feature flag must be `enabled` to initiate an expansion, and it remains `active` for the life of the pool. In other words, pools with expanded RAIDZ vdevs can not be imported by older releases of the ZFS software. == During expansion == The expansion entails reading all allocated space from existing disks in the RAIDZ group, and rewriting it to the new disks in the RAIDZ group (including the newly added device). The expansion progress can be monitored with `zpool status`. Data redundancy is maintained during (and after) the expansion. If a disk fails while the expansion is in progress, the expansion pauses until the health of the RAIDZ vdev is restored (e.g. by replacing the failed disk and waiting for reconstruction to complete). The pool remains accessible during expansion. Following a reboot or export/import, the expansion resumes where it left off. == After expansion == When the expansion completes, the additional space is available for use, and is reflected in the `available` zfs property (as seen in `zfs list`, `df`, etc). Expansion does not change the number of failures that can be tolerated without data loss (e.g. a RAIDZ2 is still a RAIDZ2 even after expansion). A RAIDZ vdev can be expanded multiple times. After the expansion completes, old blocks remain with their old data-to-parity ratio (e.g. 5-wide RAIDZ2, has 3 data to 2 parity), but distributed among the larger set of disks. New blocks will be written with the new data-to-parity ratio (e.g. a 5-wide RAIDZ2 which has been expanded once to 6-wide, has 4 data to 2 parity). However, the RAIDZ vdev's "assumed parity ratio" does not change, so slightly less space than is expected may be reported for newly-written blocks, according to `zfs list`, `df`, `ls -s`, and similar tools. Sponsored-by: The FreeBSD Foundation Sponsored-by: iXsystems, Inc. Sponsored-by: vStack Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Mark Maybee <[email protected]> Authored-by: Matthew Ahrens <[email protected]> Contributions-by: Fedor Uporov <[email protected]> Contributions-by: Stuart Maybee <[email protected]> Contributions-by: Thorsten Behrens <[email protected]> Contributions-by: Fmstrat <[email protected]> Contributions-by: Don Brady <[email protected]> Signed-off-by: Don Brady <[email protected]> Closes #15022
*	Revert "zvol: Temporally disable blk-mq"	Tony Hutter	2023-10-24	1	-1/+1
\| \| \| \| \| \| \| \| \|	This reverts commit aefb6a2bd6c24597cde655e9ce69edd0a4c34357. aefb6a2bd temporally disabled blk-mq until we could fix a fix for Signed-off-by: Tony Hutter <[email protected]> Closes #15439
*	Fix ZED auto-replace for VDEVs using by-id paths	Don Brady	2023-10-20	1	-4/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The change is simple -- restore the original code so that the VDEV path is updated when using by-id paths. The more challenging part was to devise a second ZTS test, that would test auto-replace for 'by-id' and help prevent a future regression. With that new test, we can now do an A\|B test with , and without, the fix to confirm that auto-replace for by-id paths works. The existing auto-replace test, functional/fault/auto_replace_001_pos, will confirm that we didn't break auto-replace for 'by-vdev' paths. In the original functional/fault/auto_replace_001_pos test, the disk wipe (using dd) was not effective in removing the partitioning since the kernel was never informed of the wipe. Added a call to wipefs(8) so that the kernel is informed and ZED will re-partition the device. Added a validation step that the re-partitioning occurred by confirming that the GPT partition UUID changes. Sponsored-By: OpenDrives Inc. Sponsored-By: Klara Inc. Reviewed-by: Rob Norris <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Signed-off-by: Don Brady <[email protected]> Closes #15363
*	zvol: Temporally disable blk-mq	Tony Hutter	2023-10-10	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	There was a report of zvol data loss (#15351) after enabling blk-mq on a zvol backed with 16k physical block sized disks. Out of an abundance of caution, do not allow the user to enable blk-mq until we can look into the issue. Note that blk-mq was not enabled by default on zvols. It was always opt-in via the zvol_use_blk_mq module parameter. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Tony Nguyen <[email protected]> Signed-off-by: Tony Hutter <[email protected]> Addresses: #15351 Closes #15378
*	zts: block cloning tests	Rob Norris	2023-07-24	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \|	Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Kay Pedersen <[email protected]> Signed-off-by: Rob Norris <[email protected]> Sponsored-By: OpenDrives Inc. Sponsored-By: Klara Inc. Closes #15050 Closes #405 Closes #13349
*	Shorten arcstat_quiescence sleep time	George Amanakis	2023-06-15	1	-1/+1
\| \| \| \| \| \| \| \| \|	With the latest L2ARC fixes, 2 seconds is too long to wait for quiescence of arcstats like l2_size. Shorten this interval to avoid having the persistent L2ARC tests in ZTS prematurely terminated. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: George Amanakis <[email protected]> Closes #14981
*	Teach zpool scrub to scrub only blocks in error log	George Amanakis	2023-05-18	1	-0/+18
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Added a flag '-e' in zpool scrub to scrub only blocks in error log. A user can pause, resume and cancel the error scrub by passing additional command line arguments -p -s just like a regular scrub. This involves adding a new flag, creating new libzfs interfaces, a new ioctl, and the actual iteration and read-issuing logic. Error scrubbing is executed in multiple txg to make sure pool performance is not affected. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Co-authored-by: TulsiJain [email protected] Signed-off-by: George Amanakis <[email protected]> Closes #8995 Closes #12355
*	Debug auto_replace_001_pos failures	Brian Behlendorf	2023-05-09	1	-3/+11
\| \| \| \| \| \| \| \| \|	Reduced the timeout to 60 seconds which should be more than sufficient and allow the test to be marked as FAILED rather than KILLED. Also dump the pool status on cleanup. Reviewed-by: Brian Atkinson <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #14829
*	Simplify and optimize random_int_between().	Pawel Jakub Dawidek	2023-05-05	1	-9/+2
\| \| \| \| \|	Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Pawel Jakub Dawidek <[email protected]> Closes #14805
*	Ubuntu 22.04 integration: ZTS	szubersk	2022-11-18	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Add `detect_odr_violation=1` to ASAN_OPTIONS to allow both libzfs and libzpool expose ``` zfeature_info_t spa_feature_table[SPA_FEATURES] ``` from module/zcommon/zfeature_common.c in public ABI. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Richard Yao <[email protected]> Reviewed-by: George Melikov <[email protected]> Signed-off-by: szubersk <[email protected]> Closes #14148
*	zfs_rename: support RENAME_* flags	Aleksa Sarai	2022-10-28	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Implement support for Linux's RENAME_* flags (for renameat2). Aside from being quite useful for userspace (providing race-free ways to exchange paths and implement mv --no-clobber), they are used by overlayfs and are thus required in order to use overlayfs-on-ZFS. In order for us to represent the new renameat2(2) flags in the ZIL, we create two new transaction types for the two flags which need transactional-level support (RENAME_EXCHANGE and RENAME_WHITEOUT). RENAME_NOREPLACE does not need any ZIL support because we know that if the operation succeeded before creating the ZIL entry, there was no file to be clobbered and thus it can be treated as a regular TX_RENAME. Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Pavel Snajdr <[email protected]> Signed-off-by: Aleksa Sarai <[email protected]> Closes #12209 Closes #14070
*	Add options to zfs redundant_metadata property	Akash B	2022-10-19	1	-1/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently, additional/extra copies are created for metadata in addition to the redundancy provided by the pool(mirror/raidz/draid), due to this 2 times more space is utilized per inode and this decreases the total number of inodes that can be created in the filesystem. By setting redundant_metadata to none, no additional copies of metadata are created, hence can reduce the space consumed by the additional metadata copies and increase the total number of inodes that can be created in the filesystem. Additionally, this can improve file create performance due to the reduced amount of metadata which needs to be written. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Dipak Ghosh <[email protected]> Signed-off-by: Akash B <[email protected]> Closes #13680
*	Support idmapped mount	youzhongyang	2022-10-19	1	-2/+4
\| \| \| \| \| \| \| \| \| \| \| \|	Adds support for idmapped mounts. Supported as of Linux 5.12 this functionality allows user and group IDs to be remapped without changing their state on disk. This can be useful for portable home directories and a variety of container related use cases. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Signed-off-by: Youzhong Yang <[email protected]> Closes #12923 Closes #13671
*	zed: mark disks as REMOVED when they are removed	Ameer Hamza	2022-09-28	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \|	ZED does not take any action for disk removal events if there is no spare VDEV available. Added zpool_vdev_remove_wanted() in libzfs and vdev_remove_wanted() in vdev.c to remove the VDEV through ZED on removal event. This means that if you are running zed and remove a disk, it will be properly marked as REMOVED. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Signed-off-by: Ameer Hamza <[email protected]> Closes #13797
*	ZTS: parameter expansion in zfs_unshare_006_pos	John Wren Kennedy	2022-09-15	1	-1/+1
\| \| \| \| \| \| \| \| \| \|	zfs_unshare_006 checks to see if a dataset still has an active SMB share after doing an NFS unshare -a. The test could fail because the check for the SMB share does not expect dashes in a dataset name to be converted to underscores as pathname delimiters are. Reviewed-by: Tony Nguyen <[email protected]> Signed-off-by: John Kennedy <[email protected]> Closes #13893
*	Improve too large physical ashift handling	Alexander Motin	2022-09-08	1	-0/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When iterating through children physical ashifts for vdev, prefer ones above the maximum logical ashift, that we can actually use, but within the administrator defined maximum. When selecting top-level vdev ashift, do not set it to the defined maximum in case physical ashift is even higher, but just ignore one. Using the maximum does not prevent misaligned writes, but reduces space efficiency. Since ZFS tries to write data sequentially and aggregates the writes, in many cases large misanigned writes may be not as bad as the space penalty otherwise. Allow internal physical ashifts for vdevs higher than SHIFT_MAX. May be one day allocator or aggregation could benefit from that. Reduce zfs_vdev_max_auto_ashift default from 16 (64KB) to 14 (16KB), so that ZFS may still use bigger ashifts up to SHIFT_MAX (64KB), but only if it really has to or explicitly told to, but not as an "optimization". There are some read-intensive NVMe SSDs that report Preferred Write Alignment of 64KB, and attempt to build RAIDZ2 of those leads to a space inefficiency that can't be justified. Instead these changes make ZFS fall back to logical ashift of 12 (4KB) by default and only warn user that it may be suboptimal for performance. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #13798
*	Add Linux posix_fadvise support	Finix1979	2022-09-08	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The purpose of this PR is to accepts fadvise ioctl from userland to do read-ahead by demand. It could dramatically improve sequential read performance especially when primarycache is set to metadata or zfs_prefetch_disable is 1. If the file is mmaped, generic_fadvise is also called for page cache read-ahead besides dmu_prefetch. Only POSIX_FADV_WILLNEED and POSIX_FADV_SEQUENTIAL are supported in this PR currently. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Finix Yan <[email protected]> Closes #13694
*	Add zilstat script to report zil kstats in a user friendly manner	Ameer Hamza	2022-09-02	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \| \|	Added a python script to process both global and per dataset zil kstats and report them in a user friendly manner similar to arcstat and dbufstat. Reviewed-by: George Melikov <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Richard Elling <[email protected]> Signed-off-by: Ameer Hamza <[email protected]> Closes #13704
*	Add snapshots_changed as property	Umer Saleem	2022-08-02	1	-0/+14
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Make dd_snap_cmtime property persistent across mount and unmount operations by storing in ZAP and restore the value from ZAP on hold into dd_snap_cmtime instead of updating it. Expose dd_snap_cmtime as 'snapshots_changed' property that provides a mechanism to quickly determine whether snapshot list for dataset has changed without having to mount a dataset or iterate the snapshot list. It specifies the time at which a snapshot for a dataset was last created or deleted. This allows us to be more efficient how often we query snapshots. Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Umer Saleem <[email protected]> Closes #13635
*	Replace dead opensolaris.org license link	Tino Reichardt	2022-07-11	2	-2/+2
\| \| \| \| \| \| \| \| \|	The commit replaces all findings of the link: http://www.opensolaris.org/os/licensing with this one: https://opensource.org/licenses/CDDL-1.0 Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tino Reichardt <[email protected]> Closes #13619
*	Add Linux namespace delegation support	Will Andrews	2022-06-10	1	-0/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This allows ZFS datasets to be delegated to a user/mount namespace Within that namespace, only the delegated datasets are visible Works very similarly to Zones/Jailes on other ZFS OSes As a user: ``` $ unshare -Um $ zfs list no datasets available $ echo $$ 1234 ``` As root: ``` # zfs list NAME ZONED MOUNTPOINT containers off /containers containers/host off /containers/host containers/host/child off /containers/host/child containers/host/child/gchild off /containers/host/child/gchild containers/unpriv on /unpriv containers/unpriv/child on /unpriv/child containers/unpriv/child/gchild on /unpriv/child/gchild # zfs zone /proc/1234/ns/user containers/unpriv ``` Back to the user namespace: ``` $ zfs list NAME USED AVAIL REFER MOUNTPOINT containers 129M 47.8G 24K /containers containers/unpriv 128M 47.8G 24K /unpriv containers/unpriv/child 128M 47.8G 128M /unpriv/child ``` Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Will Andrews <[email protected]> Signed-off-by: Allan Jude <[email protected]> Signed-off-by: Mateusz Piotrowski <[email protected]> Co-authored-by: Allan Jude <[email protected]> Co-authored-by: Mateusz Piotrowski <[email protected]> Sponsored-by: Buddy <https://buddy.works> Closes #12263
*	zvol: Support blk-mq for better performance	Tony Hutter	2022-06-09	3	-13/+35
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Add support for the kernel's block multiqueue (blk-mq) interface in the zvol block driver. blk-mq creates multiple request queues on different CPUs rather than having a single request queue. This can improve zvol performance with multithreaded reads/writes. This implementation uses the blk-mq interfaces on 4.13 or newer kernels. Building against older kernels will fall back to the older BIO interfaces. Note that you must set the `zvol_use_blk_mq` module param to enable the blk-mq API. It is disabled by default. In addition, this commit lets the zvol blk-mq layer process whole `struct request` IOs at a time, rather than breaking them down into their individual BIOs. This reduces dbuf lock contention and overhead versus the legacy zvol submit_bio() codepath. sequential dd to one zvol, 8k volblocksize, no O_DIRECT: legacy submit_bio() 292MB/s write 453MB/s read this commit 453MB/s write 885MB/s read It also introduces a new `zvol_blk_mq_chunks_per_thread` module parameter. This parameter represents how many volblocksize'd chunks to process per each zvol thread. It can be used to tune your zvols for better read vs write performance (higher values favor write, lower favor read). Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Ahelenia Ziemiańska <[email protected]> Reviewed-by: Tony Nguyen <[email protected]> Signed-off-by: Tony Hutter <[email protected]> Closes #13148 Issue #12483
*	Introduce BLAKE3 checksums as an OpenZFS feature	Tino Reichardt	2022-06-08	2	-1/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This commit adds BLAKE3 checksums to OpenZFS, it has similar performance to Edon-R, but without the caveats around the latter. Homepage of BLAKE3: https://github.com/BLAKE3-team/BLAKE3 Wikipedia: https://en.wikipedia.org/wiki/BLAKE_(hash_function)#BLAKE3 Short description of Wikipedia: BLAKE3 is a cryptographic hash function based on Bao and BLAKE2, created by Jack O'Connor, Jean-Philippe Aumasson, Samuel Neves, and Zooko Wilcox-O'Hearn. It was announced on January 9, 2020, at Real World Crypto. BLAKE3 is a single algorithm with many desirable features (parallelism, XOF, KDF, PRF and MAC), in contrast to BLAKE and BLAKE2, which are algorithm families with multiple variants. BLAKE3 has a binary tree structure, so it supports a practically unlimited degree of parallelism (both SIMD and multithreading) given enough input. The official Rust and C implementations are dual-licensed as public domain (CC0) and the Apache License. Along with adding the BLAKE3 hash into the OpenZFS infrastructure a new benchmarking file called chksum_bench was introduced. When read it reports the speed of the available checksum functions. On Linux: cat /proc/spl/kstat/zfs/chksum_bench On FreeBSD: sysctl kstat.zfs.misc.chksum_bench This is an example output of an i3-1005G1 test system with Debian 11: implementation 1k 4k 16k 64k 256k 1m 4m edonr-generic 1196 1602 1761 1749 1762 1759 1751 skein-generic 546 591 608 615 619 612 616 sha256-generic 240 300 316 314 304 285 276 sha512-generic 353 441 467 476 472 467 426 blake3-generic 308 313 313 313 312 313 312 blake3-sse2 402 1289 1423 1446 1432 1458 1413 blake3-sse41 427 1470 1625 1704 1679 1607 1629 blake3-avx2 428 1920 3095 3343 3356 3318 3204 blake3-avx512 473 2687 4905 5836 5844 5643 5374 Output on Debian 5.10.0-10-amd64 system: (Ryzen 7 5800X) implementation 1k 4k 16k 64k 256k 1m 4m edonr-generic 1840 2458 2665 2719 2711 2723 2693 skein-generic 870 966 996 992 1003 1005 1009 sha256-generic 415 442 453 455 457 457 457 sha512-generic 608 690 711 718 719 720 721 blake3-generic 301 313 311 309 309 310 310 blake3-sse2 343 1865 2124 2188 2180 2181 2186 blake3-sse41 364 2091 2396 2509 2463 2482 2488 blake3-avx2 365 2590 4399 4971 4915 4802 4764 Output on Debian 5.10.0-9-powerpc64le system: (POWER 9) implementation 1k 4k 16k 64k 256k 1m 4m edonr-generic 1213 1703 1889 1918 1957 1902 1907 skein-generic 434 492 520 522 511 525 525 sha256-generic 167 183 187 188 188 187 188 sha512-generic 186 216 222 221 225 224 224 blake3-generic 153 152 154 153 151 153 153 blake3-sse2 391 1170 1366 1406 1428 1426 1414 blake3-sse41 352 1049 1212 1174 1262 1258 1259 Output on Debian 5.10.0-11-arm64 system: (Pi400) implementation 1k 4k 16k 64k 256k 1m 4m edonr-generic 487 603 629 639 643 641 641 skein-generic 271 299 303 308 309 309 307 sha256-generic 117 127 128 130 130 129 130 sha512-generic 145 165 170 172 173 174 175 blake3-generic 81 29 71 89 89 89 89 blake3-sse2 112 323 368 379 380 371 374 blake3-sse41 101 315 357 368 369 364 360 Structurally, the new code is mainly split into these parts: - 1x cross platform generic c variant: blake3_generic.c - 4x assembly for X86-64 (SSE2, SSE4.1, AVX2, AVX512) - 2x assembly for ARMv8 (NEON converted from SSE2) - 2x assembly for PPC64-LE (POWER8 converted from SSE2) - one file for switching between the implementations Note the PPC64 assembly requires the VSX instruction set and the kfpu_begin() / kfpu_end() calls on PowerPC were updated accordingly. Reviewed-by: Felix Dörre <[email protected]> Reviewed-by: Ahelenia Ziemiańska <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tino Reichardt <[email protected]> Co-authored-by: Rich Ercolani <[email protected]> Closes #10058 Closes #12918
*	tests: add zfs_unshare_008_pos checking whitespace escaping	наб	2022-05-12	1	-0/+12
\| \| \| \| \| \|	Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Ahelenia Ziemiańska <[email protected]> Closes #13165
*	Adding ZTS test for O_APPEND	Brian Atkinson	2022-05-11	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Commit 63b18e4 fixed an issue in zpl_aio_write() to make sure that kiocb->ki_pos was updated correctly when opening a file with O_APPEND. Adding a test to verify O_APPEND functionality with lseek can make sure that all other distros/kernel versions also have the correct behavior. Also moved the threadappends_001_pos test into this append test directory in functional ZTS directory. This way the two append tests are together for organization purposes. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Brian Atkinson <[email protected]> Closes #13424
*	autoconf: use include directives instead of recursing down tests (mostly)	наб	2022-05-10	1	-16/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Only down to tests/zfs-tests/tests, but pull out C programs into the main Makefile ‒ this means we get correct dependency tracking for all programs (and parallelise across them) dist diff: -zfs-2.1.99/tests/zfs-tests/tests/stress/ -zfs-2.1.99/tests/zfs-tests/tests/stress/Makefile.am -zfs-2.1.99/tests/zfs-tests/tests/stress/Makefile.in Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Ahelenia Ziemiańska <[email protected]> Closes #13316
*	autoconf: use include directives instead of recursing down cmd	наб	2022-05-10	1	-0/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	No installation diff, dist lost -zfs-2.1.99/cmd/fsck_zfs/fsck.zfs which was distributed erroneously, since it's generated Also clean gitrev on clean Also add -e 'any possible bashisms' to default checkbashisms flags, and fully parallelise it and shellcheck, and it works out-of-tree, too Also align the Release in the dist META file correctly Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Ahelenia Ziemiańska <[email protected]> Closes #13316
*	Speed up WB_SYNC_NONE when a WB_SYNC_ALL occurs simultaneously	Shaan Nobee	2022-05-03	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Page writebacks with WB_SYNC_NONE can take several seconds to complete since they wait for the transaction group to close before being committed. This is usually not a problem since the caller does not need to wait. However, if we're simultaneously doing a writeback with WB_SYNC_ALL (e.g via msync), the latter can block for several seconds (up to zfs_txg_timeout) due to the active WB_SYNC_NONE writeback since it needs to wait for the transaction to complete and the PG_writeback bit to be cleared. This commit deals with 2 cases: - No page writeback is active. A WB_SYNC_ALL page writeback starts and even completes. But when it's about to check if the PG_writeback bit has been cleared, another writeback with WB_SYNC_NONE starts. The sync page writeback ends up waiting for the non-sync page writeback to complete. - A page writeback with WB_SYNC_NONE is already active when a WB_SYNC_ALL writeback starts. The WB_SYNC_ALL writeback ends up waiting for the WB_SYNC_NONE writeback. The fix works by carefully keeping track of active sync/non-sync writebacks and committing when beneficial. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Shaan Nobee <[email protected]> Closes #12662 Closes #12790
*	Corrected oversight in ZERO_RANGE behavior	Rich Ercolani	2022-04-20	1	-0/+16
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	It turns out, no, in fact, ZERO_RANGE and PUNCH_HOLE do have differing semantics in some ways - in particular, one requires KEEP_SIZE, and the other does not. Also added a zero-range test to catch this, corrected a flaw that made the punch-hole test succeed vacuously, and a typo in file_write. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rich Ercolani <[email protected]> Closes #13329 Closes #13338