openzfs/zfs.git - Unnamed repository; edit this file 'description' to name the repository.

	Commit message (Collapse)	Author	Age	Files	Lines
*	Split functional testings via github action matrix	Tino Reichardt	2023-03-15	3	-1/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This commit changes the workflow of the github actions. We split the workflow into different parts: 1) build zfs modules for Ubuntu 20.04 and 22.04 (~25m) 2) 2x zloop test (~10m) + 2x sanity test (~25m) 3) functional testings in parts 1..5 (each ~1h) - these could be triggered, when sanity tests are ok - currently I just start them all in the same time 4) cleanup and create summary When everything is fine, the full run with all testings should be done in around 2 hours. The codeql.yml and checkstyle.yml are not part in this circle. The testings are also modified a bit: - report info about CPU and checksum benchmarks - reset the debugging logs for each test - when some error occurred, we call dmesg with -c to get only the log output for the last failed test - we empty also the dbgsys Reviewed-by: George Melikov <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tino Reichardt <[email protected]> Closes #14078
*	Improve tests and update man page for healing recv	Alek P	2023-03-15	4	-2/+197
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Fix the manpage. The "SYNOPSIS" section is incorrectly formatted for receive -c. I also took this opportunity to reword some parts and fix a run-on sentence in the manpage. Add large block testing for corrective recv. This adds a new test that makes sure blocks generated using zfs send -L/--large-block large-block send flag are able to be used for healing. Since with unloaded key and errlog feature enabled corruption is not shown in zpool status #13675 is fixed the zfs_receive_corrective.ksh test no longer sets -o feature@head_errlog=disabled on pool creation so that it can also test for regressions related to head_errlog feature. Note that the zfs_receive_compressed_corrective.ksh and zfs_receive_large_block_corrective.ksh tests are still creating pools with -o feature@head_errlog=disabled. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alek Pinchuk <[email protected]> Closes #14615
*	Remove unused Edon-R variants	Tino Reichardt	2023-03-14	1	-78/+5
\| \| \| \| \| \| \| \|	This commit removes the edonr_byteorder.h file and all unused variants of Edon-R. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tino Reichardt <[email protected]> Closes #13618
*	nvpair: Constify string functions	Richard Yao	2023-03-14	3	-4/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	After addressing coverity complaints involving `nvpair_name()`, the compiler started complaining about dropping const. This lead to a rabbit hole where not only `nvpair_name()` needed to be constified, but also `nvpair_value_string()`, `fnvpair_value_string()` and a few other static functions, plus variable pointers throughout the code. The result became a fairly big change, so it has been split out into its own patch. Reviewed-by: Tino Reichardt <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Richard Yao <[email protected]> Closes #14612
*	Replace dead opensolaris.org license links	Tino Reichardt	2023-03-14	7	-7/+7
\| \| \| \| \| \| \| \| \| \| \|	The commit replaces all findings of the link: http://www.opensolaris.org/os/licensing with this one: https://opensource.org/licenses/CDDL-1.0 Reviewed-by: George Melikov <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: WHR <[email protected]> Signed-off-by: Tino Reichardt <[email protected]> Closes #14625
*	Implementation of block cloning for ZFS	Pawel Jakub Dawidek	2023-03-10	1	-0/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Block Cloning allows to manually clone a file (or a subset of its blocks) into another (or the same) file by just creating additional references to the data blocks without copying the data itself. Those references are kept in the Block Reference Tables (BRTs). The whole design of block cloning is documented in module/zfs/brt.c. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Christian Schwarz <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Rich Ercolani <[email protected]> Signed-off-by: Pawel Jakub Dawidek <[email protected]> Closes #13392
*	Fix incremental receive silently failing for recursive sends	Paul Dagnelie	2023-03-10	4	-14/+79
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	The problem occurs because dmu_recv_begin pulls in the payload and next header from the input stream in order to use the contents of the begin record's nvlist. However, the change to do that before the other checks in dmu_recv_begin occur caused a regression where an empty send stream in a recursive send could have its END record consumed by this, which broke the logic of recv_skip. A test is also included to protect against this case in the future. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Paul Dagnelie <[email protected]> Closes #12661 Closes #14568
*	More adaptive ARC eviction	Alexander Motin	2023-03-08	1	-1/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Traditionally ARC adaptation was limited to MRU/MFU distribution. But for years people with metadata-centric workload demanded mechanisms to also manage data/metadata distribution, that in original ZFS was just a FIFO. As result ZFS effectively got separate states for data and metadata, minimum and maximum metadata limits etc, but it all required manual tuning, was not adaptive and in its heart remained a bad FIFO. This change removes most of existing eviction logic, rewriting it from scratch. This makes MRU/MFU adaptation individual for data and meta- data, same as the distribution between data and metadata themselves. Since most of required states separation was already done, it only required to make arcs_size state field specific per data/metadata. The adaptation logic is still based on previous concept of ghost hits, just now it balances ARC capacity between 4 states: MRU data, MRU metadata, MFU data and MFU metadata. To simplify arc_c changes instead of arc_p measured in bytes, this code uses 3 variable arc_meta, arc_pd and arc_pm, representing ARC balance between metadata and data, MRU and MFU for data, and MRU and MFU for metadata respectively as 32-bit fixed point fractions. Since we care about the math result only when need to evict, this moves all the logic from arc_adapt() to arc_evict(), that reduces per-block overhead, since per-block operations are limited to stats collection, now moved from arc_adapt() to arc_access() and using cheaper wmsums. This also allows to remove ugly ARC_HDR_DO_ADAPT flag from many places. This change also removes number of metadata specific tunables, part of which were actually not functioning correctly, since not all metadata are equal and some (like L2ARC headers) are not really evictable. Instead it introduced single opaque knob zfs_arc_meta_balance, tuning ARC's reaction on ghost hits, allowing administrator give more or less preference to metadata without setting strict limits. Some of old code parts like arc_evict_meta() are just removed, because since introduction of ABD ARC they really make no sense: only headers referenced by small number of buffers are not evictable, and they are really not evictable no matter what this code do. Instead just call arc_prune_async() if too much metadata appear not evictable. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Allan Jude <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #14359
*	Update BLAKE3 for using the new impl handling	Tino Reichardt	2023-03-02	1	-6/+12
\| \| \| \| \| \| \| \| \| \| \|	This commit changes the BLAKE3 implementation handling and also the calls to it from the ztest command. Tested-by: Rich Ercolani <[email protected]> Tested-by: Sebastian Gottschall <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tino Reichardt <[email protected]> Closes #13741
*	Add generic implementation handling and SHA2 impl	Tino Reichardt	2023-03-02	1	-7/+27
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The skeleton file module/icp/include/generic_impl.c can be used for iterating over different implementations of algorithms. It is used by SHA256, SHA512 and BLAKE3 currently. The Solaris SHA2 implementation got replaced with a version which is based on public domain code of cppcrypto v0.10. These assembly files are taken from current openssl master: - sha256-x86_64.S: x64, SSSE3, AVX, AVX2, SHA-NI (x86_64) - sha512-x86_64.S: x64, AVX, AVX2 (x86_64) - sha256-armv7.S: ARMv7, NEON, ARMv8-CE (arm) - sha512-armv7.S: ARMv7, NEON (arm) - sha256-armv8.S: ARMv7, NEON, ARMv8-CE (aarch64) - sha512-armv8.S: ARMv7, ARMv8-CE (aarch64) - sha256-ppc.S: Generic PPC64 LE/BE (ppc64) - sha512-ppc.S: Generic PPC64 LE/BE (ppc64) - sha256-p8.S: Power8 ISA Version 2.07 LE/BE (ppc64) - sha512-p8.S: Power8 ISA Version 2.07 LE/BE (ppc64) Tested-by: Rich Ercolani <[email protected]> Tested-by: Sebastian Gottschall <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tino Reichardt <[email protected]> Closes #13741
*	zdb: add decryption support	Rob N	2023-03-02	4	-4/+74
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	The approach is straightforward: for dataset ops, if a key was offered, find the encryption root and the various encryption parameters, derive a wrapping key if necessary, and then unlock the encryption root. After that all the regular dataset ops will return unencrypted data, and that's kinda the whole thing. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Jorgen Lundman <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes #11551 Closes #12707 Closes #14503
*	ZTS: Minor fixes	Brian Behlendorf	2023-02-23	2	-3/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	- The migration_012_pos.ksh test case was failing because of a missing space after `log_must`. - None of the tests listed in the runfiles should include the .ksh suffix. Reviewed-by: Richard Yao <[email protected]> Reviewed-by: Brian Atkinson <[email protected]> Reviewed-by: George Melikov <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #14515
*	Fix buffered/direct/mmap I/O race	Brian Behlendorf	2023-02-23	4	-2/+90
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When a page is faulted in for memory mapped I/O the page lock may be dropped before it has been read and marked up to date. If a buffered read encounters such a page in mappedread() it must wait until the page has been updated. Failure to do so will result in a panic on debug builds and incorrect data on production builds. The critical part of this change is in mappedread() where pages which are not up to date are now handled. Additionally, it includes the following simplifications. - zfs_getpage() and zfs_fillpage() could be passed an array of pages. This could be more efficient if it was used but in practice only a single page was ever provided. These interfaces were simplified to acknowledge that. - update_pages() was modified to correctly set the PG_error bit on a page when it cannot be read by dmu_read(). - Setting PG_error and PG_uptodate was moved to zfs_fillpage() from zpl_readpage_common(). This is consistent with the handling in update_pages() and mappedread(). - Minor additional refactoring to comments and variable declarations to improve readability. - Add a test case to exercise concurrent buffered, direct, and mmap IO to the same file. - Reduce the mmap_sync test case default run time. Reviewed-by: Richard Yao <[email protected]> Reviewed-by: Brian Atkinson <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #13608 Closes #14498
*	EIO caused by encryption + recursive gang	Matthew Ahrens	2023-02-06	3	-1/+88
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Encrypted blocks can not have 3 DVAs, because they use the space of the 3rd DVA for the IV+salt. zio_write_gang_block() takes this into account, setting `gbh_copies` to no more than 2 in this case. Gang members BP's do not have the X (encrypted) bit set (nor do they have the DMU level and type fields set), because encryption is not handled at this level. The gang block is reassembled, and then encryption (and compression) are handled. To check if this gang block is encrypted, the code in zio_write_gang_block() checks `pio->io_bp`. This is normally fine, because the block that's being ganged is typically the encrypted BP. The problem is that if there is "recursive ganging", where a gang member is itself a gang block, then when zio_write_gang_block() is called to create a gang block for a gang member, `pio->io_bp` is the gang member's BP, which doesn't have the X bit set, so the number of DVA's is not restricted to 2. It should instead be looking at the the "gang leader", i.e. the top-level gang block, to determine how many DVA's can be used, to avoid a "NDVA's inversion" (where a child has more DVA's than its parent). gang leader BP: X (encrypted) bit set, 2 DVA's, IV+salt in 3rd DVA's space: ``` DVA[0]=<1:...:100400> DVA[1]=<0:...:100400> salt=... iv=... [L0 ZFS plain file] fletcher4 uncompressed encrypted LE gang unique double size=100000L/100000P birth=... fill=1 cksum=... ``` leader's GBH contains a BP with gang bit set and 3 DVA's: ``` DVA[0]=<1:...:55600> DVA[1]=<0:...:55600> [L0 unallocated] fletcher4 uncompressed unencrypted LE contiguous unique double size=55600L/55600P birth=... fill=0 cksum=... DVA[0]=<1:...:55600> DVA[1]=<0:...:55600> [L0 unallocated] fletcher4 uncompressed unencrypted LE contiguous unique double size=55600L/55600P birth=... fill=0 cksum=... DVA[0]=<1:...:55600> DVA[1]=<0:...:55600> DVA[2]=<1:...:200> [L0 unallocated] fletcher4 uncompressed unencrypted LE gang unique double size=55400L/55400P birth=... fill=0 cksum=... ``` On nondebug bits, having the 3rd DVA in the gang block works for the most part, because it's true that all 3 DVA's are available in the gang member BP (in the GBH). However, for accounting purposes, gang block DVA's ASIZE include all the space allocated below them, i.e. the 512-byte gang block header (GBH) as well as the gang members below that. We see that above where the gang leader BP is 1MB logical (and after compression: 0x`100000P`), but the ASIZE of each DVA is 2 sectors (1KB) more than 1MB (0x`100400`). Since thre are 3 copies of a block below it, we increment the ATIME of the 3rd DVA of the gang leader by the space used by the 3rd DVA of the child (1 sector, in this case). But there isn't really a 3rd DVA of the parent; the salt is stored in place of the 3rd DVA's ASIZE. So when zio_write_gang_member_ready() increments the parent's BP's `DVA[2]`'s ASIZE, it's actually incrementing the parent's salt. When we later try to read the encrypted recursively-ganged block, the salt doesn't match what we used to write it, so MAC verification fails and we get an EIO. ``` zio_encrypt(): encrypted 515/2/0/403 salt: 25 25 bb 9d ad d6 cd 89 zio_decrypt(): decrypting 515/2/0/403 salt: 26 25 bb 9d ad d6 cd 89 ``` This commit addresses the problem by not increasing the number of copies of the GBH beyond 2 (even for non-encrypted blocks). This simplifies the logic while maintaining the ability to traverse all metadata (including gang blocks) even if one copy is lost. (Note that 3 copies of the GBH will still be created if requested, e.g. for `copies=3` or MOS blocks.) Additionally, the code that increments the parent's DVA's ASIZE is made to check the parent DVA's NDVAS even on nondebug bits. So if there's a similar bug in the future, it will cause a panic when trying to write, rather than corrupting the parent BP and causing an error when reading. Reviewed-by: Brian Behlendorf <[email protected]> Co-authored-by: Brian Behlendorf <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> Caused-by: #14356 Closes #14440 Closes #14413
*	Prevent error messages when running tests with no timeout	Paul Dagnelie	2023-02-02	1	-1/+1
\| \| \| \| \| \|	Reviewed-by: Richard Yao <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Paul Dagnelie <[email protected]> Closes #14450
*	Wait for txg sync if the last DRR_FREEOBJECTS might result in a hole	David Hedberg	2023-01-23	4	-1/+90
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	If we receive a DRR_FREEOBJECTS as the first entry in an object range, this might end up producing a hole if the freed objects were the only existing objects in the block. If the txg starts syncing before we've processed any following DRR_OBJECT records, this leads to a possible race where the backing arc_buf_t gets its psize set to 0 in the arc_write_ready() callback while still being referenced from a dirty record in the open txg. To prevent this, we insert a txg_wait_synced call if the first record in the range was a DRR_FREEOBJECTS that actually resulted in one or more freed objects. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: David Hedberg <[email protected]> Sponsored by: Findity AB Closes #11893 Closes #14358
*	Configure zed's diagnosis engine with vdev properties	rob-wing	2023-01-23	4	-1/+311
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Introduce four new vdev properties: checksum_n checksum_t io_n io_t These properties can be used for configuring the thresholds of zed's diagnosis engine and are interpeted as <N> events in T <seconds>. When this property is set to a non-default value on a top-level vdev, those thresholds will also apply to its leaf vdevs. This behavior can be overridden by explicitly setting the property on the leaf vdev. Note that, these properties do not persist across vdev replacement. For this reason, it is advisable to set the property on the top-level vdev instead of the leaf vdev. The default values for zed's diagnosis engine (10 events, 600 seconds) remains unchanged. Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Allan Jude <[email protected]> Signed-off-by: Rob Wing <[email protected]> Sponsored-by: Seagate Technology LLC Closes #13805
*	Use setproctitle to report progress of zfs send	Ameer Hamza	2023-01-17	1	-16/+16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This allows parsing of zfs send progress by checking the process title. Doing so requires some changes to the send code in libzfs_sendrecv.c; primarily these changes move some of the accounting around, to allow for the code to be verbose as normal, or set the process title. Unlike BSD, setproctitle() isn't standard in Linux; thus, borrowed it from libbsd with slight modifications. Authored-by: Sean Eric Fagan <[email protected]> Co-authored-by: Ryan Moeller <[email protected]> Co-authored-by: Ameer Hamza <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Signed-off-by: Ameer Hamza <[email protected]> Closes #14376
*	ZTS: Annotate additonal flaky test cases	Brian Behlendorf	2023-01-17	1	-3/+7
\| \| \| \| \| \| \| \| \|	Update several flaky test cases in zts-report.py.in until they can be made entirely reliable. Reviewed-by: George Melikov <[email protected]> Reviewed-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #14392
*	Activate filesystem features only in syncing context	George Amanakis	2023-01-11	4	-1/+96
\| \| \| \| \| \| \| \| \| \| \|	When activating filesystem features after receiving a snapshot, do so only in syncing context. Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Richard Yao <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: George Amanakis <[email protected]> Closes #14304 Closes #14252
*	ZTS: close in mmapwrite.c	Antonio Russo	2023-01-06	1	-0/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	mmapwrite is used during the ZTS to identify issues with mmap-ed files. This helper program exercises this pathway by continuously writing to a file. ee6bf97c7 modified the writing threads to terminate after a set amount of total data is written. This change allows standard program execution to reach the end of a writer thread without closing the file descriptor, introducing a resource "leak." This patch appeases resource leak analyses by close()-ing the file at the end of the thread. Reviewed-by: Richard Yao <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Antonio Russo <[email protected]> Closes #14353
*	ZTS: limit mmapwrite file size	Antonio Russo	2023-01-05	1	-2/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	mmapwrite spawns several threads, all of which perform writes on a file for the purpose of testing the behavior of mmap(2)-ed files. One thread performs an mmap and a write to the beginning of that region, while the others perform regular writes after lseek(2)-ing the end of the file. Because these regular writes are set in a while (1) loop, they will write an unbounded amount of data to disk. The mmap_write_001_pos test script SIGKILLs them after 30 seconds, but on fast testbeds, this may be enough time to exhaust the available space in the filesystem, leading to spurious test failures. Instead, limit the total file size by checking that the lseek return value is no greater than 250 * 1024*1024 bytes, which is less than the default minimum vdev size defined in includes/default.cfg . Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Antonio Russo <[email protected]> Closes #14277 Closes #14345
*	arc_read()/arc_access() refactoring and cleanup	Alexander Motin	2022-12-22	5	-1/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	ARC code was many times significantly modified over the years, that created significant amount of tangled and potentially broken code. This should make arc_access()/arc_read() code some more readable. - Decouple prefetch status tracking from b_refcnt. It made sense originally, but became highly cryptic over the years. Move all the logic into arc_access(). While there, clean up and comment state transitions in arc_access(). Some transitions were weird IMO. - Unify arc_access() calls to arc_read() instead of sometimes calling it from arc_read_done(). To avoid extra state changes and checks add one more b_refcnt for ARC_FLAG_IO_IN_PROGRESS. - Reimplement ARC_FLAG_WAIT in case of ARC_FLAG_IO_IN_PROGRESS with the same callback mechanism to not falsely account them as hits. Count those as "iohits", an intermediate between "hits" and "misses". While there, call read callbacks in original request order, that should be good for fairness and random speculations/allocations/aggregations. - Introduce additional statistic counters for prefetch, accounting predictive vs prescient and hits vs iohits vs misses. - Remove hash_lock argument from functions not needing it. - Remove ARC_FLAG_PREDICTIVE_PREFETCH, since it should be opposite to ARC_FLAG_PRESCIENT_PREFETCH if ARC_FLAG_PREFETCH is set. We may wish to add ARC_FLAG_PRESCIENT_PREFETCH to few more places. - Fix few false positive tests found in the process. Reviewed-by: George Wilson <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Richard Yao <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #14123
*	Allow receiver to override encryption properties in case of replication	Ameer Hamza	2022-12-13	1	-0/+14
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently, the receiver fails to override the encryption property for the plain replicated dataset with the error: "cannot receive incremental stream: encryption property 'encryption' cannot be set for incremental streams.". The problem is resolved by allowing the receiver to override the encryption property for plain replicated send. Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Ameer Hamza <[email protected]> Closes #14253 Closes #13533
*	Skip permission checks for extended attributes	Ameer Hamza	2022-12-12	1	-0/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	zfs_zaccess_trivial() calls the generic_permission() to read xattr attributes. This causes deadlock if called from zpl_xattr_set_dir() context as xattr and the dent locks are already held in this scenario. This commit skips the permissions checks for extended attributes since the Linux VFS stack already checks it before passing us the control. Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Youzhong Yang <[email protected]> Signed-off-by: Ameer Hamza <[email protected]> Closes #14220
*	ZTS: Add missing tests to Makefile.am	Brian Behlendorf	2022-12-07	3	-8/+9
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The send-c_zstream_recompress.ksh test case was being skipped because it was not added to the Makefile.am, and was thus left out of the package. As for the renameat2 tests these were being skipped because when the patch was rebased it was not updated to use the new Makefile layout for the tests directory. Correct this. Add missing pre/post sections to sanity.run so the pyzfs tests will run. Reviewed-by: George Melikov <[email protected]> Reviewed-by: Damian Szuberski <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #14266
*	nopwrites on dmu_sync-ed blocks can result in a panic	George Wilson	2022-12-02	2	-8/+45
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	After a device has been removed, any nopwrites for blocks on that indirect vdev should be ignored and a new block should be allocated. The original code attempted to handle this but used the wrong block pointer when checking for indirect vdevs and failed to check all DVAs. This change corrects both of these issues and modifies the test case to ensure that it properly tests nopwrites with device removal. Reviewed-by: Prakash Surya <[email protected]> Reviewed-by: Matthew Ahrens <[email protected]> Reviewed-by: Richard Yao <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: George Wilson <[email protected]> Closes #14235
*	ZTS: test reported checksum errors for ZED	Rob Wing	2022-12-02	3	-1/+127
\| \| \| \| \| \| \| \| \| \| \|	Test checksum error reporting to ZED via the call paths vdev_raidz_io_done_unrecoverable() and zio_checksum_verify(). Sponsored-by: Seagate Technology LLC Reviewed-by: Richard Yao <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rob Wing <[email protected]> Closes #14190
*	Python3: replace `distutils` with `sysconfig`	Damian Szuberski	2022-11-28	3	-23/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	- `distutils` module is long time deprecated and already deleted from the CPython mainline. - To remain compatible with Debian/Ubuntu Python3 packaging style, try `distutils.sysconfig.get_python_path(0,0)` first with fallback on `sysconfig.get_path('purelib')` - pyzfs_unittest suite is run unconditionally as a part of ZTS. - Add pyzfs_unittest suite to sanity tests. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: szubersk <[email protected]> Closes #12833 Closes #13280 Closes #14177
*	ZTS: zts-report silently ignores perf test results	John Wren Kennedy	2022-11-18	1	-1/+1
\| \| \| \| \| \| \| \| \| \|	The regex used to extract test result information from a test run only matches the functional tests. Update the regex so it matches both. Reviewed-by: Richard Yao <[email protected]> Reviewed-by: George Melikov <[email protected]> Reviewed-by: Tony Nguyen <[email protected]> Signed-off-by: John Wren Kennedy <[email protected]> Closes #14185
*	Fix setting the large_block feature after receiving a snapshot	George Amanakis	2022-11-18	3	-1/+80
\| \| \| \| \| \| \| \| \| \| \| \| \|	We are not allowed to dirty a filesystem when done receiving a snapshot. In this case the flag SPA_FEATURE_LARGE_BLOCKS will not be set on that filesystem since the filesystem is not on dp_dirty_datasets, and a subsequent encrypted raw send will fail. Fix this by checking in dsl_dataset_snapshot_sync_impl() if the feature needs to be activated and do so if appropriate. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: George Amanakis <[email protected]> Closes #13699 Closes #13782
*	Ubuntu 22.04 integration: ZTS	szubersk	2022-11-18	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Add `detect_odr_violation=1` to ASAN_OPTIONS to allow both libzfs and libzpool expose ``` zfeature_info_t spa_feature_table[SPA_FEATURES] ``` from module/zcommon/zfeature_common.c in public ABI. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Richard Yao <[email protected]> Reviewed-by: George Melikov <[email protected]> Signed-off-by: szubersk <[email protected]> Closes #14148
*	Handle and detect #13709's unlock regression (#14161)	Rich Ercolani	2022-11-15	4	-1/+48
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In #13709, as in #11294 before it, it turns out that 63a26454 still had the same failure mode as when it was first landed as d1d47691, and fails to unlock certain datasets that formerly worked. Rather than reverting it again, let's add handling to just throw out the accounting metadata that failed to unlock when that happens, as well as a test with a pre-broken pool image to ensure that we never get bitten by this again. Fixes: #13709 Signed-off-by: Rich Ercolani <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Tony Hutter <[email protected]>
*	Add ability to recompress send streams with new compression algorithm	Paul Dagnelie	2022-11-10	2	-6/+64
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	As new compression algorithms are added to ZFS, it could be useful for people to recompress data with new algorithms. There is currently no mechanism to do this aside from copying the data manually into a new filesystem with the new algorithm enabled. This tool allows the transformation to happen through zfs send, allowing it to be done efficiently to remote systems and in an incremental fashion. A new zstream command is added that decompresses WRITE records and then recompresses them with a provided algorithm, and then re-emits the modified send stream. It may also be possible to re-compress embedded block pointers, but that was not attempted for the initial version. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Matthew Ahrens <[email protected]> Signed-off-by: Paul Dagnelie <[email protected]> Closes #14106
*	ZTS: random_readwrite test doesn't run correctly	John Wren Kennedy	2022-11-10	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \|	This test uses fio's bssplit mechanism to choose io sizes for the test, leaving the PERF_IOSIZES variable empty. Because that variable is empty, the innermost loop in do_fio_run_impl is never executed, and as a result, this test does the setup but collects no data. Setting the variable to "bssplit" allows performance data to be gathered. Reviewed-by: Tony Nguyen <[email protected]> Signed-off-by: John Wren Kennedy <[email protected]> Closes #14163
*	Support idmapped mount in user namespace	youzhongyang	2022-11-08	5	-17/+176
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Linux 5.17 commit torvalds/linux@5dfbfe71e enables "the idmapping infrastructure to support idmapped mounts of filesystems mounted with an idmapping". Update the OpenZFS accordingly to improve the idmapped mount support. This pull request contains the following changes: - xattr setter functions are fixed to take mnt_ns argument. Without this, cp -p would fail for an idmapped mount in a user namespace. - idmap_util is enhanced/fixed for its use in a user ns context. - One test case added to test idmapped mount in a user ns. Reviewed-by: Christian Brauner <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Youzhong Yang <[email protected]> Closes #14097
*	zed: Prevent special vdev to be replaced by hot spare	Ameer Hamza	2022-11-04	1	-0/+5
\| \| \| \| \| \| \| \| \| \| \| \| \|	Special vdevs should not be replaced by a hot spare. Log vdevs already support this, extending the functionality for special vdevs. Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Richard Yao <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Ameer Hamza <[email protected]> Closes #14129
*	Deny receiving into encrypted datasets if the keys are not loaded	Attila Fülöp	2022-11-03	1	-9/+22
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Commit 68ddc06b611854560fefa377437eb3c9480e084b introduced support for receiving unencrypted datasets as children of encrypted ones but unfortunately got the logic upside down. This resulted in failing to deny receives of incremental sends into encrypted datasets without their keys loaded. If receiving a filesystem, the receive was done into a newly created unencrypted child dataset of the target. In case of volumes the receive made the target volume undeletable since a dataset was created below it, which we obviously can't handle. Incremental streams with embedded blocks are affected as well. We fix the broken logic to properly deny receives in such cases. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Attila Fülöp <[email protected]> Closes #13598 Closes #14055 Closes #14119
*	ZTS: rsend_009_pos.ksh is destructive on zfs-on-root system	youzhongyang	2022-11-01	1	-20/+23
\| \| \| \| \| \| \|	Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Allan Jude <[email protected]> Reviewed-by: George Melikov <[email protected]> Signed-off-by: Youzhong Yang <[email protected]> Closes #14113
*	Fix oversights from 4170ae4e	Richard Yao	2022-10-31	1	-2/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	4170ae4ea600fea6ac9daa8b145960c9de3915fc was intended to tackle TOCTOU race conditions reported by CodeQL, but as an oversight, a file descriptor was not closed and some comments were not updated. Interestingly, CodeQL did not complain about the file descriptor leak, so there is room for improvement in how we configure it to try to detect this issue so that we get early warning about this. In addition, an optimization opportunity was missed by mistake in lib/libshare/os/linux/smb.c, which prevented us from truly closing the TOCTOU race. This was also caught by Coverity. Reported-by: Coverity (CID 1524424) Reported-by: Coverity (CID 1526804) Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Richard Yao <[email protected]> Closes #14109
*	Fix TOCTOU race conditions reported by CodeQL and Coverity	Richard Yao	2022-10-29	3	-27/+19
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	CodeQL and Coverity both complained about: * lib/libshare/os/linux/smb.c * tests/zfs-tests/cmd/mmapwrite.c * twice * tests/zfs-tests/tests/functional/tmpfile/tmpfile_002_pos.c * tests/zfs-tests/tests/functional/tmpfile/tmpfile_stat_mode.c * coverity had a second complaint that CodeQL did not have * tests/zfs-tests/cmd/suid_write_to_file.c * Coverity had two complaints and CodeQL had one complaint, both differed. The CodeQL complaint is about the main point of the test, so it is not fixable without a hack involving `fork()`. The issues reported by CodeQL are fixed, with the exception of the last one, which is deemed to be a false positive that is too much trouble to wrokaround. The issues reported by Coverity were only fixed if CodeQL complained about them. There were issues reported by Coverity in a number of other files that were not reported by CodeQL, but fixing the CodeQL complaints is considered a priority since we want to integrate it into a github workflow, so the remaining Coverity complaints are left for future work. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Richard Yao <[email protected]> Closes #14098
*	zfs_rename: support RENAME_* flags	Aleksa Sarai	2022-10-28	13	-2/+404
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Implement support for Linux's RENAME_* flags (for renameat2). Aside from being quite useful for userspace (providing race-free ways to exchange paths and implement mv --no-clobber), they are used by overlayfs and are thus required in order to use overlayfs-on-ZFS. In order for us to represent the new renameat2(2) flags in the ZIL, we create two new transaction types for the two flags which need transactional-level support (RENAME_EXCHANGE and RENAME_WHITEOUT). RENAME_NOREPLACE does not need any ZIL support because we know that if the operation succeeded before creating the ZIL entry, there was no file to be clobbered and thus it can be treated as a regular TX_RENAME. Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Pavel Snajdr <[email protected]> Signed-off-by: Aleksa Sarai <[email protected]> Closes #12209 Closes #14070
*	Fix multiplication converted to larger type	Andrew Innes	2022-10-28	1	-2/+2
\| \| \| \| \| \| \| \| \|	This fixes the instances of the "Multiplication result converted to larger type" alert that codeQL scanning found. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Richard Yao <[email protected]> Signed-off-by: Andrew Innes <[email protected]> Closes #14094
*	Add delay between zpool add zvol and zpool destroy	youzhongyang	2022-10-21	1	-0/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	As investigated by #14026, the zpool_add_004_pos can reliably hang if the timing is not right. This is caused by a race condition between zed doing zpool reopen (due to the zvol being added to the zpool), and the command zpool destroy. This change adds a delay between zpool add zvol and zpool destroy to avoid these issue, but does not address the underlying problem. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Youzhong Yang <[email protected]> Issue #14026 Closes #14052
*	Silence new static analyzer defect reports from idmap_util.c	Richard Yao	2022-10-20	1	-14/+14
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	2a068a1394d179dda4becf350e3afb4e8819675e introduced 2 new defect reports from Coverity and 1 from Clang's static analyzer. Coverity complained about a potential resource leak from only calling `close(fd)` when `fd > 0` because `fd` might be `0`. This is a false positive, but rather than dismiss it as such, we can change the comparison to ensure that this never appears again from any static analyzer. Upon inspection, 6 more instances of this were found in the file, so those were changed too. Unfortunately, since the file descriptor has been put into an unsigned variable in `attr.userns_fd`, we cannot do a non-negative check on it to see if it has not been allocated, so we instead restructure the error handling to avoid the need for a check. This also means that errors had not been handled correctly here, so the static analyzer found a bug (although practically by accident). Coverity also complained about a dereference before a NULL check in `do_idmap_mount()` on `source`. Upon inspection, it appears that the pointer is never NULL, so we delete the NULL check as cleanup. Clang's static analyzer complained that the return value of `write_pid_idmaps()` can be uninitialized if we have no idmaps to write. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Youzhong Yang <[email protected]> Signed-off-by: Richard Yao <[email protected]> Closes #14061
*	Fix multiple definitions of struct mount_attr on recent glibc versions	Richard Yao	2022-10-20	1	-7/+5
\| \| \| \| \| \| \| \| \| \| \| \| \|	The ifdef used would never work because the CPP is not aware of C structure definitions. Rather than use an autotools check, we can just use a nameless structure that we typedef to mount_attr_t. This is a Linux kernel interface, which means that it is stable and this is fine to do. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Youzhong Yang <[email protected]> Signed-off-by: Richard Yao <[email protected]> Closes #14057 Closes #14058
*	Add options to zfs redundant_metadata property	Akash B	2022-10-19	2	-4/+9
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently, additional/extra copies are created for metadata in addition to the redundancy provided by the pool(mirror/raidz/draid), due to this 2 times more space is utilized per inode and this decreases the total number of inodes that can be created in the filesystem. By setting redundant_metadata to none, no additional copies of metadata are created, hence can reduce the space consumed by the additional metadata copies and increase the total number of inodes that can be created in the filesystem. Additionally, this can improve file create performance due to the reduced amount of metadata which needs to be written. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Dipak Ghosh <[email protected]> Signed-off-by: Akash B <[email protected]> Closes #13680
*	Support idmapped mount	youzhongyang	2022-10-19	15	-4/+1323
\| \| \| \| \| \| \| \| \| \| \| \|	Adds support for idmapped mounts. Supported as of Linux 5.12 this functionality allows user and group IDs to be remapped without changing their state on disk. This can be useful for portable home directories and a variety of container related use cases. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Signed-off-by: Youzhong Yang <[email protected]> Closes #12923 Closes #13671
*	Fix declarations of non-global variables	Tino Reichardt	2022-10-18	5	-33/+33
\| \| \| \| \| \| \| \| \|	This patch inserts the `static` keyword to non-global variables, which where found by the analysis tool smatch. Reviewed-by: Matthew Ahrens <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tino Reichardt <[email protected]> Closes #13970
*	Cleanup: Address Clang's static analyzer's unused code complaints	Richard Yao	2022-10-14	5	-31/+13
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	These were categorized as the following: * Dead assignment 23 * Dead increment 4 * Dead initialization 6 * Dead nested assignment 18 Most of these are harmless, but since actual issues can hide among them, we correct them. That said, there were a few return values that were being ignored that appeared to merit some correction: * `destroy_callback()` in `cmd/zfs/zfs_main.c` ignored the error from `destroy_batched()`. We handle it by returning -1 if there is an error. * `zfs_do_upgrade()` in `cmd/zfs/zfs_main.c` ignored the error from `zfs_for_each()`. We handle it by doing a binary OR of the error value from the subsequent `zfs_for_each()` call to the existing value. This is how errors are mostly handled inside `zfs_for_each()`. The error value here is passed to exit from the zfs command, so doing a binary or on it is better than what we did previously. * `get_zap_prop()` in `module/zfs/zcp_get.c` ignored the error from `dsl_prop_get_ds()` when the property is not of type string. We return an error when it does. There is a small concern that the `zfs_get_temporary_prop()` call would handle things, but in the case that it does not, we would be pushing an uninitialized numval onto the lua stack. It is expected that `dsl_prop_get_ds()` will succeed anytime that `zfs_get_temporary_prop()` does, so that not giving it a chance to fix things is not a problem. * `draid_merge_impl()` in `tests/zfs-tests/cmd/draid.c` used `nvlist_add_nvlist()` twice in ways in which errors are expected to be impossible, so we switch to `fnvlist_add_nvlist()`. A few notable ones did not merit use of the return value, so we suppressed it with `(void)`: * `write_free_diffs()` in `lib/libzfs/libzfs_diff.c` ignored the error value from `describe_free()`. A look through the commit history revealed that this was intentional. * `arc_evict_hdr()` in `module/zfs/arc.c` did not need to use the returned handle from `arc_hdr_realloc()` because it is already referenced in lists. * `spa_vdev_detach()` in `module/zfs/spa.c` has a comment explicitly saying not to use the error from `vdev_label_init()` because whatever causes the error could be the reason why a detach is being done. Unfortunately, I am not presently able to analyze the kernel modules with Clang's static analyzer, so I could have missed some cases of this. In cases where reports were present in code that is duplicated between Linux and FreeBSD, I made a conscious effort to fix the FreeBSD version too. After this commit is merged, regressions like dee8934 should become extremely obvious with Clang's static analyzer since a regression would appear in the results as the only instance of unused code. That assumes that Coverity does not catch the issue first. My local branch with fixes from all of my outstanding non-draft pull requests shows 118 reports from Clang's static anlayzer after this patch. That is down by 51 from 169. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Cedric Berger <[email protected]> Signed-off-by: Richard Yao <[email protected]> Closes #13986