openzfs/zfs.git - Unnamed repository; edit this file 'description' to name the repository.

	Commit message (Collapse)	Author	Age	Files	Lines
*	Fix -Wuse-after-free warning in dbuf_destroy()	Brian Behlendorf	2022-06-27	1	-3/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Move the use of the db pointer after it is freed. It's only used as a tag so a dereference would never occur, but there's no reason we can't invert the order to resolve the warning. module/zfs/dbuf.c: In function 'dbuf_destroy': module/zfs/dbuf.c:2953:17: error: pointer 'db' may be used after 'free' [-Werror=use-after-free] Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #13528 Closes #13575
*	Fix -Wuse-after-free warning in dbuf_issue_final_prefetch_done()	Brian Behlendorf	2022-06-27	1	-1/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Move the use of the private pointer after it is freed. It's only used as a tag so a dereference would never occur, but there's no harm in inverting the order to resolve the warning. module/zfs/dbuf.c: In function 'dbuf_issue_final_prefetch_done': module/zfs/dbuf.c:3204:17: error: pointer 'private' may be used after 'free' [-Werror=use-after-free] Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #13528 Closes #13575
*	Fix -Wattribute-warning in dsl layer	Brian Behlendorf	2022-06-27	1	-4/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The memcpy(), memmove(), and memset() functions have been annotated to perform bounds checking when using FORTIFY_SOURCE. A warning is now generted when writing beyond the end of the specified field. Alternately, the new struct_group() macro could be used to create an anonymous union member for use by memcpy(). However, since this is the only place the macro would be helpful it's preferable to restructure the code slights to avoid the need for additional compatibility code when the macro does not exist. https://lore.kernel.org/lkml/[email protected]/T/ Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #13528 Closes #13575
*	Fix -Wattribute-warning in edonr	Brian Behlendorf	2022-06-27	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The wrong union memory was being accessed in EdonRInit resulting in a write beyond size of field compiler warning. Reference the correct member to resolve the warning. The warning was correct and this in case the mistake was harmless. In function ‘fortify_memcpy_chk’, inlined from ‘EdonRInit’ at zfs/module/icp/algs/edonr/edonr.c:494:3: ./include/linux/fortify-string.h:344:25: error: call to ‘__write_overflow_field’ declared with attribute warning: detected write beyond size of field (1st parameter); maybe use struct_group()? [-Werror=attribute-warning] Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #13528 Closes #13575
*	Fix -Wattribute-warning in zfs_log_xvattr()	Brian Behlendorf	2022-06-27	2	-35/+39
\| \| \| \| \| \| \| \| \| \| \| \| \|	Restructure the code in zfs_log_xvattr() to use a lr_attr_end structure when accessing lr_attr_t elements located after the variable sized array. This makes the code more understandable and resolves the accessing beyond the end of the field warnings. Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #13528 Closes #13575
*	Silence -Winfinite-recursion warning in luaD_throw()	Brian Behlendorf	2022-06-27	3	-0/+35
\| \| \| \| \| \| \| \| \| \| \| \|	This code should be kept inline with the upstream lua version as much as possible. Therefore, we simply want to silence the warning. This check was enabled by default as part of -Wall in gcc 12.1. Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #13528 Closes #13575
*	Avoid panic with recordsize > 128k, raw sending and no large_blocks	George Amanakis	2022-06-27	6	-20/+66
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The current codebase does not support raw sending buffers with block size > 128kB when large_blocks is not active. This can happen in the codepath dsl_dataset_sync()->dmu_objset_sync()->zio_nowait() which calls back dmu_objset_write_done()->dsl_dataset_block_born(). If dsl_dataset_sync() completes its run before dsl_dataset_block_born() is called, we will end up not activating some of the necessary flags, while having blocks based on those flags written in the filesystem. A subsequent send will then panic. Fix this by directly deciding in dmu_objset_sync() whether these flags need to be activated later by dsl_dataset_sync(). Instead of panicking due to a NULL pointer dereference in dmu_dump_write() in case of a send, print out an error message. Also during scrub verify there are no contradicting filesystem flags. Reviewed-by: Paul Dagnelie <[email protected]> Signed-off-by: George Amanakis <[email protected]> Closes #12275 Closes #12438
*	Avoid two 64-bit divisions per scanned block	Alexander Motin	2022-06-27	1	-4/+6
\| \| \| \| \| \| \| \|	Change math to make it like the ARC, using multiplications instead. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored-By: iXsystems, Inc. Closes #13591
*	Several B-tree optimizations	Alexander Motin	2022-06-24	2	-363/+419
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	- Introduce first element offset within a leaf. It allows to reduce by ~50% average memmove() size when adding/removing elements. If the added/removed element is in the first half of the leaf, we may shift elements before it and adjust the bth_first instead of moving more elements after it. - Use memcpy() instead of memmove() when we know there is no overlap. - Switch from uint64_t to uint32_t. It does not limit anything, but 32-bit arches should appreciate it greatly in hot paths. - Store leaf capacity in struct btree to avoid 64-bit divisions. - Adjust zfs_btree_insert_into_leaf() to always result in balanced leaves after splitting, no matter where the new element was inserted. Not that we care about it much, but it should also allow B-trees with as little as two elements per leaf instead of 4 previously. When scrubbing pool of 12 SSDs, storing 1.5TB of 4KB zvol blocks this reduces amount of time spent in memmove() inside the scan thread from 13.7% to 5.7% and total scrub time by ~15 seconds out of 9 minutes. It should also reduce spacemaps load time, but I haven't measured it. Reviewed-by: Paul Dagnelie <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored-By: iXsystems, Inc. Closes #13582
*	Add a "zstream decompress" subcommand	Alan Somers	2022-06-24	7	-5/+425
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	It can be used to repair a ZFS file system corrupted by ZFS bug #12762. Use it like this: zfs send -c <DS> \| \ zstream decompress <OBJECT>,<OFFSET>[,<COMPRESSION_ALGO>] ... \| \ zfs recv <DST_DS> Reviewed-by: Ahelenia Ziemiańska <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Allan Jude <[email protected]> Signed-off-by: Alan Somers <[email protected]> Sponsored-by: Axcient Workaround for #12762 Closes #13256
*	Several sorted scrub optimizations	Alexander Motin	2022-06-24	4	-201/+154
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	- Reduce size and comparison complexity of q_exts_by_size B-tree. Previous code used two 64-bit divisions and many other operations to compare two B-tree elements. It created enormous overhead. This implementation moves the math to the upper level and stores the score in the B-tree elements themselves. Since all that we need to store in that B-tree is the extent score and offset, those can fit into single 8 byte value instead of 24 bytes of q_exts_by_addr element and can be compared with single operation. - Better decouple secondary tree logic from main range_tree by moving rt_btree_ops and related functions into dsl_scan.c as ext_size_ops. Those functions are very small to worry about the code duplication and range_tree does not need to know details such as rt_btree_compare. - Instead of accounting number of pending bytes per pool, that needs atomic on global variable per block, account the number of non-empty per-vdev queues, that change much more rarely. - When extent scan is interrupted by TXG end, continue it in the next TXG instead of selecting next best extent. It allows to avoid leaving one truncated (and so likely not the best any more) extent each TXG. On top of some other optimizations this saves about 1.5 minutes out of 10 to scrub pool of 12 SSDs, storing 1.5TB of 4KB zvol blocks. Reviewed-by: Paul Dagnelie <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Tom Caputi <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored-By: iXsystems, Inc. Closes #13576
*	Use macros for quotes and such	Toomas Soome	2022-06-24	1	-40/+81
\| \| \| \| \| \| \|	Use Dq,Pq/Po/Pc macros. illumos dumpadm is now in section 8. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Toomas Soome <[email protected]> Closes #13586
*	Scrub mirror children without BPs	Brian Behlendorf	2022-06-23	7	-37/+252
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When scrubbing a raidz/draid pool, which contains a replacing or sparing mirror with multiple online children, only one child will be read. This is not normally a serious concern because the DTL records are used to determine where a good copy of the data is. As long as the data can be read from one child the mirror vdev will use it to repair gaps in any of its children. Furthermore, even if the data which was read is corrupt the raidz code will detect this and issue its own repair I/O to correct the damage in the mirror vdev. However, in the scenario where the DTL is wrong due to silent data corruption (say due to overwriting one child) and the scrub happens to read from a child with good data, then the other damaged mirror child will not be detected nor repaired. While this is possible for both raidz and draid vdevs, it's most pronounced when using draid. This is because by default the zed will sequentially rebuild a draid pool to a distributed spare, and the distributed spare half of the mirror is always preferred since it delivers better performance. This means the damaged half of the mirror will go undetected even after scrubbing. For system administrations this behavior is non-intuitive and in a worst case scenario could result in the only good copy of the data being unknowingly detached from the mirror. This change resolves the issue by reading all replacing/sparing mirror children when scrubbing. When the BP isn't available for verification, then compare the data buffers from each child. They must all be identical, if not there's silent damage and an error is returned to prompt the top-level vdev to issue a repair I/O to rewrite the data on all of the mirror children. Since we can't tell which child was wrong a checksum error is logged against the replacing or sparing mirror vdev. Reviewed-by: Mark Maybee <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #13555
*	Fix memory allocation issue for BLAKE3 context	Tino Reichardt	2022-06-21	4	-4/+48
\| \| \| \| \| \| \| \| \| \|	The kmem_alloc(sizeof (*ctx), KM_NOSLEEP) call on FreeBSD can't be used in this code segment. Work around this by pre-allocating a percpu context array for later use. Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tino Reichardt <[email protected]> Closes #13568
*	Remove install of zfs-load-module.service for dracut	Matthew Thode	2022-06-21	1	-2/+1
\| \| \| \| \| \| \| \|	The zfs-load-module.service service is not currently provided by the OpenZFS repository so we cannot safely assume it exists. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matthew Thode <[email protected]> Closes #13574
*	FreeBSD: Improve crypto_dispatch() handling	Alexander Motin	2022-06-17	1	-12/+29
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Handle crypto_dispatch() return values same as crp->crp_etype errors. On FreeBSD 12 many drivers returned same errors both ways, and lack of proper handling for the first ended up in assertion panic later. It was changed in FreeBSD 13, but there is no reason to not be safe. While there, skip waiting for completion, including locking and wakeup() call, for sessions on synchronous crypto drivers, such as typical aesni and software. Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored-By: iXsystems, Inc. Closes #13563
*	expose snapshot count via stat(2) of .zfs/snapshot (#13559)	Andrew	2022-06-17	1	-0/+17
\| \| \| \| \| \| \| \| \| \| \|	Increase nlinks in stat results of ./zfs/snapshot based on snapshot count. This provides quick and efficient method for administrators to get snapshot counts without having to use libzfs or list the snapdir contents. Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Andrew Walker <[email protected]> Closes #13559
*	libzfs: Prevent overridding of error code	ixhamza	2022-06-15	1	-3/+3
\| \| \| \| \| \| \| \| \| \|	zfs_send_cb_impl fails to report error for some flags. Use second error variable for send_conclusion_record. Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Ameer Hamza <[email protected]> Closes #13558
*	Reduce ZIO io_lock contention on sorted scrub	Alexander Motin	2022-06-15	1	-4/+15
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	During sorted scrub multiple threads (one per vdev) are issuing many ZIOs same time, all using the same scn->scn_zio_root ZIO as parent. It causes huge lock contention on the single global lock on that ZIO. Improve it by introducing per-queue null ZIOs, children to that one, and using them instead as proxy. For 12 SSD pool storing 1.5TB of 4KB blocks on 80-core system this dramatically reduces lock contention and reduces scrub time from 21 minutes down to 12.5, while actual read stages (not scan) are about 3x faster, reaching 100K blocks per second per vdev. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored-By: iXsystems, Inc. Closes #13553
*	Add support for ARCH=um for x86 sub-architectures	crass	2022-06-15	2	-0/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When building modules (as well as the kernel) with ARCH=um, the options -Dsetjmp=kernel_setjmp and -Dlongjmp=kernel_longjmp are passed to the C preprocessor for C files. This causes the setjmp and longjmp used in module/lua/ldo.c to be kernel_setjmp and kernel_longjmp respectively in the object file. However, the setjmp and longjmp that is intended to be called is defined in an architecture dependent assembly file under the directory module/lua/setjmp. Since it is an assembly and not a C file, the preprocessor define is not given and the names do not change. This becomes an issue when modpost is trying to create the Module.symvers and sees no defined symbol for kernel_setjmp and kernel_longjmp. To fix this, if the macro CONFIG_UML is defined, then setjmp and longjmp macros are undefined. When building with ARCH=um for x86 sub-architectures, CONFIG_X86 is not defined. Instead, CONFIG_UML_X86 is defined. Despite this, the UML x86 sub-architecture can use the same object files as the x86 architectures because the x86 sub-architecture UML kernel is running with the same instruction set as CONFIG_X86. So the modules/Kbuild build file is updated to add the same object files that CONFIG_X86 would add when CONFIG_UML_X86 is defined. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Glenn Washburn <[email protected]> Closes #13547
*	Fix clang 13 compilation errors	Damian Szuberski	2022-06-15	2	-3/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	``` os/linux/zfs/zvol_os.c:1111:3: error: ignoring return value of function declared with 'warn_unused_result' attribute [-Werror,-Wunused-result] add_disk(zv->zv_zso->zvo_disk); ^~~~~~~~ ~~~~~~~~~~~~~~~~~~~~ zpl_xattr.c:1579:1: warning: no previous prototype for function 'zpl_posix_acl_release_impl' [-Wmissing-prototypes] ``` Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: szubersk <[email protected]> Closes #13551
*	Replace ZPROP_INVAL with ZPROP_USERPROP where it means a user property	Allan Jude	2022-06-14	12	-46/+49
\| \| \| \| \| \|	Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Allan Jude <[email protected]> Sponsored-by: Klara Inc. Closes #12676
*	spl: Use a clearer name for the user namespace fd	Ryan Moeller	2022-06-14	1	-4/+4
\| \| \| \| \| \| \| \| \| \| \| \|	This fd has nothing to do with cleanup, that's just the name of the field in zfs_cmd_t that was used to pass it to the kernel. Call it what it is, an fd for a user namespace. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Allan Jude <[email protected]> Signed-off-by: Ryan Moeller <[email protected]> Closes #13554
*	libzfs: zfs_userns: Don't leak the namespace fd	Ryan Moeller	2022-06-14	1	-0/+2
\| \| \| \| \| \| \| \| \| \| \| \|	zfs_userns opens a file descriptor for the kernel to look up a namespace, but does not close it. Close the fd when we're done with it. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Allan Jude <[email protected]> Signed-off-by: Ryan Moeller <[email protected]> Closes #13554
*	Add weekly and monthly systemd timers for trimming	Julian Brunner	2022-06-10	5	-0/+63
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	On machines using systemd, trim timers can be enabled on a per-pool basis. Weekly and monthly timer units are provided. Timers can be enabled as follows: systemctl enable [email protected] --now systemctl enable [email protected] --now Each timer will pull in zfs-trim@${poolname}.service, which is not schedule-specific. The manpage zpool-trim has been updated accordingly. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Julian Brunner <[email protected]> Closes #13544
*	Improve sorted scan memory accounting	Alexander Motin	2022-06-10	1	-2/+5
\| \| \| \| \| \| \| \| \| \| \| \|	Since we use two B-trees q_exts_by_size and q_exts_by_addr, we should count 2x sizeof (range_seg_gap_t) per node. And since average B-tree memory efficiency is about 75%, we should increase it to 3x. Previous code under-counted up to 30% of the memory usage. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored-By: iXsystems, Inc. Closes #13537
*	Add Linux namespace delegation support	Will Andrews	2022-06-10	33	-15/+1166
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This allows ZFS datasets to be delegated to a user/mount namespace Within that namespace, only the delegated datasets are visible Works very similarly to Zones/Jailes on other ZFS OSes As a user: ``` $ unshare -Um $ zfs list no datasets available $ echo $$ 1234 ``` As root: ``` # zfs list NAME ZONED MOUNTPOINT containers off /containers containers/host off /containers/host containers/host/child off /containers/host/child containers/host/child/gchild off /containers/host/child/gchild containers/unpriv on /unpriv containers/unpriv/child on /unpriv/child containers/unpriv/child/gchild on /unpriv/child/gchild # zfs zone /proc/1234/ns/user containers/unpriv ``` Back to the user namespace: ``` $ zfs list NAME USED AVAIL REFER MOUNTPOINT containers 129M 47.8G 24K /containers containers/unpriv 128M 47.8G 24K /unpriv containers/unpriv/child 128M 47.8G 128M /unpriv/child ``` Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Will Andrews <[email protected]> Signed-off-by: Allan Jude <[email protected]> Signed-off-by: Mateusz Piotrowski <[email protected]> Co-authored-by: Allan Jude <[email protected]> Co-authored-by: Mateusz Piotrowski <[email protected]> Sponsored-by: Buddy <https://buddy.works> Closes #12263
*	Revert parts of 938cfeb0f27303721081223816d4f251ffeb1767	Allan Jude	2022-06-10	1	-16/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When read and writing the UID/GID, we always want the value relative to the root user namespace, the kernel will take care of remapping this to the user namespace for us. Calling from_kuid(user_ns, uid) with a unmapped uid will return -1 as that uid is outside of the scope of that namespace, and will result in the files inside the namespace all being owned by 'nobody' and not being allowed to call chmod or chown on them. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Allan Jude <[email protected]> Closes #12263
*	AVL: Remove obsolete branching optimizations	Alexander Motin	2022-06-09	1	-20/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Modern Clang and GCC can successfully implement simple conditions without branching with math and flag operations. Use of arrays for translation no longer helps as much as it was 14+ years ago. Disassemble of the code generated by Clang 13.0.0 on FreeBSD 13.1, Clang 14.0.4 on FreeBSD 14 and GCC 10.2.1 on Debian 11 with this change still shows no branching instructions. Profiling of CPU-bound scan stage of sorted scrub shows reproducible reduction of time spent inside avl_find() from 6.52% to 4.58%. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored-By: iXsystems, Inc. Closes #13540
*	libzfs: Rename msg bufs to errbuf for consistency	Ryan Moeller	2022-06-09	1	-135/+138
\| \| \| \| \| \| \| \| \| \| \| \|	`libzfs_pool.c` uses the name `msg` where everywhere else in libzfs uses `errbuf` for the error message buffer. Use the name consistent with the rest of libzfs and use ERRBUFLEN instead of 1024. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Ryan Moeller <[email protected]> Closes #13539
*	libzfs: Define the defecto standard errbuf size	Ryan Moeller	2022-06-09	8	-52/+52
\| \| \| \| \| \| \| \| \| \|	Every errbuf array in libzfs is 1024 chars. Define ERRBUFLEN in a shared header, and use it. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Ryan Moeller <[email protected]> Closes #13539
*	zvol: Support blk-mq for better performance	Tony Hutter	2022-06-09	18	-148/+1437
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Add support for the kernel's block multiqueue (blk-mq) interface in the zvol block driver. blk-mq creates multiple request queues on different CPUs rather than having a single request queue. This can improve zvol performance with multithreaded reads/writes. This implementation uses the blk-mq interfaces on 4.13 or newer kernels. Building against older kernels will fall back to the older BIO interfaces. Note that you must set the `zvol_use_blk_mq` module param to enable the blk-mq API. It is disabled by default. In addition, this commit lets the zvol blk-mq layer process whole `struct request` IOs at a time, rather than breaking them down into their individual BIOs. This reduces dbuf lock contention and overhead versus the legacy zvol submit_bio() codepath. sequential dd to one zvol, 8k volblocksize, no O_DIRECT: legacy submit_bio() 292MB/s write 453MB/s read this commit 453MB/s write 885MB/s read It also introduces a new `zvol_blk_mq_chunks_per_thread` module parameter. This parameter represents how many volblocksize'd chunks to process per each zvol thread. It can be used to tune your zvols for better read vs write performance (higher values favor write, lower favor read). Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Ahelenia Ziemiańska <[email protected]> Reviewed-by: Tony Nguyen <[email protected]> Signed-off-by: Tony Hutter <[email protected]> Closes #13148 Issue #12483
*	Introduce BLAKE3 checksums as an OpenZFS feature	Tino Reichardt	2022-06-08	53	-52/+22804
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This commit adds BLAKE3 checksums to OpenZFS, it has similar performance to Edon-R, but without the caveats around the latter. Homepage of BLAKE3: https://github.com/BLAKE3-team/BLAKE3 Wikipedia: https://en.wikipedia.org/wiki/BLAKE_(hash_function)#BLAKE3 Short description of Wikipedia: BLAKE3 is a cryptographic hash function based on Bao and BLAKE2, created by Jack O'Connor, Jean-Philippe Aumasson, Samuel Neves, and Zooko Wilcox-O'Hearn. It was announced on January 9, 2020, at Real World Crypto. BLAKE3 is a single algorithm with many desirable features (parallelism, XOF, KDF, PRF and MAC), in contrast to BLAKE and BLAKE2, which are algorithm families with multiple variants. BLAKE3 has a binary tree structure, so it supports a practically unlimited degree of parallelism (both SIMD and multithreading) given enough input. The official Rust and C implementations are dual-licensed as public domain (CC0) and the Apache License. Along with adding the BLAKE3 hash into the OpenZFS infrastructure a new benchmarking file called chksum_bench was introduced. When read it reports the speed of the available checksum functions. On Linux: cat /proc/spl/kstat/zfs/chksum_bench On FreeBSD: sysctl kstat.zfs.misc.chksum_bench This is an example output of an i3-1005G1 test system with Debian 11: implementation 1k 4k 16k 64k 256k 1m 4m edonr-generic 1196 1602 1761 1749 1762 1759 1751 skein-generic 546 591 608 615 619 612 616 sha256-generic 240 300 316 314 304 285 276 sha512-generic 353 441 467 476 472 467 426 blake3-generic 308 313 313 313 312 313 312 blake3-sse2 402 1289 1423 1446 1432 1458 1413 blake3-sse41 427 1470 1625 1704 1679 1607 1629 blake3-avx2 428 1920 3095 3343 3356 3318 3204 blake3-avx512 473 2687 4905 5836 5844 5643 5374 Output on Debian 5.10.0-10-amd64 system: (Ryzen 7 5800X) implementation 1k 4k 16k 64k 256k 1m 4m edonr-generic 1840 2458 2665 2719 2711 2723 2693 skein-generic 870 966 996 992 1003 1005 1009 sha256-generic 415 442 453 455 457 457 457 sha512-generic 608 690 711 718 719 720 721 blake3-generic 301 313 311 309 309 310 310 blake3-sse2 343 1865 2124 2188 2180 2181 2186 blake3-sse41 364 2091 2396 2509 2463 2482 2488 blake3-avx2 365 2590 4399 4971 4915 4802 4764 Output on Debian 5.10.0-9-powerpc64le system: (POWER 9) implementation 1k 4k 16k 64k 256k 1m 4m edonr-generic 1213 1703 1889 1918 1957 1902 1907 skein-generic 434 492 520 522 511 525 525 sha256-generic 167 183 187 188 188 187 188 sha512-generic 186 216 222 221 225 224 224 blake3-generic 153 152 154 153 151 153 153 blake3-sse2 391 1170 1366 1406 1428 1426 1414 blake3-sse41 352 1049 1212 1174 1262 1258 1259 Output on Debian 5.10.0-11-arm64 system: (Pi400) implementation 1k 4k 16k 64k 256k 1m 4m edonr-generic 487 603 629 639 643 641 641 skein-generic 271 299 303 308 309 309 307 sha256-generic 117 127 128 130 130 129 130 sha512-generic 145 165 170 172 173 174 175 blake3-generic 81 29 71 89 89 89 89 blake3-sse2 112 323 368 379 380 371 374 blake3-sse41 101 315 357 368 369 364 360 Structurally, the new code is mainly split into these parts: - 1x cross platform generic c variant: blake3_generic.c - 4x assembly for X86-64 (SSE2, SSE4.1, AVX2, AVX512) - 2x assembly for ARMv8 (NEON converted from SSE2) - 2x assembly for PPC64-LE (POWER8 converted from SSE2) - one file for switching between the implementations Note the PPC64 assembly requires the VSX instruction set and the kfpu_begin() / kfpu_end() calls on PowerPC were updated accordingly. Reviewed-by: Felix Dörre <[email protected]> Reviewed-by: Ahelenia Ziemiańska <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tino Reichardt <[email protected]> Co-authored-by: Rich Ercolani <[email protected]> Closes #10058 Closes #12918
*	autoconf: AC_MSG_CHECKING consistency	Brian Behlendorf	2022-06-01	9	-17/+17
\| \| \| \| \| \| \| \| \| \| \|	Make the wording more consistent for the kernel AC_MSG_CHECKING output (e.g. "checking whether ...".). Additionally, group some of the VFS interface checks with the others. No functional change. Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Attila Fülöp <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #13529
*	Linux 5.19 compat: asm/fpu/internal.h	Brian Behlendorf	2022-06-01	2	-2/+23
\| \| \| \| \| \| \| \| \| \| \|	As of the Linux 5.19 kernel the asm/fpu/internal.h header was entirely removed. It has been effectively empty since the 5.16 kernel and provides no required functionality. Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Attila Fülöp <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #13529
*	Remove wrong assertion in log spacemap	Alexander Motin	2022-06-01	1	-6/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	It is typical, but not generally true that if log summary has more blocks it must also have unflushed metaslabs. Normally with metaslabs flushed in order it works, but there are known exceptions, such as device removal or metaslab being loaded during its flush attempt. Before 600a02b8844 if spa_flush_metaslabs() hit loading metaslab it usually stopped (unless memlimit is also exceeded), but now it may flush more metaslabs, just skipping that particular one. This increased chances of assertion to fire when the skipped metaslab is flushed on next iteration if all other metaslabs in that summary entry are already flushed out of order. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored-By: iXsystems, Inc. Closes #13486 Closes #13513
*	Corrected parameters for zstd early abort	Rich Ercolani	2022-05-31	1	-2/+2
\| \| \| \| \| \| \|	That'll teach me to try and recall them from the definition. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rich Ercolani <[email protected]> Closes #13519
*	Fix typo in zil_commit() comment block	Allan Jude	2022-05-31	1	-1/+1
\| \| \| \| \| \|	Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Signed-off-by: Allan Jude <[email protected]> Closes #13518
*	Linux 5.18 compat: META	Brian Behlendorf	2022-05-31	1	-1/+1
\| \| \| \| \| \| \| \|	Update the META file to reflect compatibility with the 5.18 kernel. Reviewed-by: George Melikov <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #13527
*	Linux 5.19 compat: zap_flags_t conflict	Brian Behlendorf	2022-05-31	1	-0/+5
\| \| \| \| \| \| \| \| \| \| \| \|	As of the Linux 5.19 kernel an identically named zap_flags_t typedef is declared in the include/linux/mm_types.h linux header. Sadly, the inclusion of this header cannot be easily avoided. To resolve the conflict a #define is used to remap the name in the OpenZFS sources when building against the Linux kernel. Reviewed-by: Tony Hutter <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #13515
*	Linux 5.19 compat: bdev_start_io_acct() / bdev_end_io_acct()	Brian Behlendorf	2022-05-31	2	-30/+62
\| \| \| \| \| \| \| \| \|	As of the Linux 5.19 kernel the disk__io_acct() helper functions have been replaced by the bdev__io_acct() functions. Reviewed-by: Tony Hutter <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #13515
*	Linux 5.19 compat: aops->read_folio()	Brian Behlendorf	2022-05-31	3	-0/+46
\| \| \| \| \| \| \| \| \|	As of the Linux 5.19 kernel the readpage() address space operation has been replaced by read_folio(). Reviewed-by: Tony Hutter <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #13515
*	Linux 5.19 compat: blkdev_issue_secure_erase()	Brian Behlendorf	2022-05-31	2	-9/+81
\| \| \| \| \| \| \| \| \| \| \|	Linux 5.19 commit torvalds/linux@44abff2c0 splits the secure erase functionality from the blkdev_issue_discard() function. The blkdev_issue_secure_erase() must now be issued to issue a secure erase. Reviewed-by: Tony Hutter <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #13515
*	Linux 5.19 compat: bdev_max_secure_erase_sectors()	Brian Behlendorf	2022-05-31	3	-24/+43
\| \| \| \| \| \| \| \| \| \| \|	Linux 5.19 commit torvalds/linux@44abff2c0 removed the blk_queue_secure_erase() helper function. The preferred interface is to now use the bdev_max_secure_erase_sectors() function to check for discard support. Reviewed-by: Tony Hutter <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #13515
*	Linux 5.19 compat: bdev_max_discard_sectors()	Brian Behlendorf	2022-05-31	5	-6/+51
\| \| \| \| \| \| \| \| \| \| \|	Linux 5.19 commit torvalds/linux@70200574cc removed the blk_queue_discard() helper function. The preferred interface is to now use the bdev_max_discard_sectors() function to check for discard support. Reviewed-by: Tony Hutter <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #13515
*	Linux 5.18 compat: bio_alloc()	Brian Behlendorf	2022-05-31	1	-14/+39
\| \| \| \| \| \| \| \| \| \|	As for the Linux 5.18 kernel bio_alloc() expects a block_device struct as an argument. This removes the need for the bio_set_dev() compatibility code for 5.18 and newer kernels. Reviewed-by: Tony Hutter <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #13515
*	Fix inflated quiesce time caused by lwb_tx during zil_commit()	Kevin Jin	2022-05-26	2	-21/+76
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In current zil_commit() process, transaction lwb_tx is assigned in zil_lwb_write_issue(), and is committed in zil_lwb_flush_vdevs_done(). Thus, during lwb write out process, the txg is held in open or quiesing state, until zil_lwb_flush_vdevs_done() is called. If the zil's zio latency is high, it will cause txg_sync_thread() to starve. The goal here is to defer waiting for zil_lwb_flush_vdevs_done to the 'syncing' txg state. That is, in zil_sync(). In this patch, it achieves the goal without holding transaction. A new function zil_lwb_flush_wait_all() is introduced. It waits for the completion of all the zil_lwb_flush_vdevs_done() by given txg. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Prakash Surya <[email protected]> Signed-off-by: jxdking <[email protected]> Closes #12321
*	Replace EXTRA_DIST with dist_noinst_DATA	Brian Behlendorf	2022-05-26	27	-57/+56
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The EXTRA_DIST variable is ignored when used in the FALSE conditional of a Makefile.am. This results in the `make dist` target omitting these files from the generated tarball unless CONFIG_USER is defined. This issue can be avoided by switching to use the dist_noinst_DATA variable which is handled as expected by autoconf. This change also adds support for --with-config=dist as an alias for --with-config=srpm and updates the GitHub workflows to use it. Reviewed-by: Ahelenia Ziemiańska <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #13459 Closes #13505
*	Silence unused-but-set-variable warning	Ryan Moeller	2022-05-25	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \|	This was breaking the kmod port build on FreeBSD with Clang 13. Use the same trick as we do for ASSERT() to make DNODE_VERIFY() use its parameter at compile time without actually using it at run time in non-debug builds. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Ryan Moeller <[email protected]> Closes #13507
*	More speculative prefetcher improvements	Alexander Motin	2022-05-25	5	-101/+133
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	- Make prefetch distance adaptive: up to 4MB prefetch doubles for every, hit same as before, but after that it grows by 1/8 every time the prefetch read does not complete in time to satisfy the demand. My tests show that 4MB is sufficient for wide NVMe pool to saturate single reader thread at 2.5GB/s, while new 64MB maximum allows the same thread to reach 1.5GB/s on wide HDD pool. Further distance increase may increase speed even more, but less dramatic and with higher latency. - Allow early reuse of inactive prefetch streams: streams that never saw hits can be reused immediately if there is a demand, while others can be reused after 1s of inactivity, starting with the oldest. After 2s of inactivity streams are deleted to free resources same as before. This allows by several times increase strided read performance on HDD pool in presence of simultaneous random reads, previously filling the zfetch_max_streams limit for seconds and so blocking most of prefetch. - Always issue intermediate indirect block reads with SYNC priority. Each of those reads if delayed for longer may delay up to 1024 other block prefetches, that may be not good for wide pools. Reviewed-by: Allan Jude <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored-By: iXsystems, Inc. Closes #13452