aboutsummaryrefslogtreecommitdiffstats
path: root/module
Commit message (Collapse)AuthorAgeFilesLines
...
* Fix panic in dsl_process_sub_livelist for EINTRSerapheim Dimitropoulos2022-11-011-1/+6
| | | | | | | | | | | | | | | | | | | | | | | | = Issue Recently we hit an assertion panic in `dsl_process_sub_livelist` while exporting the spa and interrupting `bpobj_iterate_nofree`. In that case `bpobj_iterate_nofree` stops mid-way returning an EINTR without clearing the intermediate AVL tree that keeps track of the livelist entries it has encountered so far. At that point the code has a VERIFY for the number of elements of the AVL expecting it to be zero (which is not the case for EINTR). = Fix Cleanup any intermediate state before destroying the AVL when encountering EINTR. Also added a comment documenting the scenario where the EINTR comes up. There is no need to do anything else for the calles of `dsl_process_sub_livelist` as they already handle the EINTR case. Reviewed-by: Matthew Ahrens <[email protected]> Reviewed-by: Mark Maybee <[email protected]> Reviewed-by: Richard Yao <[email protected]> Signed-off-by: Serapheim Dimitropoulos <[email protected]> Closes #13939
* Bring per_txg_dirty_frees_percent back to 30Mateusz Guzik2022-11-011-1/+1
| | | | | | | | | | | | | | | | | | | | | | The current value causes significant artificial slowdown during mass parallel file removal, which can be observed both on FreeBSD and Linux when running real workloads. Sample results from Linux doing make -j 96 clean after an allyesconfig modules build: before: 4.14s user 6.79s system 48% cpu 22.631 total after: 4.17s user 6.44s system 153% cpu 6.927 total FreeBSD results in the ticket. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Richard Yao <[email protected]> Reviewed-by: George Melikov <[email protected]> Signed-off-by: Mateusz Guzik <[email protected]> Closes #13932 Closes #13938
* Add options to zfs redundant_metadata propertyAkash B2022-11-014-7/+116
| | | | | | | | | | | | | | | | | | Currently, additional/extra copies are created for metadata in addition to the redundancy provided by the pool(mirror/raidz/draid), due to this 2 times more space is utilized per inode and this decreases the total number of inodes that can be created in the filesystem. By setting redundant_metadata to none, no additional copies of metadata are created, hence can reduce the space consumed by the additional metadata copies and increase the total number of inodes that can be created in the filesystem. Additionally, this can improve file create performance due to the reduced amount of metadata which needs to be written. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Dipak Ghosh <[email protected]> Signed-off-by: Akash B <[email protected]> Closes #13680
* FreeBSD: Fix a pair of bugs in zfs_fhtovp()Mark Johnston2022-10-261-1/+2
| | | | | | | | | | | | | | | | | - Add a zfs_exit() call in an error path, otherwise a lock is leaked. - Remove the fid_gen > 1 check. That appears to be Linux-specific: zfsctl_snapdir_fid() sets fid_gen to 0 or 1 depending on whether the snapshot directory is mounted. On FreeBSD it fails, making snapshot dirs inaccessible via NFS. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Andriy Gapon <[email protected]> Signed-off-by: Mark Johnston <[email protected]> Fixes: 43dbf8817808 ("FreeBSD: vfsops: use setgen for error case") Closes #14001 Closes #13974 (cherry picked from commit ed566bf1cd0bdbf85e8c63c1c119e3d2ef5db1f6)
* Fix sequential resilver drive failure race conditionsamwyc2022-10-211-1/+13
| | | | | | | | | | | | | | | This patch handles the race condition on simultaneous failure of 2 drives, which misses the vdev_rebuild_reset_wanted signal in vdev_rebuild_thread. We retry to catch this inside the vdev_rebuild_complete_sync function. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Richard Yao <[email protected]> Reviewed-by: Dipak Ghosh <[email protected]> Reviewed-by: Akash B <[email protected]> Signed-off-by: Samuel Wycliffe J <[email protected]> Closes #14041 Closes #14050
* kcfpool_alloc() should have its argument list marked voidRichard Yao2022-10-121-1/+1
| | | | | | | | | | | | | | | | | | This error occurred when building on Gentoo with debugging enabled: zfs-kmod-2.1.6/work/zfs-2.1.6/module/icp/core/kcf_sched.c:1277:14: error: a function declaration without a prototype is deprecated in all versions of C [-Werror,-Wstrict-prototypes] kcfpool_alloc() ^ void 1 error generated. This function is not present in master. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Signed-off-by: Richard Yao <[email protected]> Closes #14023
* Fix bad free in skein codeRichard Yao2022-09-281-3/+15
| | | | | | | | | | | | | | | | | | | | | | | | | Clang's static analyzer found a bad free caused by skein_mac_atomic(). It will allocate a context on the stack and then pass it to skein_final(), which attempts to free it. Upon inspection, skein_digest_atomic() also has the same problem. These functions were created to match the OpenSolaris ICP API, so I was curious how we avoided this in other providers and looked at the SHA2 code. It appears that SHA2 has a SHA2Final() helper function that is called by the exported sha2_mac_final()/sha2_digest_final() as well as the sha2_mac_atomic() and sha2_digest_atomic() functions. The real work is done in SHA2Final() while some checks and the free are done in sha2_mac_final()/sha2_digest_final(). We fix the use after free in the skein code by taking inspiration from the SHA2 code. We introduce a skein_final_nofree() that does most of the work, and make skein_final() into a function that calls it and then frees the memory. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Signed-off-by: Richard Yao <[email protected]> Closes #13954
* FreeBSD: handle V_PCATCHMateusz Guzik2022-09-281-0/+4
| | | | | | | | | See https://cgit.FreeBSD.org/src/commit/?id=a75d1ddd74312f5dd79bc1e965f7077679659f2e Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Mateusz Guzik <[email protected]> Closes #13910
* FreeBSD: catch up to 1400068Mateusz Guzik2022-09-281-11/+30
| | | | | | Reviewed-by: Ryan Moeller <[email protected]> Signed-off-by: Mateusz Guzik <[email protected]> Closes #13909
* FreeBSD: stop passing LK_INTERLOCK to VOP_LOCKMateusz Guzik2022-09-281-1/+2
| | | | | | | | | There is an ongoing effort to eliminate this feature. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Signed-off-by: Mateusz Guzik <[email protected]> Closes #13908
* FreeBSD: Fix integer conversion for vnlru_free{,_vfsops}()Richard Yao2022-09-281-0/+6
| | | | | | | | | | | | | | | | | When reviewing #13875, I noticed that our FreeBSD code has an issue where it converts from `int64_t` to `int` when calling `vnlru_free{,_vfsops}()`. The result is that if the int64_t is `1 << 36`, the int will be 0, since the low bits are 0. Even when some low bits are set, a value such as `((1 << 36) + 1)` would truncate to 1, which is wrong. There is protection against this on 32-bit platforms, but on 64-bit platforms, there is no check to protect us, so we add a check. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Signed-off-by: Richard Yao <[email protected]> Closes #13882
* FreeBSD: Ignore symlink to i386 includesRyan Moeller2022-09-281-0/+1
| | | | | | | | | | | A symlink to i386 includes is created in the build dir on amd64 since freebsd/freebsd-src@d07600c563039f252becc29ac7d9a454b6b0600d Tell git to ignore it like the other include links. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Ryan Moeller <[email protected]> Closes #13719
* LUA: Fix CVE-2014-5461Richard Yao2022-09-271-1/+1
| | | | | | | | | | | | | | | | | Apply the fix from upstream. http://www.lua.org/bugs.html#5.2.2-1 https://www.opencve.io/cve/CVE-2014-5461 It should be noted that exploiting this requires the `SYS_CONFIG` privilege, and anyone with that privilege likely has other opportunities to do exploits, so it is unlikely that bad actors could exploit this unless system administrators are executing untrusted ZFS Channel Programs. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Richard Yao <[email protected]> Closes #13949
* Linux: Fix uninitialized variable usage in zio_do_crypt_data()Richard Yao2022-09-271-3/+3
| | | | | | | | | | | | | | | | | Coverity complained about this. An error from `hkdf_sha512()` before uio initialization will cause pointers to uninitialized memory to be passed to `zio_crypt_destroy_uio()`. This is a regression that was introduced by cf63739191b6cac629d053930a4aea592bca3819. Interestingly, this never affected FreeBSD, since the FreeBSD version never had that patch ported. Since moving uio initialization to the top of this function would slow down the qat_crypt() path, we only move the `memset()` calls to the top of the function. This is sufficient to fix this problem. Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Neal Gompa <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Richard Yao <[email protected]> Closes #13944
* Refactor Log Size LimitAlexander Motin2022-09-262-27/+48
| | | | | | | | | | | | | | | | Original Log Size Limit implementation blocked all writes in case of limit reached until the TXG is committed and the log is freed. It caused huge delays and following speed spikes in application writes. This implementation instead smoothly throttles writes, using exactly the same mechanism as used for dirty data. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: jxdking <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored-By: iXsystems, Inc. Issue #12284 Closes #13476
* Revert "Reduce dbuf_find() lock contention"Brian Behlendorf2022-09-212-15/+15
| | | | | | | | | | This reverts commit 34dbc618f50cfcd392f90af80c140398c38cbcd1. While this change resolved the lock contention observed for certain workloads, it inadventantly reduced the maximum hash inserts/removes per second. This appears to be due to the slightly higher acquisition cost of a rwlock vs a mutex. Reviewed-by: Brian Behlendorf <[email protected]>
* Add zfs_btree_verify_intensity kernel module parameterRichard Yao2022-09-211-1/+7
| | | | | | | | | | I see a few issues in the issue tracker that might be aided by being able to turn this on. We have no module parameter for it, so I would like to add one. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Richard Yao <[email protected]> Closes #13874
* Fix incorrect size given to bqueue_enqueue() call in dmu_redact.c Richard Yao2022-09-211-2/+2
| | | | | | | | | | We pass sizeof (struct redact_record *) rather than sizeof (struct redact_record). Passing the pointer size is wrong. Coverity caught this in two places. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Richard Yao <[email protected]> Closes #13885
* Delay ZFS_PROP_SHARESMB property to handle it for encrypted raw receiveAmeer Hamza2022-09-211-0/+15
| | | | | | | | | | | | | | For encrypted raw receive, objset creation is delayed until a call to dmu_recv_stream(). ZFS_PROP_SHARESMB property requires objset to be populated when calling zpl_earlier_version(). To correctly handle the ZFS_PROP_SHARESMB property for encrypted raw receive, this change delays setting the property. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Ameer Hamza <[email protected]> Closes #13878
* Improve too large physical ashift handlingAlexander Motin2022-09-215-11/+58
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When iterating through children physical ashifts for vdev, prefer ones above the maximum logical ashift, that we can actually use, but within the administrator defined maximum. When selecting top-level vdev ashift, do not set it to the defined maximum in case physical ashift is even higher, but just ignore one. Using the maximum does not prevent misaligned writes, but reduces space efficiency. Since ZFS tries to write data sequentially and aggregates the writes, in many cases large misanigned writes may be not as bad as the space penalty otherwise. Allow internal physical ashifts for vdevs higher than SHIFT_MAX. May be one day allocator or aggregation could benefit from that. Reduce zfs_vdev_max_auto_ashift default from 16 (64KB) to 14 (16KB), so that ZFS may still use bigger ashifts up to SHIFT_MAX (64KB), but only if it really has to or explicitly told to, but not as an "optimization". There are some read-intensive NVMe SSDs that report Preferred Write Alignment of 64KB, and attempt to build RAIDZ2 of those leads to a space inefficiency that can't be justified. Instead these changes make ZFS fall back to logical ashift of 12 (4KB) by default and only warn user that it may be suboptimal for performance. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #13798
* Add Module Parameter Regarding Log Size LimitKevin Jin2022-09-215-2/+86
| | | | | | | | | | | | | zfs_wrlog_data_max The upper limit of TX_WRITE log data. Once it is reached, write operation is blocked, until log data is cleared out after txg sync. It only counts TX_WRITE log with WR_COPIED or WR_NEED_COPY. Reviewed-by: Prakash Surya <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: jxdking <[email protected]> Closes #12284
* Optimize txg_kick() process (#12274)Kevin Jin2022-09-212-28/+33
| | | | | | | | | | | | | | | | | Use dp_dirty_pertxg[] for txg_kick(), instead of dp_dirty_total in original code. Extra parameter "txg" is added for txg_kick(), thus it knows which txg to kick. Also txg_kick() call is moved from dsl_pool_need_dirty_delay() to dsl_pool_dirty_space() so that we can know the txg number assigned for txg_kick(). Some unnecessary code regarding dp_dirty_total in txg_sync_thread() is also cleaned up. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Matthew Ahrens <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: jxdking <[email protected]> Closes #12274
* zfs recv hangs if max recordsize is less than received recordsizeAmeer Hamza2022-09-191-10/+13
| | | | | | | | | | | - Some optimizations for bqueue enqueue/dequeue. - Added a fix to prevent deadlock when both bqueue_enqueue_impl() and bqueue_dequeue() waits for signal to be triggered. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Signed-off-by: Ameer Hamza <[email protected]> Closes #13855
* vdev_draid_lookup_map() should not iterate outside draid_mapsRichard Yao2022-09-151-1/+1
| | | | | | | | | | Coverity reported this as an out-of-bounds read. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Neal Gompa <[email protected]> Signed-off-by: Richard Yao <[email protected]> Closes #13865
* Add physical device size to SIZE column in 'zpool list -v'Akash B2022-09-151-0/+1
| | | | | | | | | | | | | | Add physical device size/capacity only for physical devices in 'zpool list -v' instead of displaying "-" in the SIZE column. This would make it easier to see the individual device capacity and to determine which spares are large enough to replace which devices. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Dipak Ghosh <[email protected]> Signed-off-by: Akash B <[email protected]> Closes #12561 Closes #13106
* Introduce a tunable to exclude special class buffers from L2ARCGeorge Amanakis2022-09-144-7/+112
| | | | | | | | | | | | Special allocation class or dedup vdevs may have roughly the same performance as L2ARC vdevs. Introduce a new tunable to exclude those buffers from being cacheable on L2ARC. Reviewed-by: Don Brady <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: George Amanakis <[email protected]> Closes #11761 Closes #12285
* Apply arc_shrink_shift to ARC above arc_c_minAlexander Motin2022-09-132-5/+9
| | | | | | | | | | | | | It makes sense to free memory in smaller chunks when approaching arc_c_min to let other kernel subsystems to free more, since after that point we can't free anything. This also matches behavior on Linux, where to shrinker reported only the size above arc_c_min. Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Allan Jude <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Closes #13794
* Fix use-after-free in btree codeRichard Yao2022-09-131-2/+2
| | | | | | | | | | | Coverty static analysis found these. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Neal Gompa <[email protected]> Signed-off-by: Richard Yao <[email protected]> Closes #10989 Closes #13861
* Linux 5.20 compat: blk_cleanup_disk()Brian Behlendorf2022-08-091-0/+4
| | | | | | | | | As of the Linux 5.20 kernel blk_cleanup_disk() has been removed, all callers should use put_disk(). Reviewed-by: Tony Hutter <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #13728
* Linux 5.20 compat: bdevname()Brian Behlendorf2022-08-091-1/+11
| | | | | | | | | As of the Linux 5.20 kernel bdevname() has been removed, all callers should use snprintf() and the "%pg" format specifier. Reviewed-by: Tony Hutter <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #13728
* Revert behavior of 59eab109 on not-LinuxRich Ercolani2022-08-021-1/+8
| | | | | | | | | | | | | | It turns out that short-circuiting the EFAULT behavior on a short read breaks things on FreeBSD. So until there's a nicer solution, let's just revert the behavior for not-Linux. Reference: https://reviews.freebsd.org/R10:70f51f0e474ffe1fb74cb427423a2fba3637544d Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Tony Nguyen <[email protected]> Reviewed-by: Brian Atkinson <[email protected]> Signed-off-by: Rich Ercolani <[email protected]> Closes #12698
* Handle partial reads in zfs_readRich Ercolani2022-08-021-0/+8
| | | | | | | | | | | | | | | | | | | | Currently, dmu_read_uio_dnode can read 64K of a requested 1M in one loop, get EFAULT back from zfs_uiomove() (because the iovec only holds 64k), and return EFAULT, which turns into EAGAIN on the way out. EAGAIN gets interpreted as "I didn't read anything", the caller tries again without consuming the 64k we already read, and we're stuck. This apparently works on newer kernels because the caller which breaks on older Linux kernels by happily passing along a 1M read request and a 64k iovec just requests 64k at a time. With this, we now won't return EFAULT if we got a partial read. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rich Ercolani <[email protected]> Closes #12370 Closes #12509 Closes #12516
* module: lua: ldo: fix pragma nameнаб2022-07-281-1/+1
| | | | | | | | | | | | | | /home/nabijaczleweli/store/code/zfs/module/lua/ldo.c:175:32: warning: unknown option after ‘#pragma GCC diagnostic’ kind [-Wpragmas] 175 | #pragma GCC diagnostic ignored "-Winfinite-recursion"a | ^~~~~~~~~~~~~~~~~~~~~~ Fixes: a6e8113fed8a508ffda13cf1c4d8da99a4e8133a ("Silence -Winfinite-recursion warning in luaD_throw()") Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Ahelenia Ziemiańska <[email protected]> Closes #13348
* Fix objtool: missing int3 after ret warningBrian Behlendorf2022-07-278-22/+34
| | | | | | | | | | | | | | Resolve straight-line speculation warnings reported by objtool for x86_64 assembly on Linux when CONFIG_SLS is set. See the following LWN article for the complete details. https://lwn.net/Articles/877845/ Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #13528 Closes #13575
* ICP: Add missing stack frame info to SHA asm filesAttila Fülöp2022-07-272-0/+52
| | | | | | | | | | Since the assembly routines calculating SHA checksums don't use a standard stack layout, CFI directives are needed to unroll the stack. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Attila Fülöp <[email protected]> Closes #11733
* Fix -Wuse-after-free warning in dbuf_destroy()Brian Behlendorf2022-07-271-3/+3
| | | | | | | | | | | | | | | | Move the use of the db pointer after it is freed. It's only used as a tag so a dereference would never occur, but there's no reason we can't invert the order to resolve the warning. module/zfs/dbuf.c: In function 'dbuf_destroy': module/zfs/dbuf.c:2953:17: error: pointer 'db' may be used after 'free' [-Werror=use-after-free] Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #13528 Closes #13575
* Fix -Wuse-after-free warning in dbuf_issue_final_prefetch_done()Brian Behlendorf2022-07-271-1/+2
| | | | | | | | | | | | | | | | Move the use of the private pointer after it is freed. It's only used as a tag so a dereference would never occur, but there's no harm in inverting the order to resolve the warning. module/zfs/dbuf.c: In function 'dbuf_issue_final_prefetch_done': module/zfs/dbuf.c:3204:17: error: pointer 'private' may be used after 'free' [-Werror=use-after-free] Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #13528 Closes #13575
* Fix -Wattribute-warning in dsl layerBrian Behlendorf2022-07-271-4/+2
| | | | | | | | | | | | | | | | | | | | The memcpy(), memmove(), and memset() functions have been annotated to perform bounds checking when using FORTIFY_SOURCE. A warning is now generted when writing beyond the end of the specified field. Alternately, the new struct_group() macro could be used to create an anonymous union member for use by memcpy(). However, since this is the only place the macro would be helpful it's preferable to restructure the code slights to avoid the need for additional compatibility code when the macro does not exist. https://lore.kernel.org/lkml/[email protected]/T/ Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #13528 Closes #13575
* Fix -Wattribute-warning in edonrBrian Behlendorf2022-07-271-1/+1
| | | | | | | | | | | | | | | | | | | | The wrong union memory was being accessed in EdonRInit resulting in a write beyond size of field compiler warning. Reference the correct member to resolve the warning. The warning was correct and this in case the mistake was harmless. In function ‘fortify_memcpy_chk’, inlined from ‘EdonRInit’ at zfs/module/icp/algs/edonr/edonr.c:494:3: ./include/linux/fortify-string.h:344:25: error: call to ‘__write_overflow_field’ declared with attribute warning: detected write beyond size of field (1st parameter); maybe use struct_group()? [-Werror=attribute-warning] Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #13528 Closes #13575
* Fix -Wattribute-warning in zfs_log_xvattr()Brian Behlendorf2022-07-271-34/+29
| | | | | | | | | | | | | Restructure the code in zfs_log_xvattr() to use a lr_attr_end structure when accessing lr_attr_t elements located after the variable sized array. This makes the code more understandable and resolves the accessing beyond the end of the field warnings. Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #13528 Closes #13575
* Silence -Winfinite-recursion warning in luaD_throw()Brian Behlendorf2022-07-271-0/+11
| | | | | | | | | | | | This code should be kept inline with the upstream lua version as much as possible. Therefore, we simply want to silence the warning. This check was enabled by default as part of -Wall in gcc 12.1. Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #13528 Closes #13575
* libtpool: -Wno-clobberedнаб2022-07-271-1/+1
| | | | | | | | | | Also remove -Wno-unused-but-set-variable Upstream-bug: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61118 Reviewed-by: Alejandro Colomar <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Ahelenia Ziemiańska <[email protected]> Closes #13110
* Remove sha1 hashing from OpenZFS, it's not used anywhere.Tino Reichardt2022-07-268-3625/+0
| | | | | | | | | | Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Attila Fülöp <[email protected]> Signed-off-by: Tino Reichardt <[email protected]> Signed-off-by: Ahelenia Ziemiańska <[email protected]> Closes #12895 Closes #12902 Signed-off-by: Rich Ercolani <[email protected]>
* Fix scrub resume from newly created hole.Alexander Motin2022-07-262-6/+21
| | | | | | | | | | | | | | | | It may happen that scan bookmark points to a block that was turned into a part of a big hole. In such case dsl_scan_visitbp() may skip it and dsl_scan_check_resume() will not be called for it. As result new scan suspend won't be possible until the end of the object, that may take hours if the object is a multi-terabyte ZVOL on a slow HDD pool, stretching TXG to all that time, creating all sorts of problems. This patch changes the resume condition to any greater or equal block, so even if we miss the bookmarked block, the next one we find will delete the bookmark, allowing new suspend. Signed-off-by: Alexander Motin <[email protected]> Sponsored-By: iXsystems, Inc.
* Avoid memory copy when verifying raidz/draid parityAlexander Motin2022-07-261-2/+3
| | | | | | | | | | | | | | | | | | Before this change for every valid parity column raidz_parity_verify() allocated new buffer and copied there existing data, then recalculated the parity and compared the result with the copy. This patch removes the memory copy, simply swapping original buffer pointers with newly allocated empty ones for parity recalculation and comparison. Original buffers with potentially incorrect parity data are then just freed, while new recalculated ones are used for repair. On a pool of 12 4-wide raidz vdevs, storing 1.5TB of 16MB blocks, this change reduces memory traffic during scrub by 17% and total unhalted CPU time by 25%. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored-By: iXsystems, Inc. Closes #13613
* Avoid memory copies during mirror scrubAlexander Motin2022-07-261-70/+61
| | | | | | | | | | | | | | | | | | | | Issuing several scrub reads for a block we may use the parent ZIO buffer for one of child ZIOs. If that read complete successfully, then we won't need to copy the data explicitly. If block has only one copy (typical for root vdev, which is also a mirror inside), then we never need to copy -- succeed or fail as-is. Previous code also copied data from buffer of every successfully completed child ZIO, but that just does not make any sense. On healthy N-wide mirror this saves all N+1 (or even more in case of ditto blocks) memory copies for each scrubbed block, allowing CPU to focus mostly on check-summing. For other vdev types it should save one memory copy per block copy at root vdev. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Mark Maybee <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored-By: iXsystems, Inc. Closes #13606
* Fix and disable blocks statistics during scrubAlexander Motin2022-07-262-33/+30
| | | | | | | | | | | | | | | | Block statistics calculation during scrub I/O issue in case of sorted scrub accounted ditto blocks several times. Embedded blocks on other side were not accounted at all. This change moves the accounting from issue to scan stage, that fixes both problems and also allows to avoid pool-wide locking and the lock contention it created. Since this statistics is quite specific and is not even exposed now anywhere, disable its calculation by default to not waste CPU time. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored-By: iXsystems, Inc. Closes #13579
* Avoid two 64-bit divisions per scanned blockAlexander Motin2022-07-261-4/+6
| | | | | | | | Change math to make it like the ARC, using multiplications instead. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored-By: iXsystems, Inc. Closes #13591
* Several B-tree optimizationsAlexander Motin2022-07-261-362/+411
| | | | | | | | | | | | | | | | | | | | | | | | | | - Introduce first element offset within a leaf. It allows to reduce by ~50% average memmove() size when adding/removing elements. If the added/removed element is in the first half of the leaf, we may shift elements before it and adjust the bth_first instead of moving more elements after it. - Use memcpy() instead of memmove() when we know there is no overlap. - Switch from uint64_t to uint32_t. It does not limit anything, but 32-bit arches should appreciate it greatly in hot paths. - Store leaf capacity in struct btree to avoid 64-bit divisions. - Adjust zfs_btree_insert_into_leaf() to always result in balanced leaves after splitting, no matter where the new element was inserted. Not that we care about it much, but it should also allow B-trees with as little as two elements per leaf instead of 4 previously. When scrubbing pool of 12 SSDs, storing 1.5TB of 4KB zvol blocks this reduces amount of time spent in memmove() inside the scan thread from 13.7% to 5.7% and total scrub time by ~15 seconds out of 9 minutes. It should also reduce spacemaps load time, but I haven't measured it. Reviewed-by: Paul Dagnelie <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored-By: iXsystems, Inc. Closes #13582
* Several sorted scrub optimizationsAlexander Motin2022-07-262-188/+152
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | - Reduce size and comparison complexity of q_exts_by_size B-tree. Previous code used two 64-bit divisions and many other operations to compare two B-tree elements. It created enormous overhead. This implementation moves the math to the upper level and stores the score in the B-tree elements themselves. Since all that we need to store in that B-tree is the extent score and offset, those can fit into single 8 byte value instead of 24 bytes of q_exts_by_addr element and can be compared with single operation. - Better decouple secondary tree logic from main range_tree by moving rt_btree_ops and related functions into dsl_scan.c as ext_size_ops. Those functions are very small to worry about the code duplication and range_tree does not need to know details such as rt_btree_compare. - Instead of accounting number of pending bytes per pool, that needs atomic on global variable per block, account the number of non-empty per-vdev queues, that change much more rarely. - When extent scan is interrupted by TXG end, continue it in the next TXG instead of selecting next best extent. It allows to avoid leaving one truncated (and so likely not the best any more) extent each TXG. On top of some other optimizations this saves about 1.5 minutes out of 10 to scrub pool of 12 SSDs, storing 1.5TB of 4KB zvol blocks. Reviewed-by: Paul Dagnelie <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Tom Caputi <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored-By: iXsystems, Inc. Closes #13576