aboutsummaryrefslogtreecommitdiffstats
path: root/module/zfs
Commit message (Collapse)AuthorAgeFilesLines
* Enable L2 cache of all (MRU+MFU) metadata but MFU data onlyshodanshok2024-08-271-3/+8
| | | | | | | | | | | | | | | | | | `l2arc_mfuonly` was added to avoid wasting L2 ARC on read-once MRU data and metadata. However it can be useful to cache as much metadata as possible while, at the same time, restricting data cache to MFU buffers only. This patch allow for such behavior by setting `l2arc_mfuonly` to 2 (or higher). The list of possible values is the following: 0: cache both MRU and MFU for both data and metadata; 1: cache only MFU for both data and metadata; 2: cache both MRU and MFU for metadata, but only MFU for data. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Gionatan Danti <[email protected]> Closes #16343 Closes #16402
* Fix null ptr deref when renaming a zvol with snaps and snapdev=visible (#16316)Justin Gottula2024-08-221-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If a zvol is renamed, and it has one or more snapshots, and snapdev=visible is true for the zvol, then the rename causes a kernel null pointer dereference error. This has the effect (on Linux, anyway) of killing the z_zvol taskq kthread, with locks still held; which in turn causes a variety of zvol-related operations afterward to hang indefinitely (such as udev workers, among other things). The problem occurs because of an oversight in #15486 (e36ff84c338d2f7b15aef2538f6a9507115bbf4a). As documented in dataset_kstats_create, some datasets may not actually have kstats allocated for them; and at least at the present time, this is true for snapshots. In practical terms, this means that for snapshots, dk->dk_kstats will be NULL. The dataset_kstats_rename function introduced in the patch above does not first check whether dk->dk_kstats is NULL before proceeding, unlike e.g. the nearby dataset_kstats_update_* functions. In the very particular circumstance in which a zvol is renamed, AND that zvol has one or more snapshots, AND that zvol also has snapdev=visible, zvol_rename_minors_impl will loop over not just the zvol dataset itself, but each of the zvol's snapshots as well, so that their device nodes will be renamed as well. This results in dataset_kstats_create being called for snapshots, where, as we've established, dk->dk_kstats is NULL. Fix this by simply adding a NULL check before doing anything in dataset_kstats_rename. This still allows the dataset_name kstat value for the zvol to be updated (as was the intent of the original patch), and merely blocks attempts by the code to act upon the zvol's non-kstat-having snapshots. If at some future time, kstats are added for snapshots, then things should work as intended in that case as well. Signed-off-by: Justin Gottula <[email protected]> Reviewed-by: Rob Norris <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Alan Somers <[email protected]> Reviewed-by: Allan Jude <[email protected]> Reviewed-by: Tony Hutter <[email protected]>
* zfs: add bounds checking to zil_parse (#16308)c1ick2024-08-221-2/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | Make sure log record don't stray beyond valid memory region. There is a lack of verification of the space occupied by fixed members of lr_t in the zil_parse. We can create a crafted image to trigger an out of bounds read by following these steps: 1) Do some file operations and reboot to simulate abnormal exit without umount 2) zil_chain.zc_nused: 0x1000 3) First lr_t lr_t.lrc_txtype: 0x0 lr_t.lrc_reclen: 0x1000-0xb8-0x1 lr_t.lrc_txg: 0x0 lr_t.lrc_seq: 0x1 4) Update checksum in zil_chain.zc_eck Fix: Add some checks to make sure the remaining bytes are large enough to hold an log record. Signed-off-by: XDTG <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]>
* Fix long_free_dirty accounting for small files (#16264)Chunwei Chen2024-07-231-0/+7
| | | | | | | | | | | | | | For files smaller than recordsize, it's most likely that they don't have L1 blocks. However, current calculation will always return at least 1 L1 block. In this change, we check dnode level to figure out if it has L1 blocks or not, and return 0 if it doesn't. This will reduce the chance of unnecessary throttling when deleting a large number of small files. Signed-off-by: Chunwei Chen <[email protected]> Co-authored-by: Chunwei Chen <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]>
* vdev_open: clear async fault flag after reopenRob Norris2024-07-171-0/+1
| | | | | | | | | | | | | | | | | | | | | | After c3f2f1aa2, vdev_fault_wanted is set on a vdev after a probe fails. An end-of-txg async task is charged with actually faulting the vdev. In a single-disk pool, the probe failure will degrade the last disk, and then suspend the pool. However, vdev_fault_wanted is not cleared. After the pool returns, the transaction finishes and the async task runs and faults the vdev, which suspends the pool again. The fix is simple: when reopening a vdev, clear the async fault flag. If the vdev is still failed, the startup probe will quickly notice and degrade/suspend it again. If not, all is well! Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Co-authored-by: Don Brady <[email protected]> Signed-off-by: Rob Norris <[email protected]> Reviewed-by: Jorgen Lundman <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Don Brady <[email protected]>
* Some improvements to metaslabs evictionAlexander Motin2024-07-172-2/+8
| | | | | | | | | | | | | | | | | | | - Add old eviction for special and dedup metaslab classes. Those vdevs may be potentially big and fragmented with large metaslabs, while their asynchronous write pattern is not really different from normal class. It seems an omission to not evict old metaslabs from them. - If we have metaslab preload enabled, which means we are not too low on memory, do not evict active metaslabs even if they are not used for some time. Eviction of active metaslabs means we won't be able to write anything until we load them, that may take some time, that is straight opposite to metaslab preload goals. For small systems the memory saving should be less important after recent reduction in number of allocators and so open metaslabs. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #16214
* Destroy ARC buffer in case of fill errorAlexander Motin2024-07-171-0/+1
| | | | | | | | | | | | | | In case of error dmu_buf_fill_done() returns the buffer back into DB_UNCACHED state. Since during transition from DB_UNCACHED into DB_FILL state dbuf_noread() allocates an ARC buffer, we must free it here, otherwise it will be leaked. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Jorgen Lundman <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #15665 Closes #15802 Closes #16216
* head_errlog: fix use-after-freeGeorge Amanakis2024-07-151-2/+5
| | | | | | | | | | | | | In the commit of the head_errlog feature we introduced a bug in dsl_dataset_promote_sync(): we may dereference origin_head and hds, both dereferencing ddpa after calling promote_sync() on ddpa. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Chunwei Chen <[email protected]> Reviewed-by: Rob Norris <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Signed-off-by: George Amanakis <[email protected]> Closes #16272 Closes #16273
* Fix assertion in Persistent L2ARCGeorge Amanakis2024-05-291-1/+1
| | | | | | | | | At the end of l2arc_evict() fix an assertion in the case that l2ad_hand + distance == l2ad_end. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: George Amanakis <[email protected]> Closes #16202 Closes #16207
* ZAP: Fix leaf references on zap_expand_leaf() errorsAlexander Motin2024-05-291-13/+14
| | | | | | | | | | | | Depending on kind of error zap_expand_leaf() may return with or without valid leaf reference held. Make sure it returns NULL if due to error it has no leaf to return. Make its callers to check the returned leaf pointer, and release the leaf if it is not NULL. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #12366 Closes #16159
* Fix ZIL clone records for legacy holesAlexander Motin2024-05-291-5/+3
| | | | | | | | | | | | Previous code overengineered cloned range calculation by using BP_GET_LSIZE(). The problem is that legacy holes don't have the logical size, so result will be wrong. But we also don't need to look on every block size, since they all must be identical. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Brian Atkinson <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #16165
* Fix scn_queue races on very old poolsAlexander Motin2024-05-291-0/+6
| | | | | | | | | | | | | | | Code for pools before version 11 uses dmu_objset_find_dp() to scan for children datasets/clones. It calls enqueue_clones_cb() and enqueue_cb() callbacks in parallel from multiple taskq threads. It ends up bad for scan_ds_queue_insert(), corrupting scn_queue AVL-tree. Fix it by introducing a mutex to protect those two scan_ds_queue_insert() calls. All other calls are done from the sync thread and so serialized. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Brian Atkinson <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #16162
* Slightly improve dnode hashAlexander Motin2024-05-291-3/+3
| | | | | | | | | | | | As I understand just for being less predictable dnode hash includes 8 bits of objset pointer, starting at 6. But since objset_t is more than 1KB in size, its allocations are likely aligned to 2KB, that means 11 lower bits provide no entropy. Just take the 8 bits starting from 11. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #16131
* Make more taskq parameters writableAlexander Motin2024-05-291-4/+4
| | | | | | | | | | | | There is no reason for these module parameters to be read-only. Being modified they just apply on next pool import/creation, that is useful for testing different values. Reviewed-by: Rich Ercolani <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #16118
* L2ARC: Cleanup buffer re-compressionAlexander Motin2024-05-291-39/+20
| | | | | | | | | | | | | | When compressed ARC is disabled, we may have to re-compress when writing into L2ARC. If doing so we can't fit it into the original physical size, we should just fail immediately, since even if it may still fit into allocation size, its checksum will never match. While there, refactor the code similar to other compression places without using abd_return_buf_copy(). Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #16038
* Refactor dbuf_read() for safer decryptionAlexander Motin2024-05-291-110/+104
| | | | | | | | | | | | | | | | | | | | In dbuf_read_verify_dnode_crypt(): - We don't need original dbuf locked there. Instead take a lock on a dnode dbuf, that is actually manipulated. - Block decryption for a dnode dbuf if it is currently being written. ARC hash lock does not protect anonymous buffers, so arc_untransform() is unsafe when used on buffers being written, that may happen in case of encrypted dnode buffers, since they are not copied by dbuf_dirty()/dbuf_hold_copy(). In dbuf_read(): - If the buffer is in flight, recheck its compression/encryption status after it is cached, since it may need arc_untransform(). Tested-by: Rich Ercolani <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #16104
* Replace P2ALIGN with P2ALIGN_TYPED and delete P2ALIGN.chenqiuhao19972024-05-135-11/+13
| | | | | | | | | | | | | In P2ALIGN, the result would be incorrect when align is unsigned integer and x is larger than max value of the type of align. In that case, -(align) would be a positive integer, which means high bits would be zero and finally stay zero after '&' when align is converted to a larger integer type. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Youzhong Yang <[email protected]> Signed-off-by: Qiuhao Chen <[email protected]> Closes #15940
* Add prefetch property Brian Behlendorf2024-04-302-1/+25
| | | | | | | | | | | | | | | | | | | | | | ZFS prefetch is currently governed by the zfs_prefetch_disable tunable. However, this is a module-wide settings - if a specific dataset benefits from prefetch, while others have issue with it, an optimal solution does not exists. This commit introduce the "prefetch" tri-state property, which enable granular control (at dataset/volume level) for prefetching. This patch does not remove the zfs_prefetch_disable, which remains a system-wide switch for enable/disable prefetch. However, to avoid duplication, it would be preferable to deprecate and then remove the module tunable. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Ameer Hamza <[email protected]> Signed-off-by: Gionatan Danti <[email protected]> Co-authored-by: Gionatan Danti <[email protected]> Closes #15237 Closes #15436
* vdev probe to slow disk can stall mmp write checkerDon Brady2024-04-308-37/+126
| | | | | | | | | | | | | | Simplify vdev probes in the zio_vdev_io_done context to avoid holding the spa config lock for a long duration. Also allow zpool clear if no evidence of another host is using the pool. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Olaf Faaland <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Don Brady <[email protected]> Closes #15839
* Extend import_progress kstat with a notes fieldDon Brady2024-04-293-9/+118
| | | | | | | | | | | | | Detail the import progress of log spacemaps as they can take a very long time. Also grab the spa_note() messages to, as they provide insight into what is happening Sponsored-By: OpenDrives Inc. Sponsored-By: Klara Inc. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Don Brady <[email protected]> Co-authored-by: Allan Jude <[email protected]> Closes #15539
* Add ashift validation when adding devices to a poolGeorge Wilson2024-04-292-4/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | Currently, zpool add allows users to add top-level vdevs that have different ashifts but doing so prevents users from being able to perform a top-level vdev removal. Often times consumers may not realize that they have mismatched ashifts until the top-level removal fails. This feature adds ashift validation to the zpool add command and will fail the operation if the sector size of the specified vdev does not match the existing pool. This behavior can be disabled by using the -f flag. In addition, new flags have been added to provide fine-grained control to disable specific checks. These flags are: --allow-in-use --allow-ashift-mismatch --allow-replicaton-mismatch The force flag will disable all of these checks. Reviewed by: Brian Behlendorf <[email protected]> Reviewed by: Alexander Motin <[email protected]> Reviewed-by: Mark Maybee <[email protected]> Signed-off-by: George Wilson <[email protected]> Closes #15509
* Use ASSERT0P() to check that a pointer is NULL.Dag-Erling Smørgrav2024-04-291-1/+1
| | | | | | | | Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Kay Pedersen <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Dag-Erling Smørgrav <[email protected]> Closes #15225
* GCC: Fixes for gcc 14 on Fedora 40Tony Hutter2024-04-291-2/+3
| | | | | | | | | | | - Workaround dangling pointer in uu_list.c (#16124) - Fix calloc() transposed arguments in zpool_vdev_os.c - Make some temp variables unsigned to prevent triggering a '-Werror=alloc-size-larger-than' error. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tony Hutter <[email protected]> Closes #16124 Closes #16125
* Fix panics when truncating/deleting filesPavel Snajdr2024-04-291-10/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There's an union in dbuf_dirty_record_t; dr_brtwrite could evaluate to B_TRUE if the dirty record is of another type than dl. Adding more explicit dr type check before trying to access dr_brtwrite. Fixes two similar panics: [ 1373.806119] VERIFY0(db->db_level) failed (0 == 1) [ 1373.807232] PANIC at dbuf.c:2549:dbuf_undirty() [ 1373.814979] dump_stack_lvl+0x71/0x90 [ 1373.815799] spl_panic+0xd3/0x100 [spl] [ 1373.827709] dbuf_undirty+0x62a/0x970 [zfs] [ 1373.829204] dmu_buf_will_dirty_impl+0x1e9/0x5b0 [zfs] [ 1373.831010] dnode_free_range+0x532/0x1220 [zfs] [ 1373.833922] dmu_free_long_range+0x4e0/0x930 [zfs] [ 1373.835277] zfs_trunc+0x75/0x1e0 [zfs] [ 1373.837958] zfs_freesp+0x9b/0x470 [zfs] [ 1373.847236] zfs_setattr+0x161a/0x3500 [zfs] [ 1373.855267] zpl_setattr+0x125/0x320 [zfs] [ 1373.856725] notify_change+0x1ee/0x4a0 [ 1373.859207] do_truncate+0x7f/0xd0 [ 1373.859968] do_sys_ftruncate+0x28e/0x2e0 [ 1373.860962] do_syscall_64+0x38/0x90 [ 1373.861751] entry_SYSCALL_64_after_hwframe+0x6e/0xd8 [ 1822.381337] VERIFY0(db->db_level) failed (0 == 1) [ 1822.382376] PANIC at dbuf.c:2549:dbuf_undirty() [ 1822.389232] dump_stack_lvl+0x71/0x90 [ 1822.389920] spl_panic+0xd3/0x100 [spl] [ 1822.399567] dbuf_undirty+0x62a/0x970 [zfs] [ 1822.400583] dmu_buf_will_dirty_impl+0x1e9/0x5b0 [zfs] [ 1822.401752] dnode_free_range+0x532/0x1220 [zfs] [ 1822.402841] dmu_object_free+0x74/0x120 [zfs] [ 1822.403869] zfs_znode_delete+0x75/0x120 [zfs] [ 1822.404906] zfs_rmnode+0x3f6/0x7f0 [zfs] [ 1822.405870] zfs_inactive+0xa3/0x610 [zfs] [ 1822.407803] zpl_evict_inode+0x3e/0x90 [zfs] [ 1822.408831] evict+0xc1/0x1c0 [ 1822.409387] do_unlinkat+0x147/0x300 [ 1822.410060] __x64_sys_unlinkat+0x33/0x60 [ 1822.410802] do_syscall_64+0x38/0x90 [ 1822.411458] entry_SYSCALL_64_after_hwframe+0x6e/0xd8 Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Brian Atkinson <[email protected]> Signed-off-by: Pavel Snajdr <[email protected]> Closes #15983
* Add slow disk diagnosis to ZEDDon Brady2024-04-293-0/+60
| | | | | | | | | | | | | | | | | | | | | Slow disk response times can be indicative of a failing drive. ZFS currently tracks slow I/Os (slower than zio_slow_io_ms) and generates events (ereport.fs.zfs.delay). However, no action is taken by ZED, like is done for checksum or I/O errors. This change adds slow disk diagnosis to ZED which is opt-in using new VDEV properties: VDEV_PROP_SLOW_IO_N VDEV_PROP_SLOW_IO_T If multiple VDEVs in a pool are undergoing slow I/Os, then it skips the zpool_vdev_degrade(). Sponsored-By: OpenDrives Inc. Sponsored-By: Klara Inc. Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Allan Jude <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Co-authored-by: Rob Wing <[email protected]> Signed-off-by: Don Brady <[email protected]> Closes #15469
* L2ARC: Relax locking during writeAlexander Motin2024-04-195-98/+127
| | | | | | | | | | | | | | | | | | | | | | | Previous code held ARC state sublist lock throughout all L2ARC write process, which included number of allocations and even ZIO issues. Being blocked in any of those places the code could also block ARC eviction, that could cause OOM activation or even dead- lock if system is low on memory or one is too fragmented. Fix it by dropping the lock as soon as we see a block eligible for L2ARC writing and pick it up later using earlier inserted marker. While there, also reduce scope of hash lock, moving ZIO allocation and other operations not requiring header access out of it. All operations requiring header access move under hash lock, since L2_WRITING flag does not prevent header eviction only transition to arc_l2c_only state with L1 header. To be able to manipulate sublist lock and marker as needed add few more multilist functions and modify one. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #16040
* Small fix to prefetch ranges aggregationAlexander Motin2024-04-191-2/+2
| | | | | | | | | | When after #16022 adding new range we aggregate more than two existing ranges, that should be very rare, only if several streams overlap, we may need to zero not the last range, but some earlier. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #16072
* Remove db_state DB_NOFILL checks from syncing contextAlexander Motin2024-04-191-25/+19
| | | | | | | | | | | | | | Syncing context should not depend on current state of dbuf, which could already change several times in later transaction groups, but rely solely on dirty record for the transaction group being synced. Some of the checks seem already impossible, while instead of others I think we should better check for absence of data in the specific dirty record rather than DB_NOFILL. Reviewed-by: Robert Evans <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #16057
* Speculative prefetch for reordered requestsAlexander Motin2024-04-192-57/+240
| | | | | | | | | | | | | | | | | | | Before this change speculative prefetcher was able to detect a stream only if all of its accesses are perfectly sequential. It was easy to implement and is perfectly fine for single-threaded applications. Unfortunately multi-threaded network servers, such as iSCSI, SMB or NFS usually have plenty of threads and may often reorder requests, preventing successful speculation and prefetch. This change allows speculative prefetcher to detect streams even if requests are reordered by introducing a list of 9 non-contiguous ranges up to 16MB ahead of current stream position and filling the gaps as more requests arrive. It also allows stream to proceed even with holes up to a certain configurable threshold (25%). Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #16022
* Fix read errors race after block cloningAlexander Motin2024-04-191-21/+20
| | | | | | | | | | | | | | | | | | | | Investigating read errors triggering panic fixed in #16042 I've found that we have a race in a sync process between the moment dirty record for cloned block is removed and the moment dbuf is destroyed. If dmu_buf_hold_array_by_dnode() take a hold on a cloned dbuf before it is synced/destroyed, then dbuf_read_impl() may see it still in DB_NOFILL state, but without the dirty record. Such case is not an error, but equivalent to DB_UNCACHED, since the dbuf block pointer is already updated by dbuf_write_ready(). Unfortunately it is impossible to safely change the dbuf state to DB_UNCACHED there, since there may already be another cloning in progress, that dropped dbuf lock before creating a new dirty record, protected only by the range lock. Reviewed-by: Rob Norris <[email protected]> Reviewed-by: Robert Evans <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #16052
* Improve dbuf_read() error reportingAlexander Motin2024-04-191-18/+20
| | | | | | | | | | | | | Previous code reported non-ZIO errors only via return value, but not via parent ZIO. It could cause NULL-dereference panics due to dmu_buf_hold_array_by_dnode() ignoring the return value, relying solely on parent ZIO status. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Ameer Hamza <[email protected]> Reported by: Ameer Hamza <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #16042
* BRT: Fix holes cloning.Alexander Motin2024-04-191-13/+13
| | | | | | | | | | | | | | | | | | | - When reading L0 block pointers handle buffers without ones and without dirty records as a holes. Those appear when dnode size was increased, but the end was never written, so there are no new indirection levels to store the pointers. It makes no sense to return EAGAIN here, since sync won't create new indirection levels until there will be actual writes. - When cloning blocks set destination hole logical birth time to the current TXG. Otherwise if we are cloning over existing data, newly created holes may not be properly replicated later. Use BP_SET_BIRTH() when possible to not replicate its logic. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #15994 Closes #16007
* BRT: Skip getting length in brt_entry_lookup()Alexander Motin2024-04-191-16/+2
| | | | | | | | | | | | | | | Unlike DDT, where ZAP values may have different lengths due to compression, all BRT entries are identical 8-byte counters. It does not make sense to first fetch the length only to assert it. zap_lookup_uint64() is specifically designed to work with counters of different size and should return error if something odd found. Calling it straight allows to save some measurable CPU time. Reviewed-by: Pawel Jakub Dawidek <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Rob Norris <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #15950
* BRT: Make BRT block sizes configurableAlexander Motin2024-04-191-11/+11
| | | | | | | | | | | | | Similar to DDT make BRT data and indirect block sizes configurable via module parameters. I am not sure what would be the best yet, but similar to DDT 4KB blocks kill all chances of compression on vdev with ashift=12 or more, that on my tests reaches 3x. While here, fix documentation for respective DDT parameters. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #15967
* BRT: Relax brt_pending_apply() lockingAlexander Motin2024-04-191-11/+5
| | | | | | | | | | | | | Since brt_pending_apply() is running in syncing context, no other brt_pending_tree accesses are possible for the TXG. We don't need to acquire brt_pending_lock here. Reviewed-by: Pawel Jakub Dawidek <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Brian Atkinson <[email protected]> Reviewed-by: Rob Norris <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #15955
* ZAP: Massively switch to _by_dnode() interfacesAlexander Motin2024-04-196-170/+191
| | | | | | | | | | | | | | | | | Before this change ZAP called dnode_hold() for almost every block access, that was clearly visible in profiler under heavy load, such as BRT. This patch makes it always hold the dnode reference between zap_lockdir() and zap_unlockdir(). It allows to avoid most of dnode operations between those. It also adds several new _by_dnode() APIs to ZAP and uses them in BRT code. Also adds dmu_prefetch_by_dnode() variant and uses it in the ZAP code. After this there remains only one call to dmu_buf_dnode_enter(), which seems to be unneeded. So remove the call and the functions. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #15951
* BRT: Skip duplicate BRT prefetchesAlexander Motin2024-04-191-3/+3
| | | | | | | | | | | | If there is a pending entry for this block, then we've already issued BRT prefetch for it within this TXG, so don't do it again. BRT vdev lookup and following zap_prefetch_uint64() call can be pretty expensive and should be avoided when not necessary. Reviewed-by: Pawel Jakub Dawidek <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #15941
* ZAP: Some cleanups/micro-optimizationsAlexander Motin2024-04-191-43/+34
| | | | | | | | | | | | | | | - Remove custom zap_memset(), use regular memset(). - Use PANIC() instead of opaque cmn_err(CE_PANIC). - Provide entry parameter to zap_leaf_rehash_entry(). - Reduce branching in zap_leaf_array_create() inner loop. - Remove signedness where it should not be. Should be no function changes. Reviewed-by: Brian Atkinson <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #15976
* BRT: Change brt_pending_tree sorting orderAlexander Motin2024-04-191-6/+7
| | | | | | | | | | | | | | | | | | It does not look important how exactly brt_pending_tree is sorted. When cloning large file, it is quite likely that all of its blocks have identical physical birth times, so comparing them first does not provide useful entropy, while accesses additional cache line. In most cases combination of vdev and offset provides unique result and physical birth time comparison is not even needed. Meanwhile, when traversing the tree inside brt_pending_apply(), it can be beneficial for dbuf cache and CPU cache hits to group processing by vdev and so by the per-VDEV BRT ZAPs. Reviewed-by: Rob Norris <[email protected]> Reviewed-by: Brian Atkinson <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #15954
* Update resume token at object receive.Alexander Motin2024-04-191-0/+10
| | | | | | | | | | | | | | | | | | | | | Before this change resume token was updated only on data receive. Usually it is enough to resume replication without much overlap. But we've got a report of a curios case, where replication source was traversed with recursive grep, which through enabled atime modified every object without modifying any data. It produced several gigabytes of replication traffic without a single data write and so without a single resume point. While the resume token was not designed to resume from an object, I've found that the send implementation always sends object before any data. So by requesting resume from offset 0 we are effectively resuming from the object, followed (or not) by the data at offset 0, just as we need it. Reviewed-by: Allan Jude <[email protected]> Reviewed-by: Paul Dagnelie <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #15927
* Refactor dmu_prefetch().Alexander Motin2024-04-194-49/+68
| | | | | | | | | | | | | | | | | | - Split dmu_prefetch_dnode() from dmu_prefetch() into a separate function. It is quite inconvenient to read the code where len = 0 means dnode prefetch instead indirect/data prefetch. One function doing both has no benefits, since the code paths are independent. - Improve dmu_prefetch() handling of long block ranges. Instead of limiting L0 data length to prefetch for to dmu_prefetch_max, make dmu_prefetch_max limit the actual amount of prefetch at the specified level, and, if there is more, prefetch all the rest at higher indirection level. It should improve random access times within the prefetched range of any length, reducing importance of specific dmu_prefetch_max value. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #15076
* ZIL: Improve next log block size predictionAlexander Motin2024-04-191-71/+196
| | | | | | | | | | | | | | | | | | | Track history in context of bursts, not individual log blocks. It allows to not blow away all the history by single large burst of many block, and same time allows optimizations covering multiple blocks in a burst and even predicted following burst. For each burst account its optimal block size and minimal first block size. Use that statistics from the last 8 bursts to predict first block size of the next burst. Remove predefined set of block sizes. Allocate any size we see fit, multiple of 4KB, as required by ZIL now. With compression enabled by default, ZFS already writes pretty random block sizes, so this should not surprise space allocator any more. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #15635
* ZIO: Optimize zio_flush()Alexander Motin2024-04-192-22/+16
| | | | | | | | | | | | | | | | | - Generalize vdev_nowritecache handling by traversing through the VDEV tree and skipping children ZIOs where not supported. - Remove intermediate zio_null() in case of several VDEV children. - Remove children handling from zio_ioctl(). There are no other use cases for this code beside DKIOCFLUSHWRITECACHED, and would there be, I doubt they would so straightforward apply to all VDEV children. Comparing to removed previous optimization this should improve cases of redundant ZILs/SLOGs. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: George Wilson <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #15515
* ZIL: Detect single-threaded workloadsAlexander Motin2024-04-191-51/+40
| | | | | | | | | | | | | | | | | | | ... by checking that previous block is fully written and flushed. It allows to skip commit delays since we can give up on aggregation in that case. This removes zil_min_commit_timeout parameter, since for single-threaded workloads it is not needed at all, while on very fast devices even some multi-threaded workloads may get detected as single-threaded and still bypass the wait. To give multi-threaded workloads more aggregation chances increase zfs_commit_timeout_pct from 5 to 10%, as they should suffer less from additional latency. Also single-threaded workloads detection allows in perspective better prediction of the next block size. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Prakash Surya <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #15381
* Fix corruption caused by mmap flushing problemsRobert Evans2024-03-291-1/+5
| | | | | | | | | | | | | | | | | | | | | | 1) Make mmap flushes synchronous. Linux may skip flushing dirty pages already in writeback unless data-integrity sync is requested. 2) Change zfs_putpage to use TXG_WAIT. Otherwise dirty pages may be skipped due to DMU pushing back on TX assign. 3) Add missing mmap flush when doing block cloning. 4) While here, pass errors from putpage to writepage/writepages. This change fixes corruption edge cases, but unfortunately adds synchronous ZIL flushes for dirty mmap pages to llseek and bclone operations. It may be possible to avoid these sync writes later but would need more tricky refactoring of the writeback code. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Robert Evans <[email protected]> Closes #15933 Closes #16019
* abd: add page iteratorRob Norris2024-03-281-0/+42
| | | | | | | | | | | | | | | | | | | The regular ABD iterators yield data buffers, so they have to map and unmap pages into kernel memory. If the caller only wants to count chunks, or can use page pointers directly, then the map/unmap is just unnecessary overhead. This adds adb_iterate_page_func, which yields unmapped struct page instead. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rob Norris <[email protected]> Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Closes #15533 Closes #15588 (cherry picked from commit 390b448726c580999dd337be7a40b0e95cf1d50b)
* dmu: Allow buffer fills to failAlexander Motin2024-02-203-22/+34
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When ZFS overwrites a whole block, it does not bother to read the old content from disk. It is a good optimization, but if the buffer fill fails due to page fault or something else, the buffer ends up corrupted, neither keeping old content, nor getting the new one. On FreeBSD this is additionally complicated by page faults being blocked by VFS layer, always returning EFAULT on attempt to write from mmap()'ed but not yet cached address range. Normally it is not a big problem, since after original failure VFS will retry the write after reading the required data. The problem becomes worse in specific case when somebody tries to write into a file its own mmap()'ed content from the same location. In that situation the only copy of the data is getting corrupted on the page fault and the following retries only fixate the status quo. Block cloning makes this issue easier to reproduce, since it does not read the old data, unlike traditional file copy, that may work by chance. This patch provides the fill status to dmu_buf_fill_done(), that in case of error can destroy the corrupted buffer as if no write happened. One more complication in case of block cloning is that if error is possible during fill, dmu_buf_will_fill() must read the data via fall-back to dmu_buf_will_dirty(). It is required to allow in case of error restoring the buffer to a state after the cloning, not not before it, that would happen if we just call dbuf_undirty(). Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Rob Norris <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #15665
* BRT: Fix slop space calculation with block cloningBi112024-02-121-1/+2
| | | | | | | | | Similar to deduplication, the size of data duplicated by block cloning should not be included in the slop space calculation. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Yuxin Wang <[email protected]> Closes #15874
* BRT: Fix FICLONE/FICLONERANGE shortened copyTony Hutter2024-02-061-5/+38
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | On Linux the ioctl_ficlonerange() and ioctl_ficlone() system calls are expected to either fully clone the specified range or return an error. The range may be for an entire file. While internally ZFS supports cloning partial ranges there's no way to return the length cloned to the caller so we need to make this all or nothing. As part of this change support for the REMAP_FILE_CAN_SHORTEN flag has been added. When REMAP_FILE_CAN_SHORTEN is set zfs_clone_range() will return a shortened range when encountering pending dirty records. When it's clear zfs_clone_range() will block and wait for the records to be written out allowing the blocks to be cloned. Furthermore, the file range lock is held over the region being cloned to prevent it from being modified while cloning. This doesn't quite provide an atomic semantics since if an error is encountered only a portion of the range may be cloned. This will be converted to an error if REMAP_FILE_CAN_SHORTEN was not provided and returned to the caller. However, the destination file range is left in an undefined state. A test case has been added which exercises this functionality by verifying that `cp --reflink=never|auto|always` works correctly. Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #15728 Closes #15842
* Don't assert mg_initialized due to device addition racePaul Dagnelie2024-01-291-3/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | During device removal stress tests, we noticed that we were tripping the assertion that mg_initialized was true. After investigation, it was determined that the mg in question was the embedded log metaslab group for a newly added vdev; the normal mg had been initialized (by metaslab_sync_reassess, via vdev_sync_done). However, because the spa config alloc lock is not held as writer across both calls to metaslab_sync_reassess, it is possible for an allocation to happen between the two metaslab_groups being initialized. Because the metaslab code doesn't check the group in question, just the vdev's main mg, it is possible to get past the initial check in vdev_allocatable and later fail due to the assertion. We simply remove the assertions. We could also consider locking the ALLOC lock around the reassess calls in vdev_sync_done, but that risks deadlocks. We could check the actual target mg in vdev_allocatable, but that risks racing with a passivation that comes in after that check but before the assertion. We still won't be able to actually allocate from the metaslab group if no metaslabs are ready, so this change shouldn't break anything. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: George Wilson <[email protected]> Signed-off-by: Paul Dagnelie <[email protected]> Closes #15818