aboutsummaryrefslogtreecommitdiffstats
path: root/module
Commit message (Collapse)AuthorAgeFilesLines
* vdev_disk: don't touch vbio after its handed off to the kernelRob Norris2024-04-081-5/+6
| | | | | | | | | | | | | | | | | After IO is unplugged, it may complete immediately and vbio_completion be called on interrupt context. That may interrupt or deschedule our task. If its the last bio, the vbio will be freed. Then, we get rescheduled, and try to write to freed memory through vbio->. This patch just removes the the cleanup, and the corresponding assert. These were leftovers from a previous iteration of vbio_submit() and were always "belt and suspenders" ops anyway, never strictly required. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc Reported-by: Rich Ercolani <[email protected]> Signed-off-by: Rob Norris <[email protected]> (cherry picked from commit 917ff75e9510d19968ef3cc5c80b1cd0ef48f84d)
* Fix corruption caused by mmap flushing problemsRobert Evans2024-03-293-9/+10
| | | | | | | | | | | | | | | | | | | | | | 1) Make mmap flushes synchronous. Linux may skip flushing dirty pages already in writeback unless data-integrity sync is requested. 2) Change zfs_putpage to use TXG_WAIT. Otherwise dirty pages may be skipped due to DMU pushing back on TX assign. 3) Add missing mmap flush when doing block cloning. 4) While here, pass errors from putpage to writepage/writepages. This change fixes corruption edge cases, but unfortunately adds synchronous ZIL flushes for dirty mmap pages to llseek and bclone operations. It may be possible to avoid these sync writes later but would need more tricky refactoring of the writeback code. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Robert Evans <[email protected]> Closes #15933 Closes #16019
* vdev_disk: default to classic submission for 2.2.xRob Norris2024-03-281-3/+5
| | | | | | | | | | | | | We don't want to change to brand-new code in the middle of a stable series, but we want it available to test for people running into page splitting issues. This commits make zfs_vdev_disk_classic=1 the default, and updates the documentation to better explain what's going on. Signed-off-by: Rob Norris <[email protected]> Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc.
* abd_iter_page: don't use compound heads on Linux <4.5Rob Norris2024-03-281-0/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Before 4.5 (specifically, torvalds/linux@ddc58f2), head and tail pages in a compound page were refcounted separately. This means that using the head page without taking a reference to it could see it cleaned up later before we're finished with it. Specifically, bio_add_page() would take a reference, and drop its reference after the bio completion callback returns. If the zio is executed immediately from the completion callback, this is usually ok, as any data is referenced through the tail page referenced by the ABD, and so becomes "live" that way. If there's a delay in zio execution (high load, error injection), then the head page can be freed, along with any dirty flags or other indicators that the underlying memory is used. Later, when the zio completes and that memory is accessed, its either unmapped and an unhandled fault takes down the entire system, or it is mapped and we end up messing around in someone else's memory. Both of these are very bad. The solution on these older kernels is to take a reference to the head page when we use it, and release it when we're done. There's not really a sensible way under our current structure to do this; the "best" would be to keep a list of head page references in the ABD, and release them when the ABD is freed. Since this additional overhead is totally unnecessary on 4.5+, where head and tail pages share refcounts, I've opted to simply not use the compound head in ABD page iteration there. This is theoretically less efficient (though cleaning up head page references would add overhead), but its safe, and we still get the other benefits of not mapping pages before adding them to a bio and not mis-splitting pages. There doesn't appear to be an obvious symbol name or config option we can match on to discover this behaviour in configure (and the mm/page APIs have changed a lot since then anyway), so I've gone with a simple version check. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rob Norris <[email protected]> Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Closes #15533 Closes #15588 (cherry picked from commit c6be6ce1755a3d9a3cbe70256cd8958ef83d8542)
* vdev_disk: use bio_chain() to submit multiple BIOsRob Norris2024-03-281-151/+80
| | | | | | | | | | | | | | Simplifies our code a lot, so we don't have to wait for each and reassemble them. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rob Norris <[email protected]> Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Closes #15533 Closes #15588 (cherry picked from commit 72fd834c47558cb10d847948d1a4615e894c77c3)
* vdev_disk: add module parameter to select BIO submission methodRob Norris2024-03-281-2/+29
| | | | | | | | | | | | | | | | This makes the submission method selectable at module load time via the `zfs_vdev_disk_classic` parameter, allowing this change to be backported to 2.2 safely, and disabled in favour of the "classic" submission method if new problems come up. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rob Norris <[email protected]> Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Closes #15533 Closes #15588 (cherry picked from commit df2169d141aadc0c2cc728c5c5261d6f5c2a27f7)
* vdev_disk: rewrite BIO filling machinery to avoid split pagesRob Norris2024-03-281-2/+437
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This commit tackles a number of issues in the way BIOs (`struct bio`) are constructed for submission to the Linux block layer. The kernel has a hard upper limit on the number of pages/segments that can be added to a BIO, as well as a separate limit for each device (related to its queue depth and other scheduling characteristics). ZFS counts the number of memory pages in the request ABD (`abd_nr_pages_off()`, and then uses that as the number of segments to put into the BIO, up to the hard upper limit. If it requires more than the limit, it will create multiple BIOs. Leaving aside the fact that page count method is wrong (see below), not limiting to the device segment max means that the device driver will need to split the BIO in half. This is alone is not necessarily a problem, but it interacts with another issue to cause a much larger problem. The kernel function to add a segment to a BIO (`bio_add_page()`) takes a `struct page` pointer, and offset+len within it. `struct page` can represent a run of contiguous memory pages (known as a "compound page"). In can be of arbitrary length. The ZFS functions that count ABD pages and load them into the BIO (`abd_nr_pages_off()`, `bio_map()` and `abd_bio_map_off()`) will never consider a page to be more than `PAGE_SIZE` (4K), even if the `struct page` is for multiple pages. In this case, it will load the same `struct page` into the BIO multiple times, with the offset adjusted each time. With a sufficiently large ABD, this can easily lead to the BIO being entirely filled much earlier than it could have been. This is also further contributes to the problem caused by the incorrect segment limit calculation, as its much easier to go past the device limit, and so require a split. Again, this is not a problem on its own. The logic for "never submit more than `PAGE_SIZE`" is actually a little more subtle. It will actually never submit a buffer that crosses a 4K page boundary. In practice, this is fine, as most ABDs are scattered, that is a list of complete 4K pages, and so are loaded in as such. Linear ABDs are typically allocated from slabs, and for small sizes they are frequently not aligned to page boundaries. For example, a 12K allocation can span four pages, eg: -- 4K -- -- 4K -- -- 4K -- -- 4K -- | | | | | :## ######## ######## ######: [1K, 4K, 4K, 3K] Such an allocation would be loaded into a BIO as you see: [1K, 4K, 4K, 3K] This tends not to be a problem in practice, because even if the BIO were filled and needed to be split, each half would still have either a start or end aligned to the logical block size of the device (assuming 4K at least). --- In ideal circumstances, these shortcomings don't cause any particular problems. Its when they start to interact with other ZFS features that things get interesting. Aggregation will create a "gang" ABD, which is simply a list of other ABDs. Iterating over a gang ABD is just iterating over each ABD within it in turn. Because the segments are simply loaded in order, we can end up with uneven segments either side of the "gap" between the two ABDs. For example, two 12K ABDs might be aggregated and then loaded as: [1K, 4K, 4K, 3K, 2K, 4K, 4K, 2K] Should a split occur, each individual BIO can end up either having an start or end offset that is not aligned to the logical block size, which some drivers (eg SCSI) will reject. However, this tends not to happen because the default aggregation limit usually keeps the BIO small enough to not require more than one split, and most pages are actually full 4K pages, so hitting an uneven gap is very rare anyway. If the pool is under particular memory pressure, then an IO can be broken down into a "gang block", a 512-byte block composed of a header and up to three block pointers. Each points to a fragment of the original write, or in turn, another gang block, breaking the original data up over and over until space can be found in the pool for each of them. Each gang header is a separate 512-byte memory allocation from a slab, that needs to be written down to disk. When the gang header is added to the BIO, its a single 512-byte segment. Pulling all this together, consider a large aggregated write of gang blocks. This results a BIO containing lots of 512-byte segments. Given our tendency to overfill the BIO, a split is likely, and most possible split points will yield a pair of BIOs that are misaligned. Drivers that care, like the SCSI driver, will reject them. --- This commit is a substantial refactor and rewrite of much of `vdev_disk` to sort all this out. `vdev_bio_max_segs()` now returns the ideal maximum size for the device, if available. There's also a tuneable `zfs_vdev_disk_max_segs` to override this, to assist with testing. We scan the ABD up front to count the number of pages within it, and to confirm that if we submitted all those pages to one or more BIOs, it could be split at any point with creating a misaligned BIO. If the pages in the BIO are not usable (as in any of the above situations), the ABD is linearised, and then checked again. This is the same technique used in `vdev_geom` on FreeBSD, adjusted for Linux's variable page size and allocator quirks. `vbio_t` is a cleanup and enhancement of the old `dio_request_t`. The idea is simply that it can hold all the state needed to create, submit and return multiple BIOs, including all the refcounts, the ABD copy if it was needed, and so on. Apart from what I hope is a clearer interface, the major difference is that because we know how many BIOs we'll need up front, we don't need the old overflow logic that would grow the BIO array, throw away all the old work and restart. We can get it right from the start. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rob Norris <[email protected]> Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Closes #15533 Closes #15588 (cherry picked from commit 06a196020e6f70d2fedbd4d0d05bbe0c1ac6e4d8)
* vdev_disk: make read/write IO function configurableRob Norris2024-03-281-2/+21
| | | | | | | | | | | | | | This is just setting up for the next couple of commits, which will add a new IO function and a parameter to select it. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rob Norris <[email protected]> Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Closes #15533 Closes #15588 (cherry picked from commit c4a13ba483f08a81aa47479d2f763a470d95b2b0)
* vdev_disk: reorganise vdev_disk_io_startRob Norris2024-03-281-20/+31
| | | | | | | | | | | | | | Light reshuffle to make it a bit more linear to read and get rid of a bunch of args that aren't needed in all cases. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rob Norris <[email protected]> Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Closes #15533 Closes #15588 (cherry picked from commit 867178ae1db28e73051c8a7ce662f2f2f81cd8e6)
* vdev_disk: rename existing functions to vdev_classic_*Rob Norris2024-03-282-102/+118
| | | | | | | | | | | | | | This is just renaming the existing functions we're about to replace and grouping them together to make the next commits easier to follow. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rob Norris <[email protected]> Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Closes #15533 Closes #15588 (cherry picked from commit f3b85d706bae82957d2e3e0ef1d53a1cfab60eb4)
* abd: add page iteratorRob Norris2024-03-283-11/+139
| | | | | | | | | | | | | | | | | | | The regular ABD iterators yield data buffers, so they have to map and unmap pages into kernel memory. If the caller only wants to count chunks, or can use page pointers directly, then the map/unmap is just unnecessary overhead. This adds adb_iterate_page_func, which yields unmapped struct page instead. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rob Norris <[email protected]> Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Closes #15533 Closes #15588 (cherry picked from commit 390b448726c580999dd337be7a40b0e95cf1d50b)
* Linux 6.8 compat: use splice_copy_file_range() for fallbackRob N2024-03-211-2/+14
| | | | | | | | | | | | | | Linux 6.8 removes generic_copy_file_range(), which had been reduced to a simple wrapper around splice_copy_file_range(). Detect that function directly and use it if generic_ is not available. Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Tony Hutter <[email protected]> Reviewed by: Brian Behlendorf <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes #15930 Closes #15931 (cherry picked from commit ef08a4d4065d21414d7fedccac20da6bfda4dfd0)
* dmu: Allow buffer fills to failAlexander Motin2024-02-204-24/+36
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When ZFS overwrites a whole block, it does not bother to read the old content from disk. It is a good optimization, but if the buffer fill fails due to page fault or something else, the buffer ends up corrupted, neither keeping old content, nor getting the new one. On FreeBSD this is additionally complicated by page faults being blocked by VFS layer, always returning EFAULT on attempt to write from mmap()'ed but not yet cached address range. Normally it is not a big problem, since after original failure VFS will retry the write after reading the required data. The problem becomes worse in specific case when somebody tries to write into a file its own mmap()'ed content from the same location. In that situation the only copy of the data is getting corrupted on the page fault and the following retries only fixate the status quo. Block cloning makes this issue easier to reproduce, since it does not read the old data, unlike traditional file copy, that may work by chance. This patch provides the fill status to dmu_buf_fill_done(), that in case of error can destroy the corrupted buffer as if no write happened. One more complication in case of block cloning is that if error is possible during fill, dmu_buf_will_fill() must read the data via fall-back to dmu_buf_will_dirty(). It is required to allow in case of error restoring the buffer to a state after the cloning, not not before it, that would happen if we just call dbuf_undirty(). Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Rob Norris <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #15665
* BRT: Fix slop space calculation with block cloningBi112024-02-121-1/+2
| | | | | | | | | Similar to deduplication, the size of data duplicated by block cloning should not be included in the slop space calculation. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Yuxin Wang <[email protected]> Closes #15874
* LUA: Backport CVE-2020-24370's patchthe-Chain-Warden-thresh2024-02-081-3/+4
| | | | | | | | | | | | | | | | | | | CVE-2020-24370 is a security vulnerability in lua. Although the CVE description in CVE-2020-24370 said that this CVE only affected lua 5.4.0, according to lua this CVE actually existed since lua 5.2. The root cause of this CVE is the negation overflow that occurs when you try to take the negative of 0x80000000. Thus, this CVE also exists in openzfs. Try to backport the fix to the lua in openzfs since the original fix is for 5.4 and several functions have been changed. https://github.com/advisories/GHSA-gfr4-c37g-mm3v https://nvd.nist.gov/vuln/detail/CVE-2020-24370 https://www.lua.org/bugs.html#5.4.0-11 https://github.com/lua/lua/commit/a585eae6e7ada1ca9271607a4f48dfb1786 Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: ChenHao Lu <[email protected]> Closes #15847
* Improve performance for zpool trim on linuxUmer Saleem2024-02-061-14/+58
| | | | | | | | | | | | | | On Linux, ZFS uses blkdev_issue_discard in vdev_disk_io_trim to issue trim command which is synchronous. This commit updates vdev_disk_io_trim to use __blkdev_issue_discard, which is asynchronous. Unfortunately there isn't any asynchronous version for blkdev_issue_secure_erase, so performance of secure trim will still suffer. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Umer Saleem <[email protected]> Closes #15843
* BRT: Fix FICLONE/FICLONERANGE shortened copyTony Hutter2024-02-064-35/+65
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | On Linux the ioctl_ficlonerange() and ioctl_ficlone() system calls are expected to either fully clone the specified range or return an error. The range may be for an entire file. While internally ZFS supports cloning partial ranges there's no way to return the length cloned to the caller so we need to make this all or nothing. As part of this change support for the REMAP_FILE_CAN_SHORTEN flag has been added. When REMAP_FILE_CAN_SHORTEN is set zfs_clone_range() will return a shortened range when encountering pending dirty records. When it's clear zfs_clone_range() will block and wait for the records to be written out allowing the blocks to be cloned. Furthermore, the file range lock is held over the region being cloned to prevent it from being modified while cloning. This doesn't quite provide an atomic semantics since if an error is encountered only a portion of the range may be cloned. This will be converted to an error if REMAP_FILE_CAN_SHORTEN was not provided and returned to the caller. However, the destination file range is left in an undefined state. A test case has been added which exercises this functionality by verifying that `cp --reflink=never|auto|always` works correctly. Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #15728 Closes #15842
* Linux 6.8 compat: replace MAX_ORDER defineRob Norris2024-01-291-9/+18
| | | | | | | | | | | MAX_ORDER has been renamed to MAX_PAGE_ORDER. Rather than just redefining it, instead define our own name and set it consistently from the start. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rob Norris <[email protected]> Sponsored-by: https://despairlabs.com/sponsor/ Closes #15805
* Linux 6.8 compat: implement strlcpy fallbackRob Norris2024-01-294-0/+4
| | | | | | | | | | Linux has removed strlcpy in favour of strscpy. This implements a fallback implementation of strlcpy for this case. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rob Norris <[email protected]> Sponsored-by: https://despairlabs.com/sponsor/ Closes #15805
* Linux 6.8 compat: update for new bdev access functionsRob Norris2024-01-291-58/+79
| | | | | | | | | | | | | | blkdev_get_by_path() and blkdev_put() have been replaced by bdev_open_by_path() and bdev_release(), which return a "handle" object with the bdev object itself inside. This adds detection for the new functions, and macros to handle the old and new forms consistently. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rob Norris <[email protected]> Sponsored-by: https://despairlabs.com/sponsor/ Closes #15805
* Don't assert mg_initialized due to device addition racePaul Dagnelie2024-01-291-3/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | During device removal stress tests, we noticed that we were tripping the assertion that mg_initialized was true. After investigation, it was determined that the mg in question was the embedded log metaslab group for a newly added vdev; the normal mg had been initialized (by metaslab_sync_reassess, via vdev_sync_done). However, because the spa config alloc lock is not held as writer across both calls to metaslab_sync_reassess, it is possible for an allocation to happen between the two metaslab_groups being initialized. Because the metaslab code doesn't check the group in question, just the vdev's main mg, it is possible to get past the initial check in vdev_allocatable and later fail due to the assertion. We simply remove the assertions. We could also consider locking the ALLOC lock around the reassess calls in vdev_sync_done, but that risks deadlocks. We could check the actual target mg in vdev_allocatable, but that risks racing with a passivation that comes in after that check but before the assertion. We still won't be able to actually allocate from the metaslab group if no metaslabs are ready, so this change shouldn't break anything. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: George Wilson <[email protected]> Signed-off-by: Paul Dagnelie <[email protected]> Closes #15818
* Update vdev devid and physpath if changed between importsAmeer Hamza2024-01-291-13/+27
| | | | | | | | | If devid or physpath for a vdev changes between imports, ensure it is updated to the new value. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Ameer Hamza <[email protected]> Closes #15816
* linux spl: fix typo in top comment of spl-condvar.cTino Reichardt2024-01-291-1/+1
| | | | | | | | Credential Implementation -> Condition Variables Implementation Reviewed-by: Brian Atkinson <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tino Reichardt <[email protected]> Closes #15782
* Make spl_kmem_cache size check consistentyouzhongyang2024-01-291-2/+3
| | | | | | | | | On Linux x86_64, kmem cache can have size up to 4M, however increasing spl_kmem_cache_slab_limit can lead to crash due to the size check inconsistency. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Youzhong Yang <[email protected]> Closes #15757
* Extend aux label to add path informationAmeer Hamza2024-01-291-23/+31
| | | | | | | | | | Pool import logic uses vdev paths, so it makes sense to add path information on AUX vdev as well. Reviewed-by: Umer Saleem <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Ameer Hamza <[email protected]> Closes #15737
* fix: Uber block label not always found for aux vdevsAmeer Hamza2024-01-291-0/+31
| | | | | | | | | | | | | | | When spare or l2cache (aux) vdev is added during pool creation, spa->spa_uberblock is not dumped until that point. Subsequently, the aux label is never synchronized after its initial creation, resulting in the uberblock label remaining undumped. The uberblock is crucial for lib_blkid in identifying the ZFS partition type. To address this issue, we now ensure sync of the uberblock label once if it's not dumped initially. Reviewed-by: Umer Saleem <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Ameer Hamza <[email protected]> Closes #15737
* Fix a potential use-after-free in zfs_setsecattr()Mark Johnston2024-01-291-2/+2
| | | | | | | | | | In general, VOPs must not load the "z_log" field until having called zfs_enter_verify_zp(). Reviewed-by: Brian Atkinson <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Mark Johnston <[email protected]> Closes #15752
* Linux: Defer loading the object set in zfs_setattr()Mark Johnston2024-01-291-1/+2
| | | | | | | | | | | | | | We need to wait until after having done a zfs_enter() to load some fields from the zfsvfs structure. Otherwise a use-after-free is possible in the face of a concurrent rollback. Other functions in this file are careful to avoid this bug, I believe this is the only instance. Reviewed-by: Brian Atkinson <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Mark Johnston <[email protected]> Closes #15752
* Fix file descriptor leak on pool import.Pawel Jakub Dawidek2024-01-261-12/+51
| | | | | | | | | | | | | | | | | Descriptor leak can be easily reproduced by doing: # zpool import tank # sysctl kern.openfiles # zpool export tank; zpool import tank # sysctl kern.openfiles We were leaking four file descriptors on every import. Similar leak most likely existed when using file-based VDEVs. External-issue: https://reviews.freebsd.org/D43529 Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Pawel Jakub Dawidek <[email protected]> Closes #15630
* Fix cloning into mmaped and cached file.Pawel Jakub Dawidek2024-01-191-0/+4
| | | | | | | | | | | If the destination file is mmaped and the mmaped region was already read, so it is cached, we need to update mmaped pages after successful clone using update_pages(). Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Pointed out by: Ka Ho Ng <[email protected]> Signed-off-by: Pawel Jakub Dawidek <[email protected]> Closes #15772
* Autotrim High Load Average FixKevin Jin2024-01-181-1/+2
| | | | | | | | | | Switch from cv_wait() to cv_wait_idle() in vdev_autotrim_wait_kick(), which should mitigate the high load average while waiting. Reviewed-by: Brian Atkinson <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: jxdking <[email protected]> Closes #15781
* Linux 6.7 compat: zfs_setattr fix atime updateRob N2024-01-171-2/+1
| | | | | | | | | | | | | | | | | | | | In db4fc559c I messed up and changed this bit of code to set the inode atime to an uninitialised value, when actually it was just supposed to loading the atime from the inode to be stored in the SA. This changes it to what it should have been. Ensure times change by the right amount Previously, we only checked if the times changed at all, which missed a bug where the atime was being set to an undefined value. Now ensure the times change by two seconds (or thereabouts), ensuring we catch cases where we set the time to something bonkers Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rob Norris <[email protected]> Sponsored-by: https://despairlabs.com/sponsor/ Closes #15762 Closes #15773
* spa: Let spa_taskq_param_get()'s addition of a newline be optionalMark Johnston2024-01-161-6/+7
| | | | | | | | | | | For FreeBSD sysctls, we don't want the extra newline, since the sysctl(8) utility will format strings appropriately. Reviewed-by: Rob Norris <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Reported-by: Peter Holm <[email protected]> Signed-off-by: Mark Johnston <[email protected]> Closes #15719
* spa: Fix FreeBSD sysctl handlersMark Johnston2024-01-161-24/+6
| | | | | | | | | | | | | | | | sbuf_cpy() resets the sbuf state, which is wrong for sbufs allocated by sbuf_new_for_sysctl(). In particular, this code triggers an assertion failure in sbuf_clear(). Simplify by just using sysctl_handle_string() for both reading and setting the tunable. Fixes: 6930ecbb7 ("spa: make read/write queues configurable") Reviewed-by: Rob Norris <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Reported-by: Peter Holm <[email protected]> Signed-off-by: Mark Johnston <[email protected]> Closes #15719
* Fix livelist assertions for dedup and cloningAlexander Motin2024-01-121-17/+14
| | | | | | | | | | | | | | | | | | Two block pointers in livelist pointing to the same location may be caused not only by dedup, but also by block cloning. We should not assert D bit set in them. Two block pointers in livelist pointing to the same location may have different logical birth time in case of dedup or cloning. We should assert identical physical birth time instead. Assert identical physical block size between pointers in addition to checksum, since that is what checksums are calculated on. Reviewed-by: Matthew Ahrens <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #15732
* Improve block sizes checks during cloningAlexander Motin2024-01-121-6/+25
| | | | | | | | | | | | | | | | | | | | | | | - Fail if source block is smaller than destination. We can only grow blocks, not shrink them. - Fail if we do not have full znode range lock. In that case grow is not even called. We should improve zfs_rangelock_cb() somehow to know when cloning needs to grow the block size unlike write. - Fail of we tried to resize, but failed. There are many reasons for it to fail that we can not predict at this level, so be ready for them. Unlike write, that may proceed after growth failure, block cloning can't and must return error. This fixes assertion inside dmu_brt_clone() when it sees different number of blocks held in destination than it got block pointers. Builds without ZFS_DEBUG returned EXDEV, so are not affected much. Reviewed-by: Pawel Jakub Dawidek <[email protected]> Reviewed-by: Brian Atkinson <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #15724 Closes #15735
* Don't panic on unencrypted block in encrypted datasetchrisperedun2024-01-082-4/+0
| | | | | | | | | | | While 763ca47 closes the situation of block cloning creating unencrypted records in encrypted datasets, existing data still causes panic on read. Setting zfs_recover bypasses this but at the cost of potentially ignoring more serious issues. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Chris Peredun <[email protected]> Closes #15677
* dbuf: Set dr_data when unoverriding after cloneAlexander Motin2024-01-081-3/+5
| | | | | | | | | | | | | | | Block cloning normally creates dirty record without dr_data. But if the block is read after cloning, it is moved into DB_CACHED state and receives the data buffer. If after that we call dbuf_unoverride() to convert the dirty record into normal write, we should give it the data buffer from dbuf and release one. Reviewed-by: Kay Pedersen <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #15654 Closes #15656
* dbuf: Handle arcbuf assignment after block cloningAlexander Motin2024-01-081-1/+11
| | | | | | | | | | | | In some cases dbuf_assign_arcbuf() may be called on a block that was recently cloned. If it happened in current TXG we must undo the block cloning first, since the only one dirty record per TXG can't and shouldn't mean both cloning and overwrite same time. Reviewed-by: Kay Pedersen <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #15653
* DMU: Fix lock leak on dbuf_hold() errorAlexander Motin2024-01-081-1/+1
| | | | | | | | | | dmu_assign_arcbuf_by_dnode() should drop dn_struct_rwlock lock in case dbuf_hold() failed. I don't have reproduction for this, but it looks inconsistent with dmu_buf_hold_noread_by_dnode() and co. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #15644
* BRT: Limit brt_vdev_dump() to only one vdevAlexander Motin2024-01-081-47/+33
| | | | | | | | | | | | | | | Without this patch on pool of 60 vdevs with ZFS_DEBUG enabled clone takes much more time than copy, while heavily trashing dbgmsg for no good reason, repeatedly dumping all vdevs BRTs again and again, even unmodified ones. I am generally not sure this dumping is not excessive, but decided to keep it for now, just restricting its scope to more reasonable. Reviewed-by: Kay Pedersen <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #15625
* ZIL: Remove 128K into 2x68K LWB split optimizationAlexander Motin2024-01-081-2/+0
| | | | | | | | | | | | | | | | To improve 128KB block write performance in case of multiple VDEVs ZIL used to spit those writes into two 64KB ones. Unfortunately it was found to cause LWB buffer overflow, trying to write maximum- sizes 128KB TX_CLONE_RANGE record with 1022 block pointers into 68KB buffer, since unlike TX_WRITE ZIL code can't split it. This is a minimally-invasive temporary block cloning fix until the following more invasive prediction code refactoring. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Ameer Hamza <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #15634
* Allow block cloning across encrypted datasetsoromenahar2024-01-083-18/+47
| | | | | | | | | | | | | | | | | | | | | When two datasets share the same master encryption key, it is safe to clone encrypted blocks. Currently only snapshots and clones of a dataset share with it the same encryption key. Added a test for: - Clone from encrypted sibling to encrypted sibling with non encrypted parent - Clone from encrypted parent to inherited encrypted child - Clone from child to sibling with encrypted parent - Clone from snapshot to the original datasets - Clone from foreign snapshot to a foreign dataset - Cloning from non-encrypted to encrypted datasets - Cloning from encrypted to non-encrypted datasets Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Original-patch-by: Pawel Jakub Dawidek <[email protected]> Signed-off-by: Kay Pedersen <[email protected]> Closes #15544
* ZIL: Do not clone blocks from the futureAlexander Motin2024-01-082-10/+43
| | | | | | | | | | | | | | | | | | ZIL claim can not handle block pointers cloned from the future, since they are not yet allocated at that point. It may happen either if the block was just written when it was cloned, or if the pool was frozen or somehow else rewound on import. Handle it from two sides: prevent cloning of blocks with physical birth time from not yet synced or frozen TXG, and abort ZIL claim if we still detect such blocks due to rewind or something else. While there, assert that any cloned blocks we claim are really allocated by calling metaslab_check_free(). Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #15617
* ZIL: Remove TX_CLONE_RANGE replay for ZVOLs.Alexander Motin2024-01-081-59/+1
| | | | | | | | | | | | | | | zil_claim_clone_range() takes references on cloned blocks before ZIL replay. Later zil_free_clone_range() drops them after replay or on dataset destroy. The total balance is neutral. It means we do not need to do anything (drop the references) for not implemented yet TX_CLONE_RANGE replay for ZVOLs. This is a logical follow up to #15603. Reviewed-by: Kay Pedersen <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #15612
* ZIO: Add overflow checks for linear buffersAlexander Motin2024-01-081-2/+55
| | | | | | | | | | Since we use a limited set of kmem caches, quite often we have unused memory after the end of the buffer. Put there up to a 512-byte canary when built with debug to detect buffer overflows at the free time. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #15553
* ZIL: Assert record sizes in different placesAlexander Motin2024-01-086-37/+115
| | | | | | | | This should make sure we have log written without overflows. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #15517
* L2ARC: Restrict write size to 1/4 of the deviceAlexander Motin2024-01-081-26/+4
| | | | | | | | | | | | | | | | PR #15457 exposed weird logic in L2ARC write sizing. If it appeared bigger than device size, instead of liming write it reset all the system-wide tunables to their default. Aside of being excessive, it did not actually help with the problem, still allowing infinite loop to happen. This patch removes the tunables reverting logic, but instead limits L2ARC writes (or at least eviction/trim) to 1/4 of the capacity. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: George Amanakis <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #15519
* Linux: Reclaim unused spl_kmem_cache_reclaimAlexander Motin2024-01-081-11/+0
| | | | | | | | | It is unused for 3 years since #10576. Reviewed-by: George Melikov <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #15507
* FreeBSD: Optimize large kstat outputsAlexander Motin2024-01-081-22/+16
| | | | | | | | | | | | | - Use sbuf_new_for_sysctl() to reduce double-buffering on sysctl output. - Use much faster sbuf_cat() instead of sbuf_printf("%s"). Together it reduces `sysctl kstat.zfs.misc.dbufs` time from minutes to seconds, making dbufstat almost usable. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #15495