aboutsummaryrefslogtreecommitdiffstats
path: root/cmd
Commit message (Collapse)AuthorAgeFilesLines
* ddt: dedup logRob Norris2024-08-161-1/+32
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Adds a log/journal to dedup. At the end of txg, instead of writing the entry directly to the ZAP, instead its adding to an in-memory tree and appended to an on-disk object. The on-disk object is only read at import, to reload the in-memory tree. Lookups first go the the log tree before going to the ZAP, so recently-used entries will remain close by in memory. This vastly reduces overhead from dedup IO, as it will not have to do so many read/update/write cycles on ZAP leaf nodes. A flushing facility is added at end of txg, to push logged entries out to the ZAP. There's actually two separate "logs" (in-memory tree and on-disk object), one active (recieving updated entries) and one flushing (writing out to disk). These are swapped (ie flushing begins) based on memory used by the in-memory log trees and time since we last flushed something. The flushing facility monitors the amount of entries coming in and being flushed out, and calibrates itself to try to flush enough each txg to keep up with the ingest rate without competing too much with other IO. Multiple tuneables are provided to control the flushing facility. All the histograms and stats are update to accomodate the log as a separate entry store. zdb gains knowledge of how to count them and dump them. Documentation included! Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Co-authored-by: Allan Jude <[email protected]> Signed-off-by: Rob Norris <[email protected]> Sponsored-by: Klara, Inc. Sponsored-by: iXsystems, Inc. Closes #15895
* ddt: cleanup the stats & histogram codeRob Norris2024-08-161-12/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Both the API and the code were kinda mangled and I was really struggling to follow it. The worst offender was the old ddt_stat_add(); after fixing it up the rest of the changes are mostly knock-on effects and targets of opportunity. Note that the old ddt_stat_add() was safe against overflows - it could produce crazy numbers, but the compiler wouldn't do anything stupid. The assertions in ddt_stat_sub() go a lot of the way to protecting against this; getting in a position where overflows are a problem is definitely a programming error. Also expanding ddt_stat_add() and ddt_histogram_empty() produces less efficient assembly. I'm not bothered about this right now though; these should not be hot functions, and if they are we'll optimise them later. If we have to go back to the old form, we'll comment it like crazy. Finally, I've removed the assertion that the bucket will never be negative, as it will soon be possible to have entries with zero refcounts: an entry for a block that is no longer on the pool, but is on the log waiting to be synced out. It might be better to have a separate bucket for these, since they're still using real space on disk, but ultimately these stats are driving UI, and for now I've chosen to keep them matching how they've looked in the past, as well as match the operators mental model - pool usage is managed elsewhere. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rob Norris <[email protected]> Sponsored-by: Klara, Inc. Sponsored-by: iXsystems, Inc. Closes #15895
* ddt: add "flat phys" featureRob Norris2024-08-161-25/+43
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Traditional dedup keeps a separate ddt_phys_t "type" for each possible count of DVAs (that is, copies=) parameter. Each of these are tracked independently of each other, and have their own set of DVAs. This leads to an (admittedly rare) situation where you can create as many as six copies of the data, by changing the copies= parameter between copying. This is both a waste of storage on disk, but also a waste of space in the stored DDT entries, since there never needs to be more than three DVAs to handle all possible values of copies=. This commit adds a new FDT feature, DDT_FLAG_FLAT. When active, only the first ddt_phys_t is used. Each time a block is written with the dedup bit set, this single phys is checked to see if it has enough DVAs to fulfill the request. If it does, the block is filled with the saved DVAs as normal. If not, an adjusted write is issued to create as many extra copies as are needed to fulfill the request, which are then saved into the entry too. Because a single phys is no longer an all-or-nothing, but can be transitioning from fewer to more DVAs, the write path now has to keep a copy of the previous "known good" DVA set so we can revert to it in case an error occurs. zio_ddt_write() has been restructured and heavily commented to make it much easier to see what's happening. Backwards compatibility is maintained simply by allocating four ddt_phys_t when the DDT_FLAG_FLAT flag is not set, and updating the phys selection macros to check the flag. In the old arrangement, each number of copies gets a whole phys, so it will always have either zero or all necessary DVAs filled, with no in-between, so the old behaviour naturally falls out of the new code. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rob Norris <[email protected]> Co-authored-by: Don Brady <[email protected]> Sponsored-by: Klara, Inc. Sponsored-by: iXsystems, Inc. Closes #15893
* ddt: introduce lightweight entryRob Norris2024-08-161-7/+8
| | | | | | | | | | | | | | | | | | | | The idea here is that sometimes you need the contents of an entry with no intent to modify it, and/or from a place where its difficult to get hold of its originating ddt_t to know how to interpret it. A lightweight entry contains everything you might need to "read" an entry - its key, type and phys contents - but none of the extras for modifying it or using it in a larger context. It also has the full complement of phys slots, so it can represent any kind of dedup entry without having to know the specific configuration of the table it came from. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rob Norris <[email protected]> Sponsored-by: Klara, Inc. Sponsored-by: iXsystems, Inc. Closes #15893
* ddt: rework access to phys array slotsRob Norris2024-08-161-7/+6
| | | | | | | | | | | | | | | | | | | | | | | The "flat phys" feature will use only a single phys slot for all entries, which means the old "single", "double" etc naming now makes no sense, and more importantly, means that choosing the right slot for a given block pointer will depend on how many slots are in use for a given DDT. This removes the old names, and adds accessor macros to decouple specific phys array indexes from any particular meaning. (These macros look strange in isolation, mainly in the way they take the ddt_t* as an arg but don't use it. This is mostly a separate commit to introduce the concept to the reader before the "flat phys" commit extends it). Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rob Norris <[email protected]> Sponsored-by: Klara, Inc. Sponsored-by: iXsystems, Inc. Closes #15893
* zdb: rework DDT block count and leak check to just count the blocksRob Norris2024-08-161-120/+195
| | | | | | | | | | | | | | | | | | | | | | | | | | | | The upcoming dedup features break the long held assumption that all blocks on disk with a 'D' dedup bit will always be present in the DDT, or will have the same set of DVA allocations on disk as in the DDT. If the DDT is no longer a complete picture of all the dedup blocks that will be and should be on disk, then it does us no good to walk and prime it up front, since it won't necessarily match up with every block we'll see anyway. Instead, we rework things here to be more like the BRT checks. When we see a dedup'd block, we look it up in the DDT, consume a refcount, and for the second-or-later instances, count them as duplicates. The DDT and BRT are moved ahead of the space accounting. This will become important for the "flat" feature, which may need to count a modified version of the block. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Co-authored-by: Allan Jude <[email protected]> Co-authored-by: Don Brady <[email protected]> Signed-off-by: Rob Norris <[email protected]> Sponsored-by: Klara, Inc. Sponsored-by: iXsystems, Inc. Closes #15892
* zstream: remove duplicate highbit64 definitionRob Norris2024-08-091-9/+0
| | | | | | | | | | | When building a static build (--disable-shared), zstream fails to link because of the duplicate highbit64() in libzpool/kernel.c. Since they're identical, and the libzpool one is visible to zstream, we remove zstream's copy and just use the common one. Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes #16426
* JSON: Fix class values for mirrored special vdevsTony Hutter2024-08-061-4/+33
| | | | | | | | | | | | | | | | | | | | | This fixes things so mirrored special vdevs report themselves as "class=special" rather than "class=normal". This happens due to the way the vdev nvlists are constructed: mirrored special devices - The 'mirror' vdev has allocation bias as "special" and it's leaf vdevs are "normal" single or RAID0 special devices - Leaf vdevs have allocation bias as "special". This commit adds in code to check if a leaf's parent is a "special" vdev to see if it should also report "special". Reviewed-by: Ameer Hamza <[email protected]> Reviewed-by: Umer Saleem <[email protected]> Signed-off-by: Tony Hutter <[email protected]> Closes #16217
* JSON output support for zpool statusUmer Saleem2024-08-061-260/+1444
| | | | | | | | | | | | | This commit adds support for zpool status command to displpay status of ZFS pools in JSON format using '-j' option. Status information is collected in nvlist which is later dumped on stdout in JSON format. Existing options for zpool status work with '-j' flag. man page for zpool status is updated accordingly. Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Ameer Hamza <[email protected]> Signed-off-by: Umer Saleem <[email protected]> Closes #16217
* JSON output support for zpool listUmer Saleem2024-08-061-82/+297
| | | | | | | | | | | | | | | This commit adds support for zpool list command to output the list of ZFS pools in JSON format using '-j' option.. Information about available pools is collected in nvlist which is later printed to stdout in JSON format. Existing options for zfs list command work with '-j' flag. man page for zpool list is updated accordingly. Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Ameer Hamza <[email protected]> Signed-off-by: Umer Saleem <[email protected]> Closes #16217
* JSON output support for zpool getUmer Saleem2024-08-062-62/+245
| | | | | | | | | | | This commit adds support for zpool get command to output the list of properties for ZFS Pools and VDEVS in JSON format using '-j' option. Man page for zpool get is updated to include '-j' option. Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Ameer Hamza <[email protected]> Signed-off-by: Umer Saleem <[email protected]> Closes #16217
* JSON output support for zpool versionUmer Saleem2024-08-061-3/+65
| | | | | | | | | | | This commit adds support for zpool version to output in JSON format using '-j' option. Userland kernel module version is collected in nvlist which is later displayed in JSON format. man page for zpool is updated. Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Ameer Hamza <[email protected]> Signed-off-by: Umer Saleem <[email protected]> Closes #16217
* JSON output support zfs mountUmer Saleem2024-08-061-5/+36
| | | | | | | | | | | This commit adds support for zfs mount to display mounted file systems in JSON format using '-j' option. Data is collected in nvlist which is printed in JSON format. man page for zfs mount is updated accordingly. Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Ameer Hamza <[email protected]> Signed-off-by: Umer Saleem <[email protected]> Closes #16217
* JSON output support for zfs listUmer Saleem2024-08-061-44/+135
| | | | | | | | | | | | This commit adds support for JSON output for zfs list using '-j' option. Information is collected in JSON format which is later printed in jSON format. Existing options for zfs list also work with '-j'. man pages are updated with relevant information. Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Ameer Hamza <[email protected]> Signed-off-by: Umer Saleem <[email protected]> Closes #16217
* JSON output support for zfs version and zfs getUmer Saleem2024-08-061-15/+213
| | | | | | | | | | | | | | | | This commit adds support for JSON output for zfs version and zfs get commands. '-j' flag can be used to get output in JSON format. Information is collected in nvlist objects which is later printed in JSON format. Existing options that work for zfs get and zfs version also work with '-j' flag. man pages for zfs get and zfs version are updated accordingly. Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Ameer Hamza <[email protected]> Signed-off-by: Umer Saleem <[email protected]> Closes #16217
* Once more refactor arc_summary outputAlexander Motin2024-08-051-59/+88
| | | | | | | | | | | | | | | | | | | | | Before this arc_summary was not reporting any information about evictable ARC memory. As result I've found difficult to analyze behavior of dnode-heavy workload with lots of unevictable buffers. This change adds evictable sizes into states breakdown section. While there, add/refactor sections for global memory statistics, for ARC breakdown between different structures, for data/metadata. Add information about memory reclamation requests. While there, refactor and polish graph mode, neglected for a while. Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Reviewed-by: Tino Reichardt <[email protected]> Reviewed-by: Umer Saleem <[email protected]> Reviewed-by: Ameer Hamza <[email protected]> Reviewed-by: Tony Hutter <[email protected]>
* Fix zdb_dump_block for little endian (#16310)Chunwei Chen2024-07-311-1/+1
| | | | | | | | | The endian macros were changed but zdb_dump_block wasn't updated accordingly. Signed-off-by: Chunwei Chen <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Tino Reichardt <[email protected]> Reviewed-by: Allan Jude <[email protected]>
* ddt: add support for prefetching tables into the ARCAllan Jude2024-07-263-10/+124
| | | | | | | | | | | | | | | | | | | | | | This change adds a new `zpool prefetch -t ddt $pool` command which causes a pool's DDT to be loaded into the ARC. The primary goal is to remove the need to "warm" a pool's cache before deduplication stops slowing write performance. It may also provide a way to reload portions of a DDT if they have been flushed due to inactivity. Sponsored-by: iXsystems, Inc. Sponsored-by: Catalogics, Inc. Sponsored-by: Klara, Inc. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Allan Jude <[email protected]> Signed-off-by: Will Andrews <[email protected]> Signed-off-by: Fred Weigel <[email protected]> Signed-off-by: Rob Norris <[email protected]> Signed-off-by: Don Brady <[email protected]> Co-authored-by: Will Andrews <[email protected]> Co-authored-by: Don Brady <[email protected]> Closes #15890
* Fix ZDB to dump projid for projectquota enabled (#16291)Jitendra Patidar2024-07-251-0/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ZDB is supposed to dump "projid" via dump_znode(), when projectquota is enabled. ----------- static void dump_znode(objset_t *os, uint64_t object, void *data, size_t size) { ... if (dmu_objset_projectquota_enabled(os) && (pflags & ZFS_PROJID)) { uint64_t projid; if (sa_lookup(hdl, sa_attr_table[ZPL_PROJID], &projid, sizeof (uint64_t)) == 0) (void) printf("\tprojid %llu\n", (u_longlong_t)projid); } ... } ---------- But its not dumping "projid", even for project quota enabled. dmu_objset_projectquota_enabled() does following 3 checks, ---------- boolean_t dmu_objset_projectquota_enabled(objset_t *os) { return (file_cbs[os->os_phys->os_type] != NULL && DMU_PROJECTUSED_DNODE(os) != NULL && spa_feature_is_enabled(os->os_spa, SPA_FEATURE_PROJECT_QUOTA)); } ---------- It fails on file_cbs[] check. file_cbs[] gets initialised via dmu_objset_register_type(); which is not done for the ZDB, its done for the kernel via zfs_init(). Register a dummy callback handle for the DMU_OST_ZFS type in ZDB main() function to dump the projid for projectquota enabled. Signed-off-by: Jitendra Patidar <[email protected]> Closes #16290 Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Tino Reichardt <[email protected]>
* zil: add stats for commit failure/fallback (#16315)Rob Norris2024-07-251-0/+3
| | | | | | | | | | | | There's no good way to tell when a ZIL commit fails and falls back to a transaction sync, other than perhaps a throughput drop. This adds counters so we can see when it happens and why. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Tony Hutter <[email protected]>
* ddt: dedup table quota enforcementAllan Jude2024-07-251-6/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | This adds two new pool properties: - dedup_table_size, the total size of all DDTs on the pool; and - dedup_table_quota, the maximum possible size of all DDTs in the pool When set, quota will be enforced by checking when a new entry is about to be created. If the pool is over its dedup quota, the entry won't be created, and the corresponding write will be converted to a regular non-dedup write. Note that existing entries can be updated (ie their refcounts changed), as that reuses the space rather than requiring more. dedup_table_quota can be set to 'auto', which will set it based on the size of the devices backing the "dedup" allocation device. This makes it possible to limit the DDTs to the size of a dedup vdev only, such that when the device fills, no new blocks are deduplicated. Sponsored-by: iXsystems, Inc. Sponsored-By: Klara Inc. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rob Norris <[email protected]> Signed-off-by: Don Brady <[email protected]> Co-authored-by: Don Brady <[email protected]> Co-authored-by: Rob Wing <[email protected]> Co-authored-by: Sean Eric Fagan <[email protected]> Closes #15889
* zdb: fix BRT dump (#16335)Rob Norris2024-07-181-2/+7
| | | | | | | | | | | | | BRT refcounts are stored as eight uint8_ts rather than a single uint64_t. This means that za_first_integer is only the first byte, so max 256. This fixes it by doing a lookup for the whole value. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Alexander Motin <[email protected]>
* zdb: dump ZAP_FLAG_UINT64_KEY ZAPs properly (#16334)Rob Norris2024-07-171-4/+26
| | | | | | | | | | | | These are used for DDT and BRT stores. There's limited information available to produce meaningful output, but at least we can put something on screen rather than crashing. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Tony Hutter <[email protected]>
* Fix zdb "Memory fault" found on FreeBSD ZTS (#16332)Tino Reichardt2024-07-091-1/+1
| | | | | | | | | | Reason: nvlist_free() tries to free sth. which isn't allocted Solution: init this variable with NULL Closes #16311 Signed-off-by: Tino Reichardt <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Ameer Hamza <[email protected]> Reviewed-by: Tony Hutter <[email protected]>
* zdb: fix FreeBSD build failure Ameer Hamza2024-06-061-1/+1
| | | | | | | | | This fixes FreeBSD build failure with clang-18 after 23a489a got merged. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Rob Norris <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Ameer Hamza <[email protected]> Closes #16252
* zdb: detect cachefile automatically otherwise force importAmeer Hamza2024-06-031-5/+61
| | | | | | | | | | | | If a pool is created with the cache file located in a non-default path /etc/default/zpool.cache, removed, or the cachefile property is set to none, zdb fails to show the pool unless we specify the cache file or use the -e option. This PR automates this process. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Akash B <[email protected]> Signed-off-by: Ameer Hamza <[email protected]> Closes #16071
* zpool import output is not formated properly.Pawel Jakub Dawidek2024-05-291-131/+138
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The 'zpool status' output assumes that the longest prefix is six character long plus colon plus space, eg. 'status: ', 'action: ' or 'config: ' (so eight in total). This works well even when we have messages that requires more than one line, as '\t' is exactly eight characters, just like the longest prefix. The 'zpool import' output is a bit different, as it may display the comment pool property, then the longest prefix is 'comment: ', which is nine characters long, not eight. All the prefixes were given an extra space in front, but: - 'status: ' did not get an extra space. - Messages that require more than one line should use nine spaces of indentation, not eight. - The extra space in front looks redundant if there is no comment property set on the given pool. Fix it by adding an extra space to all prefixes, but only if the comment property is defined. Also, when we need to continue the message in a new line, use '\t ' for indentation. While here, apply small corrections to a couple messages. Before: pool: tank id: 7412636063178848859 state: ONLINE status: Some supported features are not enabled on the pool. (Note that they may be intentionally disabled if the 'compatibility' property is set.) action: The pool can be imported using its name or numeric identif[...] some features will not be available without an explicit 'zp[...] comment: Example comment. config: bclone ONLINE ada0 ONLINE After: pool: tank id: 10180960571062436759 state: ONLINE status: Some supported features are not enabled on the pool. (Note that they may be intentionally disabled if the 'compatibility' property is set.) action: The pool can be imported using its name or numeric identifi[...] some features will not be available without an explicit 'zp[...] config: tank ONLINE ada3 ONLINE pool: dozer id: 11028319538368222579 state: ONLINE status: Some supported features are not enabled on the pool. (Note that they may be intentionally disabled if the 'compatibility' property is set.) action: The pool can be imported using its name or numeric identif[...] some features will not be available without an explicit 'z[...] comment: Example comment. config: dozer ONLINE ada1 ONLINE Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Pawel Dawidek <[email protected]> Closes #16128
* zed: Add deadman-slot_off.sh zedletBrian Behlendorf2024-05-293-0/+80
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Optionally turn off disk's enclosure slot if an I/O is hung triggering the deadman. It's possible for outstanding I/O to a misbehaving SCSI disk to neither promptly complete or return an error. This can occur due to retry and recovery actions taken by the SCSI layer, driver, or disk. When it occurs the pool will be unresponsive even though there may be sufficient redundancy configured to proceeded without this single disk. When a hung I/O is detected by the kmods it will be posted as a deadman event. By default an I/O is considered to be hung after 5 minutes. This value can be changed with the zfs_deadman_ziotime_ms module parameter. If ZED_POWER_OFF_ENCLOSURE_SLOT_ON_DEADMAN is set the disk's enclosure slot will be powered off causing the outstanding I/O to fail. The ZED will then handle this like a normal disk failure. By default ZED_POWER_OFF_ENCLOSURE_SLOT_ON_DEADMAN is not set. As part of this change `zfs_deadman_events_per_second` is added to control the ratelimitting of deadman events independantly of delay events. In practice, a single deadman event is sufficient and more aren't particularly useful. Alphabetize the zfs_deadman_* entries in zfs.4. Reviewed-by: Tony Hutter <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #16226
* Use memset to zero stack allocations containing unionsRob N2024-05-241-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | C99 6.7.8.17 says that when an undesignated initialiser is used, only the first element of a union is initialised. If the first element is not the largest within the union, how the remaining space is initialised is up to the compiler. GCC extends the initialiser to the entire union, while Clang treats the remainder as padding, and so initialises according to whatever automatic/implicit initialisation rules are currently active. When Linux is compiled with CONFIG_INIT_STACK_ALL_PATTERN, -ftrivial-auto-var-init=pattern is added to the kernel CFLAGS. This flag sets the policy for automatic/implicit initialisation of variables on the stack. Taken together, this means that when compiling under CONFIG_INIT_STACK_ALL_PATTERN on Clang, the "zero" initialiser will only zero the first element in a union, and the rest will be filled with a pattern. This is significant for aes_ctx_t, which in aes_encrypt_atomic() and aes_decrypt_atomic() is initialised to zero, but then used as a gcm_ctx_t, which is the fifth element in the union, and thus gets pattern initialisation. Later, it's assumed to be zero, resulting in a hang. As confusing and undiscoverable as it is, by the spec, we are at fault when we initialise a structure containing a union with the zero initializer. As such, this commit replaces these uses with an explicit memset(0). Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Tino Reichardt <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes #16135 Closes #16206
* Correct level handling in zstream recompress.Rich Ercolani2024-05-161-1/+1
| | | | | | | | | | sscanf returns number of items parsed on success and EOF on failure. Reviewed-by: Adam Moss <[email protected]> Reviewed-by: Paul Dagnelie <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Rob Norris <[email protected]> Signed-off-by: Rich Ercolani <[email protected]> Closes #16198
* zdb/ztest: send dbgmsg output to stderrRob Norris2024-05-142-4/+4
| | | | | | | | | | | | | And, make the output fd an arg to zfs_dbgmsg_print(). This is a change in behaviour, but keeps it consistent with where crash traces go, and it's easy to argue this is what we want anyway; this is information about the task, not the actual output of the task. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes #16181
* backtrace: rework for signal safetyRob Norris2024-05-142-2/+2
| | | | | | | | | | Mostly, try a lot harder to not allocate anything. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes #16181
* libspl: lift backtrace into a separate fileRob Norris2024-05-142-2/+4
| | | | | | | | | | | If it's going to be used directly by zdb/ztest, then it sort of doesn't make sense to carry it with the assert code. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes #16181
* zdb/ztest: use libspl backtrace for crashesRob Norris2024-05-142-22/+2
| | | | | | | | | | We can show much nicer backtraces these days, lets use them. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes #16181
* zdb: bring crash handling over from ztestRob Norris2024-05-141-5/+56
| | | | | | | | | | | | ztest has a very nice ability to show a backtrace when there's an unexpected crash. zdb is used often enough on corrupted data and can blow up too, so nice output is useful there too. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes #16181
* Better control the thread pool size when mounting datasetsAlan Somers2024-05-143-8/+20
| | | | | | | | | | | | | | | | | | | | Ever since a10d50f999, ZFS has mounted file systems in parallel when importing a pool. It uses a fixed size of 512 for the thread pool. But since c183d164aa1, it has also imported pools in parallel. So the total number of threads at one time is 513 * npools + 1. That can easily exceed the system's limit on the number of threads per process, which will cause one or more pools to be unable to allocate any worker threads, forcing them to fallback to slow serial mounting . To forestall that, manage the threadpool size in /sbin/zpool, not libzfs. Use the same size (512), but divided by the number of pools. This is a backwards-incompatible change to the libzfs abi. Sponsored by: Axcient Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: George Wilson <[email protected]> Signed-off-by: Alan Somers <[email protected]> Closes #16178
* Add support for parallel pool exportsDon Brady2024-05-141-6/+82
| | | | | | | | | | | | | | | | | | | | | | | | Changed spa_export_common() such that it no longer holds the spa_namespace_lock for the entire duration and instead sets spa_export_thread to indicate an import is in progress on the spa. This allows for an export to a diffent pool to proceed in parallel while an export is still processing potentially long operations like spa_unload_log_sm_flush_all(). Calls like spa_lookup() and spa_vdev_enter() that rely on the spa_namespace_lock to serialize them against a concurrent export, now wait for any in-progress export thread to complete before proceeding. The 'zpool import -a' sub-command also provides multi-threaded support, using a thread pool to submit the exports in parallel. Sponsored-By: Klara Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: George Wilson <[email protected]> Signed-off-by: Don Brady <[email protected]> Closes #16153
* Replace P2ALIGN with P2ALIGN_TYPED and delete P2ALIGN.chenqiuhao19972024-05-102-7/+10
| | | | | | | | | | | | In P2ALIGN, the result would be incorrect when align is unsigned integer and x is larger than max value of the type of align. In that case, -(align) would be a positive integer, which means high bits would be zero and finally stay zero after '&' when align is converted to a larger integer type. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Youzhong Yang <[email protected]> Signed-off-by: Qiuhao Chen <[email protected]> Closes #15940
* zdb: add missing cleanup for early returnAmeer Hamza2024-05-091-25/+53
| | | | | | | Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Don Brady <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Ameer Hamza <[email protected]> Closes #16152
* vdev probe to slow disk can stall mmp write checkerDon Brady2024-04-291-1/+1
| | | | | | | | | | | | | | Simplify vdev probes in the zio_vdev_io_done context to avoid holding the spa config lock for a long duration. Also allow zpool clear if no evidence of another host is using the pool. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Olaf Faaland <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Don Brady <[email protected]> Closes #15839
* Fix arcstats for FreeBSD after zfetch supportAmeer Hamza2024-04-291-2/+8
| | | | | | Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Ameer Hamza <[email protected]> Closes #16141
* GCC: Fixes for gcc 14 on Fedora 40Tony Hutter2024-04-291-1/+1
| | | | | | | | | | | - Workaround dangling pointer in uu_list.c (#16124) - Fix calloc() transposed arguments in zpool_vdev_os.c - Make some temp variables unsigned to prevent triggering a '-Werror=alloc-size-larger-than' error. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tony Hutter <[email protected]> Closes #16124 Closes #16125
* zfs get: add '-t fs' and '-t vol' optionsRyan2024-04-221-6/+16
| | | | | | | | Make `zfs get` accept `fs` for `filesystem` and `vol` for `volume`. Reviewed-by: Rob Norris <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Ryan <[email protected]> Closes #16117
* ztest: use ASSERT3P to compare pointersBrooks Davis2024-04-221-1/+1
| | | | | | | | | | | | With a sufficiently modern gcc (I saw this with gcc13), gcc complains when casting pointers to an integer of a different type (even a larger one). On 32-bt ASSERT3U does this on 32-bit systems by casting a 32-bit pointer to uint64_t so use ASSERT3P which uses uintptr_t. Fixes: 5caeef02fa53 RAID-Z expansion feature Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Brooks Davis <[email protected]> Closes #16115
* Add newline to two zpool messagesSeth Troisi2024-04-221-2/+2
| | | | | Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Seth Troisi <[email protected]> Closes #16113
* Parallel pool importGeorge Wilson2024-04-222-26/+161
| | | | | | | | | | | | | | | | | | | | | | | | This commit allow spa_load() to drop the spa_namespace_lock so that imports can happen concurrently. Prior to dropping the spa_namespace_lock, the import logic will set the spa_load_thread value to track the thread which is doing the import. Consumers of spa_lookup() retain the same behavior by blocking when either a thread is holding the spa_namespace_lock or the spa_load_thread value is set. This will ensure that critical concurrent operations cannot take place while a pool is being imported. The zpool command is also enhanced to provide multi-threaded support when invoking zpool import -a. Lastly, zinject provides a mechanism to insert artificial delays when importing a pool and new zfs tests are added to verify parallel import functionality. Contributions-by: Don Brady <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: George Wilson <[email protected]> Closes #16093
* Add zfetch stats in arcstatsAmeer Hamza2024-04-191-5/+42
| | | | | | | | | | arc_summary also reports zfetch stats but it's inconvenient to monitor contiguously incrementing numbers. Adding them in arcstats allows us to observe streams more conveniently. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Ameer Hamza <[email protected]> Closes #16094
* zinject: "no-op" error injectionRob N2024-04-151-3/+4
| | | | | | | | | | | | | When injected, this causes the matching IO to appear to succeed, but the actual work is never submitted to the physical device. This can be used to simulate a write-back cache servicing a write, but the backing device has failed and the cache cannot complete the operation in the background. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes #16085
* zio: rename ZIO_TYPE_IOCTL to ZIO_TYPE_FLUSHRob Norris2024-04-111-5/+5
| | | | | | | | | | | | | The only possible ioctl is a flush, and any other kind of meta-operation introduced in the future is likely to have different semantics (much like trim did). So, lets just call it what it is. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes #16064
* Add support for zfs mount -R <filesystem>Umer Saleem2024-04-111-14/+61
| | | | | | | | | | | | | | | | This commit adds support for mounting a dataset along with all of it's children with '-R' flag for zfs mount. There can be scenarios where we want to mount all datasets under one hierarchy instead of mounting all datasets present on system with '-a' flag. '-R' flag should work on all root and non-root datasets. Usage information and man page has been updated for zfs mount. A test for verifying the behavior for '-R' flag is also added. Reviewed-by: Ameer Hamza <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Umer Saleem <[email protected]> Closes #16015