aboutsummaryrefslogtreecommitdiffstats
path: root/cmd
Commit message (Collapse)AuthorAgeFilesLines
* Allocate zap_attribute_t from kmem instead of stackSanjeev Bagewadi2024-10-012-67/+80
| | | | | | | | | | | | This patch is preparatory work for long name feature. It changes all users of zap_attribute_t to allocate it from kmem instead of stack. It also make zap_attribute_t and zap_name_t structure variable length. Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Chunwei Chen <[email protected]> Closes #15921
* zfs_log: add flex array fields to log record structsRob Norris2024-09-272-13/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ZIL log record structs (lr_XX_t) are frequently allocated with extra space after the struct to carry variable-sized "payload" items. Linux 6.10+ compiled with CONFIG_FORTIFY_SOURCE has been doing runtime bounds checking on memcpy() calls. Because these types had no indicator that they might use more space than their simple definition, __fortify_memcpy_chk will frequently complain about overruns eg: memcpy: detected field-spanning write (size 7) of single field "lr + 1" at zfs_log.c:425 (size 0) memcpy: detected field-spanning write (size 9) of single field "(char *)(lr + 1)" at zfs_log.c:593 (size 0) memcpy: detected field-spanning write (size 4) of single field "(char *)(lr + 1) + snamesize" at zfs_log.c:594 (size 0) memcpy: detected field-spanning write (size 7) of single field "lr + 1" at zfs_log.c:425 (size 0) memcpy: detected field-spanning write (size 9) of single field "(char *)(lr + 1)" at zfs_log.c:593 (size 0) memcpy: detected field-spanning write (size 4) of single field "(char *)(lr + 1) + snamesize" at zfs_log.c:594 (size 0) memcpy: detected field-spanning write (size 7) of single field "lr + 1" at zfs_log.c:425 (size 0) memcpy: detected field-spanning write (size 9) of single field "(char *)(lr + 1)" at zfs_log.c:593 (size 0) memcpy: detected field-spanning write (size 4) of single field "(char *)(lr + 1) + snamesize" at zfs_log.c:594 (size 0) To fix this, this commit adds flex array fields to all lr_XX_t structs that require them, and then uses those fields to access that end-of-struct area rather than more complicated casts and pointer addition. Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes #16501 Closes #16539
* Add compatibility file for GRUB versions up to v2.06Umer Saleem2024-09-213-2/+27
| | | | | | | | | | | | | | | | | | | | | | | | | GRUB is not able to detect ZFS pool if snaphsot of top level boot pool is created. This issue is observed with GRUB versions up to v2.06 if extensible_dataset feature is enabled on ZFS boot pool. compatibility=grub2-2.06 would enable all read-only compatible zpool features except extensible_dataset and other features that depend on it. The existing grub2 compatibility file is now renamed to grub2-2.12 to reflect the appropriate grub2 version. grub2-2.12 lists all read-only features that can be enabled on boot pool for grub2 with version 2.12 onwards. A new symlink grub2 is created that currently points to the grub2-2.12 compatibility file. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Umer Saleem <[email protected]> Closes #13873 Closes #15261 Closes #15909
* zio_compress: introduce max size thresholdGeorge Melikov2024-09-191-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | Now default compression is lz4, which can stop compression process by itself on incompressible data. If there are additional size checks - we will only make our compressratio worse. New usable compression thresholds are: - less than BPE_PAYLOAD_SIZE (embedded_data feature); - at least one saved sector. Old 12.5% threshold is left to minimize affect on existing user expectations of CPU utilization. If data wasn't compressed - it will be saved as ZIO_COMPRESS_OFF, so if we really need to recompress data without ashift info and check anything - we can just compress it with zero threshold. So, we don't need a new feature flag here! Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: George Melikov <[email protected]> Closes #9416
* cityhash: replace invocations with specialized versions when possibleShengqi Chen2024-09-191-2/+2
| | | | | | | | | | | So that we can get actual benefit from last commit. Reviewed by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Tino Reichardt <[email protected]> Signed-off-by: Shengqi Chen <[email protected]> Closes #16131 Closes #16483
* arcstat: add structural, types, states breakdownTheera K.2024-09-182-89/+210
| | | | | | | | Add ARC structural breakdown, ARC types breakdown, ARC states breakdown similar to arc_summary. Additional cleanups included. Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Theera K. <[email protected]> Closes #16509
* Avoid fault diagnosis if multiple vdevs have errorsDon Brady2024-09-181-29/+72
| | | | | | | | | | | | When multiple drives are throwing errors, it is likely not a drive failing but rather a failure above the drives, like a controller. The active cases context of the drive's peers is now considered when making a diagnosis. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed by: Brian Behlendorf <[email protected]> Signed-off-by: Don Brady <[email protected]> Closes #16531
* Adding Direct IO SupportBrian Atkinson2024-09-142-14/+62
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Adding O_DIRECT support to ZFS to bypass the ARC for writes/reads. O_DIRECT support in ZFS will always ensure there is coherency between buffered and O_DIRECT IO requests. This ensures that all IO requests, whether buffered or direct, will see the same file contents at all times. Just as in other FS's , O_DIRECT does not imply O_SYNC. While data is written directly to VDEV disks, metadata will not be synced until the associated TXG is synced. For both O_DIRECT read and write request the offset and request sizes, at a minimum, must be PAGE_SIZE aligned. In the event they are not, then EINVAL is returned unless the direct property is set to always (see below). For O_DIRECT writes: The request also must be block aligned (recordsize) or the write request will take the normal (buffered) write path. In the event that request is block aligned and a cached copy of the buffer in the ARC, then it will be discarded from the ARC forcing all further reads to retrieve the data from disk. For O_DIRECT reads: The only alignment restrictions are PAGE_SIZE alignment. In the event that the requested data is in buffered (in the ARC) it will just be copied from the ARC into the user buffer. For both O_DIRECT writes and reads the O_DIRECT flag will be ignored in the event that file contents are mmap'ed. In this case, all requests that are at least PAGE_SIZE aligned will just fall back to the buffered paths. If the request however is not PAGE_SIZE aligned, EINVAL will be returned as always regardless if the file's contents are mmap'ed. Since O_DIRECT writes go through the normal ZIO pipeline, the following operations are supported just as with normal buffered writes: Checksum Compression Encryption Erasure Coding There is one caveat for the data integrity of O_DIRECT writes that is distinct for each of the OS's supported by ZFS. FreeBSD - FreeBSD is able to place user pages under write protection so any data in the user buffers and written directly down to the VDEV disks is guaranteed to not change. There is no concern with data integrity and O_DIRECT writes. Linux - Linux is not able to place anonymous user pages under write protection. Because of this, if the user decides to manipulate the page contents while the write operation is occurring, data integrity can not be guaranteed. However, there is a module parameter `zfs_vdev_direct_write_verify` that controls the if a O_DIRECT writes that can occur to a top-level VDEV before a checksum verify is run before the contents of the I/O buffer are committed to disk. In the event of a checksum verification failure the write will return EIO. The number of O_DIRECT write checksum verification errors can be observed by doing `zpool status -d`, which will list all verification errors that have occurred on a top-level VDEV. Along with `zpool status`, a ZED event will be issues as `dio_verify` when a checksum verification error occurs. ZVOLs and dedup is not currently supported with Direct I/O. A new dataset property `direct` has been added with the following 3 allowable values: disabled - Accepts O_DIRECT flag, but silently ignores it and treats the request as a buffered IO request. standard - Follows the alignment restrictions outlined above for write/read IO requests when the O_DIRECT flag is used. always - Treats every write/read IO request as though it passed O_DIRECT and will do O_DIRECT if the alignment restrictions are met otherwise will redirect through the ARC. This property will not allow a request to fail. There is also a module parameter zfs_dio_enabled that can be used to force all reads and writes through the ARC. By setting this module parameter to 0, it mimics as if the direct dataset property is set to disabled. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Signed-off-by: Brian Atkinson <[email protected]> Co-authored-by: Mark Maybee <[email protected]> Co-authored-by: Matt Macy <[email protected]> Co-authored-by: Brian Behlendorf <[email protected]> Closes #10018
* spa_prop_get: require caller to supply output nvlistRob Norris2024-09-061-2/+3
| | | | | | | | | | | | | | | | | All callers to spa_prop_get() and spa_prop_get_nvlist() supplied their own preallocated nvlist (except ztest), so we can remove the option to have them allocate one if none is supplied. This sidesteps a bug in spa_prop_get(), where the error var wasn't initialised, which could lead to the provided nvlist being freed at the end. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Jorgen Lundman <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes #16505
* zpool events: expand value strings for ZIO error valuesRob Norris2024-09-051-1/+23
| | | | | | Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <[email protected]>
* Add DDT prune commandDon Brady2024-09-043-10/+162
| | | | | | | | | | | | | | | | Requires the new 'flat' physical data which has the start time for a class entry. The amount to prune can be based on a target percentage of the unique entries or based on the age (i.e., every entry older than N days). Sponsored-by: Klara, Inc. Sponsored-by: iXsystems, Inc. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Don Brady <[email protected]> Closes #16277
* build: rename FORCEDEBUG_CPPFLAGS to LIBZPOOL_CPPFLAGSRob Norris2024-08-274-5/+5
| | | | | | | | | | | | | | | This is just a very small attempt to make it more obvious that these flags aren't optional for libzpool-using programs, by not making it seem like there's an option to say "well, I don't _want_ to force debugging". Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Allan Jude <[email protected]> Reviewed-by: Rich Ercolani <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rob Norris <[email protected]> Issue #16476 Closes #16477
* zstream: build with debug to fix stack overrunsRob Norris2024-08-271-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | abd_t differs in size depending on whether or not ZFS_DEBUG is set. It turns out that libzpool is built with FORCEDEBUG_CPPFLAGS, which sets -DZFS_DEBUG, and so it always has a larger abd_t with extra debug fields, regardless of whether or not --enable-debug is set. zdb, ztest and zhack are also all built with FORCEDEBUG_CPPFLAGS, so had the same idea of the size of abd_t, but zstream was not, and used the "smaller" abd_t. In practice this didn't matter because it never used abd_t directly. This changed in b4d81b1a6, zstream was switched to use stack ABDs for compression. When built with --enable-debug, zstream implicitly gets ZFS_DEBUG, and everything was fine. Productions builds without that flag ends up with the smaller abd_t, which is now mismatched with libzpool, and causes stack overruns in zstream recompress. The simplest fix for now is to compile zstream with FORCEDEBUG_CPPFLAGS like the other binaries. This commit does that. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Allan Jude <[email protected]> Reviewed-by: Rich Ercolani <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rob Norris <[email protected]> Issue #16476 Closes #16477
* fm: pass io_flags through events & zed as uint64_tRob Norris2024-08-261-3/+12
| | | | | | | | | | | | | | | | | | | In 4938d01db (#14086) zio_flag_t was converted from an enum (generally signed 32-bit) to a uint64_t. The corresponding change wasn't made to the error reporting subsystem, limiting the error flags being delivered to zed to 32 bits. This bumps the whole pipeline to use uint64s. A tiny bit of compatibility is added for newer zed working agsinst an older kernel module, because its easy to do and misdetecting scrub/resilver errors and taking action is potentially dangerous. Making it work for new kernel modules against older zed seems to be far more invasive for far less benefit, so I have not. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes #16469
* zpool: Provide GUID to zpool-reguid(8) with -g (#16239)Mateusz Piotrowski2024-08-262-6/+19
| | | | | | | | | | | | | | This commit extends the zpool-reguid(8) command with a -g flag, which allows the user to specify the GUID to set. This change also adds some general tests for zpool-reguid(8). Sponsored-by: Wasabi Technology, Inc. Sponsored-by: Klara, Inc. Signed-off-by: Mateusz Piotrowski <[email protected]> Reviewed-by: Rob Norris <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Tony Hutter <[email protected]>
* Make mount.zfs(8) calling zfs_mount_at for legacy mounts as wellLow-power2024-08-231-3/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 329e2ffa4bca456e65c3db7f5c5c04931c551b61 has made mount.zfs(8) to call libzfs function 'zfs_mount_at', in order to propagate dataset properties into mount options. This fix however, is limited to a special use case where mount.zfs(8) is used in initrd with option '-o zfsutil'. If either initrd or the user need to use mount.zfs(8) to mount a file system with 'mountpoint' set to 'legacy', '-o zfsutil' can't be used and the original issue #7947 will still happen. Since the existing code already excluded the possibility of calling 'zfs_mount_at' when it was invoked as a helper program from zfs(8), by checking 'ZFS_MOUNT_HELPER' environment variable, it makes no sense to avoid calling 'zfs_mount_at' without '-o zfsutil'. An exception however, is when mount.zfs(8) was invoked with '-o remount' to update the mount options for an existing mount point. In this case call mount(2) directly without modifying the mount options passed from command line. Furthermore, don't run mount.zfs(8) helper for automounting snapshot. The above change to make mount.zfs(8) to call 'zfs_mount_at' apparently caused it to trigger an automount for the snapshot directory. When the helper was invoked as a result of a snapshot automount, an infinite recursion will occur. Since the need of invoking user mode mount(8) for automounting was to overcome that the 'vfs_kern_mount' being GPL-only, just run mount(8) without the mount.zfs(8) helper by adding option '-i'. Reviewed-by: Umer Saleem <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: WHR <[email protected]> Closes #16393
* zstream recompress: fix zero recompressed buffer and outputRob Norris2024-08-221-10/+12
| | | | | | | | | | | | | | | If compression happend, any garbage past the compress size was not zeroed out. If compression didn't happen, then the payload size was still set to the rounded-up return from zio_compress_data(), which is dependent on the input, which is not necessarily the logical size. So that's all fixed too, mostly from stealing the math from zio.c. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <[email protected]>
* zstream decompress: fix decompress size and outputRob Norris2024-08-221-12/+19
| | | | | | | | | | | | | | | This was incorrectly using the compressed length for the size of the decompress buffer, and quietly doing nothing if the decompressor refused to decompress the block because there wasn't enough space. After that, it wasn't correctly rewriting the record to indicate "not compressed". So that's fixed now. Sigh. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <[email protected]>
* compress: change zio_compress API to use ABDsRob Norris2024-08-222-18/+36
| | | | | | | | | | | | | | This commit changes the frontend zio_compress_data and zio_decompress_data APIs to take ABD points instead of buffer pointers. All callers are updated to match. Any that already have an appropriate ABD nearby now use it directly, while at the rest we create an one. Internally, the ABDs are passed through to the provider directly. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <[email protected]>
* compress: change compression providers API to use ABDsRob Norris2024-08-221-2/+4
| | | | | | | | | | | | | | | This commit changes the provider compress and decompress API to take ABD pointers instead of buffer pointers for both data source and destination. It then updates all providers to match. This doesn't actually change the providers to do chunked compression, just changes the API to allow such an update in the future. Helper macros are added to easily adapt the ABD functions to their buffer-based implementations. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <[email protected]>
* zstream: use zio_compress calls for compressionRob Norris2024-08-222-118/+98
| | | | | | | | | | This is updating zstream to use the zio_compress calls rather than using its own dispatch. Since that was fairly entangled, some refactoring included. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <[email protected]>
* ddt: dedup logRob Norris2024-08-161-1/+32
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Adds a log/journal to dedup. At the end of txg, instead of writing the entry directly to the ZAP, instead its adding to an in-memory tree and appended to an on-disk object. The on-disk object is only read at import, to reload the in-memory tree. Lookups first go the the log tree before going to the ZAP, so recently-used entries will remain close by in memory. This vastly reduces overhead from dedup IO, as it will not have to do so many read/update/write cycles on ZAP leaf nodes. A flushing facility is added at end of txg, to push logged entries out to the ZAP. There's actually two separate "logs" (in-memory tree and on-disk object), one active (recieving updated entries) and one flushing (writing out to disk). These are swapped (ie flushing begins) based on memory used by the in-memory log trees and time since we last flushed something. The flushing facility monitors the amount of entries coming in and being flushed out, and calibrates itself to try to flush enough each txg to keep up with the ingest rate without competing too much with other IO. Multiple tuneables are provided to control the flushing facility. All the histograms and stats are update to accomodate the log as a separate entry store. zdb gains knowledge of how to count them and dump them. Documentation included! Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Co-authored-by: Allan Jude <[email protected]> Signed-off-by: Rob Norris <[email protected]> Sponsored-by: Klara, Inc. Sponsored-by: iXsystems, Inc. Closes #15895
* ddt: cleanup the stats & histogram codeRob Norris2024-08-161-12/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Both the API and the code were kinda mangled and I was really struggling to follow it. The worst offender was the old ddt_stat_add(); after fixing it up the rest of the changes are mostly knock-on effects and targets of opportunity. Note that the old ddt_stat_add() was safe against overflows - it could produce crazy numbers, but the compiler wouldn't do anything stupid. The assertions in ddt_stat_sub() go a lot of the way to protecting against this; getting in a position where overflows are a problem is definitely a programming error. Also expanding ddt_stat_add() and ddt_histogram_empty() produces less efficient assembly. I'm not bothered about this right now though; these should not be hot functions, and if they are we'll optimise them later. If we have to go back to the old form, we'll comment it like crazy. Finally, I've removed the assertion that the bucket will never be negative, as it will soon be possible to have entries with zero refcounts: an entry for a block that is no longer on the pool, but is on the log waiting to be synced out. It might be better to have a separate bucket for these, since they're still using real space on disk, but ultimately these stats are driving UI, and for now I've chosen to keep them matching how they've looked in the past, as well as match the operators mental model - pool usage is managed elsewhere. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rob Norris <[email protected]> Sponsored-by: Klara, Inc. Sponsored-by: iXsystems, Inc. Closes #15895
* ddt: add "flat phys" featureRob Norris2024-08-161-25/+43
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Traditional dedup keeps a separate ddt_phys_t "type" for each possible count of DVAs (that is, copies=) parameter. Each of these are tracked independently of each other, and have their own set of DVAs. This leads to an (admittedly rare) situation where you can create as many as six copies of the data, by changing the copies= parameter between copying. This is both a waste of storage on disk, but also a waste of space in the stored DDT entries, since there never needs to be more than three DVAs to handle all possible values of copies=. This commit adds a new FDT feature, DDT_FLAG_FLAT. When active, only the first ddt_phys_t is used. Each time a block is written with the dedup bit set, this single phys is checked to see if it has enough DVAs to fulfill the request. If it does, the block is filled with the saved DVAs as normal. If not, an adjusted write is issued to create as many extra copies as are needed to fulfill the request, which are then saved into the entry too. Because a single phys is no longer an all-or-nothing, but can be transitioning from fewer to more DVAs, the write path now has to keep a copy of the previous "known good" DVA set so we can revert to it in case an error occurs. zio_ddt_write() has been restructured and heavily commented to make it much easier to see what's happening. Backwards compatibility is maintained simply by allocating four ddt_phys_t when the DDT_FLAG_FLAT flag is not set, and updating the phys selection macros to check the flag. In the old arrangement, each number of copies gets a whole phys, so it will always have either zero or all necessary DVAs filled, with no in-between, so the old behaviour naturally falls out of the new code. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rob Norris <[email protected]> Co-authored-by: Don Brady <[email protected]> Sponsored-by: Klara, Inc. Sponsored-by: iXsystems, Inc. Closes #15893
* ddt: introduce lightweight entryRob Norris2024-08-161-7/+8
| | | | | | | | | | | | | | | | | | | | The idea here is that sometimes you need the contents of an entry with no intent to modify it, and/or from a place where its difficult to get hold of its originating ddt_t to know how to interpret it. A lightweight entry contains everything you might need to "read" an entry - its key, type and phys contents - but none of the extras for modifying it or using it in a larger context. It also has the full complement of phys slots, so it can represent any kind of dedup entry without having to know the specific configuration of the table it came from. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rob Norris <[email protected]> Sponsored-by: Klara, Inc. Sponsored-by: iXsystems, Inc. Closes #15893
* ddt: rework access to phys array slotsRob Norris2024-08-161-7/+6
| | | | | | | | | | | | | | | | | | | | | | | The "flat phys" feature will use only a single phys slot for all entries, which means the old "single", "double" etc naming now makes no sense, and more importantly, means that choosing the right slot for a given block pointer will depend on how many slots are in use for a given DDT. This removes the old names, and adds accessor macros to decouple specific phys array indexes from any particular meaning. (These macros look strange in isolation, mainly in the way they take the ddt_t* as an arg but don't use it. This is mostly a separate commit to introduce the concept to the reader before the "flat phys" commit extends it). Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rob Norris <[email protected]> Sponsored-by: Klara, Inc. Sponsored-by: iXsystems, Inc. Closes #15893
* zdb: rework DDT block count and leak check to just count the blocksRob Norris2024-08-161-120/+195
| | | | | | | | | | | | | | | | | | | | | | | | | | | | The upcoming dedup features break the long held assumption that all blocks on disk with a 'D' dedup bit will always be present in the DDT, or will have the same set of DVA allocations on disk as in the DDT. If the DDT is no longer a complete picture of all the dedup blocks that will be and should be on disk, then it does us no good to walk and prime it up front, since it won't necessarily match up with every block we'll see anyway. Instead, we rework things here to be more like the BRT checks. When we see a dedup'd block, we look it up in the DDT, consume a refcount, and for the second-or-later instances, count them as duplicates. The DDT and BRT are moved ahead of the space accounting. This will become important for the "flat" feature, which may need to count a modified version of the block. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Co-authored-by: Allan Jude <[email protected]> Co-authored-by: Don Brady <[email protected]> Signed-off-by: Rob Norris <[email protected]> Sponsored-by: Klara, Inc. Sponsored-by: iXsystems, Inc. Closes #15892
* zstream: remove duplicate highbit64 definitionRob Norris2024-08-091-9/+0
| | | | | | | | | | | When building a static build (--disable-shared), zstream fails to link because of the duplicate highbit64() in libzpool/kernel.c. Since they're identical, and the libzpool one is visible to zstream, we remove zstream's copy and just use the common one. Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes #16426
* JSON: Fix class values for mirrored special vdevsTony Hutter2024-08-061-4/+33
| | | | | | | | | | | | | | | | | | | | | This fixes things so mirrored special vdevs report themselves as "class=special" rather than "class=normal". This happens due to the way the vdev nvlists are constructed: mirrored special devices - The 'mirror' vdev has allocation bias as "special" and it's leaf vdevs are "normal" single or RAID0 special devices - Leaf vdevs have allocation bias as "special". This commit adds in code to check if a leaf's parent is a "special" vdev to see if it should also report "special". Reviewed-by: Ameer Hamza <[email protected]> Reviewed-by: Umer Saleem <[email protected]> Signed-off-by: Tony Hutter <[email protected]> Closes #16217
* JSON output support for zpool statusUmer Saleem2024-08-061-260/+1444
| | | | | | | | | | | | | This commit adds support for zpool status command to displpay status of ZFS pools in JSON format using '-j' option. Status information is collected in nvlist which is later dumped on stdout in JSON format. Existing options for zpool status work with '-j' flag. man page for zpool status is updated accordingly. Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Ameer Hamza <[email protected]> Signed-off-by: Umer Saleem <[email protected]> Closes #16217
* JSON output support for zpool listUmer Saleem2024-08-061-82/+297
| | | | | | | | | | | | | | | This commit adds support for zpool list command to output the list of ZFS pools in JSON format using '-j' option.. Information about available pools is collected in nvlist which is later printed to stdout in JSON format. Existing options for zfs list command work with '-j' flag. man page for zpool list is updated accordingly. Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Ameer Hamza <[email protected]> Signed-off-by: Umer Saleem <[email protected]> Closes #16217
* JSON output support for zpool getUmer Saleem2024-08-062-62/+245
| | | | | | | | | | | This commit adds support for zpool get command to output the list of properties for ZFS Pools and VDEVS in JSON format using '-j' option. Man page for zpool get is updated to include '-j' option. Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Ameer Hamza <[email protected]> Signed-off-by: Umer Saleem <[email protected]> Closes #16217
* JSON output support for zpool versionUmer Saleem2024-08-061-3/+65
| | | | | | | | | | | This commit adds support for zpool version to output in JSON format using '-j' option. Userland kernel module version is collected in nvlist which is later displayed in JSON format. man page for zpool is updated. Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Ameer Hamza <[email protected]> Signed-off-by: Umer Saleem <[email protected]> Closes #16217
* JSON output support zfs mountUmer Saleem2024-08-061-5/+36
| | | | | | | | | | | This commit adds support for zfs mount to display mounted file systems in JSON format using '-j' option. Data is collected in nvlist which is printed in JSON format. man page for zfs mount is updated accordingly. Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Ameer Hamza <[email protected]> Signed-off-by: Umer Saleem <[email protected]> Closes #16217
* JSON output support for zfs listUmer Saleem2024-08-061-44/+135
| | | | | | | | | | | | This commit adds support for JSON output for zfs list using '-j' option. Information is collected in JSON format which is later printed in jSON format. Existing options for zfs list also work with '-j'. man pages are updated with relevant information. Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Ameer Hamza <[email protected]> Signed-off-by: Umer Saleem <[email protected]> Closes #16217
* JSON output support for zfs version and zfs getUmer Saleem2024-08-061-15/+213
| | | | | | | | | | | | | | | | This commit adds support for JSON output for zfs version and zfs get commands. '-j' flag can be used to get output in JSON format. Information is collected in nvlist objects which is later printed in JSON format. Existing options that work for zfs get and zfs version also work with '-j' flag. man pages for zfs get and zfs version are updated accordingly. Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Ameer Hamza <[email protected]> Signed-off-by: Umer Saleem <[email protected]> Closes #16217
* Once more refactor arc_summary outputAlexander Motin2024-08-051-59/+88
| | | | | | | | | | | | | | | | | | | | | Before this arc_summary was not reporting any information about evictable ARC memory. As result I've found difficult to analyze behavior of dnode-heavy workload with lots of unevictable buffers. This change adds evictable sizes into states breakdown section. While there, add/refactor sections for global memory statistics, for ARC breakdown between different structures, for data/metadata. Add information about memory reclamation requests. While there, refactor and polish graph mode, neglected for a while. Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Reviewed-by: Tino Reichardt <[email protected]> Reviewed-by: Umer Saleem <[email protected]> Reviewed-by: Ameer Hamza <[email protected]> Reviewed-by: Tony Hutter <[email protected]>
* Fix zdb_dump_block for little endian (#16310)Chunwei Chen2024-07-311-1/+1
| | | | | | | | | The endian macros were changed but zdb_dump_block wasn't updated accordingly. Signed-off-by: Chunwei Chen <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Tino Reichardt <[email protected]> Reviewed-by: Allan Jude <[email protected]>
* ddt: add support for prefetching tables into the ARCAllan Jude2024-07-263-10/+124
| | | | | | | | | | | | | | | | | | | | | | This change adds a new `zpool prefetch -t ddt $pool` command which causes a pool's DDT to be loaded into the ARC. The primary goal is to remove the need to "warm" a pool's cache before deduplication stops slowing write performance. It may also provide a way to reload portions of a DDT if they have been flushed due to inactivity. Sponsored-by: iXsystems, Inc. Sponsored-by: Catalogics, Inc. Sponsored-by: Klara, Inc. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Allan Jude <[email protected]> Signed-off-by: Will Andrews <[email protected]> Signed-off-by: Fred Weigel <[email protected]> Signed-off-by: Rob Norris <[email protected]> Signed-off-by: Don Brady <[email protected]> Co-authored-by: Will Andrews <[email protected]> Co-authored-by: Don Brady <[email protected]> Closes #15890
* Fix ZDB to dump projid for projectquota enabled (#16291)Jitendra Patidar2024-07-251-0/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ZDB is supposed to dump "projid" via dump_znode(), when projectquota is enabled. ----------- static void dump_znode(objset_t *os, uint64_t object, void *data, size_t size) { ... if (dmu_objset_projectquota_enabled(os) && (pflags & ZFS_PROJID)) { uint64_t projid; if (sa_lookup(hdl, sa_attr_table[ZPL_PROJID], &projid, sizeof (uint64_t)) == 0) (void) printf("\tprojid %llu\n", (u_longlong_t)projid); } ... } ---------- But its not dumping "projid", even for project quota enabled. dmu_objset_projectquota_enabled() does following 3 checks, ---------- boolean_t dmu_objset_projectquota_enabled(objset_t *os) { return (file_cbs[os->os_phys->os_type] != NULL && DMU_PROJECTUSED_DNODE(os) != NULL && spa_feature_is_enabled(os->os_spa, SPA_FEATURE_PROJECT_QUOTA)); } ---------- It fails on file_cbs[] check. file_cbs[] gets initialised via dmu_objset_register_type(); which is not done for the ZDB, its done for the kernel via zfs_init(). Register a dummy callback handle for the DMU_OST_ZFS type in ZDB main() function to dump the projid for projectquota enabled. Signed-off-by: Jitendra Patidar <[email protected]> Closes #16290 Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Tino Reichardt <[email protected]>
* zil: add stats for commit failure/fallback (#16315)Rob Norris2024-07-251-0/+3
| | | | | | | | | | | | There's no good way to tell when a ZIL commit fails and falls back to a transaction sync, other than perhaps a throughput drop. This adds counters so we can see when it happens and why. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Tony Hutter <[email protected]>
* ddt: dedup table quota enforcementAllan Jude2024-07-251-6/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | This adds two new pool properties: - dedup_table_size, the total size of all DDTs on the pool; and - dedup_table_quota, the maximum possible size of all DDTs in the pool When set, quota will be enforced by checking when a new entry is about to be created. If the pool is over its dedup quota, the entry won't be created, and the corresponding write will be converted to a regular non-dedup write. Note that existing entries can be updated (ie their refcounts changed), as that reuses the space rather than requiring more. dedup_table_quota can be set to 'auto', which will set it based on the size of the devices backing the "dedup" allocation device. This makes it possible to limit the DDTs to the size of a dedup vdev only, such that when the device fills, no new blocks are deduplicated. Sponsored-by: iXsystems, Inc. Sponsored-By: Klara Inc. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rob Norris <[email protected]> Signed-off-by: Don Brady <[email protected]> Co-authored-by: Don Brady <[email protected]> Co-authored-by: Rob Wing <[email protected]> Co-authored-by: Sean Eric Fagan <[email protected]> Closes #15889
* zdb: fix BRT dump (#16335)Rob Norris2024-07-181-2/+7
| | | | | | | | | | | | | BRT refcounts are stored as eight uint8_ts rather than a single uint64_t. This means that za_first_integer is only the first byte, so max 256. This fixes it by doing a lookup for the whole value. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Alexander Motin <[email protected]>
* zdb: dump ZAP_FLAG_UINT64_KEY ZAPs properly (#16334)Rob Norris2024-07-171-4/+26
| | | | | | | | | | | | These are used for DDT and BRT stores. There's limited information available to produce meaningful output, but at least we can put something on screen rather than crashing. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Tony Hutter <[email protected]>
* Fix zdb "Memory fault" found on FreeBSD ZTS (#16332)Tino Reichardt2024-07-091-1/+1
| | | | | | | | | | Reason: nvlist_free() tries to free sth. which isn't allocted Solution: init this variable with NULL Closes #16311 Signed-off-by: Tino Reichardt <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Ameer Hamza <[email protected]> Reviewed-by: Tony Hutter <[email protected]>
* zdb: fix FreeBSD build failure Ameer Hamza2024-06-061-1/+1
| | | | | | | | | This fixes FreeBSD build failure with clang-18 after 23a489a got merged. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Rob Norris <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Ameer Hamza <[email protected]> Closes #16252
* zdb: detect cachefile automatically otherwise force importAmeer Hamza2024-06-031-5/+61
| | | | | | | | | | | | If a pool is created with the cache file located in a non-default path /etc/default/zpool.cache, removed, or the cachefile property is set to none, zdb fails to show the pool unless we specify the cache file or use the -e option. This PR automates this process. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Akash B <[email protected]> Signed-off-by: Ameer Hamza <[email protected]> Closes #16071
* zpool import output is not formated properly.Pawel Jakub Dawidek2024-05-291-131/+138
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The 'zpool status' output assumes that the longest prefix is six character long plus colon plus space, eg. 'status: ', 'action: ' or 'config: ' (so eight in total). This works well even when we have messages that requires more than one line, as '\t' is exactly eight characters, just like the longest prefix. The 'zpool import' output is a bit different, as it may display the comment pool property, then the longest prefix is 'comment: ', which is nine characters long, not eight. All the prefixes were given an extra space in front, but: - 'status: ' did not get an extra space. - Messages that require more than one line should use nine spaces of indentation, not eight. - The extra space in front looks redundant if there is no comment property set on the given pool. Fix it by adding an extra space to all prefixes, but only if the comment property is defined. Also, when we need to continue the message in a new line, use '\t ' for indentation. While here, apply small corrections to a couple messages. Before: pool: tank id: 7412636063178848859 state: ONLINE status: Some supported features are not enabled on the pool. (Note that they may be intentionally disabled if the 'compatibility' property is set.) action: The pool can be imported using its name or numeric identif[...] some features will not be available without an explicit 'zp[...] comment: Example comment. config: bclone ONLINE ada0 ONLINE After: pool: tank id: 10180960571062436759 state: ONLINE status: Some supported features are not enabled on the pool. (Note that they may be intentionally disabled if the 'compatibility' property is set.) action: The pool can be imported using its name or numeric identifi[...] some features will not be available without an explicit 'zp[...] config: tank ONLINE ada3 ONLINE pool: dozer id: 11028319538368222579 state: ONLINE status: Some supported features are not enabled on the pool. (Note that they may be intentionally disabled if the 'compatibility' property is set.) action: The pool can be imported using its name or numeric identif[...] some features will not be available without an explicit 'z[...] comment: Example comment. config: dozer ONLINE ada1 ONLINE Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Pawel Dawidek <[email protected]> Closes #16128
* zed: Add deadman-slot_off.sh zedletBrian Behlendorf2024-05-293-0/+80
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Optionally turn off disk's enclosure slot if an I/O is hung triggering the deadman. It's possible for outstanding I/O to a misbehaving SCSI disk to neither promptly complete or return an error. This can occur due to retry and recovery actions taken by the SCSI layer, driver, or disk. When it occurs the pool will be unresponsive even though there may be sufficient redundancy configured to proceeded without this single disk. When a hung I/O is detected by the kmods it will be posted as a deadman event. By default an I/O is considered to be hung after 5 minutes. This value can be changed with the zfs_deadman_ziotime_ms module parameter. If ZED_POWER_OFF_ENCLOSURE_SLOT_ON_DEADMAN is set the disk's enclosure slot will be powered off causing the outstanding I/O to fail. The ZED will then handle this like a normal disk failure. By default ZED_POWER_OFF_ENCLOSURE_SLOT_ON_DEADMAN is not set. As part of this change `zfs_deadman_events_per_second` is added to control the ratelimitting of deadman events independantly of delay events. In practice, a single deadman event is sufficient and more aren't particularly useful. Alphabetize the zfs_deadman_* entries in zfs.4. Reviewed-by: Tony Hutter <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #16226
* Use memset to zero stack allocations containing unionsRob N2024-05-241-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | C99 6.7.8.17 says that when an undesignated initialiser is used, only the first element of a union is initialised. If the first element is not the largest within the union, how the remaining space is initialised is up to the compiler. GCC extends the initialiser to the entire union, while Clang treats the remainder as padding, and so initialises according to whatever automatic/implicit initialisation rules are currently active. When Linux is compiled with CONFIG_INIT_STACK_ALL_PATTERN, -ftrivial-auto-var-init=pattern is added to the kernel CFLAGS. This flag sets the policy for automatic/implicit initialisation of variables on the stack. Taken together, this means that when compiling under CONFIG_INIT_STACK_ALL_PATTERN on Clang, the "zero" initialiser will only zero the first element in a union, and the rest will be filled with a pattern. This is significant for aes_ctx_t, which in aes_encrypt_atomic() and aes_decrypt_atomic() is initialised to zero, but then used as a gcm_ctx_t, which is the fifth element in the union, and thus gets pattern initialisation. Later, it's assumed to be zero, resulting in a hang. As confusing and undiscoverable as it is, by the spec, we are at fault when we initialise a structure containing a union with the zero initializer. As such, this commit replaces these uses with an explicit memset(0). Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Tino Reichardt <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes #16135 Closes #16206