aboutsummaryrefslogtreecommitdiffstats
path: root/cmd
Commit message (Collapse)AuthorAgeFilesLines
* Remove hash_elements_max accounting from DBUF and ARCAlexander Motin9 days1-4/+1
| | | | | | | | | | | | | | | | Those values require global atomics to get current hash_elements values in few of the hottest code paths, while in all the years I never cared about it. If somebody wants, it should be easy to get it by periodic sampling, since neither ARC header nor DBUF counts change so fast that it would be difficult to catch. For now I've left hash_elements_max kstat for ARC, since it was used/reported by arc_summary and it would break older versions, but now it just reports the current value. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #16759
* zed: prevent automatic replacement of offline vdevsAmeer Hamza13 days1-2/+2
| | | | | | | | | | | | | | | | When an OFFLINE device is physically removed, a spare is automatically activated. However, this behavior differs in FreeBSD, where we do not transition from OFFLINE state to REMOVED. Our support team has encountered cases where customers experienced unexpected behavior during drive replacements, with multiple spares activating for the same VDEV due to a single disk replacement. This patch ensures that a drive in an OFFLINE state remains in that state, preventing it from transitioning to REMOVED and being automatically replaced by a spare. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Ameer Hamza <[email protected]> Closes #16751
* BRT: Rework structures and locks to be per-vdevAlexander Motin13 days1-21/+13
| | | | | | | | | | | | | | | | | | | | | | | While block cloning operation from the beginning was made per-vdev, before this change most of its data were protected by two pool- wide locks. It created lots of lock contention in many workload. This change makes most of block cloning data structures per-vdev, which allows to lock them separately. The only pool-wide lock now it spa_brt_lock, protecting array of per-vdev pointers and in most cases taken as reader. Also this splits per-vdev locks into three different ones: bv_pending_lock protects the AVL-tree of pending operations in open context, bv_mos_entries_lock protects BRT ZAP object from while being prefetched, and bv_lock protects the rest of per-vdev context during TXG commit process. There should be no functional difference aside of some optimizations. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Pawel Jakub Dawidek <[email protected]> Reviewed-by: Brian Atkinson <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #16740
* JSON: fix user properties output for zpool listUmer Saleem2024-11-111-1/+6
| | | | | | | | | | | | This commit fixes JSON output for zpool list when user properties are requested with -o flag. This case needed to be handled specifically since zpool_prop_to_name does not return property name for user properties, instead it is stored in pl->pl_user_prop. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Umer Saleem <[email protected]> Closes #16734
* JSON: fix user properties output for zfs listUmer Saleem2024-11-071-1/+6
| | | | | | | | | | | | This commit fixes JSON output for zfs list when user properties are requested with -o flag. This case needed to be handled specifically since zfs_prop_to_name does not return property name for user properties, instead it is stored in pl->pl_user_prop. Reviewed-by: Ameer Hamza <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Umer Saleem <[email protected]> Closes #16732
* Verify parent_dev before calling udev_device_get_sysattr_valueUglymotha2024-11-041-1/+2
| | | | | | | | | | | | Not all udev devices have parent devices. Calling udev_device_get_ functions yield an assertion error if called with a NULL pointer. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Sietse <[email protected]> Co-authored-by: Sietse <[email protected]> Closes #16705 Closes #16717
* zdb: add extra -T flag to show histograms of BRT refcountsRob Norris2024-11-011-5/+24
| | | | | | | Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes #16692
* Added output to `zpool online` and `offline`Rich Ercolani2024-10-311-3/+15
| | | | | | | | | | | | | | I was surprised to discover today that `zpool online` and `zpool offline` don't print any information about why they failed in many cases, they just return 1 with no information about why. Let's improve that where we can without changing the library function. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Allan Jude <[email protected]> Signed-off-by: Rich Ercolani <[email protected]> Closes #16244
* zdb: show bp in uberblock dumpRob Norris2024-10-201-0/+4
| | | | | | | | | | Just another useful nugget of info in times of strife. Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: George Melikov <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Tino Reichardt <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes #16667
* zdb: fix printf format in dump_zap()Martin Matuška2024-10-111-1/+1
| | | | | | | | | When compiling zdb.c on 32-bit platforms, a format conversion error is reported for a printf() in dump_zap(). Change %l to macro %" PRIu64 " to match the platform size of a 64-bit unsigned integer. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Martin Matuska <[email protected]> Closes #16635
* zpool/zfs: allow --json wherever -j is allowedRob Norris2024-10-112-5/+33
| | | | | | | | | | | | | | Mostly so that with the JSON formatting options are also used, they all look the same. To my eye, `-j --json-flat-vdevs` suggests that they are different or unrelated, while `--json --json-flat-vdevs` invites no further questions. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Umer Saleem <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes #16632
* Always validate checksums for Direct I/O readsBrian Atkinson2024-10-091-0/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This fixes an oversight in the Direct I/O PR. There is nothing that stops a process from manipulating the contents of a buffer for a Direct I/O read while the I/O is in flight. This can lead checksum verify failures. However, the disk contents are still correct, and this would lead to false reporting of checksum validation failures. To remedy this, all Direct I/O reads that have a checksum verification failure are treated as suspicious. In the event a checksum validation failure occurs for a Direct I/O read, then the I/O request will be reissued though the ARC. This allows for actual validation to happen and removes any possibility of the buffer being manipulated after the I/O has been issued. Just as with Direct I/O write checksum validation failures, Direct I/O read checksum validation failures are reported though zpool status -d in the DIO column. Also the zevent has been updated to have both: 1. dio_verify_wr -> Checksum verification failure for writes 2. dio_verify_rd -> Checksum verification failure for reads. This allows for determining what I/O operation was the culprit for the checksum verification failure. All DIO errors are reported only on the top-level VDEV. Even though FreeBSD can write protect pages (stable pages) it still has the same issue as Linux with Direct I/O reads. This commit updates the following: 1. Propogates checksum failures for reads all the way up to the top-level VDEV. 2. Reports errors through zpool status -d as DIO. 3. Has two zevents for checksum verify errors with Direct I/O. One for read and one for write. 4. Updates FreeBSD ABD code to also check for ABD_FLAG_FROM_PAGES and handle ABD buffer contents validation the same as Linux. 5. Updated manipulate_user_buffer.c to also manipulate a buffer while a Direct I/O read is taking place. 6. Adds a new ZTS test case dio_read_verify that stress tests the new code. 7. Updated man pages. 8. Added an IMPLY statement to zio_checksum_verify() to make sure that Direct I/O reads are not issued as speculative. 9. Removed self healing through mirror, raidz, and dRAID VDEVs for Direct I/O reads. This issue was first observed when installing a Windows 11 VM on a ZFS dataset with the dataset property direct set to always. The zpool devices would report checksum failures, but running a subsequent zpool scrub would not repair any data and report no errors. Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Brian Atkinson <[email protected]> Closes #16598
* ztest: Fix scrub check in ztest_raidz_expand_check()Brian Behlendorf2024-10-081-1/+18
| | | | | | | | | | The scrub code may return EBUSY under several possible scenarios causing ztest to incorrectly ASSERT when verifying the result of a raidz expansion. Update the test case to allow EBUSY since it does not indicate pool damage. Reviewed-by: Tony Hutter <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #16627
* zpool/zfs: restore -V & --version optionsRob Norris2024-10-072-2/+2
| | | | | | | | | | | | The -j option added a round of getopt, which didn't know the magic version flags. So just bypass the whole thing and go straight to the human output function for the special case. Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Umer Saleem <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes #16615 Closes #16617
* Update compatibility.d filesBrian Behlendorf2024-10-022-1/+49
| | | | | | | | | | | | Add an openzfs-2.3 compatibility file for the next release. While there are no compatibility difference between Linux and FreeBSD for 2.3 symlinks for the -linux and -freebsd names are created for any scripts expecting that convention. Reviewed-by: Rob Norris <[email protected]> Reviewed-by: Tino Reichardt <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #16588
* Avoid computing strlen() inside loopsrilysh2024-10-022-8/+12
| | | | | | | | | | | | | | | | Compiling with -O0 (no proper optimizations), strlen() call in loops for comparing the size, isn't being called/initialized before the actual loop gets started, which causes n-numbers of strlen() calls (as long as the string is). Keeping the length before entering in the loop is a good idea. On some places, even with -O2, both GCC and Clang can't recognize this pattern, which seem to happen in an array of char pointer. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: rilysh <[email protected]> Closes #16584
* Support for longnames for files/directories (Linux part)Sanjeev Bagewadi2024-10-012-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch adds the ability for zfs to support file/dir name up to 1023 bytes. This number is chosen so we can support up to 255 4-byte characters. This new feature is represented by the new feature flag feature@longname. A new dataset property "longname" is also introduced to toggle longname support for each dataset individually. This property can be disabled, even if it contains longname files. In such case, new file cannot be created with longname but existing longname files can still be looked up. Note that, to my knowledge native Linux filesystems don't support name longer than 255 bytes. So there might be programs not able to work with longname. Note that NFS server may needs to use exportfs_get_name to reconnect dentries, and the buffer being passed is limit to NAME_MAX+1 (256). So NFS may not work when longname is enabled. Note, FreeBSD vfs layer imposes a limit of 255 name lengh, so even though we add code to support it here, it won't actually work. Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Chunwei Chen <[email protected]> Closes #15921
* Allocate zap_attribute_t from kmem instead of stackSanjeev Bagewadi2024-10-012-67/+80
| | | | | | | | | | | | This patch is preparatory work for long name feature. It changes all users of zap_attribute_t to allocate it from kmem instead of stack. It also make zap_attribute_t and zap_name_t structure variable length. Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Chunwei Chen <[email protected]> Closes #15921
* zfs_log: add flex array fields to log record structsRob Norris2024-09-272-13/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ZIL log record structs (lr_XX_t) are frequently allocated with extra space after the struct to carry variable-sized "payload" items. Linux 6.10+ compiled with CONFIG_FORTIFY_SOURCE has been doing runtime bounds checking on memcpy() calls. Because these types had no indicator that they might use more space than their simple definition, __fortify_memcpy_chk will frequently complain about overruns eg: memcpy: detected field-spanning write (size 7) of single field "lr + 1" at zfs_log.c:425 (size 0) memcpy: detected field-spanning write (size 9) of single field "(char *)(lr + 1)" at zfs_log.c:593 (size 0) memcpy: detected field-spanning write (size 4) of single field "(char *)(lr + 1) + snamesize" at zfs_log.c:594 (size 0) memcpy: detected field-spanning write (size 7) of single field "lr + 1" at zfs_log.c:425 (size 0) memcpy: detected field-spanning write (size 9) of single field "(char *)(lr + 1)" at zfs_log.c:593 (size 0) memcpy: detected field-spanning write (size 4) of single field "(char *)(lr + 1) + snamesize" at zfs_log.c:594 (size 0) memcpy: detected field-spanning write (size 7) of single field "lr + 1" at zfs_log.c:425 (size 0) memcpy: detected field-spanning write (size 9) of single field "(char *)(lr + 1)" at zfs_log.c:593 (size 0) memcpy: detected field-spanning write (size 4) of single field "(char *)(lr + 1) + snamesize" at zfs_log.c:594 (size 0) To fix this, this commit adds flex array fields to all lr_XX_t structs that require them, and then uses those fields to access that end-of-struct area rather than more complicated casts and pointer addition. Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes #16501 Closes #16539
* Add compatibility file for GRUB versions up to v2.06Umer Saleem2024-09-213-2/+27
| | | | | | | | | | | | | | | | | | | | | | | | | GRUB is not able to detect ZFS pool if snaphsot of top level boot pool is created. This issue is observed with GRUB versions up to v2.06 if extensible_dataset feature is enabled on ZFS boot pool. compatibility=grub2-2.06 would enable all read-only compatible zpool features except extensible_dataset and other features that depend on it. The existing grub2 compatibility file is now renamed to grub2-2.12 to reflect the appropriate grub2 version. grub2-2.12 lists all read-only features that can be enabled on boot pool for grub2 with version 2.12 onwards. A new symlink grub2 is created that currently points to the grub2-2.12 compatibility file. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Umer Saleem <[email protected]> Closes #13873 Closes #15261 Closes #15909
* zio_compress: introduce max size thresholdGeorge Melikov2024-09-191-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | Now default compression is lz4, which can stop compression process by itself on incompressible data. If there are additional size checks - we will only make our compressratio worse. New usable compression thresholds are: - less than BPE_PAYLOAD_SIZE (embedded_data feature); - at least one saved sector. Old 12.5% threshold is left to minimize affect on existing user expectations of CPU utilization. If data wasn't compressed - it will be saved as ZIO_COMPRESS_OFF, so if we really need to recompress data without ashift info and check anything - we can just compress it with zero threshold. So, we don't need a new feature flag here! Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: George Melikov <[email protected]> Closes #9416
* cityhash: replace invocations with specialized versions when possibleShengqi Chen2024-09-191-2/+2
| | | | | | | | | | | So that we can get actual benefit from last commit. Reviewed by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Tino Reichardt <[email protected]> Signed-off-by: Shengqi Chen <[email protected]> Closes #16131 Closes #16483
* arcstat: add structural, types, states breakdownTheera K.2024-09-182-89/+210
| | | | | | | | Add ARC structural breakdown, ARC types breakdown, ARC states breakdown similar to arc_summary. Additional cleanups included. Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Theera K. <[email protected]> Closes #16509
* Avoid fault diagnosis if multiple vdevs have errorsDon Brady2024-09-181-29/+72
| | | | | | | | | | | | When multiple drives are throwing errors, it is likely not a drive failing but rather a failure above the drives, like a controller. The active cases context of the drive's peers is now considered when making a diagnosis. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed by: Brian Behlendorf <[email protected]> Signed-off-by: Don Brady <[email protected]> Closes #16531
* Adding Direct IO SupportBrian Atkinson2024-09-142-14/+62
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Adding O_DIRECT support to ZFS to bypass the ARC for writes/reads. O_DIRECT support in ZFS will always ensure there is coherency between buffered and O_DIRECT IO requests. This ensures that all IO requests, whether buffered or direct, will see the same file contents at all times. Just as in other FS's , O_DIRECT does not imply O_SYNC. While data is written directly to VDEV disks, metadata will not be synced until the associated TXG is synced. For both O_DIRECT read and write request the offset and request sizes, at a minimum, must be PAGE_SIZE aligned. In the event they are not, then EINVAL is returned unless the direct property is set to always (see below). For O_DIRECT writes: The request also must be block aligned (recordsize) or the write request will take the normal (buffered) write path. In the event that request is block aligned and a cached copy of the buffer in the ARC, then it will be discarded from the ARC forcing all further reads to retrieve the data from disk. For O_DIRECT reads: The only alignment restrictions are PAGE_SIZE alignment. In the event that the requested data is in buffered (in the ARC) it will just be copied from the ARC into the user buffer. For both O_DIRECT writes and reads the O_DIRECT flag will be ignored in the event that file contents are mmap'ed. In this case, all requests that are at least PAGE_SIZE aligned will just fall back to the buffered paths. If the request however is not PAGE_SIZE aligned, EINVAL will be returned as always regardless if the file's contents are mmap'ed. Since O_DIRECT writes go through the normal ZIO pipeline, the following operations are supported just as with normal buffered writes: Checksum Compression Encryption Erasure Coding There is one caveat for the data integrity of O_DIRECT writes that is distinct for each of the OS's supported by ZFS. FreeBSD - FreeBSD is able to place user pages under write protection so any data in the user buffers and written directly down to the VDEV disks is guaranteed to not change. There is no concern with data integrity and O_DIRECT writes. Linux - Linux is not able to place anonymous user pages under write protection. Because of this, if the user decides to manipulate the page contents while the write operation is occurring, data integrity can not be guaranteed. However, there is a module parameter `zfs_vdev_direct_write_verify` that controls the if a O_DIRECT writes that can occur to a top-level VDEV before a checksum verify is run before the contents of the I/O buffer are committed to disk. In the event of a checksum verification failure the write will return EIO. The number of O_DIRECT write checksum verification errors can be observed by doing `zpool status -d`, which will list all verification errors that have occurred on a top-level VDEV. Along with `zpool status`, a ZED event will be issues as `dio_verify` when a checksum verification error occurs. ZVOLs and dedup is not currently supported with Direct I/O. A new dataset property `direct` has been added with the following 3 allowable values: disabled - Accepts O_DIRECT flag, but silently ignores it and treats the request as a buffered IO request. standard - Follows the alignment restrictions outlined above for write/read IO requests when the O_DIRECT flag is used. always - Treats every write/read IO request as though it passed O_DIRECT and will do O_DIRECT if the alignment restrictions are met otherwise will redirect through the ARC. This property will not allow a request to fail. There is also a module parameter zfs_dio_enabled that can be used to force all reads and writes through the ARC. By setting this module parameter to 0, it mimics as if the direct dataset property is set to disabled. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Signed-off-by: Brian Atkinson <[email protected]> Co-authored-by: Mark Maybee <[email protected]> Co-authored-by: Matt Macy <[email protected]> Co-authored-by: Brian Behlendorf <[email protected]> Closes #10018
* spa_prop_get: require caller to supply output nvlistRob Norris2024-09-061-2/+3
| | | | | | | | | | | | | | | | | All callers to spa_prop_get() and spa_prop_get_nvlist() supplied their own preallocated nvlist (except ztest), so we can remove the option to have them allocate one if none is supplied. This sidesteps a bug in spa_prop_get(), where the error var wasn't initialised, which could lead to the provided nvlist being freed at the end. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Jorgen Lundman <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes #16505
* zpool events: expand value strings for ZIO error valuesRob Norris2024-09-051-1/+23
| | | | | | Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <[email protected]>
* Add DDT prune commandDon Brady2024-09-043-10/+162
| | | | | | | | | | | | | | | | Requires the new 'flat' physical data which has the start time for a class entry. The amount to prune can be based on a target percentage of the unique entries or based on the age (i.e., every entry older than N days). Sponsored-by: Klara, Inc. Sponsored-by: iXsystems, Inc. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Don Brady <[email protected]> Closes #16277
* build: rename FORCEDEBUG_CPPFLAGS to LIBZPOOL_CPPFLAGSRob Norris2024-08-274-5/+5
| | | | | | | | | | | | | | | This is just a very small attempt to make it more obvious that these flags aren't optional for libzpool-using programs, by not making it seem like there's an option to say "well, I don't _want_ to force debugging". Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Allan Jude <[email protected]> Reviewed-by: Rich Ercolani <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rob Norris <[email protected]> Issue #16476 Closes #16477
* zstream: build with debug to fix stack overrunsRob Norris2024-08-271-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | abd_t differs in size depending on whether or not ZFS_DEBUG is set. It turns out that libzpool is built with FORCEDEBUG_CPPFLAGS, which sets -DZFS_DEBUG, and so it always has a larger abd_t with extra debug fields, regardless of whether or not --enable-debug is set. zdb, ztest and zhack are also all built with FORCEDEBUG_CPPFLAGS, so had the same idea of the size of abd_t, but zstream was not, and used the "smaller" abd_t. In practice this didn't matter because it never used abd_t directly. This changed in b4d81b1a6, zstream was switched to use stack ABDs for compression. When built with --enable-debug, zstream implicitly gets ZFS_DEBUG, and everything was fine. Productions builds without that flag ends up with the smaller abd_t, which is now mismatched with libzpool, and causes stack overruns in zstream recompress. The simplest fix for now is to compile zstream with FORCEDEBUG_CPPFLAGS like the other binaries. This commit does that. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Allan Jude <[email protected]> Reviewed-by: Rich Ercolani <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rob Norris <[email protected]> Issue #16476 Closes #16477
* fm: pass io_flags through events & zed as uint64_tRob Norris2024-08-261-3/+12
| | | | | | | | | | | | | | | | | | | In 4938d01db (#14086) zio_flag_t was converted from an enum (generally signed 32-bit) to a uint64_t. The corresponding change wasn't made to the error reporting subsystem, limiting the error flags being delivered to zed to 32 bits. This bumps the whole pipeline to use uint64s. A tiny bit of compatibility is added for newer zed working agsinst an older kernel module, because its easy to do and misdetecting scrub/resilver errors and taking action is potentially dangerous. Making it work for new kernel modules against older zed seems to be far more invasive for far less benefit, so I have not. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes #16469
* zpool: Provide GUID to zpool-reguid(8) with -g (#16239)Mateusz Piotrowski2024-08-262-6/+19
| | | | | | | | | | | | | | This commit extends the zpool-reguid(8) command with a -g flag, which allows the user to specify the GUID to set. This change also adds some general tests for zpool-reguid(8). Sponsored-by: Wasabi Technology, Inc. Sponsored-by: Klara, Inc. Signed-off-by: Mateusz Piotrowski <[email protected]> Reviewed-by: Rob Norris <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Tony Hutter <[email protected]>
* Make mount.zfs(8) calling zfs_mount_at for legacy mounts as wellLow-power2024-08-231-3/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 329e2ffa4bca456e65c3db7f5c5c04931c551b61 has made mount.zfs(8) to call libzfs function 'zfs_mount_at', in order to propagate dataset properties into mount options. This fix however, is limited to a special use case where mount.zfs(8) is used in initrd with option '-o zfsutil'. If either initrd or the user need to use mount.zfs(8) to mount a file system with 'mountpoint' set to 'legacy', '-o zfsutil' can't be used and the original issue #7947 will still happen. Since the existing code already excluded the possibility of calling 'zfs_mount_at' when it was invoked as a helper program from zfs(8), by checking 'ZFS_MOUNT_HELPER' environment variable, it makes no sense to avoid calling 'zfs_mount_at' without '-o zfsutil'. An exception however, is when mount.zfs(8) was invoked with '-o remount' to update the mount options for an existing mount point. In this case call mount(2) directly without modifying the mount options passed from command line. Furthermore, don't run mount.zfs(8) helper for automounting snapshot. The above change to make mount.zfs(8) to call 'zfs_mount_at' apparently caused it to trigger an automount for the snapshot directory. When the helper was invoked as a result of a snapshot automount, an infinite recursion will occur. Since the need of invoking user mode mount(8) for automounting was to overcome that the 'vfs_kern_mount' being GPL-only, just run mount(8) without the mount.zfs(8) helper by adding option '-i'. Reviewed-by: Umer Saleem <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: WHR <[email protected]> Closes #16393
* zstream recompress: fix zero recompressed buffer and outputRob Norris2024-08-221-10/+12
| | | | | | | | | | | | | | | If compression happend, any garbage past the compress size was not zeroed out. If compression didn't happen, then the payload size was still set to the rounded-up return from zio_compress_data(), which is dependent on the input, which is not necessarily the logical size. So that's all fixed too, mostly from stealing the math from zio.c. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <[email protected]>
* zstream decompress: fix decompress size and outputRob Norris2024-08-221-12/+19
| | | | | | | | | | | | | | | This was incorrectly using the compressed length for the size of the decompress buffer, and quietly doing nothing if the decompressor refused to decompress the block because there wasn't enough space. After that, it wasn't correctly rewriting the record to indicate "not compressed". So that's fixed now. Sigh. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <[email protected]>
* compress: change zio_compress API to use ABDsRob Norris2024-08-222-18/+36
| | | | | | | | | | | | | | This commit changes the frontend zio_compress_data and zio_decompress_data APIs to take ABD points instead of buffer pointers. All callers are updated to match. Any that already have an appropriate ABD nearby now use it directly, while at the rest we create an one. Internally, the ABDs are passed through to the provider directly. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <[email protected]>
* compress: change compression providers API to use ABDsRob Norris2024-08-221-2/+4
| | | | | | | | | | | | | | | This commit changes the provider compress and decompress API to take ABD pointers instead of buffer pointers for both data source and destination. It then updates all providers to match. This doesn't actually change the providers to do chunked compression, just changes the API to allow such an update in the future. Helper macros are added to easily adapt the ABD functions to their buffer-based implementations. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <[email protected]>
* zstream: use zio_compress calls for compressionRob Norris2024-08-222-118/+98
| | | | | | | | | | This is updating zstream to use the zio_compress calls rather than using its own dispatch. Since that was fairly entangled, some refactoring included. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <[email protected]>
* ddt: dedup logRob Norris2024-08-161-1/+32
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Adds a log/journal to dedup. At the end of txg, instead of writing the entry directly to the ZAP, instead its adding to an in-memory tree and appended to an on-disk object. The on-disk object is only read at import, to reload the in-memory tree. Lookups first go the the log tree before going to the ZAP, so recently-used entries will remain close by in memory. This vastly reduces overhead from dedup IO, as it will not have to do so many read/update/write cycles on ZAP leaf nodes. A flushing facility is added at end of txg, to push logged entries out to the ZAP. There's actually two separate "logs" (in-memory tree and on-disk object), one active (recieving updated entries) and one flushing (writing out to disk). These are swapped (ie flushing begins) based on memory used by the in-memory log trees and time since we last flushed something. The flushing facility monitors the amount of entries coming in and being flushed out, and calibrates itself to try to flush enough each txg to keep up with the ingest rate without competing too much with other IO. Multiple tuneables are provided to control the flushing facility. All the histograms and stats are update to accomodate the log as a separate entry store. zdb gains knowledge of how to count them and dump them. Documentation included! Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Co-authored-by: Allan Jude <[email protected]> Signed-off-by: Rob Norris <[email protected]> Sponsored-by: Klara, Inc. Sponsored-by: iXsystems, Inc. Closes #15895
* ddt: cleanup the stats & histogram codeRob Norris2024-08-161-12/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Both the API and the code were kinda mangled and I was really struggling to follow it. The worst offender was the old ddt_stat_add(); after fixing it up the rest of the changes are mostly knock-on effects and targets of opportunity. Note that the old ddt_stat_add() was safe against overflows - it could produce crazy numbers, but the compiler wouldn't do anything stupid. The assertions in ddt_stat_sub() go a lot of the way to protecting against this; getting in a position where overflows are a problem is definitely a programming error. Also expanding ddt_stat_add() and ddt_histogram_empty() produces less efficient assembly. I'm not bothered about this right now though; these should not be hot functions, and if they are we'll optimise them later. If we have to go back to the old form, we'll comment it like crazy. Finally, I've removed the assertion that the bucket will never be negative, as it will soon be possible to have entries with zero refcounts: an entry for a block that is no longer on the pool, but is on the log waiting to be synced out. It might be better to have a separate bucket for these, since they're still using real space on disk, but ultimately these stats are driving UI, and for now I've chosen to keep them matching how they've looked in the past, as well as match the operators mental model - pool usage is managed elsewhere. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rob Norris <[email protected]> Sponsored-by: Klara, Inc. Sponsored-by: iXsystems, Inc. Closes #15895
* ddt: add "flat phys" featureRob Norris2024-08-161-25/+43
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Traditional dedup keeps a separate ddt_phys_t "type" for each possible count of DVAs (that is, copies=) parameter. Each of these are tracked independently of each other, and have their own set of DVAs. This leads to an (admittedly rare) situation where you can create as many as six copies of the data, by changing the copies= parameter between copying. This is both a waste of storage on disk, but also a waste of space in the stored DDT entries, since there never needs to be more than three DVAs to handle all possible values of copies=. This commit adds a new FDT feature, DDT_FLAG_FLAT. When active, only the first ddt_phys_t is used. Each time a block is written with the dedup bit set, this single phys is checked to see if it has enough DVAs to fulfill the request. If it does, the block is filled with the saved DVAs as normal. If not, an adjusted write is issued to create as many extra copies as are needed to fulfill the request, which are then saved into the entry too. Because a single phys is no longer an all-or-nothing, but can be transitioning from fewer to more DVAs, the write path now has to keep a copy of the previous "known good" DVA set so we can revert to it in case an error occurs. zio_ddt_write() has been restructured and heavily commented to make it much easier to see what's happening. Backwards compatibility is maintained simply by allocating four ddt_phys_t when the DDT_FLAG_FLAT flag is not set, and updating the phys selection macros to check the flag. In the old arrangement, each number of copies gets a whole phys, so it will always have either zero or all necessary DVAs filled, with no in-between, so the old behaviour naturally falls out of the new code. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rob Norris <[email protected]> Co-authored-by: Don Brady <[email protected]> Sponsored-by: Klara, Inc. Sponsored-by: iXsystems, Inc. Closes #15893
* ddt: introduce lightweight entryRob Norris2024-08-161-7/+8
| | | | | | | | | | | | | | | | | | | | The idea here is that sometimes you need the contents of an entry with no intent to modify it, and/or from a place where its difficult to get hold of its originating ddt_t to know how to interpret it. A lightweight entry contains everything you might need to "read" an entry - its key, type and phys contents - but none of the extras for modifying it or using it in a larger context. It also has the full complement of phys slots, so it can represent any kind of dedup entry without having to know the specific configuration of the table it came from. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rob Norris <[email protected]> Sponsored-by: Klara, Inc. Sponsored-by: iXsystems, Inc. Closes #15893
* ddt: rework access to phys array slotsRob Norris2024-08-161-7/+6
| | | | | | | | | | | | | | | | | | | | | | | The "flat phys" feature will use only a single phys slot for all entries, which means the old "single", "double" etc naming now makes no sense, and more importantly, means that choosing the right slot for a given block pointer will depend on how many slots are in use for a given DDT. This removes the old names, and adds accessor macros to decouple specific phys array indexes from any particular meaning. (These macros look strange in isolation, mainly in the way they take the ddt_t* as an arg but don't use it. This is mostly a separate commit to introduce the concept to the reader before the "flat phys" commit extends it). Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rob Norris <[email protected]> Sponsored-by: Klara, Inc. Sponsored-by: iXsystems, Inc. Closes #15893
* zdb: rework DDT block count and leak check to just count the blocksRob Norris2024-08-161-120/+195
| | | | | | | | | | | | | | | | | | | | | | | | | | | | The upcoming dedup features break the long held assumption that all blocks on disk with a 'D' dedup bit will always be present in the DDT, or will have the same set of DVA allocations on disk as in the DDT. If the DDT is no longer a complete picture of all the dedup blocks that will be and should be on disk, then it does us no good to walk and prime it up front, since it won't necessarily match up with every block we'll see anyway. Instead, we rework things here to be more like the BRT checks. When we see a dedup'd block, we look it up in the DDT, consume a refcount, and for the second-or-later instances, count them as duplicates. The DDT and BRT are moved ahead of the space accounting. This will become important for the "flat" feature, which may need to count a modified version of the block. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Co-authored-by: Allan Jude <[email protected]> Co-authored-by: Don Brady <[email protected]> Signed-off-by: Rob Norris <[email protected]> Sponsored-by: Klara, Inc. Sponsored-by: iXsystems, Inc. Closes #15892
* zstream: remove duplicate highbit64 definitionRob Norris2024-08-091-9/+0
| | | | | | | | | | | When building a static build (--disable-shared), zstream fails to link because of the duplicate highbit64() in libzpool/kernel.c. Since they're identical, and the libzpool one is visible to zstream, we remove zstream's copy and just use the common one. Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes #16426
* JSON: Fix class values for mirrored special vdevsTony Hutter2024-08-061-4/+33
| | | | | | | | | | | | | | | | | | | | | This fixes things so mirrored special vdevs report themselves as "class=special" rather than "class=normal". This happens due to the way the vdev nvlists are constructed: mirrored special devices - The 'mirror' vdev has allocation bias as "special" and it's leaf vdevs are "normal" single or RAID0 special devices - Leaf vdevs have allocation bias as "special". This commit adds in code to check if a leaf's parent is a "special" vdev to see if it should also report "special". Reviewed-by: Ameer Hamza <[email protected]> Reviewed-by: Umer Saleem <[email protected]> Signed-off-by: Tony Hutter <[email protected]> Closes #16217
* JSON output support for zpool statusUmer Saleem2024-08-061-260/+1444
| | | | | | | | | | | | | This commit adds support for zpool status command to displpay status of ZFS pools in JSON format using '-j' option. Status information is collected in nvlist which is later dumped on stdout in JSON format. Existing options for zpool status work with '-j' flag. man page for zpool status is updated accordingly. Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Ameer Hamza <[email protected]> Signed-off-by: Umer Saleem <[email protected]> Closes #16217
* JSON output support for zpool listUmer Saleem2024-08-061-82/+297
| | | | | | | | | | | | | | | This commit adds support for zpool list command to output the list of ZFS pools in JSON format using '-j' option.. Information about available pools is collected in nvlist which is later printed to stdout in JSON format. Existing options for zfs list command work with '-j' flag. man page for zpool list is updated accordingly. Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Ameer Hamza <[email protected]> Signed-off-by: Umer Saleem <[email protected]> Closes #16217
* JSON output support for zpool getUmer Saleem2024-08-062-62/+245
| | | | | | | | | | | This commit adds support for zpool get command to output the list of properties for ZFS Pools and VDEVS in JSON format using '-j' option. Man page for zpool get is updated to include '-j' option. Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Ameer Hamza <[email protected]> Signed-off-by: Umer Saleem <[email protected]> Closes #16217
* JSON output support for zpool versionUmer Saleem2024-08-061-3/+65
| | | | | | | | | | | This commit adds support for zpool version to output in JSON format using '-j' option. Userland kernel module version is collected in nvlist which is later displayed in JSON format. man page for zpool is updated. Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Ameer Hamza <[email protected]> Signed-off-by: Umer Saleem <[email protected]> Closes #16217