summaryrefslogtreecommitdiffstats
path: root/include/sys
Commit message (Collapse)AuthorAgeFilesLines
* lib/: set O_CLOEXEC on all fdsнаб2021-04-141-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | As found by git grep -E '(open|setmntent|pipe2?)\(' | grep -vE '((zfs|zpool)_|fd|dl|lzc_re|pidfile_|g_)open\(' FreeBSD's pidfile_open() says nothing about the flags of the files it opens, but we can't do anything about it anyway; the implementation does open all files with O_CLOEXEC Consider this output with zpool.d/media appended with "pid=$$; (ls -l /proc/$pid/fd > /dev/tty)": $ /sbin/zpool iostat -vc media lrwx------ 0 -> /dev/pts/0 l-wx------ 1 -> 'pipe:[3278500]' l-wx------ 2 -> /dev/null lrwx------ 3 -> /dev/zfs lr-x------ 4 -> /proc/31895/mounts lrwx------ 5 -> /dev/zfs lr-x------ 10 -> /usr/lib/zfs-linux/zpool.d/media vs $ ./zpool iostat -vc vendor,upath,iostat,media lrwx------ 0 -> /dev/pts/0 l-wx------ 1 -> 'pipe:[3279887]' l-wx------ 2 -> /dev/null lr-x------ 10 -> /usr/lib/zfs-linux/zpool.d/media Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Ahelenia Ziemiańska <[email protected]> Closes #11866
* Use dsl_scan_setup_check() to setup a scrubBrian Behlendorf2021-04-141-0/+1
| | | | | | | | | | | | When a rebuild completes it will automatically schedule a follow up scrub to verify all of the block checksums. Before setting up the scrub execute the counterpart dsl_scan_setup_check() function to confirm the scrub can be started. Prior to this change we'd only check vdev_rebuild_active() which isn't as comprehensive, and using the check function keeps all of this logic in one place. Reviewed-by: Mark Maybee <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #11849
* Ratelimit deadman zevents as with delay zeventsRyan Moeller2021-04-141-1/+2
| | | | | | | | | | | | | | | Just as delay zevents can flood the zevent pipe when a vdev becomes unresponsive, so do the deadman zevents. Ratelimit deadman zevents according to the same tunable as for delay zevents. Enable deadman tests on FreeBSD and add a test for deadman event ratelimiting. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Don Brady <[email protected]> Signed-off-by: Ryan Moeller <[email protected]> Closes #11786
* Fix various typosAndrea Gelmini2021-04-073-3/+3
| | | | | | | | | | Correct an assortment of typos throughout the code base. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Matthew Ahrens <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Signed-off-by: Andrea Gelmini <[email protected]> Closes #11774
* Use a helper function to clarify gang block sizeMatthew Ahrens2021-03-262-0/+15
| | | | | | | | | | | | | For gang blocks, `DVA_GET_ASIZE()` is the total space allocated for the gang DVA including its children BP's. The space allocated at each DVA's vdev/offset is `vdev_psize_to_asize(vd, SPA_GANGBLOCKSIZE)`. This commit makes this relationship more clear by using a helper function, `vdev_gang_header_asize()`, for the space allocated at the gang block's vdev/offset. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> Closes #11744
* Removed duplicated includesAndrea Gelmini2021-03-222-2/+0
| | | | | | Reviewed-by: Matthew Ahrens <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Andrea Gelmini <[email protected]> Closes #11775
* Split dmu_zfetch() speculation and execution partsAlexander Motin2021-03-191-7/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | To make better predictions on parallel workloads dmu_zfetch() should be called as early as possible to reduce possible request reordering. In particular, it should be called before dmu_buf_hold_array_by_dnode() calls dbuf_hold(), which may sleep waiting for indirect blocks, waking up multiple threads same time on completion, that can significantly reorder the requests, making the stream look like random. But we should not issue prefetch requests before the on-demand ones, since they may get to the disks first despite the I/O scheduler, increasing on-demand request latency. This patch splits dmu_zfetch() into two functions: dmu_zfetch_prepare() and dmu_zfetch_run(). The first can be executed as early as needed. It only updates statistics and makes predictions without issuing any I/Os. The I/O issuance is handled by dmu_zfetch_run(), which can be called later when all on-demand I/Os are already issued. It even tracks the activity of other concurrent threads, issuing the prefetch only when _all_ on-demand requests are issued. For many years it was a big problem for storage servers, handling deeper request queues from their clients, having to either serialize consequential reads to make ZFS prefetcher usable, or execute the incoming requests as-is and get almost no prefetch from ZFS, relying only on deep enough prefetch by the clients. Benefits of those ways varied, but neither was perfect. With this patch deeper queue sequential read benchmarks with CrystalDiskMark from Windows via iSCSI to FreeBSD target show me much better throughput with almost 100% prefetcher hit rate, comparing to almost zero before. While there, I also removed per-stream zs_lock as useless, completely covered by parent zf_lock. Also I reused zs_blocks refcount to track zf_stream linkage of the stream, since I believe previous zs_fetch == NULL check in dmu_zfetch_stream_done() was racy. Delete prefetch streams when they reach ends of files. It saves up to 1KB of RAM per file, plus reduces searches through the stream list. Block data prefetch (speculation and indirect block prefetch is still done since they are cheaper) if all dbufs of the stream are already in DMU cache. First cache miss immediately fires all the prefetch that would be done for the stream by that time. It saves some CPU time if same files within DMU cache capacity are read over and over. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Adam Moss <[email protected]> Reviewed-by: Matthew Ahrens <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored-By: iXsystems, Inc. Closes #11652
* Fix zfs_get_data access to files with wrong generationChunwei Chen2021-03-192-3/+4
| | | | | | | | | | | | | | If TX_WRITE is create on a file, and the file is later deleted and a new directory is created on the same object id, it is possible that when zil_commit happens, zfs_get_data will be called on the new directory. This may result in panic as it tries to do range lock. This patch fixes this issue by record the generation number during zfs_log_write, so zfs_get_data can check if the object is valid. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Chunwei Chen <[email protected]> Closes #10593 Closes #11682
* Clean up RAIDZ/DRAID ereport codeMatthew Ahrens2021-03-193-12/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The RAIDZ and DRAID code is responsible for reporting checksum errors on their child vdevs. Checksum errors represent events where a disk returned data or parity that should have been correct, but was not. In other words, these are instances of silent data corruption. The checksum errors show up in the vdev stats (and thus `zpool status`'s CKSUM column), and in the event log (`zpool events`). Note, this is in contrast with the more common "noisy" errors where a disk goes offline, in which case ZFS knows that the disk is bad and doesn't try to read it, or the device returns an error on the requested read or write operation. RAIDZ/DRAID generate checksum errors via three code paths: 1. When RAIDZ/DRAID reconstructs a damaged block, checksum errors are reported on any children whose data was not used during the reconstruction. This is handled in `raidz_reconstruct()`. This is the most common type of RAIDZ/DRAID checksum error. 2. When RAIDZ/DRAID is not able to reconstruct a damaged block, that means that the data has been lost. The zio fails and an error is returned to the consumer (e.g. the read(2) system call). This would happen if, for example, three different disks in a RAIDZ2 group are silently damaged. Since the damage is silent, it isn't possible to know which three disks are damaged, so a checksum error is reported against every child that returned data or parity for this read. (For DRAID, typically only one "group" of children is involved in each io.) This case is handled in `vdev_raidz_cksum_finish()`. This is the next most common type of RAIDZ/DRAID checksum error. 3. If RAIDZ/DRAID is not able to reconstruct a damaged block (like in case 2), but there happens to be additional copies of this block due to "ditto blocks" (i.e. multiple DVA's in this blkptr_t), and one of those copies is good, then RAIDZ/DRAID compares each sector of the data or parity that it retrieved with the good data from the other DVA, and if they differ then it reports a checksum error on this child. This differs from case 2 in that the checksum error is reported on only the subset of children that actually have bad data or parity. This case happens very rarely, since normally only metadata has ditto blocks. If the silent damage is extensive, there will be many instances of case 2, and the pool will likely be unrecoverable. The code for handling case 3 is considerably more complicated than the other cases, for two reasons: 1. It needs to run after the main raidz read logic has completed. The data RAIDZ read needs to be preserved until after the alternate DVA has been read, which necessitates refcounts and callbacks managed by the non-raidz-specific zio layer. 2. It's nontrivial to map the sections of data read by RAIDZ to the correct data. For example, the correct data does not include the parity information, so the parity must be recalculated based on the correct data, and then compared to the parity that was read from the RAIDZ children. Due to the complexity of case 3, the rareness of hitting it, and the minimal benefit it provides above case 2, this commit removes the code for case 3. These types of errors will now be handled the same as case 2, i.e. the checksum error will be reported against all children that returned data or parity. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> Closes #11735
* Remove unused rr_codeMatthew Ahrens2021-03-171-1/+0
| | | | | | | | | | The `rr_code` field in `raidz_row_t` is unused. This commit removes the field, as well as the code that's used to set it. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> Closes #11736
* FreeBSD: Clean up zfsdev_close to match LinuxRyan Moeller2021-03-121-1/+0
| | | | | | | | | | | | | | Resolve some oddities in zfsdev_close() which could result in a panic and were not present in the equivalent function for Linux. - Remove unused definition ZFS_MIN_MINOR - FreeBSD: Simplify zfsdev state destruction - Assert zs_minor is valid in zfsdev_close - Make locking around zfsdev state match Linux Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Ryan Moeller <[email protected]> Closes #11720
* Suppress cppcheck invalidSyntax warninigsBrian Behlendorf2021-03-051-0/+2
| | | | | | | | | | | For some reason cppcheck 1.90 is generating an invalidSyntax warning when the BF64_SET macro is used in the zstream source. The same warning is not reported by cppcheck 2.3, nor is their any evident problem with the expanded macro. This appears to be an issue with this version of cppcheck. This commit annotates the source to suppress the warning. Signed-off-by: Brian Behlendorf <[email protected]> Closes #11700
* Add upper bound for slop space calculationPrakash Surya2021-02-241-4/+5
| | | | | | | | | | This change modifies the behavior of how we determine how much slop space to use in the pool, such that now it has an upper limit. The default upper limit is 128G, but is configurable via a tunable. Reviewed-by: Matthew Ahrens <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Prakash Surya <[email protected]> Closes #11023
* Cleaning up uio headersBrian Atkinson2021-02-202-1/+21
| | | | | | | | | Making uio_impl.h the common header interface between Linux and FreeBSD so both OS's can share a common header file. This also helps reduce code duplication for zfs_uio_t for each OS. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Brian Atkinson <[email protected]> Closes #11622
* libzpool: set_global_var: refactor to not modify 'arg'Christian Schwarz2021-02-191-1/+1
| | | | | | | | | | Also fixes leak of the dlopen handle in the error case. Reviewed-by: Matthew Ahrens <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Pavel Zakharov <[email protected]> Signed-off-by: Christian Schwarz <[email protected]> Closes #11602
* Restore FreeBSD resource usage accountingRyan Moeller2021-02-192-0/+38
| | | | | | | Add zfs_racct_* interfaces for platform-dependent read/write accounting. Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Ryan Moeller <[email protected]> Closes #11613
* Checksum errors may not be countedDon Brady2021-02-191-1/+2
| | | | | | | | | | Fix regression seen in issue #11545 where checksum errors where not being counted or showing up in a zpool event. Reviewed-by: Matthew Ahrens <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Don Brady <[email protected]> Closes #11609
* Add "compatibility" property for zpool feature setsColm2021-02-172-2/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Property to allow sets of features to be specified; for compatibility with specific versions / releases / external systems. Influences the behavior of 'zpool upgrade' and 'zpool create'. Initial man page changes and test cases included. Brief synopsis: zpool create -o compatibility=off|legacy|file[,file...] pool vdev... compatibility = off : disable compatibility mode (enable all features) compatibility = legacy : request that no features be enabled compatibility = file[,file...] : read features from specified files. Only features present in *all* files will be enabled on the resulting pool. Filenames may be absolute, or relative to /etc/zfs/compatibility.d or /usr/share/zfs/compatibility.d (/etc checked first). Only affects zpool create, zpool upgrade and zpool status. ABI changes in libzfs: * New function "zpool_load_compat" to load and parse compat sets. * Add "zpool_compat_status_t" typedef for compatibility parse status. * Add ZPOOL_PROP_COMPATIBILITY to the pool properties enum * Add ZPOOL_STATUS_COMPATIBILITY_ERR to the pool status enum An initial set of base compatibility sets are included in cmd/zpool/compatibility.d, and the Makefile for cmd/zpool is modified to install these in $pkgdatadir/compatibility.d and to create symbolic links to a reasonable set of aliases. Reviewed-by: ericloewe Reviewed-by: Matthew Ahrens <[email protected]> Reviewed-by: Richard Laager <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Colm Buckley <[email protected]> Closes #11468
* Make inline ABD predicates compatible with C++Ryan Moeller2021-02-151-3/+3
| | | | | | | | | | | | | FreeBSD's zfsd fails to build after e2af2acce3 due to strict type checking errors from the implicit conversion between bool and boolean_t in the inline predicate definitions in abd.h. Use conditionals to return the correct value type from these functions. Reviewed-by: Matthew Ahrens <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Igor Kozhukhov <[email protected]> Signed-off-by: Ryan Moeller <[email protected]> Closes #11592
* Rename zfs_inode_update to zfs_znode_update_vfskhng3002021-02-091-0/+2
| | | | | | | | | | | zfs_znode_update_vfs is a more platform-agnostic name than zfs_inode_update. Besides that, the function's prototype is moved to include/sys/zfs_znode.h as the function is also used in common code. Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Ka Ho Ng <[email protected]> Sponsored by: The FreeBSD Foundation Closes #11580
* The abd child/parent relationship does not need to be trackedMatthew Ahrens2021-01-301-0/+2
| | | | | | | | | | | | | | | | | ABD's currently track their parent/child relationship. This applies to `abd_get_offset()` and `abd_borrow_buf()`. However, nothing depends on knowing this relationship, it's only used for consistency checks to verify that we are not destroying an ABD that's still in use. When we are creating/destroying ABD's frequently, the performance impact of maintaining these data structures (in particular the atomic increment/decrement operations) can be measurable. This commit removes this verification code on production builds, but keeps it when ZFS_DEBUG is set. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Brian Atkinson <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> Closes #11535
* Parallelize vdev_validateAlan Somers2021-01-261-0/+2
| | | | | | | | | | | The runtime of vdev_validate is dominated by the disk accesses in vdev_label_read_config. Speed it up by validating all vdevs in parallel using a taskq. Sponsored by: Axcient Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alan Somers <[email protected]> Closes #11470
* Parallelize vdev_loadAlan Somers2021-01-261-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | metaslab_init is the slowest part of importing a mature pool, and it must be repeated hundreds of times for each top-level vdev. But its speed is dominated by a few serialized disk accesses. That can lead to import times of > 1 hour for pools with many top-level vdevs on spinny disks. Speed up the import by using a taskqueue to parallelize vdev_load across all top-level vdevs. This also requires adding mutex protection to metaslab_class_t.mc_historgram. The mc_histogram fields were unprotected when that code was first written in "Illumos 4976-4984 - metaslab improvements" (OpenZFS f3a7f6610f2df0217ba3b99099019417a954b673). The lock wasn't added until 3dfb57a35e8cbaa7c424611235d669f3c575ada1, though it's unclear exactly which fields it's supposed to protect. In any case, it wasn't until vdev_load was parallelized that any code attempted concurrent access to those fields. Sponsored by: Axcient Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alan Somers <[email protected]> Closes #11470
* Set aside a metaslab for ZIL blocksMatthew Ahrens2021-01-216-3/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Mixing ZIL and normal allocations has several problems: 1. The ZIL allocations are allocated, written to disk, and then a few seconds later freed. This leaves behind holes (free segments) where the ZIL blocks used to be, which increases fragmentation, which negatively impacts performance. 2. When under moderate load, ZIL allocations are of 128KB. If the pool is fairly fragmented, there may not be many free chunks of that size. This causes ZFS to load more metaslabs to locate free segments of 128KB or more. The loading happens synchronously (from zil_commit()), and can take around a second even if the metaslab's spacemap is cached in the ARC. All concurrent synchronous operations on this filesystem must wait while the metaslab is loading. This can cause a significant performance impact. 3. If the pool is very fragmented, there may be zero free chunks of 128KB or more. In this case, the ZIL falls back to txg_wait_synced(), which has an enormous performance impact. These problems can be eliminated by using a dedicated log device ("slog"), even one with the same performance characteristics as the normal devices. This change sets aside one metaslab from each top-level vdev that is preferentially used for ZIL allocations (vdev_log_mg, spa_embedded_log_class). From an allocation perspective, this is similar to having a dedicated log device, and it eliminates the above-mentioned performance problems. Log (ZIL) blocks can be allocated from the following locations. Each one is tried in order until the allocation succeeds: 1. dedicated log vdevs, aka "slog" (spa_log_class) 2. embedded slog metaslabs (spa_embedded_log_class) 3. other metaslabs in normal vdevs (spa_normal_class) The space required for the embedded slog metaslabs is usually between 0.5% and 1.0% of the pool, and comes out of the existing 3.2% of "slop" space that is not available for user data. On an all-ssd system with 4TB storage, 87% fragmentation, 60% capacity, and recordsize=8k, testing shows a ~50% performance increase on random 8k sync writes. On even more fragmented systems (which hit problem #3 above and call txg_wait_synced()), the performance improvement can be arbitrarily large (>100x). Reviewed-by: Serapheim Dimitropoulos <[email protected]> Reviewed-by: George Wilson <[email protected]> Reviewed-by: Don Brady <[email protected]> Reviewed-by: Mark Maybee <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> Closes #11389
* Extending FreeBSD UIO StructBrian Atkinson2021-01-206-17/+17
| | | | | | | | | | | | | | In FreeBSD the struct uio was just a typedef to uio_t. In order to extend this struct, outside of the definition for the struct uio, the struct uio has been embedded inside of a uio_t struct. Also renamed all the uio_* interfaces to be zfs_uio_* to make it clear this is a ZFS interface. Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Jorgen Lundman <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Brian Atkinson <[email protected]> Closes #11438
* allow callers to allocate and provide the abd_t structMatthew Ahrens2021-01-203-51/+71
| | | | | | | | | | | | | | | | | | | | | | | | | | | The `abd_get_offset_*()` routines create an abd_t that references another abd_t, and doesn't allocate any pages/buffers of its own. In some workloads, these routines may be called frequently, to create many abd_t's representing small pieces of a single large abd_t. In particular, the upcoming RAIDZ Expansion project makes heavy use of these routines. This commit adds the ability for the caller to allocate and provide the abd_t struct to a variant of `abd_get_offset_*()`. This eliminates the cost of allocating the abd_t and performing the accounting associated with it (`abdstat_struct_size`). The RAIDZ/DRAID code uses this for the `rc_abd`, which references the zio's abd. The upcoming RAIDZ Expansion project will leverage this infrastructure to increase performance of reads post-expansion by around 50%. Additionally, some of the interfaces around creating and destroying abd_t's are cleaned up. Most significantly, the distinction between `abd_put()` and `abd_free()` is eliminated; all types of abd_t's are now disposed of with `abd_free()`. Reviewed-by: Brian Atkinson <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> Issue #8853 Closes #11439
* record ioctl elapsed time in zpool historyMatthew Ahrens2021-01-111-0/+1
| | | | | | | | | | | | | | | | | | | | | Each zfs ioctl that changes on-disk state (e.g. set property, create snapshot, destroy filesystem) is recorded in the zpool history, and is printed by `zpool history -i`. For performance diagnostic purposes, it would be useful to know how long each of these ioctls took to run. This commit adds that functionality, with a new `ZPOOL_HIST_ELAPSED_NS` member of the history nvlist. Additionally, the time recorded in this history log is currently the time that the history record is written to disk. But in many cases (CLI args logging and ioctl logging), this happens asynchronously, potentially many seconds after the operation completed. This commit changes the timestamp to reflect when the history event was created, rather than when it was written to disk. Reviewed-by: Mark Maybee <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> Closes #11440
* implicit conversion from 'boolean_t' to 'ds_hold_flags_t'Toomas Soome2020-12-271-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | Build error on illumos with gcc 10 did reveal: In function 'dmu_objset_refresh_ownership': ../../common/fs/zfs/dmu_objset.c:857:25: error: implicit conversion from 'boolean_t' to 'ds_hold_flags_t' {aka 'enum ds_hold_flags'} [-Werror=enum-conversion] 857 | dsl_dataset_disown(ds, decrypt, tag); | ^~~~~~~ cc1: all warnings being treated as errors libzfs_input_check.c: In function 'zfs_ioc_input_tests': libzfs_input_check.c:754:28: error: implicit conversion from 'enum dmu_objset_type' to 'enum lzc_dataset_type' [-Werror=enum-conversion] 754 | err = lzc_create(dataset, DMU_OST_ZFS, NULL, NULL, 0); | ^~~~~~~~~~~ cc1: all warnings being treated as errors The same issue is present in openzfs, and also the same issue about ds_hold_flags_t, which currently defines exactly one valid value. Reviewed-by: Igor Kozhukhov <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Toomas Soome <[email protected]> Closes #11406
* Only examine best metaslabs on each vdev Matthew Ahrens2020-12-161-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | On a system with very high fragmentation, we may need to do lots of gang allocations (e.g. most indirect block allocations (~50KB) may need to gang). Before failing a "normal" allocation and resorting to ganging, we try every metaslab. This has the impact of loading every metaslab (not a huge deal since we now typically keep all metaslabs loaded), and also iterating over every metaslab for every failing allocation. If there are many metaslabs (more than the typical ~200, e.g. due to vdev expansion or very large vdevs), the CPU cost of this iteration can be very impactful. This iteration is done with the mg_lock held, creating long hold times and high lock contention for concurrent allocations, ultimately causing long txg sync times and poor application performance. To address this, this commit changes the behavior of "normal" (not try_hard, not ZIL) allocations. These will now only examine the 100 best metaslabs (as determined by their ms_weight). If none of these have a large enough free segment, then the allocation will fail and we'll fall back on ganging. To accomplish this, we will now (normally) gang before doing a `try_hard` allocation. Non-try_hard allocations will only examine the 100 best metaslabs of each vdev. In summary, we will first try normal allocation. If that fails then we will do a gang allocation. If that fails then we will do a "try hard" gang allocation. If that fails then we will have a multi-layer gang block. Reviewed-by: Paul Dagnelie <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> Closes #11327
* Make metaslab class rotor and aliquot per-allocator.Alexander Motin2020-12-151-21/+31
| | | | | | | | | | | | | | | | | | | | | | | | | | Metaslab rotor and aliquot are used to distribute workload between vdevs while keeping some locality for logically adjacent blocks. Once multiple allocators were introduced to separate allocation of different objects it does not make much sense for different allocators to write into different metaslabs of the same metaslab group (vdev) same time, competing for its resources. This change makes each allocator choose metaslab group independently, colliding with others only sporadically. Test including simultaneous write into 4 files with recordsize of 4KB on a striped pool of 30 disks on a system with 40 logical cores show reduction of vdev queue lock contention from 54 to 27% due to better load distribution. Unfortunately it won't help much ZVOLs yet since only one dataset/ZVOL is synced at a time, and so for the most part only one allocator is used, but it may improve later. While there, to reduce the number of pointer dereferences change per-allocator storage for metaslab classes and groups from several separate malloc()'s to variable length arrays at the ends of the original class and group structures. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Matthew Ahrens <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Closes #11288
* spa: avoid type narrowing warningRyan Libby2020-12-151-1/+1
| | | | | | | | | | | | | Building the spa module for i386 caused gcc to emit -Wint-to-pointer-cast "cast to pointer from integer of different size" because spa.spa_did was uint64_t but pthread_join (via thread_join in spa_deactivate) takes a pointer (32-bit on i386). Define spa_did to be pointer-size instead. For now spa_did is in fact never non-zero and the thread_join could instead be ifdef'd out, but changing the size of spa_did may be more useful for the future. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Ryan Libby <[email protected]> Closes #11336
* dmu_zfetch: fix memory leakMatthew Macy2020-12-121-0/+1
| | | | | | | | | | The last change caused the read completion callback to not be called if the IO was still in progress. This change restores allocation of the arc buf callback, but in the callback path checks the new acb_nobuf field to know to skip buffer allocation. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matt Macy <[email protected]> Closes #11324
* Improve zfs receive performance with lightweight writeMatthew Ahrens2020-12-112-2/+28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The performance of `zfs receive` can be bottlenecked on the CPU consumed by the `receive_writer` thread, especially when receiving streams with small compressed block sizes. Much of the CPU is spent creating and destroying dbuf's and arc buf's, one for each `WRITE` record in the send stream. This commit introduces the concept of "lightweight writes", which allows `zfs receive` to write to the DMU by providing an ABD, and instantiating only a new type of `dbuf_dirty_record_t`. The dbuf and arc buf for this "dirty leaf block" are not instantiated. Because there is no dbuf with the dirty data, this mechanism doesn't support reading from "lightweight-dirty" blocks (they would see the on-disk state rather than the dirty data). Since the dedup-receive code has been removed, `zfs receive` is write-only, so this works fine. Because there are no arc bufs for the received data, the received data is no longer cached in the ARC. Testing a receive of a stream with average compressed block size of 4KB, this commit improves performance by 50%, while also reducing CPU usage by 50% of a CPU. On a per-block basis, CPU consumed by receive_writer() and dbuf_evict() is now 1/7th (14%) of what it was. Baseline: 450MB/s, CPU in receive_writer() 40% + dbuf_evict() 35% New: 670MB/s, CPU in receive_writer() 17% + dbuf_evict() 0% The code is also restructured in a few ways: Added a `dr_dnode` field to the dbuf_dirty_record_t. This simplifies some existing code that no longer needs `DB_DNODE_ENTER()` and related routines. The new field is needed by the lightweight-type dirty record. To ensure that the `dr_dnode` field remains valid until the dirty record is freed, we have to ensure that the `dnode_move()` doesn't relocate the dnode_t. To do this we keep a hold on the dnode until it's zio's have completed. This is already done by the user-accounting code (`userquota_updates_task()`), this commit extends that so that it always keeps the dnode hold until zio completion (see `dnode_rele_task()`). `dn_dirty_txg` was previously zeroed when the dnode was synced. This was not necessary, since its meaning can be "when was this dnode last dirtied". This change simplifies the new `dnode_rele_task()` code. Removed some dead code related to `DRR_WRITE_BYREF` (dedup receive). Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Paul Dagnelie <[email protected]> Reviewed-by: George Wilson <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> Closes #11105
* Implement memory and CPU hotplugPaul Dagnelie2020-12-102-0/+3
| | | | | | | | | | | | | | ZFS currently doesn't react to hotplugging cpu or memory into the system in any way. This patch changes that by adding logic to the ARC that allows the system to take advantage of new memory that is added for caching purposes. It also adds logic to the taskq infrastructure to support dynamically expanding the number of threads allocated to a taskq. Reviewed-by: Brian Behlendorf <[email protected]> Co-authored-by: Matthew Ahrens <[email protected]> Co-authored-by: Brian Behlendorf <[email protected]> Signed-off-by: Paul Dagnelie <[email protected]> Closes #11212
* Decouple arc_read_done callback from arc buf instantiationMatthew Macy2020-12-091-0/+5
| | | | | | | | | | | | Add ARC_FLAG_NO_BUF to indicate that a buffer need not be instantiated. This fixes a ~20% performance regression on cached reads due to zfetch changes. Reviewed-by: Tony Nguyen <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Matthew Ahrens <[email protected]> Signed-off-by: Matt Macy <[email protected]> Closes #11220 Closes #11232
* Reduce latency effects of non-interactive I/OAlexander Motin2020-11-241-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Investigating influence of scrub (especially sequential) on random read latency I've noticed that on some HDDs single 4KB read may take up to 4 seconds! Deeper investigation shown that many HDDs heavily prioritize sequential reads even when those are submitted with queue depth of 1. This patch addresses the latency from two sides: - by using _min_active queue depths for non-interactive requests while the interactive request(s) are active and few requests after; - by throttling it further if no interactive requests has completed while configured amount of non-interactive did. While there, I've also modified vdev_queue_class_to_issue() to give more chances to schedule at least _min_active requests to the lowest priorities. It should reduce starvation if several non-interactive processes are running same time with some interactive and I think should make possible setting of zfs_vdev_max_active to as low as 1. I've benchmarked this change with 4KB random reads from ZVOL with 16KB block size on newly written non-fragmented pool. On fragmented pool I also saw improvements, but not so dramatic. Below are log2 histograms of the random read latency in milliseconds for different devices: 4 2x mirror vdevs of SATA HDD WDC WD20EFRX-68EUZN0 before: 0, 0, 2, 1, 12, 21, 19, 18, 10, 15, 17, 21 after: 0, 0, 0, 24, 101, 195, 419, 250, 47, 4, 0, 0 , that means maximum latency reduction from 2s to 500ms. 4 2x mirror vdevs of SATA HDD WDC WD80EFZX-68UW8N0 before: 0, 0, 2, 31, 38, 28, 18, 12, 17, 20, 24, 10, 3 after: 0, 0, 55, 247, 455, 470, 412, 181, 36, 0, 0, 0, 0 , i.e. from 4s to 250ms. 1 SAS HDD SEAGATE ST14000NM0048 before: 0, 0, 29, 70, 107, 45, 27, 1, 0, 0, 1, 4, 19 after: 1, 29, 681, 1261, 676, 1633, 67, 1, 0, 0, 0, 0, 0 , i.e. from 4s to 125ms. 1 SAS SSD SEAGATE XS3840TE70014 before (microseconds): 0, 0, 0, 0, 0, 0, 0, 0, 70, 18343, 82548, 618 after: 0, 0, 0, 0, 0, 0, 0, 0, 283, 92351, 34844, 90 I've also measured scrub time during the test and on idle pools. On idle fragmented pool I've measured scrub getting few percent faster due to use of QD3 instead of QD2 before. On idle non-fragmented pool I've measured no difference. On busy non-fragmented pool I've measured scrub time increase about 1.5-1.7x, while IOPS increase reached 5-9x. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Matthew Ahrens <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored-By: iXsystems, Inc. Closes #11166
* Fix problems in zvol_set_volmode_implMatthew Macy2020-11-171-0/+2
| | | | | | | | | | | | | - Don't leave fstrans set when passed a snapshot - Don't remove minor if volmode already matches new value - (FreeBSD) Wait for GEOM ops to complete before trying remove (at create time GEOM will be "tasting" in parallel) - (FreeBSD) Don't leak zvol_state_lock on open if zv == NULL - (FreeBSD) Don't try to unlock zv->zv_state lock if zv == NULL Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matt Macy <[email protected]> Closes #11199
* Fix 'zfs userspace' for received datasets in encrypted rootloli10K2020-11-161-1/+1
| | | | | | | | | | | | | For encrypted receives, where user accounting is initially disabled on creation, both 'zfs userspace' and 'zfs groupspace' fails with EOPNOTSUPP: this is because dmu_objset_id_quota_upgrade_cb() forgets to set OBJSET_FLAG_USERACCOUNTING_COMPLETE on the objset flags after a successful dmu_objset_space_upgrade(). Reviewed-by: Brian Behlendorf <[email protected]> Co-authored-by: Brian Behlendorf <[email protected]> Signed-off-by: loli10K <[email protected]> Closes #9501 Closes #9596
* Assertion failure when logging large output of channel programMatthew Ahrens2020-11-141-0/+1
| | | | | | | | | | | | | | | | | | | The output of ZFS channel programs is logged on-disk in the zpool history, and printed by `zpool history -i`. Channel programs can use 10MB of memory by default, and up to 100MB by using the `zfs program -m` flag. Therefore their output can be up to some fraction of 100MB. In addition to being somewhat wasteful of the limited space reserved for the pool history (which for large pools is 1GB), in extreme cases this can result in a failure of `ASSERT(length <= DMU_MAX_ACCESS);` in `dmu_buf_hold_array_by_dnode()`. This commit limits the output size that will be logged to 1MB. Larger outputs will not be logged, instead a entry will be logged indicating the size of the omitted output. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> Closes #11194
* Distributed Spare (dRAID) FeatureBrian Behlendorf2020-11-1312-46/+237
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch adds a new top-level vdev type called dRAID, which stands for Distributed parity RAID. This pool configuration allows all dRAID vdevs to participate when rebuilding to a distributed hot spare device. This can substantially reduce the total time required to restore full parity to pool with a failed device. A dRAID pool can be created using the new top-level `draid` type. Like `raidz`, the desired redundancy is specified after the type: `draid[1,2,3]`. No additional information is required to create the pool and reasonable default values will be chosen based on the number of child vdevs in the dRAID vdev. zpool create <pool> draid[1,2,3] <vdevs...> Unlike raidz, additional optional dRAID configuration values can be provided as part of the draid type as colon separated values. This allows administrators to fully specify a layout for either performance or capacity reasons. The supported options include: zpool create <pool> \ draid[<parity>][:<data>d][:<children>c][:<spares>s] \ <vdevs...> - draid[parity] - Parity level (default 1) - draid[:<data>d] - Data devices per group (default 8) - draid[:<children>c] - Expected number of child vdevs - draid[:<spares>s] - Distributed hot spares (default 0) Abbreviated example `zpool status` output for a 68 disk dRAID pool with two distributed spares using special allocation classes. ``` pool: tank state: ONLINE config: NAME STATE READ WRITE CKSUM slag7 ONLINE 0 0 0 draid2:8d:68c:2s-0 ONLINE 0 0 0 L0 ONLINE 0 0 0 L1 ONLINE 0 0 0 ... U25 ONLINE 0 0 0 U26 ONLINE 0 0 0 spare-53 ONLINE 0 0 0 U27 ONLINE 0 0 0 draid2-0-0 ONLINE 0 0 0 U28 ONLINE 0 0 0 U29 ONLINE 0 0 0 ... U42 ONLINE 0 0 0 U43 ONLINE 0 0 0 special mirror-1 ONLINE 0 0 0 L5 ONLINE 0 0 0 U5 ONLINE 0 0 0 mirror-2 ONLINE 0 0 0 L6 ONLINE 0 0 0 U6 ONLINE 0 0 0 spares draid2-0-0 INUSE currently in use draid2-0-1 AVAIL ``` When adding test coverage for the new dRAID vdev type the following options were added to the ztest command. These options are leverages by zloop.sh to test a wide range of dRAID configurations. -K draid|raidz|random - kind of RAID to test -D <value> - dRAID data drives per group -S <value> - dRAID distributed hot spares -R <value> - RAID parity (raidz or dRAID) The zpool_create, zpool_import, redundancy, replacement and fault test groups have all been updated provide test coverage for the dRAID feature. Co-authored-by: Isaac Huang <[email protected]> Co-authored-by: Mark Maybee <[email protected]> Co-authored-by: Don Brady <[email protected]> Co-authored-by: Matthew Ahrens <[email protected]> Co-authored-by: Brian Behlendorf <[email protected]> Reviewed-by: Mark Maybee <[email protected]> Reviewed-by: Matt Ahrens <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #10102
* G/C struct znode -> z_movedMateusz Guzik2020-11-101-1/+0
| | | | | | | | The field is yet another leftover from unsupported zfs_znode_move. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Signed-off-by: Mateusz Guzik <[email protected]> Closes #11186
* Remove redundant oid parameter to update_pagesRyan Moeller2020-11-101-1/+1
| | | | | | | | | | The oid comes from the znode we are already passing. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Matt Macy <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Ryan Moeller <[email protected]> Closes #11176
* Linux 5.10 compat: frame.h renamed objtool.hBrian Behlendorf2020-11-021-0/+4
| | | | | | | | In Linux 5.10 the linux/frame.h header was renamed linux/objtool.h. Add a configure check to detect and use the correctly named header. Signed-off-by: Brian Behlendorf <[email protected]> Closes #11085
* zfs_vnops: make zfs_get_data OS-independentChristian Schwarz2020-11-021-0/+13
| | | | | | | | | | | | Move zfs_get_data() in to platform-independent code. The only platform-specific aspect of it is the way we release an inode (Linux) / vnode_t (FreeBSD). I am not aware of a platform that could be supported by ZFS that couldn't implement zfs_rele_async itself. It's sibling zvol_get_data already is platform-independent. Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Christian Schwarz <[email protected]> Closes #10979
* Introduce CPU_SEQID_UNSTABLEMateusz Guzik2020-11-021-0/+1
| | | | | | | | | | | | | | | | | | Current CPU_SEQID users don't care about possibly changing CPU ID, but enclose it within kpreempt disable/enable in order to fend off warnings from Linux's CONFIG_DEBUG_PREEMPT. There is no need to do it. The expected way to get CPU ID while allowing for migration is to use raw_smp_processor_id. In order to make this future-proof this patch keeps CPU_SEQID as is and introduces CPU_SEQID_UNSTABLE instead, to make it clear that consumers explicitly want this behavior. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Matt Macy <[email protected]> Signed-off-by: Mateusz Guzik <[email protected]> Closes #11142
* Consolidate zfs_holey and zfs_accessMatthew Macy2020-10-311-2/+5
| | | | | | | | | The zfs_holey() and zfs_access() functions can be made common to both FreeBSD and Linux. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Signed-off-by: Matt Macy <[email protected]> Closes #11125
* zstd: track allocator statisticsMateusz Guzik2020-10-301-2/+4
| | | | | | | | | | | | | Note that this only tracks sizes as requested by the caller. Actual allocated space will almost always be bigger (e.g., rounded up to the next power of 2 or page size). Additionally the allocated buffer may be holding other areas hostage. Nonetheless, this is a starting point for tracking memory usage in zstd. Reviewed-by: Allan Jude <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Kjeld Schouten <[email protected]> Signed-off-by: Mateusz Guzik <[email protected]> Closes #11129
* Remove UIO_ZEROCOPY functions structuresMatthew Macy2020-10-302-19/+0
| | | | | | | | | | The original xuio zero copy functionality has always been unused on Linux and FreeBSD. Remove this disabled code to avoid any confusion and improve readability. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Signed-off-by: Matt Macy <[email protected]> Closes #11124
* Update references to nonexistent man pages in codeRyan Moeller2020-10-301-1/+1
| | | | | | | | Refer to the correct section or alternative for FreeBSD and Linux. Reviewed-by: George Melikov <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Ryan Moeller <[email protected]> Closes #11132
* Share zfs_fsync, zfs_read, zfs_write, et al between Linux and FreeBSDMatthew Macy2020-10-212-0/+40
| | | | | | | | | | The zfs_fsync, zfs_read, and zfs_write function are almost identical between Linux and FreeBSD. With a little refactoring they can be moved to the common code which is what is done by this commit. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Signed-off-by: Matt Macy <[email protected]> Closes #11078