aboutsummaryrefslogtreecommitdiffstats
path: root/include/sys
Commit message (Collapse)AuthorAgeFilesLines
* Use more atomics in refcountsAlexander Motin2021-08-171-4/+4
| | | | | | | | | | | | | | Use atomic_load_64() for zfs_refcount_count() to prevent torn reads on 32-bit platforms. On 64-bit ones it should not change anything. When built with ZFS_DEBUG but running without tracking enabled use atomics instead of mutexes same as for builds without ZFS_DEBUG. Since rc_tracked can't change live we can check it without lock. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Matthew Ahrens <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored-By: iXsystems, Inc. Closes #12420
* Restore FreeBSD sysctl processing for arc.min and arc.maxAllan Jude2021-08-162-0/+9
| | | | | | | | | | | | | | | | Before OpenZFS 2.0, trying to set the FreeBSD sysctl vfs.zfs.arc_max to a disallowed value would return an error. Since the switch, it instead only generates WARN_IF_TUNING_IGNORED Keep the ability to set the sysctl's specifically to 0, even though that is less than the minimum, because some tests depend on this. Also lost, was the ability to set vfs.zfs.arc_max to a value less than the default vfs.zfs.arc_min at boot time. Restore this as well. Reviewed-by: Tony Nguyen <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Signed-off-by: Allan Jude <[email protected]> Closes #12161
* Run arc_evict thread at higher priorityTony Nguyen2021-08-101-2/+3
| | | | | | | | | | | | | | Run arc_evict thread at higher priority, nice=0, to give it more CPU time which can improve performance for workload with high ARC evict activities. On mixed read/write and sequential read workloads, I've seen between 10-40% better performance. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: George Melikov <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Tony Nguyen <[email protected]> Closes #12397
* Avoid small buffer copying on writeAlexander Motin2021-07-272-1/+1
| | | | | | | | | | | It is wrong for arc_write_ready() to use zfs_abd_scatter_enabled to decide whether to reallocate/copy the buffer, because the answer is OS-specific and depends on the buffer size. Instead of that use abd_size_alloc_linear(), moved into public header. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Brian Atkinson <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Closes #12425
* Remove NOTE(CONSTCOND) and note.hнаб2021-07-2611-80/+19
| | | | | | | | These were mostly used to annotate do {} while(0)s Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Ahelenia Ziemiańska <[email protected]> Issue #12201
* Replace /*PRINTFLIKEn*/ with attribute(printf)наб2021-07-262-3/+6
| | | | | | Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Ahelenia Ziemiańska <[email protected]> Issue #12201
* Optimize allocation throttlingAlexander Motin2021-07-212-7/+10
| | | | | | | | | | | | | | | | | | | | | | | Remove mc_lock use from metaslab_class_throttle_*(). The math there is based on refcounts and so atomic, so the only race possible there is between zfs_refcount_count() and zfs_refcount_add(). But in most cases metaslab_class_throttle_reserve() is called with the allocator lock held, which covers the race. In cases where the lock is not held, GANG_ALLOCATION() or METASLAB_MUST_RESERVE are set, and so we do not use zfs_refcount_count(). And even if we assume some other non-existing scenario, the worst that may happen from this race is few more I/Os get to allocation earlier, that is not a problem. Move locks and data of different allocators into different cache lines to avoid false sharing. Group spa_alloc_* arrays together into single array of aligned struct spa_alloc spa_allocs. Align struct metaslab_class_allocator. Reviewed-by: Paul Dagnelie <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Don Brady <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored-By: iXsystems, Inc. Closes #12314
* Add Module Parameter Regarding Log Size LimitKevin Jin2021-07-202-0/+8
| | | | | | | | | | | | | | * Add Module Parameters Regarding Log Size Limit zfs_wrlog_data_max The upper limit of TX_WRITE log data. Once it is reached, write operation is blocked, until log data is cleared out after txg sync. It only counts TX_WRITE log with WR_COPIED or WR_NEED_COPY. Reviewed-by: Prakash Surya <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: jxdking <[email protected]> Closes #12284
* Minor ARC optimizationsAlexander Motin2021-07-202-3/+10
| | | | | | | | | | | | | | | Remove unneeded global, practically constant, state pointer variables (arc_anon, arc_mru, etc.), replacing them with macros of real state variables addresses (&ARC_anon, &ARC_mru, etc.). Change ARC_EVICT_ALL from -1ULL to UINT64_MAX, not requiring special handling in inner loop of ARC reclamation. Respectively change bytes argument of arc_evict_state() from int64_t to uint64_t. Reviewed-by: Matthew Ahrens <[email protected]> Reviewed-by: Mark Maybee <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Closes #12348
* A few fixes of callback typecasting (for the upcoming ClangCFI)Alexander2021-07-201-2/+2
| | | | | | | | | | | | | | * zio: avoid callback typecasting * zil: avoid zil_itxg_clean() callback typecasting * zpl: decouple zpl_readpage() into two separate callbacks * nvpair: explicitly declare callbacks for xdr_array() * linux/zfs_nvops: don't use external iput() as a callback * zcp_synctask: don't use fnvlist_free() as a callback * zvol: don't use ops->zv_free() as a callback for taskq_dispatch() Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Mark Maybee <[email protected]> Signed-off-by: Alexander Lobakin <[email protected]> Closes #12260
* Introduce dsl_dir_diduse_transfer_space()Alexander Motin2021-07-161-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | Most of dsl_dir_diduse_space() and dsl_dir_transfer_space() CPU time is a dd_lock overhead and time spent in dmu_buf_will_dirty(). Calling them one after another is a waste of time and even more contention. Doing that twice for each rewritten block within dbuf_write_done() via dsl_dataset_block_kill() and dsl_dataset_block_born() created one of the biggest CPU overheads in case of small blocks rewrite. dsl_dir_diduse_transfer_space() combines functionality of these two functions for cases where it is needed, but without double overhead, practically for the cost of dsl_dir_diduse_space() or even cheaper. While there, optimize dsl_dir_phys() calls in dsl_dir_diduse_space() and dsl_dir_transfer_space(). It seems Clang detects some aliasing there, repeating dd->dd_dbuf->db_data dereference multiple times, increasing dd_lock scope and contention. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Matthew Ahrens <[email protected]> Author: Ryan Moeller <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored-By: iXsystems, Inc. Closes #12300
* Fix ARC ghost states eviction accountingAlexander Motin2021-07-131-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | arc_evict_hdr() returns number of evicted bytes in scope of specific state. For ghost states it does not mean the amount of really freed memory, but the logical buffer size. It is correct for the eviction process, but not for waking up threads waiting for ARC size reduction, as added in "Revise ARC shrinker algorithm" commit, causing premature wakeups while ARC is still overflowed, allowing even bigger overflow, plus processing overhead when next allocation will also get blocked, probably also for too short time. To fix that make arc_evict_hdr() also return the amount of really freed memory, which for the ghost states is only the header, and use it to update arc_evict_count instead. Originally I was thinking to not return it at all, since arc_get_data_impl() does not account for the headers, but decided that some slow allocation progress is better than long waits, reaching on my tests up to 100ms. To reduce negative latency effects of long time periods when reclaim thread can free little real memory, start reclamation process earlier, before we actually reached the overflow threshold, when we have to throttle new allocations. We can also do it without taking global arc_evict_lock, reducing the contention. Reviewed-by: George Wilson <[email protected]> Reviewed-by: Allan Jude <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored-By: iXsystems, Inc. Closes #12279
* file reference counts can get corruptedGeorge Wilson2021-07-104-7/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | Callers of zfs_file_get and zfs_file_put can corrupt the reference counts for the file structure resulting in a panic or a soft lockup. When zfs send/recv runs, it will add a reference count to the open file, and begin to send or recv the stream. If the file descriptor is closed, then when dmu_recv_stream() or dmu_send() return we will call zfs_file_put to remove the reference we placed on the file structure. Unfortunately, because zfs_file_put() uses the file descriptor to lookup the file structure, it may end up finding that the file descriptor table no longer contains the file struct, thus leaking the file structure. Or it might end up finding a file descriptor for a different file and blindly updating its reference counts. Other failure modes probably exists. This change reworks the zfs_file_[get|put] interface to not rely on the file descriptor but instead pass the zfs_file_t pointer around. Reviewed-by: Matthew Ahrens <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Mark Maybee <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Co-authored-by: Allan Jude <[email protected]> Signed-off-by: George Wilson <[email protected]> External-issue: DLPX-76119 Closes #12299
* dprintf_dnode: strcpy -> strlcpy Jorgen Lundman2021-07-072-2/+2
| | | | | | | | | Missed a couple of strcpy() in earlier commit, this is only used with --enable-debug. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Tony Nguyen <[email protected]> Signed-off-by: Jorgen Lundman <[email protected]> Closes #12311
* FreeBSD: Hardcode abd_chunk_size to PAGE_SIZEAlexander Motin2021-07-061-1/+0
| | | | | | | | | | | | | | | | | | | It makes no sense to set it below PAGE_SIZE, since it increases all overheads and makes returning memory to OS problematic. It makes no sense to set it above PAGE_SIZE, since such allocations and especially frees are too expensive and cause KVA fragmentation to benefit from fewer chunks. After that it makes no sense to keep more complicated math here. What may have sense though is just a tunable border between linear and scatter ABDs, previously also controlled by this tunable. Retain that functionality by taking abd_scatter_min_size tunable from Linux, just with different default value. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Brian Atkinson <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Closes #12328
* Remove avl_size field from struct avl_treeAlexander Motin2021-07-011-1/+3
| | | | | | | | | | | | | This field is used only by illumos mdb. On other platforms it only increases the struct size from 32 to 40 bytes. For struct vdev_queue including 13 instances of avl_tree_t size means active cache lines. Keep the padding in user-space for now to not break the ABI. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Matthew Ahrens <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored-By: iXsystems, Inc. Closes #12290
* Compact dbuf/buf hashes and lock arraysAlexander Motin2021-07-011-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | With default dbuf cache size of 1/32 of ARC, it makes no sense to have hash table of the same size (or even bigger on Linux). Reduce it to 1/8 of ARC's one, still leaving some slack, assuming higher I/O rate via dbuf cache than via ARC. Remove padding from ARC hash locks array. The idea behind padding is to avoid false sharing between locks. It would have sense if there would be a limited number of very busy locks. But since we have no limit on the number, using the same memory for more locks we can achieve even lower lock contention with the same false sharing, or we can use less memory for the same contention level. Reduce number of hash locks from 8192 to 2048. The number is still big enough to not cause contention, but reduced memory size improves cache hit rate for mutex_tryenter() in ARC eviction thread, saving about 1% of the thread time. Reviewed-by: Matthew Ahrens <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored-By: iXsystems, Inc. Closes #12289
* Fix abd leak, kmem_free correct size of abd_tJorgen Lundman2021-07-011-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix a leak of abd_t that manifested mostly when using raidzN with at least as many columns as N (e.g. a four-disk raidz2 but not a three-disk raidz2). Sufficiently heavy raidz use would eventually run a system out of memory. Additionally: * Switch abd_cache arena to FIRSTFIT, which empirically improves perofrmance. * Make abd_chunk_cache more performant and debuggable. * Allocate the abd_zero_buf from abd_chunk_cache rather than the heap. * Don't try to reap non-existent qcaches in abd_cache arena. * KM_PUSHPAGE->KM_SLEEP when allocating chunks from their own arena Reviewed-by: Matthew Ahrens <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Jorgen Lundman <[email protected]> Co-authored-by: Sean Doran <[email protected]> Closes #12295
* Optimize txg_kick() process (#12274)Kevin Jin2021-07-011-1/+1
| | | | | | | | | | | | | | | | Use dp_dirty_pertxg[] for txg_kick(), instead of dp_dirty_total in original code. Extra parameter "txg" is added for txg_kick(), thus it knows which txg to kick. Also txg_kick() call is moved from dsl_pool_need_dirty_delay() to dsl_pool_dirty_space() so that we can know the txg number assigned for txg_kick(). Some unnecessary code regarding dp_dirty_total in txg_sync_thread() is also cleaned up. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Matthew Ahrens <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: jxdking <[email protected]> Closes #12274
* Remove refcount from spa_config_*()Alexander Motin2021-07-011-2/+2
| | | | | | | | | | | | | | The only reason for spa_config_*() to use refcount instead of simple non-atomic (thanks to scl_lock) variable for scl_count is tracking, hard disabled for the last 8 years. Switch to simple int scl_count reduces the lock hold time by avoiding atomic, plus makes structure fit into single cache line, reducing the locks contention. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Matthew Ahrens <[email protected]> Reviewed-by: Mark Maybee <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored-By: iXsystems, Inc. Closes #12287
* FreeBSD: fix compilation of FreeBSD world after 29274c9f6Martin Matuška2021-06-251-1/+1
| | | | | | | | | | | prng32_bounded() is available to kernel only on FreeBSD 13+. Call inline random_get_pseudo_bytes() with correct pointer type. To be consistent, apply to Linux as well. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Martin Matuska <[email protected]> Closes #12282
* gcc 11 cleanupAttila Fülöp2021-06-232-4/+4
| | | | | | | | | | | Compiling with gcc 11.1.0 produces three new warnings. Change the code slightly to avoid them. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Matthew Ahrens <[email protected]> Signed-off-by: Attila Fülöp <[email protected]> Closes #12130 Closes #12188 Closes #12237
* Annotated dprintf as printf-likeRich Ercolani2021-06-221-1/+1
| | | | | | | | | | ZFS loves using %llu for uint64_t, but that requires a cast to not be noisy - which is even done in many, though not all, places. Also a couple places used %u for uint64_t, which were promoted to %llu. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rich Ercolani <[email protected]> Closes #12233
* Optimize small random numbers generationAlexander Motin2021-06-222-1/+15
| | | | | | | | | | | | | | | | | In all places except two spa_get_random() is used for small values, and the consumers do not require well seeded high quality values. Switch those two exceptions directly to random_get_pseudo_bytes() and optimize spa_get_random(), renaming it to random_in_range(), since it is not related to SPA or ZFS in general. On FreeBSD directly map random_in_range() to new prng32_bounded() KPI added in FreeBSD 13. On Linux and in user-space just reduce the type used to uint32_t to avoid more expensive 64bit division. Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored-By: iXsystems, Inc. Closes #12183
* Use wmsum for arc, abd, dbuf and zfetch statistics. (#12172)Alexander Motin2021-06-162-28/+95
| | | | | | | | | | | wmsum was designed exactly for cases like these with many updates and rare reads. It allows to completely avoid atomic operations on congested global variables. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Mark Maybee <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored-By: iXsystems, Inc. Closes #12172
* Re-embed multilist_t storageAlexander Motin2021-06-104-10/+11
| | | | | | | | | | | | This commit partially reverts changes to multilists in PR 7968 (multi-threaded spa-sync()) and adds some cache line alignments to separate read-only multilists and heavily modified refcount's to different cache lines. Reviewed-by: Matthew Ahrens <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored-by: iXsystems, Inc. Closes #12158
* Remove pool io kstats (#12212)Alexander Motin2021-06-102-7/+0
| | | | | | | | | | | | | | | | | | | This mostly reverts "3537 want pool io kstats" commit of 8 years ago. From one side this code using pool-wide locks became pretty bad for performance, creating significant lock contention in I/O pipeline. From another, there are more efficient ways now to obtain detailed statistics, while this statistics is illumos-specific and much less usable on Linux and FreeBSD, reported only via procfs/sysctls. This commit does not remove KSTAT_TYPE_IO implementation, that may be removed later together with already unused KSTAT_TYPE_INTR and KSTAT_TYPE_TIMER. Reviewed-by: Matthew Ahrens <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored-By: iXsystems, Inc. Closes #12212
* lib{efi,avl,share,tpool,zfs_core,zfsbootenv,zutil}: -fvisibility=hiddenнаб2021-06-094-227/+276
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | No symbols affected in libavl No symbols affected by libtpool, but pre-ANSI declarations got purged No symbols affected by libzfs_core No symbols affected by libzfs_bootenv libefi got cleaned, gained efi_debug documentation in efi_partition.h, and removes one undocumented and unused symbol from libzfs_core: D default_vtoc_map libnvpair saw removal of these symbols: D nv_alloc_nosleep_def D nv_alloc_sleep D nv_alloc_sleep_def D nv_fixed_ops_def D nvlist_hashtable_init_size D nvpair_max_recursion libshare saw removal of these symbols from libzfs: T libshare_nfs_init T libshare_smb_init T register_fstype B smb_shares libzutil saw removal of these internal symbols from libzfs_core: T label_paths T slice_cache_compare T zpool_find_import_blkid T zpool_open_func T zutil_alloc T zutil_strdup Reviewed-by: Matthew Ahrens <[email protected]> Reviewed-by: John Kennedy <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Ahelenia Ziemiańska <[email protected]> Closes #12191
* libefi: remove efi_auto_sense()наб2021-06-091-1/+0
| | | | | | | | | | | | It's present (but undocumented) in the illumos gate and used exclusively by rmformat(1) (which I recommend as a nice blast from the past), and also the math assumes 512B sectors and is therefore wrong Reviewed-by: Matthew Ahrens <[email protected]> Reviewed-by: John Kennedy <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Ahelenia Ziemiańska <[email protected]> Closes #12191
* libzfs: On FreeBSD, use MNT_NOWAIT with getfsstatAlan Somers2021-06-081-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | `getfsstat(2)` is used to retrieve the list of mounted file systems, which libzfs uses when fetching properties like mountpoint, atime, setuid, etc. The `mode` parameter may be `MNT_NOWAIT`, which uses information in the VFS's cache, or `MNT_WAIT`, which effectively does a `statfs` on every single mounted file system in order to fetch the most up-to-date information. As far as I can tell, the only fields that libzfs cares about are the filesystem's name, mountpoint, fstypename, and mount flags. Those things are always updated on mount and unmount, so they will always be accurate in the VFS's mount cache except in two circumstances: 1) When a file system is busy unmounting 2) When a ZFS file system changes the value of a mount-overridable property like atime or setuid, but doesn't remount the file system. Right now that only happens when the property is changed by an unprivileged user who has delegated authority to change the property but not to mount the dataset. But perhaps libzfs could choose to do it for other reasons in the future. Switching to `MNT_NOWAIT` will greatly improve speed with no downside, as long as we explicitly update the mount cache whenever we change a mount-overridable property. For comparison, Illumos gets this information using the native `getmntany` and `getmntent` functions, which also use cached information. The illumos function that would refresh the cache, `resetmnttab`, is never called by libzfs. And on GNU/Linux, `getmntany` and `getmntent` don't even communicate with the kernel directly. They simply parse the file they are given, which is usually /etc/mtab or /proc/mounts. Perhaps the implementation of /proc/mounts is synchronous, ala MNT_WAIT; I don't know. Sponsored-by: Axcient Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alan Somers <[email protected]> Closes: #12091
* More aggsum optimizationsAlexander Motin2021-06-071-3/+4
| | | | | | | | | | | | | | | | | | - Avoid atomic_add() when updating as_lower_bound/as_upper_bound. Previous code was excessively strong on 64bit systems while not strong enough on 32bit ones. Instead introduce and use real atomic_load() and atomic_store() operations, just an assignments on 64bit machines, but using proper atomics on 32bit ones to avoid torn reads/writes. - Reduce number of buckets on large systems. Extra buckets not as much improve add speed, as hurt reads. Unlike wmsum for aggsum reads are still important. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored-By: iXsystems, Inc. Closes #12145
* libzfs: convert to -fvisibility=hiddenнаб2021-06-033-35/+39
| | | | | | | | | | | | | | | | | | | | | | | | | | Also mark all printf-like funxions in libzfs_impl.h as printf-like and add --no-show-locs to storeabi, in hopes diffs will make more sense in future This removes these symbols from libzfs: D nfs_only T SHA256Init T SHA2Final T SHA2Init T SHA2Update T SHA384Init T SHA512Init D share_all_proto D smb_only T zfs_is_shared_proto W zpool_mount_datasets W zpool_unmount_datasets Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Matthew Ahrens <[email protected]> Reviewed-by: John Kennedy <[email protected]> Signed-off-by: Ahelenia Ziemiańska <[email protected]> Closes #12048
* libspl: staticify buf and pagesize, rename aok to libspl_assert_okнаб2021-06-031-2/+0
| | | | | | | | Exporting names this short can easily cause nasty collisions with user code. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: John Kennedy <[email protected]> Signed-off-by: Ahelenia Ziemiańska <[email protected]> Closes #12050
* include: move SPA_MINBLOCKSHIFT and zio_encrypt to sys/fs/zfs.hнаб2021-05-293-38/+41
| | | | | | | | | These are used by userspace, so should live in a public header Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Signed-off-by: Ahelenia Ziemiańska <[email protected]> Closes #12116
* Introduce write-mostly sumsAlexander Motin2021-05-271-10/+10
| | | | | | | | | | | | | | | | | | | | | wmsum counters are a reduced version of aggsum counters, optimized for write-mostly scenarios. They do not provide optimized read functions, but instead allow much cheaper add function. The primary usage is infrequently read statistic counters, not requiring exact precision. The Linux implementation is directly mapped into percpu_counter KPI. The FreeBSD implementation is directly mapped into counter(9) KPI. In user-space due to lack of better implementation mapped to aggsum. Unfortunately neither Linux percpu_counter nor FreeBSD counter(9) provide sufficient functionality to completelly replace aggsum, so it still remains to be used for several hot counters. Reviewed-by: Paul Dagnelie <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored-By: iXsystems, Inc. Closes #12114
* Fix dRAID sequential resilver silent damage handlingBrian Behlendorf2021-05-201-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This change addresses two distinct scenarios which are possible when performing a sequential resilver to a dRAID pool with vdevs that contain silent unknown damage. Which in this circumstance took the form of the devices being intentionally overwritten with zeros. However, it could also result from a device returning incorrect data while a sequential resilver was in progress. Scenario 1) A sequential resilver is performed while all of the dRAID vdevs are ONLINE and there is silent damage present on the vdev being resilvered. In this case, nothing will be repaired by vdev_raidz_io_done_reconstruct_known_missing() because rc->rc_error isn't set on any of the raid columns. To address this vdev_draid_io_start_read() has been updated to always mark the resilvering column as ESTALE for sequential resilver IO. Scenario 2) Multiple columns contain silent damage for the same block and a sequential resilver is performed. In this case it's impossible to generate the correct data from parity unless all of the damaged columns are being sequentially resilvered (and thus only good data is used to generate parity). This is as expected and there's nothing which can be done about it. However, we need to be careful not to make to situation worse. Since we can't verify the data is actually good without a checksum, we must only repair the devices which are being sequentially resilvered. Otherwise, an incorrect repair to a device which previously contained good data could effectively lock in the damage and make reconstruction impossible. A check for this was added to vdev_raidz_io_done_verified() along with a new test case. Lastly, this change updates the redundancy_draid_spare1 and redundancy_draid_spare3 test cases to be more representative of normal dRAID replacement operation. Specifically, what we care about is that the scrub run after a sequential resilver does not find additional blocks which need repair. This would indicate the sequential resilver failed to rebuild a section of one of the devices. Note also the tests were switched to using the verify_pool() function which still checks for checksum errors. Reviewed-by: Mark Maybee <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #12061
* module/zfs: remove zfs_zevent_console and zfs_zevent_colsнаб2021-05-101-1/+0
| | | | | | | | | | | | | | | zfs_zevent_console committed multiple printk()s per line without properly continuing them ‒ a single event could easily be fragmented across over thirty lines, making it useless for direct application zfs_zevent_cols exists purely to wrap the output from zfs_zevent_console The niche this was supposed to fill can be better served by something akin to the all-syslog ZEDLET Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Ahelenia Ziemiańska <[email protected]> Closes #7082 Closes #11996
* lib/: set O_CLOEXEC on all fdsнаб2021-04-111-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | As found by git grep -E '(open|setmntent|pipe2?)\(' | grep -vE '((zfs|zpool)_|fd|dl|lzc_re|pidfile_|g_)open\(' FreeBSD's pidfile_open() says nothing about the flags of the files it opens, but we can't do anything about it anyway; the implementation does open all files with O_CLOEXEC Consider this output with zpool.d/media appended with "pid=$$; (ls -l /proc/$pid/fd > /dev/tty)": $ /sbin/zpool iostat -vc media lrwx------ 0 -> /dev/pts/0 l-wx------ 1 -> 'pipe:[3278500]' l-wx------ 2 -> /dev/null lrwx------ 3 -> /dev/zfs lr-x------ 4 -> /proc/31895/mounts lrwx------ 5 -> /dev/zfs lr-x------ 10 -> /usr/lib/zfs-linux/zpool.d/media vs $ ./zpool iostat -vc vendor,upath,iostat,media lrwx------ 0 -> /dev/pts/0 l-wx------ 1 -> 'pipe:[3279887]' l-wx------ 2 -> /dev/null lr-x------ 10 -> /usr/lib/zfs-linux/zpool.d/media Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Ahelenia Ziemiańska <[email protected]> Closes #11866
* Move zfsdev_state_{init,destroy} to common codeRyan Moeller2021-04-082-1/+4
| | | | | | Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Ryan Moeller <[email protected]> Closes #11833
* Use dsl_scan_setup_check() to setup a scrubBrian Behlendorf2021-04-081-0/+1
| | | | | | | | | | | | When a rebuild completes it will automatically schedule a follow up scrub to verify all of the block checksums. Before setting up the scrub execute the counterpart dsl_scan_setup_check() function to confirm the scrub can be started. Prior to this change we'd only check vdev_rebuild_active() which isn't as comprehensive, and using the check function keeps all of this logic in one place. Reviewed-by: Mark Maybee <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #11849
* Ratelimit deadman zevents as with delay zeventsRyan Moeller2021-04-071-1/+2
| | | | | | | | | | | | | | | Just as delay zevents can flood the zevent pipe when a vdev becomes unresponsive, so do the deadman zevents. Ratelimit deadman zevents according to the same tunable as for delay zevents. Enable deadman tests on FreeBSD and add a test for deadman event ratelimiting. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Don Brady <[email protected]> Signed-off-by: Ryan Moeller <[email protected]> Closes #11786
* Fix various typosAndrea Gelmini2021-04-023-3/+3
| | | | | | | | | | Correct an assortment of typos throughout the code base. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Matthew Ahrens <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Signed-off-by: Andrea Gelmini <[email protected]> Closes #11774
* Use a helper function to clarify gang block sizeMatthew Ahrens2021-03-262-0/+15
| | | | | | | | | | | | | For gang blocks, `DVA_GET_ASIZE()` is the total space allocated for the gang DVA including its children BP's. The space allocated at each DVA's vdev/offset is `vdev_psize_to_asize(vd, SPA_GANGBLOCKSIZE)`. This commit makes this relationship more clear by using a helper function, `vdev_gang_header_asize()`, for the space allocated at the gang block's vdev/offset. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> Closes #11744
* Removed duplicated includesAndrea Gelmini2021-03-222-2/+0
| | | | | | Reviewed-by: Matthew Ahrens <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Andrea Gelmini <[email protected]> Closes #11775
* Split dmu_zfetch() speculation and execution partsAlexander Motin2021-03-191-7/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | To make better predictions on parallel workloads dmu_zfetch() should be called as early as possible to reduce possible request reordering. In particular, it should be called before dmu_buf_hold_array_by_dnode() calls dbuf_hold(), which may sleep waiting for indirect blocks, waking up multiple threads same time on completion, that can significantly reorder the requests, making the stream look like random. But we should not issue prefetch requests before the on-demand ones, since they may get to the disks first despite the I/O scheduler, increasing on-demand request latency. This patch splits dmu_zfetch() into two functions: dmu_zfetch_prepare() and dmu_zfetch_run(). The first can be executed as early as needed. It only updates statistics and makes predictions without issuing any I/Os. The I/O issuance is handled by dmu_zfetch_run(), which can be called later when all on-demand I/Os are already issued. It even tracks the activity of other concurrent threads, issuing the prefetch only when _all_ on-demand requests are issued. For many years it was a big problem for storage servers, handling deeper request queues from their clients, having to either serialize consequential reads to make ZFS prefetcher usable, or execute the incoming requests as-is and get almost no prefetch from ZFS, relying only on deep enough prefetch by the clients. Benefits of those ways varied, but neither was perfect. With this patch deeper queue sequential read benchmarks with CrystalDiskMark from Windows via iSCSI to FreeBSD target show me much better throughput with almost 100% prefetcher hit rate, comparing to almost zero before. While there, I also removed per-stream zs_lock as useless, completely covered by parent zf_lock. Also I reused zs_blocks refcount to track zf_stream linkage of the stream, since I believe previous zs_fetch == NULL check in dmu_zfetch_stream_done() was racy. Delete prefetch streams when they reach ends of files. It saves up to 1KB of RAM per file, plus reduces searches through the stream list. Block data prefetch (speculation and indirect block prefetch is still done since they are cheaper) if all dbufs of the stream are already in DMU cache. First cache miss immediately fires all the prefetch that would be done for the stream by that time. It saves some CPU time if same files within DMU cache capacity are read over and over. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Adam Moss <[email protected]> Reviewed-by: Matthew Ahrens <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored-By: iXsystems, Inc. Closes #11652
* Fix zfs_get_data access to files with wrong generationChunwei Chen2021-03-192-3/+4
| | | | | | | | | | | | | | If TX_WRITE is create on a file, and the file is later deleted and a new directory is created on the same object id, it is possible that when zil_commit happens, zfs_get_data will be called on the new directory. This may result in panic as it tries to do range lock. This patch fixes this issue by record the generation number during zfs_log_write, so zfs_get_data can check if the object is valid. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Chunwei Chen <[email protected]> Closes #10593 Closes #11682
* Clean up RAIDZ/DRAID ereport codeMatthew Ahrens2021-03-193-12/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The RAIDZ and DRAID code is responsible for reporting checksum errors on their child vdevs. Checksum errors represent events where a disk returned data or parity that should have been correct, but was not. In other words, these are instances of silent data corruption. The checksum errors show up in the vdev stats (and thus `zpool status`'s CKSUM column), and in the event log (`zpool events`). Note, this is in contrast with the more common "noisy" errors where a disk goes offline, in which case ZFS knows that the disk is bad and doesn't try to read it, or the device returns an error on the requested read or write operation. RAIDZ/DRAID generate checksum errors via three code paths: 1. When RAIDZ/DRAID reconstructs a damaged block, checksum errors are reported on any children whose data was not used during the reconstruction. This is handled in `raidz_reconstruct()`. This is the most common type of RAIDZ/DRAID checksum error. 2. When RAIDZ/DRAID is not able to reconstruct a damaged block, that means that the data has been lost. The zio fails and an error is returned to the consumer (e.g. the read(2) system call). This would happen if, for example, three different disks in a RAIDZ2 group are silently damaged. Since the damage is silent, it isn't possible to know which three disks are damaged, so a checksum error is reported against every child that returned data or parity for this read. (For DRAID, typically only one "group" of children is involved in each io.) This case is handled in `vdev_raidz_cksum_finish()`. This is the next most common type of RAIDZ/DRAID checksum error. 3. If RAIDZ/DRAID is not able to reconstruct a damaged block (like in case 2), but there happens to be additional copies of this block due to "ditto blocks" (i.e. multiple DVA's in this blkptr_t), and one of those copies is good, then RAIDZ/DRAID compares each sector of the data or parity that it retrieved with the good data from the other DVA, and if they differ then it reports a checksum error on this child. This differs from case 2 in that the checksum error is reported on only the subset of children that actually have bad data or parity. This case happens very rarely, since normally only metadata has ditto blocks. If the silent damage is extensive, there will be many instances of case 2, and the pool will likely be unrecoverable. The code for handling case 3 is considerably more complicated than the other cases, for two reasons: 1. It needs to run after the main raidz read logic has completed. The data RAIDZ read needs to be preserved until after the alternate DVA has been read, which necessitates refcounts and callbacks managed by the non-raidz-specific zio layer. 2. It's nontrivial to map the sections of data read by RAIDZ to the correct data. For example, the correct data does not include the parity information, so the parity must be recalculated based on the correct data, and then compared to the parity that was read from the RAIDZ children. Due to the complexity of case 3, the rareness of hitting it, and the minimal benefit it provides above case 2, this commit removes the code for case 3. These types of errors will now be handled the same as case 2, i.e. the checksum error will be reported against all children that returned data or parity. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> Closes #11735
* Remove unused rr_codeMatthew Ahrens2021-03-171-1/+0
| | | | | | | | | | The `rr_code` field in `raidz_row_t` is unused. This commit removes the field, as well as the code that's used to set it. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> Closes #11736
* FreeBSD: Clean up zfsdev_close to match LinuxRyan Moeller2021-03-121-1/+0
| | | | | | | | | | | | | | Resolve some oddities in zfsdev_close() which could result in a panic and were not present in the equivalent function for Linux. - Remove unused definition ZFS_MIN_MINOR - FreeBSD: Simplify zfsdev state destruction - Assert zs_minor is valid in zfsdev_close - Make locking around zfsdev state match Linux Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Ryan Moeller <[email protected]> Closes #11720
* Suppress cppcheck invalidSyntax warninigsBrian Behlendorf2021-03-051-0/+2
| | | | | | | | | | | For some reason cppcheck 1.90 is generating an invalidSyntax warning when the BF64_SET macro is used in the zstream source. The same warning is not reported by cppcheck 2.3, nor is their any evident problem with the expanded macro. This appears to be an issue with this version of cppcheck. This commit annotates the source to suppress the warning. Signed-off-by: Brian Behlendorf <[email protected]> Closes #11700