openzfs/zfs.git - Unnamed repository; edit this file 'description' to name the repository.

	Commit message (Collapse)	Author	Age	Files	Lines
*	Ratelimit deadman zevents as with delay zevents	Ryan Moeller	2021-04-07	2	-3/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Just as delay zevents can flood the zevent pipe when a vdev becomes unresponsive, so do the deadman zevents. Ratelimit deadman zevents according to the same tunable as for delay zevents. Enable deadman tests on FreeBSD and add a test for deadman event ratelimiting. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Don Brady <[email protected]> Signed-off-by: Ryan Moeller <[email protected]> Closes #11786
*	kmem_alloc(KM_SLEEP) should use kvmalloc()	Matthew Ahrens	2021-04-06	1	-0/+14
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	`kmem_alloc(size>PAGESIZE, KM_SLEEP)` is backed by `kmalloc()`, which finds contiguous physical memory. If there isn't enough contiguous physical memory available (e.g. due to physical page fragmentation), the OOM killer will be invoked to make more memory available. This is not ideal because processes may be killed when there is still plenty of free memory (it just happens to be in individual pages, not contiguous runs of pages). We have observed this when allocating the ~13KB `zfs_cmd_t`, for example in `zfsdev_ioctl()`. This commit changes the behavior of `kmem_alloc(size>PAGESIZE, KM_SLEEP)` when there are insufficient contiguous free pages. In this case we will find individual pages and stitch them together using virtual memory. This is accomplished by using `kvmalloc()`, which implements the described behavior by trying `kmalloc(__GFP_NORETRY)` and falling back on `vmalloc()`. The behavior of `kmem_alloc(KM_NOSLEEP)` is not changed; it continues to use `kmalloc(GPF_ATOMIC \| __GFP_NORETRY)`. This is because `vmalloc()` may sleep. Reviewed-by: Tony Nguyen <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: George Wilson <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> Closes #11461
*	Fix various typos	Andrea Gelmini	2021-04-02	22	-29/+29
\| \| \| \| \| \| \| \| \| \|	Correct an assortment of typos throughout the code base. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Matthew Ahrens <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Signed-off-by: Andrea Gelmini <[email protected]> Closes #11774
*	Avoid taking global lock to destroy zfsdev state	Ryan Moeller	2021-04-02	2	-21/+11
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	We have exclusive access to our zfsdev state object in this section until it is invalidated by setting zs_minor to -1, so we can destroy the state without taking a lock if we do the invalidation last, after a member to ensure correct ordering. While here, strengthen the assertions that zs_minor is valid when we enter. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Ryan Moeller <[email protected]> Closes #11751
*	FreeBSD: Fix stable/12 after AT_BENEATH removal	Ryan Moeller	2021-04-02	1	-3/+1
\| \| \| \| \|	Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Ryan Moeller <[email protected]> Closes #11827
*	Allow pool names that look like Solaris disk names	Ryan Moeller	2021-04-01	1	-6/+0
\| \| \| \| \| \| \| \| \| \| \|	Nothing bad happens if a prefix of your pool name matches a disk name. This is a bit of a silly restriction at this point. Reviewed-by: Richard Laager <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: George Melikov <[email protected]> Signed-off-by: Ryan Moeller <[email protected]> Closes #11781 Closes #11813
*	Don't scale zfs_zevent_len_max by CPU count	Ryan Moeller	2021-04-01	1	-4/+1
\| \| \| \| \| \| \| \| \| \|	The lower bound for this scaling to too low and the upper bound is too high. Use a fixed default length of 512 instead, which is a reasonable value on any system. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Ryan Moeller <[email protected]> Closes #11822
*	Atomically check and set dropped zevent count	Ryan Moeller	2021-04-01	1	-2/+1
\| \| \| \| \| \| \| \| \|	ratelimit_dropped isn't protected by a lock and is expected to be updated atomically. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Ryan Moeller <[email protected]> Closes #11822
*	Use a helper function to clarify gang block size	Matthew Ahrens	2021-03-26	2	-11/+15
\| \| \| \| \| \| \| \| \| \| \| \| \|	For gang blocks, `DVA_GET_ASIZE()` is the total space allocated for the gang DVA including its children BP's. The space allocated at each DVA's vdev/offset is `vdev_psize_to_asize(vd, SPA_GANGBLOCKSIZE)`. This commit makes this relationship more clear by using a helper function, `vdev_gang_header_asize()`, for the space allocated at the gang block's vdev/offset. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> Closes #11744
*	Fix error code on __zpl_ioctl_setflags()	Luis Henriques	2021-03-26	1	-1/+1
\| \| \| \| \| \| \| \| \| \|	Other (all?) Linux filesystems seem to return -EPERM instead of -EACCESS when trying to set FS_APPEND_FL or FS_IMMUTABLE_FL without the CAP_LINUX_IMMUTABLE capability. This was detected by generic/545 test in the fstest suite. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Luis Henriques <[email protected]> Closes #11791
*	Removed duplicated includes	Andrea Gelmini	2021-03-22	4	-4/+0
\| \| \| \| \| \|	Reviewed-by: Matthew Ahrens <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Andrea Gelmini <[email protected]> Closes #11775
*	Split dmu_zfetch() speculation and execution parts	Alexander Motin	2021-03-19	3	-112/+178
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	To make better predictions on parallel workloads dmu_zfetch() should be called as early as possible to reduce possible request reordering. In particular, it should be called before dmu_buf_hold_array_by_dnode() calls dbuf_hold(), which may sleep waiting for indirect blocks, waking up multiple threads same time on completion, that can significantly reorder the requests, making the stream look like random. But we should not issue prefetch requests before the on-demand ones, since they may get to the disks first despite the I/O scheduler, increasing on-demand request latency. This patch splits dmu_zfetch() into two functions: dmu_zfetch_prepare() and dmu_zfetch_run(). The first can be executed as early as needed. It only updates statistics and makes predictions without issuing any I/Os. The I/O issuance is handled by dmu_zfetch_run(), which can be called later when all on-demand I/Os are already issued. It even tracks the activity of other concurrent threads, issuing the prefetch only when _all_ on-demand requests are issued. For many years it was a big problem for storage servers, handling deeper request queues from their clients, having to either serialize consequential reads to make ZFS prefetcher usable, or execute the incoming requests as-is and get almost no prefetch from ZFS, relying only on deep enough prefetch by the clients. Benefits of those ways varied, but neither was perfect. With this patch deeper queue sequential read benchmarks with CrystalDiskMark from Windows via iSCSI to FreeBSD target show me much better throughput with almost 100% prefetcher hit rate, comparing to almost zero before. While there, I also removed per-stream zs_lock as useless, completely covered by parent zf_lock. Also I reused zs_blocks refcount to track zf_stream linkage of the stream, since I believe previous zs_fetch == NULL check in dmu_zfetch_stream_done() was racy. Delete prefetch streams when they reach ends of files. It saves up to 1KB of RAM per file, plus reduces searches through the stream list. Block data prefetch (speculation and indirect block prefetch is still done since they are cheaper) if all dbufs of the stream are already in DMU cache. First cache miss immediately fires all the prefetch that would be done for the stream by that time. It saves some CPU time if same files within DMU cache capacity are read over and over. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Adam Moss <[email protected]> Reviewed-by: Matthew Ahrens <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored-By: iXsystems, Inc. Closes #11652
*	Fix zfs_get_data access to files with wrong generation	Chunwei Chen	2021-03-19	4	-3/+22
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	If TX_WRITE is create on a file, and the file is later deleted and a new directory is created on the same object id, it is possible that when zil_commit happens, zfs_get_data will be called on the new directory. This may result in panic as it tries to do range lock. This patch fixes this issue by record the generation number during zfs_log_write, so zfs_get_data can check if the object is valid. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Chunwei Chen <[email protected]> Closes #10593 Closes #11682
*	Fix regression in POSIX mode behavior	Andrew	2021-03-19	1	-4/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Commit 235a85657 introduced a regression in evaluation of POSIX modes that require group DENY entries in the internal ZFS ACL. An example of such a POSX mode is 007. When write_implies_delete_child is set, then ACE_WRITE_DATA is added to `wanted_dirperms` in prior to calling zfs_zaccess_common(). This occurs is zfs_zaccess_delete(). Unfortunately, when zfs_zaccess_aces_check hits this particular DENY ACE, zfs_groupmember() is checked to determine whether access should be denied, and since zfs_groupmember() always returns B_TRUE on Linux and so this check is failed, resulting ultimately in EPERM being returned. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Signed-off-by: Andrew Walker <[email protected]> Closes #11760
*	Allow setting bootfs property on pools with indirect vdevs	Martin Matuška	2021-03-19	1	-3/+1
\| \| \| \| \| \| \| \| \|	The FreeBSD boot loader relies on the bootfs property and is capable of booting from removed (indirect) vdevs. Reviewed-by Eric van Gyzen Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Martin Matuska <[email protected]> Closes #11763
*	Removing old code for k(un)map_atomic	Brian Atkinson	2021-03-19	2	-8/+6
\| \| \| \| \| \| \| \| \| \| \| \|	It used to be required to pass a enum km_type to kmap_atomic() and kunmap_atomic(), however this is no longer necessary and the wrappers zfs_k(un)map_atomic removed these. This is confusing in the ABD code as the struct abd_iter member iter_km no longer exists and the wrapper macros simply compile them out. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Adam Moss <[email protected]> Signed-off-by: Brian Atkinson <[email protected]> Closes #11768
*	Initialize metaslab range trees in metaslab_init	Serapheim Dimitropoulos	2021-03-19	1	-94/+55
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	= Motivation We've noticed several zloop crashes within Delphix generated due to the following sequence of events: - A device gets expanded and new metaslabas are allocated for it. These metaslabs go through `metaslab_init()` but haven't gone through `metaslab_sync_done()` yet. This meas that the only range tree that's actually set is the `ms_allocatable`. All the others are NULL. - A vdev_initialization is issues and `vdev_initialize_thread` starts processing one of these new metaslabs of the expanded vdev. - As part of `vdev_initialize_calculate_progress()` we call into `metaslab_load()` and `metaslab_load_impl()` which in turn tries to dereference the metaslabs trees that are still NULL and therefore we crash. The same failure can come up from the `vdev_trim` code paths. = This Patch We considered the following solutions to deal with this issue: [A] Add logic to `vdev_initialize/trim` to skip those new metaslabs. We decided against this as it would be good to avoid exposing this lower-level detail to higer-level operations. [B] Have `metaslab_load_impl()` return early for new metaslabs and thus never touch those range_trees that are NULL at that time. This seemed more of a work-around for the bug and not a clear-cut solution. [C] Refactor our logic so all metaslabs have their range_trees created at the time of their creatin in `metaslab_init()`. In this patch we decided to go with [C] because: (1) It doesn't expose more metaslab details to higher level operations such as vdev initialize and trim. (2) The current behavior of creating the range trees lazily in `metaslab_sync_done()` is unnecessarily complicated. (3) Always initializing the metaslab range_trees makes other parts of the codebase cleaner. For example, we used to use `ms_freed` as the reference value for knowing whether all the range_trees have been initialized. Now we no longer need to do that check in most places (and in the few that we do we use the `ms_new` boolean field now which is more readable). = Side Changes Probably due to a mismerge we set `ms_loaded` to `B_TRUE` twice in `metasloab_load_impl()`. In this patch we remove the extraneous assignment. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Matthew Ahrens <[email protected]> Signed-off-by: Serapheim Dimitropoulos <[email protected]> Closes #11737
*	Linux 5.12 update: bio_max_segs() replaces BIO_MAX_PAGES	Coleman Kane	2021-03-19	1	-0/+5
\| \| \| \| \| \| \| \| \| \| \|	The BIO_MAX_PAGES macro is being retired in favor of a bio_max_segs() function that implements the typical MIN(x,y) logic used throughout the kernel for bounding the allocation, and also the new implementation is intended to be signed-safe (which the former was not). Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Coleman Kane <[email protected]> Closes #11765
*	Linux 5.12 compat: idmapped mounts	Coleman Kane	2021-03-19	6	-13/+100
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In Linux 5.12, the filesystem API was modified to support ipmapped mounts by adding a "struct user_namespace *" parameter to a number functions and VFS handlers. This change adds the needed autoconf macros to detect the new interfaces and updates the code appropriately. This change does not add support for idmapped mounts, instead it preserves the existing behavior by passing the initial user namespace where needed. A subsequent commit will be required to add support for idmapped mounted. Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Co-authored-by: Brian Behlendorf <[email protected]> Signed-off-by: Coleman Kane <[email protected]> Closes #11712
*	Clean up RAIDZ/DRAID ereport code	Matthew Ahrens	2021-03-19	6	-434/+21
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The RAIDZ and DRAID code is responsible for reporting checksum errors on their child vdevs. Checksum errors represent events where a disk returned data or parity that should have been correct, but was not. In other words, these are instances of silent data corruption. The checksum errors show up in the vdev stats (and thus `zpool status`'s CKSUM column), and in the event log (`zpool events`). Note, this is in contrast with the more common "noisy" errors where a disk goes offline, in which case ZFS knows that the disk is bad and doesn't try to read it, or the device returns an error on the requested read or write operation. RAIDZ/DRAID generate checksum errors via three code paths: 1. When RAIDZ/DRAID reconstructs a damaged block, checksum errors are reported on any children whose data was not used during the reconstruction. This is handled in `raidz_reconstruct()`. This is the most common type of RAIDZ/DRAID checksum error. 2. When RAIDZ/DRAID is not able to reconstruct a damaged block, that means that the data has been lost. The zio fails and an error is returned to the consumer (e.g. the read(2) system call). This would happen if, for example, three different disks in a RAIDZ2 group are silently damaged. Since the damage is silent, it isn't possible to know which three disks are damaged, so a checksum error is reported against every child that returned data or parity for this read. (For DRAID, typically only one "group" of children is involved in each io.) This case is handled in `vdev_raidz_cksum_finish()`. This is the next most common type of RAIDZ/DRAID checksum error. 3. If RAIDZ/DRAID is not able to reconstruct a damaged block (like in case 2), but there happens to be additional copies of this block due to "ditto blocks" (i.e. multiple DVA's in this blkptr_t), and one of those copies is good, then RAIDZ/DRAID compares each sector of the data or parity that it retrieved with the good data from the other DVA, and if they differ then it reports a checksum error on this child. This differs from case 2 in that the checksum error is reported on only the subset of children that actually have bad data or parity. This case happens very rarely, since normally only metadata has ditto blocks. If the silent damage is extensive, there will be many instances of case 2, and the pool will likely be unrecoverable. The code for handling case 3 is considerably more complicated than the other cases, for two reasons: 1. It needs to run after the main raidz read logic has completed. The data RAIDZ read needs to be preserved until after the alternate DVA has been read, which necessitates refcounts and callbacks managed by the non-raidz-specific zio layer. 2. It's nontrivial to map the sections of data read by RAIDZ to the correct data. For example, the correct data does not include the parity information, so the parity must be recalculated based on the correct data, and then compared to the parity that was read from the RAIDZ children. Due to the complexity of case 3, the rareness of hitting it, and the minimal benefit it provides above case 2, this commit removes the code for case 3. These types of errors will now be handled the same as case 2, i.e. the checksum error will be reported against all children that returned data or parity. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> Closes #11735
*	FreeBSD: make seqc asserts conditional on replay	Mateusz Guzik	2021-03-17	1	-3/+6
\| \| \| \| \| \| \|	Avoids tripping on asserts when doing pool recovery. Reviewed-by: Ryan Moeller <[email protected]> Signed-off-by: Mateusz Guzik <[email protected]> Closes #11739
*	Remove unused rr_code	Matthew Ahrens	2021-03-17	1	-46/+23
\| \| \| \| \| \| \| \| \| \|	The `rr_code` field in `raidz_row_t` is unused. This commit removes the field, as well as the code that's used to set it. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> Closes #11736
*	FreeBSD: Fix memory leaks in kstats	Ryan Moeller	2021-03-17	1	-7/+4
\| \| \| \| \| \| \| \| \| \| \|	Don't handle (incorrectly) kmem_zalloc() failure. With KM_SLEEP, will never return NULL. Free the data allocated for non-virtual kstats when deleting the object. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Ryan Moeller <[email protected]> Closes #11767
*	Linux: always check or verify return of igrab()	Adam D. Moss	2021-03-16	3	-3/+9
\| \| \| \| \| \| \| \| \| \| \|	zhold() wraps igrab() on Linux, and igrab() may fail when the inode is in the process of being deleted. This means zhold() must only be called when a reference exists and therefore it cannot be deleted. This is the case for all existing consumers so add a VERIFY and a comment explaining this requirement. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Adam Moss <[email protected]> Closes #11704
*	Reference_tracking_enable should be a module param	Don Brady	2021-03-16	2	-7/+9
\| \| \| \| \| \| \| \| \| \| \| \|	To make use of zfs_refcount_held tunable it should be a module parameter in open-zfs. Also, since the macros will auto-generate OS specific tunables, removed the existing zfs_refcount_held reference in module/os/freebsd/zfs/sysctl_os.c. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Allan Jude <[email protected]> Signed-off-by: Don Brady <[email protected]> Closes #11753
*	FreeBSD: bring back possibility to rewind the checkpoint from bootloader	Mariusz Zaborski	2021-03-12	1	-1/+16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Add parsing of the rewind options. When I was upstreaming the change [1], I omitted the part where we detect that the pool should be rewind. When the FreeBSD repo has synced with the OpenZFS, this part of the code was removed. [1] FreeBSD repo: 277f38abffc6a8160b5044128b5b2c620fbb970c [2] OpenZFS repo: f2c027bd6a003ec5793f8716e6189c389c60f47a External-issue: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=254152 Originally reviewed by: tsoome, allanjude Originally reviewed by: kevans (ok from high-level overview) Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Mariusz Zaborski <[email protected]> Closes #11730
*	FreeBSD: Clean up zfsdev_close to match Linux	Ryan Moeller	2021-03-12	1	-10/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Resolve some oddities in zfsdev_close() which could result in a panic and were not present in the equivalent function for Linux. - Remove unused definition ZFS_MIN_MINOR - FreeBSD: Simplify zfsdev state destruction - Assert zs_minor is valid in zfsdev_close - Make locking around zfsdev state match Linux Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Ryan Moeller <[email protected]> Closes #11720
*	Macroify teardown lock handling	Mateusz Guzik	2021-03-12	3	-30/+28
\| \| \| \| \| \| \| \| \| \| \|	This will allow platforms to implement it as they see fit, in particular in a different manner than rrm locks. Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Matt Macy <[email protected]> Signed-off-by: Mateusz Guzik <[email protected]> Closes #11153
*	FreeBSD: rename teardown inactive macros to mimick rrm convention	Mateusz Guzik	2021-03-12	3	-18/+18
\| \| \| \| \| \| \| \|	Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Matt Macy <[email protected]> Signed-off-by: Mateusz Guzik <[email protected]> Closes #11153
*	FreeBSD: remove 2 assertions that teardown lock is not held	Mateusz Guzik	2021-03-12	1	-45/+0
\| \| \| \| \| \| \| \| \| \| \|	They are not very useful and hard to implement in the rms routine the code is about to start using. Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Matt Macy <[email protected]> Signed-off-by: Mateusz Guzik <[email protected]> Closes #11153
*	FreeBSD: rework asserts in zfs_dd_lookup	Mateusz Guzik	2021-03-12	1	-3/+2
\| \| \| \| \| \| \| \| \| \| \| \|	1. even up ifdefs 2. drop the arguably useless teardown lock asserts -- nothing else checks for it Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Matt Macy <[email protected]> Signed-off-by: Mateusz Guzik <[email protected]> Closes #11153
*	FreeBSD: Fix scope of deadman tunables	Ryan Moeller	2021-03-11	1	-2/+2
\| \| \| \| \| \| \| \| \| \|	A few deadman tunables ended up in the wrong sysctl node. Move them to vfs.zfs.deadman.* Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Ryan Moeller <[email protected]> Closes #11715
*	zvol: call zil_replaying() during replay	Christian Schwarz	2021-03-07	3	-3/+30
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	zil_replaying(zil, tx) has the side-effect of informing the ZIL that an entry has been replayed in the (still open) tx. The ZIL uses that information to record the replay progress in the ZIL header when that tx's txg syncs. ZPL log entries are not idempotent and logically dependent and thus calling zil_replaying() is necessary for correctness. For ZVOLs the question of correctness is more nuanced: ZVOL logs only TX_WRITE and TX_TRUNCATE, both of which are idempotent. Logical dependencies between two records exist only if the write or discard request had sync semantics or if the ranges affected by the records overlap. Thus, at a first glance, it would be correct to restart replay from the beginning if we crash before replay completes. But this does not address the following scenario: Assume one log record per LWB. The chain on disk is HDR -> 1:W(1, "A") -> 2:W(1, "B") -> 3:W(2, "X") -> 4:W(3, "Z") where N:W(O, C) represents log entry number N which is a TX_WRITE of C to offset A. We replay 1, 2 and 3 in one txg, sync that txg, then crash. Bit flips corrupt 2, 3, and 4. We come up again and restart replay from the beginning because we did not call zil_replaying() during replay. We replay 1 again, then interpret 2's invalid checksum as the end of the ZIL chain and call replay done. The replayed zvol content is "AX". If we had called zil_replaying() the HDR would have pointed to 3 and our resumed replay would not have replayed anything because 3 was corrupted, resulting in zvol content "BX". If 3 logically depends on 2 then the replay corrupted the ZVOL_OBJ's contents. This patch adds the zil_replaying() calls to the replay functions. Since the callbacks in the replay function need the zilog_t* pointer so that they can call zil_replaying() we open the ZIL while replaying in zvol_create_minor(). We also verify that replay has been done when on-demand-opening the ZIL on the first modifying bio. Reviewed-by: Matthew Ahrens <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Christian Schwarz <[email protected]> Closes #11667
*	Intentionally allow ZFS_READONLY in zfs_write	Ryan Moeller	2021-03-07	2	-7/+25
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	ZFS_READONLY represents the "DOS R/O" attribute. When that flag is set, we should behave as if write access were not granted by anything in the ACL. In particular: We _must_ allow writes after opening the file r/w, then setting the DOS R/O attribute, and writing some more. (Similar to how you can write after fchmod(fd, 0444).) Restore these semantics which were lost on FreeBSD when refactoring zfs_write. To my knowledge Linux does not actually expose this flag, but we'll need it to eventually so I've added the supporting checks. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Ryan Moeller <[email protected]> Closes #11693
*	Initialize ZIL buffers	Brian Behlendorf	2021-03-05	1	-0/+1
\| \| \| \| \| \| \| \| \|	When populating a ZIL destination buffer ensure it is always zeroed before its contents are constructed. Reviewed-by: Matthew Ahrens <[email protected]> Reviewed-by: Tom Caputi <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #11687
*	Fix abd_get_offset_struct() may allocate new abd	Jorgen Lundman	2021-03-05	1	-1/+5
\| \| \| \| \| \| \| \| \|	Even when supplied with an abd to abd_get_offset_struct(), the call to abd_get_offset_impl() can allocate a different abd. Ensure to call abd_fini_struct() on the abd that is not used. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Jorgen Lundman <[email protected]> Closes #11683
*	FreeBSD module --enable-debug --enable-invariants	Ryan Moeller	2021-03-05	1	-0/+5
\| \| \| \| \| \| \| \| \| \| \| \|	Wire up the --enable-debug flag for configure to the FreeBSD module build. Add --enable-invariants. The running FreeBSD kernel config is used to detect whether to enable INVARIANTS if not explicitly specified with --enable-invariants or --disable-invariants. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Ryan Moeller <[email protected]> Closes #11678
*	linux: zvol: avoid heap allocation for zvol_request_sync=1	Christian Schwarz	2021-03-03	1	-29/+64
\| \| \| \| \| \| \| \| \| \| \| \|	The spl_kmem_alloc showed up in some flamegraphs in a single-threaded 4k sync write workload at 85k IOPS on an Intel(R) Xeon(R) Silver 4215 CPU @ 2.50GHz. Certainly not a huge win but I believe the change is clean and easy to maintain down the road. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Matthew Ahrens <[email protected]> Signed-off-by: Christian Schwarz <[email protected]> Closes #11666
*	Add "zstd-fast" to help options for "compression" property	Jake Howard	2021-03-03	1	-1/+1
\| \| \| \| \| \| \|	This value does work as expected, and is documented in the manpage. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Jake Howard <[email protected]> Closes #11670
*	Cancel TRIM / initialize on FAULTED non-writeable vdevs	nssrikanth	2021-03-02	2	-6/+19
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	When a device which is actively trimming or initializing becomes FAULTED, and therefore no longer writable, cancel the active TRIM or initialization. When the device is merely taken offline with `zpool offline` then stop the operation but do not cancel it. When the device is brought back online the operation will be resumed if possible. Reviewed-by: Brian Behlendorf <[email protected]> Co-authored-by: Brian Behlendorf <[email protected]> Co-authored-by: Vipin Kumar Verma <[email protected]> Signed-off-by: Srikanth N S <[email protected]> Closes #11588
*	Fix assert in FreeBSD-specific dmu_read_pages	Andriy Gapon	2021-02-27	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \|	The function has three similar pieces of code: for read-behind pages, requested pages and read-ahead pages. All three pieces had an assert to ensure that the page is not mapped. Later the assert was relaxed to require that the page is not mapped for writing. But that was done in two places out of three. This change fixes the third piece, read-ahead. Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Andriy Gapon <[email protected]> Closes #11654
*	Add missing checks for unsupported features	Martin Matuška	2021-02-27	1	-0/+2
\| \| \| \| \| \| \| \| \| \| \| \| \|	After 35ec517 it has become possible to import ZFS pools witn an active org.illumos:edonr feature on FreeBSD, leading to a panic. In addition, "zpool status" reported all pools without edonr as upgradable and "zpool upgrade -v" reported edonr in the list of upgradable features. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Signed-off-by: Martin Matuska <[email protected]> Closes #11653
*	Linux 5.12 compat: bio->bi_disk member moved	Coleman Kane	2021-02-24	2	-0/+8
\| \| \| \| \| \| \| \| \| \|	The struct bio member bi_disk was moved underneath a new member named bi_bdev. So all attempts to reference bio->bi_disk need to now become bio->bi_bdev->bd_disk. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Coleman Kane <[email protected]> Closes #11639
*	Fix vdev_rebuild_thread deadlock	Brian Behlendorf	2021-02-24	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \|	The metaslab_disable() call may block waiting for a txg sync. Therefore it's important that vdev_rebuild_thread release the SCL_CONFIG read lock it is holding before this call. Failure to do so can result in the txg_sync thread getting blocked waiting for this lock which results in a deadlock. Reviewed-by: Mark Maybee <[email protected]> Reviewd-by: Srikanth N S <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #11647
*	Fix overly broad locking in spa_vdev_config_exit()	Brian Behlendorf	2021-02-24	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \|	Calling vdev_free() only requires the we acquire the spa config SCL_STATE_ALL locks, not the SCL_ALL locks. In particular, we need need to avoid taking the SCL_CONFIG lock (included in SCL_ALL) as a writer since this can lead to a deadlock. The txg_sync_thread() may block in spa_txg_history_init_io() when taking the SCL_CONFIG lock as a reading when it detects there's a pending writer. Reviewed-by: Igor Kozhukhov <[email protected]> Reviewed-by: Mark Maybee <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #11585
*	Linux: increase max nvlist_src size	Brian Behlendorf	2021-02-24	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \|	On Linux increase the maximum allowed size of the src nvlist which can be passed to the /dev/zfs ioctl. Originally, this was set to a maximum of KMALLOC_MAX_SIZE (4M) because it was kmalloc'd. Since that time it's been converted to a vmalloc so that's no longer a hard limit, and it's desirable for `zfs send/recv` to allow larger nvlists so more snapshots can be sent at once. Signed-off-by: Brian Behlendorf <[email protected]> Closes #6572 Closes #11638
*	Add upper bound for slop space calculation	Prakash Surya	2021-02-24	1	-10/+17
\| \| \| \| \| \| \| \| \| \|	This change modifies the behavior of how we determine how much slop space to use in the pool, such that now it has an upper limit. The default upper limit is 128G, but is configurable via a tunable. Reviewed-by: Matthew Ahrens <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Prakash Surya <[email protected]> Closes #11023
*	Wrap bare EINVAL returns with SET_ERROR	Ryan Moeller	2021-02-24	1	-2/+2
\| \| \| \| \|	Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Ryan Moeller <[email protected]> Closes #11636
*	vdev_ops: don't try to call vdev_op_hold or vdev_op_rele when NULL	fbynite	2021-02-20	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This prevents a panic after a SLOG add/removal on the root pool followed by a zpool scrub. When a SLOG is removed, a hole takes its place - the vdev_ops for a hole is vdev_hole_ops, which defines the handler functions of vdev_op_hold and vdev_op_rele as NULL. This bug has been reported in illumos and FreeBSD, a different trigger in the FreeBSD report though. Credit for this patch goes to Patrick Mooney <[email protected]> Obtained from: illumos-gate commit: c65bd18728f34725 External-issue: https://www.illumos.org/issues/12981 External-issue: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=252396 Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rob Wing <[email protected]> Closes #11623
*	Cleaning up uio headers	Brian Atkinson	2021-02-20	4	-4/+11
\| \| \| \| \| \| \| \| \|	Making uio_impl.h the common header interface between Linux and FreeBSD so both OS's can share a common header file. This also helps reduce code duplication for zfs_uio_t for each OS. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Brian Atkinson <[email protected]> Closes #11622