openzfs/zfs.git - Unnamed repository; edit this file 'description' to name the repository.

	Commit message (Collapse)	Author	Age	Files	Lines
*	zfs-allow.8: mention 'bookmark' permission	Lauri Tirkkonen	2021-05-20	1	-0/+1
\| \| \| \| \| \|	Reviewed-by: John Kennedy <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Lauri Tirkkonen <[email protected]> Closes #12064
*	Scale worker threads and taskqs with number of CPUs	Alexander Motin	2021-05-14	1	-4/+18
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	While use of dynamic taskqs allows to reduce number of idle threads, hardcoded 8 taskqs of each kind is a big overkill for small systems, complicating CPU scheduling, increasing I/O reorder, etc, while providing no real locking benefits, just not needed there. On another side, 12*8 worker threads per kind are able to overload almost any system nowadays. For example, pool of several fast SSDs with SHA256 checksum makes system barely responsive during scrub, or with dedup enabled barely responsive during large file deletion. To address both problems this patch introduces ZTI_SCALE macro, alike to ZTI_BATCH, but with multiple taskqs, depending on number of CPUs, to be used in places where lock scalability is needed, while request ordering is not so much. The code is made to create new taskq for ~6 worker threads (less for small systems, but more for very large) up to 80% of CPU cores (previous 75% was not good for rounding down). Both number of threads and threads per taskq are now tunable in case somebody really wants to use all of system power for ZFS. While obviously some benchmarks show small peak performance reduction (not so big really, especially on systems with SMT, where use of the second threads does not give as much performance as the first ones), they also show dramatic latency reduction and much more smooth user- space operation in case of high CPU usage by ZFS. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored-By: iXsystems, Inc. Closes #11966
*	FreeBSD: Implement xattr=sa	Ryan Moeller	2021-05-13	1	-3/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	FreeBSD historically has not cared about the xattr property; it was always treated as xattr=on. With xattr=on, xattrs are stored as files in a hidden xattr directory. With xattr=sa, xattrs are stored as system attributes and get cached in nvlists during xattr operations. This makes SA xattrs simpler and more efficient to manipulate. FreeBSD needs to implement the SA xattr operations for feature parity with Linux and to ensure that SA xattrs are accessible when migrated or replicated from Linux. Following the example set by Linux, refactor our existing extattr vnops to split off the parts handling dir style xattrs, and add the corresponding SA handling parts. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Ryan Moeller <[email protected]> Closes #11997
*	Widen mancheck target to all pages, fix them	наб	2021-05-12	5	-4/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	mandoc: ./man/man8/zfs-mount-generator.8.in:188:2: ERROR: skipping end of block that is not open: RE mandoc: ./man/man8/zfs_ids_to_path.8:38:2: ERROR: skipping unknown macro: .LP mandoc: ./man/man8/zfs_ids_to_path.8:48:2: ERROR: inserting missing end of block: Sh breaks Bl mandoc: ./man/man8/zfs-wait.8:69:2: ERROR: skipping end of block that is not open: El mandoc: ./man/man8/zfs-program.8:460:2: ERROR: inserting missing end of block: It breaks Bd mandoc: ./man/man8/zfs-mount-generator.8:188:2: ERROR: skipping end of block that is not open: RE mandoc: ./man/man8/zstream.8:43:2: ERROR: skipping unknown macro: .LP mandoc: ./man/man8/zstream.8:107:2: ERROR: inserting missing end of block: Sh breaks Bl mandoc: ./man/man8/zstream.8:107:2: ERROR: inserting missing end of block: Sh breaks Bl make: *** [Makefile:1529: mancheck] Error 1 Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Ahelenia Ziemiańska <[email protected]> Issue #12017
*	libzfs: add keylocation=https://, backed by fetch(3) or libcurl	наб	2021-05-12	1	-2/+17
\| \| \| \| \| \| \| \| \| \| \|	Add support for http and https to the keylocation properly to allow encryption keys to be fetched from the specified URL. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Signed-off-by: Ahelenia Ziemiańska <[email protected]> Issue #9543 Closes #9947 Closes #11956
*	Remove unimplemented virus scanning hooks	Ryan Moeller	2021-05-10	1	-1/+1
\| \| \| \| \| \| \|	Reviewed-by: Adam Moss <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Igor Kozhukhov <[email protected]> Signed-off-by: Ryan Moeller <[email protected]> Closes #11972
*	module/zfs: remove zfs_zevent_console and zfs_zevent_cols	наб	2021-05-10	1	-22/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	zfs_zevent_console committed multiple printk()s per line without properly continuing them ‒ a single event could easily be fragmented across over thirty lines, making it useless for direct application zfs_zevent_cols exists purely to wrap the output from zfs_zevent_console The niche this was supposed to fill can be better served by something akin to the all-syslog ZEDLET Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Ahelenia Ziemiańska <[email protected]> Closes #7082 Closes #11996
*	Replace ZoL with OpenZFS where applicable	наб	2021-05-07	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Afterward, git grep ZoL matches: * README.md: * [ZoL Site](https://zfsonlinux.org) - Correct * etc/default/zfs.in:# ZoL userland configuration. - Changing this would induce a needless upgrade-check, if the user has modified the configuration; this can be updated the next time the defaults change * module/zfs/dmu_send.c: * ZoL < 0.7 does not handle [...] - Before 0.7 is ZoL, so fair enough Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Ahelenia Ziemiańska <[email protected]> Issue #11956
*	Updated zfs_dbgmsg_enable documentation to be more accurate	Rich Ercolani	2021-05-05	1	-3/+6
\| \| \| \| \| \| \| \| \| \|	Changed the default specified for zfs_dbgmsg_enable, added clarification of interaction with zfs_flags. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Signed-off-by: Rich Ercolani <[email protected]> Closes #11984 Closes #11986
*	Fixed incorrect man page reference in zfsprops(8)	Daniel Stevenson	2021-04-20	1	-1/+1
\| \| \| \| \| \| \| \| \| \|	The special_small_blocks section directed readers to zpool(8) for documentation on special allocation classes, while they are actually documented in zpoolconcepts(8). Reviewed-by: George Melikov <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Daniel Stevenson <[email protected]> Closes #11918
*	zfs-send(8): Restore sorting of flags	Ryan Moeller	2021-04-15	1	-9/+9
\| \| \| \| \| \| \| \| \|	Before #11710 the flags in zfs-send(8) were sorted. Restore order and bump the date. Reviewed-by: George Melikov <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Ryan Moeller <[email protected]> Closes #11905
*	ZFS traverse_visitbp optimization to limit prefetch	Jitendra Patidar	2021-04-15	1	-0/+13
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Traversal code, traverse_visitbp() does visit blocks recursively. Indirect (Non L0) Block of size 128k could contain, 1024 block pointers of 128 bytes. In case of full traverse OR incremental traverse, where all blocks were modified, it could traverse large number of blocks pointed by indirect. Traversal code does issue prefetch of blocks traversed below indirect. This could result into large number of async reads queued on vdev queue. So, account for prefetch issued for blocks pointed by indirect and limit max prefetch in one go. Module Param: zfs_traverse_indirect_prefetch_limit: Limit of prefetch while traversing an indirect block. Local counters: prefetched: Local counter to account for number prefetch done. pidx: Index for which next prefetch to be issued. ptidx: Index at which next prefetch to be triggered. Keep "ptidx" somewhere in the middle of blocks prefetched, so that blocks prefetch read gets the enough time window before their demand read is issued. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Jitendra Patidar <[email protected]> Closes #11802 Closes #11803
*	Improvements to the 'compatibility' property	Colm	2021-04-12	2	-7/+24
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Several improvements to the operation of the 'compatibility' property: 1) Improved handling of unrecognized features: Change the way unrecognized features in compatibility files are handled. * invalid features in files under /usr/share/zfs/compatibility.d only get a warning (as these may refer to future features not yet in the library), * invalid features in files under /etc/zfs/compatibility.d get an error (as these are presumed to refer to the current system). 2) Improved error reporting from zpool_load_compat. Note: slight ABI change to zpool_load_compat for better error reporting. 3) compatibility=legacy inhibits all 'zpool upgrade' operations. 4) Detect when features are enabled outside current compatibility set * zpool set compatibility=foo <-- print a warning * zpool set feature@xxx=enabled <-- error * zpool status <-- indicate this state Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Colm Buckley <[email protected]> Closes #11861
*	ZTS: fix removal_condense_export test case	Brian Behlendorf	2021-04-11	1	-0/+13
\| \| \| \| \| \| \| \| \| \| \| \| \|	It's been observed in the CI that the required 25% of obsolete bytes in the mapping can be to high a threshold for this test resulting in condensing never being triggered and a test failure. To prevent these failures make the existing zfs_condense_indirect_obsolete_pct tuning available so the obsolete percentage can be reduced from 25% to 5% during this test. Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: George Melikov <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #11869
*	zfprops(8): fix spacing in jailed= arguments	наб	2021-04-11	1	-1/+1
\| \| \| \| \| \|	Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Ahelenia Ziemiańska <[email protected]> Closes #11866
*	zfs-[un]jail(8): fix "zfs-jail [un]jail" leftovers	наб	2021-04-11	1	-2/+2
\| \| \| \| \| \|	Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Ahelenia Ziemiańska <[email protected]> Closes #11866
*	Allow zfs to send replication streams with missing snapshots	pablofsf	2021-04-11	1	-2/+8
\| \| \| \| \| \| \| \| \| \| \| \|	A tentative implementation and discussion was done in #5285. According to it a send --skip-missing\|-s flag has been added. In a replication stream, when there are snapshots missing in the hierarchy, if -s is provided print a warning and ignore dataset (and its children) instead of throwing an error Reviewed-by: Paul Dagnelie <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Pablo Correa Gómez <[email protected]> Closes #11710
*	Ratelimit deadman zevents as with delay zevents	Ryan Moeller	2021-04-07	1	-1/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Just as delay zevents can flood the zevent pipe when a vdev becomes unresponsive, so do the deadman zevents. Ratelimit deadman zevents according to the same tunable as for delay zevents. Enable deadman tests on FreeBSD and add a test for deadman event ratelimiting. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Don Brady <[email protected]> Signed-off-by: Ryan Moeller <[email protected]> Closes #11786
*	zed.8: the Diagnosis Engine is implemented	наб	2021-04-07	1	-2/+0
\| \| \| \| \| \|	Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Ahelenia Ziemiańska <[email protected]> Closes #11834
*	zed: merge all _NOT_IMPLEMENTED_ events	наб	2021-04-07	1	-3/+0
\| \| \| \| \| \| \| \| \| \| \| \| \|	These events should currently never be generated. Also untag _zed_event_add_nvpair() from merge with zpool_do_events_nvprint() ‒ they serve different purposes (machine, usually script vs human consumption) and format the output differently as it stands Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Ahelenia Ziemiańska <[email protected]> Closes #11834
*	zed.8: don't pretend an unprivileged user could change the script owner	наб	2021-04-07	1	-9/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	And add a note on /why/ ZEDLETs need to be owned by root Quoth chown(2), Linux man-pages project: Only a privileged process (Linux: one with the CAP_CHOWN capability) may change the owner of a file. Quoth chown(2), FreeBSD: [EPERM] The operation would change the ownership, but the effective user ID is not the super-user. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Ahelenia Ziemiańska <[email protected]> Closes #11834
*	zed: purge all mentions of a configuration file	наб	2021-04-07	1	-8/+0
\| \| \| \| \| \| \| \| \| \| \| \|	There simply isn't a need for one, since the flags the daemon takes are all short (mostly just toggles) and administrative in nature, and are therefore better served by the age-old tradition of sourcing an environment file and preparing the cmdline in the init-specific handler itself, if needed at all Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Ahelenia Ziemiańska <[email protected]> Closes #11834
*	zpool-features.5: remove "booting not possible with this feature"s	наб	2021-04-06	1	-4/+0
\| \| \| \| \| \| \| \| \| \| \| \|	The exact limitations on what features are supported when booting vary considerably depending on the environment. In order to minimize confusion avoid categorical statements which assume GRUB2 is being used. The supported GRUB2 features are covered earlier in this man page for easy reference. Reviewed-by: George Melikov <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Ahelenia Ziemiańska <[email protected]> Closes #11842
*	man: fix wrong .Xr macros usages	George Melikov	2021-04-06	7	-10/+10
\| \| \| \| \| \| \| \|	In addition, html doc will have working hyperlinks. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Signed-off-by: George Melikov <[email protected]> Closes #11845
*	Fix various typos	Andrea Gelmini	2021-04-02	3	-6/+6
\| \| \| \| \| \| \| \| \| \|	Correct an assortment of typos throughout the code base. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Matthew Ahrens <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Signed-off-by: Andrea Gelmini <[email protected]> Closes #11774
*	zed: allow limiting concurrent jobs	наб	2021-04-02	1	-0/+6
\| \| \| \| \| \| \| \| \|	200ms time-out is relatively long, but if we already hit the cap, then we'll likely be able to spawn multiple new jobs when we wake up Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Ahelenia Ziemiańska <[email protected]> Closes #11807
*	zed: use separate reaper thread and collect ZEDLETs asynchronously	наб	2021-04-02	1	-6/+0
\| \| \| \| \| \|	Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Ahelenia Ziemiańska <[email protected]> Closes #11807
*	Don't scale zfs_zevent_len_max by CPU count	Ryan Moeller	2021-04-01	1	-5/+4
\| \| \| \| \| \| \| \| \| \|	The lower bound for this scaling to too low and the upper bound is too high. Use a fixed default length of 512 instead, which is a reasonable value on any system. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Ryan Moeller <[email protected]> Closes #11822
*	fsck.zfs: implement 4/8 exit codes as suggested in manpage	наб	2021-03-31	1	-11/+15
\| \| \| \| \| \| \| \| \| \| \| \| \|	Update the fsck.zfs helper to bubble up some already-known-about errors if they are detected in the pool. health=degraded => 4/"Filesystem errors left uncorrected" health=faulted && dataset in /etc/fstab => 8/"Operational error" pool not found => 8/"Operational error" everything else => 0 Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Ahelenia Ziemiańska <[email protected]> Closes #11806
*	zed: reap child after killing on time-out	наб	2021-03-26	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \|	When a child process is killed waitpid() must be called on the pid the reap the zombie process. Update BUGS section to reflect reality by replacing "zedlets aren't time limited with "zedlets can be interrupted". Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Ahelenia Ziemiańska <[email protected]> Closes #11769 Closes #11798
*	Fix typo in zgenhostid.8	Ryan Moeller	2021-03-19	1	-2/+2
\| \| \| \| \| \|	Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: George Melikov <[email protected]> Signed-off-by: Ryan Moeller <[email protected]> Closes #11770
*	Clean up RAIDZ/DRAID ereport code	Matthew Ahrens	2021-03-19	1	-0/+17
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The RAIDZ and DRAID code is responsible for reporting checksum errors on their child vdevs. Checksum errors represent events where a disk returned data or parity that should have been correct, but was not. In other words, these are instances of silent data corruption. The checksum errors show up in the vdev stats (and thus `zpool status`'s CKSUM column), and in the event log (`zpool events`). Note, this is in contrast with the more common "noisy" errors where a disk goes offline, in which case ZFS knows that the disk is bad and doesn't try to read it, or the device returns an error on the requested read or write operation. RAIDZ/DRAID generate checksum errors via three code paths: 1. When RAIDZ/DRAID reconstructs a damaged block, checksum errors are reported on any children whose data was not used during the reconstruction. This is handled in `raidz_reconstruct()`. This is the most common type of RAIDZ/DRAID checksum error. 2. When RAIDZ/DRAID is not able to reconstruct a damaged block, that means that the data has been lost. The zio fails and an error is returned to the consumer (e.g. the read(2) system call). This would happen if, for example, three different disks in a RAIDZ2 group are silently damaged. Since the damage is silent, it isn't possible to know which three disks are damaged, so a checksum error is reported against every child that returned data or parity for this read. (For DRAID, typically only one "group" of children is involved in each io.) This case is handled in `vdev_raidz_cksum_finish()`. This is the next most common type of RAIDZ/DRAID checksum error. 3. If RAIDZ/DRAID is not able to reconstruct a damaged block (like in case 2), but there happens to be additional copies of this block due to "ditto blocks" (i.e. multiple DVA's in this blkptr_t), and one of those copies is good, then RAIDZ/DRAID compares each sector of the data or parity that it retrieved with the good data from the other DVA, and if they differ then it reports a checksum error on this child. This differs from case 2 in that the checksum error is reported on only the subset of children that actually have bad data or parity. This case happens very rarely, since normally only metadata has ditto blocks. If the silent damage is extensive, there will be many instances of case 2, and the pool will likely be unrecoverable. The code for handling case 3 is considerably more complicated than the other cases, for two reasons: 1. It needs to run after the main raidz read logic has completed. The data RAIDZ read needs to be preserved until after the alternate DVA has been read, which necessitates refcounts and callbacks managed by the non-raidz-specific zio layer. 2. It's nontrivial to map the sections of data read by RAIDZ to the correct data. For example, the correct data does not include the parity information, so the parity must be recalculated based on the correct data, and then compared to the parity that was read from the RAIDZ children. Due to the complexity of case 3, the rareness of hitting it, and the minimal benefit it provides above case 2, this commit removes the code for case 3. These types of errors will now be handled the same as case 2, i.e. the checksum error will be reported against all children that returned data or parity. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> Closes #11735
*	Hold and release permissions exist	gldisater	2021-03-16	1	-0/+3
\| \| \| \| \| \| \| \|	The man page was missing these two permissions. Add the missing permissions to the man page. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Jeremy Faulkner <[email protected]> Closes #11727
*	Reference_tracking_enable should be a module param	Don Brady	2021-03-16	1	-1/+24
\| \| \| \| \| \| \| \| \| \| \| \|	To make use of zfs_refcount_held tunable it should be a module parameter in open-zfs. Also, since the macros will auto-generate OS specific tunables, removed the existing zfs_refcount_held reference in module/os/freebsd/zfs/sysctl_os.c. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Allan Jude <[email protected]> Signed-off-by: Don Brady <[email protected]> Closes #11753
*	Fix whitespace introduced in ecc277cff	Martin Matuška	2021-03-11	2	-9/+9
\| \| \| \| \| \| \| \| \|	The manual page change in ecc277c has introduced whitespace on line ends. Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: George Melikov <[email protected]> Signed-off-by: Martin Matuska <[email protected]> Closes #11722
*	Clarify compressed zfs send/recv behavior	manfromafar	2021-03-07	2	-1/+16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Docs for send and receive do not explain behavior when sending a compressed stream then receiving on a host that overrides compression with -o compress=value. The data from the send stream is written as it was from the send is the compressed form but the compression algorithm set on the receiver is the overridden version which causes some confusion as to what algorithm was actually used. Updated man docs to clarify behavior Reviewed-by: Brian Behlendorf <[email protected]> Reviewed By: Allan Jude <[email protected]> Signed-off-by: manfromafar <[email protected]> Closes #11690
*	Linux: increase max nvlist_src size	Brian Behlendorf	2021-02-24	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \|	On Linux increase the maximum allowed size of the src nvlist which can be passed to the /dev/zfs ioctl. Originally, this was set to a maximum of KMALLOC_MAX_SIZE (4M) because it was kmalloc'd. Since that time it's been converted to a vmalloc so that's no longer a hard limit, and it's desirable for `zfs send/recv` to allow larger nvlists so more snapshots can be sent at once. Signed-off-by: Brian Behlendorf <[email protected]> Closes #6572 Closes #11638
*	Add "compatibility" property for zpool feature sets	Colm	2021-02-17	4	-6/+104
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Property to allow sets of features to be specified; for compatibility with specific versions / releases / external systems. Influences the behavior of 'zpool upgrade' and 'zpool create'. Initial man page changes and test cases included. Brief synopsis: zpool create -o compatibility=off\|legacy\|file[,file...] pool vdev... compatibility = off : disable compatibility mode (enable all features) compatibility = legacy : request that no features be enabled compatibility = file[,file...] : read features from specified files. Only features present in all files will be enabled on the resulting pool. Filenames may be absolute, or relative to /etc/zfs/compatibility.d or /usr/share/zfs/compatibility.d (/etc checked first). Only affects zpool create, zpool upgrade and zpool status. ABI changes in libzfs: * New function "zpool_load_compat" to load and parse compat sets. * Add "zpool_compat_status_t" typedef for compatibility parse status. * Add ZPOOL_PROP_COMPATIBILITY to the pool properties enum * Add ZPOOL_STATUS_COMPATIBILITY_ERR to the pool status enum An initial set of base compatibility sets are included in cmd/zpool/compatibility.d, and the Makefile for cmd/zpool is modified to install these in $pkgdatadir/compatibility.d and to create symbolic links to a reasonable set of aliases. Reviewed-by: ericloewe Reviewed-by: Matthew Ahrens <[email protected]> Reviewed-by: Richard Laager <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Colm Buckley <[email protected]> Closes #11468
*	zfs-list.8: clarify listing snapshots	Brian Behlendorf	2021-02-04	1	-3/+8
\| \| \| \| \| \| \| \| \| \| \|	Clarify how to include snapshots in the `zpool list` output by referencing the full name of the `listsnapshots` pool property, and the `zpool list -t snapshot` option. Reviewed-by: Matthew Ahrens <[email protected]> Reviewed-by: George Melikov <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #11562 Closes #11565
*	Add zdb -r <dataset> <object-id \| file> <output>	Allan Jude	2021-01-27	1	-1/+17
\| \| \| \| \| \| \| \| \| \| \| \| \|	While you can use zdb -R poolname vdev:offset:[<lsize>/]<psize>[:flags] to extract individual DVAs from a vdev, it would be handy for be able copy an entire file out of the pool. Given a file or object number, add support to copy the contents to a file. Useful for debugging and recovery. Reviewed-by: Jorgen Lundman <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Allan Jude <[email protected]> Closes #11027
*	Fix a man page link in zfs-program.8	Alan Somers	2021-01-26	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \|	zfs-program.8 has an orphan link, fix it. https://svnweb.freebsd.org/base?view=revision&revision=360080 Obtained from: FreeBSD Reviewed-by: George Melikov <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alan Somers <[email protected]> Closes #11529
*	zfsprops.8: fix mispluralisation in "Default values is"	наб	2021-01-24	1	-1/+1
\| \| \| \| \| \| \|	Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: George Melikov <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Signed-off-by: Ahelenia Ziemiańska <[email protected]> Closes #11509
*	Set aside a metaslab for ZIL blocks	Matthew Ahrens	2021-01-21	1	-0/+16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Mixing ZIL and normal allocations has several problems: 1. The ZIL allocations are allocated, written to disk, and then a few seconds later freed. This leaves behind holes (free segments) where the ZIL blocks used to be, which increases fragmentation, which negatively impacts performance. 2. When under moderate load, ZIL allocations are of 128KB. If the pool is fairly fragmented, there may not be many free chunks of that size. This causes ZFS to load more metaslabs to locate free segments of 128KB or more. The loading happens synchronously (from zil_commit()), and can take around a second even if the metaslab's spacemap is cached in the ARC. All concurrent synchronous operations on this filesystem must wait while the metaslab is loading. This can cause a significant performance impact. 3. If the pool is very fragmented, there may be zero free chunks of 128KB or more. In this case, the ZIL falls back to txg_wait_synced(), which has an enormous performance impact. These problems can be eliminated by using a dedicated log device ("slog"), even one with the same performance characteristics as the normal devices. This change sets aside one metaslab from each top-level vdev that is preferentially used for ZIL allocations (vdev_log_mg, spa_embedded_log_class). From an allocation perspective, this is similar to having a dedicated log device, and it eliminates the above-mentioned performance problems. Log (ZIL) blocks can be allocated from the following locations. Each one is tried in order until the allocation succeeds: 1. dedicated log vdevs, aka "slog" (spa_log_class) 2. embedded slog metaslabs (spa_embedded_log_class) 3. other metaslabs in normal vdevs (spa_normal_class) The space required for the embedded slog metaslabs is usually between 0.5% and 1.0% of the pool, and comes out of the existing 3.2% of "slop" space that is not available for user data. On an all-ssd system with 4TB storage, 87% fragmentation, 60% capacity, and recordsize=8k, testing shows a ~50% performance increase on random 8k sync writes. On even more fragmented systems (which hit problem #3 above and call txg_wait_synced()), the performance improvement can be arbitrarily large (>100x). Reviewed-by: Serapheim Dimitropoulos <[email protected]> Reviewed-by: George Wilson <[email protected]> Reviewed-by: Don Brady <[email protected]> Reviewed-by: Mark Maybee <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> Closes #11389
*	Only examine best metaslabs on each vdev	Matthew Ahrens	2020-12-16	1	-0/+34
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	On a system with very high fragmentation, we may need to do lots of gang allocations (e.g. most indirect block allocations (~50KB) may need to gang). Before failing a "normal" allocation and resorting to ganging, we try every metaslab. This has the impact of loading every metaslab (not a huge deal since we now typically keep all metaslabs loaded), and also iterating over every metaslab for every failing allocation. If there are many metaslabs (more than the typical ~200, e.g. due to vdev expansion or very large vdevs), the CPU cost of this iteration can be very impactful. This iteration is done with the mg_lock held, creating long hold times and high lock contention for concurrent allocations, ultimately causing long txg sync times and poor application performance. To address this, this commit changes the behavior of "normal" (not try_hard, not ZIL) allocations. These will now only examine the 100 best metaslabs (as determined by their ms_weight). If none of these have a large enough free segment, then the allocation will fail and we'll fall back on ganging. To accomplish this, we will now (normally) gang before doing a `try_hard` allocation. Non-try_hard allocations will only examine the 100 best metaslabs of each vdev. In summary, we will first try normal allocation. If that fails then we will do a gang allocation. If that fails then we will do a "try hard" gang allocation. If that fails then we will have a multi-layer gang block. Reviewed-by: Paul Dagnelie <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> Closes #11327
*	Add -u option to 'zfs create'	Ryan Moeller	2020-12-04	1	-4/+8
\| \| \| \| \| \| \| \| \| \| \| \| \|	Add -u option to 'zfs create' that prevents file system from being automatically mounted. This is similar to the 'zfs receive -u'. Authored by: pjd <[email protected]> FreeBSD-commit: freebsd/freebsd@35c58230e292775a694d189ff2b0bea2dcf6947d Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Allan Jude <[email protected]> Ported-by: Ryan Moeller <[email protected]> Signed-off-by: Ryan Moeller <[email protected]> Closes #11254
*	Fix trivial typo in zfs-diff.8	melak	2020-12-03	1	-1/+1
\| \| \| \| \| \| \|	Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tamas TEVESZ <[email protected]> Closes #11268 Closes #11272
*	Reduce latency effects of non-interactive I/O	Alexander Motin	2020-11-24	1	-2/+37
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Investigating influence of scrub (especially sequential) on random read latency I've noticed that on some HDDs single 4KB read may take up to 4 seconds! Deeper investigation shown that many HDDs heavily prioritize sequential reads even when those are submitted with queue depth of 1. This patch addresses the latency from two sides: - by using _min_active queue depths for non-interactive requests while the interactive request(s) are active and few requests after; - by throttling it further if no interactive requests has completed while configured amount of non-interactive did. While there, I've also modified vdev_queue_class_to_issue() to give more chances to schedule at least _min_active requests to the lowest priorities. It should reduce starvation if several non-interactive processes are running same time with some interactive and I think should make possible setting of zfs_vdev_max_active to as low as 1. I've benchmarked this change with 4KB random reads from ZVOL with 16KB block size on newly written non-fragmented pool. On fragmented pool I also saw improvements, but not so dramatic. Below are log2 histograms of the random read latency in milliseconds for different devices: 4 2x mirror vdevs of SATA HDD WDC WD20EFRX-68EUZN0 before: 0, 0, 2, 1, 12, 21, 19, 18, 10, 15, 17, 21 after: 0, 0, 0, 24, 101, 195, 419, 250, 47, 4, 0, 0 , that means maximum latency reduction from 2s to 500ms. 4 2x mirror vdevs of SATA HDD WDC WD80EFZX-68UW8N0 before: 0, 0, 2, 31, 38, 28, 18, 12, 17, 20, 24, 10, 3 after: 0, 0, 55, 247, 455, 470, 412, 181, 36, 0, 0, 0, 0 , i.e. from 4s to 250ms. 1 SAS HDD SEAGATE ST14000NM0048 before: 0, 0, 29, 70, 107, 45, 27, 1, 0, 0, 1, 4, 19 after: 1, 29, 681, 1261, 676, 1633, 67, 1, 0, 0, 0, 0, 0 , i.e. from 4s to 125ms. 1 SAS SSD SEAGATE XS3840TE70014 before (microseconds): 0, 0, 0, 0, 0, 0, 0, 0, 70, 18343, 82548, 618 after: 0, 0, 0, 0, 0, 0, 0, 0, 283, 92351, 34844, 90 I've also measured scrub time during the test and on idle pools. On idle fragmented pool I've measured scrub getting few percent faster due to use of QD3 instead of QD2 before. On idle non-fragmented pool I've measured no difference. On busy non-fragmented pool I've measured scrub time increase about 1.5-1.7x, while IOPS increase reached 5-9x. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Matthew Ahrens <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored-By: iXsystems, Inc. Closes #11166
*	zpool(8): fix pool-wi[sd]e typo	наб	2020-11-16	1	-1/+1
\| \| \| \| \| \|	Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Ahelenia Ziemiańska <[email protected]> Closes #11202
*	zgenhostid: accept hostid arguments equal to zero.	Érico Rolim	2020-11-14	1	-1/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	A common usage pattern for zgenhostid, including in the ZFS dracut module, is running it as: zgenhostid $(hostid) However, zgenhostid only accepted hostid arguments greater than 0, which meant that, when the output of hostid(1) was "00000000", zgenhostid would error out, even though 0 is a possible return value for the gethostid(3) function used by hostid(1): - On current musl libc, gethostid(3) is a stub that always returns 0. - On glibc, gethostid(3) will return 0 if /etc/hostid exists but is smaller than 4 bytes. In these cases, it makes more sense for zgenhostid to treat a value of 0 as other parts of the zfs codebase do, meaning that a hostid value couldn't be determined; therefore, it should attempt to generate a random value to write into /etc/hostid. The manpage and usage output have been updated to reflect this. Whitespace has also been fixed in the usage output. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Georgy Yakovlev <[email protected]> Reviewed-by: Andrew J. Hesford <[email protected]> Signed-off-by: Érico Rolim <[email protected]> Closes #11174 Closes #11189
*	Assertion failure when logging large output of channel program	Matthew Ahrens	2020-11-14	1	-0/+14
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The output of ZFS channel programs is logged on-disk in the zpool history, and printed by `zpool history -i`. Channel programs can use 10MB of memory by default, and up to 100MB by using the `zfs program -m` flag. Therefore their output can be up to some fraction of 100MB. In addition to being somewhat wasteful of the limited space reserved for the pool history (which for large pools is 1GB), in extreme cases this can result in a failure of `ASSERT(length <= DMU_MAX_ACCESS);` in `dmu_buf_hold_array_by_dnode()`. This commit limits the output size that will be logged to 1MB. Larger outputs will not be logged, instead a entry will be logged indicating the size of the omitted output. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> Closes #11194