summaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* Replace strchrnul() with strrchr()Jorgen Lundman2021-09-141-1/+3
| | | | | | | | | | | | Could have gone either way with this one, either adding it to macOS/Windows SPL, or returning it to "classic" usage with strrchr(). Since the new special way isn't really used, and only used once, we have this commit. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Signed-off-by: Jorgen Lundman <[email protected]> Closes #12312
* FreeBSD: Use unmapped I/O for scattered/gang ABD buffersAlexander Motin2021-09-141-10/+113
| | | | | | | | | | | | | | | | | | | | | | Many FreeBSD disk drivers support "unmapped" I/O mode, when data buffer represented not with a virtually contiguous KVA-mapped address range, but with a list of physical memory pages. Originally it was designed to do I/O from buffers without KVA mapping (unmapped). But moving virtual addresses out of equation allows us to operate even non-contiguous data buffers with one condition: all buffer discon- tinuities must be aligned to memory page borders. Doing I/O to capable GEOM device this patch traverses through non- linear ABD buffers, validating the chunks borders. If the condition is met, it supplies GEOM with the list of original physical memory pages instead of copying the data into temporary contiguous buffer. On capable hardware on pools with ashift=12 and default ABD chunk of 4KB it should handle all the I/O without additional memory copying. Reviewed-by: Brian Atkinson <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Closes #12320
* FreeBSD: Hardcode abd_chunk_size to PAGE_SIZEAlexander Motin2021-09-142-79/+51
| | | | | | | | | | | | | | | | | | | | It makes no sense to set it below PAGE_SIZE, since it increases all overheads and makes returning memory to OS problematic. It makes no sense to set it above PAGE_SIZE, since such allocations and especially frees are too expensive and cause KVA fragmentation to benefit from fewer chunks. After that it makes no sense to keep more complicated math here. What may have sense though is just a tunable border between linear and scatter ABDs, previously also controlled by this tunable. Retain that functionality by taking abd_scatter_min_size tunable from Linux, just with different default value. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Brian Atkinson <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Closes #12328
* Move gethrtime() calls out of vdev queue lockAlexander Motin2021-09-141-6/+5
| | | | | | | | | | | | | This dramatically reduces the lock contention on systems with slower (non-TSC) timecounters. With TSC the difference is minimal, but since this lock is pretty congested, any improvement counts. Plus I don't see any reason to do it under the lock other than the latency of the lock itself, which this change actually reduces. Reviewed-by: Matthew Ahrens <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Closes #12281
* Use substantially more robust program exit status logic in zvol_idJustin Gottula2021-09-141-23/+27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, there are several places in zvol_id where the program logic returns particular errno values, or even particular ioctl return values, as the program exit status, rather than a straightforward system of explicit zero on success and explicit nonzero value(s) on failure. This is problematic for multiple reasons. One particularly interesting problem that can arise, is that if any of these values happens to have all 8 least significant bits unset (i.e., it is a positive or negative multiple of 256), then although the C program sees a nonzero int value (presumed to be a failure exit status), the actual exit status as seen by the system is only the bottom 8 bits of that integer: zero. This can happen in practice, and I have encountered it myself. In a particularly weird situation, the zvol_open code in the zfs kernel module was behaving in such a manner that it caused the open() syscall to fail and for errno to be set to a kernel-private value (ERESTARTSYS, which happens to be defined as 512). It turns out that 512 is evenly divisible by 256; or, in other words, its least significant 8 bits are all-zero. So even though zvol_id believed it was returning a nonzero (failure) exit status of 512, the system modulo'd that value by 256, resulting in the actual exit status visible by other programs being 0! This actually-zero (non-failure) exit status caused problems: udev believed that the program was operating successfully, when in fact it was attempting to indicate failure via a nonzero exit status integer. Combined with another problem, this led to the creation of nonsense symlinks for zvol dev nodes by udev. Let's get rid of all this problematic logic, and simply return EXIT_SUCCESS (0) is everything went fine, and EXIT_FAILURE (1) if anything went wrong. Additionally, let's clarify some of the variable names (error is similar to errno, etc) and clean up the overall program flow a bit. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Pavel Zakharov <[email protected]> Signed-off-by: Justin Gottula <[email protected]> Closes #12302
* Print zvol_id error messages to stderr rather than stdoutJustin Gottula2021-09-141-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The zvol_id program is invoked by udev, via a PROGRAM key in the 60-zvol.rules.in rule file, to determine the "pretty" /dev/zvol/* symlink paths paths that should be generated for each opaquely named /dev/zd* dev node. The udev rule uses the PROGRAM key, followed by a SYMLINK+= assignment containing the %c substitution, to collect the program's stdout and then "paste" it directly into the name of the symlink(s) to be created. Unfortunately, as currently written, zvol_id outputs both its intended output (a single string representing the symlink path that should be created to refer to the name of the dataset whose /dev/zd* path is given) AND its error messages (if any) to stdout. When processing PROGRAM keys (and others, such as IMPORT{program}), udev uses only the data written to stdout for functional purposes. Any data written to stderr is used solely for the purposes of logging (if udev's log_level is set to debug). The unintended consequence of this is as follows: if zvol_id encounters an error condition; and then udev fails to halt processing of the current rule (either because zvol_id didn't return a nonzero exit status, or because the PROGRAM key in the rule wasn't written properly to result in a "non-match" condition that would stop the current rule on a nonzero exit); then udev will create a space-delimited list of symlink names derived directly from the words of the error message string! I've observed this exact behavior on my own system, in a situation where the open() syscall on /dev/zd* dev nodes was failing sporadically (for reasons that aren't especially relevant here). Because the open() call failed, zvol_id printed "Unable to open device file: /dev/zd736\n" to stdout and then exited. The udev rule finished with SYMLINK+="zvol/%c %c". Assuming a volume name like pool/foo/bar, this would ordinarily expand to SYMLINK+="zvol/pool/foo/bar pool/foo/bar" and would cause symlinks to be created like this: /dev/zvol/pool/foo/bar -> /dev/zd736 /dev/pool/foo/bar -> /dev/zd736 But because of the combination of error messages being printed to stdout, and the udev syntax freely accepting a space-delimited sequence of names in this context, the error message string "Unable to open device file: /dev/zd736\n" in reality expanded to SYMLINK+="zvol/Unable to open device file: /dev/zd736" which caused the following symlinks to actually be created: /dev/zvol/Unable -> /dev/zd736 /dev/to -> /dev/zd736 /dev/open -> /dev/zd736 /dev/device -> /dev/zd736 /dev/file: -> /dev/zd736 /dev//dev/zd736 -> /dev/zd736 (And, because multiple zvols had open() syscall errors, multiple zvols attempted to claim several of those symlink names, resulting in numerous udev errors and timeouts and general chaos.) This commit rectifies all this silliness by simply printing error messages to stderr, as Dennis Ritchie originally intended. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Pavel Zakharov <[email protected]> Signed-off-by: Justin Gottula <[email protected]> Closes #12302
* Udev rules: use match (==) rather than assign (=) for PROGRAMJustin Gottula2021-09-141-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Assignment syntax (=) can be used for the PROGRAM key. But the PROGRAM key is really a match key, not an assign key. The internal logic used by udev to decide whether a PROGRAM key "matched" or not (which determines whether the remainder of the rule is evaluated) depends on whether the operator was OP_MATCH (==) or OP_NOMATCH (!=). [1] The man page claims that '"=", ":=", and "+=" have the same effect as "=="' for PROGRAM keys. And, after a brief perusal, the udev source code does seem to confirm that operators other than OP_MATCH (==) or OP_NOMATCH (!=) are implicitly converted to OP_MATCH (==). [2] But it's not entirely clear that this is definitely the case: anecdotal testing seems to indicate that when OP_ASSIGN (=) is used, the program's exit status is disregarded and the remainder of the rule is processed regardless of whether it was, in fact, a successful exit. The bottom line here is that, if zvol_id hits some snag and returns a nonzero exit status, then we almost certainly do NOT want to continue on with the rule and use whatever the stdout contents may have been to mindlessly create /dev/zvol/* symlinks. Therefore, let's be extra-sure and use the match (==) operator explicitly, to eliminate any possibility that udev might do the wrong thing, and ensure that a nonzero exit status will definitely short-circuit the rest of the rule, bypassing the SYMLINK+= assignments. [1] udev, file src/udev/udev-rules.c, func udev_rule_apply_token_to_event, switch case TK_M_PROGRAM if r != 0 (nonzero exit status): return token->op == OP_NOMATCH; switch case TK_M_PROGRAM if r == 0 (zero exit status): return token->op == OP_MATCH; func retval 0 => key is considered to have matched func retval 1 => key is considered to have NOT matched [2] udev, file src/udev/udev-rules.c, func parse_token, at func start: bool is_match = IN_SET(op, OP_MATCH, OP_NOMATCH); in else-if case streq(key, "PROGRAM"): if (!is_match) op = OP_MATCH; Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Pavel Zakharov <[email protected]> Signed-off-by: Justin Gottula <[email protected]> Closes #12302
* Udev rules: replace deprecated $tempnode with $devnodeJustin Gottula2021-09-141-1/+1
| | | | | | | | | | | | | | The $tempnode substitution is so old that it's not even mentioned in the man page anymore. It is still technically supported by udev, but with plenty of "deprecated" comments surrounding it. The preferred modern equivalent of $tempnode is $devnode (or alternatively, %N). Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Pavel Zakharov <[email protected]> Signed-off-by: Justin Gottula <[email protected]> Closes #12302
* Udev rules: use non-ancient comma syntaxJustin Gottula2021-09-141-1/+1
| | | | | | | | | | | This file is old as dirt. It's entirely possible that commas were optional in udev back at that time. But they're definitely supposed to be there nowadays. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Pavel Zakharov <[email protected]> Signed-off-by: Justin Gottula <[email protected]> Closes #12302
* Compact dbuf/buf hashes and lock arraysAlexander Motin2021-09-143-24/+11
| | | | | | | | | | | | | | | | | | | | | | | | With default dbuf cache size of 1/32 of ARC, it makes no sense to have hash table of the same size (or even bigger on Linux). Reduce it to 1/8 of ARC's one, still leaving some slack, assuming higher I/O rate via dbuf cache than via ARC. Remove padding from ARC hash locks array. The idea behind padding is to avoid false sharing between locks. It would have sense if there would be a limited number of very busy locks. But since we have no limit on the number, using the same memory for more locks we can achieve even lower lock contention with the same false sharing, or we can use less memory for the same contention level. Reduce number of hash locks from 8192 to 2048. The number is still big enough to not cause contention, but reduced memory size improves cache hit rate for mutex_tryenter() in ARC eviction thread, saving about 1% of the thread time. Reviewed-by: Matthew Ahrens <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Closes #12289
* Fix abd leak, kmem_free correct size of abd_tJorgen Lundman2021-09-144-6/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix a leak of abd_t that manifested mostly when using raidzN with at least as many columns as N (e.g. a four-disk raidz2 but not a three-disk raidz2). Sufficiently heavy raidz use would eventually run a system out of memory. Additionally: * Switch abd_cache arena to FIRSTFIT, which empirically improves perofrmance. * Make abd_chunk_cache more performant and debuggable. * Allocate the abd_zero_buf from abd_chunk_cache rather than the heap. * Don't try to reap non-existent qcaches in abd_cache arena. * KM_PUSHPAGE->KM_SLEEP when allocating chunks from their own arena Reviewed-by: Matthew Ahrens <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Jorgen Lundman <[email protected]> Co-authored-by: Sean Doran <[email protected]> Closes #12295
* Upstream: dmu_zfetch_stream_fini leaks refcountJorgen Lundman2021-09-141-0/+2
| | | | | | | | | | | dmu_zfetch_stream_fini() is missing calls to destroy the refcounts, leaking them and the mutex inside. Reviewed-by: Matthew Ahrens <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Jorgen Lundman <[email protected]> Closes #12294
* ZED: Match added disk by pool/vdev GUID if found (#12217)Ryan Moeller2021-09-146-10/+137
| | | | | | | | | | This enables ZED to auto-online vdevs that are not wholedisk managed by ZFS. Signed-off-by: Ryan Moeller <[email protected]> Reviewed-by: Don Brady <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Tony Hutter <[email protected]>
* Optimize small random numbers generationAlexander Motin2021-09-1416-45/+83
| | | | | | | | | | | | | | | | | | In all places except two spa_get_random() is used for small values, and the consumers do not require well seeded high quality values. Switch those two exceptions directly to random_get_pseudo_bytes() and optimize spa_get_random(), renaming it to random_in_range(), since it is not related to SPA or ZFS in general. On FreeBSD directly map random_in_range() to new prng32_bounded() KPI added in FreeBSD 13. On Linux and in user-space just reduce the type used to uint32_t to avoid more expensive 64bit division. Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored-By: iXsystems, Inc. Closes #12183
* FreeBSD: Implement xattr=saRyan Moeller2021-09-143-147/+395
| | | | | | | | | | | | | | | | | | | | FreeBSD historically has not cared about the xattr property; it was always treated as xattr=on. With xattr=on, xattrs are stored as files in a hidden xattr directory. With xattr=sa, xattrs are stored as system attributes and get cached in nvlists during xattr operations. This makes SA xattrs simpler and more efficient to manipulate. FreeBSD needs to implement the SA xattr operations for feature parity with Linux and to ensure that SA xattrs are accessible when migrated or replicated from Linux. Following the example set by Linux, refactor our existing extattr vnops to split off the parts handling dir style xattrs, and add the corresponding SA handling parts. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Ryan Moeller <[email protected]> Closes #11997
* FreeBSD: Clean up ASSERT/VERIFY use in moduleRyan Moeller2021-09-1423-236/+231
| | | | | | | | | | | | | | | Convert use of ASSERT() to ASSERT0(), ASSERT3U(), ASSERT3S(), ASSERT3P(), and likewise for VERIFY(). In some cases it ended up making more sense to change the code, such as VERIFY on nvlist operations that I have converted to use fnvlist instead. In one place I changed an internal struct member from int to boolean_t to match its use. Some asserts that combined multiple checks with && in a single assert have been split to separate asserts, to make it apparent which check fails. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Ryan Moeller <[email protected]> Closes #11971
* Tag 2.1.0zfs-2.1.0b_zfs-2.1.0Brian Behlendorf2021-07-021-1/+1
| | | | Signed-off-by: Brian Behlendorf <[email protected]>
* Tag 2.1.0-rc8zfs-2.1.0-rc8Brian Behlendorf2021-06-291-1/+1
| | | | Signed-off-by: Brian Behlendorf <[email protected]>
* Linux 5.13 compat: METABrian Behlendorf2021-06-291-1/+1
| | | | | | | | Increase the Linux-Maximum version in the META file to 5.13. All of the required compatibility patches have been merged and the 5.13 kernel has been officially released. Signed-off-by: Brian Behlendorf <[email protected]>
* zed: fix sending emails (#12292)Laurențiu Nicola2021-06-291-1/+1
| | | | | | | | Commit 6fc3099 broke the quoting when invoking the mail program, revert that change. Signed-off-by: Laurențiu Nicola <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Tony Hutter <[email protected]>
* Avoid 64bit division in multilist index functionsAlexander Motin2021-06-294-6/+21
| | | | | | | | | | | | The number of sublists in a multilist is relatively small. We dont need 64 bits to calculate an index. 32 bits is sufficient and makes the code more efficient. Reviewed-by: Matthew Ahrens <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Mark Maybee <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored-By: iXsystems, Inc. Closes #12288
* Fix plymouth passphrase prompt with dracutMichal Vasilek2021-06-291-2/+2
| | | | | | | | | | | | | | plymouth --command splits the command on spaces which means that zfs-load-key was getting the filesystem name enclosed in single quotes (since 13c59bb76) and failing. This commit fixes it by piping the password directly to the command similar to how it's done in other scripts (initramfs, dracut without plymouth). Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Michal Vasilek <[email protected]> Related-to: #9193 Related-to: #9202 Closes #12147
* Fix build with KASANRich Ercolani2021-06-291-0/+19
| | | | | | | | | | The stock zstd code expects some helpers from ASAN if present. This works fine in userland, but in kernel, KASAN also gets detected, and lacks those helpers. So let's make some empty substitutes for that case. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rich Ercolani <[email protected]> Closes #12232
* Help compiller optimize out abd_verify()Alexander Motin2021-06-291-2/+2
| | | | | | | | | | | | While abd_verify() does nothing when built without debug, compiler can't optimize it out by itself due to calls to external list_*() and abd_verify_scatter(). This commit makes it explicit. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Adam Moss <[email protected]> Reviewed-by: George Melikov <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored-By: iXsystems, Inc. Closes #12280
* Update cache file when setting compatibility propertyBrian Behlendorf2021-06-244-6/+107
| | | | | | | | | | | | | | | | | | | | Unlike most other properties the 'compatibility' property is stored in the pool config object and not the DMU_OT_POOL_PROPS object. This had the advantage that the compatibility information is available without needing to fully import the pool (it can be read with zdb). However, this means we need to make sure to update both the copy of the config in the MOS and the cache file. This wasn't being done. This commit adds a call to spa_async_request() to ensure the copy of the config in the cache file gets updated as well as the one stored in the pool. This same change is made for the 'comment' property which suffers from the same inconsistency. Reviewed-by: Sean Eric Fagan <[email protected]> Reviewed-by: Colm Buckley <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #12261 Closes #12276
* Fix flag copying in resume casePaul Dagnelie2021-06-241-0/+4
| | | | | | | | | A couple flags weren't being copied in the case where we're doing size estimation on a resume. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Matthew Ahrens <[email protected]> Signed-off-by: Paul Dagnelie <[email protected]> Closes: #12266
* zfs_metaslab_mem_limit should be 25 instead of 75jumbi772021-06-241-1/+1
| | | | | | | | | | | According to current zfs man page zfs_metaslab_mem_limit should be 25 instead of 75. Reviewed-by: Matthew Ahrens <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Paul Dagnelie <[email protected]> Reviewed-by: Mark Maybee <[email protected]> Signed-off-by: [email protected] Closes #12273
* Stop using "zstreamdump" in tests/Rich Ercolani2021-06-246-21/+20
| | | | | | | | | | zstreamdump was replaced with "zstream dump"; let's stop using the old name, compat symlink or no. Reviewed-by: George Melikov <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rich Ercolani <[email protected]> Closes #12277
* Update libera webchat client URLJonathon2021-06-241-1/+1
| | | | | | | | Libera have made a webchat client available. This change builds on #12127. Reviewed-by: Matthew Ahrens <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Jonathon Fernyhough <[email protected]> Closes #12251
* gcc 11 cleanupAttila Fülöp2021-06-244-13/+27
| | | | | | | | | | | Compiling with gcc 11.1.0 produces three new warnings. Change the code slightly to avoid them. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Matthew Ahrens <[email protected]> Signed-off-by: Attila Fülöp <[email protected]> Closes #12130 Closes #12188 Closes #12237
* ZTS: Add known exceptionsBrian Behlendorf2021-06-241-0/+3
| | | | | | | | | | | | | | | The receive-o-x_props_override test case reliably fails on the FreeBSD main builders (but not on Linux), until the root cause is understood add this test to the FreeBSD exception list. On Linux the alloc_class_012_pos test case may occasionally fail. This is a known false positive which has also been added to the Linux exception list until the test can be made entirely reliable. Reviewed-by: George Melikov <[email protected]> Reviewed-by: John Kennedy <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #12272
* Annotated dprintf as printf-likeRich Ercolani2021-06-2434-166/+256
| | | | | | | | | | ZFS loves using %llu for uint64_t, but that requires a cast to not be noisy - which is even done in many, though not all, places. Also a couple places used %u for uint64_t, which were promoted to %llu. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rich Ercolani <[email protected]> Closes #12233
* Revert Consolidate arc_buf allocation checksAntonio Russo2021-06-241-44/+77
| | | | | | | | | | | | | | | | | | This reverts commit 13fac09868b4e4e08cc3ef7b937ac277c1c407b1. Per the discussion in #11531, the reverted commit---which intended only to be a cleanup commit---introduced a subtle, unintended change in behavior. Care was taken to partially revert and then reapply 10b3c7f5e4 which would otherwise have caused a conflict. These changes were squashed in to this commit. Reviewed-by: Brian Behlendorf <[email protected]> Suggested-by: @chrisrd Suggested-by: [email protected] Signed-off-by: Antonio Russo <[email protected]> Closes #11531 Closes #12227
* Use wmsum for arc, abd, dbuf and zfetch statistics. (#12172)Alexander Motin2021-06-248-216/+778
| | | | | | | | | | | wmsum was designed exactly for cases like these with many updates and rare reads. It allows to completely avoid atomic operations on congested global variables. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Mark Maybee <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored-By: iXsystems, Inc. Closes #12172
* libspl: implement atomics in terms of atomicsнаб2021-06-218-1697/+84
| | | | | | | | | | | | | | | | This replaces the generic libspl atomic.c atomics implementation with one based on builtin gcc atomics. This functionality was added as an experimental feature in gcc 4.4. Today even CentOS 7 ships with gcc 4.8 as the default compiler we can make this the default. Furthermore, the builtin atomics are as good or better than our hand-rolled implementation so it's reasonable to drop that custom code. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Ahelenia Ziemiańska <[email protected]> Closes #11904 Closes #12252 Closes #12244
* Avoid deadlock when removing L2ARC devices under I/OGeorge Amanakis2021-06-172-14/+6
| | | | | | | | | | | | | | | | In case we have I/O and try to remove an L2ARC device a deadlock might occur. arc_read()->zio_read()->zfs_blkptr_verify() waits for SCL_VDEV to be dropped while holding the hash_lock. However, spa_l2cache_load() holds SCL_ALL and waits for the hash_lock in l2arc_evict(). Fix this by moving zfs_blkptr_verify() to the top top arc_read() before the hash_lock is taken. Verify the block pointer and return a checksum error if damaged rather than halting the system, by using BLK_VERIFY_LOG instead of BLK_VERIFY_HALT. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Mark Maybee <[email protected]> Signed-off-by: George Amanakis <[email protected]> Closes #12054
* systemd: import: expand $ZPOOL_IMPORT_OPTS correctlyнаб2021-06-152-2/+2
| | | | | | | | | | | | | | | | Turns out $ZPOOL_IMPORT_OPTS expands in a shell-like fashion, yielding 'import' '-aN' '-o' 'cachefile=none' for an unset variable, and 'import' '-aN' '-o' 'cachefile=none' 'word1' 'word2' for a white-spaced one, but ${ZPOOL_IMPORT_OPTS} expands like "${Z_I_O}" would in a shell, yielding 'import' '-aN' '-o' 'cachefile=none' '' (empty) and 'import' '-aN' '-o' 'cachefile=none' 'word1 word2' (spaced) Fixes eec5ba113e1d285d445333079a3e8184872ad00a "dracut: 90zfs: respect zfs_force=1 on systemd systems" Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: John Kennedy <[email protected]> Signed-off-by: Ahelenia Ziemiańska <[email protected]> Closes: #12231
* vdev_draid_min_asize() ignores reserved spaceMatthew Ahrens2021-06-151-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | vdev_draid_min_asize() returns the minimum size of a child vdev. This is used when determining if a disk is big enough to replace a child. It's also used by zdb to determine how big of a child to make to test replacement. vdev_draid_min_asize() says that the child’s asize has to be at least 1/Nth of the entire draid’s asize, which is the same logic as raidz. However, this contradicts the code in vdev_draid_open(), which calculates the draid’s asize based on a reduced child size: An additional 32MB of scratch space is reserved at the end of each child for use by the dRAID expansion feature So the problem is that you can replace a draid disk with one that’s vdev_draid_min_asize(), but it actually needs to be larger to accommodate the additional 32MB. The replacement is allowed and everything works at first (since the reserved space is at the end, and we don’t try to use it yet), but when you try to close and reopen the pool, vdev_draid_open() calculates a smaller asize for the draid, because of the smaller leaf, which is not allowed. I think the confusion is that vdev_draid_min_asize() is correctly returning the amount of required *allocatable* space in a leaf, but the actual *size* of the leaf needs to be at least 32MB more than that. ztest_vdev_attach_detach() assumes that it can attach that size of device, and it actually can (the kernel/libzpool accepts it), but it then later causes zdb to not be able to open the pool. This commit changes vdev_draid_min_asize() to return the required size of the leaf, not the size that draid will make available to the metaslab allocator. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Mark Maybee <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> Closes #11459 Closes #12221
* Do not hash unlinked inodesPaul Zuchowski2021-06-151-4/+11
| | | | | | | | | | | | | | | | In zfs_znode_alloc we always hash inodes. If the znode is unlinked, we do not need to hash it. This fixes the problem where zfs_suspend_fs is doing zrele (iput) in an async fashion, and zfs_resume_fs unlinked drain processing will try to hash an inode that could still be hashed, resulting in a panic. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alan Somers <[email protected]> Signed-off-by: Paul Zuchowski <[email protected]> Closes #9741 Closes #11223 Closes #11648 Closes #12210
* Added uncompress requirementRich Ercolani2021-06-153-0/+19
| | | | | | | | | | | | Having an old enough version of "file" and no "uncompress" program installed can cause rpmbuild as root to crash and mangle rpmdb. So let's add a build dependency for RPM-based systems. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Signed-off-by: Rich Ercolani <[email protected]> Closes: #12071 Closes: #12168
* ZTS: Add zfs_clone_livelist_dedup.ksh to Makefile.amBrian Behlendorf2021-06-151-0/+1
| | | | | | | | | | Commit 86b5f4c12 added a new zfs_clone_livelist_dedup.ksh test case but didn't include it in the Makefile.am. This results in the test not being included in the dist tarball so it's never run by the CI. Reviewed-by: John Kennedy <[email protected]> Reviewed-by: Serapheim Dimitropoulos <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes: #12224
* Tag 2.1.0-rc7zfs-2.1.0-rc7Brian Behlendorf2021-06-101-1/+1
| | | | Signed-off-by: Brian Behlendorf <[email protected]>
* Re-embed multilist_t storageAlexander Motin2021-06-1012-104/+99
| | | | | | | | | | | | | This commit partially reverts changes to multilists in PR 7968 (multi-threaded spa-sync()) and adds some cache line alignments to separate read-only multilists and heavily modified refcount's to different cache lines. Reviewed-by: Matthew Ahrens <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored-by: iXsystems, Inc. Closes #12158
* dracut: 90zfs: respect zfs_force=1 on systemd systemsнаб2021-06-106-22/+30
| | | | | | | | | | On systemd systems provide an environment generator in order to respect the zfs_force=1 kernel command line option. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Tony Nguyen <[email protected]> Signed-off-by: Ahelenia Ziemiańska <[email protected]> Closes #11403 Closes #12195
* Remove pool io kstatsAlexander Motin2021-06-1010-276/+0
| | | | | | | | | | | | | | | | | | | | This mostly reverts "3537 want pool io kstats" commit of 8 years ago. From one side this code using pool-wide locks became pretty bad for performance, creating significant lock contention in I/O pipeline. From another, there are more efficient ways now to obtain detailed statistics, while this statistics is illumos-specific and much less usable on Linux and FreeBSD, reported only via procfs/sysctls. This commit does not remove KSTAT_TYPE_IO implementation, that may be removed later together with already unused KSTAT_TYPE_INTR and KSTAT_TYPE_TIMER. Reviewed-by: Matthew Ahrens <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored-By: iXsystems, Inc. Closes #12212
* Added error for writing to /dev/ on LinuxRich Ercolani2021-06-102-2/+37
| | | | | | | | | | | | Starting in Linux 5.10, trying to write to /dev/{null,zero} errors out. Prefer to inform people when this happens rather than hoping they guess what's wrong. Reviewed-by: Antonio Russo <[email protected]> Reviewed-by: Ahelenia Ziemiańska <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: John Kennedy <[email protected]> Signed-off-by: Rich Ercolani <[email protected]> Closes: #11991
* libzfs: format safetyнаб2021-06-106-65/+52
| | | | | | | Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Signed-off-by: Ahelenia Ziemiańska <[email protected]> Closes #12116
* zgenhostid.8: revisitнаб2021-06-101-27/+33
| | | | | | | Reviewed-by: Richard Laager <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Ahelenia Ziemiańska <[email protected]> Closes #12212
* Consistentify miscellaneous style on remaining manpagesнаб2021-06-108-64/+65
| | | | | | | | | Most notably this fixes the vdev_id(8) non-.Xrs in vdev_id.conf.5 Reviewed-by: Richard Laager <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Ahelenia Ziemiańska <[email protected]> Closes #12212
* Move properties, parameters, events, and concepts around manual sectionsнаб2021-06-1050-582/+543
| | | | | | | | | | | | | | | | | | | The pages moved as follows: zpool-features.{5 => 7} spl{-module-parameters.5 => .4} zfs{-module-parameters.5 => .4} zfs-events.5 => into zpool-events.8 zfsconcepts.{8 => 7} zfsprops.{8 => 7} zpoolconcepts.{8 => 7} zpoolprops.{8 => 7} Reviewed-by: Richard Laager <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Ahelenia Ziemiańska <[email protected]> Co-authored-by: Daniel Ebdrup Jensen <[email protected]> Closes #12149 Closes #12212