aboutsummaryrefslogtreecommitdiffstats
path: root/cmd
Commit message (Collapse)AuthorAgeFilesLines
* Add FreeBSD code to arc_summary and arcstatRyan Moeller2019-11-303-6/+77
| | | | | | | | Adding the FreeBSD code allows arc_summary and arcstat to be used on FreeBSD. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Ryan Moeller <[email protected]> Closes #9641
* Add display of checksums to zdb -RPaul Zuchowski2019-11-271-2/+58
| | | | | | | | | | | | | | | The function zdb_read_block (zdb -R) was always intended to have a :c flag which would read the DVA and length supplied by the user, and display the checksum. Since we don't know which checksum goes with the data, we should calculate and display them all. For each checksum in the table, read in the data at the supplied DVA:length, calculate the checksum, and display it. Update the man page and create a zfs test for the new feature. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Kjeld Schouten <[email protected]> Signed-off-by: Paul Zuchowski <[email protected]> Closes #9607
* Add zfs_file_* interface, remove vnodesMatthew Macy2019-11-214-10/+11
| | | | | | | | | | | | | | | | | | Provide a common zfs_file_* interface which can be implemented on all platforms to perform normal file access from either the kernel module or the libzpool library. This allows all non-portable vnode_t usage in the common code to be replaced by the new portable zfs_file_t. The associated vnode and kobj compatibility functions, types, and macros have been removed from the SPL. Moving forward, vnodes should only be used in platform specific code when provided by the native operating system. Reviewed-by: Sean Eric Fagan <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Igor Kozhukhov <[email protected]> Reviewed-by: Jorgen Lundman <[email protected]> Signed-off-by: Matt Macy <[email protected]> Closes #9556
* Remove requirement for -d 1 for zfs list and zfs get with bookmarksInsanePrawn2019-11-181-11/+12
| | | | | | | | | | | | | | df58307 removed the need to specify -d 1 when zfs list and zfs get are called with -t snapshot on a datset. This commit extends the same behaviour to -t bookmark. This commit also introduces the 'snap' shorthand for snapshots from zfs list to zfs get. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Tom Caputi <[email protected]> Reviewed-by: Kjeld Schouten <[email protected]> Signed-off-by: InsanePrawn <[email protected]> Closes #9589
* Isolate code specific to Linux in cmd/Ryan Moeller2019-11-114-99/+103
| | | | | | | | | | | | | Use sys.platform to choose the correct implementation of functions and values of variables for the platform being run on. Reword some comments to avoid describing implementation details in the wrong places. Reviewed-by: Kjeld Schouten <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Jorgen Lundman <[email protected]> Signed-off-by: Ryan Moeller <[email protected]> Closes #9561
* zvol_wait should ignore redacted zvolsPavel Zakharov2019-11-061-2/+6
| | | | | | | | | | zvol_wait waits for zvol links to be created under /dev/zvol for each zvol. Links are not created for redacted zvols so we should ignore those. Reviewed-by: Paul Dagnelie <[email protected]> Reviewed-by: Matt Ahrens <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Pavel Zakharov <[email protected]> Closes #9545
* Prefix struct rangelockMatthew Macy2019-11-011-2/+2
| | | | | | | | | | A struct rangelock already exists on FreeBSD. Add a zfs_ prefix as per our convention to prevent any conflict with existing symbols. This change is a follow up to 2cc479d0. Reviewed-by: Matt Ahrens <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matt Macy <[email protected]> Closes #9534
* Remove ECKSUM alias in zinjectMatthew Macy2019-11-011-2/+0
| | | | | | | | The custom ECKSUM errno is defined as appropriate by the platform specific os/linux/spl/sys/errno.h header. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matt Macy <[email protected]> Closes #9537
* Add wrapper for Linux BLKFLSBUF ioctlMatthew Macy2019-10-282-2/+2
| | | | | | | | | | FreeBSD has no analog. Buffered block devices were removed a decade plus ago. Reviewed-by: Igor Kozhukhov <[email protected]> Reviewed-by: Jorgen Lundman <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matt Macy <[email protected]> Closes #9508
* Fix zpool history unbounded memory usageChunwei Chen2019-10-281-16/+28
| | | | | | | | | | | | In original implementation, zpool history will read the whole history before printing anything, causing memory usage goes unbounded. We fix this by breaking it into read-print iterations. Reviewed-by: Tom Caputi <[email protected]> Reviewed-by: Matt Ahrens <[email protected]> Reviewed-by: Igor Kozhukhov <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Chunwei Chen <[email protected]> Closes #9516
* Use zfs_ioctl() in zinject.cMatthew Macy2019-10-251-5/+5
| | | | | | | | | | | | Consistently use the `zfs_ioctl()` wrapper since `ioctl()` cannot be called directly due to differing semantics between platforms. Follow up PR to #9492. Reviewed-by: Igor Kozhukhov <[email protected]> Reviewed-by: Jorgen Lundman <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matt Macy <[email protected]> Closes #9507
* Remove gratuitous Linux only include in ztest & zdbMatthew Macy2019-10-192-2/+0
| | | | | | | | We don't need to include stdio_ext.h Reviewed-by: Igor Kozhukhov <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matt Macy <[email protected]> Closes #9483
* Detect if sed supports --in-placeRyan Moeller2019-10-162-2/+2
| | | | | | | | | | | | | | Not all versions of sed have the --in-place flag. Detect support for the flag during ./configure and provide a fallback mechanism for those systems where sed's behavior differs. The autoconf variable ${ac_inplace} can be used to choose the correct flags for editing a file in place with sed. Replace violating usages in Makefile.am with ${ac_inplace}. Reviewed-by: Chris Dunlop <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Ryan Moeller <[email protected]> Closes #9463
* Fix strdup conflict on other platformsMatthew Macy2019-10-101-1/+1
| | | | | | | | | | | | | | | | In the FreeBSD kernel the strdup signature is: ``` char *strdup(const char *__restrict, struct malloc_type *); ``` It's unfortunate that the developers have chosen to change the signature of libc functions - but it's what I have to deal with. Reviewed-by: Jorgen Lundman <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matt Macy <[email protected]> Closes #9433
* Reduce loaded range tree memory usagePaul Dagnelie2019-10-091-5/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch implements a new tree structure for ZFS, and uses it to store range trees more efficiently. The new structure is approximately a B-tree, though there are some small differences from the usual characterizations. The tree has core nodes and leaf nodes; each contain data elements, which the elements in the core nodes acting as separators between its children. The difference between core and leaf nodes is that the core nodes have an array of children, while leaf nodes don't. Every node in the tree may be only partially full; in most cases, they are all at least 50% full (in terms of element count) except for the root node, which can be less full. Underfull nodes will steal from their neighbors or merge to remain full enough, while overfull nodes will split in two. The data elements are contained in tree-controlled buffers; they are copied into these on insertion, and overwritten on deletion. This means that the elements are not independently allocated, which reduces overhead, but also means they can't be shared between trees (and also that pointers to them are only valid until a side-effectful tree operation occurs). The overhead varies based on how dense the tree is, but is usually on the order of about 50% of the element size; the per-node overheads are very small, and so don't make a significant difference. The trees can accept arbitrary records; they accept a size and a comparator to allow them to be used for a variety of purposes. The new trees replace the AVL trees used in the range trees today. Currently, the range_seg_t structure contains three 8 byte integers of payload and two 24 byte avl_tree_node_ts to handle its storage in both an offset-sorted tree and a size-sorted tree (total size: 64 bytes). In the new model, the range seg structures are usually two 4 byte integers, but a separate one needs to exist for the size-sorted and offset-sorted tree. Between the raw size, the 50% overhead, and the double storage, the new btrees are expected to use 8*1.5*2 = 24 bytes per record, or 33.3% as much memory as the AVL trees (this is for the purposes of storing metaslab range trees; for other purposes, like scrubs, they use ~50% as much memory). We reduced the size of the payload in the range segments by teaching range trees about starting offsets and shifts; since metaslabs have a fixed starting offset, and they all operate in terms of disk sectors, we can store the ranges using 4-byte integers as long as the size of the metaslab divided by the sector size is less than 2^32. For 512-byte sectors, this is a 2^41 (or 2TB) metaslab, which with the default settings corresponds to a 256PB disk. 4k sector disks can handle metaslabs up to 2^46 bytes, or 2^63 byte disks. Since we do not anticipate disks of this size in the near future, there should be almost no cases where metaslabs need 64-byte integers to store their ranges. We do still have the capability to store 64-byte integer ranges to account for cases where we are storing per-vdev (or per-dnode) trees, which could reasonably go above the limits discussed. We also do not store fill information in the compact version of the node, since it is only used for sorted scrub. We also optimized the metaslab loading process in various other ways to offset some inefficiencies in the btree model. While individual operations (find, insert, remove_from) are faster for the btree than they are for the avl tree, remove usually requires a find operation, while in the AVL tree model the element itself suffices. Some clever changes actually caused an overall speedup in metaslab loading; we use approximately 40% less cpu to load metaslabs in our tests on Illumos. Another memory and performance optimization was achieved by changing what is stored in the size-sorted trees. When a disk is heavily fragmented, the df algorithm used by default in ZFS will almost always find a number of small regions in its initial cursor-based search; it will usually only fall back to the size-sorted tree to find larger regions. If we increase the size of the cursor-based search slightly, and don't store segments that are smaller than a tunable size floor in the size-sorted tree, we can further cut memory usage down to below 20% of what the AVL trees store. This also results in further reductions in CPU time spent loading metaslabs. The 16KiB size floor was chosen because it results in substantial memory usage reduction while not usually resulting in situations where we can't find an appropriate chunk with the cursor and are forced to use an oversized chunk from the size-sorted tree. In addition, even if we do have to use an oversized chunk from the size-sorted tree, the chunk would be too small to use for ZIL allocations, so it isn't as big of a loss as it might otherwise be. And often, more small allocations will follow the initial one, and the cursor search will now find the remainder of the chunk we didn't use all of and use it for subsequent allocations. Practical testing has shown little or no change in fragmentation as a result of this change. If the size-sorted tree becomes empty while the offset sorted one still has entries, it will load all the entries from the offset sorted tree and disregard the size floor until it is unloaded again. This operation occurs rarely with the default setting, only on incredibly thoroughly fragmented pools. There are some other small changes to zdb to teach it to handle btrees, but nothing major. Reviewed-by: George Wilson <[email protected]> Reviewed-by: Matt Ahrens <[email protected]> Reviewed by: Sebastien Roy [email protected] Reviewed-by: Igor Kozhukhov <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Paul Dagnelie <[email protected]> Closes #9181
* OpenZFS restructuring - libzutilMatthew Macy2019-10-031-5/+5
| | | | | | | | Factor Linux specific functionality out of libzutil. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matt Macy <[email protected]> Signed-off-by: Ryan Moeller <[email protected]> Closes #9356
* OpenZFS restructuring - libsplMatthew Macy2019-10-0216-97/+5
| | | | | | | | | Factor Linux specific pieces out of libspl. Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Sean Eric Fagan <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matt Macy <[email protected]> Closes #9336
* OpenZFS restructuring - zpoolMatthew Macy2019-09-306-352/+428
| | | | | | | | | | | Factor Linux specific functions out of the zpool command. Reviewed-by: Allan Jude <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Sean Eric Fagan <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: loli10K <[email protected]> Signed-off-by: Matt Macy <[email protected]> Closes #9333
* Adding slack notifierBen McGough2019-09-262-0/+86
| | | | | | | | | | Allow ZED notification via slack incoming webhook. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Richard Elling <[email protected]> Signed-off-by: Ben McGough <[email protected]> Closes #9076 Closes #9350
* Use signed types to prevent subtraction overflowRyan Moeller2019-09-221-4/+4
| | | | | | | | | | | | | | | | | | The difference between the sizes could be positive or negative. Leaving the types as unsigned means the result overflows when the difference is negative and removing the labs() means we'll have introduced a bug. The subtraction results in the correct value when the unsigned integer is interpreted as a signed integer by labs(). Clang doesn't see that we're doing a subtraction and abusing the types. It sees the result of the subtraction, an unsigned value, being passed to an absolute value function and emits a warning which we treat as an error. Reviewed by: Youzhong Yang <[email protected]> Reviewed-by: Igor Kozhukhov <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Ryan Moeller <[email protected]> Closes #9355
* Refactor libzfs_error_init newlinesRyan Moeller2019-09-184-4/+4
| | | | | | | | | Move the trailing newlines from the error message strings to the format strings to more closely match the other error messages. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: George Melikov <[email protected]> Signed-off-by: Ryan Moeller <[email protected]> Closes #9330
* Add subcommand to wait for background zfs activity to completeJohn Gallagher2019-09-131-42/+517
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently the best way to wait for the completion of a long-running operation in a pool, like a scrub or device removal, is to poll 'zpool status' and parse its output, which is neither efficient nor convenient. This change adds a 'wait' subcommand to the zpool command. When invoked, 'zpool wait' will block until a specified type of background activity completes. Currently, this subcommand can wait for any of the following: - Scrubs or resilvers to complete - Devices to initialized - Devices to be replaced - Devices to be removed - Checkpoints to be discarded - Background freeing to complete For example, a scrub that is in progress could be waited for by running zpool wait -t scrub <pool> This also adds a -w flag to the attach, checkpoint, initialize, replace, remove, and scrub subcommands. When used, this flag makes the operations kicked off by these subcommands synchronous instead of asynchronous. This functionality is implemented using a new ioctl. The type of activity to wait for is provided as input to the ioctl, and the ioctl blocks until all activity of that type has completed. An ioctl was used over other methods of kernel-userspace communiction primarily for the sake of portability. Porting Notes: This is ported from Delphix OS change DLPX-44432. The following changes were made while porting: - Added ZoL-style ioctl input declaration. - Reorganized error handling in zpool_initialize in libzfs to integrate better with changes made for TRIM support. - Fixed check for whether a checkpoint discard is in progress. Previously it also waited if the pool had a checkpoint, instead of just if a checkpoint was being discarded. - Exposed zfs_initialize_chunk_size as a ZoL-style tunable. - Updated more existing tests to make use of new 'zpool wait' functionality, tests that don't exist in Delphix OS. - Used existing ZoL tunable zfs_scan_suspend_progress, together with zinject, in place of a new tunable zfs_scan_max_blks_per_txg. - Added support for a non-integral interval argument to zpool wait. Future work: ZoL has support for trimming devices, which Delphix OS does not. In the future, 'zpool wait' could be extended to add the ability to wait for trim operations to complete. Reviewed-by: Matt Ahrens <[email protected]> Reviewed-by: John Kennedy <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: John Gallagher <[email protected]> Closes #9162
* Canonicalize Python shebangsRyan Moeller2019-09-126-6/+6
| | | | | | | | | | | | | | | | | | | | | | | | /usr/bin/env python3 is the suggested[1] shebang for Python in general (likewise for python2) and is conventional across platforms. This eases development on systems where python is not installed in /usr/bin (FreeBSD for example) and makes it possible to develop in virtual environments (venv) for isolating dependencies. Many packaging guidelines discourage the use of /usr/bin/env, but since this is the canonical way of writing shebangs in the Python community, many packaging scripts are already equipped to handle substituting the appropriate absolute path to python automatically. Some RPM package builders lacking brp-mangle-shebangs need a small fallback mechanism in the package spec to stamp the appropriate shebang on installed Python scripts. [1]: https://docs.python.org/3/using/unix.html?#miscellaneous Reviewed-by: Richard Laager <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: George Melikov <[email protected]> Signed-off-by: Ryan Moeller <[email protected]> Closes #9314
* Add/generalize abstractions in arc_summary3Ryan Moeller2019-09-101-50/+62
| | | | | | | | | | | | | Code for interfacing with procfs for kstats and tunables is Linux- specific. A more generic interface can be used for the abstractions of loading kstats and various tunable parameters, allowing other platforms to implement the functions cleanly. In a similar vein, determining the ZFS/SPL version can be abstracted away in order for other platforms to provide their own implementations of this function. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Matt Macy <[email protected]> Signed-off-by: Ryan Moeller <[email protected]> Closes #9279
* Add/generalize abstraction in arc_summary2Ryan Moeller2019-09-101-29/+25
| | | | | | | | | | | | A more generic interface can be used for the abstraction of loading kstats, allowing other platforms to implement the function cleanly. In a similar vein, loading tunables can be abstracted away in order for other platforms to provide their own implementations of this function. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Matt Macy <[email protected]> Signed-off-by: Ryan Moeller <[email protected]> Closes #9277
* Fix zpool subcommands error message with some unsupported optionsloli10K2019-09-041-4/+2
| | | | | | | | | | | | | | | | | Both 'detach' and 'online' zpool subcommands, when provided with an unsupported option, forget to print it in the error message: # zpool online -t rpool vda3 invalid option '' usage: online [-e] <pool> <device> ... This changes fixes the error message in order to include the actual option that is not supported. Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: loli10K <[email protected]> Closes #9270
* zvol_wait script should ignore partially received zvolsPavel Zakharov2019-09-031-2/+21
| | | | | | | | | Partially received zvols won't have links in /dev/zvol. Reviewed-by: Sebastien Roy <[email protected]> Reviewed-by: Paul Dagnelie <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Pavel Zakharov <[email protected]> Closes #9260
* Fix typos in cmd/Andrea Gelmini2019-08-3015-26/+26
| | | | | | | Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Richard Laager <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Andrea Gelmini <[email protected]> Closes #9234
* Fix automake program name transformationsRyan Moeller2019-08-201-2/+8
| | | | | | | | | Automake can perform program name transformations at install time. However, arc_summary has its own name transformation taking place, which interferes with the automake transforms. The automake transforms must be taken into account in order to resolve the conflict. Signed-off-by: Ryan Moeller <[email protected]>
* spa_load_verify() may consume too much memoryGeorge Wilson2019-08-131-7/+8
| | | | | | | | | | | | | | | | | | | | When a pool is imported it will scan the pool to verify the integrity of the data and metadata. The amount it scans will depend on the import flags provided. On systems with small amounts of memory or when importing a pool from the crash kernel, it's possible for spa_load_verify to issue too many I/Os that it consumes all the memory of the system resulting in an OOM message or a hang. To prevent this, we limit the amount of memory that the initial pool scan can consume. This change will, by default, use 1/16th of the ARC for scan I/Os to prevent running the system out of memory during import. Reviewed-by: Matt Ahrens <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Serapheim Dimitropoulos <[email protected]> Signed-off-by: George Wilson [email protected] External-issue: DLPX-65237 External-issue: DLPX-65238 Closes #9146
* Metaslab max_size should be persisted while unloadedPaul Dagnelie2019-08-051-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | When we unload metaslabs today in ZFS, the cached max_size value is discarded. We instead use the histogram to determine whether or not we think we can satisfy an allocation from the metaslab. This can result in situations where, if we're doing I/Os of a size not aligned to a histogram bucket, a metaslab is loaded even though it cannot satisfy the allocation we think it can. For example, a metaslab with 16 entries in the 16k-32k bucket may have entirely 16kB entries. If we try to allocate a 24kB buffer, we will load that metaslab because we think it should be able to handle the allocation. Doing so is expensive in CPU time, disk reads, and average IO latency. This is exacerbated if the write being attempted is a sync write. This change makes ZFS cache the max_size after the metaslab is unloaded. If we ever get a free (or a coalesced group of frees) larger than the max_size, we will update it. Otherwise, we leave it as is. When attempting to allocate, we use the max_size as a lower bound, and respect it unless we are in try_hard. However, we do age the max_size out at some point, since we expect the actual max_size to increase as we do more frees. A more sophisticated algorithm here might be helpful, but this works reasonably well. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Matt Ahrens <[email protected]> Signed-off-by: Paul Dagnelie <[email protected]> Closes #9055
* zed crashes when devid not presentMatthew Ahrens2019-07-261-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | zed core dumps due to a NULL pointer in zfs_agent_iter_vdev(). The gs_devid is NULL, but the nvl has a "devid" entry. zfs_agent_post_event() checks that ZFS_EV_VDEV_GUID or DEV_IDENTIFIER is present in nvl, but then later it and zfs_agent_iter_vdev() assume that DEV_IDENTIFIER is present and thus gs_devid is set. Typically this is not a problem because usually either all vdevs have devid's, or none of them do. Since zfs_agent_iter_vdev() first checks if the vdev has devid before dereferencing gs_devid, the problem isn't typically encountered. However, if some vdevs have devid's and some do not, then the problem is easily reproduced. This can happen if the pool has been moved from a system that has devid's to one that does not. The fix is for zfs_agent_iter_vdev() to only try to match the devid's if both nvl and gsp have devid's present. Reviewed-by: Prashanth Sreenivasa <[email protected]> Reviewed-by: Don Brady <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: loli10K <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> External-issue: DLPX-65090 Closes #9054 Closes #9060
* Fast Clone DeletionSara Hartse2019-07-261-63/+267
| | | | | | | | | | | | | | | | | | | | | Deleting a clone requires finding blocks are clone-only, not shared with the snapshot. This was done by traversing the entire block tree which results in a large performance penalty for sparsely written clones. This is new method keeps track of clone blocks when they are modified in a "Livelist" so that, when it’s time to delete, the clone-specific blocks are already at hand. We see performance improvements because now deletion work is proportional to the number of clone-modified blocks, not the size of the original dataset. Reviewed-by: Sean Eric Fagan <[email protected]> Reviewed-by: Matt Ahrens <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Serapheim Dimitropoulos <[email protected]> Signed-off-by: Sara Hartse <[email protected]> Closes #8416
* Race condition between spa async threads and exportSerapheim Dimitropoulos2019-07-181-1/+17
| | | | | | | | | | | | | | | | | | In the past we've seen multiple race conditions that have to do with open-context threads async threads and concurrent calls to spa_export()/spa_destroy() (including the one referenced in issue #9015). This patch ensures that only one thread can execute the main body of spa_export_common() at a time, with subsequent threads returning with a new error code created just for this situation, eliminating this way any race condition bugs introduced by concurrent calls to this function. Reviewed by: Matt Ahrens <[email protected]> Reviewed by: Brian Behlendorf <[email protected]> Signed-off-by: Serapheim Dimitropoulos <[email protected]> Closes #9015 Closes #9044
* zdb: don't print log spacemap stats in pools without the featureSerapheim Dimitropoulos2019-07-181-0/+6
| | | | | | | | | | | | | | | | | | | | Creating a pool with not features enabled and running `zdb -mmmmmm on` it before the patch: ``` Log Space Maps in Pool: Log Space Map Obsolete Entry Statistics: 0 valid entries out of 0 - txg 0 0 valid entries out of 0 - total ``` After this patch the above output goes away. Reviewed by: Matt Ahrens <[email protected]> Reviewed by: Sara Hartse <[email protected]> Reviewed by: Brian Behlendorf <[email protected]> Signed-off-by: Serapheim Dimitropoulos <[email protected]> Closes #9048
* New service that waits on zvol links to be createdPavel Zakharov2019-07-173-1/+95
| | | | | | | | | | | | | | | | The zfs-volume-wait.service scans existing zvols and waits for their links under /dev to be created. Any service that depends on zvol links to be there should add a dependency on zfs-volumes.target. By default, this target is not enabled. Reviewed-by: Fabian Grünbichler <[email protected]> Reviewed-by: Antonio Russo <[email protected]> Reviewed-by: Richard Laager <[email protected]> Reviewed-by: loli10K <[email protected]> Reviewed-by: John Gallagher <[email protected]> Reviewed-by: George Wilson <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Pavel Zakharov <[email protected]> Closes #8975
* Add zfs create dryrunMike Gerdts2019-07-161-22/+92
| | | | | | | | | | | | | | | | | | | Adds the ability to sanity check zfs create arguments and to see the value of any additional properties that will local to the dataset. For example, automation that may need to adjust quota on a parent filesystem before creating a volume may call `zfs create -nP -V <size> <volume>` to obtain the value of refreservation. This adds the following options to zfs create: - -n dry-run (no-op) - -v verbose - -P parseable (implies verbose) Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Matt Ahrens <[email protected]> Reviewed-by: Jerry Jelinek <[email protected]> Signed-off-by: Mike Gerdts <[email protected]> Closes #8974
* Log Spacemap ProjectSerapheim Dimitropoulos2019-07-162-56/+370
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | = Motivation At Delphix we've seen a lot of customer systems where fragmentation is over 75% and random writes take a performance hit because a lot of time is spend on I/Os that update on-disk space accounting metadata. Specifically, we seen cases where 20% to 40% of sync time is spend after sync pass 1 and ~30% of the I/Os on the system is spent updating spacemaps. The problem is that these pools have existed long enough that we've touched almost every metaslab at least once, and random writes scatter frees across all metaslabs every TXG, thus appending to their spacemaps and resulting in many I/Os. To give an example, assuming that every VDEV has 200 metaslabs and our writes fit within a single spacemap block (generally 4K) we have 200 I/Os. Then if we assume 2 levels of indirection, we need 400 additional I/Os and since we are talking about metadata for which we keep 2 extra copies for redundancy we need to triple that number, leading to a total of 1800 I/Os per VDEV every TXG. We could try and decrease the number of metaslabs so we have less I/Os per TXG but then each metaslab would cover a wider range on disk and thus would take more time to be loaded in memory from disk. In addition, after it's loaded, it's range tree would consume more memory. Another idea would be to just increase the spacemap block size which would allow us to fit more entries within an I/O block resulting in fewer I/Os per metaslab and a speedup in loading time. The problem is still that we don't deal with the number of I/Os going up as the number of metaslabs is increasing and the fact is that we generally write a lot to a few metaslabs and a little to the rest of them. Thus, just increasing the block size would actually waste bandwidth because we won't be utilizing our bigger block size. = About this patch This patch introduces the Log Spacemap project which provides the solution to the above problem while taking into account all the aforementioned tradeoffs. The details on how it achieves that can be found in the references sections below and in the code (see Big Theory Statement in spa_log_spacemap.c). Even though the change is fairly constraint within the metaslab and lower-level SPA codepaths, there is a side-change that is user-facing. The change is that VDEV IDs from VDEV holes will no longer be reused. To give some background and reasoning for this, when a log device is removed and its VDEV structure was replaced with a hole (or was compacted; if at the end of the vdev array), its vdev_id could be reused by devices added after that. Now with the pool-wide space maps recording the vdev ID, this behavior can cause problems (e.g. is this entry referring to a segment in the new vdev or the removed log?). Thus, to simplify things the ID reuse behavior is gone and now vdev IDs for top-level vdevs are truly unique within a pool. = Testing The illumos implementation of this feature has been used internally for a year and has been in production for ~6 months. For this patch specifically there don't seem to be any regressions introduced to ZTS and I have been running zloop for a week without any related problems. = Performance Analysis (Linux Specific) All performance results and analysis for illumos can be found in the links of the references. Redoing the same experiments in Linux gave similar results. Below are the specifics of the Linux run. After the pool reached stable state the percentage of the time spent in pass 1 per TXG was 64% on average for the stock bits while the log spacemap bits stayed at 95% during the experiment (graph: sdimitro.github.io/img/linux-lsm/PercOfSyncInPassOne.png). Sync times per TXG were 37.6 seconds on average for the stock bits and 22.7 seconds for the log spacemap bits (related graph: sdimitro.github.io/img/linux-lsm/SyncTimePerTXG.png). As a result the log spacemap bits were able to push more TXGs, which is also the reason why all graphs quantified per TXG have more entries for the log spacemap bits. Another interesting aspect in terms of txg syncs is that the stock bits had 22% of their TXGs reach sync pass 7, 55% reach sync pass 8, and 20% reach 9. The log space map bits reached sync pass 4 in 79% of their TXGs, sync pass 7 in 19%, and sync pass 8 at 1%. This emphasizes the fact that not only we spend less time on metadata but we also iterate less times to convergence in spa_sync() dirtying objects. [related graphs: stock- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGStock.png lsm- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGLSM.png] Finally, the improvement in IOPs that the userland gains from the change is approximately 40%. There is a consistent win in IOPS as you can see from the graphs below but the absolute amount of improvement that the log spacemap gives varies within each minute interval. sdimitro.github.io/img/linux-lsm/StockVsLog3Days.png sdimitro.github.io/img/linux-lsm/StockVsLog10Hours.png = Porting to Other Platforms For people that want to port this commit to other platforms below is a list of ZoL commits that this patch depends on: Make zdb results for checkpoint tests consistent db587941c5ff6dea01932bb78f70db63cf7f38ba Update vdev_is_spacemap_addressable() for new spacemap encoding 419ba5914552c6185afbe1dd17b3ed4b0d526547 Simplify spa_sync by breaking it up to smaller functions 8dc2197b7b1e4d7ebc1420ea30e51c6541f1d834 Factor metaslab_load_wait() in metaslab_load() b194fab0fb6caad18711abccaff3c69ad8b3f6d3 Rename range_tree_verify to range_tree_verify_not_present df72b8bebe0ebac0b20e0750984bad182cb6564a Change target size of metaslabs from 256GB to 16GB c853f382db731e15a87512f4ef1101d14d778a55 zdb -L should skip leak detection altogether 21e7cf5da89f55ce98ec1115726b150e19eefe89 vs_alloc can underflow in L2ARC vdevs 7558997d2f808368867ca7e5234e5793446e8f3f Simplify log vdev removal code 6c926f426a26ffb6d7d8e563e33fc176164175cb Get rid of space_map_update() for ms_synced_length 425d3237ee88abc53d8522a7139c926d278b4b7f Introduce auxiliary metaslab histograms 928e8ad47d3478a3d5d01f0dd6ae74a9371af65e Error path in metaslab_load_impl() forgets to drop ms_sync_lock 8eef997679ba54547f7d361553d21b3291f41ae7 = References Background, Motivation, and Internals of the Feature - OpenZFS 2017 Presentation: youtu.be/jj2IxRkl5bQ - Slides: slideshare.net/SerapheimNikolaosDim/zfs-log-spacemaps-project Flushing Algorithm Internals & Performance Results (Illumos Specific) - Blogpost: sdimitro.github.io/post/zfs-lsm-flushing/ - OpenZFS 2018 Presentation: youtu.be/x6D2dHRjkxw - Slides: slideshare.net/SerapheimNikolaosDim/zfs-log-spacemap-flushing-algorithm Upstream Delphix Issues: DLPX-51539, DLPX-59659, DLPX-57783, DLPX-61438, DLPX-41227, DLPX-59320 DLPX-63385 Reviewed-by: Sean Eric Fagan <[email protected]> Reviewed-by: Matt Ahrens <[email protected]> Reviewed-by: George Wilson <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Serapheim Dimitropoulos <[email protected]> Closes #8442
* Enable zfs-mount-generator by defaultAntonio Russo2019-07-151-0/+1
| | | | | | | | | Reviewed-by: Richard Laager <[email protected]> Reviewed-by: Fabian Grünbichler <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Antonio Russo <[email protected]> Closes #8750 Closes #8848
* systemd encryption key supportAntonio Russo2019-07-151-2/+2
| | | | | | | | | | | | | | | | | | | | | | Modify zfs-mount-generator to produce a dependency on new zfs-import-key-*.service units, dynamically created at boot to call zfs load-key for the encryption root, before attempting to mount any encrypted datasets. These units are created by zfs-mount-generator, and RequiresMountsFor on the keyfile, if present, or call systemd-ask-password if a passphrase is requested. This patch includes suggestions from @Fabian-Gruenbichler, @ryanjaeb and @rlaager, as well an adaptation of @rlaager's script to retry on incorrect password entry. Reviewed-by: Richard Laager <[email protected]> Reviewed-by: Fabian Grünbichler <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Antonio Russo <[email protected]> Closes #8750 Closes #8848
* Linux 5.0 compat: SIMD compatibilityBrian Behlendorf2019-07-121-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Restore the SIMD optimization for 4.19.38 LTS, 4.14.120 LTS, and 5.0 and newer kernels. This is accomplished by leveraging the fact that by definition dedicated kernel threads never need to concern themselves with saving and restoring the user FPU state. Therefore, they may use the FPU as long as we can guarantee user tasks always restore their FPU state before context switching back to user space. For the 5.0 and 5.1 kernels disabling preemption and local interrupts is sufficient to allow the FPU to be used. All non-kernel threads will restore the preserved user FPU state. For 5.2 and latter kernels the user FPU state restoration will be skipped if the kernel determines the registers have not changed. Therefore, for these kernels we need to perform the additional step of saving and restoring the FPU registers. Invalidating the per-cpu global tracking the FPU state would force a restore but that functionality is private to the core x86 FPU implementation and unavailable. In practice, restricting SIMD to kernel threads is not a major restriction for ZFS. The vast majority of SIMD operations are already performed by the IO pipeline. The remaining cases are relatively infrequent and can be handled by the generic code without significant impact. The two most noteworthy cases are: 1) Decrypting the wrapping key for an encrypted dataset, i.e. `zfs load-key`. All other encryption and decryption operations will use the SIMD optimized implementations. 2) Generating the payload checksums for a `zfs send` stream. In order to avoid making any changes to the higher layers of ZFS all of the `*_get_ops()` functions were updated to take in to consideration the calling context. This allows for the fastest implementation to be used as appropriate (see kfpu_allowed()). The only other notable instance of SIMD operations being used outside a kernel thread was at module load time. This code was moved in to a taskq in order to accommodate the new kernel thread restriction. Finally, a few other modifications were made in order to further harden this code and facilitate testing. They include updating each implementations operations structure to be declared as a constant. And allowing "cycle" to be set when selecting the preferred ops in the kernel as well as user space. Reviewed-by: Tony Hutter <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #8754 Closes #8793 Closes #8965
* zfs send does not handle invalid input gracefullyloli10K2019-07-081-1/+5
| | | | | | | | | | | Due to some changes introduced in 30af21b 'zfs send' can crash when provided with invalid inputs: this change attempts to add more checks to the affected code paths. Reviewed-by: Attila Fülöp <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Signed-off-by: loli10K <[email protected]> Closes #9001
* Fix zfs "redact" misc issuesloli10K2019-07-051-9/+9
| | | | | | | | | | | * zfs redact error messages do not end with newline character * 30af21b0 inadvertently removed some ZFS_PROP comments * man/zfs: zfs redact <redaction_snapshot> is not optional Reviewed-by: Giuseppe Di Natale <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: George Melikov <[email protected]> Signed-off-by: loli10K <[email protected]> Closes #8988
* OpenZFS 9318 - vol_volsize_to_reservation does not account for raidz skip blocksMike Gerdts2019-07-051-2/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When a volume is created in a pool with raidz vdevs and volblocksize != 128k, the volume can reference more space than is reserved with the automatically calculated refreservation. There are two deficiencies in vol_volsize_to_reservation that contribute to this: 1) Skip blocks may be added to keep each allocation a multiple of parity + 1. This is the dominating factor when volblocksize is close to 2^ashift. 2) raidz deflation for 128 KB blocks is different for most other block sizes. See "The theory of raidz space accounting" comment in libzfs_dataset.c for a full explanation. Authored by: Mike Gerdts <[email protected]> Reviewed by: Richard Elling <[email protected]> Reviewed by: Sanjay Nadkarni <[email protected]> Reviewed by: Jerry Jelinek <[email protected]> Reviewed by: Matt Ahrens <[email protected]> Reviewed by: Kody Kantor <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Approved by: Dan McDonald <[email protected]> Ported-by: Mike Gerdts <[email protected]> Porting Notes: * ZTS: wait for zvols to exist before writing * ZTS: use log_must_busy with {zpool|zfs} destroy OpenZFS-issue: https://www.illumos.org/issues/9318 OpenZFS-commit: https://github.com/illumos/illumos-gate/commit/b73ccab0 Closes #8973
* Add 'zfs umount -u' for encrypted datasetsTom Caputi2019-06-281-5/+8
| | | | | | | | | | | This patch adds the ability for the user to unload keys for datasets as they are being unmounted. This is analogous to 'zfs mount -l'. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alek Pinchuk <[email protected]> Signed-off-by: Tom Caputi <[email protected]> Closes: #8917 Closes: #8952
* zdb -vvvvv on ztest pool dies with "out of memory"Paul Dagnelie2019-06-251-6/+20
| | | | | | | | | | | | | | | ztest creates some extremely large files as part of its operation. When zdb tries to dump a large enough file, it can run out of memory or spend an extremely long time attempting to print millions or billions of uint64_ts. We cap the amount of data from a uint64 object that we are willing to read and print. Reviewed-by: Don Brady <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Paul Dagnelie <[email protected]> External-issue: DLPX-53814 Closes #8947
* Redacted Send/Receive causes zdb to dump coreloli10K2019-06-241-1/+1
| | | | | | | | | When used with verbosity >= 4 zdb fails an assertion in dump_bookmarks() because it expects snprintf() to retun 0 on success. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Paul Dagnelie <[email protected]> Signed-off-by: loli10K <[email protected]> Closes #8948
* Remove code for zfs remapMatthew Ahrens2019-06-241-78/+0
| | | | | | | | | | | | | | | | The "zfs remap" command was disabled by 6e91a72fe3ff8bb282490773bd687632f3e8c79d, because it has little utility and introduced some tricky bugs. This commit removes the code for it, the associated ZFS_IOC_REMAP ioctl, and tests. Note that the ioctl and property will remain, but have no functionality. This allows older software to fail gracefully if it attempts to use these, and avoids a backwards incompatibility that would be introduced if we renumbered the later ioctls/props. Reviewed-by: Tom Caputi <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> Closes #8944
* Fix out-of-tree build failuresBrian Behlendorf2019-06-242-55/+59
| | | | | | | | | | | | | | | | | | | | Resolve the incorrect use of srcdir and builddir references for various files in the build system. These have crept in over time and went unnoticed because when building in the top level directory srcdir and builddir are identical. With this change it's again possible to build in a subdirectory. $ mkdir obj $ cd obj $ ../configure $ make Reviewed-by: loli10K <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Don Brady <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #8921 Closes #8943
* Let zfs mount all tolerate in-progress mountsDon Brady2019-06-221-1/+18
| | | | | | | | | | | | | | | | | | | The zfs-mount service can unexpectedly fail to start when zfs encounters a mount that is in progress. This service uses zfs mount -a, which has a window between the time it checks if the dataset was mounted and when the actual mount (via mount.zfs binary) occurs. The reason for the racing mounts is that both zfs-mount.target and zfs-share.target are allowed to execute concurrently after the import. This is more of an issue with the relatively recent addition of parallel mounting, and we should consider serializing the mount and share targets. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed by: John Kennedy <[email protected]> Reviewed-by: Allan Jude <[email protected]> Signed-off-by: Don Brady <[email protected]> Closes #8881