summaryrefslogtreecommitdiffstats
path: root/lib
Commit message (Collapse)AuthorAgeFilesLines
* Remove unused headers from uu_string.cMatthew Macy2019-10-251-2/+0
| | | | | | | | The malloc.h include is gratuitous and runs in to the following error on FreeBSD: Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matt Macy <[email protected]> Closes #9509
* Move platform dependent errno aliasesMatthew Macy2019-10-253-1/+12
| | | | | | | | | EBADE, EBADR, and ENOANO do not exist on FreeBSD The libspl errno.h is similarly platform dependent. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matt Macy <[email protected]> Closes #9498
* Remove sdt.hMatthew Macy2019-10-251-0/+23
| | | | | | | | | It's mostly a noop on ZoL and it conflicts with platforms that support dtrace. Remove this header to resolve the conflict. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Jorgen Lundman <[email protected]> Signed-off-by: Matt Macy <[email protected]> Closes #9497
* Fix incremental recursive encrypted receiveTom Caputi2019-10-241-5/+17
| | | | | | | | | | | | | | | | | | | | Currently, incremental recursive encrypted receives fail to work for any snapshot after the first. The reason for this is because the check in zfs_setup_cmdline_props() did not properly realize that when the user attempts to use '-x encryption' in this situation, they are not really overriding the existing encryption property and instead are attempting to prevent it from changing. This resulted in an error message stating: "encryption property 'encryption' cannot be set or excluded for raw or incremental streams". This problem is fixed by updating the logic to expect this use case. Reviewed-by: loli10K <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Igor Kozhukhov <[email protected]> Signed-off-by: Tom Caputi <[email protected]> Closes #9494
* Linux 4.14, 4.19, 5.0+ compat: SIMD save/restoreBrian Behlendorf2019-10-241-1/+2
| | | | | | | | | | | | | | | | | | | | Contrary to initial testing we cannot rely on these kernels to invalidate the per-cpu FPU state and restore the FPU registers. Nor can we guarantee that the kernel won't modify the FPU state which we saved in the task struck. Therefore, the kfpu_begin() and kfpu_end() functions have been updated to save and restore the FPU state using our own dedicated per-cpu FPU state variables. This has the additional advantage of allowing us to use the FPU again in user threads. So we remove the code which was added to use task queues to ensure some functions ran in kernel threads. Reviewed-by: Fabian Grünbichler <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #9346 Closes #9403
* Use zfs_ioctl with zfs_cmd_t in libzfsMatthew Macy2019-10-237-27/+27
| | | | | | | | | Consistently use the `zfs_ioctl()` wrapper since `ioctl()` cannot be called directly due to differing semantics between platforms. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Jorgen Lundman <[email protected]> Signed-off-by: Matt Macy <[email protected]> Closes #9492
* Use platform independent error code for libzfs_run_process_implMatthew Macy2019-10-231-1/+1
| | | | | | Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matt Macy <[email protected]> Closes #9493
* Use correct format string when printing int8Matthew Macy2019-10-201-1/+1
| | | | | | | | Reviewed-by: Igor Kozhukhov <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: George Melikov <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matt Macy <[email protected]> Closes #9486
* OpenZFS restructuring - ARC memory pressureMatthew Macy2019-10-181-0/+1
| | | | | | | | | | | Factor Linux specific memory pressure handling out of ARC. Each platform will have different available interfaces for managing memory pressure. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Jorgen Lundman <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Signed-off-by: Matt Macy <[email protected]> Closes #9472
* Move zio_delay_interrupt to platform codeMatthew Macy2019-10-131-0/+1
| | | | | | | | FreeBSD has its own implementation as do other platforms. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Signed-off-by: Matt Macy <[email protected]> Closes #9439
* Modify sharenfs=on default behaviorBrian Behlendorf2019-10-131-2/+3
| | | | | | | | | | | | While it may sometimes be convenient to export an NFS filesystem with no_root_squash it should not be the default behavior. Align the default behavior with the Linux NFS server defaults. To restore the previous behavior use 'zfs set sharenfs="no_root_squash,..."'. Reviewed-by: loli10K <[email protected]> Reviewed-by: Richard Laager <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #9397 Closes #9425
* Implement ZPOOL_IMPORT_UDEV_TIMEOUT_MSRichard Yao2019-10-111-1/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since 0.7.0, zpool import would unconditionally block on udev for 30 seconds. This introduced a regression in initramfs environments that lack udev (particularly mdev based environments), yet use a zfs userland tools intended for the system that had been built against udev. Gentoo's genkernel is the main example, although custom user initramfs environments would be similarly impacted unless special builds of the ZFS userland utilities were done for them. Such environments already have their own mechanisms for blocking until device nodes are ready (such as genkernel's scandelay parameter), so it is unnecessary for zpool import to block on a non-existent udev until a timeout is reached inside of them. Rather than trying to intelligently determine whether udev is available on the system to avoid unnecessarily blocking in such environments, it seems best to just allow the environment to override the timeout. I propose that we add an environment variable called ZPOOL_IMPORT_UDEV_TIMEOUT_MS. Setting it to 0 would restore the 0.6.x behavior that was more desirable in mdev based initramfs environments. This allows the system user land utilities to be reused when building mdev-based initramfs archives. Reviewed-by: Igor Kozhukhov <[email protected]> Reviewed-by: Jorgen Lundman <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Georgy Yakovlev <[email protected]> Signed-off-by: Richard Yao <[email protected]> Closes #9436
* Reduce loaded range tree memory usagePaul Dagnelie2019-10-095-5/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch implements a new tree structure for ZFS, and uses it to store range trees more efficiently. The new structure is approximately a B-tree, though there are some small differences from the usual characterizations. The tree has core nodes and leaf nodes; each contain data elements, which the elements in the core nodes acting as separators between its children. The difference between core and leaf nodes is that the core nodes have an array of children, while leaf nodes don't. Every node in the tree may be only partially full; in most cases, they are all at least 50% full (in terms of element count) except for the root node, which can be less full. Underfull nodes will steal from their neighbors or merge to remain full enough, while overfull nodes will split in two. The data elements are contained in tree-controlled buffers; they are copied into these on insertion, and overwritten on deletion. This means that the elements are not independently allocated, which reduces overhead, but also means they can't be shared between trees (and also that pointers to them are only valid until a side-effectful tree operation occurs). The overhead varies based on how dense the tree is, but is usually on the order of about 50% of the element size; the per-node overheads are very small, and so don't make a significant difference. The trees can accept arbitrary records; they accept a size and a comparator to allow them to be used for a variety of purposes. The new trees replace the AVL trees used in the range trees today. Currently, the range_seg_t structure contains three 8 byte integers of payload and two 24 byte avl_tree_node_ts to handle its storage in both an offset-sorted tree and a size-sorted tree (total size: 64 bytes). In the new model, the range seg structures are usually two 4 byte integers, but a separate one needs to exist for the size-sorted and offset-sorted tree. Between the raw size, the 50% overhead, and the double storage, the new btrees are expected to use 8*1.5*2 = 24 bytes per record, or 33.3% as much memory as the AVL trees (this is for the purposes of storing metaslab range trees; for other purposes, like scrubs, they use ~50% as much memory). We reduced the size of the payload in the range segments by teaching range trees about starting offsets and shifts; since metaslabs have a fixed starting offset, and they all operate in terms of disk sectors, we can store the ranges using 4-byte integers as long as the size of the metaslab divided by the sector size is less than 2^32. For 512-byte sectors, this is a 2^41 (or 2TB) metaslab, which with the default settings corresponds to a 256PB disk. 4k sector disks can handle metaslabs up to 2^46 bytes, or 2^63 byte disks. Since we do not anticipate disks of this size in the near future, there should be almost no cases where metaslabs need 64-byte integers to store their ranges. We do still have the capability to store 64-byte integer ranges to account for cases where we are storing per-vdev (or per-dnode) trees, which could reasonably go above the limits discussed. We also do not store fill information in the compact version of the node, since it is only used for sorted scrub. We also optimized the metaslab loading process in various other ways to offset some inefficiencies in the btree model. While individual operations (find, insert, remove_from) are faster for the btree than they are for the avl tree, remove usually requires a find operation, while in the AVL tree model the element itself suffices. Some clever changes actually caused an overall speedup in metaslab loading; we use approximately 40% less cpu to load metaslabs in our tests on Illumos. Another memory and performance optimization was achieved by changing what is stored in the size-sorted trees. When a disk is heavily fragmented, the df algorithm used by default in ZFS will almost always find a number of small regions in its initial cursor-based search; it will usually only fall back to the size-sorted tree to find larger regions. If we increase the size of the cursor-based search slightly, and don't store segments that are smaller than a tunable size floor in the size-sorted tree, we can further cut memory usage down to below 20% of what the AVL trees store. This also results in further reductions in CPU time spent loading metaslabs. The 16KiB size floor was chosen because it results in substantial memory usage reduction while not usually resulting in situations where we can't find an appropriate chunk with the cursor and are forced to use an oversized chunk from the size-sorted tree. In addition, even if we do have to use an oversized chunk from the size-sorted tree, the chunk would be too small to use for ZIL allocations, so it isn't as big of a loss as it might otherwise be. And often, more small allocations will follow the initial one, and the cursor search will now find the remainder of the chunk we didn't use all of and use it for subsequent allocations. Practical testing has shown little or no change in fragmentation as a result of this change. If the size-sorted tree becomes empty while the offset sorted one still has entries, it will load all the entries from the offset sorted tree and disregard the size floor until it is unloaded again. This operation occurs rarely with the default setting, only on incredibly thoroughly fragmented pools. There are some other small changes to zdb to teach it to handle btrees, but nothing major. Reviewed-by: George Wilson <[email protected]> Reviewed-by: Matt Ahrens <[email protected]> Reviewed by: Sebastien Roy [email protected] Reviewed-by: Igor Kozhukhov <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Paul Dagnelie <[email protected]> Closes #9181
* OpenZFS restructuring - libzfsMatthew Macy2019-10-0310-797/+980
| | | | | | | | | Factor Linux specific functionality out of libzfs. Reviewed-by: Allan Jude <[email protected]> Reviewed-by: Jorgen Lundman <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matthew Macy <[email protected]> Closes #9377
* OpenZFS restructuring - libzutilMatthew Macy2019-10-036-1321/+1501
| | | | | | | | Factor Linux specific functionality out of libzutil. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matt Macy <[email protected]> Signed-off-by: Ryan Moeller <[email protected]> Closes #9356
* OpenZFS restructuring - libsplMatthew Macy2019-10-0236-307/+103
| | | | | | | | | Factor Linux specific pieces out of libspl. Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Sean Eric Fagan <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matt Macy <[email protected]> Closes #9336
* OpenZFS restructuring - zpoolMatthew Macy2019-09-301-0/+15
| | | | | | | | | | | Factor Linux specific functions out of the zpool command. Reviewed-by: Allan Jude <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Sean Eric Fagan <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: loli10K <[email protected]> Signed-off-by: Matt Macy <[email protected]> Closes #9333
* Fix encryption hierarchy issues with zfs recv -dTom Caputi2019-09-251-44/+40
| | | | | | | | | | | | | | | | | | | | | Currently, the recv_fix_encryption_hierarchy() function accepts 'destsnap' as one of its parameters. Originally, this was intended to be the top-level dataset of a receive (whether or not the receive was recursive). Unfortunately, this parameter actually is simply the input that is passed in from the command line. When the user specifies 'zfs recv -d', this string is actually only the name of the receiving pool since the rest of the name is derived from the send stream. This causes the function to fail, leaving some datasets with an invalid encryption hierarchy. This patch resolves this problem by passing in the top_zfs variable instead. In order to make this work, this patch also includes some changes that ensure the value is always present when we need it. Reviewed-by: loli10K <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tom Caputi <[email protected]> Closes #9273 Closes #9309
* Refactor libzfs_error_init newlinesRyan Moeller2019-09-181-5/+5
| | | | | | | | | Move the trailing newlines from the error message strings to the format strings to more closely match the other error messages. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: George Melikov <[email protected]> Signed-off-by: Ryan Moeller <[email protected]> Closes #9330
* Add subcommand to wait for background zfs activity to completeJohn Gallagher2019-09-132-20/+136
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently the best way to wait for the completion of a long-running operation in a pool, like a scrub or device removal, is to poll 'zpool status' and parse its output, which is neither efficient nor convenient. This change adds a 'wait' subcommand to the zpool command. When invoked, 'zpool wait' will block until a specified type of background activity completes. Currently, this subcommand can wait for any of the following: - Scrubs or resilvers to complete - Devices to initialized - Devices to be replaced - Devices to be removed - Checkpoints to be discarded - Background freeing to complete For example, a scrub that is in progress could be waited for by running zpool wait -t scrub <pool> This also adds a -w flag to the attach, checkpoint, initialize, replace, remove, and scrub subcommands. When used, this flag makes the operations kicked off by these subcommands synchronous instead of asynchronous. This functionality is implemented using a new ioctl. The type of activity to wait for is provided as input to the ioctl, and the ioctl blocks until all activity of that type has completed. An ioctl was used over other methods of kernel-userspace communiction primarily for the sake of portability. Porting Notes: This is ported from Delphix OS change DLPX-44432. The following changes were made while porting: - Added ZoL-style ioctl input declaration. - Reorganized error handling in zpool_initialize in libzfs to integrate better with changes made for TRIM support. - Fixed check for whether a checkpoint discard is in progress. Previously it also waited if the pool had a checkpoint, instead of just if a checkpoint was being discarded. - Exposed zfs_initialize_chunk_size as a ZoL-style tunable. - Updated more existing tests to make use of new 'zpool wait' functionality, tests that don't exist in Delphix OS. - Used existing ZoL tunable zfs_scan_suspend_progress, together with zinject, in place of a new tunable zfs_scan_max_blks_per_txg. - Added support for a non-integral interval argument to zpool wait. Future work: ZoL has support for trimming devices, which Delphix OS does not. In the future, 'zpool wait' could be extended to add the ability to wait for trim operations to complete. Reviewed-by: Matt Ahrens <[email protected]> Reviewed-by: John Kennedy <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: John Gallagher <[email protected]> Closes #9162
* Enable compiler to typecheck loggingMatthew Macy2019-09-121-0/+4
| | | | | | | | | | Annotate spa logging declarations with printflike Workaround gcc bug (non disable-able warning) by replacing "" with " " Reviewed-by: Matt Ahrens <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matt Macy <[email protected]> Closes #9316
* OpenZFS restructuring - move linux tracing code to platform directoriesMatthew Macy2019-09-112-0/+2
| | | | | | | | | | | Move Linux specific tracing headers and source to platform directories and update the build system. Reviewed-by: Allan Jude <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Reviewed by: Brad Lewis <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matt Macy <[email protected]> Closes #9290
* OpenZFS restructuring - move platform specific sourcesMatthew Macy2019-09-061-0/+1
| | | | | | | | | | | | Move platform specific Linux source under module/os/linux/ and update the build system accordingly. Additional code restructuring will follow to make the common code fully portable. Reviewed-by: Jorgen Lundman <[email protected]> Reviewed-by: Igor Kozhukhov <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matthew Macy <[email protected]> Closes #9206
* Fix noop receive of raw send streamTom Caputi2019-09-051-6/+33
| | | | | | | | | | | | | | | | | | | | Currently, the noop receive code fails to work with raw send streams and resuming send streams. This happens because zfs_receive_impl() reads the DRR_BEGIN payload without reading the payload itself. Normally, the kernel expects to read this itself, but in this case the recv_skip() code runs instead and it is not prepared to handle the stream being left at any place other than the beginning of a record. This patch resolves this issue by manually reading the DRR_BEGIN payload in the dry-run case. This patch also includes a number of small fixups in this code path. Reviewed-by: George Melikov <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Paul Dagnelie <[email protected]> Signed-off-by: Tom Caputi <[email protected]> Closes #9221 Closes #9173
* OpenZFS restructuring - move platform specific headersMatthew Macy2019-09-052-0/+449
| | | | | | | | | | | | | | | | | Move platform specific Linux headers under include/os/linux/. Update the build system accordingly to detect the platform. This lays some of the initial groundwork to supporting building for other platforms. As part of this change it was necessary to create both a user and kernel space sys/simd.h header which can be included in either context. No functional change, the source has been refactored and the relevant #include's updated. Reviewed-by: Jorgen Lundman <[email protected]> Reviewed-by: Igor Kozhukhov <[email protected]> Signed-off-by: Matthew Macy <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #9198
* Always refuse receving non-resume stream when resume state existsAndriy Gapon2019-09-031-4/+11
| | | | | | | | | | | | | | This fixes a hole in the situation where the resume state is left from receiving a new dataset and, so, the state is set on the dataset itself (as opposed to %recv child). Additionally, distinguish incremental and resume streams in error messages. Reviewed-by: Matt Ahrens <[email protected]> Reviewed-by: Tom Caputi <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Andriy Gapon <[email protected]> Closes #9252
* Fix typos in lib/Andrea Gelmini2019-09-0217-26/+26
| | | | | | | Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Richard Laager <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Andrea Gelmini <[email protected]> Closes #9237
* zfs_handle used after being closed/freed in change_one callbackPavel Zakharov2019-08-281-12/+18
| | | | | | | | | | | | | | | | | | | | This is a typical case of use after free. We would call zfs_close(zhp) which would free the handle, and then call zfs_iter_children() on that handle later. This change ensures that the zfs_handle is only closed when we are ready to return. Running `zfs inherit -r sharenfs pool` was failing with an error code without any error messages. After some debugging I've pinpointed the issue to be memory corruption, which would cause zfs to try to issue an ioctl to the wrong device and receive ENOTTY. Reviewed-by: Paul Dagnelie <[email protected]> Reviewed-by: George Wilson <[email protected]> Reviewed-by: Sebastien Roy <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alek Pinchuk <[email protected]> Signed-off-by: Pavel Zakharov <[email protected]> Issue #7967 Closes #9165
* Fix device expansion when VM is powered offPrakash Surya2019-08-131-25/+87
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When running on an ESXi based VM, I've found that "zpool online -e" will not expand the zpool, if the disk was expanded in ESXi while the VM was powered off. For example, take the following scenario: 1. VM running on top of VMware ESXi 2. ZFS pool created with a given device "sda" of size 8GB 3. VM powered off 4. Device "sda" size expanded to 16GB 5. VM powered on 6. "zpool online -e" used on device "sda" In this situation, after (2) the zpool will be roughly 8GB in size. After (6), the expectation is the zpool's size will expand to roughly 16GB in size; i.e. expand to the new size of the "sda" device. Unfortunately, I've seen that after (6), the zpool size does not change. What's happening is after (5), the EFI label of the "sda" device will be such that fields "efi_last_u_lba", "efi_last_lba", and "efi_altern_lba" all reflect the new size of the disk; i.e. "33554398", "33554431", and "33554431" respectively. Thus, the check that we perform in "efi_use_whole_disk": if ((efi_label->efi_altern_lba == 1) || (efi_label->efi_altern_lba >= efi_label->efi_last_lba)) { This will return true, and then we return from the function without having expanded the size of the zpool/device. In contrast, if we remove steps (3) and (5) in the sequence above, i.e. the device is expanded while the VM is powered on, things change. In that case, the fields "efi_last_u_lba" and "efi_altern_lba" do not change (i.e. they still reflect the old 8GB device size), but the "efi_last_lba" field does change (i.e. it now reflects the new 16GB device size). Thus, when we evaluate the same conditional in "efi_use_whole_disk", it'll return false, so the zpool is expanded. Taking all of this into account, this PR updates "efi_use_whole_disk" to properly expand the zpool when the underlying disk is expanded while the VM is powered off. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: George Wilson <[email protected]> Reviewed-by: Don Brady <[email protected]> Signed-off-by: Prakash Surya <[email protected]> Closes #9111
* Increase default zcmd allocation to 256KMichael Niewöhner2019-07-301-1/+1
| | | | | | | | | | | | | | | | | When creating hundreds of clones (for example using containers with LXD) cloning slows down as the number of clones increases over time. The reason for this is that the fetching of the clone information using a small zcmd buffer requires two ioctl calls, one to determine the size and a second to return the data. However, this requires gathering the data twice, once to determine the size and again to populate the zcmd buffer to return it to userspace. These are expensive ioctl() calls, so instead, make the default buffer size much larger: 256K. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: George Melikov <[email protected]> Signed-off-by: Colin Ian King <[email protected]> Signed-off-by: Michael Niewöhner <[email protected]> Closes #9084
* Race condition between spa async threads and exportSerapheim Dimitropoulos2019-07-181-0/+5
| | | | | | | | | | | | | | | | | | In the past we've seen multiple race conditions that have to do with open-context threads async threads and concurrent calls to spa_export()/spa_destroy() (including the one referenced in issue #9015). This patch ensures that only one thread can execute the main body of spa_export_common() at a time, with subsequent threads returning with a new error code created just for this situation, eliminating this way any race condition bugs introduced by concurrent calls to this function. Reviewed by: Matt Ahrens <[email protected]> Reviewed by: Brian Behlendorf <[email protected]> Signed-off-by: Serapheim Dimitropoulos <[email protected]> Closes #9015 Closes #9044
* Log Spacemap ProjectSerapheim Dimitropoulos2019-07-161-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | = Motivation At Delphix we've seen a lot of customer systems where fragmentation is over 75% and random writes take a performance hit because a lot of time is spend on I/Os that update on-disk space accounting metadata. Specifically, we seen cases where 20% to 40% of sync time is spend after sync pass 1 and ~30% of the I/Os on the system is spent updating spacemaps. The problem is that these pools have existed long enough that we've touched almost every metaslab at least once, and random writes scatter frees across all metaslabs every TXG, thus appending to their spacemaps and resulting in many I/Os. To give an example, assuming that every VDEV has 200 metaslabs and our writes fit within a single spacemap block (generally 4K) we have 200 I/Os. Then if we assume 2 levels of indirection, we need 400 additional I/Os and since we are talking about metadata for which we keep 2 extra copies for redundancy we need to triple that number, leading to a total of 1800 I/Os per VDEV every TXG. We could try and decrease the number of metaslabs so we have less I/Os per TXG but then each metaslab would cover a wider range on disk and thus would take more time to be loaded in memory from disk. In addition, after it's loaded, it's range tree would consume more memory. Another idea would be to just increase the spacemap block size which would allow us to fit more entries within an I/O block resulting in fewer I/Os per metaslab and a speedup in loading time. The problem is still that we don't deal with the number of I/Os going up as the number of metaslabs is increasing and the fact is that we generally write a lot to a few metaslabs and a little to the rest of them. Thus, just increasing the block size would actually waste bandwidth because we won't be utilizing our bigger block size. = About this patch This patch introduces the Log Spacemap project which provides the solution to the above problem while taking into account all the aforementioned tradeoffs. The details on how it achieves that can be found in the references sections below and in the code (see Big Theory Statement in spa_log_spacemap.c). Even though the change is fairly constraint within the metaslab and lower-level SPA codepaths, there is a side-change that is user-facing. The change is that VDEV IDs from VDEV holes will no longer be reused. To give some background and reasoning for this, when a log device is removed and its VDEV structure was replaced with a hole (or was compacted; if at the end of the vdev array), its vdev_id could be reused by devices added after that. Now with the pool-wide space maps recording the vdev ID, this behavior can cause problems (e.g. is this entry referring to a segment in the new vdev or the removed log?). Thus, to simplify things the ID reuse behavior is gone and now vdev IDs for top-level vdevs are truly unique within a pool. = Testing The illumos implementation of this feature has been used internally for a year and has been in production for ~6 months. For this patch specifically there don't seem to be any regressions introduced to ZTS and I have been running zloop for a week without any related problems. = Performance Analysis (Linux Specific) All performance results and analysis for illumos can be found in the links of the references. Redoing the same experiments in Linux gave similar results. Below are the specifics of the Linux run. After the pool reached stable state the percentage of the time spent in pass 1 per TXG was 64% on average for the stock bits while the log spacemap bits stayed at 95% during the experiment (graph: sdimitro.github.io/img/linux-lsm/PercOfSyncInPassOne.png). Sync times per TXG were 37.6 seconds on average for the stock bits and 22.7 seconds for the log spacemap bits (related graph: sdimitro.github.io/img/linux-lsm/SyncTimePerTXG.png). As a result the log spacemap bits were able to push more TXGs, which is also the reason why all graphs quantified per TXG have more entries for the log spacemap bits. Another interesting aspect in terms of txg syncs is that the stock bits had 22% of their TXGs reach sync pass 7, 55% reach sync pass 8, and 20% reach 9. The log space map bits reached sync pass 4 in 79% of their TXGs, sync pass 7 in 19%, and sync pass 8 at 1%. This emphasizes the fact that not only we spend less time on metadata but we also iterate less times to convergence in spa_sync() dirtying objects. [related graphs: stock- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGStock.png lsm- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGLSM.png] Finally, the improvement in IOPs that the userland gains from the change is approximately 40%. There is a consistent win in IOPS as you can see from the graphs below but the absolute amount of improvement that the log spacemap gives varies within each minute interval. sdimitro.github.io/img/linux-lsm/StockVsLog3Days.png sdimitro.github.io/img/linux-lsm/StockVsLog10Hours.png = Porting to Other Platforms For people that want to port this commit to other platforms below is a list of ZoL commits that this patch depends on: Make zdb results for checkpoint tests consistent db587941c5ff6dea01932bb78f70db63cf7f38ba Update vdev_is_spacemap_addressable() for new spacemap encoding 419ba5914552c6185afbe1dd17b3ed4b0d526547 Simplify spa_sync by breaking it up to smaller functions 8dc2197b7b1e4d7ebc1420ea30e51c6541f1d834 Factor metaslab_load_wait() in metaslab_load() b194fab0fb6caad18711abccaff3c69ad8b3f6d3 Rename range_tree_verify to range_tree_verify_not_present df72b8bebe0ebac0b20e0750984bad182cb6564a Change target size of metaslabs from 256GB to 16GB c853f382db731e15a87512f4ef1101d14d778a55 zdb -L should skip leak detection altogether 21e7cf5da89f55ce98ec1115726b150e19eefe89 vs_alloc can underflow in L2ARC vdevs 7558997d2f808368867ca7e5234e5793446e8f3f Simplify log vdev removal code 6c926f426a26ffb6d7d8e563e33fc176164175cb Get rid of space_map_update() for ms_synced_length 425d3237ee88abc53d8522a7139c926d278b4b7f Introduce auxiliary metaslab histograms 928e8ad47d3478a3d5d01f0dd6ae74a9371af65e Error path in metaslab_load_impl() forgets to drop ms_sync_lock 8eef997679ba54547f7d361553d21b3291f41ae7 = References Background, Motivation, and Internals of the Feature - OpenZFS 2017 Presentation: youtu.be/jj2IxRkl5bQ - Slides: slideshare.net/SerapheimNikolaosDim/zfs-log-spacemaps-project Flushing Algorithm Internals & Performance Results (Illumos Specific) - Blogpost: sdimitro.github.io/post/zfs-lsm-flushing/ - OpenZFS 2018 Presentation: youtu.be/x6D2dHRjkxw - Slides: slideshare.net/SerapheimNikolaosDim/zfs-log-spacemap-flushing-algorithm Upstream Delphix Issues: DLPX-51539, DLPX-59659, DLPX-57783, DLPX-61438, DLPX-41227, DLPX-59320 DLPX-63385 Reviewed-by: Sean Eric Fagan <[email protected]> Reviewed-by: Matt Ahrens <[email protected]> Reviewed-by: George Wilson <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Serapheim Dimitropoulos <[email protected]> Closes #8442
* Fix race in parallel mount's thread dispatching algorithmTomohiro Kusumi2019-07-091-2/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Strategy of parallel mount is as follows. 1) Initial thread dispatching is to select sets of mount points that don't have dependencies on other sets, hence threads can/should run lock-less and shouldn't race with other threads for other sets. Each thread dispatched corresponds to top level directory which may or may not have datasets to be mounted on sub directories. 2) Subsequent recursive thread dispatching for each thread from 1) is to mount datasets for each set of mount points. The mount points within each set have dependencies (i.e. child directories), so child directories are processed only after parent directory completes. The problem is that the initial thread dispatching in zfs_foreach_mountpoint() can be multi-threaded when it needs to be single-threaded, and this puts threads under race condition. This race appeared as mount/unmount issues on ZoL for ZoL having different timing regarding mount(2) execution due to fork(2)/exec(2) of mount(8). `zfs unmount -a` which expects proper mount order can't unmount if the mounts were reordered by the race condition. There are currently two known patterns of input list `handles` in `zfs_foreach_mountpoint(..,handles,..)` which cause the race condition. 1) #8833 case where input is `/a /a /a/b` after sorting. The problem is that libzfs_path_contains() can't correctly handle an input list with two same top level directories. There is a race between two POSIX threads A and B, * ThreadA for "/a" for test1 and "/a/b" * ThreadB for "/a" for test0/a and in case of #8833, ThreadA won the race. Two threads were created because "/a" wasn't considered as `"/a" contains "/a"`. 2) #8450 case where input is `/ /var/data /var/data/test` after sorting. The problem is that libzfs_path_contains() can't correctly handle an input list containing "/". There is a race between two POSIX threads A and B, * ThreadA for "/" and "/var/data/test" * ThreadB for "/var/data" and in case of #8450, ThreadA won the race. Two threads were created because "/var/data" wasn't considered as `"/" contains "/var/data"`. In other words, if there is (at least one) "/" in the input list, the initial thread dispatching must be single-threaded since every directory is a child of "/", meaning they all directly or indirectly depend on "/". In both cases, the first non_descendant_idx() call fails to correctly determine "path1-contains-path2", and as a result the initial thread dispatching creates another thread when it needs to be single-threaded. Fix a conditional in libzfs_path_contains() to consider above two. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed by: Sebastien Roy <[email protected]> Signed-off-by: Tomohiro Kusumi <[email protected]> Closes #8450 Closes #8833 Closes #8878
* zfs send does not handle invalid input gracefullyloli10K2019-07-081-0/+6
| | | | | | | | | | | Due to some changes introduced in 30af21b 'zfs send' can crash when provided with invalid inputs: this change attempts to add more checks to the affected code paths. Reviewed-by: Attila Fülöp <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Signed-off-by: loli10K <[email protected]> Closes #9001
* OpenZFS 9318 - vol_volsize_to_reservation does not account for raidz skip blocksMike Gerdts2019-07-051-8/+181
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When a volume is created in a pool with raidz vdevs and volblocksize != 128k, the volume can reference more space than is reserved with the automatically calculated refreservation. There are two deficiencies in vol_volsize_to_reservation that contribute to this: 1) Skip blocks may be added to keep each allocation a multiple of parity + 1. This is the dominating factor when volblocksize is close to 2^ashift. 2) raidz deflation for 128 KB blocks is different for most other block sizes. See "The theory of raidz space accounting" comment in libzfs_dataset.c for a full explanation. Authored by: Mike Gerdts <[email protected]> Reviewed by: Richard Elling <[email protected]> Reviewed by: Sanjay Nadkarni <[email protected]> Reviewed by: Jerry Jelinek <[email protected]> Reviewed by: Matt Ahrens <[email protected]> Reviewed by: Kody Kantor <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Approved by: Dan McDonald <[email protected]> Ported-by: Mike Gerdts <[email protected]> Porting Notes: * ZTS: wait for zvols to exist before writing * ZTS: use log_must_busy with {zpool|zfs} destroy OpenZFS-issue: https://www.illumos.org/issues/9318 OpenZFS-commit: https://github.com/illumos/illumos-gate/commit/b73ccab0 Closes #8973
* Fix error text for EINVAL in zfs_receive_one()Tom Caputi2019-07-021-2/+3
| | | | | | | | | | This small patch fixes the EINVAL case for zfs_receive_one(). A missing 'else' has been added to the two possible cases, which will ensure the intended error message is printed. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: loli10K <[email protected]> Signed-off-by: Tom Caputi <[email protected]> Closes #8977
* Add 'zfs umount -u' for encrypted datasetsTom Caputi2019-06-281-1/+27
| | | | | | | | | | | This patch adds the ability for the user to unload keys for datasets as they are being unmounted. This is analogous to 'zfs mount -l'. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alek Pinchuk <[email protected]> Signed-off-by: Tom Caputi <[email protected]> Closes: #8917 Closes: #8952
* Remove code for zfs remapMatthew Ahrens2019-06-242-40/+0
| | | | | | | | | | | | | | | | The "zfs remap" command was disabled by 6e91a72fe3ff8bb282490773bd687632f3e8c79d, because it has little utility and introduced some tricky bugs. This commit removes the code for it, the associated ZFS_IOC_REMAP ioctl, and tests. Note that the ioctl and property will remain, but have no functionality. This allows older software to fail gracefully if it attempts to use these, and avoids a backwards incompatibility that would be introduced if we renumbered the later ioctls/props. Reviewed-by: Tom Caputi <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> Closes #8944
* Fix error message on promoting encrypted datasetTom Caputi2019-06-241-0/+10
| | | | | | | | | This patch corrects the error message reported when attempting to promote a dataset outside of its encryption root. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tom Caputi <[email protected]> Closes #8905 Closes #8935
* OpenZFS 9425 - channel programs can be interruptedDon Brady2019-06-221-0/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Problem Statement ================= ZFS Channel program scripts currently require a timeout, so that hung or long-running scripts return a timeout error instead of causing ZFS to get wedged. This limit can currently be set up to 100 million Lua instructions. Even with a limit in place, it would be desirable to have a sys admin (support engineer) be able to cancel a script that is taking a long time. Proposed Solution ================= Make it possible to abort a channel program by sending an interrupt signal.In the underlying txg_wait_sync function, switch the cv_wait to a cv_wait_sig to catch the signal. Once a signal is encountered, the dsl_sync_task function can install a Lua hook that will get called before the Lua interpreter executes a new line of code. The dsl_sync_task can resume with a standard txg_wait_sync call and wait for the txg to complete. Meanwhile, the hook will abort the script and indicate that the channel program was canceled. The kernel returns a EINTR to indicate that the channel program run was canceled. Porting notes: Added missing return value from cv_wait_sig() Authored by: Don Brady <[email protected]> Reviewed by: Sebastien Roy <[email protected]> Reviewed by: Serapheim Dimitropoulos <[email protected]> Reviewed by: Matt Ahrens <[email protected]> Reviewed by: Sara Hartse <[email protected]> Reviewed by: Brian Behlendorf <[email protected]> Approved by: Robert Mustacchi <[email protected]> Ported-by: Don Brady <[email protected]> Signed-off-by: Don Brady <[email protected]> OpenZFS-issue: https://www.illumos.org/issues/9425 OpenZFS-commit: https://github.com/illumos/illumos-gate/commit/d0cb1fb926 Closes #8904
* Add libnvpair to libzfs pkg-configHarry Mallon2019-06-221-1/+1
| | | | | | | | Functions such as `fnvlist_lookup_nvlist` need libnvpair to be linked. Default pkg-config file did not contain it. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Harry Mallon <[email protected]> Closes #8919
* Allow unencrypted children of encrypted datasetsTom Caputi2019-06-203-79/+23
| | | | | | | | | | | | | | | | | | When encryption was first added to ZFS, we made a decision to prevent users from creating unencrypted children of encrypted datasets. The idea was to prevent users from inadvertently leaving some of their data unencrypted. However, since the release of 0.8.0, some legitimate reasons have been brought up for this behavior to be allowed. This patch simply removes this limitation from all code paths that had checks for it and updates the tests accordingly. Reviewed-by: Jason King <[email protected]> Reviewed-by: Sean Eric Fagan <[email protected]> Reviewed-by: Richard Laager <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tom Caputi <[email protected]> Closes #8737 Closes #8870
* Remove dedupditto functionalityMatthew Ahrens2019-06-191-9/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If dedup is in use, the `dedupditto` property can be set, causing ZFS to keep an extra copy of data that is referenced many times (>100x). The idea was that this data is more important than other data and thus we want to be really sure that it is not lost if the disk experiences a small amount of random corruption. ZFS (and system administrators) rely on the pool-level redundancy to protect their data (e.g. mirroring or RAIDZ). Since the user/sysadmin doesn't have control over what data will be offered extra redundancy by dedupditto, this extra redundancy is not very useful. The bulk of the data is still vulnerable to loss based on the pool-level redundancy. For example, if particle strikes corrupt 0.1% of blocks, you will either be saved by mirror/raidz, or you will be sad. This is true even if dedupditto saved another 0.01% of blocks from being corrupted. Therefore, the dedupditto functionality is rarely enabled (i.e. the property is rarely set), and it fulfills its promise of increased redundancy even more rarely. Additionally, this feature does not work as advertised (on existing releases), because scrub/resilver did not repair the extra (dedupditto) copy (see https://github.com/zfsonlinux/zfs/pull/8270). In summary, this seldom-used feature doesn't work, and even if it did it wouldn't provide useful data protection. It has a non-trivial maintenance burden (again see https://github.com/zfsonlinux/zfs/pull/8270). We should remove the dedupditto functionality. For backwards compatibility with the existing CLI, "zpool set dedupditto" will still "succeed" (exit code zero), but won't have any effect. For backwards compatibility with existing pools that had dedupditto enabled at some point, the code will still be able to understand dedupditto blocks and free them when appropriate. However, ZFS won't write any new dedupditto blocks. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Igor Kozhukhov <[email protected]> Reviewed-by: Alek Pinchuk <[email protected]> Issue #8270 Closes #8310
* Use ZFS_DEV macro instead of literalsTomohiro Kusumi2019-06-192-4/+4
| | | | | | | | | The rest of the code/comments use ZFS_DEV, so sync with that. Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Richard Elling <[email protected]> Signed-off-by: Tomohiro Kusumi <[email protected]> Closes #8912
* Implement Redacted Send/ReceivePaul Dagnelie2019-06-198-179/+949
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Redacted send/receive allows users to send subsets of their data to a target system. One possible use case for this feature is to not transmit sensitive information to a data warehousing, test/dev, or analytics environment. Another is to save space by not replicating unimportant data within a given dataset, for example in backup tools like zrepl. Redacted send/receive is a three-stage process. First, a clone (or clones) is made of the snapshot to be sent to the target. In this clone (or clones), all unnecessary or unwanted data is removed or modified. This clone is then snapshotted to create the "redaction snapshot" (or snapshots). Second, the new zfs redact command is used to create a redaction bookmark. The redaction bookmark stores the list of blocks in a snapshot that were modified by the redaction snapshot(s). Finally, the redaction bookmark is passed as a parameter to zfs send. When sending to the snapshot that was redacted, the redaction bookmark is used to filter out blocks that contain sensitive or unwanted information, and those blocks are not included in the send stream. When sending from the redaction bookmark, the blocks it contains are considered as candidate blocks in addition to those blocks in the destination snapshot that were modified since the creation_txg of the redaction bookmark. This step is necessary to allow the target to rehydrate data in the case where some blocks are accidentally or unnecessarily modified in the redaction snapshot. The changes to bookmarks to enable fast space estimation involve adding deadlists to bookmarks. There is also logic to manage the life cycles of these deadlists. The new size estimation process operates in cases where previously an accurate estimate could not be provided. In those cases, a send is performed where no data blocks are read, reducing the runtime significantly and providing a byte-accurate size estimate. Reviewed-by: Dan Kimmel <[email protected]> Reviewed-by: Matt Ahrens <[email protected]> Reviewed-by: Prashanth Sreenivasa <[email protected]> Reviewed-by: John Kennedy <[email protected]> Reviewed-by: George Wilson <[email protected]> Reviewed-by: Chris Williamson <[email protected]> Reviewed-by: Pavel Zhakarov <[email protected]> Reviewed-by: Sebastien Roy <[email protected]> Reviewed-by: Prakash Surya <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Paul Dagnelie <[email protected]> Closes #7958
* Restrict filesystem creation if name referred either '.' or '..'Tulsi Jain2019-06-131-0/+10
| | | | | | | | | | | This change restricts filesystem creation if the given name contains either '.' or '..' Reviewed-by: Matt Ahrens <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Richard Elling <[email protected]> Signed-off-by: TulsiJain <[email protected]> Closes #8842 Closes #8564
* Refactor parent dataset handling in libzfs zfs_rename()Tomohiro Kusumi2019-05-281-9/+4
| | | | | | | | | | | For recursive renaming, simplify the code by moving `zhrp` and `parentname` to inner scope. `zhrp` is only used to test existence of a parent dataset for recursive dataset dir scan since ba6a24026c. Reviewed by: Brian Behlendorf <[email protected]> Reviewed-by: Richard Laager <[email protected]> Reviewed-by: Giuseppe Di Natale <[email protected]> Signed-off-by: Tomohiro Kusumi <[email protected]> Closes #8815
* zfs: don't pretty-print objsetid propertyloli10K2019-05-241-1/+4
| | | | | | | | | | The objsetid property, while being stored as a number, is a dataset identifier and should not be pretty-printed. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: George Melikov <[email protected]> Reviewed-by: Chris Dunlop <[email protected]> Signed-off-by: loli10K <[email protected]> Closes #8784
* Fix wrong assertion in libzfs diff error handlingRyan Moeller2019-05-191-1/+1
| | | | | | | | | | In compare(), all error cases set the error code to EPIPE, so when an error is set, the correct assertion to make is that the error is EPIPE, not EINVAL. Reviewed-by: Richard Elling <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Ryan Moeller <[email protected]> Closes #8743
* Fix send/recv lost spill blockBrian Behlendorf2019-05-071-0/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When receiving a DRR_OBJECT record the receive_object() function needs to determine how to handle a spill block associated with the object. It may need to be removed or kept depending on how the object was modified at the source. This determination is currently accomplished using a heuristic which takes in to account the DRR_OBJECT record and the existing object properties. This is a problem because there isn't quite enough information available to do the right thing under all circumstances. For example, when only the block size changes the spill block is removed when it should be kept. What's needed to resolve this is an additional flag in the DRR_OBJECT which indicates if the object being received references a spill block. The DRR_OBJECT_SPILL flag was added for this purpose. When set then the object references a spill block and it must be kept. Either it is update to date, or it will be replaced by a subsequent DRR_SPILL record. Conversely, if the object being received doesn't reference a spill block then any existing spill block should always be removed. Since previous versions of ZFS do not understand this new flag additional DRR_SPILL records will be inserted in to the stream. This has the advantage of being fully backward compatible. Existing ZFS systems receiving this stream will recreate the spill block if it was incorrectly removed. Updated ZFS versions will correctly ignore the additional spill blocks which can be identified by checking for the DRR_SPILL_UNMODIFIED flag. The small downside to this approach is that is may increase the size of the stream and of the received snapshot on previous versions of ZFS. Additionally, when receiving streams generated by previous unpatched versions of ZFS spill blocks may still be lost. OpenZFS-issue: https://www.illumos.org/issues/9952 FreeBSD-issue: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=233277 Reviewed-by: Paul Dagnelie <[email protected]> Reviewed-by: Matt Ahrens <[email protected]> Reviewed-by: Tom Caputi <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #8668