aboutsummaryrefslogtreecommitdiffstats
path: root/man
Commit message (Collapse)AuthorAgeFilesLines
* Add notice that forcefully unmount is not supported on LinuxMariusz Zaborski2020-02-182-2/+4
| | | | | | | | | | | The Linux VFS will never allow a filesystem which is in use to be unmounted. This behavior differs from other platforms like FreeBSD which allow a filesystem to be force unmounted. This will result in errors being returned to applications actively using the filesystem. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Mariusz Zaborski <[email protected]> Closes #10013
* Prefer org.openzfs for features and propertiesRichard Laager2020-02-181-1/+1
| | | | | | | | | | Moving forward, we wish to use org.openzfs (no dash) rather than org.open-zfs or org.zfsonlinux for feature GUIDs and property names. The existing feature GUIDs cannot be changed. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Signed-off-by: Richard Laager <[email protected]> Closes #10003
* Systemd mount generator: Generate noauto units; add control propertiesInsanePrawn2020-02-142-14/+180
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This commit refactors the systemd mount generators and makes the following major changes: - The generator now generates units for datasets marked canmount=noauto, too. These units are NOT WantedBy local-fs.target. If there are multiple noauto datasets for a path, no noauto unit will be created. Datasets with canmount=on are prioritized. - Introduces handling of new user properties which are now included in the zfs-list.cache files: - org.openzfs.systemd:requires: List of units to require for this mount unit - org.openzfs.systemd:requires-mounts-for: List of mounts to require by this mount unit - org.openzfs.systemd:before: List of units to order after this mount unit - org.openzfs.systemd:after: List of units to order before this mount unit - org.openzfs.systemd:wanted-by: List of units to add a Wants dependency on this mount unit to - org.openzfs.systemd:required-by: List of units to add a Requires dependency on this mount unit to - org.openzfs.systemd:nofail: Toggles between a wants and a requires dependency. - org.openzfs.systemd:ignore: Do not generate a mount unit for this dataset. Consult the updated man page for detailed documentation. - Restructures and extends the zfs-mount-generator(8) man page with the above properties, information on unit ordering and a license header. Reviewed-by: Richard Laager <[email protected]> Reviewed-by: Antonio Russo <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: InsanePrawn <[email protected]> Closes #9649
* Support setting user properties in a channel programJason King2020-02-141-2/+23
| | | | | | | | | | | | | This adds support for setting user properties in a zfs channel program by adding 'zfs.sync.set_prop' and 'zfs.check.set_prop' to the ZFS LUA API. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Matt Ahrens <[email protected]> Co-authored-by: Sara Hartse <[email protected]> Contributions-by: Jason King <[email protected]> Signed-off-by: Sara Hartse <[email protected]> Signed-off-by: Jason King <[email protected]> Closes #9950
* Remove limit on number of async zio_frees of non-dedup blocksMatthew Ahrens2020-02-141-0/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | The module parameter zfs_async_block_max_blocks limits the number of blocks that can be freed by the background freeing of filesystems and snapshots (from "zfs destroy"), in one TXG. This is useful when freeing dedup blocks, becuase each zio_free() of a dedup block can require an i/o to read the relevant part of the dedup table (DDT), and will also dirty that block. zfs_async_block_max_blocks is set to 100,000 by default. For the more typical case where dedup is not used, this can have a negative performance impact on the rate of background freeing (from "zfs destroy"). For example, with recordsize=8k, and TXG's syncing once every 5 seconds, we can free only 160MB of data per second, which may be much less than the rate we can write data. This change increases zfs_async_block_max_blocks to be unlimited by default. To address the dedup freeing issue, a new tunable is introduced, zfs_max_async_dedup_frees, which limits the number of zio_free()'s of dedup blocks done by background destroys, per txg. The default is 100,000 free's (same as the old zfs_async_block_max_blocks default). Reviewed-by: Paul Dagnelie <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> Closes #10000
* zcp: add zfs.sync.bookmarkChristian Schwarz2020-02-111-0/+16
| | | | | | | | | | Add support for bookmark creation and cloning. Reviewed-by: Matt Ahrens <[email protected]> Reviewed-by: Paul Dagnelie <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Christian Schwarz <[email protected]> Closes #9571
* Implement bookmark copyingChristian Schwarz2020-02-112-3/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This feature allows copying existing bookmarks using zfs bookmark fs#target fs#newbookmark There are some niche use cases for such functionality, e.g. when using bookmarks as markers for replication progress. Copying redaction bookmarks produces a normal bookmark that cannot be used for redacted send (we are not duplicating the redaction object). ZCP support for bookmarking (both creation and copying) will be implemented in a separate patch based on this work. Overview: - Terminology: - source = existing snapshot or bookmark - new/bmark = new bookmark - Implement bookmark copying in `dsl_bookmark.c` - create new bookmark node - copy source's `zbn_phys` to new's `zbn_phys` - zero-out redaction object id in copy - Extend existing bookmark ioctl nvlist schema to accept bookmarks as sources - => `dsl_bookmark_create_nvl_validate` is authoritative - use `dsl_dataset_is_before` check for both snapshot and bookmark sources - Adjust CLI - refactor shortname expansion logic in `zfs_do_bookmark` - Update man pages - warn about redaction bookmark handling - Add test cases - CLI - pyyzfs libzfs_core bindings Reviewed-by: Matt Ahrens <[email protected]> Reviewed-by: Paul Dagnelie <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Christian Schwarz <[email protected]> Closes #9571
* Fix zdb -R with 'b' flagPaul Zuchowski2020-02-101-1/+1
| | | | | | | | | | | | | zdb -R :b fails due to the indirect block being compressed, and the 'b' and 'd' flag not working in tandem when specified. Fix the flag parsing code and create a zfs test for zdb -R block display. Also fix the zio flags where the dotted notation for the vdev portion of DVA (i.e. 0.0:offset:length) fails. Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Paul Zuchowski <[email protected]> Closes #9640 Closes #9729
* ICP: Improve AES-GCM performanceAttila Fülöp2020-02-101-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently SIMD accelerated AES-GCM performance is limited by two factors: a. The need to disable preemption and interrupts and save the FPU state before using it and to do the reverse when done. Due to the way the code is organized (see (b) below) we have to pay this price twice for each 16 byte GCM block processed. b. Most processing is done in C, operating on single GCM blocks. The use of SIMD instructions is limited to the AES encryption of the counter block (AES-NI) and the Galois multiplication (PCLMULQDQ). This leads to the FPU not being fully utilized for crypto operations. To solve (a) we do crypto processing in larger chunks while owning the FPU. An `icp_gcm_avx_chunk_size` module parameter was introduced to make this chunk size tweakable. It defaults to 32 KiB. This step alone roughly doubles performance. (b) is tackled by porting and using the highly optimized openssl AES-GCM assembler routines, which do all the processing (CTR, AES, GMULT) in a single routine. Both steps together result in up to 32x reduction of the time spend in the en/decryption routines, leading up to approximately 12x throughput increase for large (128 KiB) blocks. Lastly, this commit changes the default encryption algorithm from AES-CCM to AES-GCM when setting the `encryption=on` property. Reviewed-By: Brian Behlendorf <[email protected]> Reviewed-By: Jason King <[email protected]> Reviewed-By: Tom Caputi <[email protected]> Reviewed-By: Richard Laager <[email protected]> Signed-off-by: Attila Fülöp <[email protected]> Closes #9749
* Restore aclmode and remove acltype on FreeBSDRyan Moeller2020-02-041-1/+51
| | | | | | | | | | | | | | This replaces the placeholder ZFS_PROP_PRIVATE with ZFS_PROP_ACLMODE, matching what is done in the NFSv4 ACLs PR (#9709). On FreeBSD we hide ZFS_PROP_ACLTYPE, while on Linux we hide ZFS_PROP_ACLMODE. The tests already assume this arrangement. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Ryan Moeller <[email protected]> Closes #9913
* zdb: add support for object ranges for zdb -dNed Bass2020-01-241-4/+41
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Allow a range of object identifiers to dump with -d. This may be useful when dumping a large dataset and you want to break it up into multiple phases, or to resume where a previous scan left off. Object type selection flags are supported to reduce the performance overhead of verbosely dumping unwanted objects, and to reduce the amount of post-processing work needed to filter out unwanted objects from zdb output. This change extends existing syntax in a backward-compatible way. That is, the base case of a range is to specify a single object identifier to dump. Ranges and object identifiers can be intermixed as command line parameters. Usage synopsis: Object ranges take the form <start>:<end>[:<flags>] start Starting object number end Ending object number, or -1 for no upper bound flags Optional flags to select object types: A All objects (this is the default) d ZFS directories f ZFS files m SPA space maps z ZAPs - Negate effect of next flag Examples: # Dump all file objects zdb -dd tank/fish 0:-1:f # Dump all file and directory objects zdb -dd tank/fish 0:-1:fd # Dump all types except file and directory objects zdb -dd tank/fish 0:-1:A-f-d # Dump object IDs in a specific range zdb -dd tank/fish 1000:2000 Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Paul Zuchowski <[email protected]> Signed-off-by: Ned Bass <[email protected]> Closes #9832
* Add AltiVec RAID-ZRomain Dolbeau2020-01-231-0/+1
| | | | | | | | | | | | | Implements the RAID-Z function using AltiVec SIMD. This is basically the NEON code translated to AltiVec. Note that the 'fletcher' algorithm requires 64-bits operations, and the initial implementations of AltiVec (PPC74xx a.k.a. G4, PPC970 a.k.a. G5) only has up to 32-bits operations, so no 'fletcher'. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Romain Dolbeau <[email protected]> Closes #9539
* Support inheriting properties in channel programsJason King2020-01-221-1/+23
| | | | | | | | | This adds support in channel programs to inherit properties analogous to `zfs inherit` by adding `zfs.sync.inherit` and `zfs.check.inherit` functions to the ZFS LUA API. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Jason King <[email protected]> Closes #9738
* zdb -d should accept the numeric objset idPaul Zuchowski2020-01-161-2/+3
| | | | | | | | | | | | As an alternative to the dataset name, zdb now allows the decimal or hexadecimal objset ID to be specified. When permanent errors are reported as 2 hexadecimal numbers (objset ID : object ID) in zpool status; you can now use 'zdb <pool>[/objset ID] object' to determine the names of the objset and object which have the error. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Signed-off-by: Paul Zuchowski <[email protected]> Closes #9733
* Document zfs change-key caveatsRichard Laager2020-01-141-2/+25
| | | | | | | | | | | | | | | As discussed on the 2019-01-07 OpenZFS Leadership Meeting, we need to be clear about the limitations of `zfs change-key`. Changing the user key does not change the master key, nor does it currently overwrite the old wrapped master key on disk. Reviewed-by: Tom Caputi <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Matt Ahrens <[email protected]> Reviewed-by: George Melikov <[email protected]> Reviewed-by: Garrett Fields <[email protected]> Reviewed-by: Kjeld Schouten <[email protected]> Signed-off-by: Richard Laager <[email protected]> Closes #9819
* Change http://zfsonlinux.org links to https://zfsonlinux.orgBrian Behlendorf2020-01-131-1/+1
| | | | | | | | | | | Update the project website links contained in to repository to reference the secure https://zfsonlinux.org address. Reviewed-By: Richard Laager <[email protected]> Reviewed-by: George Melikov <[email protected]> Reviewed-by: Garrett Fields <[email protected]> Reviewed-by: Kjeld Schouten <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #9837
* Add 'zfs send --saved' flagTom Caputi2020-01-101-0/+21
| | | | | | | | | | | | | | | | | | This commit adds the --saved (-S) to the 'zfs send' command. This flag allows a user to send a partially received dataset, which can be useful when migrating a backup server to new hardware. This flag is compatible with resumable receives, so even if the saved send is interrupted, it can be resumed. The flag does not require any user / kernel ABI changes or any new feature flags in the send stream format. Reviewed-by: Paul Dagnelie <[email protected]> Reviewed-by: Alek Pinchuk <[email protected]> Reviewed-by: Paul Zuchowski <[email protected]> Reviewed-by: Christian Schwarz <[email protected]> Reviewed-by: Matt Ahrens <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tom Caputi <[email protected]> Closes #9007
* zfs-module-parameters(5): add all missing itemsDeHackEd2020-01-071-7/+172
| | | | | | | | | | | | | | | I ran a report against the output of `modinfo zfs.ko`. This commit adds everything missing and corrects a few renamed module parameters. Specifically: * zfs_checksums_per second renamed in ad796b8a3 * vdev_ms_count_limit renamed in c853f382d Also fixes some variable type inconsistencies (unsigned int => uint) Reviewed-by: George Amanakis <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: DHE <[email protected]> Closes #9809
* Fix ZPOOL_VDEV_NAME_PATH option descriptionBrian Behlendorf2020-01-061-1/+1
| | | | | | | | | The corresponding zpool status option is -P and not -p. Update this description to reference the correct option. Reviewed-by: loli10K <[email protected]> Reviewed-by: George Melikov <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #9803
* Colorize zpool status outputTony Hutter2019-12-191-0/+6
| | | | | | | | | | | | | | | | | | | If the ZFS_COLOR env variable is set, then use ANSI color output in zpool status: - Column headers are bold - Degraded or offline pools/vdevs are yellow - Non-zero error counters and faulted vdevs/pools are red - The 'status:' and 'action:' sections are yellow if they're displaying a warning. This also includes a new 'faketty' function in libtest.shlib that is compatible with FreeBSD (code provided by @freqlabs). Reviewed-by: Jorgen Lundman <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Signed-off-by: Tony Hutter <[email protected]> Closes #9340
* Add FreeBSD jail support hooksMatthew Macy2019-12-115-1/+152
| | | | | | | | | | | | Add the 'zfs jail/unjail' subcommands along with the relevant documentation from FreeBSD. This feature is not supported on Linux and still requires the match kernel ioctls which will be included when the FreeBSD platform code is integrated. Reviewed-by: Jorgen Lundman <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matt Macy <[email protected]> Signed-off-by: Ryan Moeller <[email protected]> Closes #9686
* zio_decompress_data always ASSERTs successful decompressionPaul Zuchowski2019-12-101-4/+6
| | | | | | | | | | | | | This interferes with zdb_read_block trying all the decompression algorithms when the 'd' flag is specified, as some are expected to fail. Also control the output when guessing algorithms, try the more common compression types first, allow specifying lsize/psize, and fix an uninitialized variable. Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Paul Zuchowski <[email protected]> Closes #9612 Closes #9630
* Disable EDONR on FreeBSDMatthew Macy2019-12-052-0/+5
| | | | | | | | | | | FreeBSD uses its own crypto framework in-kernel which, at this time, has no EDONR implementation. Reviewed-by: Jorgen Lundman <[email protected]> Reviewed-by: Allan Jude <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matt Macy <[email protected]> Signed-off-by: Ryan Moeller <[email protected]> Closes #9664
* Increase allowed 'special_small_blocks' maximum valueBrian Behlendorf2019-12-031-1/+1
| | | | | | | | | | | | | | There may be circumstances where it's desirable that all blocks in a specified dataset be stored on the special device. Relax the artificial 128K limit and allow the special_small_blocks property to be set up to 1M. When blocks >1MB have been enabled via the zfs_max_recordsize module option, this limit is increased accordingly. Reviewed-by: Don Brady <[email protected]> Reviewed-by: Kjeld Schouten <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #9131 Closes #9355
* Remove zfs_vdev_elevator module optionBrian Behlendorf2019-11-271-14/+0
| | | | | | | | | | | | | As described in commit f81d5ef6 the zfs_vdev_elevator module option is being removed. Users who require this functionality should update their systems to set the disk scheduler using a udev rule. Reviewed-by: Richard Laager <[email protected]> Reviewed-by: loli10K <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #8664 Closes #9417 Closes #9609
* Add display of checksums to zdb -RPaul Zuchowski2019-11-271-0/+2
| | | | | | | | | | | | | | | The function zdb_read_block (zdb -R) was always intended to have a :c flag which would read the DVA and length supplied by the user, and display the checksum. Since we don't know which checksum goes with the data, we should calculate and display them all. For each checksum in the table, read in the data at the supplied DVA:length, calculate the checksum, and display it. Update the man page and create a zfs test for the new feature. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Kjeld Schouten <[email protected]> Signed-off-by: Paul Zuchowski <[email protected]> Closes #9607
* Change zed.service to zfs-zed.service in man pageKjeld Schouten2019-11-131-2/+2
| | | | | | | | zed.service does not exist replaced with correct service name in man. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Kjeld Schouten-Lebbing <[email protected]> Closes #9581
* Reorganize zpool(8) man page into sectionsRoss Williams2019-11-1337-2446/+3959
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Moved subcommand topics into individual manpages. Reordered and grouped the list of subcommands by topic. Moved concepts overview to `zpoolconcepts.8` and the long list of available pool properties to `zpoolprops.8`. Internal cross-references copied from `zpool.8` needed to be converted to `.Xr` external references to new subcommand manual pages. Move `autotrim` into lexical order, autotrim tacked onto the end of a list. Now it is in alphabetical order. Clarify attach/detach description. Description was too specific to command syntax. Overview clarifies reason for attaching or detaching a device. Clarify replace description, don't refer to subcommand arguments. Clarify split command description, say what split actually does and why you'd want to do it. Clarify description of upgrade, and simplify the zpool.8 wording of the zpool-upgrade(8) description. Clarify description of import, detail what zpool-import(8) actually does. Add appropriate SEE ALSO sections. Divided zpool subcommand manual pages need their own SEE ALSO sections. Also modified fsck.zfs.8 to point directly to zfs-scrub.8 and zed.8.in to include a direct reference to zfs-events.8 Reviewed-by: Matt Ahrens <[email protected]> Reviewed-by: Kjeld Schouten <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Ross Williams <[email protected]> Closes #9564
* Reorganize zfs(8) man page into sectionsRoss Williams2019-11-1236-4801/+5972
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Most subcommands got their own manpages (e.g. create). Some related commands grouped into a single manpage and symlinks created (e.g. set, get, and inherit). I did this when topics were either too short to warrant their own file or so interrelated that a user would want to refer between commands in the same file. Corrected .Sx internal references to .Xr cross refs; lots of .Sx references from when text was all in zfs.8 needed to be changed to .Xr zfs-$SUBCOMMAND 8 cross references. Divided subcommand list in zfs(8) into sections of related functionality. This required writing new descriptions for some commands. Preserved ".Os Linux", `.Os` macro parsing behavior differs between mandoc from the "BSD" mandoc package (available on Ubuntu) and man from Ubuntu's man-db package, which calls groff to format the manpages. Groff handles the `.Os` macro differently and wrongly, defaulting it to "BSD" in `/usr/share/groff/*/tmac/mdoc/doc-common`, instead of getting the default from `uname`. A future set of changes will introduce build-time preprocessing of manpages for platform-specific documentation and can insert the correct operating system name. Added SEE ALSO sections, the newly-divided zfs-*.8 subcommand man pages needed their own SEE ALSO sections pointing to related subcommands and, in some cases, documentation from other packages (e.g. zfs-share.8). Reviewed-by: Matt Ahrens <[email protected]> Reviewed-by: Kjeld Schouten <[email protected]> Reviewed-by: Sean Eric Fagan <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Ross Williams <[email protected]> Closes #9559
* Add AVX512BW variant of fletcherRomain Dolbeau2019-10-301-1/+1
| | | | | | | | | | | It is much faster than AVX512F when byteswapping on Skylake-SP and newer, as we can do the byteswap in a single vshufb instead of many instructions. Reviewed by: Gvozden Neskovic <[email protected]> Reviewed-by: Chunwei Chen <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Romain Dolbeau <[email protected]> Closes #9517
* Modify sharenfs=on default behaviorBrian Behlendorf2019-10-131-1/+1
| | | | | | | | | | | | While it may sometimes be convenient to export an NFS filesystem with no_root_squash it should not be the default behavior. Align the default behavior with the Linux NFS server defaults. To restore the previous behavior use 'zfs set sharenfs="no_root_squash,..."'. Reviewed-by: loli10K <[email protected]> Reviewed-by: Richard Laager <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #9397 Closes #9425
* Implement ZPOOL_IMPORT_UDEV_TIMEOUT_MSRichard Yao2019-10-111-0/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since 0.7.0, zpool import would unconditionally block on udev for 30 seconds. This introduced a regression in initramfs environments that lack udev (particularly mdev based environments), yet use a zfs userland tools intended for the system that had been built against udev. Gentoo's genkernel is the main example, although custom user initramfs environments would be similarly impacted unless special builds of the ZFS userland utilities were done for them. Such environments already have their own mechanisms for blocking until device nodes are ready (such as genkernel's scandelay parameter), so it is unnecessary for zpool import to block on a non-existent udev until a timeout is reached inside of them. Rather than trying to intelligently determine whether udev is available on the system to avoid unnecessarily blocking in such environments, it seems best to just allow the environment to override the timeout. I propose that we add an environment variable called ZPOOL_IMPORT_UDEV_TIMEOUT_MS. Setting it to 0 would restore the 0.6.x behavior that was more desirable in mdev based initramfs environments. This allows the system user land utilities to be reused when building mdev-based initramfs archives. Reviewed-by: Igor Kozhukhov <[email protected]> Reviewed-by: Jorgen Lundman <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Georgy Yakovlev <[email protected]> Signed-off-by: Richard Yao <[email protected]> Closes #9436
* Update `zfs program` command usageBrian Behlendorf2019-10-101-2/+4
| | | | | | | | | | | | | | | | | Update the zfs(8) man page to clearly describe that arguments for channel programs are to be listed after the -- sentinel which terminates argument processing. This behavior is supported by getopt on Linux, FreeBSD, and Illumos according to each platforms respective man pages. zfs program [-jn] [-t instruction-limit] [-m memory-limit] pool script [--] arg1 ... Reviewed-by: Clint Armstrong <[email protected]> Reviewed-by: George Melikov <[email protected]> Reviewed-by: loli10K <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #9056 Closes #9428
* Add warning for zfs_vdev_elevator option removalBrian Behlendorf2019-09-251-2/+4
| | | | | | | | | | | | | | | | | | | | | | | | Originally the zfs_vdev_elevator module option was added as a convenience so the requested elevator would be automatically set on the underlying block devices. At the time this was simple because the kernel provided an API function which did exactly this. This API was then removed in the Linux 4.12 kernel which prompted us to add compatibly code to set the elevator via a usermodehelper. While well intentioned this introduced a bug which could cause a system hang, that issue was subsequently fixed by commit 2a0d4188. In order to avoid future bugs in this area, and to simplify the code, this functionality is being deprecated. A console warning has been added to notify any existing consumers and the documentation updated accordingly. This option will remain for the lifetime of the 0.8.x series for compatibility but if planned to be phased out of master. Reviewed-by: Richard Laager <[email protected]> Reviewed-by: loli10K <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #8664 Closes #9317
* Add subcommand to wait for background zfs activity to completeJohn Gallagher2019-09-132-9/+104
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently the best way to wait for the completion of a long-running operation in a pool, like a scrub or device removal, is to poll 'zpool status' and parse its output, which is neither efficient nor convenient. This change adds a 'wait' subcommand to the zpool command. When invoked, 'zpool wait' will block until a specified type of background activity completes. Currently, this subcommand can wait for any of the following: - Scrubs or resilvers to complete - Devices to initialized - Devices to be replaced - Devices to be removed - Checkpoints to be discarded - Background freeing to complete For example, a scrub that is in progress could be waited for by running zpool wait -t scrub <pool> This also adds a -w flag to the attach, checkpoint, initialize, replace, remove, and scrub subcommands. When used, this flag makes the operations kicked off by these subcommands synchronous instead of asynchronous. This functionality is implemented using a new ioctl. The type of activity to wait for is provided as input to the ioctl, and the ioctl blocks until all activity of that type has completed. An ioctl was used over other methods of kernel-userspace communiction primarily for the sake of portability. Porting Notes: This is ported from Delphix OS change DLPX-44432. The following changes were made while porting: - Added ZoL-style ioctl input declaration. - Reorganized error handling in zpool_initialize in libzfs to integrate better with changes made for TRIM support. - Fixed check for whether a checkpoint discard is in progress. Previously it also waited if the pool had a checkpoint, instead of just if a checkpoint was being discarded. - Exposed zfs_initialize_chunk_size as a ZoL-style tunable. - Updated more existing tests to make use of new 'zpool wait' functionality, tests that don't exist in Delphix OS. - Used existing ZoL tunable zfs_scan_suspend_progress, together with zinject, in place of a new tunable zfs_scan_max_blks_per_txg. - Added support for a non-integral interval argument to zpool wait. Future work: ZoL has support for trimming devices, which Delphix OS does not. In the future, 'zpool wait' could be extended to add the ability to wait for trim operations to complete. Reviewed-by: Matt Ahrens <[email protected]> Reviewed-by: John Kennedy <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: John Gallagher <[email protected]> Closes #9162
* Fix typos in man/Andrea Gelmini2019-08-308-16/+16
| | | | | | | Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Richard Laager <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Andrea Gelmini <[email protected]> Closes #9233
* Keep more metaslabs loadedPaul Dagnelie2019-08-291-1/+29
| | | | | | | | | | | | | | | | | | | | | | | | With the other metaslab changes loaded onto a system, we can significantly reduce the memory usage of each loaded metaslab and unload them on demand if there is memory pressure. However, none of those changes actually result in us keeping more metaslabs loaded. If we don't keep more metaslabs loaded, we will still have to wait for demand-loading to finish when no loaded metaslab can satisfy our allocation, which can cause ZIL performance issues. In addition, performance is traditionally measured by IOs per unit time, while unloading is currently done on a txg-count basis. Txgs can take a widely varying range of times, from tenths of a second to several seconds. This can result in confusing, hard to predict behavior. This change simply adds a time-based component to metaslab unloading. A metaslab will remain loaded for one minute and 8 txgs (by default) after it was last used, unless it is evicted due to memory pressure. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Matt Ahrens <[email protected]> Signed-off-by: Paul Dagnelie <[email protected]> External-issue: DLPX-65016 External-issue: DLPX-65047 Closes #9197
* Cap metaslab memory usagePaul Dagnelie2019-08-161-0/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | On systems with large amounts of storage and high fragmentation, a huge amount of space can be used by storing metaslab range trees. Since metaslabs are only unloaded during a txg sync, and only if they have been inactive for 8 txgs, it is possible to get into a state where all of the system's memory is consumed by range trees and metaslabs, and txgs cannot sync. While ZFS knows how to evict ARC data when needed, it has no such mechanism for range tree data. This can result in boot hangs for some system configurations. First, we add the ability to unload metaslabs outside of syncing context. Second, we store a multilist of all loaded metaslabs, sorted by their selection txg, so we can quickly identify the oldest metaslabs. We use a multilist to reduce lock contention during heavy write workloads. Finally, we add logic that will unload a metaslab when we're loading a new metaslab, if we're using more than a certain fraction of the available memory on range trees. Reviewed-by: Matt Ahrens <[email protected]> Reviewed-by: George Wilson <[email protected]> Reviewed-by: Sebastien Roy <[email protected]> Reviewed-by: Serapheim Dimitropoulos <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Paul Dagnelie <[email protected]> Closes #9128
* spa_load_verify() may consume too much memoryGeorge Wilson2019-08-131-4/+4
| | | | | | | | | | | | | | | | | | | | When a pool is imported it will scan the pool to verify the integrity of the data and metadata. The amount it scans will depend on the import flags provided. On systems with small amounts of memory or when importing a pool from the crash kernel, it's possible for spa_load_verify to issue too many I/Os that it consumes all the memory of the system resulting in an OOM message or a hang. To prevent this, we limit the amount of memory that the initial pool scan can consume. This change will, by default, use 1/16th of the ARC for scan I/Os to prevent running the system out of memory during import. Reviewed-by: Matt Ahrens <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Serapheim Dimitropoulos <[email protected]> Signed-off-by: George Wilson [email protected] External-issue: DLPX-65237 External-issue: DLPX-65238 Closes #9146
* Introduce getting holds and listing bookmarks through ZCPSerapheim Dimitropoulos2019-08-121-1/+27
| | | | | | | | | | | | | | | | Consumers of ZFS Channel Programs can now list bookmarks, and get holds from datasets. A minor-refactoring was also applied to distinguish between user and system properties in ZCP. Reviewed-by: Paul Dagnelie <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Matt Ahrens <[email protected]> Reviewed-by: Serapheim Dimitropoulos <[email protected]> Ported-by: Serapheim Dimitropoulos <[email protected]> Signed-off-by: Dan Kimmel <[email protected]> OpenZFS-issue: https://illumos.org/issues/8862 Closes #7902
* Metaslab max_size should be persisted while unloadedPaul Dagnelie2019-08-051-0/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | When we unload metaslabs today in ZFS, the cached max_size value is discarded. We instead use the histogram to determine whether or not we think we can satisfy an allocation from the metaslab. This can result in situations where, if we're doing I/Os of a size not aligned to a histogram bucket, a metaslab is loaded even though it cannot satisfy the allocation we think it can. For example, a metaslab with 16 entries in the 16k-32k bucket may have entirely 16kB entries. If we try to allocate a 24kB buffer, we will load that metaslab because we think it should be able to handle the allocation. Doing so is expensive in CPU time, disk reads, and average IO latency. This is exacerbated if the write being attempted is a sync write. This change makes ZFS cache the max_size after the metaslab is unloaded. If we ever get a free (or a coalesced group of frees) larger than the max_size, we will update it. Otherwise, we leave it as is. When attempting to allocate, we use the max_size as a lower bound, and respect it unless we are in try_hard. However, we do age the max_size out at some point, since we expect the actual max_size to increase as we do more frees. A more sophisticated algorithm here might be helpful, but this works reasonably well. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Matt Ahrens <[email protected]> Signed-off-by: Paul Dagnelie <[email protected]> Closes #9055
* List log_spacemap feature in zpool-features.5 manualSerapheim Dimitropoulos2019-07-311-0/+22
| | | | | | | | | Update zpool-features.5 manpage to describe the log_spacemap feature. Reviewed-by: Matthew Ahrens <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Pavel Zakharov <[email protected]> Signed-off-by: Serapheim Dimitropoulos <[email protected]> Closes #9096
* Fast Clone DeletionSara Hartse2019-07-262-1/+112
| | | | | | | | | | | | | | | | | | | | | Deleting a clone requires finding blocks are clone-only, not shared with the snapshot. This was done by traversing the entire block tree which results in a large performance penalty for sparsely written clones. This is new method keeps track of clone blocks when they are modified in a "Livelist" so that, when it’s time to delete, the clone-specific blocks are already at hand. We see performance improvements because now deletion work is proportional to the number of clone-modified blocks, not the size of the original dataset. Reviewed-by: Sean Eric Fagan <[email protected]> Reviewed-by: Matt Ahrens <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Serapheim Dimitropoulos <[email protected]> Signed-off-by: Sara Hartse <[email protected]> Closes #8416
* New service that waits on zvol links to be createdPavel Zakharov2019-07-172-1/+22
| | | | | | | | | | | | | | | | The zfs-volume-wait.service scans existing zvols and waits for their links under /dev to be created. Any service that depends on zvol links to be there should add a dependency on zfs-volumes.target. By default, this target is not enabled. Reviewed-by: Fabian Grünbichler <[email protected]> Reviewed-by: Antonio Russo <[email protected]> Reviewed-by: Richard Laager <[email protected]> Reviewed-by: loli10K <[email protected]> Reviewed-by: John Gallagher <[email protected]> Reviewed-by: George Wilson <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Pavel Zakharov <[email protected]> Closes #8975
* Add zfs create dryrunMike Gerdts2019-07-161-5/+93
| | | | | | | | | | | | | | | | | | | Adds the ability to sanity check zfs create arguments and to see the value of any additional properties that will local to the dataset. For example, automation that may need to adjust quota on a parent filesystem before creating a volume may call `zfs create -nP -V <size> <volume>` to obtain the value of refreservation. This adds the following options to zfs create: - -n dry-run (no-op) - -v verbose - -P parseable (implies verbose) Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Matt Ahrens <[email protected]> Reviewed-by: Jerry Jelinek <[email protected]> Signed-off-by: Mike Gerdts <[email protected]> Closes #8974
* Log Spacemap ProjectSerapheim Dimitropoulos2019-07-162-1/+149
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | = Motivation At Delphix we've seen a lot of customer systems where fragmentation is over 75% and random writes take a performance hit because a lot of time is spend on I/Os that update on-disk space accounting metadata. Specifically, we seen cases where 20% to 40% of sync time is spend after sync pass 1 and ~30% of the I/Os on the system is spent updating spacemaps. The problem is that these pools have existed long enough that we've touched almost every metaslab at least once, and random writes scatter frees across all metaslabs every TXG, thus appending to their spacemaps and resulting in many I/Os. To give an example, assuming that every VDEV has 200 metaslabs and our writes fit within a single spacemap block (generally 4K) we have 200 I/Os. Then if we assume 2 levels of indirection, we need 400 additional I/Os and since we are talking about metadata for which we keep 2 extra copies for redundancy we need to triple that number, leading to a total of 1800 I/Os per VDEV every TXG. We could try and decrease the number of metaslabs so we have less I/Os per TXG but then each metaslab would cover a wider range on disk and thus would take more time to be loaded in memory from disk. In addition, after it's loaded, it's range tree would consume more memory. Another idea would be to just increase the spacemap block size which would allow us to fit more entries within an I/O block resulting in fewer I/Os per metaslab and a speedup in loading time. The problem is still that we don't deal with the number of I/Os going up as the number of metaslabs is increasing and the fact is that we generally write a lot to a few metaslabs and a little to the rest of them. Thus, just increasing the block size would actually waste bandwidth because we won't be utilizing our bigger block size. = About this patch This patch introduces the Log Spacemap project which provides the solution to the above problem while taking into account all the aforementioned tradeoffs. The details on how it achieves that can be found in the references sections below and in the code (see Big Theory Statement in spa_log_spacemap.c). Even though the change is fairly constraint within the metaslab and lower-level SPA codepaths, there is a side-change that is user-facing. The change is that VDEV IDs from VDEV holes will no longer be reused. To give some background and reasoning for this, when a log device is removed and its VDEV structure was replaced with a hole (or was compacted; if at the end of the vdev array), its vdev_id could be reused by devices added after that. Now with the pool-wide space maps recording the vdev ID, this behavior can cause problems (e.g. is this entry referring to a segment in the new vdev or the removed log?). Thus, to simplify things the ID reuse behavior is gone and now vdev IDs for top-level vdevs are truly unique within a pool. = Testing The illumos implementation of this feature has been used internally for a year and has been in production for ~6 months. For this patch specifically there don't seem to be any regressions introduced to ZTS and I have been running zloop for a week without any related problems. = Performance Analysis (Linux Specific) All performance results and analysis for illumos can be found in the links of the references. Redoing the same experiments in Linux gave similar results. Below are the specifics of the Linux run. After the pool reached stable state the percentage of the time spent in pass 1 per TXG was 64% on average for the stock bits while the log spacemap bits stayed at 95% during the experiment (graph: sdimitro.github.io/img/linux-lsm/PercOfSyncInPassOne.png). Sync times per TXG were 37.6 seconds on average for the stock bits and 22.7 seconds for the log spacemap bits (related graph: sdimitro.github.io/img/linux-lsm/SyncTimePerTXG.png). As a result the log spacemap bits were able to push more TXGs, which is also the reason why all graphs quantified per TXG have more entries for the log spacemap bits. Another interesting aspect in terms of txg syncs is that the stock bits had 22% of their TXGs reach sync pass 7, 55% reach sync pass 8, and 20% reach 9. The log space map bits reached sync pass 4 in 79% of their TXGs, sync pass 7 in 19%, and sync pass 8 at 1%. This emphasizes the fact that not only we spend less time on metadata but we also iterate less times to convergence in spa_sync() dirtying objects. [related graphs: stock- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGStock.png lsm- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGLSM.png] Finally, the improvement in IOPs that the userland gains from the change is approximately 40%. There is a consistent win in IOPS as you can see from the graphs below but the absolute amount of improvement that the log spacemap gives varies within each minute interval. sdimitro.github.io/img/linux-lsm/StockVsLog3Days.png sdimitro.github.io/img/linux-lsm/StockVsLog10Hours.png = Porting to Other Platforms For people that want to port this commit to other platforms below is a list of ZoL commits that this patch depends on: Make zdb results for checkpoint tests consistent db587941c5ff6dea01932bb78f70db63cf7f38ba Update vdev_is_spacemap_addressable() for new spacemap encoding 419ba5914552c6185afbe1dd17b3ed4b0d526547 Simplify spa_sync by breaking it up to smaller functions 8dc2197b7b1e4d7ebc1420ea30e51c6541f1d834 Factor metaslab_load_wait() in metaslab_load() b194fab0fb6caad18711abccaff3c69ad8b3f6d3 Rename range_tree_verify to range_tree_verify_not_present df72b8bebe0ebac0b20e0750984bad182cb6564a Change target size of metaslabs from 256GB to 16GB c853f382db731e15a87512f4ef1101d14d778a55 zdb -L should skip leak detection altogether 21e7cf5da89f55ce98ec1115726b150e19eefe89 vs_alloc can underflow in L2ARC vdevs 7558997d2f808368867ca7e5234e5793446e8f3f Simplify log vdev removal code 6c926f426a26ffb6d7d8e563e33fc176164175cb Get rid of space_map_update() for ms_synced_length 425d3237ee88abc53d8522a7139c926d278b4b7f Introduce auxiliary metaslab histograms 928e8ad47d3478a3d5d01f0dd6ae74a9371af65e Error path in metaslab_load_impl() forgets to drop ms_sync_lock 8eef997679ba54547f7d361553d21b3291f41ae7 = References Background, Motivation, and Internals of the Feature - OpenZFS 2017 Presentation: youtu.be/jj2IxRkl5bQ - Slides: slideshare.net/SerapheimNikolaosDim/zfs-log-spacemaps-project Flushing Algorithm Internals & Performance Results (Illumos Specific) - Blogpost: sdimitro.github.io/post/zfs-lsm-flushing/ - OpenZFS 2018 Presentation: youtu.be/x6D2dHRjkxw - Slides: slideshare.net/SerapheimNikolaosDim/zfs-log-spacemap-flushing-algorithm Upstream Delphix Issues: DLPX-51539, DLPX-59659, DLPX-57783, DLPX-61438, DLPX-41227, DLPX-59320 DLPX-63385 Reviewed-by: Sean Eric Fagan <[email protected]> Reviewed-by: Matt Ahrens <[email protected]> Reviewed-by: George Wilson <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Serapheim Dimitropoulos <[email protected]> Closes #8442
* systemd encryption key supportAntonio Russo2019-07-151-1/+1
| | | | | | | | | | | | | | | | | | | | | | Modify zfs-mount-generator to produce a dependency on new zfs-import-key-*.service units, dynamically created at boot to call zfs load-key for the encryption root, before attempting to mount any encrypted datasets. These units are created by zfs-mount-generator, and RequiresMountsFor on the keyfile, if present, or call systemd-ask-password if a passphrase is requested. This patch includes suggestions from @Fabian-Gruenbichler, @ryanjaeb and @rlaager, as well an adaptation of @rlaager's script to retry on incorrect password entry. Reviewed-by: Richard Laager <[email protected]> Reviewed-by: Fabian Grünbichler <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Antonio Russo <[email protected]> Closes #8750 Closes #8848
* Fix zfs "redact" misc issuesloli10K2019-07-051-2/+2
| | | | | | | | | | | * zfs redact error messages do not end with newline character * 30af21b0 inadvertently removed some ZFS_PROP comments * man/zfs: zfs redact <redaction_snapshot> is not optional Reviewed-by: Giuseppe Di Natale <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: George Melikov <[email protected]> Signed-off-by: loli10K <[email protected]> Closes #8988
* Fix typo in zpool-features.5, section bookmark_writtenMatthew Ahrens2019-07-031-1/+1
| | | | | | | | | | The full property name includes "delphix", not "delphxi". Reviewed-by: Richard Laager <[email protected]> Reviewed-by: George Melikov <[email protected]> Reviewed-by: Igor Kozhukhov <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> Closes #8985
* Fix zfs(8) mandoc errorsloli10K2019-07-021-1/+0
| | | | | | | | | | | | | | This was accidentally introduced in 765d1f06: mandoc: ./man/man8/zfs.8: ERROR: skipping item outside list: It Ar filesystem Ns | Ns Ar mountpoint mandoc: ./man/man8/zfs.8: ERROR: skipping item outside list: It Xo mandoc: ./man/man8/zfs.8: ERROR: skipping end of block that is not open: Xc mandoc: ./man/man8/zfs.8: ERROR: skipping item outside list: It Xo mandoc: ./man/man8/zfs.8: ERROR: skipping end of block that is not open: Xc Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: George Melikov <[email protected]> Signed-off-by: loli10K <[email protected]> Closes #8980