aboutsummaryrefslogtreecommitdiffstats
path: root/include/libzfs.h
Commit message (Collapse)AuthorAgeFilesLines
* Add 'zfs send --saved' flagTom Caputi2020-01-101-0/+4
| | | | | | | | | | | | | | | | | | This commit adds the --saved (-S) to the 'zfs send' command. This flag allows a user to send a partially received dataset, which can be useful when migrating a backup server to new hardware. This flag is compatible with resumable receives, so even if the saved send is interrupted, it can be resumed. The flag does not require any user / kernel ABI changes or any new feature flags in the send stream format. Reviewed-by: Paul Dagnelie <[email protected]> Reviewed-by: Alek Pinchuk <[email protected]> Reviewed-by: Paul Zuchowski <[email protected]> Reviewed-by: Christian Schwarz <[email protected]> Reviewed-by: Matt Ahrens <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tom Caputi <[email protected]> Closes #9007
* Remove parameter names from libzfs.h signaturesNick Black2020-01-081-7/+7
| | | | | | | | | | | | | | | Most of libzfs.h doesn't provide names for the parameters in its signatures. These few functions included them. That wouldn't be a problem, per se, but the 'lines' parameter conflicts with the 'lines' #define from terminfo's term.h, present for at least a decade. This makes it difficult to compile code making use of both ZFS and terminfo. Reviewed-by: Matt Ahrens <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Igor Kozhukhov <[email protected]> Reviewed-by: Kjeld Schouten <[email protected]> Signed-off-by: Nick Black <[email protected]> Closes #9821
* Fix zpool history unbounded memory usageChunwei Chen2019-10-281-1/+2
| | | | | | | | | | | | In original implementation, zpool history will read the whole history before printing anything, causing memory usage goes unbounded. We fix this by breaking it into read-print iterations. Reviewed-by: Tom Caputi <[email protected]> Reviewed-by: Matt Ahrens <[email protected]> Reviewed-by: Igor Kozhukhov <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Chunwei Chen <[email protected]> Closes #9516
* Fix encryption hierarchy issues with zfs recv -dTom Caputi2019-09-251-0/+3
| | | | | | | | | | | | | | | | | | | | | Currently, the recv_fix_encryption_hierarchy() function accepts 'destsnap' as one of its parameters. Originally, this was intended to be the top-level dataset of a receive (whether or not the receive was recursive). Unfortunately, this parameter actually is simply the input that is passed in from the command line. When the user specifies 'zfs recv -d', this string is actually only the name of the receiving pool since the rest of the name is derived from the send stream. This causes the function to fail, leaving some datasets with an invalid encryption hierarchy. This patch resolves this problem by passing in the top_zfs variable instead. In order to make this work, this patch also includes some changes that ensure the value is always present when we need it. Reviewed-by: loli10K <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tom Caputi <[email protected]> Closes #9273 Closes #9309
* Add subcommand to wait for background zfs activity to completeJohn Gallagher2019-09-131-0/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently the best way to wait for the completion of a long-running operation in a pool, like a scrub or device removal, is to poll 'zpool status' and parse its output, which is neither efficient nor convenient. This change adds a 'wait' subcommand to the zpool command. When invoked, 'zpool wait' will block until a specified type of background activity completes. Currently, this subcommand can wait for any of the following: - Scrubs or resilvers to complete - Devices to initialized - Devices to be replaced - Devices to be removed - Checkpoints to be discarded - Background freeing to complete For example, a scrub that is in progress could be waited for by running zpool wait -t scrub <pool> This also adds a -w flag to the attach, checkpoint, initialize, replace, remove, and scrub subcommands. When used, this flag makes the operations kicked off by these subcommands synchronous instead of asynchronous. This functionality is implemented using a new ioctl. The type of activity to wait for is provided as input to the ioctl, and the ioctl blocks until all activity of that type has completed. An ioctl was used over other methods of kernel-userspace communiction primarily for the sake of portability. Porting Notes: This is ported from Delphix OS change DLPX-44432. The following changes were made while porting: - Added ZoL-style ioctl input declaration. - Reorganized error handling in zpool_initialize in libzfs to integrate better with changes made for TRIM support. - Fixed check for whether a checkpoint discard is in progress. Previously it also waited if the pool had a checkpoint, instead of just if a checkpoint was being discarded. - Exposed zfs_initialize_chunk_size as a ZoL-style tunable. - Updated more existing tests to make use of new 'zpool wait' functionality, tests that don't exist in Delphix OS. - Used existing ZoL tunable zfs_scan_suspend_progress, together with zinject, in place of a new tunable zfs_scan_max_blks_per_txg. - Added support for a non-integral interval argument to zpool wait. Future work: ZoL has support for trimming devices, which Delphix OS does not. In the future, 'zpool wait' could be extended to add the ability to wait for trim operations to complete. Reviewed-by: Matt Ahrens <[email protected]> Reviewed-by: John Kennedy <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: John Gallagher <[email protected]> Closes #9162
* Race condition between spa async threads and exportSerapheim Dimitropoulos2019-07-181-0/+1
| | | | | | | | | | | | | | | | | | In the past we've seen multiple race conditions that have to do with open-context threads async threads and concurrent calls to spa_export()/spa_destroy() (including the one referenced in issue #9015). This patch ensures that only one thread can execute the main body of spa_export_common() at a time, with subsequent threads returning with a new error code created just for this situation, eliminating this way any race condition bugs introduced by concurrent calls to this function. Reviewed by: Matt Ahrens <[email protected]> Reviewed by: Brian Behlendorf <[email protected]> Signed-off-by: Serapheim Dimitropoulos <[email protected]> Closes #9015 Closes #9044
* OpenZFS 9318 - vol_volsize_to_reservation does not account for raidz skip blocksMike Gerdts2019-07-051-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When a volume is created in a pool with raidz vdevs and volblocksize != 128k, the volume can reference more space than is reserved with the automatically calculated refreservation. There are two deficiencies in vol_volsize_to_reservation that contribute to this: 1) Skip blocks may be added to keep each allocation a multiple of parity + 1. This is the dominating factor when volblocksize is close to 2^ashift. 2) raidz deflation for 128 KB blocks is different for most other block sizes. See "The theory of raidz space accounting" comment in libzfs_dataset.c for a full explanation. Authored by: Mike Gerdts <[email protected]> Reviewed by: Richard Elling <[email protected]> Reviewed by: Sanjay Nadkarni <[email protected]> Reviewed by: Jerry Jelinek <[email protected]> Reviewed by: Matt Ahrens <[email protected]> Reviewed by: Kody Kantor <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Approved by: Dan McDonald <[email protected]> Ported-by: Mike Gerdts <[email protected]> Porting Notes: * ZTS: wait for zvols to exist before writing * ZTS: use log_must_busy with {zpool|zfs} destroy OpenZFS-issue: https://www.illumos.org/issues/9318 OpenZFS-commit: https://github.com/illumos/illumos-gate/commit/b73ccab0 Closes #8973
* Remove code for zfs remapMatthew Ahrens2019-06-241-2/+0
| | | | | | | | | | | | | | | | The "zfs remap" command was disabled by 6e91a72fe3ff8bb282490773bd687632f3e8c79d, because it has little utility and introduced some tricky bugs. This commit removes the code for it, the associated ZFS_IOC_REMAP ioctl, and tests. Note that the ioctl and property will remain, but have no functionality. This allows older software to fail gracefully if it attempts to use these, and avoids a backwards incompatibility that would be introduced if we renumbered the later ioctls/props. Reviewed-by: Tom Caputi <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> Closes #8944
* Implement Redacted Send/ReceivePaul Dagnelie2019-06-191-3/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Redacted send/receive allows users to send subsets of their data to a target system. One possible use case for this feature is to not transmit sensitive information to a data warehousing, test/dev, or analytics environment. Another is to save space by not replicating unimportant data within a given dataset, for example in backup tools like zrepl. Redacted send/receive is a three-stage process. First, a clone (or clones) is made of the snapshot to be sent to the target. In this clone (or clones), all unnecessary or unwanted data is removed or modified. This clone is then snapshotted to create the "redaction snapshot" (or snapshots). Second, the new zfs redact command is used to create a redaction bookmark. The redaction bookmark stores the list of blocks in a snapshot that were modified by the redaction snapshot(s). Finally, the redaction bookmark is passed as a parameter to zfs send. When sending to the snapshot that was redacted, the redaction bookmark is used to filter out blocks that contain sensitive or unwanted information, and those blocks are not included in the send stream. When sending from the redaction bookmark, the blocks it contains are considered as candidate blocks in addition to those blocks in the destination snapshot that were modified since the creation_txg of the redaction bookmark. This step is necessary to allow the target to rehydrate data in the case where some blocks are accidentally or unnecessarily modified in the redaction snapshot. The changes to bookmarks to enable fast space estimation involve adding deadlists to bookmarks. There is also logic to manage the life cycles of these deadlists. The new size estimation process operates in cases where previously an accurate estimate could not be provided. In those cases, a send is performed where no data blocks are read, reducing the runtime significantly and providing a byte-accurate size estimate. Reviewed-by: Dan Kimmel <[email protected]> Reviewed-by: Matt Ahrens <[email protected]> Reviewed-by: Prashanth Sreenivasa <[email protected]> Reviewed-by: John Kennedy <[email protected]> Reviewed-by: George Wilson <[email protected]> Reviewed-by: Chris Williamson <[email protected]> Reviewed-by: Pavel Zhakarov <[email protected]> Reviewed-by: Sebastien Roy <[email protected]> Reviewed-by: Prakash Surya <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Paul Dagnelie <[email protected]> Closes #7958
* Add feature check for 'zpool resilver' commandTom Caputi2019-05-021-0/+1
| | | | | | | | | | | | The 'zpool resilver' command requires that the resilver_defer feature is active on the pool. Unfortunately, the check for this was left out of the original patch. This commit simply corrects this so that the command properly returns an error in this case. Reviewed by: Brian Behlendorf <[email protected]> Reviewed-by: Igor Kozhukhov <[email protected]> Signed-off-by: Tom Caputi <[email protected]> Closes #8700
* Add option [-V|--version] to emit version stringTerraTech2019-04-161-0/+7
| | | | | | | | | | | | | | | | | | | | | Add the 'zfs version' and 'zpool version' subcommands to display the version of the user space utilities and loaded zfs kernel module. For example: $ zfs version zfs-0.8.0-rc3_169_g67e0366b88 zfs-kmod-0.8.0-rc3_169_g67e0366b88 The '-V' and '--version' aliases were added to support the common convention of using 'zfs --version` to obtain the version information. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Matthew Ahrens <[email protected]> Reviewed-by: Richard Laager <[email protected]> Signed-off-by: TerraTech <[email protected]> Closes #2501 Closes #8567
* Add TRIM supportBrian Behlendorf2019-03-291-0/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | UNMAP/TRIM support is a frequently-requested feature to help prevent performance from degrading on SSDs and on various other SAN-like storage back-ends. By issuing UNMAP/TRIM commands for sectors which are no longer allocated the underlying device can often more efficiently manage itself. This TRIM implementation is modeled on the `zpool initialize` feature which writes a pattern to all unallocated space in the pool. The new `zpool trim` command uses the same vdev_xlate() code to calculate what sectors are unallocated, the same per- vdev TRIM thread model and locking, and the same basic CLI for a consistent user experience. The core difference is that instead of writing a pattern it will issue UNMAP/TRIM commands for those extents. The zio pipeline was updated to accommodate this by adding a new ZIO_TYPE_TRIM type and associated spa taskq. This new type makes is straight forward to add the platform specific TRIM/UNMAP calls to vdev_disk.c and vdev_file.c. These new ZIO_TYPE_TRIM zios are handled largely the same way as ZIO_TYPE_READs or ZIO_TYPE_WRITEs. This makes it possible to largely avoid changing the pipieline, one exception is that TRIM zio's may exceed the 16M block size limit since they contain no data. In addition to the manual `zpool trim` command, a background automatic TRIM was added and is controlled by the 'autotrim' property. It relies on the exact same infrastructure as the manual TRIM. However, instead of relying on the extents in a metaslab's ms_allocatable range tree, a ms_trim tree is kept per metaslab. When 'autotrim=on', ranges added back to the ms_allocatable tree are also added to the ms_free tree. The ms_free tree is then periodically consumed by an autotrim thread which systematically walks a top level vdev's metaslabs. Since the automatic TRIM will skip ranges it considers too small there is value in occasionally running a full `zpool trim`. This may occur when the freed blocks are small and not enough time was allowed to aggregate them. An automatic TRIM and a manual `zpool trim` may be run concurrently, in which case the automatic TRIM will yield to the manual TRIM. Reviewed-by: Jorgen Lundman <[email protected]> Reviewed-by: Tim Chase <[email protected]> Reviewed-by: Matt Ahrens <[email protected]> Reviewed-by: George Wilson <[email protected]> Reviewed-by: Serapheim Dimitropoulos <[email protected]> Contributions-by: Saso Kiselkov <[email protected]> Contributions-by: Tim Chase <[email protected]> Contributions-by: Chunwei Chen <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #8419 Closes #598
* Avoid retrieving unused snapshot propsAlek P2019-03-121-3/+5
| | | | | | | | | | | | | | This patch modifies the zfs_ioc_snapshot_list_next() ioctl to enable it to take input parameters that alter the way looping through the list of snapshots is performed. The idea here is to restrict functions that throw away some of the snapshots returned by the ioctl to a range of snapshots that these functions actually use. This improves efficiency and execution speed for some rollback and send operations. Reviewed-by: Tom Caputi <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed by: Matt Ahrens <[email protected]> Signed-off-by: Alek Pinchuk <[email protected]> Closes #8077
* zfs should optionally send holdsPaul Zuchowski2019-02-151-0/+9
| | | | | | | | | | | | | Add -h switch to zfs send command to send dataset holds. If holds are present in the stream, zfs receive will create them on the target dataset, unless the zfs receive -h option is used to skip receive of holds. Reviewed-by: Alek Pinchuk <[email protected]> Reviewed-by: loli10K <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed by: Paul Dagnelie <[email protected]> Signed-off-by: Paul Zuchowski <[email protected]> Closes #7513
* ZVOLs should not be allowed to have childrenloli10K2019-02-081-0/+1
| | | | | | | | | | | | | | | zfs create, receive and rename can bypass this hierarchy rule. Update both userland and kernel module to prevent this issue and use pyzfs unit tests to exercise the ioctls directly. Note: this commit slightly changes zfs_ioc_create() ABI. This allow to differentiate a generic error (EINVAL) from the specific case where we tried to create a dataset below a ZVOL (ZFS_ERR_WRONG_PARENT). Reviewed-by: Paul Dagnelie <[email protected]> Reviewed-by: Matt Ahrens <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Tom Caputi <[email protected]> Signed-off-by: loli10K <[email protected]>
* OpenZFS 9102 - zfs should be able to initialize storage devicesGeorge Wilson2019-01-071-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | PROBLEM ======== The first access to a block incurs a performance penalty on some platforms (e.g. AWS's EBS, VMware VMDKs). Therefore we recommend that volumes are "thick provisioned", where supported by the platform (VMware). This can create a large delay in getting a new virtual machines up and running (or adding storage to an existing Engine). If the thick provision step is omitted, write performance will be suboptimal until all blocks on the LUN have been written. SOLUTION ========= This feature introduces a way to 'initialize' the disks at install or in the background to make sure we don't incur this first read penalty. When an entire LUN is added to ZFS, we make all space available immediately, and allow ZFS to find unallocated space and zero it out. This works with concurrent writes to arbitrary offsets, ensuring that we don't zero out something that has been (or is in the middle of being) written. This scheme can also be applied to existing pools (affecting only free regions on the vdev). Detailed design: - new subcommand:zpool initialize [-cs] <pool> [<vdev> ...] - start, suspend, or cancel initialization - Creates new open-context thread for each vdev - Thread iterates through all metaslabs in this vdev - Each metaslab: - select a metaslab - load the metaslab - mark the metaslab as being zeroed - walk all free ranges within that metaslab and translate them to ranges on the leaf vdev - issue a "zeroing" I/O on the leaf vdev that corresponds to a free range on the metaslab we're working on - continue until all free ranges for this metaslab have been "zeroed" - reset/unmark the metaslab being zeroed - if more metaslabs exist, then repeat above tasks. - if no more metaslabs, then we're done. - progress for the initialization is stored on-disk in the vdev’s leaf zap object. The following information is stored: - the last offset that has been initialized - the state of the initialization process (i.e. active, suspended, or canceled) - the start time for the initialization - progress is reported via the zpool status command and shows information for each of the vdevs that are initializing Porting notes: - Added zfs_initialize_value module parameter to set the pattern written by "zpool initialize". - Added zfs_vdev_{initializing,removal}_{min,max}_active module options. Authored by: George Wilson <[email protected]> Reviewed by: John Wren Kennedy <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Reviewed by: Pavel Zakharov <[email protected]> Reviewed by: Prakash Surya <[email protected]> Reviewed by: loli10K <[email protected]> Reviewed by: Brian Behlendorf <[email protected]> Approved by: Richard Lowe <[email protected]> Signed-off-by: Tim Chase <[email protected]> Ported-by: Tim Chase <[email protected]> OpenZFS-issue: https://www.illumos.org/issues/9102 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/c3963210eb Closes #8230
* Add missing MMP status code to libzfs_statusbunder20152019-01-031-0/+2
| | | | | | | | | | | | When MMP was merged the status codes in libzfs_status were not updated to add the status code for ZPOOL_STATUS_IO_FAILURE_MMP. This commit corrects this and adds comments to help keep track of which code is used for which status. Reviewed-by: Olaf Faaland <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: bunder2015 <[email protected]> Closes #8148 Closes #8222
* OpenZFS 8115 - parallel zfs mountSebastien Roy2018-11-151-3/+2
| | | | | | | | | | | | | | | | | | | | | | | | Porting Notes: * Use thread pools (tpool) API instead of introducing taskq interfaces to libzfs. * Use pthread_mutext for locks as mutex_t isn't available. * Ignore alternative libshare initialization since OpenZFS-7955 is not present on zfsonlinux. Authored by: Sebastien Roy <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Reviewed by: Pavel Zakharov <[email protected]> Reviewed by: Brad Lewis <[email protected]> Reviewed by: George Wilson <[email protected]> Reviewed by: Paul Dagnelie <[email protected]> Reviewed by: Prashanth Sreenivasa <[email protected]> Authored by: Brian Behlendorf <[email protected]> Approved by: Matt Ahrens <[email protected]> Ported-by: Don Brady <[email protected]> OpenZFS-issue: https://www.illumos.org/issues/8115 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/a3f0e2b569 Closes #8092
* Add libzutil for libzfs or libzpool consumersDon Brady2018-11-051-102/+0
| | | | | | | | | | | Adds a libzutil for utility functions that are common to libzfs and libzpool consumers (most of what was in libzfs_import.c). This removes the need for utilities to link against both libzpool and libzfs. Reviewed-by: Matthew Ahrens <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Don Brady <[email protected]> Closes #8050
* changelist should be able to iter on mountsAlek P2018-10-021-1/+2
| | | | | | | | | | | | | Modified changelist_gather()ing for the mountpoint property. Now instead of iterating on all dataset descendants, we read /proc/self/mounts and iterate on the mounted descendant datasets only. Switched changelist implementation from a uu_list_* to uu_avl_* in order to reduce changlist code-path's worst case time complexity. Reviewed by: Don Brady <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alek Pinchuk <[email protected]> Closes #7967
* Add basic zfs ioc input nvpair validationDon Brady2018-09-021-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We want newer versions of libzfs_core to run against an existing zfs kernel module (i.e. a deferred reboot or module reload after an update). Programmatically document, via a zfs_ioc_key_t, the valid arguments for the ioc commands that rely on nvpair input arguments (i.e. non legacy commands from libzfs_core). Automatically verify the expected pairs before dispatching a command. This initial phase focuses on the non-legacy ioctls. A follow-on change can address the legacy ioctl input from the zfs_cmd_t. The zfs_ioc_key_t for zfs_keys_channel_program looks like: static const zfs_ioc_key_t zfs_keys_channel_program[] = { {"program", DATA_TYPE_STRING, 0}, {"arg", DATA_TYPE_UNKNOWN, 0}, {"sync", DATA_TYPE_BOOLEAN_VALUE, ZK_OPTIONAL}, {"instrlimit", DATA_TYPE_UINT64, ZK_OPTIONAL}, {"memlimit", DATA_TYPE_UINT64, ZK_OPTIONAL}, }; Introduce four input errors to identify specific input failures (in addition to generic argument value errors like EINVAL, ERANGE, EBADF, and E2BIG). ZFS_ERR_IOC_CMD_UNAVAIL the ioctl number is not supported by kernel ZFS_ERR_IOC_ARG_UNAVAIL an input argument is not supported by kernel ZFS_ERR_IOC_ARG_REQUIRED a required input argument is missing ZFS_ERR_IOC_ARG_BADTYPE an input argument has an invalid type Reviewed-by: Matthew Ahrens <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Don Brady <[email protected]> Closes #7780
* Added encryption support for zfs recv -o / -xTom Caputi2018-08-151-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | One small integration that was absent from b52563 was support for zfs recv -o / -x with regards to encryption parameters. The main use cases of this are as follows: * Receiving an unencrypted stream as encrypted without needing to create a "dummy" encrypted parent so that encryption can be inheritted. * Allowing users to change their keylocation on receive, so long as the receiving dataset is an encryption root. * Allowing users to explicitly exclude or override the encryption property from an unencrypted properties stream, allowing it to be received as encrypted. * Receiving a recursive heirarchy of unencrypted datasets, encrypting the top-level one and forcing all children to inherit the encryption. Reviewed-by: Jorgen Lundman <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Richard Elling <[email protected]> Signed-off-by: Tom Caputi <[email protected]> Closes #7650
* OpenZFS 9166 - zfs storage pool checkpointSerapheim Dimitropoulos2018-06-261-0/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Details about the motivation of this feature and its usage can be found in this blogpost: https://sdimitro.github.io/post/zpool-checkpoint/ A lightning talk of this feature can be found here: https://www.youtube.com/watch?v=fPQA8K40jAM Implementation details can be found in big block comment of spa_checkpoint.c Side-changes that are relevant to this commit but not explained elsewhere: * renames members of "struct metaslab trees to be shorter without losing meaning * space_map_{alloc,truncate}() accept a block size as a parameter. The reason is that in the current state all space maps that we allocate through the DMU use a global tunable (space_map_blksz) which defauls to 4KB. This is ok for metaslab space maps in terms of bandwirdth since they are scattered all over the disk. But for other space maps this default is probably not what we want. Examples are device removal's vdev_obsolete_sm or vdev_chedkpoint_sm from this review. Both of these have a 1:1 relationship with each vdev and could benefit from a bigger block size. Porting notes: * The part of dsl_scan_sync() which handles async destroys has been moved into the new dsl_process_async_destroys() function. * Remove "VERIFY(!(flags & FWRITE))" in "kernel.c" so zhack can write to block device backed pools. * ZTS: * Fix get_txg() in zpool_sync_001_pos due to "checkpoint_txg". * Don't use large dd block sizes on /dev/urandom under Linux in checkpoint_capacity. * Adopt Delphix-OS's setting of 4 (spa_asize_inflation = SPA_DVAS_PER_BP + 1) for the checkpoint_capacity test to speed its attempts to fill the pool * Create the base and nested pools with sync=disabled to speed up the "setup" phase. * Clear labels in test pool between checkpoint tests to avoid duplicate pool issues. * The import_rewind_device_replaced test has been marked as "known to fail" for the reasons listed in its DISCLAIMER. * New module parameters: zfs_spa_discard_memory_limit, zfs_remove_max_bytes_pause (not documented - debugging only) vdev_max_ms_count (formerly metaslabs_per_vdev) vdev_min_ms_count Authored by: Serapheim Dimitropoulos <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Reviewed by: John Kennedy <[email protected]> Reviewed by: Dan Kimmel <[email protected]> Reviewed by: Brian Behlendorf <[email protected]> Approved by: Richard Lowe <[email protected]> Ported-by: Tim Chase <[email protected]> Signed-off-by: Tim Chase <[email protected]> OpenZFS-issue: https://illumos.org/issues/9166 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/7159fdb8 Closes #7570
* Add pool state /proc entry, "SUSPENDED" poolsTony Hutter2018-06-061-0/+2
| | | | | | | | | | | | | | | | | | | 1. Add a proc entry to display the pool's state: $ cat /proc/spl/kstat/zfs/tank/state ONLINE This is done without using the spa config locks, so it will never hang. 2. Fix 'zpool status' and 'zpool list -o health' output to print "SUSPENDED" instead of "ONLINE" for suspended pools. Reviewed-by: Olaf Faaland <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed by: Richard Elling <[email protected]> Signed-off-by: Tony Hutter <[email protected]> Closes #7331 Closes #7563
* OpenZFS 9235 - rename zpool_rewind_policy_t to zpool_load_policy_tPavel Zakharov2018-06-041-1/+1
| | | | | | | | | | | | | | | | | | | | | | | We want to be able to pass various settings during import/open of a pool, which are not only related to rewind. Instead of adding a new policy and duplicate a bunch of code, we should just rename rewind_policy to a more generic term like load_policy. For instance, we'd like to set spa->spa_import_flags from the nvlist, rather from a flags parameter passed to spa_import as in some cases we want those flags not only for the import case, but also for the open case. One such flag could be ZFS_IMPORT_MISSING_LOG (as used in zdb) which would allow zfs to open a pool when logs are missing. Authored by: Pavel Zakharov <[email protected]> Reviewed by: Matt Ahrens <[email protected]> Reviewed by: George Wilson <[email protected]> Approved by: Robert Mustacchi <[email protected]> Ported-by: Brian Behlendorf <[email protected]> OpenZFS-issue: https://illumos.org/issues/9235 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/d2b1e44 Closes #7532
* OpenZFS 9075 - Improve ZFS pool import/load process and corrupted pool recoveryPavel Zakharov2018-05-081-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Some work has been done lately to improve the debugability of the ZFS pool load (and import) process. This includes: 7638 Refactor spa_load_impl into several functions 8961 SPA load/import should tell us why it failed 7277 zdb should be able to print zfs_dbgmsg's To iterate on top of that, there's a few changes that were made to make the import process more resilient and crash free. One of the first tasks during the pool load process is to parse a config provided from userland that describes what devices the pool is composed of. A vdev tree is generated from that config, and then all the vdevs are opened. The Meta Object Set (MOS) of the pool is accessed, and several metadata objects that are necessary to load the pool are read. The exact configuration of the pool is also stored inside the MOS. Since the configuration provided from userland is external and might not accurately describe the vdev tree of the pool at the txg that is being loaded, it cannot be relied upon to safely operate the pool. For that reason, the configuration in the MOS is read early on. In the past, the two configurations were compared together and if there was a mismatch then the load process was aborted and an error was returned. The latter was a good way to ensure a pool does not get corrupted, however it made the pool load process needlessly fragile in cases where the vdev configuration changed or the userland configuration was outdated. Since the MOS is stored in 3 copies, the configuration provided by userland doesn't have to be perfect in order to read its contents. Hence, a new approach has been adopted: The pool is first opened with the untrusted userland configuration just so that the real configuration can be read from the MOS. The trusted MOS configuration is then used to generate a new vdev tree and the pool is re-opened. When the pool is opened with an untrusted configuration, writes are disabled to avoid accidentally damaging it. During reads, some sanity checks are performed on block pointers to see if each DVA points to a known vdev; when the configuration is untrusted, instead of panicking the system if those checks fail we simply avoid issuing reads to the invalid DVAs. This new two-step pool load process now allows rewinding pools accross vdev tree changes such as device replacement, addition, etc. Loading a pool from an external config file in a clustering environment also becomes much safer now since the pool will import even if the config is outdated and didn't, for instance, register a recent device addition. With this code in place, it became relatively easy to implement a long-sought-after feature: the ability to import a pool with missing top level (i.e. non-redundant) devices. Note that since this almost guarantees some loss of data, this feature is for now restricted to a read-only import. Porting notes (ZTS): * Fix 'make dist' target in zpool_import * The maximum path length allowed by tar is 99 characters. Several of the new test cases exceeded this limit resulting in them not being included in the tarball. Shorten the names slightly. * Set/get tunables using accessor functions. * Get last synced txg via the "zfs_txg_history" mechanism. * Clear zinject handlers in cleanup for import_cache_device_replaced and import_rewind_device_replaced in order that the zpool can be exported if there is an error. * Increase FILESIZE to 8G in zfs-test.sh to allow for a larger ext4 file system to be created on ZFS_DISK2. Also, there's no need to partition ZFS_DISK2 at all. The partitioning had already been disabled for multipath devices. Among other things, the partitioning steals some space from the ext4 file system, makes it difficult to accurately calculate the paramters to parted and can make some of the tests fail. * Increase FS_SIZE and FILE_SIZE in the zpool_import test configuration now that FILESIZE is larger. * Write more data in order that device evacuation take lonnger in a couple tests. * Use mkdir -p to avoid errors when the directory already exists. * Remove use of sudo in import_rewind_config_changed. Authored by: Pavel Zakharov <[email protected]> Reviewed by: George Wilson <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Reviewed by: Andrew Stormont <[email protected]> Approved by: Hans Rosenfeld <[email protected]> Ported-by: Tim Chase <[email protected]> Signed-off-by: Tim Chase <[email protected]> OpenZFS-issue: https://illumos.org/issues/9075 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/619c0123 Closes #7459
* OpenZFS 7614, 9064 - zfs device evacuation/removalMatthew Ahrens2018-04-141-1/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | OpenZFS 7614 - zfs device evacuation/removal OpenZFS 9064 - remove_mirror should wait for device removal to complete This project allows top-level vdevs to be removed from the storage pool with "zpool remove", reducing the total amount of storage in the pool. This operation copies all allocated regions of the device to be removed onto other devices, recording the mapping from old to new location. After the removal is complete, read and free operations to the removed (now "indirect") vdev must be remapped and performed at the new location on disk. The indirect mapping table is kept in memory whenever the pool is loaded, so there is minimal performance overhead when doing operations on the indirect vdev. The size of the in-memory mapping table will be reduced when its entries become "obsolete" because they are no longer used by any block pointers in the pool. An entry becomes obsolete when all the blocks that use it are freed. An entry can also become obsolete when all the snapshots that reference it are deleted, and the block pointers that reference it have been "remapped" in all filesystems/zvols (and clones). Whenever an indirect block is written, all the block pointers in it will be "remapped" to their new (concrete) locations if possible. This process can be accelerated by using the "zfs remap" command to proactively rewrite all indirect blocks that reference indirect (removed) vdevs. Note that when a device is removed, we do not verify the checksum of the data that is copied. This makes the process much faster, but if it were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be possible to copy the wrong data, when we have the correct data on e.g. the other side of the mirror. At the moment, only mirrors and simple top-level vdevs can be removed and no removal is allowed if any of the top-level vdevs are raidz. Porting Notes: * Avoid zero-sized kmem_alloc() in vdev_compact_children(). The device evacuation code adds a dependency that vdev_compact_children() be able to properly empty the vdev_child array by setting it to NULL and zeroing vdev_children. Under Linux, kmem_alloc() and related functions return a sentinel pointer rather than NULL for zero-sized allocations. * Remove comment regarding "mpt" driver where zfs_remove_max_segment is initialized to SPA_MAXBLOCKSIZE. Change zfs_condense_indirect_commit_entry_delay_ticks to zfs_condense_indirect_commit_entry_delay_ms for consistency with most other tunables in which delays are specified in ms. * ZTS changes: Use set_tunable rather than mdb Use zpool sync as appropriate Use sync_pool instead of sync Kill jobs during test_removal_with_operation to allow unmount/export Don't add non-disk names such as "mirror" or "raidz" to $DISKS Use $TEST_BASE_DIR instead of /tmp Increase HZ from 100 to 1000 which is more common on Linux removal_multiple_indirection.ksh Reduce iterations in order to not time out on the code coverage builders. removal_resume_export: Functionally, the test case is correct but there exists a race where the kernel thread hasn't been fully started yet and is not visible. Wait for up to 1 second for the removal thread to be started before giving up on it. Also, increase the amount of data copied in order that the removal not finish before the export has a chance to fail. * MMP compatibility, the concept of concrete versus non-concrete devices has slightly changed the semantics of vdev_writeable(). Update mmp_random_leaf_impl() accordingly. * Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool feature which is not supported by OpenZFS. * Added support for new vdev removal tracepoints. * Test cases removal_with_zdb and removal_condense_export have been intentionally disabled. When run manually they pass as intended, but when running in the automated test environment they produce unreliable results on the latest Fedora release. They may work better once the upstream pool import refectoring is merged into ZoL at which point they will be re-enabled. Authored by: Matthew Ahrens <[email protected]> Reviewed-by: Alex Reece <[email protected]> Reviewed-by: George Wilson <[email protected]> Reviewed-by: John Kennedy <[email protected]> Reviewed-by: Prakash Surya <[email protected]> Reviewed by: Richard Laager <[email protected]> Reviewed by: Tim Chase <[email protected]> Reviewed by: Brian Behlendorf <[email protected]> Approved by: Garrett D'Amore <[email protected]> Ported-by: Tim Chase <[email protected]> Signed-off-by: Tim Chase <[email protected]> OpenZFS-issue: https://www.illumos.org/issues/7614 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb Closes #6900
* Report pool suspended due to MMPOlaf Faaland2018-03-151-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | When the pool is suspended, record whether it was due to an I/O error or due to MMP writes failing to succeed within the required time. Change spa_suspended from uint8_t to zio_suspend_reason_t to store the reason. When userspace queries pool status via spa_tryimport(), report the reason the pool was suspended in a new key, ZPOOL_CONFIG_SUSPENDED_REASON. In libzfs, when interpreting the returned config nvlist, report suspension due to MMP with a new pool status enum value, ZPOOL_STATUS_IO_FAILURE_MMP. In status_callback(), which generates and emits the message when 'zpool status' is executed, add a case to print an appropriate message for the new pool status enum value. Reviewed-by: George Melikov <[email protected]> Reviewed-by: Giuseppe Di Natale <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Signed-off-by: Olaf Faaland <[email protected]> Closes #7296
* Change functions which return literals to return `const char*`Tomohiro Kusumi2018-03-091-1/+1
| | | | | | | | | | | | | | get_format_prompt_string() and zpool_state_to_name() return a string literal which is read-only, thus they should return `const char*`. zpool_get_prop_string() returns a non-const string after successful nv-lookup, and returns a string literal otherwise. Since this function is designed to be used for read-only purpose, the return type should also be `const char*`. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tomohiro Kusumi <[email protected]> Closes #7285
* Want 'zfs send -b'LOLi2018-02-211-0/+3
| | | | | | | | | | | | | | | This change implements 'zfs send -b' which can be used to send only received property values whether or not they are overridden by local settings. This can be very useful during "restore" operations from a backup pool because it allows to send only the property values originally sent from the backup source, even though they were later modified on the destination either by a 'zfs set' operation, explicit 'zfs inherit' or overridden during the receive process via 'zfs receive -o|-x'. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: loli10K <[email protected]> Closes #7156
* Added no_scrub_restart flag to zpool reopenArkadiusz Bubała2017-10-261-1/+2
| | | | | | | | | | | | | | | | | | | | | | | Added -n flag to zpool reopen that allows a running scrub operation to continue if there is a device with Dirty Time Log. By default if a component device has a DTL and zpool reopen is executed all running scan operations will be restarted. Added functional tests for `zpool reopen` Tests covers following scenarios: * `zpool reopen` without arguments, * `zpool reopen` with pool name as argument, * `zpool reopen` while scrubbing, * `zpool reopen -n` while scrubbing, * `zpool reopen -n` while resilvering, * `zpool reopen` with bad arguments. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Tom Caputi <[email protected]> Signed-off-by: Arkadiusz Bubała <[email protected]> Closes #6076 Closes #6746
* Add convenience 'zfs_get' functionsJohn2017-10-191-0/+1
| | | | | | | | Add get functions to match existing ones. Reviewed-by: Giuseppe Di Natale <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: John Ramsden <[email protected]> Closes #6308
* Remove FRU and LIBTOPO SupportDavid Quigley2017-09-181-11/+0
| | | | | | | | | FRU and LIBTOPO support are illumos only features that will not be ported to Linux and make the code more complicated than necessary. This commit makes way for further cleanups of the zed/FMA code. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: David Quigley <[email protected]> Closes #6641
* Add -vnP support to 'zfs send' for bookmarksLOLi2017-09-081-1/+1
| | | | | | | | | This leverages the functionality introduced in cf7684b to expose verbose, dry-run and parsable 'zfs send' options for bookmarks. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: loli10K <[email protected]> Closes #3666 Closes #6601
* Native Encryption for ZFS on LinuxTom Caputi2017-08-141-2/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This change incorporates three major pieces: The first change is a keystore that manages wrapping and encryption keys for encrypted datasets. These commands mostly involve manipulating the new DSL Crypto Key ZAP Objects that live in the MOS. Each encrypted dataset has its own DSL Crypto Key that is protected with a user's key. This level of indirection allows users to change their keys without re-encrypting their entire datasets. The change implements the new subcommands "zfs load-key", "zfs unload-key" and "zfs change-key" which allow the user to manage their encryption keys and settings. In addition, several new flags and properties have been added to allow dataset creation and to make mounting and unmounting more convenient. The second piece of this patch provides the ability to encrypt, decyrpt, and authenticate protected datasets. Each object set maintains a Merkel tree of Message Authentication Codes that protect the lower layers, similarly to how checksums are maintained. This part impacts the zio layer, which handles the actual encryption and generation of MACs, as well as the ARC and DMU, which need to be able to handle encrypted buffers and protected data. The last addition is the ability to do raw, encrypted sends and receives. The idea here is to send raw encrypted and compressed data and receive it exactly as is on a backup system. This means that the dataset on the receiving system is protected using the same user key that is in use on the sending side. By doing so, datasets can be efficiently backed up to an untrusted system without fear of data being compromised. Reviewed by: Matthew Ahrens <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Jorgen Lundman <[email protected]> Signed-off-by: Tom Caputi <[email protected]> Closes #494 Closes #5769
* Add libtpool (thread pools)Brian Behlendorf2017-08-091-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | OpenZFS provides a library called tpool which implements thread pools for user space applications. Porting this library means the zpool utility no longer needs to borrow the kernel mutex and taskq interfaces from libzpool. This code was updated to use the tpool library which behaves in a very similar fashion. Porting libtpool was relatively straight forward and minimal modifications were needed. The core changes were: * Fully convert the library to use pthreads. * Updated signal handling. * lmalloc/lfree converted to calloc/free * Implemented portable pthread_attr_clone() function. Finally, update the build system such that libzpool.so is no longer linked in to zfs(8), zpool(8), etc. All that is required is libzfs to which the zcommon soures were added (which is the way it always should have been). Removing the libzpool dependency resulted in several build issues which needed to be resolved. * Moved zfeature support to module/zcommon/zfeature_common.c * Moved ratelimiting to to module/zfs/zfs_ratelimit.c * Moved get_system_hostid() to lib/libspl/gethostid.c * Removed use of cmn_err() in zcommon source * Removed dprintf_setup() call from zpool_main.c and zfs_main.c * Removed highbit() and lowbit() * Removed unnecessary library dependencies from Makefiles * Removed fletcher-4 kstat in user space * Added sha2 support explicitly to libzfs * Added highbit64() and lowbit64() to zpool_util.c Reviewed-by: Tony Hutter <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #6442
* Multi-modifier protection (MMP)Olaf Faaland2017-07-131-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add multihost=on|off pool property to control MMP. When enabled a new thread writes uberblocks to the last slot in each label, at a set frequency, to indicate to other hosts the pool is actively imported. These uberblocks are the last synced uberblock with an updated timestamp. Property defaults to off. During tryimport, find the "best" uberblock (newest txg and timestamp) repeatedly, checking for change in the found uberblock. Include the results of the activity test in the config returned by tryimport. These results are reported to user in "zpool import". Allow the user to control the period between MMP writes, and the duration of the activity test on import, via a new module parameter zfs_multihost_interval. The period is specified in milliseconds. The activity test duration is calculated from this value, and from the mmp_delay in the "best" uberblock found initially. Add a kstat interface to export statistics about Multiple Modifier Protection (MMP) updates. Include the last synced txg number, the timestamp, the delay since the last MMP update, the VDEV GUID, the VDEV label that received the last MMP update, and the VDEV path. Abbreviated output below. $ cat /proc/spl/kstat/zfs/mypool/multihost 31 0 0x01 10 880 105092382393521 105144180101111 txg timestamp mmp_delay vdev_guid vdev_label vdev_path 20468 261337 250274925 68396651780 3 /dev/sda 20468 261339 252023374 6267402363293 1 /dev/sdc 20468 261340 252000858 6698080955233 1 /dev/sdx 20468 261341 251980635 783892869810 2 /dev/sdy 20468 261342 253385953 8923255792467 3 /dev/sdd 20468 261344 253336622 042125143176 0 /dev/sdab 20468 261345 253310522 1200778101278 2 /dev/sde 20468 261346 253286429 0950576198362 2 /dev/sdt 20468 261347 253261545 96209817917 3 /dev/sds 20468 261349 253238188 8555725937673 3 /dev/sdb Add a new tunable zfs_multihost_history to specify the number of MMP updates to store history for. By default it is set to zero meaning that no MMP statistics are stored. When using ztest to generate activity, for automated tests of the MMP function, some test functions interfere with the test. For example, the pool is exported to run zdb and then imported again. Add a new ztest function, "-M", to alter ztest behavior to prevent this. Add new tests to verify the new functionality. Tests provided by Giuseppe Di Natale. Reviewed by: Matthew Ahrens <[email protected]> Reviewed-by: Giuseppe Di Natale <[email protected]> Reviewed-by: Ned Bass <[email protected]> Reviewed-by: Andreas Dilger <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Olaf Faaland <[email protected]> Closes #745 Closes #6279
* Implemented zpool scrub pause/resumeAlek P2017-07-061-1/+2
| | | | | | | | | | | | | | | | | | Currently, there is no way to pause a scrub. Pausing may be useful when the pool is busy with other I/O to preserve bandwidth. This patch adds the ability to pause and resume scrubbing. This is achieved by maintaining a persistent on-disk scrub state. While the state is 'paused' we do not scrub any more blocks. We do however perform regular scan housekeeping such as freeing async destroyed and deadlist blocks while paused. Reviewed by: Matthew Ahrens <[email protected]> Reviewed by: Thomas Caputi <[email protected]> Reviewed-by: Serapheim Dimitropoulos <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alek Pinchuk <[email protected]> Closes #6167
* Dashes for zero latency values in zpool iostat -pTony Hutter2017-06-221-1/+11
| | | | | | | | | | | | This prints dashes instead of zeros for zero latency values in 'zpool iostat -p'. You'll get zero latencies reported when the disk is idle, but technically a zero latency is invalid, since you can't measure the latency of doing nothing. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: George Melikov <[email protected]> Reviewed-by: Giuseppe Di Natale <[email protected]> Signed-off-by: Tony Hutter <[email protected]> Closes #6210
* Implemented zpool sync commandAlek P2017-05-191-0/+3
| | | | | | | | | | | This addition will enable us to sync an open TXG to the main pool on demand. The functionality is similar to 'sync(2)' but 'zpool sync' will return when data has hit the main storage instead of potentially just the ZIL as is the case with the 'sync(2)' cmd. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Signed-off-by: Alek Pinchuk <[email protected]> Closes #6122
* Force fault a vdev with 'zpool offline -f'Tony Hutter2017-05-191-0/+1
| | | | | | | | | | | | | This patch adds a '-f' option to 'zpool offline' to fault a vdev instead of bringing it offline. Unlike the OFFLINE state, the FAULTED state will trigger the FMA code, allowing for things like autoreplace and triggering the slot fault LED. The -f faults persist across imports, unless they were set with the temporary (-t) flag. Both persistent and temporary faults can be cleared with zpool clear. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tony Hutter <[email protected]> Closes #6094
* Add zfs_nicebytes() to print human-readable sizesLOLi2017-05-021-2/+4
| | | | | | | | | | | | | | | | * Add zfs_nicebytes() to print human-readable sizes Some 'zfs', 'zpool' and 'zdb' output strings can be confusing to the user when no units are specified. This add a new zfs_nicenum_format "ZFS_NICENUM_BYTES" used to print bytes in their human-readable form. Additionally, update some test cases to use machine-parsable 'zfs get'. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: loli10K <[email protected]> Closes #2414 Closes #3185 Closes #3594 Closes #6032
* Prebaked scripts for zpool status/iostat -cTony Hutter2017-04-211-0/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch updates the "zpool status/iostat -c" commands to only run "pre-baked" scripts from the /etc/zfs/zpool.d directory (or wherever you install to). The scripts can only be run from -c as an unprivileged user (unless the ZPOOL_SCRIPTS_AS_ROOT environment var is set by root). This was done to encourage scripts to be written is such a way that normal users can use them, and to be cautious. If your script needs to run a privileged command, consider adding the appropriate line in /etc/sudoers. See zpool(8) for an example of how to do this. The patch also allows the scripts to output custom column names. If the script outputs a line like: name=value then "name" is used for the column name, and "value" is its value. Multiple columns can be specified by outputting multiple lines. Column names and values can have spaces. If the value is empty, a dash (-) is printed instead. After all the "name=value" lines are read (if any), zpool will take the next the next line of output (if any) and print it without a column header. After that, no more lines will be processed. This can be useful for printing errors. Lastly, this patch also disables the -c option with the latency and request size histograms, since it produced awkward output and made the code harder to maintain. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Giuseppe Di Natale <[email protected]> Signed-off-by: Tony Hutter <[email protected]> Closes #5852
* OpenZFS 6865 - want zfs-tests cases for zpool labelclear commandYuri Pankov2017-04-111-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | Authored by: Yuri Pankov <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Reviewed by: John Kennedy <[email protected]> Approved by: Robert Mustacchi <[email protected]> Reviewed-by: loli10K <[email protected]> Ported-by: Brian Behlendorf <[email protected]> Porting Notes: - Updated 'zpool labelclear' and 'zdb -l' such that they attempt to find a vdev given solely its short name. This behavior is consistent with the upstream OpenZFS code and the test cases depend on it. The actual implementation differs slightly due to device naming conventions on Linux. - auto_online_001_pos, auto_replace_001_pos and add-o_ashift test cases updated to expect failure when no label exists. - read_efi_label() and zpool_label_disk_check() are read-only operations and should use O_RDONLY at open time to enforce this. - zpool_label_disk() and zpool_relabel_disk() write the partition information using O_DIRECT an fsync() and page cache invalidation to ensure a consistent view of the device. - dump_label() in zdb should invalidate the page cache in order to get the authoritative label from disk. OpenZFS-issue: https://www.illumos.org/issues/6865 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/c95076c Closes #5981
* OpenZFS 4521 - zfstest is trying to execute evil "zfs unmount -a"Giuseppe Di Natale2017-02-031-1/+3
| | | | | | | | | | | | | | | | | | Authored by: Yuri Pankov <[email protected]> Reviewed by: Andriy Gapon <[email protected]> Reviewed by: Dan McDonald <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Reviewed by: John Kennedy <[email protected]> Approved by: Robert Mustacchi <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: George Melikov <[email protected]> Ported-by: Giuseppe Di Natale <[email protected]> Porting Notes: - Correctly set __ZFS_POOL_RESTRICT in inherit_001_pos OpenZFS-issue: https://www.illumos.org/issues/4521 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/8808ac5 Closes #5674
* Add -c to zpool iostat & status to run commandTony Hutter2016-11-291-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch adds a command (-c) option to zpool status and zpool iostat. The -c option allows you to run an arbitrary command on each vdev and display the first line of output in zpool status/iostat. The environment vars VDEV_PATH and VDEV_UPATH are set to the vdev's path and "underlying path" before running the command. For device mapper, multipath, or partitioned vdevs, VDEV_UPATH is the actual underlying /dev/sd* disk. This can be useful if the command you're running requires a /dev/sd* device. The patch also uses /sys/block/<dev>/slaves/ to lookup the underlying device instead of using libdevmapper. This not only removes the libdevmapper requirement at build time, but also allows you to resolve device mapper devices without being root. This means that UDEV_UPATH get set correctly when running zpool status/iostat as an unprivileged user. Example: $ zpool status -c 'echo I am $VDEV_PATH, $VDEV_UPATH' NAME STATE READ WRITE CKSUM mypool ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 mpatha ONLINE 0 0 0 I am /dev/mapper/mpatha, /dev/sdc sdb ONLINE 0 0 0 I am /dev/sdb1, /dev/sdb Reviewed-by: Giuseppe Di Natale <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tony Hutter <[email protected]> Closes #5368
* Allow zfs unshare <protocol> -aLOLi2016-11-291-0/+1
| | | | | | | | | | | | | | | | | | Allow `zfs unshare <protocol> -a` command to share or unshare all datasets of a given protocol, nfs or smb. Additionally, enable most of ZFS Test Suite zfs_share/zfs_unshare test cases. To work around some Illumos-specific functionalities ($SHARE/$UNSHARE) some function wrappers were added around them. Finally, fix and issue in smb_is_share_active() that would leave SMB shares exported when invoking 'zfs unshare -a' Reviewed-by: Giuseppe Di Natale <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Turbo Fredriksson <[email protected]> Signed-off-by: loli10K <[email protected]> Closes #3238 Closes #5367
* Fix 'zpool import' detection issuesBrian Behlendorf2016-11-071-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch addresses multiple 'zpool import' block device indentification problems which are most likely to occur on a system configured to use blkid, by_vdev paths, multipath and failover. The symptom most commonly observed is the import uses different path names to import the pool than would normally be expected. * When using blkid to identify vdevs the listed devices may be added to the cache in any order. In order to apply the preferred search order heuristic a zfs_path_order() function was added to calculate the order given full path names. * Since it's possible to have multiple block devices with different vdev guids which refer to the same ZPOOL_CONFIG_PATH the slice cache must be indexed by guid and name. By avoiding collisions the preferred ordering can be maintaining even when multiple block devices claim the same ZPOOL_CONFIG_PATH. The preferred sorting by partition was never benefitial for a Linux system and was removed as part of this change. * When adding entries to the blkid cache avl_find/avl_insert are used instead of avl_add because collisions are possible and must be handled gracefully. * For pools using multipath devices there are, at a minimum, three devices where a vdev label may be read. They are the dm-* device and each underlying /dev/sd* device. Due to the way the block cache is implemented each of these devices may have a different cached copy of the vdev label. This can result in "ghost pools" which appear to persist even after a 'zpool labelclear' has been done to the dm-* device. In order to prevent this the vdev label is read with O_DIRECT in order to bypass any caching to get the on-disk version. * When opening a block device verify that vdev guid read from the disk matches the expected vdev guid. This allows for bad labels to be filtered out. Reviewed-by: Tony Hutter <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #5359
* Turn on/off enclosure slot fault LED even when disk isn't presentTony Hutter2016-10-241-2/+3
| | | | | | | | | | | | | | | | | Previously when a drive faulted, the statechange-led.sh script would lookup the drive's LED sysfs entry in /sys/block/sd*/device/enclosure_device, and turn it on. During testing we noticed that if you pulled out a drive, or if the drive was so badly broken that it no longer appeared to Linux, that the /sys/block/sd* path would be removed, and the script could not lookup the LED entry. To fix this, this patch looks up the disks's more persistent "/sys/class/enclosure/X:X:X:X/Slot N" LED sysfs path at pool import. It then passes that path to the statechange-led script to use, rather than having the script look it up on the fly. This allows the script to turn on/off the slot LEDs even when the drive is missing. Closes #5309 Closes #2375
* Multipath autoreplace, control enclosure LEDs, event rate limitingTony Hutter2016-10-191-1/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | 1. Enable multipath autoreplace support for FMA. This extends FMA autoreplace to work with multipath disks. This requires libdevmapper to be installed at build time. 2. Turn on/off fault LEDs when VDEVs become degraded/faulted/online Set ZED_USE_ENCLOSURE_LEDS=1 in zed.rc to have ZED turn on/off the enclosure LED for a drive when a drive becomes FAULTED/DEGRADED. Your enclosure must be supported by the Linux SES driver for this to work. The enclosure LED scripts work for multipath devices as well. The scripts will clear the LED when the fault is cleared. 3. Rate limit ZIO delay and checksum events so as not to flood ZED ZIO delay and checksum events are rate limited to 5/sec in the zfs module. Reviewed-by: Richard Laager <[email protected]> Reviewed by: Don Brady <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tony Hutter <[email protected]> Closes #2449 Closes #3017 Closes #5159