aboutsummaryrefslogtreecommitdiffstats
path: root/module/zfs/dsl_deadlist.c
Commit message (Collapse)AuthorAgeFilesLines
* Implement Redacted Send/ReceivePaul Dagnelie2019-06-191-31/+37
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Redacted send/receive allows users to send subsets of their data to a target system. One possible use case for this feature is to not transmit sensitive information to a data warehousing, test/dev, or analytics environment. Another is to save space by not replicating unimportant data within a given dataset, for example in backup tools like zrepl. Redacted send/receive is a three-stage process. First, a clone (or clones) is made of the snapshot to be sent to the target. In this clone (or clones), all unnecessary or unwanted data is removed or modified. This clone is then snapshotted to create the "redaction snapshot" (or snapshots). Second, the new zfs redact command is used to create a redaction bookmark. The redaction bookmark stores the list of blocks in a snapshot that were modified by the redaction snapshot(s). Finally, the redaction bookmark is passed as a parameter to zfs send. When sending to the snapshot that was redacted, the redaction bookmark is used to filter out blocks that contain sensitive or unwanted information, and those blocks are not included in the send stream. When sending from the redaction bookmark, the blocks it contains are considered as candidate blocks in addition to those blocks in the destination snapshot that were modified since the creation_txg of the redaction bookmark. This step is necessary to allow the target to rehydrate data in the case where some blocks are accidentally or unnecessarily modified in the redaction snapshot. The changes to bookmarks to enable fast space estimation involve adding deadlists to bookmarks. There is also logic to manage the life cycles of these deadlists. The new size estimation process operates in cases where previously an accurate estimate could not be provided. In those cases, a send is performed where no data blocks are read, reducing the runtime significantly and providing a byte-accurate size estimate. Reviewed-by: Dan Kimmel <[email protected]> Reviewed-by: Matt Ahrens <[email protected]> Reviewed-by: Prashanth Sreenivasa <[email protected]> Reviewed-by: John Kennedy <[email protected]> Reviewed-by: George Wilson <[email protected]> Reviewed-by: Chris Williamson <[email protected]> Reviewed-by: Pavel Zhakarov <[email protected]> Reviewed-by: Sebastien Roy <[email protected]> Reviewed-by: Prakash Surya <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Paul Dagnelie <[email protected]> Closes #7958
* OpenZFS 7614, 9064 - zfs device evacuation/removalMatthew Ahrens2018-04-141-2/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | OpenZFS 7614 - zfs device evacuation/removal OpenZFS 9064 - remove_mirror should wait for device removal to complete This project allows top-level vdevs to be removed from the storage pool with "zpool remove", reducing the total amount of storage in the pool. This operation copies all allocated regions of the device to be removed onto other devices, recording the mapping from old to new location. After the removal is complete, read and free operations to the removed (now "indirect") vdev must be remapped and performed at the new location on disk. The indirect mapping table is kept in memory whenever the pool is loaded, so there is minimal performance overhead when doing operations on the indirect vdev. The size of the in-memory mapping table will be reduced when its entries become "obsolete" because they are no longer used by any block pointers in the pool. An entry becomes obsolete when all the blocks that use it are freed. An entry can also become obsolete when all the snapshots that reference it are deleted, and the block pointers that reference it have been "remapped" in all filesystems/zvols (and clones). Whenever an indirect block is written, all the block pointers in it will be "remapped" to their new (concrete) locations if possible. This process can be accelerated by using the "zfs remap" command to proactively rewrite all indirect blocks that reference indirect (removed) vdevs. Note that when a device is removed, we do not verify the checksum of the data that is copied. This makes the process much faster, but if it were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be possible to copy the wrong data, when we have the correct data on e.g. the other side of the mirror. At the moment, only mirrors and simple top-level vdevs can be removed and no removal is allowed if any of the top-level vdevs are raidz. Porting Notes: * Avoid zero-sized kmem_alloc() in vdev_compact_children(). The device evacuation code adds a dependency that vdev_compact_children() be able to properly empty the vdev_child array by setting it to NULL and zeroing vdev_children. Under Linux, kmem_alloc() and related functions return a sentinel pointer rather than NULL for zero-sized allocations. * Remove comment regarding "mpt" driver where zfs_remove_max_segment is initialized to SPA_MAXBLOCKSIZE. Change zfs_condense_indirect_commit_entry_delay_ticks to zfs_condense_indirect_commit_entry_delay_ms for consistency with most other tunables in which delays are specified in ms. * ZTS changes: Use set_tunable rather than mdb Use zpool sync as appropriate Use sync_pool instead of sync Kill jobs during test_removal_with_operation to allow unmount/export Don't add non-disk names such as "mirror" or "raidz" to $DISKS Use $TEST_BASE_DIR instead of /tmp Increase HZ from 100 to 1000 which is more common on Linux removal_multiple_indirection.ksh Reduce iterations in order to not time out on the code coverage builders. removal_resume_export: Functionally, the test case is correct but there exists a race where the kernel thread hasn't been fully started yet and is not visible. Wait for up to 1 second for the removal thread to be started before giving up on it. Also, increase the amount of data copied in order that the removal not finish before the export has a chance to fail. * MMP compatibility, the concept of concrete versus non-concrete devices has slightly changed the semantics of vdev_writeable(). Update mmp_random_leaf_impl() accordingly. * Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool feature which is not supported by OpenZFS. * Added support for new vdev removal tracepoints. * Test cases removal_with_zdb and removal_condense_export have been intentionally disabled. When run manually they pass as intended, but when running in the automated test environment they produce unreliable results on the latest Fedora release. They may work better once the upstream pool import refectoring is merged into ZoL at which point they will be re-enabled. Authored by: Matthew Ahrens <[email protected]> Reviewed-by: Alex Reece <[email protected]> Reviewed-by: George Wilson <[email protected]> Reviewed-by: John Kennedy <[email protected]> Reviewed-by: Prakash Surya <[email protected]> Reviewed by: Richard Laager <[email protected]> Reviewed by: Tim Chase <[email protected]> Reviewed by: Brian Behlendorf <[email protected]> Approved by: Garrett D'Amore <[email protected]> Ported-by: Tim Chase <[email protected]> Signed-off-by: Tim Chase <[email protected]> OpenZFS-issue: https://www.illumos.org/issues/7614 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb Closes #6900
* OpenZFS 5428 - provide fts(), reallocarray(), and strtonum()Yuri Pankov2017-07-081-5/+3
| | | | | | | | | | | | | | | | Authored by: Yuri Pankov <[email protected]> Reviewed by: Robert Mustacchi <[email protected]> Approved by: Joshua M. Clulow <[email protected]> Reviewed-by: Giuseppe Di Natale <[email protected]> Reviewed-by: George Melikov <[email protected]> Ported-by: Brian Behlendorf <[email protected]> Porting Notes: * All hunks unrelated to ZFS were dropped. OpenZFS-issue: https://www.illumos.org/issues/5428 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/4585130 Closes #6326
* Add missing *_destroy/*_fini callsGvozden Neskovic2017-05-041-1/+1
| | | | | | | | | The proposed debugging enhancements in zfsonlinux/spl#587 identified the following missing *_destroy/*_fini calls. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Gvozden Neskovic <[email protected]> Closes #5428
* panic in bpobj_space(): null pointer dereferenceMatthew Ahrens2017-02-091-9/+23
| | | | | | | | | | | | | | | | | | | | | | | | | This is a race condition in the deadlist code. A thread executing an administrative command that uses dsl_deadlist_space_range() holds the lock of the whole deadlist_t to protect the access of all its entries that the deadlist contains in an avl tree. Sync threads trying to insert a new entry in the deadlist (through dsl_deadlist_insert() -> dle_enqueue()) do not hold the deadlist lock at that moment. If the dle_bpobj is the empty bpobj (our sentinel value), we close and reopen it. Between these two operations, it is possible for the dsl_deadlist_space_range() thread to dereference that bpobj which is NULL during that window. Threads should hold the a deadlist's dl_lock when they manipulate its internal data so scenarios like the one above are avoided. Reviewed-by: Matthew Ahrens <[email protected]> Reviewed-by: Dan Kimmel <[email protected]> Reviewed-by: George Melikov <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> Closes #5762
* Performance optimization of AVL tree comparator functionsGvozden Neskovic2016-08-311-8/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | perf: 2.75x faster ddt_entry_compare() First 256bits of ddt_key_t is a block checksum, which are expected to be close to random data. Hence, on average, comparison only needs to look at first few bytes of the keys. To reduce number of conditional jump instructions, the result is computed as: sign(memcmp(k1, k2)). Sign of an integer 'a' can be obtained as: `(0 < a) - (a < 0)` := {-1, 0, 1} , which is computed efficiently. Synthetic performance evaluation of original and new algorithm over 1G random keys on 2.6GHz Intel(R) Xeon(R) CPU E5-2660 v3: old 6.85789 s new 2.49089 s perf: 2.8x faster vdev_queue_offset_compare() and vdev_queue_timestamp_compare() Compute the result directly instead of using conditionals perf: zfs_range_compare() Speedup between 1.1x - 2.5x, depending on compiler version and optimization level. perf: spa_error_entry_compare() `bcmp()` is not suitable for comparator use. Use `memcmp()` instead. perf: 2.8x faster metaslab_compare() and metaslab_rangesize_compare() perf: 2.8x faster zil_bp_compare() perf: 2.8x faster mze_compare() perf: faster dbuf_compare() perf: faster compares in spa_misc perf: 2.8x faster layout_hash_compare() perf: 2.8x faster space_reftree_compare() perf: libzfs: faster avl tree comparators perf: guid_compare() perf: dsl_deadlist_compare() perf: perm_set_compare() perf: 2x faster range_tree_seg_compare() perf: faster unique_compare() perf: faster vdev_cache _compare() perf: faster vdev_uberblock_compare() perf: faster fuid _compare() perf: faster zfs_znode_hold_compare() Signed-off-by: Gvozden Neskovic <[email protected]> Signed-off-by: Richard Elling <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #5033
* Handle damaged blk_birth in dsl_deadlist_insert()Brian Behlendorf2015-12-151-0/+8
| | | | | | | | | | | | | | | | | If a bit were cleared in `bp->blk_birth` such that the txg birth was now lower than any other txg_birth in the deadlist, then there will be no entry before this in the tree. This should be impossible but regardless error handling code has been added for this case. By default this is left as a fatal case and the blk_birth is logged. However, setting `zfs_recover=1` will cause the bp to be placed at the start of the deadlist even though it contains an invalid blk_birth. Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Chunwei Chen <[email protected]> Closes #4086 Closes #4089
* Illumos 5027 - zfs large block supportMatthew Ahrens2015-05-111-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 5027 zfs large block support Reviewed by: Alek Pinchuk <[email protected]> Reviewed by: George Wilson <[email protected]> Reviewed by: Josef 'Jeff' Sipek <[email protected]> Reviewed by: Richard Elling <[email protected]> Reviewed by: Saso Kiselkov <[email protected]> Reviewed by: Brian Behlendorf <[email protected]> Approved by: Dan McDonald <[email protected]> References: https://www.illumos.org/issues/5027 https://github.com/illumos/illumos-gate/commit/b515258 Porting Notes: * Included in this patch is a tiny ISP2() cleanup in zio_init() from Illumos 5255. * Unlike the upstream Illumos commit this patch does not impose an arbitrary 128K block size limit on volumes. Volumes, like filesystems, are limited by the zfs_max_recordsize=1M module option. * By default the maximum record size is limited to 1M by the module option zfs_max_recordsize. This value may be safely increased up to 16M which is the largest block size supported by the on-disk format. At the moment, 1M blocks clearly offer a significant performance improvement but the benefits of going beyond this for the majority of workloads are less clear. * The illumos version of this patch increased DMU_MAX_ACCESS to 32M. This was determined not to be large enough when using 16M blocks because the zfs_make_xattrdir() function will fail (EFBIG) when assigning a TX. This was immediately observed under Linux because all newly created files must have a security xattr created and that was failing. Therefore, we've set DMU_MAX_ACCESS to 64M. * On 32-bit platforms a hard limit of 1M is set for blocks due to the limited virtual address space. We should be able to relax this one the ABD patches are merged. Ported-by: Brian Behlendorf <[email protected]> Closes #354
* Illumos 5056 - ZFS deadlock on db_mtx and dn_holdsJustin T. Gibbs2015-04-281-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 5056 ZFS deadlock on db_mtx and dn_holds Author: Justin Gibbs <[email protected]> Reviewed by: Will Andrews <[email protected]> Reviewed by: Matt Ahrens <[email protected]> Reviewed by: George Wilson <[email protected]> Approved by: Dan McDonald <[email protected]> References: https://www.illumos.org/issues/5056 https://github.com/illumos/illumos-gate/commit/bc9014e Porting Notes: sa_handle_get_from_db(): - the original patch includes an otherwise unmentioned fix for a possible usage of an uninitialised variable dmu_objset_open_impl(): - Under Illumos list_link_init() is the same as filling a list_node_t with NULLs, so they don't notice if they miss doing list_link_init() on a zero'd containing structure (e.g. allocated with kmem_zalloc as here). Under Linux, not so much: an uninitialised list_node_t goes "Boom!" some time later when it's used or destroyed. dmu_objset_evict_dbufs(): - reduce stack usage using kmem_alloc() Ported-by: Chris Dunlop <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]>
* Illumos 5314 - Remove "dbuf phys" db->db_data pointer aliases in ZFSJustin T. Gibbs2015-04-281-2/+3
| | | | | | | | | | | | | | | | 5314 Remove "dbuf phys" db->db_data pointer aliases in ZFS Author: Justin T. Gibbs <[email protected]> Reviewed by: Andriy Gapon <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Reviewed by: Will Andrews <[email protected]> Approved by: Dan McDonald <[email protected]> References: https://www.illumos.org/issues/5314 https://github.com/illumos/illumos-gate/commit/c137962 Ported-by: Chris Dunlop <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]>
* Change KM_PUSHPAGE -> KM_SLEEPBrian Behlendorf2015-01-161-2/+2
| | | | | | | | | | | | | | | By marking DMU transaction processing contexts with PF_FSTRANS we can revert the KM_PUSHPAGE -> KM_SLEEP changes. This brings us back in line with upstream. In some cases this means simply swapping the flags back. For others fnvlist_alloc() was replaced by nvlist_alloc(..., KM_PUSHPAGE) and must be reverted back to fnvlist_alloc() which assumes KM_SLEEP. The one place KM_PUSHPAGE is kept is when allocating ARC buffers which allows us to dip in to reserved memory. This is again the same as upstream. Signed-off-by: Brian Behlendorf <[email protected]>
* Illumos #3104: eliminate empty bpobjsMatthew Ahrens2013-01-081-9/+45
| | | | | | | | | | | | | | | | 3104 eliminate empty bpobjs Reviewed by: George Wilson <[email protected]> Reviewed by: Adam Leventhal <[email protected]> Reviewed by: Christopher Siden <[email protected]> Reviewed by: Garrett D'Amore <[email protected]> Approved by: Eric Schrock <[email protected]> References: illumos/illumos-gate@f17457368189aa911f774c38c1f21875a568bdca illumos changeset: 13782:8f78aae28a63 https://www.illumos.org/issues/3104 Ported-by: Brian Behlendorf <[email protected]>
* Switch KM_SLEEP to KM_PUSHPAGERichard Yao2012-08-271-2/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Differences between how paging is done on Solaris and Linux can cause deadlocks if KM_SLEEP is used in any the following contexts. * The txg_sync thread * The zvol write/discard threads * The zpl_putpage() VFS callback This is because KM_SLEEP will allow for direct reclaim which may result in the VM calling back in to the filesystem or block layer to write out pages. If a lock is held over this operation the potential exists to deadlock the system. To ensure forward progress all memory allocations in these contexts must us KM_PUSHPAGE which disables performing any I/O to accomplish the memory allocation. Previously, this behavior was acheived by setting PF_MEMALLOC on the thread. However, that resulted in unexpected side effects such as the exhaustion of pages in ZONE_DMA. This approach touchs more of the zfs code, but it is more consistent with the right way to handle these cases under Linux. This is patch lays the ground work for being able to safely revert the following commits which used PF_MEMALLOC: 21ade34 Disable direct reclaim for z_wr_* threads cfc9a5c Fix zpl_writepage() deadlock eec8164 Fix ASSERTION(!dsl_pool_sync_context(tx->tx_pool)) Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #726
* Illumos #1644, #1645, #1646, #1647, #1708Matthew Ahrens2012-07-311-3/+27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 1644 add ZFS "clones" property 1645 add ZFS "written" and "written@..." properties 1646 "zfs send" should estimate size of stream 1647 "zfs destroy" should determine space reclaimed by destroying multiple snapshots 1708 adjust size of zpool history data References: https://www.illumos.org/issues/1644 https://www.illumos.org/issues/1645 https://www.illumos.org/issues/1646 https://www.illumos.org/issues/1647 https://www.illumos.org/issues/1708 This commit modifies the user to kernel space ioctl ABI. Extra care should be taken when updating to ensure both the kernel modules and utilities are updated. This change has reordered all of the new ioctl()s to the end of the list. This should help minimize this issue in the future. Reviewed by: Richard Lowe <[email protected]> Reviewed by: George Wilson <[email protected]> Reviewed by: Albert Lee <[email protected]> Approved by: Garrett D'Amore <[email protected]> Ported by: Martin Matuska <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #826 Closes #664
* Update core ZFS code from build 121 to build 141.Brian Behlendorf2010-05-281-0/+474