OpenZFS 9486 - reduce memory used by device removal on fragmented pools

Device removal allocates a new location for each allocated segment on the disk that's being removed. Each allocation results in one entry in the mapping table, which maps from old location + length to new location. When a fragmented disk is removed, this can result in a large number of mapping entries, and thus a large amount of memory consumed by the mapping table. In the worst real-world cases, we've seen around 1GB of RAM per 1TB of storage removed. We can improve on this situation by allocating larger segments, which span across both allocated and free regions of the device being removed. By including free regions in the allocation (and thus mapping), we reduce the number of mapping entries. For example, if we have a 4K allocation followed by 1K free and then 4K allocated, we would allocate 4+1+4 = 9KB, and then move the entire region (including allocated and free parts). In this case we used one mapping where previously we would have used two, but often the ratio is much higher (up to 20:1 in real-world use). We then need to mark the regions that were free on the removing device as free in the new locations, and also obsolete in the mapping entry. This method preserves the fragmentation of the removing device, rather than consolidating its allocated space into a small number of chunks where possible. But it results in drastic reduction of memory used by the mapping table - around 20x in the most-fragmented cases. In the most fragmented real-world cases, this reduces memory used by the mapping from ~1GB to ~50MB of RAM per 1TB of storage removed. Less fragmented cases will typically also see around 50-100MB of RAM per 1TB of storage. Porting notes: * Add the following as module parameters: * zfs_condense_indirect_vdevs_enable * zfs_condense_max_obsolete_bytes * Document the following module parameters: * zfs_condense_indirect_vdevs_enable * zfs_condense_max_obsolete_bytes * zfs_condense_min_mapping_bytes Authored by: Matthew Ahrens <[email protected]> Reviewed by: Brian Behlendorf <[email protected]> Ported-by: Tim Chase <[email protected]> Signed-off-by: Tim Chase <[email protected]> OpenZFS-issue: https://illumos.org/issues/9486 OpenZFS-commit: https://github.com/ahrens/illumos/commit/07152e142e44c External-issue: DLPX-57962 Closes #7536
author: Matthew Ahrens <[email protected]> 2018-02-26 15:33:55 -0800
committer: Brian Behlendorf <[email protected]> 2018-05-24 10:18:07 -0700
commit: 0dc2f70c5cece6ef2474e14552111ae098d9f5b4 (patch)
tree: 8414edcb42c28aecbc4e9422eb02d15d7e98d035 /module/zfs/vdev_label.c
parent: ba863d0be4cbfbea938b10e49fb6ff459ac9ec20 (diff)
1 files changed, 32 insertions, 21 deletions
diff --git a/module/zfs/vdev_label.c b/module/zfs/vdev_label.c
index e5ba7cdc2..7ea8da1e6 100644
--- a/module/zfs/vdev_label.c
+++ b/module/zfs/vdev_label.c
@@ -33,15 +33,15 @@
  *	1. Uniquely identify this device as part of a ZFS pool and confirm its
  *	   identity within the pool.
  *
- * 	2. Verify that all the devices given in a configuration are present
+ *	2. Verify that all the devices given in a configuration are present
  *         within the pool.
  *
- * 	3. Determine the uberblock for the pool.
+ *	3. Determine the uberblock for the pool.
  *
- * 	4. In case of an import operation, determine the configuration of the
+ *	4. In case of an import operation, determine the configuration of the
  *         toplevel vdev of which it is a part.
  *
- * 	5. If an import operation cannot find all the devices in the pool,
+ *	5. If an import operation cannot find all the devices in the pool,
  *         provide enough information to the administrator to determine which
  *         devices are missing.
  *
@@ -77,9 +77,9 @@
  * In order to identify which labels are valid, the labels are written in the
  * following manner:
  *
- * 	1. For each vdev, update 'L1' to the new label
- * 	2. Update the uberblock
- * 	3. For each vdev, update 'L2' to the new label
+ *	1. For each vdev, update 'L1' to the new label
+ *	2. Update the uberblock
+ *	3. For each vdev, update 'L2' to the new label
  *
  * Given arbitrary failure, we can determine the correct label to use based on
  * the transaction group.  If we fail after updating L1 but before updating the
@@ -117,19 +117,19 @@
  *
  * The nvlist describing the pool and vdev contains the following elements:
  *
- * 	version		ZFS on-disk version
- * 	name		Pool name
- * 	state		Pool state
- * 	txg		Transaction group in which this label was written
- * 	pool_guid	Unique identifier for this pool
- * 	vdev_tree	An nvlist describing vdev tree.
+ *	version		ZFS on-disk version
+ *	name		Pool name
+ *	state		Pool state
+ *	txg		Transaction group in which this label was written
+ *	pool_guid	Unique identifier for this pool
+ *	vdev_tree	An nvlist describing vdev tree.
  *	features_for_read
  *			An nvlist of the features necessary for reading the MOS.
  *
  * Each leaf device label also contains the following:
  *
- * 	top_guid	Unique ID for top-level vdev in which this is contained
- * 	guid		Unique ID for the leaf vdev
+ *	top_guid	Unique ID for top-level vdev in which this is contained
+ *	guid		Unique ID for the leaf vdev
  *
  * The 'vs' configuration follows the format described in 'spa_config.c'.
  */
@@ -515,22 +515,33 @@ vdev_config_generate(spa_t *spa, vdev_t *vd, boolean_t getstats,
 			 * histograms.
 			 */
 			uint64_t seg_count = 0;
+			uint64_t to_alloc = vd->vdev_stat.vs_alloc;
 
 			/*
 			 * There are the same number of allocated segments
 			 * as free segments, so we will have at least one
-			 * entry per free segment.
+			 * entry per free segment.  However, small free
+			 * segments (smaller than vdev_removal_max_span)
+			 * will be combined with adjacent allocated segments
+			 * as a single mapping.
 			 */
 			for (int i = 0; i < RANGE_TREE_HISTOGRAM_SIZE; i++) {
-				seg_count += vd->vdev_mg->mg_histogram[i];
+				if (1ULL << (i + 1) < vdev_removal_max_span) {
+					to_alloc +=
+					    vd->vdev_mg->mg_histogram[i] <<
+					    (i + 1);
+				} else {
+					seg_count +=
+					    vd->vdev_mg->mg_histogram[i];
+				}
 			}
 
 			/*
-			 * The maximum length of a mapping is SPA_MAXBLOCKSIZE,
-			 * so we need at least one entry per SPA_MAXBLOCKSIZE
-			 * of allocated data.
+			 * The maximum length of a mapping is
+			 * zfs_remove_max_segment, so we need at least one entry
+			 * per zfs_remove_max_segment of allocated data.
 			 */
-			seg_count += vd->vdev_stat.vs_alloc / SPA_MAXBLOCKSIZE;
+			seg_count += to_alloc / zfs_remove_max_segment;
 
 			fnvlist_add_uint64(nv, ZPOOL_CONFIG_INDIRECT_SIZE,
 			    seg_count *
author	Matthew Ahrens <[email protected]>	2018-02-26 15:33:55 -0800
committer	Brian Behlendorf <[email protected]>	2018-05-24 10:18:07 -0700
commit	0dc2f70c5cece6ef2474e14552111ae098d9f5b4 (patch)
tree	8414edcb42c28aecbc4e9422eb02d15d7e98d035 /module/zfs/vdev_label.c
parent	ba863d0be4cbfbea938b10e49fb6ff459ac9ec20 (diff)