Allocate disk space fairly in the presence of vdevs of unequal size.

The metaslab allocator device selection algorithm contains a bias mechanism whose goal is to achieve roughly equal disk space usage across all top-level vdevs. It seems that the initial rationale for this code was to allow newly added (empty) vdevs to "come up to speed" faster in an attempt to make the pool quickly converge to a steady state where all vdevs are equally utilized. While the code seems to work reasonably well for this use case, there is another scenario in which this algorithm fails miserably: the case where top-level vdevs don't have the same sizes (capacities). ZFS allows this, and it is a good feature to have, so that users who simply want to build a pool with the disks they happen to have lying around can do so even if the disks have heteregenous sizes. Here's a script that simulates a pool with two vdevs, with one 4X larger than the other: dd if=/dev/zero of=/tmp/d1 bs=1 count=1 seek=134217728 dd if=/dev/zero of=/tmp/d2 bs=1 count=1 seek=536870912 zpool create testspace /tmp/d1 /tmp/d2 dd if=/dev/zero of=/testspace/foobar bs=1M count=256 zpool iostat -v testspace Before this commit, the script would output the following: capacity pool alloc free ---------- ----- ----- testspace 252M 375M /tmp/d1 104M 18.5M /tmp/d2 148M 356M ---------- ----- ----- This demonstrates that the current code handles this situation very poorly: d1 shows 85% usage despite the pool itself being only 40% full. d1 is quite saturated at this point, and is slowing down the entire pool due to saturation, fragmentation and the like. In contrast, here's the result with the code in this commit: capacity pool alloc free ---------- ----- ----- testspace 252M 375M /tmp/d1 56.7M 66.3M /tmp/d2 195M 309M ---------- ----- ------ This looks much better. d1 is 46% used, which is close to the overall pool utilization (40%). The code still doesn't result in perfectly balanced allocation, probably because of the way mg_bias is applied which does not guarantee perfect accuracy, but this is still much better than before. Signed-off-by: Etienne Dechamps <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #3389
author: Etienne Dechamps <[email protected]> 2015-05-10 15:39:18 +0100
committer: Brian Behlendorf <[email protected]> 2015-06-22 14:18:29 -0700
commit: bb3250d07ec818587333d7c26116314b3dc8a684 (patch)
tree: 515f1d27d8eacba73b0e9f87f3671cdf9fe5be71 /module/zfs/metaslab.c
parent: 218b4e0a7608f7ef37ec72042a68c45e539a5d1c (diff)
1 files changed, 25 insertions, 11 deletions
diff --git a/module/zfs/metaslab.c b/module/zfs/metaslab.c
index 7ff1a4f5a..86bf3c197 100644
--- a/module/zfs/metaslab.c
+++ b/module/zfs/metaslab.c
@@ -2335,28 +2335,42 @@ top:
 			 * figure out whether the corresponding vdev is
 			 * over- or under-used relative to the pool,
 			 * and set an allocation bias to even it out.
+			 *
+			 * Bias is also used to compensate for unequally
+			 * sized vdevs so that space is allocated fairly.
 			 */
 			if (mc->mc_aliquot == 0 && metaslab_bias_enabled) {
 				vdev_stat_t *vs = &vd->vdev_stat;
-				int64_t vu, cu;
-
-				vu = (vs->vs_alloc * 100) / (vs->vs_space + 1);
-				cu = (mc->mc_alloc * 100) / (mc->mc_space + 1);
+				int64_t vs_free = vs->vs_space - vs->vs_alloc;
+				int64_t mc_free = mc->mc_space - mc->mc_alloc;
+				int64_t ratio;
 
 				/*
 				 * Calculate how much more or less we should
 				 * try to allocate from this device during
 				 * this iteration around the rotor.
-				 * For example, if a device is 80% full
-				 * and the pool is 20% full then we should
-				 * reduce allocations by 60% on this device.
 				 *
-				 * mg_bias = (20 - 80) * 512K / 100 = -307K
+				 * This basically introduces a zero-centered
+				 * bias towards the devices with the most
+				 * free space, while compensating for vdev
+				 * size differences.
+				 *
+				 * Examples:
+				 *  vdev V1 = 16M/128M
+				 *  vdev V2 = 16M/128M
+				 *  ratio(V1) = 100% ratio(V2) = 100%
+				 *
+				 *  vdev V1 = 16M/128M
+				 *  vdev V2 = 64M/128M
+				 *  ratio(V1) = 127% ratio(V2) =  72%
 				 *
-				 * This reduces allocations by 307K for this
-				 * iteration.
+				 *  vdev V1 = 16M/128M
+				 *  vdev V2 = 64M/512M
+				 *  ratio(V1) =  40% ratio(V2) = 160%
 				 */
-				mg->mg_bias = ((cu - vu) *
+				ratio = (vs_free * mc->mc_alloc_groups * 100) /
+				    (mc_free + 1);
+				mg->mg_bias = ((ratio - 100) *
 				    (int64_t)mg->mg_aliquot) / 100;
 			} else if (!metaslab_bias_enabled) {
 				mg->mg_bias = 0;
author	Etienne Dechamps <[email protected]>	2015-05-10 15:39:18 +0100
committer	Brian Behlendorf <[email protected]>	2015-06-22 14:18:29 -0700
commit	bb3250d07ec818587333d7c26116314b3dc8a684 (patch)
tree	515f1d27d8eacba73b0e9f87f3671cdf9fe5be71 /module/zfs/metaslab.c
parent	218b4e0a7608f7ef37ec72042a68c45e539a5d1c (diff)