Reduced IOPS when all vdevs are in the zfs_mg_fragmentation_threshold

Historically while doing performance testing we've noticed that IOPS can be significantly reduced when all vdevs in the pool are hitting the zfs_mg_fragmentation_threshold percentage. Specifically in a hypothetical pool with two vdevs, what can happen is the following: Vdev A would go above that threshold and only vdev B would be used. Then vdev B would pass that threshold but vdev A would go below it (we've been freeing from A to allocate to B). The allocations would go back and forth utilizing one vdev at a time with IOPS taking a hit. Empirically, we've seen that our vdev selection for allocations is good enough that fragmentation increases uniformly across all vdevs the majority of the time. Thus we set the threshold percentage high enough to avoid hitting the speed bump on pools that are being pushed to the edge. We effectively disable its effect in the majority of the cases but we don't remove (at least for now) just in case we hit any weird behavior in the future. Reviewed-by: George Melikov <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Matt Ahrens <[email protected]> Signed-off-by: Serapheim Dimitropoulos <[email protected]> Closes #8859
author: Serapheim Dimitropoulos <[email protected]> 2019-06-06 13:08:41 -0700
committer: Brian Behlendorf <[email protected]> 2019-06-06 13:08:41 -0700
commit: cb020f0d86f6a1902f49c6e7d1f934850188acc2 (patch)
tree: 1537cedbae662465bd6d798e31e26e7bab39c930 /module
parent: 627f5117a3ec4b435e6ee85026a1da671de44c7b (diff)
1 files changed, 20 insertions, 5 deletions
diff --git a/module/zfs/metaslab.c b/module/zfs/metaslab.c
index ec89810b4..d1d5a243f 100644
--- a/module/zfs/metaslab.c
+++ b/module/zfs/metaslab.c
@@ -103,12 +103,27 @@ int zfs_mg_noalloc_threshold = 0;
 
 /*
  * Metaslab groups are considered eligible for allocations if their
- * fragmenation metric (measured as a percentage) is less than or equal to
- * zfs_mg_fragmentation_threshold. If a metaslab group exceeds this threshold
- * then it will be skipped unless all metaslab groups within the metaslab
- * class have also crossed this threshold.
+ * fragmenation metric (measured as a percentage) is less than or
+ * equal to zfs_mg_fragmentation_threshold. If a metaslab group
+ * exceeds this threshold then it will be skipped unless all metaslab
+ * groups within the metaslab class have also crossed this threshold.
+ *
+ * This tunable was introduced to avoid edge cases where we continue
+ * allocating from very fragmented disks in our pool while other, less
+ * fragmented disks, exists. On the other hand, if all disks in the
+ * pool are uniformly approaching the threshold, the threshold can
+ * be a speed bump in performance, where we keep switching the disks
+ * that we allocate from (e.g. we allocate some segments from disk A
+ * making it bypassing the threshold while freeing segments from disk
+ * B getting its fragmentation below the threshold).
+ *
+ * Empirically, we've seen that our vdev selection for allocations is
+ * good enough that fragmentation increases uniformly across all vdevs
+ * the majority of the time. Thus we set the threshold percentage high
+ * enough to avoid hitting the speed bump on pools that are being pushed
+ * to the edge.
  */
-int zfs_mg_fragmentation_threshold = 85;
+int zfs_mg_fragmentation_threshold = 95;
 
 /*
  * Allow metaslabs to keep their active state as long as their fragmentation
author	Serapheim Dimitropoulos <[email protected]>	2019-06-06 13:08:41 -0700
committer	Brian Behlendorf <[email protected]>	2019-06-06 13:08:41 -0700
commit	cb020f0d86f6a1902f49c6e7d1f934850188acc2 (patch)
tree	1537cedbae662465bd6d798e31e26e7bab39c930 /module
parent	627f5117a3ec4b435e6ee85026a1da671de44c7b (diff)