Only examine best metaslabs on each vdev

On a system with very high fragmentation, we may need to do lots of gang allocations (e.g. most indirect block allocations (~50KB) may need to gang). Before failing a "normal" allocation and resorting to ganging, we try every metaslab. This has the impact of loading every metaslab (not a huge deal since we now typically keep all metaslabs loaded), and also iterating over every metaslab for every failing allocation. If there are many metaslabs (more than the typical ~200, e.g. due to vdev expansion or very large vdevs), the CPU cost of this iteration can be very impactful. This iteration is done with the mg_lock held, creating long hold times and high lock contention for concurrent allocations, ultimately causing long txg sync times and poor application performance. To address this, this commit changes the behavior of "normal" (not try_hard, not ZIL) allocations. These will now only examine the 100 best metaslabs (as determined by their ms_weight). If none of these have a large enough free segment, then the allocation will fail and we'll fall back on ganging. To accomplish this, we will now (normally) gang before doing a `try_hard` allocation. Non-try_hard allocations will only examine the 100 best metaslabs of each vdev. In summary, we will first try normal allocation. If that fails then we will do a gang allocation. If that fails then we will do a "try hard" gang allocation. If that fails then we will have a multi-layer gang block. Reviewed-by: Paul Dagnelie <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> Closes #11327
author: Matthew Ahrens <[email protected]> 2020-12-16 14:40:05 -0800
committer: GitHub <[email protected]> 2020-12-16 14:40:05 -0800
commit: be5c6d96530e19efde7c0af771f9ddb0073ef751 (patch)
tree: 86ec18586ac7464f1154ba881cc0d89ff9270b79 /module/zfs/zio.c
parent: f8020c936356b887ab1e03eba1a723f9dfda6eea (diff)
1 files changed, 7 insertions, 8 deletions
diff --git a/module/zfs/zio.c b/module/zfs/zio.c
index ba438353a..3c2b731f7 100644
--- a/module/zfs/zio.c
+++ b/module/zfs/zio.c
@@ -3585,17 +3585,16 @@ zio_alloc_zil(spa_t *spa, objset_t *os, uint64_t txg, blkptr_t *new_bp,
 	 * of, so we just hash the objset ID to pick the allocator to get
 	 * some parallelism.
 	 */
-	error = metaslab_alloc(spa, spa_log_class(spa), size, new_bp, 1,
-	    txg, NULL, METASLAB_FASTWRITE, &io_alloc_list, NULL,
-	    cityhash4(0, 0, 0, os->os_dsl_dataset->ds_object) %
-	    spa->spa_alloc_count);
+	int flags = METASLAB_FASTWRITE | METASLAB_ZIL;
+	int allocator = cityhash4(0, 0, 0, os->os_dsl_dataset->ds_object) %
+	    spa->spa_alloc_count;
+	error = metaslab_alloc(spa, spa_log_class(spa), size, new_bp,
+	    1, txg, NULL, flags, &io_alloc_list, NULL, allocator);
 	if (error == 0) {
 		*slog = TRUE;
 	} else {
-		error = metaslab_alloc(spa, spa_normal_class(spa), size,
-		    new_bp, 1, txg, NULL, METASLAB_FASTWRITE,
-		    &io_alloc_list, NULL, cityhash4(0, 0, 0,
-		    os->os_dsl_dataset->ds_object) % spa->spa_alloc_count);
+		error = metaslab_alloc(spa, spa_normal_class(spa), size, new_bp,
+		    1, txg, NULL, flags, &io_alloc_list, NULL, allocator);
 		if (error == 0)
 			*slog = FALSE;
 	}
author	Matthew Ahrens <[email protected]>	2020-12-16 14:40:05 -0800
committer	GitHub <[email protected]>	2020-12-16 14:40:05 -0800
commit	be5c6d96530e19efde7c0af771f9ddb0073ef751 (patch)
tree	86ec18586ac7464f1154ba881cc0d89ff9270b79 /module/zfs/zio.c
parent	f8020c936356b887ab1e03eba1a723f9dfda6eea (diff)