diff options
author | Matthew Ahrens <[email protected]> | 2020-12-16 14:40:05 -0800 |
---|---|---|
committer | GitHub <[email protected]> | 2020-12-16 14:40:05 -0800 |
commit | be5c6d96530e19efde7c0af771f9ddb0073ef751 (patch) | |
tree | 86ec18586ac7464f1154ba881cc0d89ff9270b79 /module/zfs/zio.c | |
parent | f8020c936356b887ab1e03eba1a723f9dfda6eea (diff) |
Only examine best metaslabs on each vdev
On a system with very high fragmentation, we may need to do lots of gang
allocations (e.g. most indirect block allocations (~50KB) may need to
gang). Before failing a "normal" allocation and resorting to ganging, we
try every metaslab. This has the impact of loading every metaslab (not
a huge deal since we now typically keep all metaslabs loaded), and also
iterating over every metaslab for every failing allocation. If there are
many metaslabs (more than the typical ~200, e.g. due to vdev expansion
or very large vdevs), the CPU cost of this iteration can be very
impactful. This iteration is done with the mg_lock held, creating long
hold times and high lock contention for concurrent allocations,
ultimately causing long txg sync times and poor application performance.
To address this, this commit changes the behavior of "normal" (not
try_hard, not ZIL) allocations. These will now only examine the 100
best metaslabs (as determined by their ms_weight). If none of these
have a large enough free segment, then the allocation will fail and
we'll fall back on ganging.
To accomplish this, we will now (normally) gang before doing a
`try_hard` allocation. Non-try_hard allocations will only examine the
100 best metaslabs of each vdev. In summary, we will first try normal
allocation. If that fails then we will do a gang allocation. If that
fails then we will do a "try hard" gang allocation. If that fails then
we will have a multi-layer gang block.
Reviewed-by: Paul Dagnelie <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Matthew Ahrens <[email protected]>
Closes #11327
Diffstat (limited to 'module/zfs/zio.c')
-rw-r--r-- | module/zfs/zio.c | 15 |
1 files changed, 7 insertions, 8 deletions
diff --git a/module/zfs/zio.c b/module/zfs/zio.c index ba438353a..3c2b731f7 100644 --- a/module/zfs/zio.c +++ b/module/zfs/zio.c @@ -3585,17 +3585,16 @@ zio_alloc_zil(spa_t *spa, objset_t *os, uint64_t txg, blkptr_t *new_bp, * of, so we just hash the objset ID to pick the allocator to get * some parallelism. */ - error = metaslab_alloc(spa, spa_log_class(spa), size, new_bp, 1, - txg, NULL, METASLAB_FASTWRITE, &io_alloc_list, NULL, - cityhash4(0, 0, 0, os->os_dsl_dataset->ds_object) % - spa->spa_alloc_count); + int flags = METASLAB_FASTWRITE | METASLAB_ZIL; + int allocator = cityhash4(0, 0, 0, os->os_dsl_dataset->ds_object) % + spa->spa_alloc_count; + error = metaslab_alloc(spa, spa_log_class(spa), size, new_bp, + 1, txg, NULL, flags, &io_alloc_list, NULL, allocator); if (error == 0) { *slog = TRUE; } else { - error = metaslab_alloc(spa, spa_normal_class(spa), size, - new_bp, 1, txg, NULL, METASLAB_FASTWRITE, - &io_alloc_list, NULL, cityhash4(0, 0, 0, - os->os_dsl_dataset->ds_object) % spa->spa_alloc_count); + error = metaslab_alloc(spa, spa_normal_class(spa), size, new_bp, + 1, txg, NULL, flags, &io_alloc_list, NULL, allocator); if (error == 0) *slog = FALSE; } |