OpenZFS 7090 - zfs should throttle allocations

OpenZFS 7090 - zfs should throttle allocations Authored by: George Wilson <[email protected]> Reviewed by: Alex Reece <[email protected]> Reviewed by: Christopher Siden <[email protected]> Reviewed by: Dan Kimmel <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Reviewed by: Paul Dagnelie <[email protected]> Reviewed by: Prakash Surya <[email protected]> Reviewed by: Sebastien Roy <[email protected]> Approved by: Matthew Ahrens <[email protected]> Ported-by: Don Brady <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> When write I/Os are issued, they are issued in block order but the ZIO pipeline will drive them asynchronously through the allocation stage which can result in blocks being allocated out-of-order. It would be nice to preserve as much of the logical order as possible. In addition, the allocations are equally scattered across all top-level VDEVs but not all top-level VDEVs are created equally. The pipeline should be able to detect devices that are more capable of handling allocations and should allocate more blocks to those devices. This allows for dynamic allocation distribution when devices are imbalanced as fuller devices will tend to be slower than empty devices. The change includes a new pool-wide allocation queue which would throttle and order allocations in the ZIO pipeline. The queue would be ordered by issued time and offset and would provide an initial amount of allocation of work to each top-level vdev. The allocation logic utilizes a reservation system to reserve allocations that will be performed by the allocator. Once an allocation is successfully completed it's scheduled on a given top-level vdev. Each top-level vdev maintains a maximum number of allocations that it can handle (mg_alloc_queue_depth). The pool-wide reserved allocations (top-levels * mg_alloc_queue_depth) are distributed across the top-level vdevs metaslab groups and round robin across all eligible metaslab groups to distribute the work. As top-levels complete their work, they receive additional work from the pool-wide allocation queue until the allocation queue is emptied. OpenZFS-issue: https://www.illumos.org/issues/7090 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/4756c3d7 Closes #5258 Porting Notes: - Maintained minimal stack in zio_done - Preserve linux-specific io sizes in zio_write_compress - Added module params and documentation - Updated to use optimize AVL cmp macros
author: Don Brady <[email protected]> 2016-10-13 18:59:18 -0600
committer: Brian Behlendorf <[email protected]> 2016-10-13 17:59:18 -0700
commit: 3dfb57a35e8cbaa7c424611235d669f3c575ada1 (patch)
tree: d0958fdc57be43a540bba035580f0d8b39f1a99c /include/sys/metaslab_impl.h
parent: a85a90557dfc70e09475c156a376f6923a6c89f0 (diff)
1 files changed, 62 insertions, 1 deletions
diff --git a/include/sys/metaslab_impl.h b/include/sys/metaslab_impl.h
index 27a53b515..1c8993aca 100644
--- a/include/sys/metaslab_impl.h
+++ b/include/sys/metaslab_impl.h
@@ -24,7 +24,7 @@
  */
 
 /*
- * Copyright (c) 2011, 2014 by Delphix. All rights reserved.
+ * Copyright (c) 2011, 2015 by Delphix. All rights reserved.
  */
 
 #ifndef _SYS_METASLAB_IMPL_H
@@ -59,11 +59,42 @@ extern "C" {
  * to use a block allocator that best suits that class.
  */
 struct metaslab_class {
+	kmutex_t		mc_lock;
 	spa_t			*mc_spa;
 	metaslab_group_t	*mc_rotor;
 	metaslab_ops_t		*mc_ops;
 	uint64_t		mc_aliquot;
+
+	/*
+	 * Track the number of metaslab groups that have been initialized
+	 * and can accept allocations. An initialized metaslab group is
+	 * one has been completely added to the config (i.e. we have
+	 * updated the MOS config and the space has been added to the pool).
+	 */
+	uint64_t		mc_groups;
+
+	/*
+	 * Toggle to enable/disable the allocation throttle.
+	 */
+	boolean_t		mc_alloc_throttle_enabled;
+
+	/*
+	 * The allocation throttle works on a reservation system. Whenever
+	 * an asynchronous zio wants to perform an allocation it must
+	 * first reserve the number of blocks that it wants to allocate.
+	 * If there aren't sufficient slots available for the pending zio
+	 * then that I/O is throttled until more slots free up. The current
+	 * number of reserved allocations is maintained by the mc_alloc_slots
+	 * refcount. The mc_alloc_max_slots value determines the maximum
+	 * number of allocations that the system allows. Gang blocks are
+	 * allowed to reserve slots even if we've reached the maximum
+	 * number of allocations allowed.
+	 */
+	uint64_t		mc_alloc_max_slots;
+	refcount_t		mc_alloc_slots;
+
 	uint64_t		mc_alloc_groups; /* # of allocatable groups */
+
 	uint64_t		mc_alloc;	/* total allocated space */
 	uint64_t		mc_deferred;	/* total deferred frees */
 	uint64_t		mc_space;	/* total space (alloc + free) */
@@ -85,6 +116,15 @@ struct metaslab_group {
 	avl_tree_t		mg_metaslab_tree;
 	uint64_t		mg_aliquot;
 	boolean_t		mg_allocatable;		/* can we allocate? */
+
+	/*
+	 * A metaslab group is considered to be initialized only after
+	 * we have updated the MOS config and added the space to the pool.
+	 * We only allow allocation attempts to a metaslab group if it
+	 * has been initialized.
+	 */
+	boolean_t		mg_initialized;
+
 	uint64_t		mg_free_capacity;	/* percentage free */
 	int64_t			mg_bias;
 	int64_t			mg_activation_count;
@@ -93,6 +133,27 @@ struct metaslab_group {
 	taskq_t			*mg_taskq;
 	metaslab_group_t	*mg_prev;
 	metaslab_group_t	*mg_next;
+
+	/*
+	 * Each metaslab group can handle mg_max_alloc_queue_depth allocations
+	 * which are tracked by mg_alloc_queue_depth. It's possible for a
+	 * metaslab group to handle more allocations than its max. This
+	 * can occur when gang blocks are required or when other groups
+	 * are unable to handle their share of allocations.
+	 */
+	uint64_t		mg_max_alloc_queue_depth;
+	refcount_t		mg_alloc_queue_depth;
+
+	/*
+	 * A metalab group that can no longer allocate the minimum block
+	 * size will set mg_no_free_space. Once a metaslab group is out
+	 * of space then its share of work must be distributed to other
+	 * groups.
+	 */
+	boolean_t		mg_no_free_space;
+
+	uint64_t		mg_allocations;
+	uint64_t		mg_failed_allocations;
 	uint64_t		mg_fragmentation;
 	uint64_t		mg_histogram[RANGE_TREE_HISTOGRAM_SIZE];
 };
author	Don Brady <[email protected]>	2016-10-13 18:59:18 -0600
committer	Brian Behlendorf <[email protected]>	2016-10-13 17:59:18 -0700
commit	3dfb57a35e8cbaa7c424611235d669f3c575ada1 (patch)
tree	d0958fdc57be43a540bba035580f0d8b39f1a99c /include/sys/metaslab_impl.h
parent	a85a90557dfc70e09475c156a376f6923a6c89f0 (diff)