Emergency slab objects

This patch is designed to resolve a deadlock which can occur with __vmalloc() based slabs. The issue is that the Linux kernel does not honor the flags passed to __vmalloc(). This makes it unsafe to use in a writeback context. Unfortunately, this is a use case ZFS depends on for correct operation. Fixing this issue in the upstream kernel was pursued and patches are available which resolve the issue. https://bugs.gentoo.org/show_bug.cgi?id=416685 However, these changes were rejected because upstream felt that using __vmalloc() in the context of writeback should never be done. Their solution was for us to rewrite parts of ZFS to accomidate the Linux VM. While that is probably the right long term solution, and it is something we want to pursue, it is not a trivial task and will likely destabilize the existing code. This work has been planned for the 0.7.0 release but in the meanwhile we want to improve the SPL slab implementation to accomidate this expected ZFS usage. This is accomplished by performing the __vmalloc() asynchronously in the context of a work queue. This doesn't prevent the posibility of the worker thread from deadlocking. However, the caller can now safely block on a wait queue for the slab allocation to complete. Normally this will occur in a reasonable amount of time and the caller will be woken up when the new slab is available,. The objects will then get cached in the per-cpu magazines and everything will proceed as usual. However, if the __vmalloc() deadlocks for the reasons described above, or is just very slow, then the callers on the wait queues will timeout out. When this rare situation occurs they will attempt to kmalloc() a single minimally sized object using the GFP_NOIO flags. This allocation will not deadlock because kmalloc() will honor the passed flags and the caller will be able to make forward progress. As long as forward progress can be maintained then even if the worker thread is deadlocked the critical thread will make progress. This will eventually allow the deadlocked worker thread to complete and normal operation will resume. These emergency allocations will likely be slow since they require contiguous pages. However, their use should be rare so the impact is expected to be minimal. If that turns out not to be the case in practice further optimizations are possible. One additional concern is if these emergency objects are long lived. Right now they are simply tracked on a list which must be walked when an object is freed. Is they accumulate on a system and the list grows freeing objects will become more expensive. This could be handled relatively easily by using a hash instead of a list, but that optimization (if needed) is left for a follow up patch. Additionally, these emeregency objects could be repacked in to existing slabs as objects are freed if the kmem_cache_set_move() functionality was implemented. See issue https://github.com/zfsonlinux/spl/issues/26 for full details. This work would also help reduce ZFS's memory fragmentation problems. The /proc/spl/kmem/slab file has had two new columns added at the end. The 'emerg' column reports the current number of these emergency objects in use for the cache, and the following 'max' column shows the historical worst case. These value should give us a good idea of how often these objects are needed. Based on these values under real use cases we can tune the default behavior. Lastly, as a side benefit using a single work queue for the slab allocations should reduce cpu contention on the global virtual address space lock. This should manifest itself as reduced cpu usage for the system. Signed-off-by: Brian Behlendorf <[email protected]>
author: Brian Behlendorf <[email protected]> 2012-08-07 16:59:50 -0700
committer: Brian Behlendorf <[email protected]> 2012-08-27 12:00:42 -0700
commit: e2dcc6e2b818d6f996203c9f1b4dfaf449554790 (patch)
tree: 281003950af3b956f37458e5ee5d2f811a9f023f /include/sys
parent: 587045a6386d1e7a6256eb6ef825e046beabd9a3 (diff)
1 files changed, 17 insertions, 0 deletions
diff --git a/include/sys/kmem.h b/include/sys/kmem.h
index 344e2716b..aaff6d046 100644
--- a/include/sys/kmem.h
+++ b/include/sys/kmem.h
@@ -291,6 +291,7 @@ enum {
 	KMC_BIT_KMEM		= 5,	/* Use kmem cache */
 	KMC_BIT_VMEM		= 6,	/* Use vmem cache */
 	KMC_BIT_OFFSLAB		= 7,	/* Objects not on slab */
+	KMC_BIT_GROWING         = 15,   /* Growing in progress */
 	KMC_BIT_REAPING		= 16,	/* Reaping in progress */
 	KMC_BIT_DESTROY		= 17,	/* Destroy in progress */
 	KMC_BIT_TOTAL		= 18,	/* Proc handler helper bit */
@@ -315,6 +316,7 @@ typedef enum kmem_cbrc {
 #define KMC_KMEM		(1 << KMC_BIT_KMEM)
 #define KMC_VMEM		(1 << KMC_BIT_VMEM)
 #define KMC_OFFSLAB		(1 << KMC_BIT_OFFSLAB)
+#define KMC_GROWING		(1 << KMC_BIT_GROWING)
 #define KMC_REAPING		(1 << KMC_BIT_REAPING)
 #define KMC_DESTROY		(1 << KMC_BIT_DESTROY)
 #define KMC_TOTAL		(1 << KMC_BIT_TOTAL)
@@ -374,6 +376,17 @@ typedef struct spl_kmem_slab {
 	uint32_t		sks_ref;	/* Ref count used objects */
 } spl_kmem_slab_t;
 
+typedef struct spl_kmem_alloc {
+	struct spl_kmem_cache	*ska_cache;	/* Owned by cache */
+	int			ska_flags;	/* Allocation flags */
+	struct delayed_work	ska_work;	/* Allocation work */
+} spl_kmem_alloc_t;
+
+typedef struct spl_kmem_emergency {
+	void			*ske_obj;	/* Buffer address */
+	struct list_head	ske_list;	/* Emergency list linkage */
+} spl_kmem_emergency_t;
+
 typedef struct spl_kmem_cache {
 	uint32_t		skc_magic;	/* Sanity magic */
 	uint32_t		skc_name_size;	/* Name length */
@@ -398,7 +411,9 @@ typedef struct spl_kmem_cache {
 	struct list_head	skc_list;	/* List of caches linkage */
 	struct list_head	skc_complete_list;/* Completely alloc'ed */
 	struct list_head	skc_partial_list; /* Partially alloc'ed */
+	struct list_head	skc_emergency_list; /* Min sized objects */
 	spinlock_t		skc_lock;	/* Cache lock */
+	wait_queue_head_t	skc_waitq;	/* Allocation waiters */
 	uint64_t		skc_slab_fail;	/* Slab alloc failures */
 	uint64_t		skc_slab_create;/* Slab creates */
 	uint64_t		skc_slab_destroy;/* Slab destroys */
@@ -408,6 +423,8 @@ typedef struct spl_kmem_cache {
 	uint64_t		skc_obj_total;	/* Obj total current */
 	uint64_t		skc_obj_alloc;	/* Obj alloc current */
 	uint64_t		skc_obj_max;	/* Obj max historic */
+	uint64_t		skc_obj_emergency; /* Obj emergency current */
+	uint64_t		skc_obj_emergency_max; /* Obj emergency max */
 } spl_kmem_cache_t;
 #define kmem_cache_t		spl_kmem_cache_t
author	Brian Behlendorf <[email protected]>	2012-08-07 16:59:50 -0700
committer	Brian Behlendorf <[email protected]>	2012-08-27 12:00:42 -0700
commit	e2dcc6e2b818d6f996203c9f1b4dfaf449554790 (patch)
tree	281003950af3b956f37458e5ee5d2f811a9f023f /include/sys
parent	587045a6386d1e7a6256eb6ef825e046beabd9a3 (diff)