Reimplement mutexs for Linux lock profiling/analysis

For a generic explanation of why mutexs needed to be reimplemented to work with the kernel lock profiling see commits: e811949a57044d60d12953c5c3b808a79a7d36ef and d28db80fd0fd4fd63aec09037c44408e51a222d6 The specific changes made to the mutex implemetation are as follows. The Linux mutex structure is now directly embedded in the kmutex_t. This allows a kmutex_t to be directly case to a mutex struct and passed directly to the Linux primative. Just like with the rwlocks it is critical that these functions be implemented as '#defines to ensure the location information is preserved. The preprocessor can then do a direct replacement of the Solaris primative with the linux primative. Just as with the rwlocks we need to track the lock owner. Here things get a little more interesting because depending on your kernel version, and how you've built your kernel Linux may already do this for you. If your running a 2.6.29 or newer kernel on a SMP system the lock owner will be tracked. This was added to Linux to support adaptive mutexs, more on that shortly. Alternately, your kernel might track the lock owner if you've set CONFIG_DEBUG_MUTEXES in the kernel build. If neither of the above things is true for your kernel the kmutex_t type will include and track the lock owner to ensure correct behavior. This is all handled by a new autoconf check called SPL_AC_MUTEX_OWNER. Concerning adaptive mutexs these are a very recent development and they did not make it in to either the latest FC11 of SLES11 kernels. Ideally, I'd love to see this kernel change appear in one of these distros because it does help performance. From Linux kernel commit: 0d66bf6d3514b35eb6897629059443132992dbd7 "Testing with Ingo's test-mutex application... gave a 345% boost for VFS scalability on my testbox" However, if you don't want to backport this change yourself you can still simply export the task_curr() symbol. The kmutex_t implementation will use this symbol when it's available to provide it's own adaptive mutexs. Finally, DEBUG_MUTEX support was removed including the proc handlers. This was done because now that we are cleanly integrated with the kernel profiling all this information and much much more is available in debug kernel builds. This code was now redundant. Update mutexs validated on: - SLES10 (ppc64) - SLES11 (x86_64) - CHAOS4.2 (x86_64) - RHEL5.3 (x86_64) - RHEL6 (x86_64) - FC11 (x86_64)
author: Brian Behlendorf <[email protected]> 2009-09-25 14:47:01 -0700
committer: Brian Behlendorf <[email protected]> 2009-09-25 14:47:01 -0700
commit: 4d54fdee1d774ddaef381893434a3721067e2c56 (patch)
tree: 7139adfd73794aec7103361539b30903a6500572 /module/spl/spl-mutex.c
parent: d28db80fd0fd4fd63aec09037c44408e51a222d6 (diff)
1 files changed, 32 insertions, 263 deletions
diff --git a/module/spl/spl-mutex.c b/module/spl/spl-mutex.c
index f0389f5d1..0af74571d 100644
--- a/module/spl/spl-mutex.c
+++ b/module/spl/spl-mutex.c
@@ -1,7 +1,7 @@
 /*
  *  This file is part of the SPL: Solaris Porting Layer.
  *
- *  Copyright (c) 2008 Lawrence Livermore National Security, LLC.
+ *  Copyright (c) 2009 Lawrence Livermore National Security, LLC.
  *  Produced at Lawrence Livermore National Laboratory
  *  Written by:
  *          Brian Behlendorf <[email protected]>,
@@ -32,277 +32,46 @@
 
 #define DEBUG_SUBSYSTEM S_MUTEX
 
-/* Mutex implementation based on those found in Solaris.  This means
- * they the MUTEX_DEFAULT type is an adaptive mutex.  When calling
- * mutex_enter() your process will spin waiting for the lock if it's
- * likely the lock will be free'd shortly.  If it looks like the
- * lock will be held for a longer time we schedule and sleep waiting
- * for it.  This determination is made by checking if the holder of
- * the lock is currently running on cpu or sleeping waiting to be
- * scheduled.  If the holder is currently running it's likely the
- * lock will be shortly dropped.
+/*
+ * While a standard mutex implementation has been available in the kernel
+ * for quite some time.  It was not until 2.6.29 and latter kernels that
+ * adaptive mutexs were embraced and integrated with the scheduler.  This
+ * brought a significant performance improvement, but just as importantly
+ * it added a lock owner to the generic mutex outside CONFIG_DEBUG_MUTEXES
+ * builds.  This is critical for correctly supporting the mutex_owner()
+ * Solaris primitive.  When the owner is available we use a pure Linux
+ * mutex implementation.  When the owner is not available we still use
+ * Linux mutexs as a base but also reserve space for an owner field right
+ * after the mutex structure.
  *
- * XXX: This is basically a rough implementation to see if this
- * helps our performance.  If it does a more careful implementation
- * should be done, perhaps in assembly.
+ * In the case when HAVE_MUTEX_OWNER is not defined your code may
+ * still me able to leverage adaptive mutexs.  As long as the task_curr()
+ * symbol is exported this code will provide a poor mans adaptive mutex
+ * implementation.  However, this is not required and if the symbol is
+ * unavailable we provide a standard mutex.
  */
 
-/*  0:         Never spin when trying to aquire lock
- * -1:         Spin until aquired or holder yeilds without dropping lock
+#ifndef HAVE_MUTEX_OWNER
+#ifdef HAVE_TASK_CURR
+/*
+ * mutex_spin_max = { 0, -1, 1-MAX_INT }
+ *  0:         Never spin when trying to acquire lock
+ * -1:         Spin until acquired or holder yields without dropping lock
  *  1-MAX_INT: Spin for N attempts before sleeping for lock
  */
 int mutex_spin_max = 0;
-
-#ifdef DEBUG_MUTEX
-int mutex_stats[MUTEX_STATS_SIZE] = { 0 };
-spinlock_t mutex_stats_lock;
-struct list_head mutex_stats_list;
-#endif
-
-int
-__spl_mutex_init(kmutex_t *mp, char *name, int type, void *ibc)
-{
-	int flags = KM_SLEEP;
-
-	ASSERT(mp);
-	ASSERT(name);
-	ASSERT(ibc == NULL);
-
-	mp->km_name = NULL;
-	mp->km_name_size = strlen(name) + 1;
-
-	switch (type) {
-		case MUTEX_DEFAULT:
-			mp->km_type = MUTEX_ADAPTIVE;
-			break;
-		case MUTEX_SPIN:
-		case MUTEX_ADAPTIVE:
-			mp->km_type = type;
-			break;
-		default:
-			SBUG();
-	}
-
-	/* We may be called when there is a non-zero preempt_count or
-	 * interrupts are disabled is which case we must not sleep.
-	 */
-        if (current_thread_info()->preempt_count || irqs_disabled())
-		flags = KM_NOSLEEP;
-
-	/* Semaphore kmem_alloc'ed to keep struct size down (<64b) */
-	mp->km_sem = kmem_alloc(sizeof(struct semaphore), flags);
-	if (mp->km_sem == NULL)
-		return -ENOMEM;
-
-	mp->km_name = kmem_alloc(mp->km_name_size, flags);
-	if (mp->km_name == NULL) {
-		kmem_free(mp->km_sem, sizeof(struct semaphore));
-		return -ENOMEM;
-	}
-
-	sema_init(mp->km_sem, 1);
-	strncpy(mp->km_name, name, mp->km_name_size);
-
-#ifdef DEBUG_MUTEX
-	mp->km_stats = kmem_zalloc(sizeof(int) * MUTEX_STATS_SIZE, flags);
-        if (mp->km_stats == NULL) {
-		kmem_free(mp->km_name, mp->km_name_size);
-		kmem_free(mp->km_sem, sizeof(struct semaphore));
-		return -ENOMEM;
-	}
-
-	/* XXX - This appears to be a much more contended lock than I
-	 * would have expected.  To run with this debugging enabled and
-	 * get reasonable performance we may need to be more clever and
-	 * do something like hash the mutex ptr on to one of several
-	 * lists to ease this single point of contention.
-	 */
-	spin_lock(&mutex_stats_lock);
-	list_add_tail(&mp->km_list, &mutex_stats_list);
-	spin_unlock(&mutex_stats_lock);
-#endif
-	mp->km_magic = KM_MAGIC;
-	mp->km_owner = NULL;
-
-	return 0;
-}
-EXPORT_SYMBOL(__spl_mutex_init);
-
-void
-__spl_mutex_destroy(kmutex_t *mp)
-{
-	ASSERT(mp);
-	ASSERT(mp->km_magic == KM_MAGIC);
-
-#ifdef DEBUG_MUTEX
-	spin_lock(&mutex_stats_lock);
-	list_del_init(&mp->km_list);
-	spin_unlock(&mutex_stats_lock);
-
-	kmem_free(mp->km_stats, sizeof(int) * MUTEX_STATS_SIZE);
-#endif
-	kmem_free(mp->km_name, mp->km_name_size);
-	kmem_free(mp->km_sem, sizeof(struct semaphore));
-
-	memset(mp, KM_POISON, sizeof(*mp));
-}
-EXPORT_SYMBOL(__spl_mutex_destroy);
-
-/* Return 1 if we acquired the mutex, else zero.  */
-int
-__mutex_tryenter(kmutex_t *mp)
-{
-	int rc;
-	ENTRY;
-
-	ASSERT(mp);
-	ASSERT(mp->km_magic == KM_MAGIC);
-	MUTEX_STAT_INC(mutex_stats, MUTEX_TRYENTER_TOTAL);
-	MUTEX_STAT_INC(mp->km_stats, MUTEX_TRYENTER_TOTAL);
-
-	rc = down_trylock(mp->km_sem);
-	if (rc == 0) {
-		ASSERT(mp->km_owner == NULL);
-		mp->km_owner = current;
-		MUTEX_STAT_INC(mutex_stats, MUTEX_TRYENTER_NOT_HELD);
-		MUTEX_STAT_INC(mp->km_stats, MUTEX_TRYENTER_NOT_HELD);
-	}
-
-	RETURN(!rc);
-}
-EXPORT_SYMBOL(__mutex_tryenter);
-
-#ifndef HAVE_TASK_CURR
-#define task_curr(owner)                0
-#endif
-
-
-static void
-mutex_enter_adaptive(kmutex_t *mp)
-{
-	struct task_struct *owner;
-	int count = 0;
-
-	/* Lock is not held so we expect to aquire the lock */
-	if ((owner = mp->km_owner) == NULL) {
-		down(mp->km_sem);
-		MUTEX_STAT_INC(mutex_stats, MUTEX_ENTER_NOT_HELD);
-		MUTEX_STAT_INC(mp->km_stats, MUTEX_ENTER_NOT_HELD);
-	} else {
-		/* The lock is held by a currently running task which
-		 * we expect will drop the lock before leaving the
-		 * head of the runqueue.  So the ideal thing to do
-		 * is spin until we aquire the lock and avoid a
-		 * context switch.  However it is also possible the
-		 * task holding the lock yields the processor with
-		 * out dropping lock.  In which case, we know it's
-		 * going to be a while so we stop spinning and go
-		 * to sleep waiting for the lock to be available.
-		 * This should strike the optimum balance between
-		 * spinning and sleeping waiting for a lock.
-		 */
-		while (task_curr(owner) && (count <= mutex_spin_max)) {
-			if (down_trylock(mp->km_sem) == 0) {
-				MUTEX_STAT_INC(mutex_stats, MUTEX_ENTER_SPIN);
-				MUTEX_STAT_INC(mp->km_stats, MUTEX_ENTER_SPIN);
-				GOTO(out, count);
-			}
-			count++;
-		}
-
-		/* The lock is held by a sleeping task so it's going to
-		 * cost us minimally one context switch.  We might as
-		 * well sleep and yield the processor to other tasks.
-		 */
-		down(mp->km_sem);
-		MUTEX_STAT_INC(mutex_stats, MUTEX_ENTER_SLEEP);
-		MUTEX_STAT_INC(mp->km_stats, MUTEX_ENTER_SLEEP);
-	}
-out:
-	MUTEX_STAT_INC(mutex_stats, MUTEX_ENTER_TOTAL);
-	MUTEX_STAT_INC(mp->km_stats, MUTEX_ENTER_TOTAL);
-}
-
-void
-__mutex_enter(kmutex_t *mp)
-{
-	ENTRY;
-	ASSERT(mp);
-	ASSERT(mp->km_magic == KM_MAGIC);
-
-	switch (mp->km_type) {
-		case MUTEX_SPIN:
-			while (down_trylock(mp->km_sem));
-			MUTEX_STAT_INC(mutex_stats, MUTEX_ENTER_SPIN);
-			MUTEX_STAT_INC(mp->km_stats, MUTEX_ENTER_SPIN);
-			break;
-		case MUTEX_ADAPTIVE:
-			mutex_enter_adaptive(mp);
-			break;
-	}
-
-	ASSERT(mp->km_owner == NULL);
-	mp->km_owner = current;
-
-	EXIT;
-}
-EXPORT_SYMBOL(__mutex_enter);
-
-void
-__mutex_exit(kmutex_t *mp)
-{
-	ENTRY;
-	ASSERT(mp);
-	ASSERT(mp->km_magic == KM_MAGIC);
-	ASSERT(mp->km_owner == current);
-	mp->km_owner = NULL;
-	up(mp->km_sem);
-	EXIT;
-}
-EXPORT_SYMBOL(__mutex_exit);
-
-/* Return 1 if mutex is held by current process, else zero.  */
-int
-__mutex_owned(kmutex_t *mp)
-{
-	ENTRY;
-	ASSERT(mp);
-	ASSERT(mp->km_magic == KM_MAGIC);
-	RETURN(mp->km_owner == current);
-}
-EXPORT_SYMBOL(__mutex_owned);
-
-/* Return owner if mutex is owned, else NULL.  */
-kthread_t *
-__spl_mutex_owner(kmutex_t *mp)
-{
-	ENTRY;
-	ASSERT(mp);
-	ASSERT(mp->km_magic == KM_MAGIC);
-	RETURN(mp->km_owner);
-}
-EXPORT_SYMBOL(__spl_mutex_owner);
+module_param(mutex_spin_max, int, 0644);
+MODULE_PARM_DESC(mutex_spin_max, "Spin a maximum of N times to acquire lock");
 
 int
-spl_mutex_init(void)
+spl_mutex_spin_max(void)
 {
-	ENTRY;
-#ifdef DEBUG_MUTEX
-	spin_lock_init(&mutex_stats_lock);
-        INIT_LIST_HEAD(&mutex_stats_list);
-#endif
-	RETURN(0);
+        return mutex_spin_max;
 }
+EXPORT_SYMBOL(spl_mutex_spin_max);
 
-void
-spl_mutex_fini(void)
-{
-        ENTRY;
-#ifdef DEBUG_MUTEX
-	ASSERT(list_empty(&mutex_stats_list));
-#endif
-        EXIT;
-}
+#endif /* HAVE_TASK_CURR */
+#endif /* !HAVE_MUTEX_OWNER */
 
-module_param(mutex_spin_max, int, 0644);
-MODULE_PARM_DESC(mutex_spin_max, "Spin a maximum of N times to aquire lock");
+int spl_mutex_init(void) { return 0; }
+void spl_mutex_fini(void) { }
author	Brian Behlendorf <[email protected]>	2009-09-25 14:47:01 -0700
committer	Brian Behlendorf <[email protected]>	2009-09-25 14:47:01 -0700
commit	4d54fdee1d774ddaef381893434a3721067e2c56 (patch)
tree	7139adfd73794aec7103361539b30903a6500572 /module/spl/spl-mutex.c
parent	d28db80fd0fd4fd63aec09037c44408e51a222d6 (diff)