aboutsummaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* kmem-cache: Use taskqs for ageingBrian Behlendorf2012-12-122-43/+52
| | | | | | | | | | | | | | | | | | | | | Shift the cache and magazine ageing functionality over to the new delayed taskq interfaces. This allows us to abandon the kernels delayed work queue interface and all the compatibility code it requires. However, the delayed taskq interface does not allow us to schedule a task for a specfic cpu so the ageing code was slightly reworked. The magazine ageing delay has been directly linked to the cache ageing function. The spl_cache_age() function invokes on_each_cpu() in order to run spl_magazine_age() on each cpu. It then blocks waiting for them to complete and promptly reclaims any free slabs. When restructing the code wasn't the primary goal I think the new code is far more understable and maintainable. It also should help minimize magazine thrashing because free slabs are immediately released after the magazine is aged. Signed-off-by: Brian Behlendorf <[email protected]>
* kmem-cache: spl_kmem_cache_create() may always sleepBrian Behlendorf2012-12-121-11/+8
| | | | | | | | | | | | | | When this code was originally written I went overboard and allowed for the possibility of creating a cache in an atomic context. In practice there are no callers which ever do this. This makes sense since a cache is by design a long lived data structure. To prevent abuse of this function going forward I'm removing the code which is supported to handle an atomic context. All allocators have been updated to use KM_SLEEP and the might_sleep() debug macro has been added to immediately detect atomic callers. Signed-off-by: Brian Behlendorf <[email protected]>
* splat taskq:front: Reduce stack frameBrian Behlendorf2012-12-121-1/+6
| | | | | | | | | | | | The slightly increased size of the taskq_ent_t when debugging is enabled has pushed the taskq:front splat test over frame size limit. To resolve this dynamically allocate the taskq_ent_t structures so they are part of the heap instead of the stack. In function 'splat_taskq_test6_impl' error: the frame size of 1648 bytes is larger than 1024 bytes Signed-off-by: Brian Behlendorf <[email protected]>
* splat taskq:order: Reduce stack frameBrian Behlendorf2012-12-121-1/+6
| | | | | | | | | | | | The slightly increased size of the taskq_ent_t when debugging is enabled has pushed the taskq:order splat test over frame size limit. To resolve this dynamically allocate the taskq_ent_t structures so they are part of the heap instead of the stack. In function 'splat_taskq_test5_impl' error: the frame size of 1680 bytes is larger than 1024 bytes Signed-off-by: Brian Behlendorf <[email protected]>
* splat taskq:cancel: Add test caseBrian Behlendorf2012-12-121-0/+183
| | | | | | | | | | | | | | | | | | | | | | | | | | Add a test case for taskq_cancel_id() to verify it is working properly. Just like taskq:delay we start by dispatching 100 tasks. However this time 1/3 of the tasks use taskq_dispatch() and will be run immediately, and 2/3 use taskq_dispatch_delay(). The idea is to create a busy taskq with both active, pending, and delayed tasks. After all the items have been successfully dispatched the test begins randomly canceling known task ids. It will do this for 5 seconds randomly canceling a task id and then sleeping for a few milliseconds. The task being canceled may have already run, still be on the pending list, or may be currently being executed by a worker thread. The idea is to ensure we catch any subtle race conditions. Once all the non-canceled tasks have completed we cross check the number of tasks which ran with the number of tasks which were successfully canceled. Additionally, we verify that the taskq_cancel_id() function never blocks longer than needed. This time is bounded by the longest run time of the task which was dispatched. Signed-off-by: Brian Behlendorf <[email protected]>
* splat taskq:delay: Add test caseBrian Behlendorf2012-12-121-11/+113
| | | | | | | | | | | Add a test case for taskq_dispatch_delay() to verify it is working properly. The test dispatchs 100 tasks to a taskq with random expiration times spread over 5 seconds. As each task expires and gets executed by a worker thread it verifies that it was run at the correct time. Once all the delayed tasks have been executed we double check that all the dispatched tasks were successful. Signed-off-by: Brian Behlendorf <[email protected]>
* taskq delay/cancel functionalityBrian Behlendorf2012-12-123-119/+389
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add the ability to dispatch a delayed task to a taskq. The desired behavior is for the task to be queued but not executed by a worker thread until the expiration time is reached. To achieve this two new functions were added. * taskq_dispatch_delay() - This function behaves exactly like taskq_dispatch() however it takes a third 'expire_time' argument. The caller should pass the desired time the task should be executed as an absolute value in jiffies. The task is guarenteed not to run before this time, it may run slightly latter if all the worker threads are busy. * taskq_cancel_id() - Given a task id attempt to cancel the task before it gets executed. This is primarily useful for canceling delay tasks but can be used for canceling any previously dispatched task. There are three possible return values. 0 - The task was found and canceled before it was executed. ENOENT - The task was not found, either it was already run or an invalid task id was supplied by the caller. EBUSY - The task is currently executing any may not be canceled. This function will block until the task has been completed. * taskq_wait_all() - The taskq_wait_id() function was renamed taskq_wait_all() to more clearly reflect its actual behavior. It is only curreny used by the splat taskq regression tests. * taskq_wait_id() - Historically, the only difference between this function and taskq_wait() was that you passed the task id. In both functions you would block until ALL lower task ids which executed. This was semantically correct but could be very slow particularly if there were delay tasks submitted. To better accomidate the delay tasks this function was reimplemnted. It will now only block until the passed task id has been completed. This is actually a fairly low risk change for a few reasons. * Only new ZFS callers will make use of the new interfaces and very little common code was changed to support the new functions. * The existing taskq_wait() implementation was not changed just slightly refactored. * The newly optimized taskq_wait_id() implementation was never used by ZFS we can't accidentally introduce a new bug there. NOTE: This functionality does not exist in the Illumos taskqs. Signed-off-by: Brian Behlendorf <[email protected]>
* taskq style, remove #define wrappersBrian Behlendorf2012-12-122-46/+36
| | | | | | | | | | | | | | When the taskq implementation was originally written I wrapped all the API functions in #define's. This was done as a preventative measure to ensure that a taskq symbol never conflicted with an existing kernel symbol. However, in practice the taskq symbols never conflicted. The only major conflicts occured with the kmem cache API. Since this added layer of obfuscation never bought us anything for the taskq's I'm removing it. Signed-off-by: Brian Behlendorf <[email protected]>
* taskq style, convert spaces to soft tabsBrian Behlendorf2012-12-122-217/+218
| | | | | | | | Update the taskq implementation to conform with the style used throughout the rest of the code. There are no functional changes in this commit. Signed-off-by: Brian Behlendorf <[email protected]>
* splat linux:shrinker: Fix fail-safeSteven Johnson2012-12-121-4/+21
| | | | | | Ensure the fail-safe is reset between successive tests. Signed-off-by: Brian Behlendorf <[email protected]>
* splat linux:shrinker: Fix race conditionSteven Johnson2012-12-121-1/+27
| | | | | | | | | | | | Ensure the test thread blocks until the shrinker has completed its work. This is done by putting the test thread to sleep and waking it each time the shrinker callback runs. Once the shrinker size drops to zero or we time out the test is allowed to proceed. Signed-off-by: Brian Behlendorf <[email protected]> Closes #96 Closes #125 Closes #182
* splat command verbose behaviorBrian Behlendorf2012-12-111-1/+2
| | | | | | | | | | | | | | | | | The splat command takes a verbose option which when set prints the internal debug log for every test. This is helpful when tracking down a common failure, but for a rare failure the volume of log data is distracting. Therefore, the verbose option has been adjusted to allow only printing the debug log on failure. The legacy behavior is still available by specifying the verbose option twice. For example: $ splat -t all:all # Never print the debug log $ splat -v -t all:all # Only print debug log on failure $ splat -vv -t all:all # Always print the debug log Signed-off-by: Brian Behlendorf <[email protected]>
* splat taskq:front: Fix raceSteven Johnson2012-12-051-30/+32
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The taskq:front test has a race condition where task 4 and 8 race to complete, due to an incorrectly calculated set of delay "factors" (T). If task 4 wins and actually finishes first, the verification of the order of completion will fail. The delays calculated to order task completion do not take into account the terminal line in the table, and so are all off by a factor of 1. This causes all the tasks in all queues to finish sooner than expected and the accumulated error is the root cause of tasks 4 and 8 racing to complete first. Before the change the "actual" table looks like I commented in #130. I changed: * the table in the comment to correctly reflect the test and the factor timings needed. * the individual task delay factors of T so that ONLY 1 task will every 2T. (on average) * 1T was reduced from 100ms to 50ms. This halves the duration of the test and makes any remaining raciness more likely to cause failures, but it did not cause the test to fail. * simplified the delay factor logic by using a table look-up instead of a switch. * Added a "task started" message so that with -v it is possible to see the order tasks are started. * Moved the "task completed" message inside the spinlock so that with -v the message truly reflects the absolute order of completion as guaranteed by the spinlock. Signed-off-by: Brian Behlendorf <[email protected]> Closes #130
* Handle errors from spl_kern_path_locked()Brian Behlendorf2012-12-031-0/+2
| | | | | | | | | | | When the Linux 3.6 KERN_PATH_LOCKED compatibility code was added by commit bcb1589 an entirely new vn_remove() implementation was added. That function did not properly handle an error from spl_kern_path_locked() which would result in an panic. This patch addresses the issue by returning the error to the caller. Signed-off-by: Brian Behlendorf <[email protected]> Closes #187
* Linux compat 3.7, kernel_thread()Brian Behlendorf2012-12-032-66/+61
| | | | | | | | | | | | | | The preferred kernel interface for creating threads has been kthread_create() for a long time now. However, several of the SPLAT tests still use the legacy kernel_thread() function which has finally been dropped (mostly). Update the condvar and rwlock SPLAT tests to use the modern interface. Frankly this is something we should have done a long time ago. Signed-off-by: Brian Behlendorf <[email protected]> Closes #194
* Verify --with-linux source directory existsBrian Behlendorf2012-11-291-5/+8
| | | | | | | | | | | | Previously this check was only performed when ./configure was attempting to autodetect your kernel source directory. But we should also handle the case where --with-linux was provided and is obviously wrong. This way we catch the error before invoking make and compiling the source with an incorrect autoconf results. Signed-off-by: Brian Behlendorf <[email protected]> Closes #162
* Disable FS reclaim when allocating new slabsBrian Behlendorf2012-11-271-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | Allowing the spl_cache_grow_work() function to reclaim inodes allows for two unlikely deadlocks. Therefore, we clear __GFP_FS for these allocations. The two deadlocks are: * While holding the ZFS_OBJ_HOLD_ENTER(zsb, obj1) lock a function calls kmem_cache_alloc() which happens to need to allocate a new slab. To allocate the new slab we enter FS level reclaim and attempt to evict several inodes. To evict these inodes we need to take the ZFS_OBJ_HOLD_ENTER(zsb, obj2) lock and it just happens that obj1 and obj2 use the same hashed lock. * Similar to the first case however instead of getting blocked on the hash lock we block in txg_wait_open() which is waiting for the next txg which isn't coming because the txg_sync thread is blocked in kmem_cache_alloc(). Note this isn't a 100% fix because vmalloc() won't strictly honor __GFP_FS. However, it practice this is sufficient because several very unlikely things must all occur concurrently. Signed-off-by: Brian Behlendorf <[email protected]> Issue zfsonlinux/zfs#1101
* SPL 0.6.0-rc12Brian Behlendorf2012-11-131-1/+1
|
* Merge branch 'kmem-cache-optimization'Brian Behlendorf2012-11-083-50/+131
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This branch contains kmem cache optimizations designed to resolve the lockups reported in zfsonlinux/zfs#922. The lockups were largely the result of spin lock contention in the slab under low memory conditions. Fundamentally, these changes are all designed to minimize that contention though a variety of methods. * Improved vmem cached deadlock detection * Track emergency objects in rbtree * Optimize spl_kmem_cache_free() * Never spin in kmem_cache_alloc() Signed-off-by: Brian Behlendorf <[email protected]> zfsonlinux/zfs#922
| * Never spin in kmem_cache_alloc()Brian Behlendorf2012-11-061-5/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | If we are reaping from the cache and a concurrent allocation occurs then the caller must block until the reaping is complete. This is signaled by the clearing of the KMC_BIT_REAPING bit. Otherwise the caller will be in a tight loop which takes and releases the skc->skc_cache lock. When there are multiple concurrent callers the system will thrash on the lock and appear to lock up. Signed-off-by: Brian Behlendorf <[email protected]>
| * Optimize spl_kmem_cache_free()Brian Behlendorf2012-11-061-4/+5
| | | | | | | | | | | | | | | | | | | | Because only virtual slabs may have emergency objects and these objects are guaranteed to have physical addresses. It can be easily determined if the passed object is a virtual slab object or an emergency object. This allows us to completely optimize the emergency object free case out of the common free path. Signed-off-by: Brian Behlendorf <[email protected]>
| * Track emergency object in rbtreeBrian Behlendorf2012-11-062-28/+75
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In the initial implementation emergency objects were tracked on a per-cache list. The assumption was that under normal operation we would never allocate more than a handful of these objects. So the cost of walking the list during free was expected to be negligible. However real world usage has shown that emergency objects tend to be allocated in batches. A deadlock will be detected and several thousand emergency objects will be allocated before the original blocked slab allocation can complete. Therefore the original list has been replaced by a red black tree which is sorted by the memory address of each allocated object. This bounds the worst case insertion and removal time to O(log n) which minimize contention on the assoicated spin lock. Signed-off-by: Brian Behlendorf <[email protected]>
| * Improved vmem cached deadlock detectionBrian Behlendorf2012-11-063-13/+34
|/ | | | | | | | | | | | | | | | | | | | | The entire goal of performing the slab allocations asynchronously is to be able to detect when a vmalloc() deadlocks. In this case, and only this case, do we want to start allocating emergency objects. The trick here is to minimize false positives because the overhead of tracking emergency objects is far higher than normal slab objects. With that goal in mind the code was reworked to be less sensitive to slow allocations by increasing the wait time. Once a cache is is marked deadlocked all subsequent allocations which can not be satisfied with existing cache objects will immediately allocate new emergency objects. This behavior persists until the asynchronous allocation completes and clears the deadlocked flag. The result of these tweaks is that far fewer emergency objects get created which is important because this minimizes the cost of releasing them latter in kmem_cache_free(). Signed-off-by: Brian Behlendorf <[email protected]>
* Merge branch 'splat'Brian Behlendorf2012-11-0622-81/+93
|\ | | | | | | | | | | | | | | Additional debugging, some cleanup, and an assortment of fixes to the SPLAT tests and infrastructure. Full details in the individual patches. Signed-off-by: Brian Behlendorf <[email protected]>
| * splat kmem:slab_overcommit: DisabledBrian Behlendorf2012-11-061-8/+8
| | | | | | | | | | | | | | Disable this test because it may result in an OOM event on the system which can result in the test infrastructure being killed. Signed-off-by: Brian Behlendorf <[email protected]>
| * splat atomic:64-bit: Create thread outside spin lockBrian Behlendorf2012-11-061-7/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | The Fedora 3.6 debug kernel identified the following issue where we create a thread under a spin lock. This isn't safe because sleeping could result in a deadlock. Therefore the lock is changed to a mutex so it's safe to sleep. BUG: sleeping function called from invalid context at mm/slub.c:930 in_atomic(): 1, irqs_disabled(): 0, pid: 10583, name: splat 1 lock held by splat/10583: Signed-off-by: Brian Behlendorf <[email protected]>
| * splat: Fix log buffer lockingBrian Behlendorf2012-11-062-14/+15
| | | | | | | | | | | | | | | | | | | | The Fedora 3.6 debug kernel identified the following issue where we call copy_to_user() under a spin lock(). This used to be safe in older kernels but no longer appears to be true so the spin lock was changed to a mutex. None of this code is performance critical so allowing the process to sleep is harmless. Signed-off-by: Brian Behlendorf <[email protected]>
| * splat: Cleanup headersBrian Behlendorf2012-11-0620-43/+37
| | | | | | | | | | | | | | Restructure the the SPLAT headers such that each test only includes the minimal set of headers it requires. Signed-off-by: Brian Behlendorf <[email protected]>
| * Condition variable reference countsBrian Behlendorf2012-11-062-9/+26
|/ | | | | | | | | | | | | | | Reference count every entry and exit from the condition variable functions: cv_wait(), cv_wait_timeout(), cv_signal(), cv_broadcast(). This allows us to safely block in cv_destroy() until all consumers have been scheduled and are no longer accessing the condition variable memory. In addition poison the magic value at the start of cv_destroy() to ensure there are never any new callers after cv_destroy() is called. The consumer is responsible for ensuring this never occurs. Signed-off-by: Brian Behlendorf <[email protected]>
* Merge remote branch 'eris/stats'Brian Behlendorf2012-11-062-7/+70
|\ | | | | | | | | | | | | Bring in support for the new KSTAT_TYPE_TXG type. This allows for additional visibility in to the txg handling. Signed-off-by: Brian Behlendorf <[email protected]>
| * Add KSTAT_TYPE_TXG typeBrian Behlendorf2012-11-022-1/+61
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add a new kstat type for tracking useful statistics about a TXG. The new KSTAT_TYPE_TXG type can be used to tracks the following statistics per-txg. txg - Unique txg number state - State (O)pen/(Q)uiescing/(S)yncing/(C)ommitted birth; - Creation time nread - Bytes read nwritten; - Bytes written reads - IOPs read writes - IOPs write open_time; - Length in nanoseconds the txg was open quiesce_time - Length in nanoseconds the txg was quiescing sync_time; - Length in nanoseconds the txg was syncing Signed-off-by: Brian Behlendorf <[email protected]>
| * Make kstat.ks_update() callback atomicBrian Behlendorf2012-10-232-6/+9
|/ | | | | | | | | | | | | | | | | | | | | | Move the kstat ks_update() callback under the ks_lock. This enables dynamically sized kstats without modification to the kstat API. * Create a kstat with the KSTAT_FLAG_VIRTUAL flag. * Register a ->ks_update() callback which does: o Frees any existing ks_data buffer. o Set ks_data_size to the kstat array size. o Set ks_data to an allocated buffer of size ks_data_size o Populate the array of buffers with the required data. The buffer allocated in the ks_update() callback is guaranteed to remain allocated and valid while the proc sequence handler iterates over the buffer. The lock will not be dropped until kstat_seq_stop() function is run making it safe for concurrent access. To allow the ks_update() callback to perform memory allocations the lock was changed to a mutex. Signed-off-by: Brian Behlendorf <[email protected]>
* Linux 3.7 compat, __clear_close_on_exec() removedBrian Behlendorf2012-10-183-197/+0
| | | | | | | | | | | | | | | | | | | | Commit torvalds/linux@b8318b0 moved the __clear_close_on_exec() function out of include/linux/fdtable.h and in to fs/file.c making it unavailable to the SPL. Now as it turns out we only used this function to tear down some test infrastructure for the vn_getf()/vn_releasef() SPLAT regression tests. Rather than implement even more autoconf compatibilty code to handle this we just remove the test case. This also allows us to drop three existing autoconf tests. This does mean the SPLAT tests will no longer verify these functions but historically they have never been a problem. And if we feel we absolutely need this test coverage I'm sure a more portable version of the test case could be added. Signed-off-by: Brian Behlendorf <[email protected]> Closes #183
* Linux 3.6 compat, kern_path_locked() addedYuxuan Shui2012-10-143-0/+158
| | | | | | | | | | | | | | | | | | | | | | | | | | | | The kern_path_parent() function was removed from Linux 3.6 because it was observed that all the callers just want the parent dentry. The simpler kern_path_locked() function replaces kern_path_parent() and does the lookup while holding the ->i_mutex lock. This is good news for the vn implementation because it removes the need for us to handle the locking. However, it makes it harder to implement a single readable vn_remove()/vn_rename() function which is usually what we prefer. Therefore, we implement a new version of vn_remove()/vn_rename() for Linux 3.6 and newer kernels. This allows us to leave the existing working implementation untouched, and to add a simpler version for newer kernels. Long term I would very much like to see all of the vn code removed since what this code enabled is generally frowned upon in the kernel. But that can't happen util we either abondon the zpool.cache file or implement alternate infrastructure to update is correctly in user space. Signed-off-by: Yuxuan Shui <[email protected]> Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #154
* Switch KM_SLEEP to KM_PUSHPAGEMassimo Maggi2012-10-111-3/+3
| | | | | | | | | In this particular instance the allocation occurred in the context of sys_msync()->...->zpl_putpage() where we must be careful not to initiate additional I/O. Signed-off-by: Massimo Maggi <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]>
* Add interface for file hole punching.Etienne Dechamps2012-10-043-0/+81
| | | | | | | | | | | | | | | | | | | | | | | | | | This adds an interface to "punch holes" (deallocate space) in VFS files. The interface is identical to the Solaris VOP_SPACE interface. This interface is necessary for TRIM support on file vdevs. This is implemented using Linux fallocate(FALLOC_FL_PUNCH_HOLE), which was introduced in 2.6.38. For a brief time before 2.6.38 this was done using the truncate_range inode operation, which was quickly deprecated. This patch only supports FALLOC_FL_PUNCH_HOLE. This adds support for the truncate_range() inode operation to VOP_SPACE() for file hole punching. This API is deprecated and removed in 3.5, so it's only useful for old kernels. On tmpfs, the truncate_range() inode operation translates to shmem_truncate_range(). Unfortunately, this function expects the end offset to be inclusive and aligned to the end of a page. If it is not, the kernel will stop with a BUG_ON(). This patch fixes the issue by adapting to the constraints set forth by shmem_truncate_range(). Signed-off-by: Brian Behlendorf <[email protected]> Closes #168
* SPL 0.6.0-rc11Brian Behlendorf2012-09-181-1/+1
|
* Switch KM_SLEEP to KM_PUSHPAGEBrian Behlendorf2012-09-121-2/+2
| | | | | | | | | | | | | | Under certain circumstances the following functions may be called in a context where KM_SLEEP is unsafe and can result in a deadlocked system. To avoid this problem the unconditional KM_SLEEPs are converted to KM_PUSHPAGEs. This will prevent them from attempting to initiate any I/O during direct reclaim. This change was originally part of cd5ca4b but was reverted by 330fe01. It always should have had its own commit for exactly this reason. Signed-off-by: Brian Behlendorf <[email protected]>
* Remove TQ_SLEEP -> KM_SLEEP mappingBrian Behlendorf2012-09-122-16/+19
| | | | | | | | | | | | | | | | | When the taskq code was originally written it seemed like a good idea to simply map TQ_SLEEP to KM_SLEEP. Unfortunately, this assumed that the TQ_* flags would never confict with any of the Linux GFP_* flags. When adding the TQ_PUSHPAGE support in commit cd5ca4b this invariant was accidentally broken. Therefore to support TQ_PUSHPAGE, which is needed for Linux, and prevent any further confusion I have removed this direct mapping. The TQ_SLEEP, TQ_NOSLEEP, and TQ_PUSHPAGE are no longer defined in terms of their KM_* counterparts. Instead a simple mapping function is introduce to convert TQ_* -> KM_* where needed. Signed-off-by: Brian Behlendorf <[email protected]> Issue #171
* Revert "Switch KM_SLEEP to KM_PUSHPAGE"Brian Behlendorf2012-09-123-6/+7
| | | | | | | | This reverts commit cd5ca4b2f86a606aa6ed68341a3672fdde1c9856 due to conflicts in the higher TQ_ bits which caused incorrect behavior. Signed-off-by: Brian Behlendorf <[email protected]>
* Remove autotools productsChris Dunlop2012-09-112-314/+1
| | | | | | | spl_config.h.in is a generated file: remove and .gitignore it Signed-off-by: Chris Dunlop <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]>
* Debug cv_destroy() with mutex heldBrian Behlendorf2012-09-101-4/+5
| | | | | | | | | | | | | | There still appears to be a race in the condition variables where ->cv_mutex is set after we are woken from the cv_destroy wait queue. This might be possible when cv_destroy() is called immediately after cv_broadcast(). We had some troubles with this previously but there may still be a small race, see commit d599e4f. The following patch closes one small race and improves the ASSERTs such that they log the offending value. Signed-off-by: Brian Behlendorf <[email protected]> zfsonlinux/zfs#943
* Set KMC_NOEMERGENCY for zlib workspacesBrian Behlendorf2012-09-071-2/+4
| | | | | | | | | | | | | | | | The workspace required by zlib to perform compression is roughly 512MB (order-7). These allocations are so large that we should never attempt to directly kmalloc an emergency object for them. It is far preferable to asynchronously vmalloc an additional slab in case it's needed. Then simply block waiting for an existing object to be released or for the new slab to be allocated. This can be accomplished by disabling emergency slab objects by passing the KMC_NOEMERGENCY flag at slab creation time. Signed-off-by: Brian Behlendorf <[email protected]> zfsonlinux/zfs#917
* Add KMC_NOEMERGENCY slab flagBrian Behlendorf2012-09-072-2/+9
| | | | | | | | | Provide a flag to disable the use of emergency objects for a specific kmem cache. There may be instances where under no circumstances should you kmalloc() an emergency object. For example, when you cache contains very large objects (>128k). Signed-off-by: Brian Behlendorf <[email protected]>
* Add DKIOCTRIM for TRIM support.Etienne Dechamps2012-09-021-0/+1
| | | | | | | | | | See dechamps/zfs@cc6cd40ad71e1e611591929ad08184516357eaf5 for details. This harmless addition was merged to simplify testing the ZFS TRIM support patches. Signed-off-by: Brian Behlendorf <[email protected]> Closes #167
* Suppress task_hash_table_init() large allocation warningBrian Behlendorf2012-08-301-1/+2
| | | | | | | | | | | | When various kernel debuging options are enabled this allocation may be larger than usual as shown by the following warning. It is in no way harmful so we suppress the warning. SPL: large kmem_alloc(40960, 0x80d0) at tsd_hash_table_init:358 (76495/76495) Signed-off-by: Brian Behlendorf <[email protected]> Closes #93
* Enhance SPLAT kmem:slab_overcommit testBrian Behlendorf2012-08-301-193/+208
| | | | | | | | | | | | | | | | | | | | | After the emergency slab objects were merged I started observing timeout failures in the kmem:slab_overcommit test. These were due to the ineffecient way the slab_overcommit reclaim function was implemented. And due to the additional cost of potentially allocating ten of thousands of emergency objects and tracking them on a single list. This patch addresses the first concern by enhansing the test case to trace all of the allocations objects as a linked list. This allows for a cleaner version of the reclaim function to simply release SPLAT_KMEM_OBJ_RECLAIM objects. Since this touches some common code all the tests which share these data structions were also updated. After making these changes slab_overcommit is reliably passing. However, there is certainly additional cleanup which could be done here. Signed-off-by: Brian Behlendorf <[email protected]>
* Switch KM_SLEEP to KM_PUSHPAGEBrian Behlendorf2012-08-273-7/+6
| | | | | | | | | | Under certain circumstances the following functions may be called in a context where KM_SLEEP is unsafe and can result in a deadlocked system. To avoid this problem the unconditional KM_SLEEPs are converted to KM_PUSHPAGEs. This will prevent them from attempting to initiate any I/O during direct reclaim. Signed-off-by: Brian Behlendorf <[email protected]>
* Mutex ASSERT on self deadlockBrian Behlendorf2012-08-271-2/+7
| | | | | | | | | | | | Generate an assertion if we're going to deadlock the system by attempting to acquire a mutex the process is already holding. There are currently no known instances of this under normal operation, but it _might_ be possible when using a ZVOL as a swap device. I want to ensure we catch this immediately if it were to occur. Signed-off-by: Brian Behlendorf <[email protected]>
* Add PF_NOFS debugging flagBrian Behlendorf2012-08-271-0/+49
| | | | | | | | | | | | | | | | | | | | PF_NOFS is a per-process debug flag which is set in current->flags to detect when a process is performing an unsafe allocation. All tasks with PF_NOFS set must strictly use KM_PUSHPAGE for allocations because if they enter direct reclaim and initiate I/O they may deadlock. When debugging is disabled, any incorrect usage will be detected and a call stack with a warning will be printed to the console. The flags will then be automatically corrected to allow for safe execution. If debugging is enabled this will be treated as a fatal condition. To avoid any risk of conflicting with the existing PF_ flags. The PF_NOFS bit shadows the rarely used PF_MUTEX_TESTER bit. Only when CONFIG_RT_MUTEX_TESTER is not set, and we know this bit is unused, will the PF_NOFS bit be valid. Happily, most existing distributions ship a kernel with CONFIG_RT_MUTEX_TESTER disabled. Signed-off-by: Brian Behlendorf <[email protected]>