summaryrefslogtreecommitdiffstats
path: root/module/spl
Commit message (Collapse)AuthorAgeFilesLines
* Add callbacks for displaying KSTAT_TYPE_RAW kstatsPrakash Surya2013-10-161-14/+105
| | | | | | | | | | | | | | | | | | | The current implementation for displaying kstats of type KSTAT_TYPE_RAW is rather crude. This patch attempts to enhance this handling by allowing a kstat user to register formatting callbacks which can optionally be used. The callbacks allow the user to implement functions for interpreting their data and transposing it into a character buffer. This buffer, containing a string representation of the raw data, is then be displayed through the current /proc textual interface. Additionally the kstats are made writable because it's now possible to provide a useful handler via the existing ks_update() interface. Signed-off-by: Prakash Surya <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #296
* Consistently use local_irq_disable/local_irq_enableBrian Behlendorf2013-10-091-3/+2
| | | | | | | | | | | | | | | | | | It was observed that spl_kmem_cache_alloc() uses local_irq_save() and saves the interrupt state in a local variable. This would normally be fine except that spl_kmem_cache_alloc() calls spl_cache_refill() which re-enables interrupts. It is then possible that while interrupts are enabled the process is rescheduled to a different cpu before being disable again. This could result in us restoring the saved interrupt state from one cpu to another. What the consequences of this are aren't perfectly clear, but this is clearly a bug and it has the potential to cause issues. The code has been updated to just use local_irq_enable() and local_irq_disable() to avoid this. Signed-off-by: Brian Behlendorf <[email protected]>
* Replace current_kernel_time() with getnstimeofday()Richard Yao2013-10-091-1/+3
| | | | | | | | | | | current_kernel_time() is used by the SPLAT, but it is not meant for performance measurement. We modify the SPLAT to use getnstimeofday(), which is equivalent to the gethrestime() function on Solaris. Additionally, we update gethrestime() to invoke getnstimeofday(). Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #279
* Linux 3.8 compat: Use kuid_t/kgid_t when requiredRichard Yao2013-08-092-14/+19
| | | | | | | | | | | | | | | | | When CONFIG_UIDGID_STRICT_TYPE_CHECKS is enabled uid_t/git_t are replaced by kuid_t/kgid_t, which are structures instead of integral types. This causes any code that uses an integral type to fail to build. The User Namespace functionality introduced in Linux 3.8 requires CONFIG_UIDGID_STRICT_TYPE_CHECKS, so we could not build against any kernel that supported it. We resolve this by converting between the new kuid_t/kgid_t structures and the original uid_t/gid_t types. Original-patch-by: DHE Rewrite-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #260
* PaX/GrSecurity Linux 3.8.y compat: Use __no_const on struct ctl_tableRichard Yao2013-08-081-5/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The PaX team started constifying `struct ctl_table` as of their Linux 3.8.0 patchset. This lead to zfsonlinux/spl#225 and Gentoo bug #463012. While investigating our options, I learned that there is a preprocessor directive called CONSTIFY_PLUGIN that we can use to detect the presence of the PaX changes and adjust the code accordingly. The PaX Team had suggested adopting ctl_table_no_const, but supporting older kernels required declaring that whenever the CONSTIFY_PLUGIN was set. Future compiler changes could potentially cause that to break in the presence of -Werror, so instead we define our own spl_ctl_table typdef and use that. This should be compatible with all PaX kernels. This introduces a Linux kernel version number check to prevent a build failure on versions of the PaX GCC plugin that existed for kernels before Linux 3.8.0. Affected versions of the PaX plugin will trigger a compiler error when they see no_const cast on a non-constified structure. Ordinarily, we would need an autotools check to catch that. However, it is safe to do a kernel version check instead of an autotools check in this specific instance because the affected versions of the PaX GCC plugin only exist for Linux kernels before 3.8.0 and the constification of `struct ctl_table` by the PaX developers only occurs in Linux 3.8.0 and later. Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #225
* Fix race in spl_kmem_cache_reap_now()Richard Yao2013-08-081-5/+5
| | | | | | | | | | | | | | | | | | | | The current code contains a race condition that triggers when bit 2 in spl.spl_kmem_cache_expire is set, spl_kmem_cache_reap_now() is invoked and another thread is concurrently accessing its magazine. spl_kmem_cache_reap_now() currently invokes spl_cache_flush() on each magazine in the same thread when bit 2 in spl.spl_kmem_cache_expire is set. This is unsafe because there is one magazine per CPU and the magazines are lockless, so it is impossible to guarentee that another CPU is not using its magazine when this function is called. The solution is to only touch the local CPU's magazine and leave other CPU's magazines to other CPUs. Reported-by: DHE Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #274
* Linux 3.11 compat: Replace num_physpages with totalram_pagesRichard Yao2013-08-081-2/+3
| | | | | | | | | | | | | | | | | | | | num_physpages was removed by torvalds/linux@cfa11e08ed39eb28a9eff9a907b20913020c69b5, so lets replace it with totalram_pages. This is a bug fix as much as it is a compatibility fix because num_physpages did not reflect the number of pages actually available to the kernel: http://lkml.indiana.edu/hypermail/linux/kernel/0908.2/01001.html Also, there are known issues with memory calculations when ZFS is in a Xen dom0. There is a chance that using totalram_pages could resolve them. This conjecture is untested at the time of writing. Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #273
* Fix KMC_OFFSLAB type cachesBrian Behlendorf2013-07-301-1/+2
| | | | | | | | | | | | | | | Because spl_slab_size() was always returning -ENOSPC for caches of type KMC_OFFSLAB the cache could never be created. Additionally the slab size is rounded up to a page which is what kv_alloc() expects. The kv_alloc() code will minimally allocate a page, in the KMC_OFFSLAB case this could be reduced. The basic regression tests kmem:slab_small, kmem:slab_large, and kmem:slab_align regression were updated to test KMC_OFFSLAB. Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Ying Zhu <[email protected]> Closes #266
* Return -1 for generic kmem cache shrinkerBrian Behlendorf2013-07-301-1/+9
| | | | | | | | | | | | | | | It has been observed that it's possible to get in a state where shrink_slabs() will spin repeated invoking the generic kmem cache shrinker. It fails to detect it's not making forward progress reclaiming from the cache and doesn't give up. To ensure this never occurs we unconditionally return -1 after reclaiming what we can. Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Richard Yao <[email protected]> Closes zfsonlinux/zfs#1276 Closes zfsonlinux/zfs#1598 Closes zfsonlinux/zfs#1432
* Modify gethrestime to use current_kernel_time()James H2013-07-151-4/+3
| | | | | | | | | This allows us to get nanosecond resolution. It also means we use the same time source as utimensat(now) etc. Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #255
* Fix bogus kmem leak warningBrian Behlendorf2013-07-101-4/+5
| | | | | | | | | | | | Commit 5c7a036 correctly relocated the creation of a taskq and the registraction of the kmem_cache_shrinker after the initialization of the kmem tracking code. However, the cleanup of these structures was not done before the leak checks in spl_kmem_fini(). This resulted in an incorrect 'kmem leaked' warning even though there was no actual leak. Signed-off-by: Brian Behlendorf <[email protected]> Closes zfsonlinux/zfs#1569
* Fix --enable-debug-kmem-tracking optionBrian Behlendorf2013-07-091-9/+9
| | | | | | | | | | | | | | | | | | | This code has gotten something stale and no longer builds cleanly against modern kernels. The two issues addressed here are as follows: * The hlist_*_rcu interfaces in the kernel have been relatively unstable. Since this isn't performance critical code just use the long standing hlist_* variants. * In older kernels the hash_ptr() function takes a 'void *' but in newer kernels it expects a 'const void *'. To silence the compiler warnings about this explicitly cast it to a 'void *'. The memset function is a similar case but it always expects a 'void *'. Signed-off-by: Brian Behlendorf <[email protected]> Closes #256
* Linux 3.10 compat: Do not rely on struct proc_dir_entry definitionRichard Yao2013-07-082-85/+87
| | | | | | | | | | | | | | | | Linux kernel commit torvalds/linux#59d8053f moved the definition of struct proc_dir_entry from include/linux/proc_fs.h to the private header fs/proc/internal.h. The SPL relied on that to map Solaris' kstat to entries in /proc/spl/kstat. Since the proc_dir_entry structure is now private the only safe thing to do is wrap the opaque proc handle with our own structure. This actually ends up simplify the code and is good because it moves us away from depending on implementation details of /proc. Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #257
* Linux 3.10 compat: replace PDE()->data with PDE_DATA()Yuxuan Shui2013-07-081-1/+4
| | | | | | | | | | | Linux kernel commit torvalds/linux@d9dda78b renamed PDE() to PDE_DATA(). To handle this detect the prefered interface and define a PDE_DATA() wrapper for consistency. Signed-off-by: Yuxuan Shui <[email protected]> Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #257
* Fix --enable-debug-kmem-tracking optionTim Chase2013-06-181-7/+8
| | | | | | | | | | | | | | | Re-order initialization in spl_kmem_init to allow for kmem tracing to work. The spl_kmem_init function calls taskq_create prior to initializing the tracking (calling spl_kmem_init_tracking). Since taskq_create uses kmem_alloc, NULL dereferences occur because the global kmem_list hasn't had its next & prev pointers initialized yet. This commit moves the calls to spl_kmem_init_tracking earlier in the spl_kmem_init function in order that the subsequent kmem_alloc calls (by taskq_create) work properly. Signed-off-by: Brian Behlendorf <[email protected]> Closes #243
* Fix taskq_wait_id()Brian Behlendorf2013-05-031-26/+14
| | | | | | | | | | | | The existing taskq_wait_id() function can incorrectly block indefinitely. Reimplement it more simply using wait_event() in a similar fashion to taskq_wait_all(). This flaw was uncovered in the context of moving vn_rdwr() to a taskq. Previously taskq_wait_id() had no consumers outside the SPLAT task framework which is why the issue went unnoticed. Signed-off-by: Brian Behlendorf <[email protected]>
* Do not call cond_resched() in spl_slab_reclaim()Richard Yao2013-03-211-3/+0
| | | | | | | | | | Calling cond_resched() after each object is freed and then after each slab is freed can cause slabs of objects to live for excessive periods of time following reclaimation. This interferes with the kernel's own memory management when called from kswapd and can cause direct reclaim to occur in response to memory pressure that should have been resolved. Signed-off-by: Richard Yao <[email protected]>
* Linux 3.9 compat: Switch to hlist_for_each{,_rcu}Richard Yao2013-03-142-2/+4
| | | | | | | | | | torvalds/linux@b67bfe0d42cac56c512dd5da4b1b347a23f4b70a changed hlist_for_each_entry{,_rcu} to take 3 arguments instead of 4. We handle this by switching to hlist_for_each{,_rcu}, which works across all supported kernels. Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]>
* Drop support for 3 argument version of set_fs_pwdRichard Yao2013-03-141-41/+2
| | | | | | | | | | This was a suggestion that Brian Behlendorf made when reviewing an early pull request for Linux 3.9 support. This commit was made intentionally easy to revert should we ever have a reason to reintroduce support for older kernels. Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]>
* Linux 3.9 compat: set_fs_root takes const struct path *Richard Yao2013-03-141-0/+4
| | | | | | | | torvalds/linux@dcf787f39162ce32ca325b3e784aba2d2444619a enforces const-correctness in passing struct path *. Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]>
* Linux 3.9 compat: vfs_getattr takes two argumentsRichard Yao2013-03-141-2/+15
| | | | | | | | | | | | The function prototype of vfs_getattr previoulsy took struct vfsmount * and struct dentry * as arguments. These would always be defined together in a struct path *. torvalds/linux@3dadecce20603aa380023c65e6f55f108fd5e952 modified vfs_getattr to take struct path * is taken as an argument instead. Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]>
* Linux 3.9 compat: Do not depend on f_vfsmntRichard Yao2013-03-141-3/+3
| | | | | | | | | | torvalds/linux@182be684784334598eee1d90274e7f7aa0063616 removed the preprocessor definition for f_vfsmnt. The ability to access the mountpoint via ->f_path.mnt has been stable for a long time, so we switch to that. Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]>
* Refresh links to web siteNed Bass2013-03-0419-19/+19
| | | | | | | Update links to refer to the official ZFS on Linux website instead of @behlendorf's personal fork on github. Signed-off-by: Brian Behlendorf <[email protected]>
* Disable automatic log dumpingBrian Behlendorf2013-02-051-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | Long ago infrastructure was added to the SPL to keep an internal debug log of the last few seconds of activity. This was helpful during the early development, but these days it is no longer needed. I haven't had to resort to this debug buffer to resolve an issue for several years now. Today better more generic tools like systemtap and ftrace have evolved to the point where they can be used for this purpose. Along with the stack trace dumped to the system console, and in rare cases a crash dump we almost always have the debug we need. Therefore, I'm disabling the code which automatically dumps this log to disk during an assertion except for the case where spl_debug_panic_on_bug is set (disabled by default). This should be viewed as a first step towards either. a) Retiring this infrastructure and complexity entirely, or b) Integrating this logging more properly with ftrace. As part of this change I'm also removing from the packages the undocumented spl utility which is used to decode the binary logs. Signed-off-by: Brian Behlendorf <[email protected]>
* Fix tsd_get/set() race with tsd_exit/destroy()Brian Behlendorf2013-01-311-1/+32
| | | | | | | | | | | | | | | The tsd_exit() and tsd_destroy() functions remove entries from hash bins without taking the hash bin lock. They do take the table lock, but tsd_get() and tsd_set() only take the hash bin lock to allow for maximum concurency. The result is that while tsd_get() and tsd_set() are traversing the hash bin list it can be modified by another thread in which happens to hash to the same value. To avoid this add the needed locking to tsd_exit() and tsd_destroy(). Signed-off-by: Brian Behlendorf <[email protected]> Closes #174
* Add spl_kmem_cache_expire module optionBrian Behlendorf2013-01-281-4/+32
| | | | | | | | | | | | | | | | | | | | | | | | Cache aging was implemented because it was part of the default Solaris kmem_cache behavior. The idea is that per-cpu objects which haven't been accessed in several seconds should be returned to the cache. On the other hand Linux slabs never move objects back to the slabs unless there is memory pressure on the system. This behavior is now configurable through the 'spl_kmem_cache_expire' module option. The value is a bit mask with the following meaning. 0x1 - Solaris style cache aging eviction is enabled. 0x2 - Linux style low memory eviction is enabled. Both methods may be safely enabled simultaneously, but by default both are disabled. It has never been clear if the kmem cache aging (which has been around from day one) actually does any good. It has however been the source of numerous bugs so I wouldn't mind retiring it entirely. Signed-off-by: Brian Behlendorf <[email protected]> Closes zfsonlinux/zfs#1227 Closes #210
* Remove spl_invalidate_inodes()Brian Behlendorf2013-01-171-14/+0
| | | | | | | | | | | This functionality is no longer required by ZFS, see commit zfsonlinux/zfs@7b3e34ba5a7ee8d0fda44d214f6f11eb16cdb26f. Since there are no other consumers, and because it adds additional autoconf complexity which must be maintained the spl_invalidate_inodes() function has been removed. Signed-off-by: Brian Behlendorf <[email protected]> Issue zfsonlinux/zfs#795
* kmem-cache: Fix slab ageing soft lockupBrian Behlendorf2013-01-141-43/+51
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit a10287e00d13c4c4dbbff14f42b00b03da363fcb slightly reworked the slab ageing code such that it is no longer dependent on the Linux delayed work queue interfaces. This was good for portability and performance, but it requires us to use the on_each_cpu() function to execute the spl_magazine_age() function. That means that the function is now executing in interrupt context whereas before it was scheduled in normal process context. And that means we need to be slightly more careful about the locking in the interrupt handler. With the reworked code it's possible that we'll be holding the skc->skc_lock and be interrupted to handle the spl_magazine_age() IRQ. This will result in a deadlock and soft lockup errors unless we're careful to detect the contention and avoid taking the lock in the interupt handler. So that's what this patch does. Alternately, (and slightly more conventionally) we could have used spin_lock_irqsave() to prevent this race entirely but I'd perfer to avoid disabling interrupts as much as possible due to performance concerns. There is absolutely no penalty for us not aging objects out of the magazine due to contention. Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Prakash Surya <[email protected]> Closes zfsonlinux/zfs#1193
* call_usermodehelper() should wait for processNed Bass2013-01-091-2/+2
| | | | | | | | | | | As of Linux 3.4 the UMH_WAIT_* constants were renumbered. In particular, the meaning of "1" changed from UMH_WAIT_PROC (wait for process to complete), to UMH_WAIT_EXEC (wait for the exec, but not the process). A number of call sites used the number 1 instead of the constant name, so the behavior was not as expected on kernels with this change. Signed-off-by: Brian Behlendorf <[email protected]>
* RHEL 6.4 compat, fallocate()Brian Behlendorf2013-01-081-7/+9
| | | | | | | | | | | | | | | | | In the upstream kernel the FALLOC_FL_PUNCH_HOLE #define was introduced after the fallocate() function was moved from the inode_operations to the file_operations structure. Therefore, the SPL code assumed that if FALLOC_FL_PUNCH_HOLE was defined it was safe to use f_ops->fallocate(). Unfortunately, the RHEL6.4 kernel has only backported the FALLOC_FL_PUNCH_HOLE #define and not the fallocate() change. To address this compatibility issue the spl_filp_fallocate() helper function was added to properly detect which interface is available. Signed-off-by: Brian Behlendorf <[email protected]>
* Add cv_wait_io() to account I/O timeMatt Johnston2013-01-071-4/+14
| | | | | | | | | | Under Linux when a task is waiting on I/O it should call the io_schedule() function for proper accounting. The Solaris cv_wait() function provides no way to specify what the cv is waiting on therefore cv_wait_io() is introduced. Signed-off-by: Brian Behlendorf <[email protected]> Closes #206
* Fix spl_kmem_init_kallsyms_lookup() panicBrian Behlendorf2012-12-192-0/+21
| | | | | | | | | | | Due to I/O buffering the helper may return successfully before the proc handler has a chance to execute. To catch this case wait up to 1 second to verify spl_kallsyms_lookup_name_fn was updated to a non SYMBOL_POISON value. Signed-off-by: Brian Behlendorf <[email protected]> Closes zfsonlinux/zfs#699 Closes zfsonlinux/zfs#859
* kmem-cache: Use a taskq for async allocationsBrian Behlendorf2012-12-121-4/+4
| | | | | | | | | | | | | | Shift the asynchronous allocations over to use the taskq interfaces. This allows us to abandon the kernels delayed work queue interface and all the compatibility code it requires. This code never actually used the delay functionality it was just done this way to leverage the existing compatibility code. All that is required is a thread context to perform the allocation in. The only thing clever in this change is that we take advantage of the preallocated task queue entries to avoid a memory allocation. Signed-off-by: Brian Behlendorf <[email protected]>
* kmem-cache: Use taskqs for ageingBrian Behlendorf2012-12-121-41/+50
| | | | | | | | | | | | | | | | | | | | | Shift the cache and magazine ageing functionality over to the new delayed taskq interfaces. This allows us to abandon the kernels delayed work queue interface and all the compatibility code it requires. However, the delayed taskq interface does not allow us to schedule a task for a specfic cpu so the ageing code was slightly reworked. The magazine ageing delay has been directly linked to the cache ageing function. The spl_cache_age() function invokes on_each_cpu() in order to run spl_magazine_age() on each cpu. It then blocks waiting for them to complete and promptly reclaims any free slabs. When restructing the code wasn't the primary goal I think the new code is far more understable and maintainable. It also should help minimize magazine thrashing because free slabs are immediately released after the magazine is aged. Signed-off-by: Brian Behlendorf <[email protected]>
* kmem-cache: spl_kmem_cache_create() may always sleepBrian Behlendorf2012-12-121-11/+8
| | | | | | | | | | | | | | When this code was originally written I went overboard and allowed for the possibility of creating a cache in an atomic context. In practice there are no callers which ever do this. This makes sense since a cache is by design a long lived data structure. To prevent abuse of this function going forward I'm removing the code which is supported to handle an atomic context. All allocators have been updated to use KM_SLEEP and the might_sleep() debug macro has been added to immediately detect atomic callers. Signed-off-by: Brian Behlendorf <[email protected]>
* taskq delay/cancel functionalityBrian Behlendorf2012-12-121-96/+356
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add the ability to dispatch a delayed task to a taskq. The desired behavior is for the task to be queued but not executed by a worker thread until the expiration time is reached. To achieve this two new functions were added. * taskq_dispatch_delay() - This function behaves exactly like taskq_dispatch() however it takes a third 'expire_time' argument. The caller should pass the desired time the task should be executed as an absolute value in jiffies. The task is guarenteed not to run before this time, it may run slightly latter if all the worker threads are busy. * taskq_cancel_id() - Given a task id attempt to cancel the task before it gets executed. This is primarily useful for canceling delay tasks but can be used for canceling any previously dispatched task. There are three possible return values. 0 - The task was found and canceled before it was executed. ENOENT - The task was not found, either it was already run or an invalid task id was supplied by the caller. EBUSY - The task is currently executing any may not be canceled. This function will block until the task has been completed. * taskq_wait_all() - The taskq_wait_id() function was renamed taskq_wait_all() to more clearly reflect its actual behavior. It is only curreny used by the splat taskq regression tests. * taskq_wait_id() - Historically, the only difference between this function and taskq_wait() was that you passed the task id. In both functions you would block until ALL lower task ids which executed. This was semantically correct but could be very slow particularly if there were delay tasks submitted. To better accomidate the delay tasks this function was reimplemnted. It will now only block until the passed task id has been completed. This is actually a fairly low risk change for a few reasons. * Only new ZFS callers will make use of the new interfaces and very little common code was changed to support the new functions. * The existing taskq_wait() implementation was not changed just slightly refactored. * The newly optimized taskq_wait_id() implementation was never used by ZFS we can't accidentally introduce a new bug there. NOTE: This functionality does not exist in the Illumos taskqs. Signed-off-by: Brian Behlendorf <[email protected]>
* taskq style, remove #define wrappersBrian Behlendorf2012-12-121-21/+21
| | | | | | | | | | | | | | When the taskq implementation was originally written I wrapped all the API functions in #define's. This was done as a preventative measure to ensure that a taskq symbol never conflicted with an existing kernel symbol. However, in practice the taskq symbols never conflicted. The only major conflicts occured with the kmem cache API. Since this added layer of obfuscation never bought us anything for the taskq's I'm removing it. Signed-off-by: Brian Behlendorf <[email protected]>
* taskq style, convert spaces to soft tabsBrian Behlendorf2012-12-121-158/+157
| | | | | | | | Update the taskq implementation to conform with the style used throughout the rest of the code. There are no functional changes in this commit. Signed-off-by: Brian Behlendorf <[email protected]>
* Handle errors from spl_kern_path_locked()Brian Behlendorf2012-12-031-0/+2
| | | | | | | | | | | When the Linux 3.6 KERN_PATH_LOCKED compatibility code was added by commit bcb1589 an entirely new vn_remove() implementation was added. That function did not properly handle an error from spl_kern_path_locked() which would result in an panic. This patch addresses the issue by returning the error to the caller. Signed-off-by: Brian Behlendorf <[email protected]> Closes #187
* Disable FS reclaim when allocating new slabsBrian Behlendorf2012-11-271-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | Allowing the spl_cache_grow_work() function to reclaim inodes allows for two unlikely deadlocks. Therefore, we clear __GFP_FS for these allocations. The two deadlocks are: * While holding the ZFS_OBJ_HOLD_ENTER(zsb, obj1) lock a function calls kmem_cache_alloc() which happens to need to allocate a new slab. To allocate the new slab we enter FS level reclaim and attempt to evict several inodes. To evict these inodes we need to take the ZFS_OBJ_HOLD_ENTER(zsb, obj2) lock and it just happens that obj1 and obj2 use the same hashed lock. * Similar to the first case however instead of getting blocked on the hash lock we block in txg_wait_open() which is waiting for the next txg which isn't coming because the txg_sync thread is blocked in kmem_cache_alloc(). Note this isn't a 100% fix because vmalloc() won't strictly honor __GFP_FS. However, it practice this is sufficient because several very unlikely things must all occur concurrently. Signed-off-by: Brian Behlendorf <[email protected]> Issue zfsonlinux/zfs#1101
* Never spin in kmem_cache_alloc()Brian Behlendorf2012-11-061-5/+17
| | | | | | | | | | | | | If we are reaping from the cache and a concurrent allocation occurs then the caller must block until the reaping is complete. This is signaled by the clearing of the KMC_BIT_REAPING bit. Otherwise the caller will be in a tight loop which takes and releases the skc->skc_cache lock. When there are multiple concurrent callers the system will thrash on the lock and appear to lock up. Signed-off-by: Brian Behlendorf <[email protected]>
* Optimize spl_kmem_cache_free()Brian Behlendorf2012-11-061-4/+5
| | | | | | | | | | Because only virtual slabs may have emergency objects and these objects are guaranteed to have physical addresses. It can be easily determined if the passed object is a virtual slab object or an emergency object. This allows us to completely optimize the emergency object free case out of the common free path. Signed-off-by: Brian Behlendorf <[email protected]>
* Track emergency object in rbtreeBrian Behlendorf2012-11-061-26/+72
| | | | | | | | | | | | | | | | | | | In the initial implementation emergency objects were tracked on a per-cache list. The assumption was that under normal operation we would never allocate more than a handful of these objects. So the cost of walking the list during free was expected to be negligible. However real world usage has shown that emergency objects tend to be allocated in batches. A deadlock will be detected and several thousand emergency objects will be allocated before the original blocked slab allocation can complete. Therefore the original list has been replaced by a red black tree which is sorted by the memory address of each allocated object. This bounds the worst case insertion and removal time to O(log n) which minimize contention on the assoicated spin lock. Signed-off-by: Brian Behlendorf <[email protected]>
* Improved vmem cached deadlock detectionBrian Behlendorf2012-11-062-13/+31
| | | | | | | | | | | | | | | | | | | | | The entire goal of performing the slab allocations asynchronously is to be able to detect when a vmalloc() deadlocks. In this case, and only this case, do we want to start allocating emergency objects. The trick here is to minimize false positives because the overhead of tracking emergency objects is far higher than normal slab objects. With that goal in mind the code was reworked to be less sensitive to slow allocations by increasing the wait time. Once a cache is is marked deadlocked all subsequent allocations which can not be satisfied with existing cache objects will immediately allocate new emergency objects. This behavior persists until the asynchronous allocation completes and clears the deadlocked flag. The result of these tweaks is that far fewer emergency objects get created which is important because this minimizes the cost of releasing them latter in kmem_cache_free(). Signed-off-by: Brian Behlendorf <[email protected]>
* Condition variable reference countsBrian Behlendorf2012-11-061-8/+24
| | | | | | | | | | | | | | | Reference count every entry and exit from the condition variable functions: cv_wait(), cv_wait_timeout(), cv_signal(), cv_broadcast(). This allows us to safely block in cv_destroy() until all consumers have been scheduled and are no longer accessing the condition variable memory. In addition poison the magic value at the start of cv_destroy() to ensure there are never any new callers after cv_destroy() is called. The consumer is responsible for ensuring this never occurs. Signed-off-by: Brian Behlendorf <[email protected]>
* Add KSTAT_TYPE_TXG typeBrian Behlendorf2012-11-021-0/+39
| | | | | | | | | | | | | | | | | | | Add a new kstat type for tracking useful statistics about a TXG. The new KSTAT_TYPE_TXG type can be used to tracks the following statistics per-txg. txg - Unique txg number state - State (O)pen/(Q)uiescing/(S)yncing/(C)ommitted birth; - Creation time nread - Bytes read nwritten; - Bytes written reads - IOPs read writes - IOPs write open_time; - Length in nanoseconds the txg was open quiesce_time - Length in nanoseconds the txg was quiescing sync_time; - Length in nanoseconds the txg was syncing Signed-off-by: Brian Behlendorf <[email protected]>
* Make kstat.ks_update() callback atomicBrian Behlendorf2012-10-231-5/+7
| | | | | | | | | | | | | | | | | | | | | | Move the kstat ks_update() callback under the ks_lock. This enables dynamically sized kstats without modification to the kstat API. * Create a kstat with the KSTAT_FLAG_VIRTUAL flag. * Register a ->ks_update() callback which does: o Frees any existing ks_data buffer. o Set ks_data_size to the kstat array size. o Set ks_data to an allocated buffer of size ks_data_size o Populate the array of buffers with the required data. The buffer allocated in the ks_update() callback is guaranteed to remain allocated and valid while the proc sequence handler iterates over the buffer. The lock will not be dropped until kstat_seq_stop() function is run making it safe for concurrent access. To allow the ks_update() callback to perform memory allocations the lock was changed to a mutex. Signed-off-by: Brian Behlendorf <[email protected]>
* Linux 3.6 compat, kern_path_locked() addedYuxuan Shui2012-10-141-0/+136
| | | | | | | | | | | | | | | | | | | | | | | | | | | | The kern_path_parent() function was removed from Linux 3.6 because it was observed that all the callers just want the parent dentry. The simpler kern_path_locked() function replaces kern_path_parent() and does the lookup while holding the ->i_mutex lock. This is good news for the vn implementation because it removes the need for us to handle the locking. However, it makes it harder to implement a single readable vn_remove()/vn_rename() function which is usually what we prefer. Therefore, we implement a new version of vn_remove()/vn_rename() for Linux 3.6 and newer kernels. This allows us to leave the existing working implementation untouched, and to add a simpler version for newer kernels. Long term I would very much like to see all of the vn code removed since what this code enabled is generally frowned upon in the kernel. But that can't happen util we either abondon the zpool.cache file or implement alternate infrastructure to update is correctly in user space. Signed-off-by: Yuxuan Shui <[email protected]> Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #154
* Switch KM_SLEEP to KM_PUSHPAGEMassimo Maggi2012-10-111-3/+3
| | | | | | | | | In this particular instance the allocation occurred in the context of sys_msync()->...->zpl_putpage() where we must be careful not to initiate additional I/O. Signed-off-by: Massimo Maggi <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]>
* Add interface for file hole punching.Etienne Dechamps2012-10-041-0/+53
| | | | | | | | | | | | | | | | | | | | | | | | | | This adds an interface to "punch holes" (deallocate space) in VFS files. The interface is identical to the Solaris VOP_SPACE interface. This interface is necessary for TRIM support on file vdevs. This is implemented using Linux fallocate(FALLOC_FL_PUNCH_HOLE), which was introduced in 2.6.38. For a brief time before 2.6.38 this was done using the truncate_range inode operation, which was quickly deprecated. This patch only supports FALLOC_FL_PUNCH_HOLE. This adds support for the truncate_range() inode operation to VOP_SPACE() for file hole punching. This API is deprecated and removed in 3.5, so it's only useful for old kernels. On tmpfs, the truncate_range() inode operation translates to shmem_truncate_range(). Unfortunately, this function expects the end offset to be inclusive and aligned to the end of a page. If it is not, the kernel will stop with a BUG_ON(). This patch fixes the issue by adapting to the constraints set forth by shmem_truncate_range(). Signed-off-by: Brian Behlendorf <[email protected]> Closes #168