summaryrefslogtreecommitdiffstats
path: root/module/spl
Commit message (Collapse)AuthorAgeFilesLines
...
* Linux 4.13 compat: wait queuesBrian Behlendorf2017-07-231-2/+12
| | | | | | | | | | | | Commit torvalds/linux@ac6424b9 - Renamed struct wait_queue -> struct wait_queue_entry. Commit torvalds/linux@2055da97 - Renamed wait_queue_head::task_list -> wait_queue_head::head - Renamed wait_queue_entry::task_list -> wait_queue_entry::entry Reviewed-by: Chunwei Chen <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #629
* Don't cache the system hostidBrian Behlendorf2017-07-132-46/+30
| | | | | | | | | | | | | Historically the SPL cached the system hostid the first time it was accessed. This was done to speed up subsequent accesses. But in practice the system host id is rarely accessed and its inconvenient that it doesn't promptly detect /etc/hostid configuration changes. Therefore, zone_get_hostid() has been updated to always refresh the system hostid reported. Reviewed-by: Olaf Faaland <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #626
* Fix cv_timedwait timeoutBrian Behlendorf2017-05-251-18/+12
| | | | | | | | | | | Perform the already past expiration time check before updating cvp->cv_mutex with the provided mutex. This check only depends on local state. Doing it first ensures that cvp->cv_mutex will not be updated in the timeout case or if it's ever called with an expire_time <= now. Reviewed-by: Tim Chase <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #616
* Linux 4.12 compat: PF_FSTRANS was removedChunwei Chen2017-05-091-6/+6
| | | | | | | | Change SPL_FSTRANS to optionally contains PF_FSTRANS. Also, add __spl_pf_fstrans_check for the checks specifically for PF_FSTRANS. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Chunwei Chen <[email protected]> Closes #614
* Linux 4.11 compat: remove stub for __put_task_structOlaf Faaland2017-03-201-16/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Before kernel 2.6.29 credentials were embedded in task_structs, and zfs had cases where one thread would need to refer to the credential of another thread, forcing it to take a hold on the foreign thread's task_struct to ensure it was not freed. Since 2.6.29, the credential has been moved out of the task_struct into a cred_t. In addition, the mainline kernel originally did not export __put_task_struct() but the RHEL5 kernel did, according to zfsonlinux/spl@e811949a570. As of 2.6.39 the mainline kernel exports it. There is no longer zfs code that takes or releases holds on a task_struct, and so there is no longer any reference to __put_task_struct(). This affects the linux 4.11 kernel because the prototype for __put_task_struct() is in a new include file (linux/sched/task.h) and so the config check failed to detect the exported symbol. Removing the unnecessary stub and corresponding config check. This works on kernels since the oldest one currently supported, 2.6.32 as shipped with Centos/RHEL. Reviewed-by: Chunwei Chen <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Olaf Faaland <[email protected]> Closes #608
* Linux 4.11 compat: vfs_getattr() takes 4 argsOlaf Faaland2017-03-201-3/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There are changes to vfs_getattr() in torvalds/linux@a528d35. The new interface is: int vfs_getattr(const struct path *path, struct kstat *stat, u32 request_mask, unsigned int query_flags) The request_mask argument indicates which field(s) the caller intends to use. Fields the caller does not specify via request_mask may be set in the returned struct anyway, but their values may be approximate. The query_flags argument indicates whether the filesystem must update the attributes from the backing store. This patch uses the query_flags which result in vfs_getattr behaving the same as it did with the 2-argument version which the kernel provided before Linux 4.11. Members blksize and blocks are now always the same size regardless of arch. They match the size of the equivalent members in vnode_t. The configure checks are modified to ensure that the appropriate vfs_getattr() interface is used. A more complete fix, removing the ZFS dependency on vfs_getattr() entirely, is deferred as it is a much larger project. Reviewed-by: Chunwei Chen <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Olaf Faaland <[email protected]> Closes #608
* Linux 4.11 compat: set_task_state() removedOlaf Faaland2017-02-231-2/+2
| | | | | | | | | | | | | | | | | | Replace uses of set_task_state(current, STATE) with set_current_state(STATE). In Linux 4.11, torvalds/linux@642fa44, set_task_state() is removed. All spl uses are of the form set_task_state(current, STATE). set_current_state(STATE) is equivalent and has been available since Linux 2.2.26. Furthermore, set_current_state(STATE) is already used in about 15 locations within spl. This change should have no impact other than removing an unnecessary dependency. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Olaf Faaland <[email protected]> Closes #603
* Use kernel slab for vn_cache and vn_file_cacheChunwei Chen2017-01-311-2/+2
| | | | | | | | | | Resolve a false positive in the kmemleak checker by shifting to the kernel slab. It shows up because vn_file_cache is using KMC_KMEM which is directly allocated using __get_free_pages, which is not automatically tracked by kmemleak. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Chunwei Chen <[email protected]> Closes #599
* Reimplement rt_mutex_owner to fix build with DEBUG & PREEMPT_RT_FULLclefru2017-01-191-1/+5
| | | | | | | | | | | | | | rt_mutex_owner is internal to kernel/locking/rtmutex_common.h and inaccessible for SPL via the public kernel headers. The way of accessing the owner has been stable since at least 3.13 ([1], [2]), which is masking the lowest bit in the owner pointer in rt_mutex. We do the same. [1] http://lxr.free-electrons.com/source/kernel/locking/rtmutex_common.h?v=3.13#L99 [2] http://lxr.free-electrons.com/source/kernel/locking/rtmutex_common.h?v=4.9#L78 Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Clemens Fruhwirth <[email protected]> Closes #593
* Remove identical if statements in module/spl/spl-vnode.cGeorge Melikov2017-01-191-3/+0
| | | | | Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: George Melikov <[email protected]> Closes #594
* Add support for recent kmem_cache_create_usercopyKevin Tanguy2017-01-171-2/+11
| | | | | | | | | | | | | | | | SLAB_USERCOPY flag was used to indicate PAX not to kill copies from kernel to userland. With recent grsecurity patchset and CONFIG_GRKERNSEC_HIDESYM that enables CONFIG_PAX_USERCOPY zfs would panic. Handle newer API while keeping old one functional. Tested-by: RageLtMan <rageltman@sempervictus> Reviewed-by: spendergrsec <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Kevin Tanguy <[email protected]> Closes #595
* Update struct member intializers to C89RageLtMan2017-01-131-5/+5
| | | | | | | | | | | | When building SPL within the kernel tree, C99 initializers cause build failures and need to be converted to C89 as kernel CFLAGS specify -std=gnu89. This fix was provided by @behlendorf in #595 discussion notes and manually implemented in the current master revision. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: RageLtMan <rageltman@sempervictus> Closes #597
* Add support for rw semaphore under PREEMPT_RT_FULLClemens Fruhwirth2016-12-191-1/+31
| | | | | | | | | | | | | | | | | | | | | | | | | | | | The main complication from the RT patch set is that the RW semaphore locks change such that read locks on an rwsem can be taken only by a single thread. All other threads are locked out. This single thread can take a read lock multiple times though. The underlying implementation changes to a mutex with an additional read_depth count. The implementation can be best understood by inspecting the RT patch. rwsem_rt.h and rt.c give the best insight into how RT rwsem works. My implementation for rwsem_tryupgrade is basically an inversion of rt_downgrade_write found in rt.c. Please see the comments in the code. Unfortunately, I have to drop SPLAT rwlock test4 completely as this test tries to take multiple locks from different threads, which RT rwsems do not support. Otherwise SPLAT, zconfig.sh, zpios-sanity.sh and zfs-tests.sh pass on my Debian-testing VM with the kernel linux-image-4.8.0-1-rt-amd64. Tested-by: kernelOfTruth <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Clemens Fruhwirth <[email protected]> Closes zfsonlinux/zfs#5491 Closes #589 Closes #308
* Add system_delay_taskq for long delayChunwei Chen2016-12-081-0/+14
| | | | | | | | | Add a dedicated system_delay_taskq for long delay like spa_deadman and zpl_posix_acl_free. This will allow us to use system_taskq in the manner of dispatch multiple tasks and call taskq_wait_outstanding. Reviewed by: Brian Behlendorf <[email protected]> Signed-off-by: Chunwei Chen <[email protected]> Closes #588
* Limit number of tasks shown in taskq procChunwei Chen2016-12-011-6/+13
| | | | | | | | | | | | To prevent holding tq_lock for too long. Before zfsonlinux/zfs@8e71ab9, hogging delay tasks and cat /proc/spl/taskq would easily cause a lockup. While that bug has been fixed. It's probably still a good idea to do this just in case task lists grow too large. Reviewed-by: Tim Chase <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Chunwei Chen <[email protected]> Closes #586
* Add TASKQID_INVALID and TASKQID_INITIAL macrosUbuntu2016-11-021-9/+9
| | | | | | | | Add the TASKQID_INVALID and TASKQID_INITIAL macros and update the taskq implementation and test cases to use them. This is solely for the purposes of readability and introduces no functional change. Signed-off-by: Brian Behlendorf <[email protected]>
* Fix vmem_size()Ubuntu2016-11-021-4/+30
| | | | | | | | | | | | | | | Add a minimal implementation of vmem_size() which accounts for the virtual memory usage of the SPL's kmem cache. This functionality is only useful on 32-bit systems with a small virtual address space. The following assumptions are made: 1) The major SPL consumer of virtual memory is the kmem cache. 2) Memory allocated with vmem_alloc() is short lived and can be ignored. 3) Allow a 4MB floor as a generous pad given normal consumption. 4) The spl_kmem_cache_sem only contends with cache create/destroy. Signed-off-by: Brian Behlendorf <[email protected]>
* Linux 4.9 compat: group_info changesChunwei Chen2016-10-201-0/+10
| | | | | | | | | In Linux 4.9, torvalds/linux@81243ea, group_info changed from 2d array via ->blocks to 1d array via ->gid. We change the spl cred functions accordingly. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Chunwei Chen <[email protected]> Closes #581
* Fix crgetgroups out-of-bound and misc cred fixChunwei Chen2016-10-201-15/+16
| | | | | | | | | | | | | | | | | init_groups has 0 nblocks, therefore calling the current crgetgroups with init_groups would result in out-of-bound access. We fix this by returning NULL when nblocks is 0. Cap crgetngroups to NGROUPS_PER_BLOCK, since crgetgroups will only return blocks[0]. Also, remove all get_group_info. The cred already holds reference on the group_info, and cred is not mutable. So there's no reason to hold extra reference, if we hold cred. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Chunwei Chen <[email protected]> Closes #556
* Fix out-of-bound in per_cpu in spl_random_inittuxoko2016-10-071-1/+1
| | | | | | | | | When iterating per_cpu values, we need to use for_each_possible_cpu. While NR_CPUS indicates the number of CPU supported by the kernel, it might not initialize all of them if the kernel decides it's not possible to use them. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Chunwei Chen <[email protected]> Closes #578
* Fix p0 initializerBrian Behlendorf2016-10-041-1/+2
| | | | | | | | | | | | Due to changes in the task_struct the following warning is occurs when initializing the global p0. Since this structure only exists for it's address to be taken initialize it in a manor which isn't sensitive to internal changes to the structure. module/spl/spl-generic.c:58:1: error: missing braces around initializer [-Werror=missing-braces] Signed-off-by: Brian Behlendorf <[email protected]> Closes #576
* Increase spl_kmem_alloc_warn limitBrian Behlendorf2016-09-161-2/+2
| | | | | | | | | | | | | | | | | | In order to support ABD with large blocks the spl_kmem_alloc_warn limit needs to be increased to 64K. A 16M block requires that pointers be stored for 4096 4K-pages on an x86_64 system. Each of these pointers is 8 bytes requiring an allocation of 8*4096=32,768 bytes. The addition of a small header to this structure pushes the allocation over the default 32K warning threshold. In addition, fix a small bug where MAX was used instead of MIN when setting the default. This ensures a reasonable limit is still set on systems with page sizes larger then 4K. Reviewed-by: David Quigley <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #571
* Fix: handle NULL case in spl_kmem_free_track()GeLiXin2016-08-191-0/+4
| | | | | | | | | | | | When DEBUG_KMEM_TRACKING is enabled in SPL, we keep tracking all the buffers alloced by kmem_alloc() and kmem_zalloc(). If a NULL pointer which indicates no track info in SPL is passed to spl_kmem_free_track, we just ignore it. Signed-off-by: GeLiXin <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue zfsonlinux/zfs#4967 Closes #567
* Linux 4.8 compat: rw_semaphore atomic_long_t countBrian Behlendorf2016-07-291-1/+10
| | | | | | | | | | | | | | For non-rwsem-spinlocks the "count" member was changed from a "long" to "atomic_long_t" type. A configure check has been added to detect this change along with new versions of the _rwsem_tryupgrade() function and RWSEM_COUNT() macro. See https://github.com/torvalds/linux/commit/8ee62b18 for complete details. Signed-off-by: Tim Chase <[email protected]> Signed-off-by: Chunwei Chen <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #563
* Improve spl slab cache allocJinshan Xiong2016-06-011-8/+35
| | | | | | | | | | | | | The policy is to try to allocate with KM_NOSLEEP, which will lead to memory allocation with GFP_ATOMIC, and if it fails, it will launch an taskq to expand slab space. This way it should be able to get better NUMA memory locality and reduce the overhead of context switch. Signed-off-by: Jinshan Xiong <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #551
* Implement a proper rw_tryupgradeChunwei Chen2016-05-311-0/+41
| | | | | | | | | | | | | | | | | | Current rw_tryupgrade does rw_exit and then rw_tryenter(RW_RWITER), and then does rw_enter(RW_READER) if it fails. This violate the assumption that rw_tryupgrade should be atomic and could cause extra contention or even lock inversion. This patch we implement a proper rw_tryupgrade. For rwsem-spinlock, we take the spinlock to check rwsem->count and rwsem->wait_list. For normal rwsem, we use cmpxchg on rwsem->count to change the value from single reader to single writer. Signed-off-by: Chunwei Chen <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Tim Chase <[email protected]> Closes zfsonlinux/zfs#4692 Closes #554
* Fix taskq_wait_outstanding re-evaluate tq_next_idChunwei Chen2016-05-241-2/+2
| | | | | | | | | | | wait_event is a macro, so the current implementation will cause re- evaluation of tq_next_id every time it wakes up. This would cause taskq_wait_outstanding(tq, 0) to be equivalent to taskq_wait(tq) Signed-off-by: Chunwei Chen <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Tim Chase <[email protected]> Issue #553
* Fix race between taskq_destroy and dynamic spawning threadChunwei Chen2016-05-241-5/+25
| | | | | | | | | | | | | | | | While taskq_destroy would wait for dynamic_taskq to finish its tasks, but it does not implies the thread being spawned is up and running. This will cause taskq to be freed before the thread can exit. We fix this by using tq_nspawn to indicate how many threads are being spawned before they are inserted to the thread list. And have taskq_destroy to wait for it to drop to zero. Signed-off-by: Chunwei Chen <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Tim Chase <[email protected]> Issue #553 Closes #550
* Restore CALLOUT_FLAG_ABSOLUTE in cv_timedwait_hiresChunwei Chen2016-05-241-3/+2
| | | | | | | | | | In 39cd90e, I mistakenly disabled the ability of using absolute expire time in cv_timedwait_hires. I don't quite sure why I did that, so let's restore it. Signed-off-by: Chunwei Chen <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Tim Chase <[email protected]> Issue #553
* Linux 4.7 compat: inode_lock() and friendsChunwei Chen2016-05-201-1/+2
| | | | | | | | | | | | | | | | Linux 4.7 changes i_mutex to i_rwsem, and we should used inode_lock and inode_lock_shared to do exclusive and shared lock respectively. We use spl_inode_lock{,_shared}() to hide the difference. Note that on older kernel you'll always take an exclusive lock. We also add all other inode_lock friends. And nested users now should explicitly call spl_inode_lock_nested with correct subclass. Signed-off-by: Chunwei Chen <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue zfsonlinux/zfs#4665 Closes #549
* Add cv_timedwait_sig_hires to allow interruptible sleepChunwei Chen2016-05-121-10/+30
| | | | | | Signed-off-by: Chunwei Chen <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #548
* Clear PF_FSTRANS over spl_filp_fallocate()Tim Chase2016-04-261-0/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The problem described in 2a5d574 also applies to XFS's file or inode fallocate method. Both paths may trigger writeback and expose this issue, see the full stack below. When layered on XFS a warning will be emitted under CentOS7 when entering either the file or inode fallocate method with PF_FSTRANS already set. To avoid triggering this error PF_FSTRANS is cleared and then reset in vn_space(). WARNING: at fs/xfs/xfs_aops.c:982 xfs_vm_writepage+0x58b/0x5d0 Call Trace: [<ffffffff810a1ed5>] warn_slowpath_common+0x95/0xe0 [<ffffffff810a1f3a>] warn_slowpath_null+0x1a/0x20 [<ffffffffa0231fdb>] xfs_vm_writepage+0x58b/0x5d0 [xfs] [<ffffffff81173ed7>] __writepage+0x17/0x40 [<ffffffff81176f81>] write_cache_pages+0x251/0x530 [<ffffffff811772b1>] generic_writepages+0x51/0x80 [<ffffffffa0230cb0>] xfs_vm_writepages+0x60/0x80 [xfs] [<ffffffff81177300>] do_writepages+0x20/0x30 [<ffffffff8116a5f5>] __filemap_fdatawrite_range+0xb5/0x100 [<ffffffff8116a6cb>] filemap_write_and_wait_range+0x8b/0xd0 [<ffffffffa0235bb4>] xfs_free_file_space+0xf4/0x520 [xfs] [<ffffffffa023cbce>] xfs_file_fallocate+0x19e/0x2c0 [xfs] [<ffffffffa036c6fc>] vn_space+0x3c/0x40 [spl] [<ffffffffa0434817>] vdev_file_io_start+0x207/0x260 [zfs] [<ffffffffa047170d>] zio_vdev_io_start+0xad/0x2d0 [zfs] [<ffffffffa0474942>] zio_execute+0x82/0xe0 [zfs] [<ffffffffa036ba7d>] taskq_thread+0x28d/0x5a0 [spl] [<ffffffff810c1777>] kthread+0xd7/0xf0 [<ffffffff8167de2f>] ret_from_fork+0x3f/0x70 Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Tim Chase <[email protected]> Signed-off-by: Nikolay Borisov <[email protected]> Closes zfsonlinux/zfs#4529
* Allow spawning a new thread for TQ_NOQUEUE dispatch with dynamic taskqTim Chase2016-03-171-4/+18
| | | | | | | | | | | | When a TQ_NOQUEUE dispatch is done on a dynamic taskq, allow another thread to be spawned. This will cause TQ_NOQUEUE to behave similarly as it does with non-dynamic taskqs. Add support for TQ_NOQUEUE to taskq_dispatch_ent(). Signed-off-by: Tim Chase <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #530
* Add rw_tryupgrade()Brian Behlendorf2016-03-101-60/+0
| | | | | | | | | | | | | | | | | | | | This implementation of rw_tryupgrade() behaves slightly differently from its counterparts on other platforms. It drops the RW_READER lock and then acquires the RW_WRITER lock leaving a small window where no lock is held. On other platforms the lock is never released during the upgrade process. This is necessary under Linux because the kernel does not provide an upgrade function. There are currently no callers in the ZFS code where this change in behavior is a problem. In fact, in most cases the code is already written such that if the upgrade fails the RW_READER lock is dropped and the caller blocks waiting to acquire the lock as RW_WRITER. Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Tim Chase <[email protected]> Signed-off-by: Matthew Thode <[email protected]> Closes zfsonlinux/zfs#4388 Closes #534
* random_get_pseudo_bytes() need not provide cryptographic strength entropyRichard Yao2016-02-171-0/+148
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Perf profiling of dd on a zvol revealed that my system spent 3.16% of its time in random_get_pseudo_bytes(). No SPL consumers need cryptographic strength entropy, so we can reduce our overhead by changing the implementation to utilize a fast PRNG. The Linux kernel did not export a suitable PRNG function until it exported get_random_int() in Linux 3.10. While we could implement an autotools check so that we use it when it is available or even try to access the symbol on older kernels where it is not exported using the fact that it is exported on newer ones as justification, we can instead implement our own pseudo-random data generator. For this purpose, I have written one based on a 128-bit pseudo-random number generator proposed in a paper by Sebastiano Vigna that itself was based on work by the late George Marsaglia. http://vigna.di.unimi.it/ftp/papers/xorshiftplus.pdf Profiling the same benchmark with an earlier variant of this patch that used a slightly different generator (roughly same number of instructions) by the same author showed that time spent in random_get_pseudo_bytes() dropped to 0.06%. That is a factor of 50 improvement. This particular generator algorithm is also well known to be fast: http://xorshift.di.unimi.it/#speed The benchmark numbers there state that it runs at 1.12ns/64-bits or 7.14 GBps of throughput on an Intel Core i7-4770 in what is presumably a single-threaded context. Using it in `random_get_pseudo_bytes()` in the manner I have will probably not reach that level of performance, but it should be fairly high and many times higher than the Linux `get_random_bytes()` function that we use now, which runs at 16.3 MB/s on my Intel Xeon E3-1276v3 processor when measured by using dd on /dev/urandom. Also, putting this generator's seed into per-CPU variables allows us to eliminate overhead from both spin locks and CPU memory barriers, which is NUMA friendly. We could have alternatively modified consumers to use something like `gethrtime() % 3` as suggested by both Matthew Ahrens and Tim Chase, but that has a few potential problems that this approach avoids: 1. Switching to `gethrtime() % 3` in hot code paths today requires diverging from illumos-gate and does nothing about potential future patches from illumos-gate that call our slow `random_get_pseudo_bytes()` in different hot code paths. Reimplementing `random_get_pseudo_bytes()` with a per-CPU PRNG avoids both of those things entirely, which means less work for us in the future. 2. Looking at the code that implements `gethrtime()`, I think it is unlikely to be faster than this per-CPU PRNG implementation of `random_get_pseudo_bytes()`. It would be best to go with something fast now so that there is no point in revisiting this from a performance perspective. 3. `gethrtime() % 3` can vary in behavior from system to system based on kernel version, architecture and clock source. In comparison, this per-CPU PRNG is about ~40 lines of code in `random_get_pseudo_bytes()` that should behave consistently across all systems regardless of kernel version, system architecture or machine clock source. It is unlikely that we would ever need to revisit this per-CPU PRNG while the same cannot be said for `gethrtime() % 3`. 4. `gethrtime()` uses CPU memory barriers and maybe atomic instructions depending on the clock source, so replacing `random_get_pseudo_bytes()` with `gethrtime()` in hot code paths could still require a future person working on NUMA scalability to reimplement it anyway while this per-CPU PRNG would not by virtue of using neither CPU memory barriers nor atomic instructions. Note that I did not check various clock sources for the presence of atomic instructions. There is simply too much code to read and given the drawbacks versus this per-cpu PRNG, there is no point in being certain. 5. I have heard of instances where poor quality pseudo-random numbers caused problems for HPC code in ways that took more than a year to identify and were remedied by switching to a higher quality source of pseudo-random numbers. While filesystems are different than HPC code, I do not think it is impossible for us to have instances where poor quality pseudo-random numbers can cause problems. Opting for a well studied PRNG algorithm that passes tests for statistical randomness over changing callers to use `gethrtime() % 3` bypasses the need to think about both whether poor quality pseudo-random numbers can cause problems and the statistical quality of numbers from `gethrtime() % 3`. 6. `gethrtime()` calls `getrawmonotonic()`, which uses seqlocks. This is probably not a huge issue, but anyone using kgdb would never be able to step through a seqlock critical section, which is not a problem either now or with the per-CPU PRNG: https://en.wikipedia.org/wiki/Seqlock The only downside that I can see is that this code's memory requirement is O(N) where N is NR_CPUS, versus the current code and `gethrtime() % 3`, which are O(1), but that should not be a problem. The seeds will use 64KB of memory at the high end (i.e `NR_CPU == 4096`) and 16 bytes of memory at the low end (i.e. `NR_CPU == 1`). In either case, we should only use a few hundred bytes of code for text, especially since `spl_rand_jump()` should be inlined into `spl_random_init()`, which should be removed during early boot as part of "Freeing unused kernel memory". In either case, the memory requirements are minuscule. Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Tim Chase <[email protected]> Closes #372
* Allow kicking a taskq to spawn more threadsChunwei Chen2016-02-051-0/+60
| | | | | | | | | | | This patch add a module parameter spl_taskq_kick. When writing non-zero value to it, it will scan all the taskq, if a taskq contains a task pending for more than 5 seconds, it will be forced to spawn a new thread. This is use as an emergency recovery from deadlock, not a general solution. Signed-off-by: Chunwei Chen <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #529
* Remove RLIM64_INFINITY assert in vn_rdwr()Brian Behlendorf2016-01-231-1/+0
| | | | | | | | | Previous commit be29e6a updated kobj_read_file() so it no longer unconditionally passes RLIM64_INFINITY. The vn_rdwr() function needs to be updated accordingly. Signed-off-by: Brian Behlendorf <[email protected]> Issue #513
* kobj_read_file: Return -1 on vn_rdwr() errorRichard Yao2016-01-231-3/+8
| | | | | | | | | | | | | I noticed that the SPL implementation of kobj_read_file is not correct after comparing it with the userland implementation of kobj_read_file() in zfsonlinux/zfs#4104. Note that we no longer pass RLIM64_INFINITY with this, but our vn_rdwr implementation did not support it anyway, so there is no difference. Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #513
* Use tsd to store tq for taskq_memberChunwei Chen2016-01-203-56/+61
| | | | | | | | | | | | | To prevent taskq_member holding tq_lock and doing linear search, thus causing contention. We store the taskq pointer to which the thread belongs in tsd. This way taskq_member will not need to touch tq_lock, and tsd has per slot spinlock. So the contention should be reduced greatly. Signed-off-by: Chunwei Chen <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #500 Closes #504 Closes #505
* Don't hold mutex until release cv in cv_waitChunwei Chen2016-01-121-15/+40
| | | | | | | | | | | | | | | | | | | | | | | | If a thread is holding mutex when doing cv_destroy, it might end up waiting a thread in cv_wait. The waiter would wake up trying to aquire the same mutex and cause deadlock. We solve this by move the mutex_enter to the bottom of cv_wait, so that the waiter will release the cv first, allowing cv_destroy to succeed and have a chance to free the mutex. This would create race condition on the cv_mutex. We use xchg to set and check it to ensure we won't be harmed by the race. This would result in the cv_mutex debugging becomes best-effort. Also, the change reveals a race, which was unlikely before, where we call mutex_destroy while test threads are still holding the mutex. We use kthread_stop to make sure the threads are exit before mutex_destroy. Signed-off-by: Chunwei Chen <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Tim Chase <[email protected]> Issue zfsonlinux/zfs#4166 Issue zfsonlinux/zfs#4106
* Use spl_fstrans_mark instead of memalloc_noio_saveChunwei Chen2015-12-184-37/+5
| | | | | | | | | | | | | | | | | | | For earlier versions of the kernel with memalloc_noio_save, it only turns off __GFP_IO but leaves __GFP_FS untouched during direct reclaim. This would cause threads to direct reclaim into ZFS and cause deadlock. Instead, we should stick to using spl_fstrans_mark. Since we would explicitly turn off both __GFP_IO and __GFP_FS before allocation, it will work on every version of the kernel. This impacts kernel versions 3.9-3.17, see upstream kernel commit torvalds/linux@934f307 for reference. Signed-off-by: Chunwei Chen <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Tim Chase <[email protected]> Closes #515 Issue zfsonlinux/zfs#4111
* Provide kstat for taskqsTim Chase2015-12-162-0/+269
| | | | | | | | | | | | | | | | | | | | | | | | | | | This patch provides 2 new kstats to display task queues: /proc/spl/taskqs-all - Display all task queues /proc/spl/taskqs - Display only "active" task queues A task queue is considered to be "active" if it currently has active (running) threads or if any of its pending, priority, delay or waitq lists are not empty. If the task queue has running threads, displays each thread function's address (symbolically, if possibly) and its argument. If the task queue has a non-empty list of pending, priority or delayed task queue entries (taskq_ent_t), displays each entry's thread function address and arguemnt. If the task queue has any waiters, displays each waiting task's pid. Note: This patch also updates some comments in taskq.h which referred to "taskq_t" when they should have referred to "taskq_ent_t". Signed-off-by: Tim Chase <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #491
* Fix cstyle issues in spl-taskq.c and taskq.hBrian Behlendorf2015-12-111-34/+41
| | | | | | | This patch only addresses the issues identified by the style checker. It contains no functional changes. Signed-off-by: Brian Behlendorf <[email protected]>
* Don't use tq->tq_lock_flagsChunwei Chen2015-12-111-61/+62
| | | | | | | | | | The flags argument in spin_lock_irqsave is modified out side of spin_lock context. We cannot use a shared variable like tq->tq_lock_flags for them. This patch removes it and uses local variable for the flags. Signed-off-by: Chunwei Chen <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #506
* Subclass tq_lock to eliminate a lockdep warningOlaf Faaland2015-12-111-21/+49
| | | | | | | | | | | | | | | | | | | | | | | | | | When taskq_dispatch() calls taskq_thread_spawn() to create a new thread for a taskq, linux lockdep warns of possible recursive locking. This is a false positive. One such call chain is as follows, when a taskq needs more threads: taskq_dispatch->taskq_thread_spawn->taskq_dispatch The initial taskq_dispatch() holds tq_lock on the taskq that needed more worker threads. The later call into taskq_dispatch() takes dynamic_taskq->tq_lock. Without subclassing, lockdep believes these could potentially be the same lock and complains. A similar case occurs when taskq_dispatch() then calls task_alloc(). This patch uses spin_lock_irqsave_nested() when taking tq_lock, with one of two new lock subclasses: subclass taskq TQ_LOCK_DYNAMIC dynamic_taskq TQ_LOCK_GENERAL any other Signed-off-by: Olaf Faaland <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #480
* Revert "Make taskq_member() use ->journal_info"Brian Behlendorf2015-12-081-3/+34
| | | | | | | | This reverts commit a430c11f0b1ef16ca5edf3059e4082709277376c. Using journal_info like this can cause a BUG at kernel fs/jbd2/transaction.c:425! Signed-off-by: Brian Behlendorf <[email protected]> Issue #500
* Make taskq_member() use ->journal_infoRichard Yao2015-12-081-34/+3
| | | | | | | | | | | | | | | | | | | | | | The ->journal_info pointer in the task_struct is reserved for use by filesystems and because the kernel can have multiple file systems on the same stack due to direct reclaim, each filesystem that touches ->journal_info in a callback function will save the value at the start of its frame and restore it at the end of its frame. This allows us to safely use ->journal_info to store a pointer to the taskq's struct in taskq threads so that ZFS code paths can detect the presence of a taskq. This could break if the ZFS code were to use taskq_member from the context of direct reclaim. However, there are no such uses of it in that manner, so this is safe. This eliminates an O(N) list traversal under a spinlock with an O(1) unlocked pointer comparison. Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: tuxoko <[email protected]> Signed-off-by: Tim Chase <[email protected]> Closes #500
* Fix race between getf() and areleasef()Richard Yao2015-12-031-0/+13
| | | | | | | | | | | | | | | | | | If a vnode is released asynchronously through areleasef(), it is possible for the user process to reuse the file descriptor before areleasef is called. When this happens, getf() will return a stale reference, any operations in the kernel on that file descriptor will fail (as it is closed) and the operations meant for that fd will never occur from userspace's perspective. We correct this by detecting this condition in getf(), doing a putf on the old file handle, updating the file descriptor and proceeding as if everything was fine. When the areleasef() is done, it will harmlessly decrement the reference counter on the Illumos file handle. Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #492
* spl-kmem-cache: include linux/prefetch.h for prefetchw()Dimitri John Ledkov2015-12-021-0/+1
| | | | | | | | This is needed for architectures that do not have a builtin prefetchw() Signed-off-by: Dimitri John Ledkov <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #502
* Remove superfluous `newline` characterloli10K2015-11-131-1/+1
| | | | | | | | | Remove superfluous `newline` character from spl_kmem_cache_magazine_size module parameter description. Signed-off-by: loli10K <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #499