aboutsummaryrefslogtreecommitdiffstats
path: root/module/spl/spl-kmem.c
Commit message (Collapse)AuthorAgeFilesLines
* OpenZFS restructuring - move platform specific sourcesMatthew Macy2019-09-061-556/+0
| | | | | | | | | | | | Move platform specific Linux source under module/os/linux/ and update the build system accordingly. Additional code restructuring will follow to make the common code fully portable. Reviewed-by: Jorgen Lundman <[email protected]> Reviewed-by: Igor Kozhukhov <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matthew Macy <[email protected]> Closes #9206
* Add missing __GFP_HIGHMEM flag to vmallocMichael Niewöhner2019-07-171-1/+2
| | | | | | | | | | | Make use of __GFP_HIGHMEM flag in vmem_alloc, which is required for some 32-bit systems to make use of full available memory. While kernel versions >=4.12-rc1 add this flag implicitly, older kernels do not. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Sebastian Gottschall <[email protected]> Signed-off-by: Michael Niewöhner <[email protected]> Closes #9031
* Update build system and packagingBrian Behlendorf2018-05-291-16/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Minimal changes required to integrate the SPL sources in to the ZFS repository build infrastructure and packaging. Build system and packaging: * Renamed SPL_* autoconf m4 macros to ZFS_*. * Removed redundant SPL_* autoconf m4 macros. * Updated the RPM spec files to remove SPL package dependency. * The zfs package obsoletes the spl package, and the zfs-kmod package obsoletes the spl-kmod package. * The zfs-kmod-devel* packages were updated to add compatibility symlinks under /usr/src/spl-x.y.z until all dependent packages can be updated. They will be removed in a future release. * Updated copy-builtin script for in-kernel builds. * Updated DKMS package to include the spl.ko. * Updated stale AUTHORS file to include all contributors. * Updated stale COPYRIGHT and included the SPL as an exception. * Renamed README.markdown to README.md * Renamed OPENSOLARIS.LICENSE to LICENSE. * Renamed DISCLAIMER to NOTICE. Required code changes: * Removed redundant HAVE_SPL macro. * Removed _BOOT from nvpairs since it doesn't apply for Linux. * Initial header cleanup (removal of empty headers, refactoring). * Remove SPL repository clone/build from zimport.sh. * Use of DEFINE_RATELIMIT_STATE and DEFINE_SPINLOCK removed due to build issues when forcing C99 compilation. * Replaced legacy ACCESS_ONCE with READ_ONCE. * Include needed headers for `current` and `EXPORT_SYMBOL`. Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Olaf Faaland <[email protected]> Reviewed-by: Matthew Ahrens <[email protected]> Reviewed-by: Pavel Zakharov <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> TEST_ZIMPORT_SKIP="yes" Closes #7556
* Fix more cstyle warningsBrian Behlendorf2018-02-241-1/+4
| | | | | | | | | | This patch contains no functional changes. It is solely intended to resolve cstyle warnings in order to facilitate moving the spl source code in to the zfs repository. Reviewed-by: Giuseppe Di Natale <[email protected]> Reviewed by: George Melikov <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #687
* Fix cstyle warningsBrian Behlendorf2018-02-071-1/+1
| | | | | | | | This patch contains no functional changes. It is solely intended to resolve cstyle warnings in order to facilitate moving the spl source code in to the zfs repository. Signed-off-by: Brian Behlendorf <[email protected]> Closes #681
* make module/spl/spl-kmem.c non-executable (again)Fabian-Gruenbichler2017-08-101-0/+0
| | | | | | | | | | | | | This was probably accidentally committed in aeb9baa618beea1458ab3ab22cbc0f39213da6cf Fix: handle NULL case in spl_kmem_free_track() Reviewed-by: George Melikov <[email protected]> Reviewed-by: Giuseppe Di Natale <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Gvozden Neskovic <[email protected]> Signed-off-by: Fabian Grünbichler <[email protected]> Closes #644
* Increase spl_kmem_alloc_warn limitBrian Behlendorf2016-09-161-2/+2
| | | | | | | | | | | | | | | | | | In order to support ABD with large blocks the spl_kmem_alloc_warn limit needs to be increased to 64K. A 16M block requires that pointers be stored for 4096 4K-pages on an x86_64 system. Each of these pointers is 8 bytes requiring an allocation of 8*4096=32,768 bytes. The addition of a small header to this structure pushes the allocation over the default 32K warning threshold. In addition, fix a small bug where MAX was used instead of MIN when setting the default. This ensures a reasonable limit is still set on systems with page sizes larger then 4K. Reviewed-by: David Quigley <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #571
* Fix: handle NULL case in spl_kmem_free_track()GeLiXin2016-08-191-0/+4
| | | | | | | | | | | | When DEBUG_KMEM_TRACKING is enabled in SPL, we keep tracking all the buffers alloced by kmem_alloc() and kmem_zalloc(). If a NULL pointer which indicates no track info in SPL is passed to spl_kmem_free_track, we just ignore it. Signed-off-by: GeLiXin <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue zfsonlinux/zfs#4967 Closes #567
* Use spl_fstrans_mark instead of memalloc_noio_saveChunwei Chen2015-12-181-2/+2
| | | | | | | | | | | | | | | | | | | For earlier versions of the kernel with memalloc_noio_save, it only turns off __GFP_IO but leaves __GFP_FS untouched during direct reclaim. This would cause threads to direct reclaim into ZFS and cause deadlock. Instead, we should stick to using spl_fstrans_mark. Since we would explicitly turn off both __GFP_IO and __GFP_FS before allocation, it will work on every version of the kernel. This impacts kernel versions 3.9-3.17, see upstream kernel commit torvalds/linux@934f307 for reference. Signed-off-by: Chunwei Chen <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Tim Chase <[email protected]> Closes #515 Issue zfsonlinux/zfs#4111
* Optimize vmem_alloc() retry pathBrian Behlendorf2015-02-021-1/+11
| | | | | | | | | | | | | | | | | For performance reasons the reworked kmem code maps vmem_alloc() to kmalloc_node() for allocations less than spa_kmem_alloc_max. This allows for more concurrency in the system and less contention of the virtual address space. Generally, this is a good thing. However, in the case when the kmalloc_node() fails it makes little sense to retry it using kmalloc_node() again. It will likely fail in exactly the same way. A smarter strategy is to abandon this optimization and retry using spl_vmalloc() which is very likely to succeed. Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Ned Bass <[email protected]> Closes #428
* Fix GFP_KERNEL allocations flagsBrian Behlendorf2015-01-211-2/+2
| | | | | | | | | | The kmem_vasprintf(), kmem_vsprintf(), kobj_open_file(), and vn_openat() functions should all use the kmem_flags_convert() function to generate the GFP_* flags. This ensures that they can be safely called in any context and the correct flags will be used. Signed-off-by: Brian Behlendorf <[email protected]> Closes #426
* Add hooks for disabling direct reclaimRichard Yao2015-01-161-1/+1
| | | | | | | | | | | | | | | | | | | | | The port of XFS to Linux introduced a thread-specific PF_FSTRANS bit that is used to mark contexts which are processing transactions. When set, allocations in this context can dip into kernel memory reserves to avoid deadlocks during writeback. Linux 3.9 provided the additional PF_MEMALLOC_NOIO for disabling __GFP_IO in page allocations, which XFS began using in 3.15. This patch implements hooks for marking transactions via PF_FSTRANS. When an allocation is performed in the context of PF_FSTRANS, any KM_SLEEP allocation is transparently converted to a GFP_NOIO allocation. Additionally, when using a Linux 3.9 or newer kernel, it will set PF_MEMALLOC_NOIO to prevent direct reclaim from entering pageout() on on any KM_PUSHPAGE or KM_NOSLEEP allocation. This effectively allows the spl_vmalloc() helper function to be used safely in a thread which is responsible for IO. Signed-off-by: Brian Behlendorf <[email protected]>
* Refactor generic memory allocation interfacesBrian Behlendorf2015-01-161-155/+244
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch achieves the following goals: 1. It replaces the preprocessor kmem flag to gfp flag mapping with proper translation logic. This eliminates the potential for surprises that were previously possible where kmem flags were mapped to gfp flags. 2. It maps vmem_alloc() allocations to kmem_alloc() for allocations sized less than or equal to the newly-added spl_kmem_alloc_max parameter. This ensures that small allocations will not contend on a single global lock, large allocations can still be handled, and potentially limited virtual address space will not be squandered. This behavior is entirely different than under Illumos due to different memory management strategies employed by the respective kernels. However, this functionally provides the semantics required. 3. The --disable-debug-kmem, --enable-debug-kmem (default), and --enable-debug-kmem-tracking allocators have been unified in to a single spl_kmem_alloc_impl() allocation function. This was done to simplify the code and make it more maintainable. 4. Improve portability by exposing an implementation of the memory allocations functions that can be safely used in the same way they are used on Illumos. Specifically, callers may safely use KM_SLEEP in contexts which perform filesystem IO. This allows us to eliminate an entire class of Linux specific changes which were previously required to avoid deadlocking the system. This change will be largely transparent to existing callers but there are a few caveats: 1. Because the headers were refactored and extraneous includes removed callers may find they need to explicitly add additional #includes. In particular, kmem_cache.h must now be explicitly includes to access the SPL's kmem cache implementation. This behavior is different from Illumos but it was done to avoid always masking the Linux slab functions when kmem.h is included. 2. Callers, like Lustre, which made assumptions about the definitions of KM_SLEEP, KM_NOSLEEP, and KM_PUSHPAGE will need to be updated. Other callers such as ZFS which did not will not require changes. 3. KM_PUSHPAGE is no longer overloaded to imply GFP_NOIO. It retains its original meaning of allowing allocations to access reserved memory. KM_PUSHPAGE callers can be converted back to KM_SLEEP. 4. The KM_NODEBUG flags has been retired and the default warning threshold increased to 32k. 5. The kmem_virt() functions has been removed. For callers which need to distinguish between a physical and virtual address use is_vmalloc_addr(). Signed-off-by: Brian Behlendorf <[email protected]>
* Fix kmem cstyle issuesBrian Behlendorf2015-01-161-49/+55
| | | | | | | | Address all cstyle issues in the kmem, vmem, and kmem_cache source and headers. This will done to make it easier to review subsequent changes which will rework the kmem/vmem implementation. Signed-off-by: Brian Behlendorf <[email protected]>
* Refactor existing codeBrian Behlendorf2015-01-161-1790/+2
| | | | | | | | | | | | | | | | | This change introduces no functional changes to the memory management interfaces. It only restructures the existing codes by separating the kmem, vmem, and kmem cache implementations in the separate source and header files. Splitting this functionality in to separate files required the addition of spl_vmem_{init,fini}() and spl_kmem_cache_{initi,fini}() functions. Additionally, several minor changes to the #include's were required to accommodate the removal of extraneous header from kmem.h. But again, while large this patch introduces no functional changes. Signed-off-by: Brian Behlendorf <[email protected]>
* Remove compat includes from sys/types.hNed Bass2014-11-191-0/+2
| | | | | | | | | | | | | | | | | | Don't include the compatibility code in linux/*_compat.h in the public header sys/types.h. This causes problems when an external code base includes the ZFS headers and has its own conflicting compatibility code. Lustre, in particular, defined SHRINK_STOP for compatibility with pre-3.12 kernels in a way that conflicted with the SPL's definition. Because Lustre ZFS OSD includes ZFS headers it fails to build due to a '"SHRINK_STOP" redefined' compiler warning. To avoid such conflicts only include the compat headers from .c files or private headers. Also, for consistency, include sys/*.h before linux/*.h then sort by header name. Signed-off-by: Ned Bass <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #411
* Retire legacy debugging infrastructureBrian Behlendorf2014-11-191-205/+97
| | | | | | | | | | | | | | | | | | When the SPL was originally written Linux tracepoints were still in their infancy. Therefore, an entire debugging subsystem was added to facilite tracing which served us well for many years. Now that Linux tracepoints have matured they provide all the functionality of the previous tracing subsystem. Rather than maintain parallel functionality it makes sense to fully adopt tracepoints. Therefore, this patch retires the legacy debugging infrastructure. See zfsonlinux/zfs@bc9f413 for the tracepoint changes. Signed-off-by: Ned Bass <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #408
* kmem_cache: Call constructor/destructor on each alloc/freeRichard Yao2014-10-281-21/+16
| | | | | | | | | | | | | | This has a few benefits. First, it fixes a regression that "Rework generic memory allocation interfaces" appears to have triggered in splat's slab_reap and slab_age tests. Second, it makes porting code from Illumos to ZFSOnLinux easier. Third, it has the side effect of making reclaim from slab caches that specify reclaim functions an order of magnitude faster. The splat slab_reap test usually took 30 to 40 seconds. With this change, it takes 3 to 4. Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #369
* Linux 3.12 compat: shrinker semanticsTim Chase2014-10-281-19/+34
| | | | | | | | | | | | | | | The new shrinker API as of Linux 3.12 modifies "struct shrinker" by replacing the @shrink callback with the pair of @count_objects and @scan_objects. It also requires the return value of @count_objects to return the number of objects actually freed whereas the previous @shrink callback returned the number of remaining freeable objects. This patch adds support for the new @scan_objects return value semantics and updates the splat shrinker test case appropriately. Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Tim Chase <[email protected]> Closes #403
* Remove kvasprintf() wrapperBrian Behlendorf2014-10-171-23/+0
| | | | | | | The kvasprintf() function has been available since Linux 2.6.22. There is no longer a need to maintain this compatibility code. Signed-off-by: Brian Behlendorf <[email protected]>
* Remove shrink_{i,d}node_cache() wrappersBrian Behlendorf2014-10-171-31/+0
| | | | | | | | | This is optional functionality which may or may not be useful to ZFS when using older kernels. It is never a hard requirement. Therefore this functionality is being removed from the SPL and a simpler slimmed down version will be added to ZFS. Signed-off-by: Brian Behlendorf <[email protected]>
* Remove global memory variablesBrian Behlendorf2014-10-171-220/+0
| | | | | | | | | | | | | | | | | | | | Platforms such as Illumos and FreeBSD have historically provided global variables which summerize the memory state of a system. Linux on the otherhand doesn't expose any of this information to kernel modules and uses entirely different mechanisms for memory management. In order to simplify the original ZFS port to Linux these global variables were emulated by the SPL for the benefit of ZFS. As ZoL has matured over the years it has moved steadily away from these interfaces and now no longer depends on them at all. Therefore, this patch completely removes the global variables availrmem, minfree, desfree, lotsfree, needfree, swapfs_minfree, and swapfs_reserve. This greatly simplifies the memory management code and eliminates a common area of confusion. Signed-off-by: Brian Behlendorf <[email protected]>
* Remove get_vmalloc_info() wrapperBrian Behlendorf2014-10-171-27/+4
| | | | | | | | | | | | | | | The get_vmalloc_info() function was used to back the vmem_size() function. This was always problematic and resulted in brittle code because the kernel never provided a clean interface for modules. However, it turns out that the only caller of this function in ZFS uses it to determine the total virtual address space size. This can be determined easily without get_vmalloc_info() so vmem_size() has been updated to take this approach which allows us to shed the get_vmalloc_info() dependency. Signed-off-by: Brian Behlendorf <[email protected]>
* Remove on_each_cpu() wrapperBrian Behlendorf2014-10-171-1/+1
| | | | | | | The on_each_cpu() function has been available since Linux 2.6.27. There is no longer a need to maintain this compatibility code. Signed-off-by: Brian Behlendorf <[email protected]>
* Map highbit64() to fls64()Brian Behlendorf2014-10-171-1/+1
| | | | | | | | The fls64() function has been available since Linux 2.6.16 and it should be used to implemented highbit64(). This allows us to provide an optimized implementation and simplify the code. Signed-off-by: Brian Behlendorf <[email protected]>
* Remove sysctl_vfs_cache_pressure assumptionBrian Behlendorf2014-10-171-1/+1
| | | | | | | | | | | | | The generic SPL cache shrinkers make the assumption that the caches only contain VFS cache data and therefore should be scaled based on vfs_cache_pressure. This is not strictly true and it should not be assumed. Removing this tuning should not have any impact on the stock behavior because vfs_cache_pressure=100 by default. This means that no scaling will take place. Signed-off-by: Brian Behlendorf <[email protected]>
* Linux 3.16 compat: smp_mb__after_clear_bit()Turbo Fredriksson2014-09-221-1/+1
| | | | | | | | | | | | | | | The smp_mb__{before,after}_clear_bit functions have been renamed smp_mb__{before,after}_atomic. Rather than adding a compatibility function to handle this the code has been updated to use smp_wmb(). This has the advantage of being a stable functionally equivalent interface. On many architectures smp_mb__after_clear_bit() expands to smp_wmb(). Others might be able to do something slightly more efficient but this will be safe and correct on all of them. Signed-off-by: Turbo Fredriksson <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #386
* Linux 3.17 compat: remove wait_on_bit action functionNed Bass2014-08-111-9/+2
| | | | | | | | | | | | | | | | | | | Linux kernel 3.17 removes the action function argument from wait_on_bit(). Add autoconf test and compatibility macro to support the new interface. The former "wait_on_bit" interface required an 'action' function to be provided which does the actual waiting. There were over 20 such functions in the kernel, many of them identical, though most cases can be satisfied by one of just two functions: one which uses io_schedule() and one which just uses schedule(). This API change was made to consolidate all of those redundant wait functions. References: torvalds/linux@7431620 Signed-off-by: Ned Bass <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #378
* Set spl_kmem_cache_slab_limit=16384 to defaultBrian Behlendorf2014-08-081-0/+10
| | | | | | | | | | | | For small objects the Linux slab allocator should be used to make the most efficient use of the memory. However, large objects are not supported by the Linux slab and therefore the SPL implementation is preferred. A cutoff of 16K was determined to be optimal for architectures using 4K pages. Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: DHE <[email protected]> Issue #356 Closes #379
* Set spl_kmem_cache_reclaim=0 to defaultBrian Behlendorf2014-08-081-5/+6
| | | | | | | | | | | | Reinstate the correct default behavior of returning the number of objects in the cache for reclaim. This behavior was disabled in recent releases to do occasional reports of spinning in shrink_slabs(). Those issues have been resolved and can no longer can be reproduced. See commit 376dc35. Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: DHE <[email protected]> Issue #358 Closes #379
* Rate limit debugging stack tracesBrian Behlendorf2014-07-221-1/+1
| | | | | | | | | | | | There have been issues in the past where excessive debug logging to the console has resulted in significant performance impacts. In the vast majority of these cases only a few stack traces are required to diagnose the issue. Therefore, stack traces dumped to the console will now we limited to 5 every 60s. Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Prakash Surya <[email protected]> Closes #374
* Add spl_kmem_cache_reclaim module optionBrian Behlendorf2014-05-221-9/+20
| | | | | | | | | | | | | | | | | | | | | | The correct behavior for all registered shrinkers is to return the number of objects in their cache. In theory this allows the Linux VM to balance memory reclaim across all registered caches. In commit b9b3715 this behavior was disabled in favor of returning -1 which notifies the VM that no additional objects are available for reclaim. This was done as a workaround to resolve thrashing in shrink_slabs() which could occur when memory was low and numerous core where in reclaim. Unfortunately, this has been observed to increase the likelihood of OOM events when SPL slab consumers are responsible for consuming the majority of memory. Therefore, this patch makes this behavior tunable. Setting the spl_kmem_cache_reclaim module option to 0x1 will result in the shrinker only being called once. This is the default behavior. Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Prakash Surya <[email protected]> Closes #358
* Add KMC_SLAB cache typeBrian Behlendorf2014-05-221-22/+152
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For small objects the Linux slab allocator has several advantages over its counterpart in the SPL. These include: 1) It is more memory-efficient and packs objects more tightly. 2) It is continually tuned to maximize performance. Therefore it makes sense to layer the SPLs slab allocator on top of the Linux slab allocator. This allows us to leverage the advantages above while preserving the Illumos semantics we depend on. However, there are some things we need to be careful of: 1) The Linux slab allocator was never designed to work well with large objects. Because the SPL slab must still handle this use case a cut off limit was added to transition from Linux slab backed objects to kmem or vmem backed slabs. spl_kmem_cache_slab_limit - Objects less than or equal to this size in bytes will be backed by the Linux slab. By default this value is zero which disables the Linux slab functionality. Reasonable values for this cut off limit are in the range of 4096-16386 bytes. spl_kmem_cache_kmem_limit - Objects less than or equal to this size in bytes will be backed by a kmem slab. Objects over this size will be vmem backed instead. This value defaults to 1/8 a page, or 512 bytes on an x86_64 architecture. 2) Be aware that using the Linux slab may inadvertently introduce new deadlocks. Care has been taken previously to ensure that all allocations which occur in the write path use GFP_NOIO. However, there may be internal allocations performed in the Linux slab which do not honor these flags. If this is the case a deadlock may occur. The path forward is definitely to start relying on the Linux slab. But for that to happen we need to start building confidence that there aren't any unexpected surprises lurking for us. And ideally need to move completely away from using the SPLs slab for large memory allocations. This patch is a first step. NOTES: 1) The KMC_NOMAGAZINE flag was leveraged to support the Linux slab backed caches but it is not supported for kmem/vmem backed caches. 2) Regardless of the spl_kmem_cache_*_limit settings a cache may be explicitly set to a given type by passed the KMC_KMEM, KMC_VMEM, or KMC_SLAB flags during cache creation. 3) The constructors, destructors, and reclaim callbacks are all functional and will be called regardless of the cache type. 4) KMC_SLAB caches will not appear in /proc/spl/kmem/slab due to the issues involved in presenting correct object accounting. Instead they will appear in /proc/slabinfo under the same names. 5) Several kmem SPLAT tests needed to be fixed because they relied incorrectly on internal kmem slab accounting. With the updated test cases all the SPLAT tests pass as expected. 6) An autoconf test was added to ensure that the __GFP_COMP flag was correctly added to the default flags used when allocating a slab. This is required to ensure all pages in higher order slabs are properly refcounted, see ae16ed9. 7) When using the SLUB allocator there is no need to attempt to set the __GFP_COMP flag. This has been the default behavior for the SLUB since Linux 2.6.25. 8) When using the SLUB it may be desirable to set the slub_nomerge kernel parameter to prevent caches from being merged. Original-patch-by: DHE <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Prakash Surya <[email protected]> Signed-off-by: Tim Chase <[email protected]> Signed-off-by: DHE <[email protected]> Signed-off-by: Chunwei Chen <[email protected]> Closes #356
* Fix crash when using ZFS on Ceph rbdChunwei Chen2014-04-251-1/+2
| | | | | | | | | | | | | | | When using __get_free_pages to get high order memory, only the first page's _count will set to 1, other's will be 0. When an internal page get passed into rbd, it will eventully go into tcp_sendpage. There, it will be called with get_page and put_page, and get freed erroneously when _count jump back to 0. The solution to this problem is to use compound page. All pages in a high order compound page share a single _count. So get_page and put_page in tcp_sendpage will not cause _count jump to 0. Signed-off-by: Chunwei Chen <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #251
* Change spl_kmem_cache_expire default setting to 2Richard Yao2014-04-141-3/+4
| | | | | | | | | This behavior is more consistent with the way memory reclaim is expected to work under Linux. Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #349
* Expose max/min objs per slab and max slab sizeAndrey Vesnovaty2014-04-141-6/+19
| | | | | | | | | | | | | | | | | | | | | | | | | By default maximal number of objects in slab can't exceed (16*2 - 1) and slab size can't exceed 32M. Today's high end servers having couple hundreds of RAM available for ARC may run into a trouble with virtual memory because of the restriction mentioned above. Problem: Reasons for very high number of virtual memory allocations: * Real slab size very small relative to the size of the entire RAM * Slabs allocated on virtual memory and fill entire ARC The result is very high number of allocated virtual memory ranges (hundreds of ranges). When virtual memory subsystem manages high number of ranges its performance become so poor that it freezes from time to time. Solution: Number of objects per slab should be increased taking into account maximal slab size which can also be increased if needed. Signed-off-by: Andrey Vesnovaty <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #337
* Consistently use local_irq_disable/local_irq_enableBrian Behlendorf2013-10-091-3/+2
| | | | | | | | | | | | | | | | | | It was observed that spl_kmem_cache_alloc() uses local_irq_save() and saves the interrupt state in a local variable. This would normally be fine except that spl_kmem_cache_alloc() calls spl_cache_refill() which re-enables interrupts. It is then possible that while interrupts are enabled the process is rescheduled to a different cpu before being disable again. This could result in us restoring the saved interrupt state from one cpu to another. What the consequences of this are aren't perfectly clear, but this is clearly a bug and it has the potential to cause issues. The code has been updated to just use local_irq_enable() and local_irq_disable() to avoid this. Signed-off-by: Brian Behlendorf <[email protected]>
* Fix race in spl_kmem_cache_reap_now()Richard Yao2013-08-081-5/+5
| | | | | | | | | | | | | | | | | | | | The current code contains a race condition that triggers when bit 2 in spl.spl_kmem_cache_expire is set, spl_kmem_cache_reap_now() is invoked and another thread is concurrently accessing its magazine. spl_kmem_cache_reap_now() currently invokes spl_cache_flush() on each magazine in the same thread when bit 2 in spl.spl_kmem_cache_expire is set. This is unsafe because there is one magazine per CPU and the magazines are lockless, so it is impossible to guarentee that another CPU is not using its magazine when this function is called. The solution is to only touch the local CPU's magazine and leave other CPU's magazines to other CPUs. Reported-by: DHE Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #274
* Fix KMC_OFFSLAB type cachesBrian Behlendorf2013-07-301-1/+2
| | | | | | | | | | | | | | | Because spl_slab_size() was always returning -ENOSPC for caches of type KMC_OFFSLAB the cache could never be created. Additionally the slab size is rounded up to a page which is what kv_alloc() expects. The kv_alloc() code will minimally allocate a page, in the KMC_OFFSLAB case this could be reduced. The basic regression tests kmem:slab_small, kmem:slab_large, and kmem:slab_align regression were updated to test KMC_OFFSLAB. Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Ying Zhu <[email protected]> Closes #266
* Return -1 for generic kmem cache shrinkerBrian Behlendorf2013-07-301-1/+9
| | | | | | | | | | | | | | | It has been observed that it's possible to get in a state where shrink_slabs() will spin repeated invoking the generic kmem cache shrinker. It fails to detect it's not making forward progress reclaiming from the cache and doesn't give up. To ensure this never occurs we unconditionally return -1 after reclaiming what we can. Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Richard Yao <[email protected]> Closes zfsonlinux/zfs#1276 Closes zfsonlinux/zfs#1598 Closes zfsonlinux/zfs#1432
* Fix bogus kmem leak warningBrian Behlendorf2013-07-101-4/+5
| | | | | | | | | | | | Commit 5c7a036 correctly relocated the creation of a taskq and the registraction of the kmem_cache_shrinker after the initialization of the kmem tracking code. However, the cleanup of these structures was not done before the leak checks in spl_kmem_fini(). This resulted in an incorrect 'kmem leaked' warning even though there was no actual leak. Signed-off-by: Brian Behlendorf <[email protected]> Closes zfsonlinux/zfs#1569
* Fix --enable-debug-kmem-tracking optionBrian Behlendorf2013-07-091-9/+9
| | | | | | | | | | | | | | | | | | | This code has gotten something stale and no longer builds cleanly against modern kernels. The two issues addressed here are as follows: * The hlist_*_rcu interfaces in the kernel have been relatively unstable. Since this isn't performance critical code just use the long standing hlist_* variants. * In older kernels the hash_ptr() function takes a 'void *' but in newer kernels it expects a 'const void *'. To silence the compiler warnings about this explicitly cast it to a 'void *'. The memset function is a similar case but it always expects a 'void *'. Signed-off-by: Brian Behlendorf <[email protected]> Closes #256
* Fix --enable-debug-kmem-tracking optionTim Chase2013-06-181-7/+8
| | | | | | | | | | | | | | | Re-order initialization in spl_kmem_init to allow for kmem tracing to work. The spl_kmem_init function calls taskq_create prior to initializing the tracking (calling spl_kmem_init_tracking). Since taskq_create uses kmem_alloc, NULL dereferences occur because the global kmem_list hasn't had its next & prev pointers initialized yet. This commit moves the calls to spl_kmem_init_tracking earlier in the spl_kmem_init function in order that the subsequent kmem_alloc calls (by taskq_create) work properly. Signed-off-by: Brian Behlendorf <[email protected]> Closes #243
* Do not call cond_resched() in spl_slab_reclaim()Richard Yao2013-03-211-3/+0
| | | | | | | | | | Calling cond_resched() after each object is freed and then after each slab is freed can cause slabs of objects to live for excessive periods of time following reclaimation. This interferes with the kernel's own memory management when called from kswapd and can cause direct reclaim to occur in response to memory pressure that should have been resolved. Signed-off-by: Richard Yao <[email protected]>
* Linux 3.9 compat: Switch to hlist_for_each{,_rcu}Richard Yao2013-03-141-1/+2
| | | | | | | | | | torvalds/linux@b67bfe0d42cac56c512dd5da4b1b347a23f4b70a changed hlist_for_each_entry{,_rcu} to take 3 arguments instead of 4. We handle this by switching to hlist_for_each{,_rcu}, which works across all supported kernels. Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]>
* Refresh links to web siteNed Bass2013-03-041-1/+1
| | | | | | | Update links to refer to the official ZFS on Linux website instead of @behlendorf's personal fork on github. Signed-off-by: Brian Behlendorf <[email protected]>
* Add spl_kmem_cache_expire module optionBrian Behlendorf2013-01-281-4/+32
| | | | | | | | | | | | | | | | | | | | | | | | Cache aging was implemented because it was part of the default Solaris kmem_cache behavior. The idea is that per-cpu objects which haven't been accessed in several seconds should be returned to the cache. On the other hand Linux slabs never move objects back to the slabs unless there is memory pressure on the system. This behavior is now configurable through the 'spl_kmem_cache_expire' module option. The value is a bit mask with the following meaning. 0x1 - Solaris style cache aging eviction is enabled. 0x2 - Linux style low memory eviction is enabled. Both methods may be safely enabled simultaneously, but by default both are disabled. It has never been clear if the kmem cache aging (which has been around from day one) actually does any good. It has however been the source of numerous bugs so I wouldn't mind retiring it entirely. Signed-off-by: Brian Behlendorf <[email protected]> Closes zfsonlinux/zfs#1227 Closes #210
* Remove spl_invalidate_inodes()Brian Behlendorf2013-01-171-14/+0
| | | | | | | | | | | This functionality is no longer required by ZFS, see commit zfsonlinux/zfs@7b3e34ba5a7ee8d0fda44d214f6f11eb16cdb26f. Since there are no other consumers, and because it adds additional autoconf complexity which must be maintained the spl_invalidate_inodes() function has been removed. Signed-off-by: Brian Behlendorf <[email protected]> Issue zfsonlinux/zfs#795
* kmem-cache: Fix slab ageing soft lockupBrian Behlendorf2013-01-141-43/+51
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit a10287e00d13c4c4dbbff14f42b00b03da363fcb slightly reworked the slab ageing code such that it is no longer dependent on the Linux delayed work queue interfaces. This was good for portability and performance, but it requires us to use the on_each_cpu() function to execute the spl_magazine_age() function. That means that the function is now executing in interrupt context whereas before it was scheduled in normal process context. And that means we need to be slightly more careful about the locking in the interrupt handler. With the reworked code it's possible that we'll be holding the skc->skc_lock and be interrupted to handle the spl_magazine_age() IRQ. This will result in a deadlock and soft lockup errors unless we're careful to detect the contention and avoid taking the lock in the interupt handler. So that's what this patch does. Alternately, (and slightly more conventionally) we could have used spin_lock_irqsave() to prevent this race entirely but I'd perfer to avoid disabling interrupts as much as possible due to performance concerns. There is absolutely no penalty for us not aging objects out of the magazine due to contention. Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Prakash Surya <[email protected]> Closes zfsonlinux/zfs#1193
* kmem-cache: Use a taskq for async allocationsBrian Behlendorf2012-12-121-4/+4
| | | | | | | | | | | | | | Shift the asynchronous allocations over to use the taskq interfaces. This allows us to abandon the kernels delayed work queue interface and all the compatibility code it requires. This code never actually used the delay functionality it was just done this way to leverage the existing compatibility code. All that is required is a thread context to perform the allocation in. The only thing clever in this change is that we take advantage of the preallocated task queue entries to avoid a memory allocation. Signed-off-by: Brian Behlendorf <[email protected]>