aboutsummaryrefslogtreecommitdiffstats
path: root/config/kernel.m4
Commit message (Collapse)AuthorAgeFilesLines
* Remove zfs_vdev_elevator module optionBrian Behlendorf2019-11-271-2/+0
| | | | | | | | | | | | | As described in commit f81d5ef6 the zfs_vdev_elevator module option is being removed. Users who require this functionality should update their systems to set the disk scheduler using a udev rule. Reviewed-by: Richard Laager <[email protected]> Reviewed-by: loli10K <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #8664 Closes #9417 Closes #9609
* Make use of kvmalloc if available and fix vmem_alloc implementationMichael Niewöhner2019-11-131-0/+2
| | | | | | | | | | | | | | | | | This patch implements use of kvmalloc for GFP_KERNEL allocations, which may increase performance if the allocator is able to allocate physical memory, if kvmalloc is available as a public kernel interface (since v4.12). Otherwise it will simply fall back to virtual memory (vmalloc). Also fix vmem_alloc implementation which can lead to slow allocations since the first attempt with kmalloc does not make use of the noretry flag but tells the linux kernel to retry several times before it fails. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Matt Ahrens <[email protected]> Signed-off-by: Sebastian Gottschall <[email protected]> Signed-off-by: Michael Niewöhner <[email protected]> Closes #9034
* Linux compat: Minimum kernel version 3.10Brian Behlendorf2019-11-121-58/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Increase the minimum supported kernel version from 2.6.32 to 3.10. This removes support for the following Linux enterprise distributions. Distribution | Kernel | End of Life ---------------- | ------ | ------------- Ubuntu 12.04 LTS | 3.2 | Apr 28, 2017 SLES 11 | 3.0 | Mar 32, 2019 RHEL / CentOS 6 | 2.6.32 | Nov 30, 2020 The following changes were made as part of removing support. * Updated `configure` to enforce a minimum kernel version as specified in the META file (Linux-Minimum: 3.10). configure: error: *** Cannot build against kernel version 2.6.32. *** The minimum supported kernel version is 3.10. * Removed all `configure` kABI checks and matching C code for interfaces which solely predate the Linux 3.10 kernel. * Updated all `configure` kABI checks to fail when an interface is missing which was in the 3.10 kernel up to the latest 5.1 kernel. Removed the HAVE_* preprocessor defines for these checks and updated the code to unconditionally use the verified interface. * Inverted the detection logic in several kABI checks to match the new interface as it appears in 3.10 and newer and not the legacy interface. * Consolidated the following checks in to individual files. Due the large number of changes in the checks it made sense to handle this now. It would be desirable to group other related checks in the same fashion, but this as left as future work. - config/kernel-blkdev.m4 - Block device kABI checks - config/kernel-blk-queue.m4 - Block queue kABI checks - config/kernel-bio.m4 - Bio interface kABI checks * Removed the kABI checks for sops->nr_cached_objects() and sops->free_cached_objects(). These interfaces are currently unused. Signed-off-by: Brian Behlendorf <[email protected]> Closes #9566
* Perform KABI checks in parallelBrian Behlendorf2019-10-011-290/+493
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Reduce the time required for ./configure to perform the needed KABI checks by allowing kbuild to compile multiple test cases in parallel. This was accomplished by splitting each test's source code from the logic handling whether that code could be compiled or not. By introducing this split it's possible to minimize the number of times kbuild needs to be invoked. As importantly, it means all of the tests can be built in parallel. This does require a little extra care since we expect some tests to fail, so the --keep-going (-k) option must be provided otherwise some tests may not get compiled. Furthermore, since a failure during the kbuild modpost phase will result in an early exit; the final linking phase is limited to tests which passed the initial compilation and produced an object file. Once everything has been built the configure script proceeds as previously. The only significant difference is that it now merely needs to test for the existence of a .ko file to determine the result of a given test. This vastly speeds up the entire process. New test cases should use ZFS_LINUX_TEST_SRC to declare their test source code and ZFS_LINUX_TEST_RESULT to check the result. All of the existing kernel-*.m4 files have been updated accordingly, see config/kernel-current-time.m4 for a basic example. The legacy ZFS_LINUX_TRY_COMPILE macro has been kept to handle special cases but it's use is not encouraged. master (secs) patched (secs) ------------- ---------------- autogen.sh 61 68 configure 137 24 (~17% of current run time) make -j $(nproc) 44 44 make rpms 287 150 Reviewed-by: Tony Hutter <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #8547 Closes #9132 Closes #9341
* Allow TRIM_UNUSED_KSYM when build as a builtin-moduleTorsten Wörtwein2019-06-041-2/+3
| | | | | | | | If ZFS is built with enable_linux_builtin, it seems to be possible to compile the kernel with TRIM_UNUSED_KSYM. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Torsten Wörtwein <[email protected]> Closes #8820
* Remove vn_set_fs_pwd()/vn_set_pwd() (no need to be at / during insmod)Tomohiro Kusumi2019-05-291-1/+0
| | | | | | | | | | | | | | | | | | Per suggestion from @behlendorf in #8777, remove vn_set_fs_pwd() and vn_set_pwd() which are only used in zfs_ioctl.c:_init() while loading zfs.ko. The rest of initialization functions being called here after cwd set to / don't depend on cwd of the process except for spa_config_load(). spa_config_load() uses a relative path ".//etc/zfs/zpool.cache" when `rootdir` is non-NULL, which is "/etc/zfs/zpool.cache" given cwd is /, so just unconditionally use the absolute path without "./", so that `vn_set_pwd("/")` as well as the entire functions can be removed. This is also what FreeBSD does. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Signed-off-by: Tomohiro Kusumi <[email protected]> Closes #8826
* Linux 5.2 compat: Remove config/kernel-set-fs-pwd.m4Tomohiro Kusumi2019-05-251-1/+0
| | | | | | | | | | | | | | | | This failed on 5.2-rc1 with "error: unknown" message, for set_fs_pwd() not being visible in both const and non-const tests. This is caused by torvalds/linux@83da1bed86. It's configurable, but we would want to be able to compile with default kbuild setting. set_fs_pwd() has never been exported with exception of some distro kernels, and set_fs_pwd() wasn't used in ZoL to begin with. The test result was used for a spl function vn_set_fs_pwd(). Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: loli10K <[email protected]> Signed-off-by: Tomohiro Kusumi <[email protected]> Closes #8777
* Linux 2.6.39 compat: Test if kstrtoul() existsTomohiro Kusumi2019-05-241-0/+1
| | | | | | | | | | | | | | | | | | | | | | | kstrtoul() exists only after torvalds/linux@33ee3b2e2eb9 in 2.6.39. Use strict_strtoul() if kstrtoul() doesn't exist. Note that strict_strtoul() has existed as an alias for kstrtoul() for a while, but removed in torvalds/linux@3db2e9cdc085. It looks like RHEL6 (2.6.32 based) has backported kstrtoul(), and this caused build CI to pass compilation test. It should fail on vanilla < 2.6.39 kernels or distro kernels without backport as reported in #8760. -- # grep "kstrtoul(" /lib/modules/2.6.32-754.12.1.el6.x86_64/build/ \ include/linux/kernel.h >/dev/null # echo $? 0 Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: loli10K <[email protected]> Signed-off-by: Tomohiro Kusumi <[email protected]> Closes #8760 Closes #8761
* kernel timer API reworkRafael Kitover2019-05-231-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In `config/kernel-timer.m4` refactor slightly to check more generally for the new `timer_setup()` APIs, but also check the callback signature because some kernels (notably 4.14) have the new `timer_setup()` API but use the old callback signature. Also add a check for a `flags` member in `struct timer_list`, which was added in 4.1-rc8. Add compatibility shims to `include/spl/sys/timer.h` to allow using the new timer APIs with the only two caveats being that the callback argument type must be declared as `spl_timer_list_t` and an explicit assignment is required to get the timer variable for the `timer_of()` macro. So the callback would look like this: ```c __cv_wakeup(spl_timer_list_t t) { struct timer_list *tmr = (struct timer_list *)t; struct thing *parent = from_timer(parent, tmr, parent_timer_field); ... /* do stuff with parent */ ``` Make some minor changes to `spl-condvar.c` and `spl-taskq.c` to use the new timer APIs instead of conditional code. Reviewed-by: Tomohiro Kusumi <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rafael Kitover <[email protected]> Closes #8647
* Linux 5.0 compat: Use totalhigh_pages()Tomohiro Kusumi2019-05-041-0/+1
| | | | | | | | | | | | | Linux kernel commit ca79b0c211af63fa3276f0e3fd7dd9ada2439839 "mm: convert totalram_pages and totalhigh_pages variables to atomic" replaced `totalhigh_pages` with an inline function `totalhigh_pages()`. This broke compilation on IA32, etc, as ZoL uses `totalhigh_pages` on archs with highmem. Confirmed on Fedora 30 (5.0.9-301.fc30.i686). Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tomohiro Kusumi <[email protected]> Closes #8677 Closes #8701
* Add TRIM supportBrian Behlendorf2019-03-291-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | UNMAP/TRIM support is a frequently-requested feature to help prevent performance from degrading on SSDs and on various other SAN-like storage back-ends. By issuing UNMAP/TRIM commands for sectors which are no longer allocated the underlying device can often more efficiently manage itself. This TRIM implementation is modeled on the `zpool initialize` feature which writes a pattern to all unallocated space in the pool. The new `zpool trim` command uses the same vdev_xlate() code to calculate what sectors are unallocated, the same per- vdev TRIM thread model and locking, and the same basic CLI for a consistent user experience. The core difference is that instead of writing a pattern it will issue UNMAP/TRIM commands for those extents. The zio pipeline was updated to accommodate this by adding a new ZIO_TYPE_TRIM type and associated spa taskq. This new type makes is straight forward to add the platform specific TRIM/UNMAP calls to vdev_disk.c and vdev_file.c. These new ZIO_TYPE_TRIM zios are handled largely the same way as ZIO_TYPE_READs or ZIO_TYPE_WRITEs. This makes it possible to largely avoid changing the pipieline, one exception is that TRIM zio's may exceed the 16M block size limit since they contain no data. In addition to the manual `zpool trim` command, a background automatic TRIM was added and is controlled by the 'autotrim' property. It relies on the exact same infrastructure as the manual TRIM. However, instead of relying on the extents in a metaslab's ms_allocatable range tree, a ms_trim tree is kept per metaslab. When 'autotrim=on', ranges added back to the ms_allocatable tree are also added to the ms_free tree. The ms_free tree is then periodically consumed by an autotrim thread which systematically walks a top level vdev's metaslabs. Since the automatic TRIM will skip ranges it considers too small there is value in occasionally running a full `zpool trim`. This may occur when the freed blocks are small and not enough time was allowed to aggregate them. An automatic TRIM and a manual `zpool trim` may be run concurrently, in which case the automatic TRIM will yield to the manual TRIM. Reviewed-by: Jorgen Lundman <[email protected]> Reviewed-by: Tim Chase <[email protected]> Reviewed-by: Matt Ahrens <[email protected]> Reviewed-by: George Wilson <[email protected]> Reviewed-by: Serapheim Dimitropoulos <[email protected]> Contributions-by: Saso Kiselkov <[email protected]> Contributions-by: Tim Chase <[email protected]> Contributions-by: Chunwei Chen <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #8419 Closes #598
* Linux 5.0 compat: Use totalram_pages()Tony Hutter2019-01-281-0/+1
| | | | | | | | | | | | | totalram_pages() was converted to an atomic variable in 5.0: https://patchwork.kernel.org/patch/10652795/ Its value should now be read though the totalram_pages() helper function. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tony Hutter <[email protected]> Closes #8263
* Linux 5.0 compat: access_ok() drops 'type' parameterTony Hutter2019-01-281-0/+1
| | | | | | | | access_ok no longer needs a 'type' parameter in the 5.0 kernel. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tony Hutter <[email protected]> Closes #8261
* Linux 4.18 compat: Use ktime_get_coarse_real_ts64()Tony Hutter2019-01-281-0/+1
| | | | | | | | | Newer kernels remove current_kernel_time64(). Use ktime_get_coarse_real_ts64() in its place. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tony Hutter <[email protected]> Closes #8258
* Use autoconf variable for C preprocessorBen Wolsieffer2018-12-051-1/+1
| | | | | | | | | | This fixes the build when cross-compiling, where the preprocessor might be prefixed. Reviewed-by: George Melikov <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Signed-off-by: Ben Wolsieffer <[email protected]> Closes #8180
* Fix statfs(2) for 32-bit user spaceBrian Behlendorf2018-09-241-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | When handling a 32-bit statfs() system call the returned fields, although 64-bit in the kernel, must be limited to 32-bits or an EOVERFLOW error will be returned. This is less of an issue for block counts since the default reported block size in 128KiB. But since it is possible to set a smaller block size, these values will be scaled as needed to fit in a 32-bit unsigned long. Unlike most other filesystems the total possible file counts are more likely to overflow because they are calculated based on the available free space in the pool. In order to prevent this the reported value must be capped at 2^32-1. This is only for statfs(2) reporting, there are no changes to the internal ZFS limits. Reviewed-by: Andreas Dilger <[email protected]> Reviewed-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #7927 Closes #7122 Closes #7937
* Direct IO supportBrian Behlendorf2018-08-271-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Direct IO via the O_DIRECT flag was originally introduced in XFS by IRIX for database workloads. Its purpose was to allow the database to bypass the page and buffer caches to prevent unnecessary IO operations (e.g. readahead) while preventing contention for system memory between the database and kernel caches. On Illumos, there is a library function called directio(3C) that allows user space to provide a hint to the file system that Direct IO is useful, but the file system is free to ignore it. The semantics are also entirely a file system decision. Those that do not implement it return ENOTTY. Since the semantics were never defined in any standard, O_DIRECT is implemented such that it conforms to the behavior described in the Linux open(2) man page as follows. 1. Minimize cache effects of the I/O. By design the ARC is already scan-resistant which helps mitigate the need for special O_DIRECT handling. Data which is only accessed once will be the first to be evicted from the cache. This behavior is in consistent with Illumos and FreeBSD. Future performance work may wish to investigate the benefits of immediately evicting data from the cache which has been read or written with the O_DIRECT flag. Functionally this behavior is very similar to applying the 'primarycache=metadata' property per open file. 2. O_DIRECT _MAY_ impose restrictions on IO alignment and length. No additional alignment or length restrictions are imposed. 3. O_DIRECT _MAY_ perform unbuffered IO operations directly between user memory and block device. No unbuffered IO operations are currently supported. In order to support features such as transparent compression, encryption, and checksumming a copy must be made to transform the data. 4. O_DIRECT _MAY_ imply O_DSYNC (XFS). O_DIRECT does not imply O_DSYNC for ZFS. Callers must provide O_DSYNC to request synchronous semantics. 5. O_DIRECT _MAY_ disable file locking that serializes IO operations. Applications should avoid mixing O_DIRECT and normal IO or mmap(2) IO to the same file. This is particularly true for overlapping regions. All I/O in ZFS is locked for correctness and this locking is not disabled by O_DIRECT. However, concurrently mixing O_DIRECT, mmap(2), and normal I/O on the same file is not recommended. This change is implemented by layering the aops->direct_IO operations on the existing AIO operations. Code already existed in ZFS on Linux for bypassing the page cache when O_DIRECT is specified. References: * http://xfs.org/docs/xfsdocs-xml-dev/XFS_User_Guide/tmp/en-US/html/ch02s09.html * https://blogs.oracle.com/roch/entry/zfs_and_directio * https://ext4.wiki.kernel.org/index.php/Clarifying_Direct_IO's_Semantics * https://illumos.org/man/3c/directio Reviewed-by: Richard Elling <[email protected]> Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #224 Closes #7823
* Add support for autoexpand propertyBrian Behlendorf2018-07-231-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | While the autoexpand property may seem like a small feature it depends on a significant amount of system infrastructure. Enough of that infrastructure is now in place that with a few modifications for Linux it can be supported. Auto-expand works as follows; when a block device is modified (re-sized, closed after being open r/w, etc) a change uevent is generated for udev. The ZED, which is monitoring udev events, passes the change event along to zfs_deliver_dle() if the disk or partition contains a zfs_member as identified by blkid. From here the device is matched against all imported pool vdevs using the vdev_guid which was read from the label by blkid. If a match is found the ZED reopens the pool vdev. This re-opening is important because it allows the vdev to be briefly closed so the disk partition table can be re-read. Otherwise, it wouldn't be possible to report the maximum possible expansion size. Finally, if the property autoexpand=on a vdev expansion will be attempted. After performing some sanity checks on the disk to verify that it is safe to expand, the primary partition (-part1) will be expanded and the partition table updated. The partition is then re-opened (again) to detect the updated size which allows the new capacity to be used. In order to make all of the above possible the following changes were required: * Updated the zpool_expand_001_pos and zpool_expand_003_pos tests. These tests now create a pool which is layered on a loopback, scsi_debug, and file vdev. This allows for testing of non- partitioned block device (loopback), a partition block device (scsi_debug), and a file which does not receive udev change events. This provided for better test coverage, and by removing the layering on ZFS volumes there issues surrounding layering one pool on another are avoided. * zpool_find_vdev_by_physpath() updated to accept a vdev guid. This allows for matching by guid rather than path which is a more reliable way for the ZED to reference a vdev. * Fixed zfs_zevent_wait() signal handling which could result in the ZED spinning when a signal was not handled. * Removed vdev_disk_rrpart() functionality which can be abandoned in favor of kernel provided blkdev_reread_part() function. * Added a rwlock which is held as a writer while a disk is being reopened. This is important to prevent errors from occurring for any configuration related IOs which bypass the SCL_ZIO lock. The zpool_reopen_007_pos.ksh test case was added to verify IO error are never observed when reopening. This is not expected to impact IO performance. Additional fixes which aren't critical but were discovered and resolved in the course of developing this functionality. * Added PHYS_PATH="/dev/zvol/dataset" to the vdev configuration for ZFS volumes. This is as good as a unique physical path, while the volumes are not used in the test cases anymore for other reasons this improvement was included. Reviewed by: Richard Elling <[email protected]> Signed-off-by: Sara Hartse <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #120 Closes #2437 Closes #5771 Closes #7366 Closes #7582 Closes #7629
* Linux 4.18 compat: inode timespec -> timespec64Brian Behlendorf2018-06-191-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | Commit torvalds/linux@95582b0 changes the inode i_atime, i_mtime, and i_ctime members form timespec's to timespec64's to make them 2038 safe. As part of this change the current_time() function was also updated to return the timespec64 type. Resolve this issue by introducing a new inode_timespec_t type which is defined to match the timespec type used by the inode. It should be used when working with inode timestamps to ensure matching types. The timestruc_t type under Illumos was used in a similar fashion but was specified to always be a timespec_t. Rather than incorrectly define this type all timespec_t types have been replaced by the new inode_timespec_t type. Finally, the kernel and user space 'sys/time.h' headers were aligned with each other. They define as appropriate for the context several constants as macros and include static inline implementation of gethrestime(), gethrestime_sec(), and gethrtime(). Reviewed-by: Chunwei Chen <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #7643
* Linux compat 4.18: check_disk_size_change()Brian Behlendorf2018-06-151-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Added support for the bops->check_events() interface which was added in the 2.6.38 kernel to replace bops->media_changed(). Fully implementing this functionality allows the volume resize code to rely on revalidate_disk(), which is the preferred mechanism, and removes the need to use check_disk_size_change(). In order for bops->check_events() to lookup the zvol_state_t stored in the disk->private_data the zvol_state_lock needs to be held. Since the check events interface may poll the mutex has been converted to a rwlock for better concurrently. The rwlock need only be taken as a writer in the zvol_free() path when disk->private_data is set to NULL. The configure checks for the block_device_operations structure were consolidated in a single kernel-block-device-operations.m4 file. The ZFS_AC_KERNEL_BDEV_BLOCK_DEVICE_OPERATIONS configure checks and assoicated dead code was removed. This interface was added to the 2.6.28 kernel which predates the oldest supported 2.6.32 kernel and will therefore always be available. Updated maximum Linux version in META file. The 4.17 kernel was released on 2018-06-03 and ZoL is compatible with the finalized kernel. Reviewed-by: Boris Protopopov <[email protected]> Reviewed-by: Sara Hartse <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #7611
* Update build system and packagingBrian Behlendorf2018-05-291-199/+54
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Minimal changes required to integrate the SPL sources in to the ZFS repository build infrastructure and packaging. Build system and packaging: * Renamed SPL_* autoconf m4 macros to ZFS_*. * Removed redundant SPL_* autoconf m4 macros. * Updated the RPM spec files to remove SPL package dependency. * The zfs package obsoletes the spl package, and the zfs-kmod package obsoletes the spl-kmod package. * The zfs-kmod-devel* packages were updated to add compatibility symlinks under /usr/src/spl-x.y.z until all dependent packages can be updated. They will be removed in a future release. * Updated copy-builtin script for in-kernel builds. * Updated DKMS package to include the spl.ko. * Updated stale AUTHORS file to include all contributors. * Updated stale COPYRIGHT and included the SPL as an exception. * Renamed README.markdown to README.md * Renamed OPENSOLARIS.LICENSE to LICENSE. * Renamed DISCLAIMER to NOTICE. Required code changes: * Removed redundant HAVE_SPL macro. * Removed _BOOT from nvpairs since it doesn't apply for Linux. * Initial header cleanup (removal of empty headers, refactoring). * Remove SPL repository clone/build from zimport.sh. * Use of DEFINE_RATELIMIT_STATE and DEFINE_SPINLOCK removed due to build issues when forcing C99 compilation. * Replaced legacy ACCESS_ONCE with READ_ONCE. * Include needed headers for `current` and `EXPORT_SYMBOL`. Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Olaf Faaland <[email protected]> Reviewed-by: Matthew Ahrens <[email protected]> Reviewed-by: Pavel Zakharov <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> TEST_ZIMPORT_SKIP="yes" Closes #7556
* Allow mounting datasets more than onceSeth Forshee2018-04-131-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently mounting an already mounted zfs dataset results in an error, whereas it is typically allowed with other filesystems. This causes some bad interactions with mount namespaces. Take this sequence for example: - Create a dataset - Create a snapshot of the dataset - Create a clone of the snapshot - Create a new mount namespace - Rename the original dataset The rename results in unmounting and remounting the clone in the original mount namespace, however the remount fails because the dataset is still mounted in the new mount namespace. (Note that this means the mount in the new mount namespace is never being unmounted, so perhaps the unmount/remount of the clone isn't actually necessary.) The problem here is a result of the way mounting is implemented in the kernel module. Since it is not mounting block devices it uses mount_nodev() instead of the usual mount_bdev(). However, mount_nodev() is written for filesystems for which each mount is a new instance (i.e. a new super block), and zfs should be able to detect when a mount request can be satisfied using an existing super block. Change zpl_mount() to call sget() directly with it's own test callback. Passing the objset_t object as the fs data allows checking if a superblock already exists for the dataset, and in that case we just need to return a new reference for the sb's root dentry. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Tom Caputi <[email protected]> Signed-off-by: Alek Pinchuk <[email protected]> Signed-off-by: Seth Forshee <[email protected]> Closes #5796 Closes #7207
* Linux compat 4.16: blk_queue_flag_{set,clear}Giuseppe Di Natale2018-04-101-0/+2
| | | | | | | | | | queue_flag_{set,clear}_unlocked are now private interfaces in the Linux kernel (https://github.com/torvalds/linux/commit/8a0ac14). Use blk_queue_flag_{set,clear} interfaces which were introduced as of https://github.com/torvalds/linux/commit/8814ce8. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Giuseppe Di Natale <[email protected]> Closes #7410
* Add kernel module auto-loadingBrian Behlendorf2018-03-131-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Historically a dynamic misc minor number was registered for the /dev/zfs device in order to prevent minor number collisions. This was fine but it prevented us from being able to use the kernel module auto-loaded which requires a known reserved value. Resolve this issue by adding a configure test to find an available misc minor number which can then be used in MODULE_ALIAS_MISCDEV at build time. By adding this alias the zfs kmod is added to the list of known static-nodes and the systemd-tmpfiles-setup-dev service will create a /dev/zfs character device at boot time. This in turn allows us to update the 90-zfs.rules file to make it aware this is a static node. The upshot of this is that whenever a process (zpool, zfs, zed) opens the /dev/zfs the kmods will be automatic loaded. This even works for unprivileged users so there is no longer a need to manually load the modules at boot time. As an additional bonus the zed now no longer needs to start after the zfs-import.service since it will trigger the module load. In the unlikely event the minor number we selected conflicts with another out of tree unregistered minor number the code falls back to dynamically allocating it. In this case the modules again must be manually loaded. Note that due to the change in the method of registering the minor number the zimport.sh test case may incorrectly fail when the static node for the installed packages is created instead of the dynamic one. This issue will only transiently impact zimport.sh for this single commit when we transition and are mixing and matching methods. Reviewed-by: Fabian Grünbichler <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> TEST_ZIMPORT_SKIP="yes" Closes #7287
* Take user namespaces into account in policy checksWolfgang Bumiller2018-03-071-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | Change file related checks to use user namespaces and make sure involved uids/gids are mappable in the current namespace. Note that checks without file ownership information will still not take user namespaces into account, as some of these should be handled via 'zfs allow' (otherwise root in a user namespace could issue commands such as `zpool export`). This also adds an initial user namespace regression test for the setgid bit loss, with a user_ns_exec helper usable in further tests. Additionally, configure checks for the required user namespace related features are added for: * ns_capable * kuid/kgid_has_mapping() * user_ns in cred_t Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Wolfgang Bumiller <[email protected]> Closes #6800 Closes #7270
* Linux 4.16 compat: get_disk_and_module()Giuseppe Di Natale2018-03-051-0/+1
| | | | | | | | | | As of https://github.com/torvalds/linux/commit/fb6d47a, get_disk() is now get_disk_and_module(). Add a configure check to determine if we need to use get_disk_and_module(). Reviewed-by: loli10K <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Giuseppe Di Natale <[email protected]> Closes #7264
* Fix free memory calculation on v3.14+chrisrd2018-02-231-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Provide infrastructure to auto-configure to enum and API changes in the global page stats used for our free memory calculations. arc_free_memory has been broken since an API change in Linux v3.14: 2016-07-28 v4.8 599d0c95 mm, vmscan: move LRU lists to node 2016-07-28 v4.8 75ef7184 mm, vmstat: add infrastructure for per-node vmstats These commits moved some of global_page_state() into global_node_page_state(). The API change was particularly egregious as, instead of breaking the old code, it silently did the wrong thing and we continued using global_page_state() where we should have been using global_node_page_state(), thus indexing into the wrong array via NR_SLAB_RECLAIMABLE et al. There have been further API changes along the way: 2017-07-06 v4.13 385386cf mm: vmstat: move slab statistics from zone to node counters 2017-09-06 v4.14 c41f012a mm: rename global_page_state to global_zone_page_state ...and various (incomplete, as it turns out) attempts to accomodate these changes in ZoL: 2017-08-24 2209e409 Linux 4.8+ compatibility fix for vm stats 2017-09-16 787acae0 Linux 3.14 compat: IO acct, global_page_state, etc 2017-09-19 661907e6 Linux 4.14 compat: IO acct, global_page_state, etc The config infrastructure provided here resolves these issues going back to the original API change in v3.14 and is robust against further Linux changes in this area. Reviewed-by: Giuseppe Di Natale <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: George Melikov <[email protected]> Signed-off-by: Chris Dunlop <[email protected]> Closes #7170
* Linux 4.16 compat: use correct *_dec_and_test()Tony Hutter2018-02-221-0/+1
| | | | | | | | | | Use refcount_dec_and_test() on 4.16+ kernels, atomic_dec_and_test() on older kernels. https://lwn.net/Articles/714974/ Reviewed-by: Giuseppe Di Natale <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tony Hutter <[email protected]> Closes: #7179 Closes: #7211
* Fix config issues: frame size and headerschrisrd2018-02-151-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 1. With various (debug and/or tracing?) kernel options enabled it's possible for 'struct inode' and 'struct super_block' to exceed the default frame size, leaving errors like this in config.log: build/conftest.c:116:1: error: the frame size of 1048 bytes is larger than 1024 bytes [-Werror=frame-larger-than=] Fix this by removing the frame size warning for config checks 2. Without the correct headers included, it's possible for declarations to be missed, leaving errors like this in the config.log: build/conftest.c:131:14: error: ‘struct nameidata’ declared inside parameter list [-Werror] Fix this by adding appropriate headers. Note: Both these issues can result in silent config failures because the compile failure is taken to mean "this option is not supported by this kernel" rather than "there's something wrong with the config test". This can lead to something merely annoying (compile failures) to something potentially serious (miscompiled or misused kernel primitives or functions). E.g. the fixes included here resulted in these additional defines in zfs_config.h with linux v4.14.19: Also, drive-by whitespace fixes in config/* files which don't mention "GNU" (those ones look to be imported from elsewhere so leave them alone). Reviewed by: Brian Behlendorf <[email protected]> Signed-off-by: Chris Dunlop <[email protected]> Closes #7169
* Linux 4.16 compat: inode_set_iversion()Brian Behlendorf2018-02-081-0/+1
| | | | | | | | | | | | | | | A new interface was added to manipulate the version field of an inode. Add a inode_set_iversion() wrapper for older kernels and use the new interface when available. The i_version field was dropped from the trace point due to the switch to an atomic64_t i_version type. Reviewed-by: Olaf Faaland <[email protected]> Reviewed-by: Giuseppe Di Natale <[email protected]> Reviewed-by: Chunwei Chen <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #7148
* Support -fsanitize=address with --enable-asanBrian Behlendorf2018-01-101-13/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When --enable-asan is provided to configure then build all user space components with fsanitize=address. For kernel support use the Linux KASAN feature instead. https://github.com/google/sanitizers/wiki/AddressSanitizer When using gcc version 4.8 any test case which intentionally generates a core dump will fail when using --enable-asan. The default behavior is to disable core dumps and only newer versions allow this behavior to be controled at run time with the ASAN_OPTIONS environment variable. Additionally, this patch includes some build system cleanup. * Rules.am updated to set the minimum AM_CFLAGS, AM_CPPFLAGS, and AM_LDFLAGS. Any additional flags should be added on a per-Makefile basic. The --enable-debug and --enable-asan options apply to all user space binaries and libraries. * Compiler checks consolidated in always-compiler-options.m4 and renamed for consistency. * -fstack-check compiler flag was removed, this functionality is provided by asan when configured with --enable-asan. * Split DEBUG_CFLAGS in to DEBUG_CFLAGS, DEBUG_CPPFLAGS, and DEBUG_LDFLAGS. * Moved default kernel build flags in to module/Makefile.in and split in to ZFS_MODULE_CFLAGS and ZFS_MODULE_CPPFLAGS. These flags are set with the standard ccflags-y kbuild mechanism. * -Wframe-larger-than checks applied only to binaries or libraries which include source files which are built in both user space and kernel space. This restriction is relaxed for user space only utilities. * -Wno-unused-but-set-variable applied only to libzfs and libzpool. The remaining warnings are the result of an ASSERT using a variable when is always declared. * -D_POSIX_PTHREAD_SEMANTICS and -D__EXTENSIONS__ dropped because they are Solaris specific and thus not needed. * Ensure $GDB is defined as gdb by default in zloop.sh. Signed-off-by: DHE <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #7027
* Support integration with new QAT productswli52017-10-201-2/+2
| | | | | | | | | | | | Support integration with new QAT products: Intel(R) C62x Chipset, or Atom(R) C3000 Processor Product Family SoC: 1. Detect new file name in auto-conf. 2. Change MAX_INSTANCES to 48. 3. Change "num_inst" to U16 to clean a build warning. Reviewed-by: George Melikov <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Weigang Li <[email protected]> Closes #6767
* Linux 3.14 compat: IO acct, global_page_state, etcGiuseppe Di Natale2017-09-161-1/+3
| | | | | | | | | | | | | | | | | | | | | | generic_start_io_acct/generic_end_io_acct in the master branch of the linux kernel requires that the request_queue be provided. Move the logic from freemem in the spl to arc_free_memory in arc.c. Do this so we can take advantage of global_page_state interface checks in zfs. Upstream kernel replaced struct block_device with struct gendisk in struct bio. Determine if the function bio_set_dev exists during configure and have zfs use that if it exists. bio_set_dev https://github.com/torvalds/linux/commit/74d4699 global_node_page_state https://github.com/torvalds/linux/commit/75ef718 io acct https://github.com/torvalds/linux/commit/d62e26b Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Giuseppe Di Natale <[email protected]> Closes #6635
* Linux 4.8+ compatibility fix for vm statsdbavatar2017-08-241-0/+1
| | | | | | | | | | | | vm_node_stat must be used instead of vm_zone_stat. Unfortunately the old code still compiles potentially leading to silent failure of arc_evictable_memory() AKAMAI: CR 3816601: Regression in zfs dropcache test Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Chunwei Chen <[email protected]> Signed-off-by: Debabrata Banerjee <[email protected]> Closes #6528
* Linux 4.13 compat: bio->bi_status and blk_status_tBrian Behlendorf2017-07-231-0/+1
| | | | | | | | | Commit torvalds/linux@4e4cbee9. The bio->bi_error field was replaced with bio->bi_status which is an enum that describes all possible error types. Reviewed-by: Chunwei Chen <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #6351
* config: allow --with-linux without --with-linux-objChunwei Chen2017-05-251-1/+2
| | | | | | | | | | Don't use `uname -r` to determine kernel build directory when the user specified kernel source with --with-linux. Otherwise, the user is forced to use --with-linux-obj even if they are the same directory, which is very counterintuitive. Signed-off-by: Chunwei Chen <[email protected]> Requires-spl: refs/pull/617/head
* Linux 4.12 compat: CURRENT_TIME removedBrian Behlendorf2017-05-101-0/+1
| | | | | | | | Linux 4.9 added current_time() as the preferred interface to get the filesystem time. CURRENT_TIME was retired in Linux 4.12. Reviewed-by: Chunwei Chen <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #6114
* Enable Linux read-ahead for a single page on ZVOLsRichard Yao2017-05-041-0/+1
| | | | | | | | | | | | | | | | | | | | Linux has read-ahead logic designed to accelerate sequential workloads. ZFS has its own read-ahead logic called zprefetch that operates on both ZVOLs and datasets. Having two prefetchers active at the same time can cause overprefetching, which unnecessarily reduces IOPS performance on CoW filesystems like ZFS. Testing shows that entirely disabling the Linux prefetch results in a significant performance penalty for reads while commensurate benefits are seen in random writes. It appears that read-ahead benefits are inversely proportional to random write benefits, and so a single page of Linux-layer read-ahead appears to offer the middle ground for both workloads. Reviewed-by: Chunwei Chen <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Richard Yao <[email protected]> Issue #5902
* Linux 4.12 compat: super_setup_bdi_name()Brian Behlendorf2017-05-021-1/+1
| | | | | | | | | | All filesystems were converted to dynamically allocated BDIs. The destruction of backing_dev_info structures is handled as part of super block destruction. Refactor the code to abstract away the details of creating and destroying a BDI. Reviewed-by: Chunwei Chen <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #6089
* GZIP compression offloading with QAT acceleratorwli52017-03-221-0/+90
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch implement the hardware accelerator method in GZIP compression in ZFS. When the ZFS pool is enabled GZIP compression, the compression API will be automatically transferred to the hardware accelerator to free up CPU resource and speed up the compression time. * To enable Intel QAT hardware acceleration in ZOL you need to have QAT hardware and the driver installed: * QAT hardware DH8950: http://ark.intel.com/products/79483/Intel-QuickAssist-Adapter-8950 * QAT driver: https://01.org/intel-quickassist-technology * Start QAT driver in your system: service qat_service start * Enable QAT in ZFS, e.g.: ./configure --with-qat=<qat-driver-path>/QAT1.6 make * Set GZIP compression in ZFS dataset: zfs set compression = gzip <dataset> * Get QAT hardware statistics by: cat /proc/spl/kstat/zfs/qat * To disable QAT in ZFS: insmod zfs.ko zfs_qat_disable=1 Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Jinshan Xiong <[email protected]> Signed-off-by: Weigang Li <[email protected]> Closes #5846
* Linux 4.11 compat: iops.getattr and friendsOlaf Faaland2017-03-201-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In torvalds/linux@a528d35, there are changes to the getattr family of functions, struct kstat, and the interface of inode_operations .getattr. The inode_operations .getattr and simple_getattr() interface changed to: int (*getattr) (const struct path *, struct dentry *, struct kstat *, u32 request_mask, unsigned int query_flags) The request_mask argument indicates which field(s) the caller intends to use. Fields the caller has not specified via request_mask may be set in the returned struct anyway, but their values may be approximate. The query_flags argument indicates whether the filesystem must update the attributes from the backing store. Currently both fields are ignored. It is possible that getattr-related functions within zfs could be optimized based on the request_mask. struct kstat includes new fields: u32 result_mask; /* What fields the user got */ u64 attributes; /* See STATX_ATTR_* flags */ struct timespec btime; /* File creation time */ Fields attribute and btime are cleared; the result_mask reflects this. These appear to be optional based on simple_getattr() and vfs_getattr() within the kernel, which take the same approach. Reviewed-by: Chunwei Chen <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Olaf Faaland <[email protected]> Closes #5875
* [icp] fpu and asm cleanup for linuxGvozden Neskovic2017-03-071-0/+1
| | | | | | | | | | | | | | Properly annotate functions and data section so that objtool does not complain when CONFIG_STACK_VALIDATION and CONFIG_FRAME_POINTER are enabled. Pass KERNELCPPFLAGS to assembler. Use kfpu_begin()/kfpu_end() to protect SIMD regions in Linux kernel. Reviewed-by: Tom Caputi <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Gvozden Neskovic <[email protected]> Closes #5872 Closes #5041
* Allow c99 code to compileMatthew Ahrens2017-02-081-0/+1
| | | | | | | | | | | | Add the appropriate compiler flags to accept c99 code. This will help to minimize differences with upstream, and aid porting changes. One change was necessary in zvol.c because the DEFINE_IDA() macro does not work with the new compiler flags. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Chunwei Chen <[email protected]> Reviewed-by: George Melikov <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> Closes #5756
* Add -Wno-declaration-after-statement to KERNELCPPFLAGSBrian Behlendorf2017-01-281-0/+1
| | | | | | | | | | | | | | | | Disable the warnings regarding ISO C90 forbidding mixed declarations and code. While this functionality was introduced as part of C99 gcc does allow this in C90 mode as an extension. https://gcc.gnu.org/onlinedocs/gcc/Mixed-Declarations.html#Mixed-Declarations Allowing this usage helps minimize the changes required when porting patches from OpenZFS. The downside here is that this functionality is specific to gcc. Reviewed-by: George Melikov <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #5686
* Retire .write/.read file operationsChunwei Chen2017-01-271-0/+1
| | | | | | | | | | | | | | | The .write/.read file operations callbacks can be retired since support for .read_iter/.write_iter and .aio_read/.aio_write has been added. The vfs_write()/vfs_read() entry functions will select the correct interface for the kernel. This is desirable because all VFS write/read operations now rely on common code. This change also add the generic write checks to make sure that ulimits are enforced correctly on write. Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Chunwei Chen <[email protected]> Closes #5587 Closes #5673
* 4.10 compat - BIO flag changes and othersTim Chase2016-12-301-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | [bio] The req_op enum was changed to req_opf. Update the "Linux 4.8 API" autotools checks to use an int to determine whether the various REQ_OP values are defined. This should work properly on kernels >= 4.8. [bio] bio_set_op_attrs() is now an inline function and can't be detected with #ifdef. Add a configure check to determine whether bio_set_op_attrs() is defined. Move the local definition of it from vdev_disk.c to blkdev_compat.h for consistency with other related compability shims. [bio] The read/write flags and their modifiers, including WRITE_FLUSH, WRITE_FUA and WRITE_FLUSH_FUA have been removed from fs.h. Add the new bio_set_flush() compatibility wrapper to replace VDEV_WRITE_FLUSH_FUA and set the flags appropriately for each supported kernel version. [vfs] The generic_readlink() function has been made static. If .readlink in inode_operations is NULL, generic_readlink() is used. [zol typo] Completely unrelated to 4.10 compat, fix a typo in the check for REQ_OP_SECURE_ERASE so that the proper macro is defined: s/HAVE_REQ_OP_SECURE_DISCARD/HAVE_REQ_OP_SECURE_ERASE/ Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Chunwei Chen <[email protected]> Signed-off-by: Tim Chase <[email protected]> Closes #5499
* Use inode_set_flags when availableChunwei Chen2016-12-161-0/+1
| | | | Signed-off-by: Chunwei Chen <[email protected]>
* Kernel 4.9 compat: file_operations->aio_fsync removalDeHackEd2016-11-151-0/+1
| | | | | | | Linux kernel commit 723c038475b78 removed this field. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: DHE <[email protected]> Closes #5393
* Linux 3.14 compat: assign inode->set_acltuxoko2016-11-091-0/+1
| | | | | | | | | | | | | Linux 3.14 introduces inode->set_acl(). Normally, acl modification will come from setxattr, which will handle by the acl xattr_handler, and we already handles that well. However, nfsd will directly calls inode->set_acl or return error if it doesn't exists. Reviewed-by: Tim Chase <[email protected]> Reviewed-by: Massimo Maggi <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Chunwei Chen <[email protected]> Closes #5371 Closes #5375
* Use set_cached_acl and forget_cached_acl when possibleChunwei Chen2016-11-071-1/+2
| | | | | | | | | Originally, these two function are inline, so their usability is tied to posix_acl_release. However, since Linux 3.14, they became EXPORT_SYMBOL, so we can always use them. In this patch, we create an independent test for these two functions so we can use them when possible. Signed-off-by: Chunwei Chen <[email protected]>