summaryrefslogtreecommitdiffstats
path: root/config/kernel.m4
Commit message (Collapse)AuthorAgeFilesLines
* Fix statfs(2) for 32-bit user spaceBrian Behlendorf2018-09-241-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | When handling a 32-bit statfs() system call the returned fields, although 64-bit in the kernel, must be limited to 32-bits or an EOVERFLOW error will be returned. This is less of an issue for block counts since the default reported block size in 128KiB. But since it is possible to set a smaller block size, these values will be scaled as needed to fit in a 32-bit unsigned long. Unlike most other filesystems the total possible file counts are more likely to overflow because they are calculated based on the available free space in the pool. In order to prevent this the reported value must be capped at 2^32-1. This is only for statfs(2) reporting, there are no changes to the internal ZFS limits. Reviewed-by: Andreas Dilger <[email protected]> Reviewed-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #7927 Closes #7122 Closes #7937
* Direct IO supportBrian Behlendorf2018-08-271-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Direct IO via the O_DIRECT flag was originally introduced in XFS by IRIX for database workloads. Its purpose was to allow the database to bypass the page and buffer caches to prevent unnecessary IO operations (e.g. readahead) while preventing contention for system memory between the database and kernel caches. On Illumos, there is a library function called directio(3C) that allows user space to provide a hint to the file system that Direct IO is useful, but the file system is free to ignore it. The semantics are also entirely a file system decision. Those that do not implement it return ENOTTY. Since the semantics were never defined in any standard, O_DIRECT is implemented such that it conforms to the behavior described in the Linux open(2) man page as follows. 1. Minimize cache effects of the I/O. By design the ARC is already scan-resistant which helps mitigate the need for special O_DIRECT handling. Data which is only accessed once will be the first to be evicted from the cache. This behavior is in consistent with Illumos and FreeBSD. Future performance work may wish to investigate the benefits of immediately evicting data from the cache which has been read or written with the O_DIRECT flag. Functionally this behavior is very similar to applying the 'primarycache=metadata' property per open file. 2. O_DIRECT _MAY_ impose restrictions on IO alignment and length. No additional alignment or length restrictions are imposed. 3. O_DIRECT _MAY_ perform unbuffered IO operations directly between user memory and block device. No unbuffered IO operations are currently supported. In order to support features such as transparent compression, encryption, and checksumming a copy must be made to transform the data. 4. O_DIRECT _MAY_ imply O_DSYNC (XFS). O_DIRECT does not imply O_DSYNC for ZFS. Callers must provide O_DSYNC to request synchronous semantics. 5. O_DIRECT _MAY_ disable file locking that serializes IO operations. Applications should avoid mixing O_DIRECT and normal IO or mmap(2) IO to the same file. This is particularly true for overlapping regions. All I/O in ZFS is locked for correctness and this locking is not disabled by O_DIRECT. However, concurrently mixing O_DIRECT, mmap(2), and normal I/O on the same file is not recommended. This change is implemented by layering the aops->direct_IO operations on the existing AIO operations. Code already existed in ZFS on Linux for bypassing the page cache when O_DIRECT is specified. References: * http://xfs.org/docs/xfsdocs-xml-dev/XFS_User_Guide/tmp/en-US/html/ch02s09.html * https://blogs.oracle.com/roch/entry/zfs_and_directio * https://ext4.wiki.kernel.org/index.php/Clarifying_Direct_IO's_Semantics * https://illumos.org/man/3c/directio Reviewed-by: Richard Elling <[email protected]> Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #224 Closes #7823
* Add support for autoexpand propertyBrian Behlendorf2018-07-231-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | While the autoexpand property may seem like a small feature it depends on a significant amount of system infrastructure. Enough of that infrastructure is now in place that with a few modifications for Linux it can be supported. Auto-expand works as follows; when a block device is modified (re-sized, closed after being open r/w, etc) a change uevent is generated for udev. The ZED, which is monitoring udev events, passes the change event along to zfs_deliver_dle() if the disk or partition contains a zfs_member as identified by blkid. From here the device is matched against all imported pool vdevs using the vdev_guid which was read from the label by blkid. If a match is found the ZED reopens the pool vdev. This re-opening is important because it allows the vdev to be briefly closed so the disk partition table can be re-read. Otherwise, it wouldn't be possible to report the maximum possible expansion size. Finally, if the property autoexpand=on a vdev expansion will be attempted. After performing some sanity checks on the disk to verify that it is safe to expand, the primary partition (-part1) will be expanded and the partition table updated. The partition is then re-opened (again) to detect the updated size which allows the new capacity to be used. In order to make all of the above possible the following changes were required: * Updated the zpool_expand_001_pos and zpool_expand_003_pos tests. These tests now create a pool which is layered on a loopback, scsi_debug, and file vdev. This allows for testing of non- partitioned block device (loopback), a partition block device (scsi_debug), and a file which does not receive udev change events. This provided for better test coverage, and by removing the layering on ZFS volumes there issues surrounding layering one pool on another are avoided. * zpool_find_vdev_by_physpath() updated to accept a vdev guid. This allows for matching by guid rather than path which is a more reliable way for the ZED to reference a vdev. * Fixed zfs_zevent_wait() signal handling which could result in the ZED spinning when a signal was not handled. * Removed vdev_disk_rrpart() functionality which can be abandoned in favor of kernel provided blkdev_reread_part() function. * Added a rwlock which is held as a writer while a disk is being reopened. This is important to prevent errors from occurring for any configuration related IOs which bypass the SCL_ZIO lock. The zpool_reopen_007_pos.ksh test case was added to verify IO error are never observed when reopening. This is not expected to impact IO performance. Additional fixes which aren't critical but were discovered and resolved in the course of developing this functionality. * Added PHYS_PATH="/dev/zvol/dataset" to the vdev configuration for ZFS volumes. This is as good as a unique physical path, while the volumes are not used in the test cases anymore for other reasons this improvement was included. Reviewed by: Richard Elling <[email protected]> Signed-off-by: Sara Hartse <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #120 Closes #2437 Closes #5771 Closes #7366 Closes #7582 Closes #7629
* Linux 4.18 compat: inode timespec -> timespec64Brian Behlendorf2018-06-191-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | Commit torvalds/linux@95582b0 changes the inode i_atime, i_mtime, and i_ctime members form timespec's to timespec64's to make them 2038 safe. As part of this change the current_time() function was also updated to return the timespec64 type. Resolve this issue by introducing a new inode_timespec_t type which is defined to match the timespec type used by the inode. It should be used when working with inode timestamps to ensure matching types. The timestruc_t type under Illumos was used in a similar fashion but was specified to always be a timespec_t. Rather than incorrectly define this type all timespec_t types have been replaced by the new inode_timespec_t type. Finally, the kernel and user space 'sys/time.h' headers were aligned with each other. They define as appropriate for the context several constants as macros and include static inline implementation of gethrestime(), gethrestime_sec(), and gethrtime(). Reviewed-by: Chunwei Chen <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #7643
* Linux compat 4.18: check_disk_size_change()Brian Behlendorf2018-06-151-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Added support for the bops->check_events() interface which was added in the 2.6.38 kernel to replace bops->media_changed(). Fully implementing this functionality allows the volume resize code to rely on revalidate_disk(), which is the preferred mechanism, and removes the need to use check_disk_size_change(). In order for bops->check_events() to lookup the zvol_state_t stored in the disk->private_data the zvol_state_lock needs to be held. Since the check events interface may poll the mutex has been converted to a rwlock for better concurrently. The rwlock need only be taken as a writer in the zvol_free() path when disk->private_data is set to NULL. The configure checks for the block_device_operations structure were consolidated in a single kernel-block-device-operations.m4 file. The ZFS_AC_KERNEL_BDEV_BLOCK_DEVICE_OPERATIONS configure checks and assoicated dead code was removed. This interface was added to the 2.6.28 kernel which predates the oldest supported 2.6.32 kernel and will therefore always be available. Updated maximum Linux version in META file. The 4.17 kernel was released on 2018-06-03 and ZoL is compatible with the finalized kernel. Reviewed-by: Boris Protopopov <[email protected]> Reviewed-by: Sara Hartse <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #7611
* Update build system and packagingBrian Behlendorf2018-05-291-199/+54
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Minimal changes required to integrate the SPL sources in to the ZFS repository build infrastructure and packaging. Build system and packaging: * Renamed SPL_* autoconf m4 macros to ZFS_*. * Removed redundant SPL_* autoconf m4 macros. * Updated the RPM spec files to remove SPL package dependency. * The zfs package obsoletes the spl package, and the zfs-kmod package obsoletes the spl-kmod package. * The zfs-kmod-devel* packages were updated to add compatibility symlinks under /usr/src/spl-x.y.z until all dependent packages can be updated. They will be removed in a future release. * Updated copy-builtin script for in-kernel builds. * Updated DKMS package to include the spl.ko. * Updated stale AUTHORS file to include all contributors. * Updated stale COPYRIGHT and included the SPL as an exception. * Renamed README.markdown to README.md * Renamed OPENSOLARIS.LICENSE to LICENSE. * Renamed DISCLAIMER to NOTICE. Required code changes: * Removed redundant HAVE_SPL macro. * Removed _BOOT from nvpairs since it doesn't apply for Linux. * Initial header cleanup (removal of empty headers, refactoring). * Remove SPL repository clone/build from zimport.sh. * Use of DEFINE_RATELIMIT_STATE and DEFINE_SPINLOCK removed due to build issues when forcing C99 compilation. * Replaced legacy ACCESS_ONCE with READ_ONCE. * Include needed headers for `current` and `EXPORT_SYMBOL`. Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Olaf Faaland <[email protected]> Reviewed-by: Matthew Ahrens <[email protected]> Reviewed-by: Pavel Zakharov <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> TEST_ZIMPORT_SKIP="yes" Closes #7556
* Allow mounting datasets more than onceSeth Forshee2018-04-131-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently mounting an already mounted zfs dataset results in an error, whereas it is typically allowed with other filesystems. This causes some bad interactions with mount namespaces. Take this sequence for example: - Create a dataset - Create a snapshot of the dataset - Create a clone of the snapshot - Create a new mount namespace - Rename the original dataset The rename results in unmounting and remounting the clone in the original mount namespace, however the remount fails because the dataset is still mounted in the new mount namespace. (Note that this means the mount in the new mount namespace is never being unmounted, so perhaps the unmount/remount of the clone isn't actually necessary.) The problem here is a result of the way mounting is implemented in the kernel module. Since it is not mounting block devices it uses mount_nodev() instead of the usual mount_bdev(). However, mount_nodev() is written for filesystems for which each mount is a new instance (i.e. a new super block), and zfs should be able to detect when a mount request can be satisfied using an existing super block. Change zpl_mount() to call sget() directly with it's own test callback. Passing the objset_t object as the fs data allows checking if a superblock already exists for the dataset, and in that case we just need to return a new reference for the sb's root dentry. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Tom Caputi <[email protected]> Signed-off-by: Alek Pinchuk <[email protected]> Signed-off-by: Seth Forshee <[email protected]> Closes #5796 Closes #7207
* Linux compat 4.16: blk_queue_flag_{set,clear}Giuseppe Di Natale2018-04-101-0/+2
| | | | | | | | | | queue_flag_{set,clear}_unlocked are now private interfaces in the Linux kernel (https://github.com/torvalds/linux/commit/8a0ac14). Use blk_queue_flag_{set,clear} interfaces which were introduced as of https://github.com/torvalds/linux/commit/8814ce8. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Giuseppe Di Natale <[email protected]> Closes #7410
* Add kernel module auto-loadingBrian Behlendorf2018-03-131-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Historically a dynamic misc minor number was registered for the /dev/zfs device in order to prevent minor number collisions. This was fine but it prevented us from being able to use the kernel module auto-loaded which requires a known reserved value. Resolve this issue by adding a configure test to find an available misc minor number which can then be used in MODULE_ALIAS_MISCDEV at build time. By adding this alias the zfs kmod is added to the list of known static-nodes and the systemd-tmpfiles-setup-dev service will create a /dev/zfs character device at boot time. This in turn allows us to update the 90-zfs.rules file to make it aware this is a static node. The upshot of this is that whenever a process (zpool, zfs, zed) opens the /dev/zfs the kmods will be automatic loaded. This even works for unprivileged users so there is no longer a need to manually load the modules at boot time. As an additional bonus the zed now no longer needs to start after the zfs-import.service since it will trigger the module load. In the unlikely event the minor number we selected conflicts with another out of tree unregistered minor number the code falls back to dynamically allocating it. In this case the modules again must be manually loaded. Note that due to the change in the method of registering the minor number the zimport.sh test case may incorrectly fail when the static node for the installed packages is created instead of the dynamic one. This issue will only transiently impact zimport.sh for this single commit when we transition and are mixing and matching methods. Reviewed-by: Fabian Grünbichler <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> TEST_ZIMPORT_SKIP="yes" Closes #7287
* Take user namespaces into account in policy checksWolfgang Bumiller2018-03-071-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | Change file related checks to use user namespaces and make sure involved uids/gids are mappable in the current namespace. Note that checks without file ownership information will still not take user namespaces into account, as some of these should be handled via 'zfs allow' (otherwise root in a user namespace could issue commands such as `zpool export`). This also adds an initial user namespace regression test for the setgid bit loss, with a user_ns_exec helper usable in further tests. Additionally, configure checks for the required user namespace related features are added for: * ns_capable * kuid/kgid_has_mapping() * user_ns in cred_t Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Wolfgang Bumiller <[email protected]> Closes #6800 Closes #7270
* Linux 4.16 compat: get_disk_and_module()Giuseppe Di Natale2018-03-051-0/+1
| | | | | | | | | | As of https://github.com/torvalds/linux/commit/fb6d47a, get_disk() is now get_disk_and_module(). Add a configure check to determine if we need to use get_disk_and_module(). Reviewed-by: loli10K <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Giuseppe Di Natale <[email protected]> Closes #7264
* Fix free memory calculation on v3.14+chrisrd2018-02-231-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Provide infrastructure to auto-configure to enum and API changes in the global page stats used for our free memory calculations. arc_free_memory has been broken since an API change in Linux v3.14: 2016-07-28 v4.8 599d0c95 mm, vmscan: move LRU lists to node 2016-07-28 v4.8 75ef7184 mm, vmstat: add infrastructure for per-node vmstats These commits moved some of global_page_state() into global_node_page_state(). The API change was particularly egregious as, instead of breaking the old code, it silently did the wrong thing and we continued using global_page_state() where we should have been using global_node_page_state(), thus indexing into the wrong array via NR_SLAB_RECLAIMABLE et al. There have been further API changes along the way: 2017-07-06 v4.13 385386cf mm: vmstat: move slab statistics from zone to node counters 2017-09-06 v4.14 c41f012a mm: rename global_page_state to global_zone_page_state ...and various (incomplete, as it turns out) attempts to accomodate these changes in ZoL: 2017-08-24 2209e409 Linux 4.8+ compatibility fix for vm stats 2017-09-16 787acae0 Linux 3.14 compat: IO acct, global_page_state, etc 2017-09-19 661907e6 Linux 4.14 compat: IO acct, global_page_state, etc The config infrastructure provided here resolves these issues going back to the original API change in v3.14 and is robust against further Linux changes in this area. Reviewed-by: Giuseppe Di Natale <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: George Melikov <[email protected]> Signed-off-by: Chris Dunlop <[email protected]> Closes #7170
* Linux 4.16 compat: use correct *_dec_and_test()Tony Hutter2018-02-221-0/+1
| | | | | | | | | | Use refcount_dec_and_test() on 4.16+ kernels, atomic_dec_and_test() on older kernels. https://lwn.net/Articles/714974/ Reviewed-by: Giuseppe Di Natale <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tony Hutter <[email protected]> Closes: #7179 Closes: #7211
* Fix config issues: frame size and headerschrisrd2018-02-151-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 1. With various (debug and/or tracing?) kernel options enabled it's possible for 'struct inode' and 'struct super_block' to exceed the default frame size, leaving errors like this in config.log: build/conftest.c:116:1: error: the frame size of 1048 bytes is larger than 1024 bytes [-Werror=frame-larger-than=] Fix this by removing the frame size warning for config checks 2. Without the correct headers included, it's possible for declarations to be missed, leaving errors like this in the config.log: build/conftest.c:131:14: error: ‘struct nameidata’ declared inside parameter list [-Werror] Fix this by adding appropriate headers. Note: Both these issues can result in silent config failures because the compile failure is taken to mean "this option is not supported by this kernel" rather than "there's something wrong with the config test". This can lead to something merely annoying (compile failures) to something potentially serious (miscompiled or misused kernel primitives or functions). E.g. the fixes included here resulted in these additional defines in zfs_config.h with linux v4.14.19: Also, drive-by whitespace fixes in config/* files which don't mention "GNU" (those ones look to be imported from elsewhere so leave them alone). Reviewed by: Brian Behlendorf <[email protected]> Signed-off-by: Chris Dunlop <[email protected]> Closes #7169
* Linux 4.16 compat: inode_set_iversion()Brian Behlendorf2018-02-081-0/+1
| | | | | | | | | | | | | | | A new interface was added to manipulate the version field of an inode. Add a inode_set_iversion() wrapper for older kernels and use the new interface when available. The i_version field was dropped from the trace point due to the switch to an atomic64_t i_version type. Reviewed-by: Olaf Faaland <[email protected]> Reviewed-by: Giuseppe Di Natale <[email protected]> Reviewed-by: Chunwei Chen <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #7148
* Support -fsanitize=address with --enable-asanBrian Behlendorf2018-01-101-13/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When --enable-asan is provided to configure then build all user space components with fsanitize=address. For kernel support use the Linux KASAN feature instead. https://github.com/google/sanitizers/wiki/AddressSanitizer When using gcc version 4.8 any test case which intentionally generates a core dump will fail when using --enable-asan. The default behavior is to disable core dumps and only newer versions allow this behavior to be controled at run time with the ASAN_OPTIONS environment variable. Additionally, this patch includes some build system cleanup. * Rules.am updated to set the minimum AM_CFLAGS, AM_CPPFLAGS, and AM_LDFLAGS. Any additional flags should be added on a per-Makefile basic. The --enable-debug and --enable-asan options apply to all user space binaries and libraries. * Compiler checks consolidated in always-compiler-options.m4 and renamed for consistency. * -fstack-check compiler flag was removed, this functionality is provided by asan when configured with --enable-asan. * Split DEBUG_CFLAGS in to DEBUG_CFLAGS, DEBUG_CPPFLAGS, and DEBUG_LDFLAGS. * Moved default kernel build flags in to module/Makefile.in and split in to ZFS_MODULE_CFLAGS and ZFS_MODULE_CPPFLAGS. These flags are set with the standard ccflags-y kbuild mechanism. * -Wframe-larger-than checks applied only to binaries or libraries which include source files which are built in both user space and kernel space. This restriction is relaxed for user space only utilities. * -Wno-unused-but-set-variable applied only to libzfs and libzpool. The remaining warnings are the result of an ASSERT using a variable when is always declared. * -D_POSIX_PTHREAD_SEMANTICS and -D__EXTENSIONS__ dropped because they are Solaris specific and thus not needed. * Ensure $GDB is defined as gdb by default in zloop.sh. Signed-off-by: DHE <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #7027
* Support integration with new QAT productswli52017-10-201-2/+2
| | | | | | | | | | | | Support integration with new QAT products: Intel(R) C62x Chipset, or Atom(R) C3000 Processor Product Family SoC: 1. Detect new file name in auto-conf. 2. Change MAX_INSTANCES to 48. 3. Change "num_inst" to U16 to clean a build warning. Reviewed-by: George Melikov <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Weigang Li <[email protected]> Closes #6767
* Linux 3.14 compat: IO acct, global_page_state, etcGiuseppe Di Natale2017-09-161-1/+3
| | | | | | | | | | | | | | | | | | | | | | generic_start_io_acct/generic_end_io_acct in the master branch of the linux kernel requires that the request_queue be provided. Move the logic from freemem in the spl to arc_free_memory in arc.c. Do this so we can take advantage of global_page_state interface checks in zfs. Upstream kernel replaced struct block_device with struct gendisk in struct bio. Determine if the function bio_set_dev exists during configure and have zfs use that if it exists. bio_set_dev https://github.com/torvalds/linux/commit/74d4699 global_node_page_state https://github.com/torvalds/linux/commit/75ef718 io acct https://github.com/torvalds/linux/commit/d62e26b Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Giuseppe Di Natale <[email protected]> Closes #6635
* Linux 4.8+ compatibility fix for vm statsdbavatar2017-08-241-0/+1
| | | | | | | | | | | | vm_node_stat must be used instead of vm_zone_stat. Unfortunately the old code still compiles potentially leading to silent failure of arc_evictable_memory() AKAMAI: CR 3816601: Regression in zfs dropcache test Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Chunwei Chen <[email protected]> Signed-off-by: Debabrata Banerjee <[email protected]> Closes #6528
* Linux 4.13 compat: bio->bi_status and blk_status_tBrian Behlendorf2017-07-231-0/+1
| | | | | | | | | Commit torvalds/linux@4e4cbee9. The bio->bi_error field was replaced with bio->bi_status which is an enum that describes all possible error types. Reviewed-by: Chunwei Chen <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #6351
* config: allow --with-linux without --with-linux-objChunwei Chen2017-05-251-1/+2
| | | | | | | | | | Don't use `uname -r` to determine kernel build directory when the user specified kernel source with --with-linux. Otherwise, the user is forced to use --with-linux-obj even if they are the same directory, which is very counterintuitive. Signed-off-by: Chunwei Chen <[email protected]> Requires-spl: refs/pull/617/head
* Linux 4.12 compat: CURRENT_TIME removedBrian Behlendorf2017-05-101-0/+1
| | | | | | | | Linux 4.9 added current_time() as the preferred interface to get the filesystem time. CURRENT_TIME was retired in Linux 4.12. Reviewed-by: Chunwei Chen <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #6114
* Enable Linux read-ahead for a single page on ZVOLsRichard Yao2017-05-041-0/+1
| | | | | | | | | | | | | | | | | | | | Linux has read-ahead logic designed to accelerate sequential workloads. ZFS has its own read-ahead logic called zprefetch that operates on both ZVOLs and datasets. Having two prefetchers active at the same time can cause overprefetching, which unnecessarily reduces IOPS performance on CoW filesystems like ZFS. Testing shows that entirely disabling the Linux prefetch results in a significant performance penalty for reads while commensurate benefits are seen in random writes. It appears that read-ahead benefits are inversely proportional to random write benefits, and so a single page of Linux-layer read-ahead appears to offer the middle ground for both workloads. Reviewed-by: Chunwei Chen <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Richard Yao <[email protected]> Issue #5902
* Linux 4.12 compat: super_setup_bdi_name()Brian Behlendorf2017-05-021-1/+1
| | | | | | | | | | All filesystems were converted to dynamically allocated BDIs. The destruction of backing_dev_info structures is handled as part of super block destruction. Refactor the code to abstract away the details of creating and destroying a BDI. Reviewed-by: Chunwei Chen <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #6089
* GZIP compression offloading with QAT acceleratorwli52017-03-221-0/+90
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch implement the hardware accelerator method in GZIP compression in ZFS. When the ZFS pool is enabled GZIP compression, the compression API will be automatically transferred to the hardware accelerator to free up CPU resource and speed up the compression time. * To enable Intel QAT hardware acceleration in ZOL you need to have QAT hardware and the driver installed: * QAT hardware DH8950: http://ark.intel.com/products/79483/Intel-QuickAssist-Adapter-8950 * QAT driver: https://01.org/intel-quickassist-technology * Start QAT driver in your system: service qat_service start * Enable QAT in ZFS, e.g.: ./configure --with-qat=<qat-driver-path>/QAT1.6 make * Set GZIP compression in ZFS dataset: zfs set compression = gzip <dataset> * Get QAT hardware statistics by: cat /proc/spl/kstat/zfs/qat * To disable QAT in ZFS: insmod zfs.ko zfs_qat_disable=1 Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Jinshan Xiong <[email protected]> Signed-off-by: Weigang Li <[email protected]> Closes #5846
* Linux 4.11 compat: iops.getattr and friendsOlaf Faaland2017-03-201-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In torvalds/linux@a528d35, there are changes to the getattr family of functions, struct kstat, and the interface of inode_operations .getattr. The inode_operations .getattr and simple_getattr() interface changed to: int (*getattr) (const struct path *, struct dentry *, struct kstat *, u32 request_mask, unsigned int query_flags) The request_mask argument indicates which field(s) the caller intends to use. Fields the caller has not specified via request_mask may be set in the returned struct anyway, but their values may be approximate. The query_flags argument indicates whether the filesystem must update the attributes from the backing store. Currently both fields are ignored. It is possible that getattr-related functions within zfs could be optimized based on the request_mask. struct kstat includes new fields: u32 result_mask; /* What fields the user got */ u64 attributes; /* See STATX_ATTR_* flags */ struct timespec btime; /* File creation time */ Fields attribute and btime are cleared; the result_mask reflects this. These appear to be optional based on simple_getattr() and vfs_getattr() within the kernel, which take the same approach. Reviewed-by: Chunwei Chen <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Olaf Faaland <[email protected]> Closes #5875
* [icp] fpu and asm cleanup for linuxGvozden Neskovic2017-03-071-0/+1
| | | | | | | | | | | | | | Properly annotate functions and data section so that objtool does not complain when CONFIG_STACK_VALIDATION and CONFIG_FRAME_POINTER are enabled. Pass KERNELCPPFLAGS to assembler. Use kfpu_begin()/kfpu_end() to protect SIMD regions in Linux kernel. Reviewed-by: Tom Caputi <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Gvozden Neskovic <[email protected]> Closes #5872 Closes #5041
* Allow c99 code to compileMatthew Ahrens2017-02-081-0/+1
| | | | | | | | | | | | Add the appropriate compiler flags to accept c99 code. This will help to minimize differences with upstream, and aid porting changes. One change was necessary in zvol.c because the DEFINE_IDA() macro does not work with the new compiler flags. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Chunwei Chen <[email protected]> Reviewed-by: George Melikov <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> Closes #5756
* Add -Wno-declaration-after-statement to KERNELCPPFLAGSBrian Behlendorf2017-01-281-0/+1
| | | | | | | | | | | | | | | | Disable the warnings regarding ISO C90 forbidding mixed declarations and code. While this functionality was introduced as part of C99 gcc does allow this in C90 mode as an extension. https://gcc.gnu.org/onlinedocs/gcc/Mixed-Declarations.html#Mixed-Declarations Allowing this usage helps minimize the changes required when porting patches from OpenZFS. The downside here is that this functionality is specific to gcc. Reviewed-by: George Melikov <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #5686
* Retire .write/.read file operationsChunwei Chen2017-01-271-0/+1
| | | | | | | | | | | | | | | The .write/.read file operations callbacks can be retired since support for .read_iter/.write_iter and .aio_read/.aio_write has been added. The vfs_write()/vfs_read() entry functions will select the correct interface for the kernel. This is desirable because all VFS write/read operations now rely on common code. This change also add the generic write checks to make sure that ulimits are enforced correctly on write. Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Chunwei Chen <[email protected]> Closes #5587 Closes #5673
* 4.10 compat - BIO flag changes and othersTim Chase2016-12-301-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | [bio] The req_op enum was changed to req_opf. Update the "Linux 4.8 API" autotools checks to use an int to determine whether the various REQ_OP values are defined. This should work properly on kernels >= 4.8. [bio] bio_set_op_attrs() is now an inline function and can't be detected with #ifdef. Add a configure check to determine whether bio_set_op_attrs() is defined. Move the local definition of it from vdev_disk.c to blkdev_compat.h for consistency with other related compability shims. [bio] The read/write flags and their modifiers, including WRITE_FLUSH, WRITE_FUA and WRITE_FLUSH_FUA have been removed from fs.h. Add the new bio_set_flush() compatibility wrapper to replace VDEV_WRITE_FLUSH_FUA and set the flags appropriately for each supported kernel version. [vfs] The generic_readlink() function has been made static. If .readlink in inode_operations is NULL, generic_readlink() is used. [zol typo] Completely unrelated to 4.10 compat, fix a typo in the check for REQ_OP_SECURE_ERASE so that the proper macro is defined: s/HAVE_REQ_OP_SECURE_DISCARD/HAVE_REQ_OP_SECURE_ERASE/ Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Chunwei Chen <[email protected]> Signed-off-by: Tim Chase <[email protected]> Closes #5499
* Use inode_set_flags when availableChunwei Chen2016-12-161-0/+1
| | | | Signed-off-by: Chunwei Chen <[email protected]>
* Kernel 4.9 compat: file_operations->aio_fsync removalDeHackEd2016-11-151-0/+1
| | | | | | | Linux kernel commit 723c038475b78 removed this field. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: DHE <[email protected]> Closes #5393
* Linux 3.14 compat: assign inode->set_acltuxoko2016-11-091-0/+1
| | | | | | | | | | | | | Linux 3.14 introduces inode->set_acl(). Normally, acl modification will come from setxattr, which will handle by the acl xattr_handler, and we already handles that well. However, nfsd will directly calls inode->set_acl or return error if it doesn't exists. Reviewed-by: Tim Chase <[email protected]> Reviewed-by: Massimo Maggi <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Chunwei Chen <[email protected]> Closes #5371 Closes #5375
* Use set_cached_acl and forget_cached_acl when possibleChunwei Chen2016-11-071-1/+2
| | | | | | | | | Originally, these two function are inline, so their usability is tied to posix_acl_release. However, since Linux 3.14, they became EXPORT_SYMBOL, so we can always use them. In this patch, we create an independent test for these two functions so we can use them when possible. Signed-off-by: Chunwei Chen <[email protected]>
* Add support for O_TMPFILEChunwei Chen2016-11-041-0/+1
| | | | | | | | | | | | | | | Linux 3.11 add O_TMPFILE to open(2), which allow creating an unlinked file on supported filesystem. It's basically doing open(2) and unlink(2) atomically. The filesystem support is added through i_op->tmpfile. We basically copy the create operation except we get rid of the link and name related stuff and add the new node to unlinked set. We also add support for linkat(2) to link tmpfile. However, since all previous file operation will skip ZIL, we force a txg_wait_synced to make sure we are sync safe. Signed-off-by: Chunwei Chen <[email protected]>
* Linux 4.9 compat: inode_change_ok() renamed setattr_prepare()Brian Behlendorf2016-10-201-0/+1
| | | | | | | | | | | | In torvalds/linux@31051c8 the inode_change_ok() function was renamed setattr_prepare() and updated to take a dentry ratheri than an inode. Update the code to call the setattr_prepare() and add a wrapper function which call inode_change_ok() for older kernels. Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Chunwei Chen <[email protected]> Requires-spl: refs/pull/581/head
* Linux 4.9 compat: remove iops->{set,get,remove}xattrChunwei Chen2016-10-201-0/+1
| | | | | | | | In Linux 4.9, torvalds/linux@fd50eca, iops->{set,get,remove}xattr and generic_{set,get,remove}xattr are removed. xattr operations will directly go through sb->s_xattr. Signed-off-by: Chunwei Chen <[email protected]>
* Linux 4.9 compat: iops->rename() wants flagsChunwei Chen2016-10-201-0/+1
| | | | | | | In Linux 4.9, torvalds/linux@2773bf0, iops->rename() and iops->rename2() are merged together into iops->rename(), it now wants flags. Signed-off-by: Chunwei Chen <[email protected]>
* Explicit block device plugging when submitting multiple BIOsIsaac Huang2016-09-291-0/+1
| | | | | | | | | Without plugging, the default 'noop' scheduler will not merge the BIOs which are part of a large ZIO. Reviewed-by: Andreas Dilger <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Isaac Huang <[email protected]> Closes #5181
* Linux compat: Grsecurity kernelGvozden Neskovic2016-08-221-0/+1
| | | | | | | | | | | API Change: Module parameter set/get methods take const parameter in Grsecurity kernel v4.7.1 Signed-off-by: Gvozden Neskovic <[email protected]> Signed-off-by: Jason Zaman <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #4997 Closes #5001
* Use file_dentry and file_inode wrappersChen Haiquan2016-08-111-0/+1
| | | | | | | | | | | | Fix bugs due to kernel change in torvalds/linux@4bacc9c9234c ("overlayfs: Make f_path always point to the overlay and f_inode to the underlay"). This problem crashes system when use zfs as a layer of overlayfs. Signed-off-by: Chen Haiquan <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #4914 Closes #4935
* Linux 4.8 compat: Fix removal of bio->bi_rw memberBrian Behlendorf2016-08-111-0/+4
| | | | | | | | | | | | | | | | | | All users of bio->bi_rw have been replaced with compatibility wrappers. This allows the kernel specific logic to be abstracted away, and for each of the supported cases to be documented with the wrapper. The updated interfaces are as follows: * void blk_queue_set_write_cache(struct request_queue *, bool, bool) * boolean_t bio_is_flush(struct bio *) * boolean_t bio_is_fua(struct bio *) * boolean_t bio_is_discard(struct bio *) * boolean_t bio_is_secure_erase(struct bio *) * VDEV_WRITE_FLUSH_FUA Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Chunwei Chen <[email protected]> Closes #4951
* Linux 4.7 compat: fix zpl_get_acl returns invalid acl pointerChunwei Chen2016-08-091-0/+1
| | | | | | | | | | | | | | | | | Starting from Linux 4.7, get_acl will set acl cache pointer to temporary sentinel value before calling i_op->get_acl. Therefore we can't compare against ACL_NOT_CACHED and return. Since from Linux 3.14, get_acl already check the cache for us, so we disable this in zpl_get_acl. Linux 4.7 also does set_cached_acl for us so we disable it in zpl_get_acl. Signed-off-by: Chunwei Chen <[email protected]> Signed-off-by: Nikolay Borisov <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #4944 Closes #4946
* Linux 4.8 compat: posix_acl_valid()Brian Behlendorf2016-08-081-0/+1
| | | | | | | | | | | | | | The posix_acl_valid() function has been updated to require a user namespace. Filesystem callers should normally provide the user_ns from the super block associcated with the ACL; the zpl_posix_acl_valid() wrapper has been added for this purpose. See https://github.com/torvalds/linux/commit/0d4d717f for complete details. Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Nikolay Borisov <[email protected]> Signed-off-by: Chunwei Chen <[email protected]> Closes #4922
* Retire HAVE_CURRENT_UMASK and HAVE_POSIX_ACL_CACHINGBrian Behlendorf2016-08-081-2/+0
| | | | | | | | | | Remove ZFS_AC_KERNEL_CURRENT_UMASK and ZFS_AC_KERNEL_POSIX_ACL_CACHING configure checks, all supported kernel provide this functionality. Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Nikolay Borisov <[email protected]> Signed-off-by: Chunwei Chen <[email protected]> Closes #4922
* Linux 4.8 compat: new s_user_ns member of struct super_blockNikolay Borisov2016-08-081-0/+1
| | | | | | | | | | | | | Kernel 4.8 paved the way to enabling mounting a file system inside a non-init user namespace. To facilitate this a s_user_ns member was added holding the userns in which the filesystem's instance was mounted. This enables doing the uid/gid translation relative to this particular username space and not the default init_user_ns. Signed-off-by: Nikolay Borisov <[email protected]> Signed-off-by: Chunwei Chen <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #4928
* Linux 4.8 compat: submit_bio()Brian Behlendorf2016-07-291-0/+1
| | | | | | | | | | | | | The rw argument has been removed from submit_bio/submit_bio_wait. Callers are now expected to set bio->bi_rw instead of passing it in. See https://github.com/torvalds/linux/commit/4e49ea4a for complete details. Signed-off-by: Tim Chase <[email protected]> Signed-off-by: Chunwei Chen <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #4892 Issue #4899
* Fix sync behavior for disk vdevsTim Chase2016-07-251-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Prior to b39c22b, which was first generally available in the 0.6.5 release as b39c22b, ZoL never actually submitted synchronous read or write requests to the Linux block layer. This means the vdev_disk_dio_is_sync() function had always returned false and, therefore, the completion in dio_request_t.dr_comp was never actually used. In b39c22b, synchronous ZIO operations were translated to synchronous BIO requests in vdev_disk_io_start(). The follow-on commits 5592404 and aa159af fixed several problems introduced by b39c22b. In particular, 5592404 introduced the new flag parameter "wait" to __vdev_disk_physio() but under ZoL, since vdev_disk_physio() is never actually used, the wait flag was always zero so the new code had no effect other than to cause a bug in the use of the dio_request_t.dr_comp which was fixed by aa159af. The original rationale for introducing synchronous operations in b39c22b was to hurry certains requests through the BIO layer which would have otherwise been subject to its unplug timer which would increase the latency. This behavior of the unplug timer, however, went away during the transition of the plug/unplug system between kernels 2.6.32 and 2.6.39. To handle the unplug timer behavior on 2.6.32-2.6.35 kernels the BIO_RW_UNPLUG flag is used as a hint to suppress the plugging behavior. For kernels 2.6.36-2.6.38, the REQ_UNPLUG macro will be available and ise used for the same purpose. Signed-off-by: Tim Chase <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #4858
* Check whether the kernel supports i_uid/gid_read/write helpersNikolay Borisov2016-07-251-0/+1
| | | | | | | | | | | | | Since the concept of a kuid and the need to translate from it to ordinary integer type was added in kernel version 3.5 implement necessary plumbing to be able to detect this condition during compile time. If the kernel doesn't support the kuid then just fall back to directly accessing the respective struct inode's members Signed-off-by: Nikolay Borisov <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #4685 Issue #227