summaryrefslogtreecommitdiffstats
path: root/include
Commit message (Collapse)AuthorAgeFilesLines
* Persistent L2ARCGeorge Amanakis2020-04-104-18/+297
| | | | | | | | | | | | | | | | | | | | | This commit makes the L2ARC persistent across reboots. We implement a light-weight persistent L2ARC metadata structure that allows L2ARC contents to be recovered after a reboot. This significantly eases the impact a reboot has on read performance on systems with large caches. Reviewed-by: Matthew Ahrens <[email protected]> Reviewed-by: George Wilson <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Co-authored-by: Saso Kiselkov <[email protected]> Co-authored-by: Jorgen Lundman <[email protected]> Co-authored-by: George Amanakis <[email protected]> Ported-by: Yuxuan Shui <[email protected]> Signed-off-by: George Amanakis <[email protected]> Closes #925 Closes #1823 Closes #2672 Closes #3744 Closes #9582
* Don't ignore zfs_arc_max below allmem/32Ryan Moeller2020-04-091-1/+1
| | | | | | | | | | | Set arc_c_min before arc_c_max so that when zfs_arc_min is set lower than the default allmem/32 zfs_arc_max can also be set lower. Add warning messages when tunables are being ignored. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Ryan Moeller <[email protected]> Closes #10157 Closes #10158
* Add separate field for indicating that spa is in middle of splitMatthew Macy2020-04-091-0/+1
| | | | | | | | | | | By default it's not possible to open a device already owned by an active vdev. It's necessary to make an exception to this for vdev split. The FreeBSD platform code will make an exception if spa_is splitting is set to to true. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Signed-off-by: Matt Macy <[email protected]> Closes #10178
* Linux 5.7 compat: blk_alloc_queue()Brian Behlendorf2020-04-091-0/+14
| | | | | | | | | | | | | | | | | | Commit https://github.com/torvalds/linux/commit/3d745ea5 simplified the blk_alloc_queue() interface by updating it to take the request queue as an argument. Add a wrapper function which accepts the new arguments and internally uses the available interfaces. Other minor changes include increasing the Linux-Maximum to 5.6 now that 5.6 has been released. It was not bumped to 5.7 because this release has not yet been finalized and is still subject to change. Added local 'struct zvol_state_os *zso' variable to zvol_alloc. Reviewed-by: George Melikov <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #10181 Closes #10187
* Finish refactoring for ZFS_MODULE_PARAM_CALLRyan Moeller2020-04-074-7/+10
| | | | | | | | | | | | | | | Linux and FreeBSD have different parameters for tunable proc handler. This has prevented FreeBSD from implementing the ZFS_MODULE_PARAM_CALL macro. To complete the sharing of ZFS_MODULE_PARAM_CALL declarations, create per-platform definitions of the parameter list, ZFS_MODULE_PARAM_ARGS. With the declarations wired up we discovered an incorrect scope prefix for spa_slop_shift, so this is now fixed. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Ryan Moeller <[email protected]> Closes #10179
* Add 'zfs wait' commandPaul Dagnelie2020-04-014-0/+24
| | | | | | | | | | | | | | | | | | | | | | Add a mechanism to wait for delete queue to drain. When doing redacted send/recv, many workflows involve deleting files that contain sensitive data. Because of the way zfs handles file deletions, snapshots taken quickly after a rm operation can sometimes still contain the file in question, especially if the file is very large. This can result in issues for redacted send/recv users who expect the deleted files to be redacted in the send streams, and not appear in their clones. This change duplicates much of the zpool wait related logic into a zfs wait command, which can be used to wait until the internal deleteq has been drained. Additional wait activities may be added in the future. Reviewed-by: Matthew Ahrens <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: John Gallagher <[email protected]> Signed-off-by: Paul Dagnelie <[email protected]> Closes #9707
* Let default arc_c_max be platform dependentRyan Moeller2020-03-271-0/+1
| | | | | | | | | | | | Linux changed the default max ARC size to 1/2 of physical memory to deal with shortcomings of the Linux SLUB allocator. Other platforms do not require the same logic. Implement an arc_default_max() function to determine a default max ARC size in platform code. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Ryan Moeller <[email protected]> Closes #10155
* Compile cityhash code into libzfsMatthew Ahrens2020-03-273-1/+1
| | | | | | | | | Make the cityhash code compile into libzfs, in preparation for the new "zstream" command. Reviewed-by: Paul Dagnelie <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> Closes #10152
* Deprecate deduplicated send streamsMatthew Ahrens2020-03-181-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Dedup send can only deduplicate over the set of blocks in the send command being invoked, and it does not take advantage of the dedup table to do so. This is a very common misconception among not only users, but developers, and makes the feature seem more useful than it is. As a result, many users are using the feature but not getting any benefit from it. Dedup send requires a nontrivial expenditure of memory and CPU to operate, especially if the dataset(s) being sent is (are) not already using a dedup-strength checksum. Dedup send adds developer burden. It expands the test matrix when developing new features, causing bugs in released code, and delaying development efforts by forcing more testing to be done. As a result, we are deprecating the use of `zfs send -D` and receiving of such streams. This change adds a warning to the man page, and also prints the warning whenever dedup send or receive are used. In a future release, we plan to: 1. remove the kernel code for generating deduplicated streams 2. make `zfs send -D` generate regular, non-deduplicated streams 3. remove the kernel code for receiving deduplicated streams 4. make `zfs receive` of deduplicated streams process them in userland to "re-duplicate" them, so that they can still be received. Reviewed-by: Paul Dagnelie <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: George Melikov <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> Closes #7887 Closes #10117
* Separate warning for incomplete and corrupt streamsPaul Dagnelie2020-03-171-0/+1
| | | | | | | | | | This change adds a separate return code to zfs_ioc_recv that is used for incomplete streams, in addition to the existing return code for streams that contain corruption. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Matthew Ahrens <[email protected]> Signed-off-by: Paul Dagnelie <[email protected]> Closes #10122
* Add option for forcible unmounting dataset while receiving snapshot.Mariusz Zaborski2020-03-171-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | Currently when the dataset is in use we can't receive snapshots. zfs send test/1@asd | zfs recv -FM test/2 cannot unmount '/test/2': Device busy This commits add option 'M' which attempts to forcibly unmount the dataset. Thanks to this we can enforce receiving snapshots in a single step. Note that this functionality is not supported on Linux because the VFS will prevent active mounted filesystems from being unmounted, even with the force option. This is the intended VFS behavior. Test cases were added to verify the expected behavior based on the platform. Discussed-with: Pawel Jakub Dawidek <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Allan Jude <[email protected]> External-issue: https://reviews.freebsd.org/D22306 Closes #9904
* dmu_objset_from_ds must be called with dp_config_rwlock heldMatthew Ahrens2020-03-121-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The normal lock order is that the dp_config_rwlock must be held before the ds_opening_lock. For example, dmu_objset_hold() does this. However, dmu_objset_open_impl() is called with the ds_opening_lock held, and if the dp_config_rwlock is not already held, it will attempt to acquire it. This may lead to deadlock, since the lock order is reversed. Looking at all the callers of dmu_objset_open_impl() (which is principally the callers of dmu_objset_from_ds()), almost all callers already have the dp_config_rwlock. However, there are a few places in the send and receive code paths that do not. For example: dsl_crypto_populate_key_nvlist, send_cb, dmu_recv_stream, receive_write_byref, redact_traverse_thread. This commit resolves the problem by requiring all callers ot dmu_objset_from_ds() to hold the dp_config_rwlock. In most cases, the code has been restructured such that we call dmu_objset_from_ds() earlier on in the send and receive processes, when we already have the dp_config_rwlock, and save the objset_t until we need it in the middle of the send or receive (similar to what we already do with the dsl_dataset_t). Thus we do not need to acquire the dp_config_rwlock in many new places. I also cleaned up code in dmu_redact_snap() and send_traverse_thread(). Reviewed-by: Paul Dagnelie <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Paul Zuchowski <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> Closes #9662 Closes #10115
* Prevent race condition in dnode_dest (#10101)John Poduska2020-03-122-0/+2
| | | | | | | | | | | | | | | | | | | | | | | dnode_special_close() waits for the refcount of dn_holds to go to zero without holding the dn_mtx. dnode_rele_and_unlock() does the final remove to dn_holds with dn_mtx being held: refs = zfs_refcount_remove(&dn->dn_holds, tag); mutex_exit(&dn->dn_mtx); So, there is a race condition after the remove until dn_mtx is dropped. During that time, dnode_destroy() can get called, which ends up in dnode_dest() calling mutex_destroy() and a panic since the lock is still held. This change adds a condvar to wait for the final dnode_rele_and_unlock() to release the dn_mtx before calling dnode_destroy(). Reviewed-by: Paul Dagnelie <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Matthew Ahrens <[email protected]> Signed-off-by: John Poduska <[email protected]> Closes #7814 Closes #10101
* Improve zfs send performance by bypassing the ARCMatthew Ahrens2020-03-102-0/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When doing a zfs send on a dataset with small recordsize (e.g. 8K), performance is dominated by the per-block overheads. This is especially true with `zfs send --compressed`, which further reduces the amount of data sent, for the same number of blocks. Several threads are involved, but the limiting factor is the `send_prefetch` thread, which is 100% on CPU. The main job of the `send_prefetch` thread is to issue zio's for the data that will be needed by the main thread. It does this by calling `arc_read(ARC_FLAG_PREFETCH)`. This has an immediate cost of creating an arc_hdr, which takes around 14% of one CPU. It also induces later costs by other threads: * Since the data was only prefetched, dmu_send()->dmu_dump_write() will need to call arc_read() again to get the data. This will have to look up the arc_hdr in the hash table and copy the data from the scatter ABD in the arc_hdr to a linear ABD in arc_buf. This takes 27% of one CPU. * dmu_dump_write() needs to arc_buf_destroy() This takes 11% of one CPU. * arc_adjust() will need to evict this arc_hdr, taking about 50% of one CPU. All of these costs can be avoided by bypassing the ARC if the data is not already cached. This commit changes `zfs send` to check for the data in the ARC, and if it is not found then we directly call `zio_read()`, reading the data into a linear ABD which is used by dmu_dump_write() directly. The performance improvement is best expressed in terms of how many blocks can be processed by `zfs send` in one second. This change increases the metric by 50%, from ~100,000 to ~150,000. When the amount of data per block is small (e.g. 2KB), there is a corresponding reduction in the elapsed time of `zfs send >/dev/null` (from 86 minutes to 58 minutes in this test case). In addition to improving the performance of `zfs send`, this change makes `zfs send` not pollute the ARC cache. In most cases the data will not be reused, so this allows us to keep caching useful data in the MRU (hit-once) part of the ARC. Reviewed-by: Paul Dagnelie <[email protected]> Reviewed-by: Serapheim Dimitropoulos <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> Closes #10067
* Add trim support to zpool waitBrian Behlendorf2020-03-042-0/+4
| | | | | | | | | | | | Manual trims fall into the category of long-running pool activities which people might want to wait synchronously for. This change adds support to 'zpool wait' for waiting for manual trim operations to complete. It also adds a '-w' flag to 'zpool trim' which can be used to turn 'zpool trim' into a synchronous operation. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Serapheim Dimitropoulos <[email protected]> Signed-off-by: John Gallagher <[email protected]> Closes #10071
* Improve performance of zio_taskq_memberMatthew Ahrens2020-03-032-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | __zio_execute() calls zio_taskq_member() to determine if we are running in a zio interrupt taskq, in which case we may need to switch to processing this zio in a zio issue taskq. The call to zio_taskq_member() can become a performance bottleneck when we are processing a high rate of zio's. zio_taskq_member() calls taskq_member() on each of the zio interrupt taskqs, of which there are 21. This is slow because each call to taskq_member() does tsd_get(taskq_tsd), which on Linux is relatively slow. This commit improves the performance of zio_taskq_member() by having it cache the value of tsd_get(taskq_tsd), reducing the number of those calls to 1/21th of the current behavior. In a test case running `zfs send -c >/dev/null` of a filesystem with small blocks (average 2.5KB/block), zio_taskq_member() was using 6.7% of one CPU, and with this change it is reduced to 1.3%. Overall time to perform the `zfs send` reduced by 10% (~150,000 block/sec to ~165,000 blocks/sec). Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Serapheim Dimitropoulos <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Tony Nguyen <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> Closes #10070
* Make spa_history_zone platform-dependent in kernelRyan Moeller2020-03-021-0/+1
| | | | | | | | | | | This function should only return "linux" on Linux. Move the kernel part of the function out of common code. Fix the tests for FreeBSD. Reviewed-by: George Melikov <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Ryan Moeller <[email protected]> Closes #10079
* Re-share zfsdev_getminor and zfs_onexit_fd_holdMatthew Macy2020-02-281-0/+1
| | | | | | | | | | By adding a zfs_file_private accessor to the common interfaces and some extensions to FreeBSD platform code it is now possible to share the implementations for the aforementioned functions. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matt Macy <[email protected]> Closes #10073
* Linux 5.6 compat: time_tBrian Behlendorf2020-02-272-3/+3
| | | | | | | | | | | | | | | | | | | As part of the Linux kernel's y2038 changes the time_t type has been fully retired. Callers are now required to use the time64_t type. Rather than move to the new type, I've removed the few remaining places where a time_t is used in the kernel code. They've been replaced with a uint64_t which is already how ZFS internally handled these values. Going forward we should work towards updating the remaining user space time_t consumers to the 64-bit interfaces. Reviewed-by: Matthew Macy <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #10052 Closes #10064
* Linux 5.6 compat: ktime_get_raw_ts64()Brian Behlendorf2020-02-271-0/+5
| | | | | | | | | | | The getrawmonotonic() and getrawmonotonic64() interfaces have been fully retired. Update gethrtime() to use the replacement interface ktime_get_raw_ts64() which was introduced in the 4.18 kernel. Reviewed-by: Tony Hutter <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #10052 Closes #10064
* Refactor dnode dirty context from dbuf_dirtyMatthew Macy2020-02-261-1/+2
| | | | | | | | | | | * Add dedicated donde_set_dirtyctx routine. * Add empty dirty record on destroy assertion. * Make much more extensive use of the SET_ERROR macro. Reviewed-by: Will Andrews <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Matthew Ahrens <[email protected]> Signed-off-by: Matt Macy <[email protected]> Closes #9924
* Remove zfs_getattr and convoff dead codeDirkjan Bussink2020-02-242-2/+1
| | | | | | | | | | | | The `convoff` function is called only in one code path in `zfs_space`. Each caller of `zfs_space` is called with a `flock64_t` that has `l_whence` set to `SEEK_SET`. This means that `convoff` always results in a no-op as the `bfp` parameter has `l_whence` set to `SEEK_SET` and `int whence` is `SEEK_SET` as well. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Signed-off-by: Dirkjan Bussink <[email protected]> Closes #10006
* Enable zpool events tunables and tests on FreeBSDRyan Moeller2020-02-181-0/+1
| | | | | | | | | | | | | | We have have made the necessary changes in our module code to expose zevents through both devd and the zpool events ioctl. Now the tunables can be exposed and zpool events tests can be enabled on both platforms. A few minor tweaks to the tests were needed to accommodate the way wc formats output on FreeBSD. zed remains to be ported. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Ryan Moeller <[email protected]> Closes #10008
* Factor out some dbuf subroutines and add state change tracingMatthew Macy2020-02-181-0/+17
| | | | | | | | | | | | Create dedicated dbuf_read_hole and dbuf_read_bonus. Additionally, add a dtrace probe to allow state change tracing. Reviewed-by: Matt Ahrens <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Will Andrews <[email protected]> Reviewed by: Brad Lewis <[email protected]> Authored-by: Will Andrews <[email protected]> Signed-off-by: Matt Macy <[email protected]> Closes #9923
* Prefer org.openzfs for features and propertiesRichard Laager2020-02-181-2/+2
| | | | | | | | | | Moving forward, we wish to use org.openzfs (no dash) rather than org.open-zfs or org.zfsonlinux for feature GUIDs and property names. The existing feature GUIDs cannot be changed. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Signed-off-by: Richard Laager <[email protected]> Closes #10003
* Support setting user properties in a channel programJason King2020-02-142-0/+45
| | | | | | | | | | | | | This adds support for setting user properties in a zfs channel program by adding 'zfs.sync.set_prop' and 'zfs.check.set_prop' to the ZFS LUA API. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Matt Ahrens <[email protected]> Co-authored-by: Sara Hartse <[email protected]> Contributions-by: Jason King <[email protected]> Signed-off-by: Sara Hartse <[email protected]> Signed-off-by: Jason King <[email protected]> Closes #9950
* Remove limit on number of async zio_frees of non-dedup blocksMatthew Ahrens2020-02-141-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | The module parameter zfs_async_block_max_blocks limits the number of blocks that can be freed by the background freeing of filesystems and snapshots (from "zfs destroy"), in one TXG. This is useful when freeing dedup blocks, becuase each zio_free() of a dedup block can require an i/o to read the relevant part of the dedup table (DDT), and will also dirty that block. zfs_async_block_max_blocks is set to 100,000 by default. For the more typical case where dedup is not used, this can have a negative performance impact on the rate of background freeing (from "zfs destroy"). For example, with recordsize=8k, and TXG's syncing once every 5 seconds, we can free only 160MB of data per second, which may be much less than the rate we can write data. This change increases zfs_async_block_max_blocks to be unlimited by default. To address the dedup freeing issue, a new tunable is introduced, zfs_max_async_dedup_frees, which limits the number of zio_free()'s of dedup blocks done by background destroys, per txg. The default is 100,000 free's (same as the old zfs_async_block_max_blocks default). Reviewed-by: Paul Dagnelie <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> Closes #10000
* Remove duplicate dbufs accountingAlexander Motin2020-02-131-0/+7
| | | | | | | | | | | | | | Since AVL already has embedded element counter, use dn_dbufs_count only for dbufs not counted there (bonus buffers) and just add them. This removes two atomics per dbuf life cycle. According to profiler it reduces time spent by dbuf_destroy() inside bottlenecked dbuf_evict_thread() from 13.36% to 9.20% of the core. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Matt Ahrens <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored-By: iXsystems, Inc. Closes #9949
* zcp: add zfs.sync.bookmarkChristian Schwarz2020-02-111-0/+17
| | | | | | | | | | Add support for bookmark creation and cloning. Reviewed-by: Matt Ahrens <[email protected]> Reviewed-by: Paul Dagnelie <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Christian Schwarz <[email protected]> Closes #9571
* Implement bookmark copyingChristian Schwarz2020-02-113-0/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This feature allows copying existing bookmarks using zfs bookmark fs#target fs#newbookmark There are some niche use cases for such functionality, e.g. when using bookmarks as markers for replication progress. Copying redaction bookmarks produces a normal bookmark that cannot be used for redacted send (we are not duplicating the redaction object). ZCP support for bookmarking (both creation and copying) will be implemented in a separate patch based on this work. Overview: - Terminology: - source = existing snapshot or bookmark - new/bmark = new bookmark - Implement bookmark copying in `dsl_bookmark.c` - create new bookmark node - copy source's `zbn_phys` to new's `zbn_phys` - zero-out redaction object id in copy - Extend existing bookmark ioctl nvlist schema to accept bookmarks as sources - => `dsl_bookmark_create_nvl_validate` is authoritative - use `dsl_dataset_is_before` check for both snapshot and bookmark sources - Adjust CLI - refactor shortname expansion logic in `zfs_do_bookmark` - Update man pages - warn about redaction bookmark handling - Add test cases - CLI - pyyzfs libzfs_core bindings Reviewed-by: Matt Ahrens <[email protected]> Reviewed-by: Paul Dagnelie <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Christian Schwarz <[email protected]> Closes #9571
* Fix zdb -R with 'b' flagPaul Zuchowski2020-02-101-0/+9
| | | | | | | | | | | | | zdb -R :b fails due to the indirect block being compressed, and the 'b' and 'd' flag not working in tandem when specified. Fix the flag parsing code and create a zfs test for zdb -R block display. Also fix the zio flags where the dotted notation for the vdev portion of DVA (i.e. 0.0:offset:length) fails. Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Paul Zuchowski <[email protected]> Closes #9640 Closes #9729
* Share some code for spa deadman tunablesRyan Moeller2020-02-101-0/+2
| | | | | | | | | | | | | We need to do the same thing to update all spas on any OS for these tunables, so let's share the code. While here let's match the types of the literals initializing the variables with the type of the variable. Reviewed-by: Allan Jude <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Olaf Faaland <[email protected]> Signed-off-by: Ryan Moeller <[email protected]> Closes #9964
* ICP: Improve AES-GCM performanceAttila Fülöp2020-02-102-1/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently SIMD accelerated AES-GCM performance is limited by two factors: a. The need to disable preemption and interrupts and save the FPU state before using it and to do the reverse when done. Due to the way the code is organized (see (b) below) we have to pay this price twice for each 16 byte GCM block processed. b. Most processing is done in C, operating on single GCM blocks. The use of SIMD instructions is limited to the AES encryption of the counter block (AES-NI) and the Galois multiplication (PCLMULQDQ). This leads to the FPU not being fully utilized for crypto operations. To solve (a) we do crypto processing in larger chunks while owning the FPU. An `icp_gcm_avx_chunk_size` module parameter was introduced to make this chunk size tweakable. It defaults to 32 KiB. This step alone roughly doubles performance. (b) is tackled by porting and using the highly optimized openssl AES-GCM assembler routines, which do all the processing (CTR, AES, GMULT) in a single routine. Both steps together result in up to 32x reduction of the time spend in the en/decryption routines, leading up to approximately 12x throughput increase for large (128 KiB) blocks. Lastly, this commit changes the default encryption algorithm from AES-CCM to AES-GCM when setting the `encryption=on` property. Reviewed-By: Brian Behlendorf <[email protected]> Reviewed-By: Jason King <[email protected]> Reviewed-By: Tom Caputi <[email protected]> Reviewed-By: Richard Laager <[email protected]> Signed-off-by: Attila Fülöp <[email protected]> Closes #9749
* Linux 5.6 compat: timestamp_truncate()Brian Behlendorf2020-02-071-6/+7
| | | | | | | | | | | | The timestamp_truncate() function was added, it replaces the existing timespec64_trunc() function. This change renames our wrapper function to be consistent with the upstream name and updates the compatibility code for older kernels accordingly. Reviewed-by: Tony Hutter <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #9956 Closes #9961
* Linux 5.6 compat: struct proc_opsBrian Behlendorf2020-02-071-1/+7
| | | | | | | | | | | | | | The proc_ops structure was introduced to replace the use of of the file_operations structure when registering proc handlers. This change creates a new kstat_proc_op_t typedef for compatibility which can be used to pass around the correct structure. This change additionally adds the 'const' keyword to all of the existing proc operations structures. Reviewed-by: Tony Hutter <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #9961
* Reduce number of atomic_add() calls in aggsumAlexander Motin2020-02-061-1/+1
| | | | | | | | | | | | | | | | | | | | | | | Previous code used 4 atomics to do aggsum_flush_bucket() and 2 more to re-borrow after the flush. But since asc_borrowed and asc_delta are accessed only while holding asc_lock, it makes no any sense to modify as_lower_bound and as_upper_bound in multiple steps. Instead of that the new code uses only 2 atomics in all the cases, one per as_*_bound variable. I think even that is overkill, simple atomic store and load could be used here, since all modifications are done under the as_lock, but there are no such primitives in ZFS code now. While there, make borrow code consider previous borrow value, so that on mixed request patterns reduce chance of needing to borrow again if much larger request follows tiny one that needed borrow. Also reduce as_numbuckets from uint64_t to u_int. It makes no sense to use so large division operation on every aggsum_add(). Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Paul Dagnelie <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored-By: iXsystems, Inc. Closes #9930
* Few microoptimizations to dbuf layerAlexander Motin2020-02-051-6/+7
| | | | | | | | | | | | | | | | | | | Move db_link into the same cache line as db_blkid and db_level. It allows significantly reduce avl_add() time in dbuf_create() on systems with large RAM and huge number of dbufs per dnode. Avoid few accesses to dbuf_caches[].size, which is highly congested under high IOPS and never stays in cache for a long time. Use local value we are receiving from zfs_refcount_add_many() any way. Remove cache_size_bytes_max bump from dbuf_evict_one(). I don't see a point to do it on dbuf eviction after we done it on insertion in dbuf_rele_and_unlock(). Reviewed-by: Matt Ahrens <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored-By: iXsystems, Inc. Closes #9931
* Convert dbuf dirty record record list to a list_tMatthew Macy2020-02-051-4/+27
| | | | | | | | | Additionally pull in state machine comments about upcoming async cow work. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Matt Ahrens <[email protected]> Signed-off-by: Matt Macy <[email protected]> Closes #9902
* Restore aclmode and remove acltype on FreeBSDRyan Moeller2020-02-041-1/+1
| | | | | | | | | | | | | | This replaces the placeholder ZFS_PROP_PRIVATE with ZFS_PROP_ACLMODE, matching what is done in the NFSv4 ACLs PR (#9709). On FreeBSD we hide ZFS_PROP_ACLTYPE, while on Linux we hide ZFS_PROP_ACLMODE. The tests already assume this arrangement. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Ryan Moeller <[email protected]> Closes #9913
* async zvol minor node creation interferes with receiveMatthew Ahrens2020-02-032-8/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When we finish a zfs receive, dmu_recv_end_sync() calls zvol_create_minors(async=TRUE). This kicks off some other threads that create the minor device nodes (in /dev/zvol/poolname/...). These async threads call zvol_prefetch_minors_impl() and zvol_create_minor(), which both call dmu_objset_own(), which puts a "long hold" on the dataset. Since the zvol minor node creation is asynchronous, this can happen after the `ZFS_IOC_RECV[_NEW]` ioctl and `zfs receive` process have completed. After the first receive ioctl has completed, userland may attempt to do another receive into the same dataset (e.g. the next incremental stream). This second receive and the asynchronous minor node creation can interfere with one another in several different ways, because they both require exclusive access to the dataset: 1. When the second receive is finishing up, dmu_recv_end_check() does dsl_dataset_handoff_check(), which can fail with EBUSY if the async minor node creation already has a "long hold" on this dataset. This causes the 2nd receive to fail. 2. The async udev rule can fail if zvol_id and/or systemd-udevd try to open the device while the the second receive's async attempt at minor node creation owns the dataset (via zvol_prefetch_minors_impl). This causes the minor node (/dev/zd*) to exist, but the udev-generated /dev/zvol/... to not exist. 3. The async minor node creation can silently fail with EBUSY if the first receive's zvol_create_minor() trys to own the dataset while the second receive's zvol_prefetch_minors_impl already owns the dataset. To address these problems, this change synchronously creates the minor node. To avoid the lock ordering problems that the asynchrony was introduced to fix (see #3681), we create the minor nodes from open context, with no locks held, rather than from syncing contex as was originally done. Implementation notes: We generally do not need to traverse children or prefetch anything (e.g. when running the recv, snapshot, create, or clone subcommands of zfs). We only need recursion when importing/opening a pool and when loading encryption keys. The existing recursive, asynchronous, prefetching code is preserved for use in these cases. Channel programs may need to create zvol minor nodes, when creating a snapshot of a zvol with the snapdev property set. We figure out what snapshots are created when running the LUA program in syncing context. In this case we need to remember what snapshots were created, and then try to create their minor nodes from open context, after the LUA code has completed. There are additional zvol use cases that asynchronously own the dataset, which can cause similar problems. E.g. changing the volmode or snapdev properties. These are less problematic because they are not recursive and don't touch datasets that are not involved in the operation, there is still potential for interference with subsequent operations. In the future, these cases should be similarly converted to create the zvol minor node synchronously from open context. The async tasks of removing and renaming minors do not own the objset, so they do not have this problem. However, it may make sense to also convert these operations to happen synchronously from open context, in the future. Reviewed-by: Paul Dagnelie <[email protected]> Reviewed-by: Prakash Surya <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> External-issue: DLPX-65948 Closes #7863 Closes #9885
* Add AltiVec RAID-ZRomain Dolbeau2020-01-234-1/+117
| | | | | | | | | | | | | Implements the RAID-Z function using AltiVec SIMD. This is basically the NEON code translated to AltiVec. Note that the 'fletcher' algorithm requires 64-bits operations, and the initial implementations of AltiVec (PPC74xx a.k.a. G4, PPC970 a.k.a. G5) only has up to 32-bits operations, so no 'fletcher'. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Romain Dolbeau <[email protected]> Closes #9539
* Support inheriting properties in channel programsJason King2020-01-221-0/+9
| | | | | | | | | This adds support in channel programs to inherit properties analogous to `zfs inherit` by adding `zfs.sync.inherit` and `zfs.check.inherit` functions to the ZFS LUA API. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Jason King <[email protected]> Closes #9738
* Fix errata #4 handling for resuming streamsTom Caputi2020-01-141-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | Currently, the handling for errata #4 has two issues which allow the checks for this issue to be bypassed using resumable sends. The first issue is that drc->drc_fromsnapobj is not set in the resuming code as it is in the non-resuming code. This causes dsl_crypto_recv_key_check() to skip its checks for the from_ivset_guid. The second issue is that resumable sends do not clean up their on-disk state if they fail the checks in dmu_recv_stream() that happen before any data is received. As a result of these two bugs, a user can attempt a resumable send of a dataset without a from_ivset_guid. This will fail the initial dmu_recv_stream() checks, leaving a valid resume state. The send can then be resumed, which skips those checks, allowing the receive to be completed. This commit fixes these issues by setting drc->drc_fromsnapobj in the resuming receive path and by ensuring that resumablereceives are properly cleaned up if they fail the initial dmu_recv_stream() checks. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tom Caputi <[email protected]> Closes #9818 Closes #9829
* libzfs: add zfs_mount_at() functionKyle Evans2020-01-141-0/+1
| | | | | | | | | | | | | | zfs_mount_at() mounts a dataset at an arbitrary mountpoint rather than at the configured mountpoint. This may be used by consumers that wish to temporarily expose a dataset at another mountpoint without altering dataset/pool properties. This will be used by FreeBSD's libbe be_mount(), which mounts a boot environment at an arbitrary mountpoint. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Signed-off-by: Kyle Evans <[email protected]> Closes #9833
* Change http://zfsonlinux.org links to https://zfsonlinux.orgBrian Behlendorf2020-01-131-1/+1
| | | | | | | | | | | Update the project website links contained in to repository to reference the secure https://zfsonlinux.org address. Reviewed-By: Richard Laager <[email protected]> Reviewed-by: George Melikov <[email protected]> Reviewed-by: Garrett Fields <[email protected]> Reviewed-by: Kjeld Schouten <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #9837
* Add 'zfs send --saved' flagTom Caputi2020-01-103-4/+11
| | | | | | | | | | | | | | | | | | This commit adds the --saved (-S) to the 'zfs send' command. This flag allows a user to send a partially received dataset, which can be useful when migrating a backup server to new hardware. This flag is compatible with resumable receives, so even if the saved send is interrupted, it can be resumed. The flag does not require any user / kernel ABI changes or any new feature flags in the send stream format. Reviewed-by: Paul Dagnelie <[email protected]> Reviewed-by: Alek Pinchuk <[email protected]> Reviewed-by: Paul Zuchowski <[email protected]> Reviewed-by: Christian Schwarz <[email protected]> Reviewed-by: Matt Ahrens <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tom Caputi <[email protected]> Closes #9007
* Remove parameter names from libzfs.h signaturesNick Black2020-01-081-7/+7
| | | | | | | | | | | | | | | Most of libzfs.h doesn't provide names for the parameters in its signatures. These few functions included them. That wouldn't be a problem, per se, but the 'lines' parameter conflicts with the 'lines' #define from terminfo's term.h, present for at least a decade. This makes it difficult to compile code making use of both ZFS and terminfo. Reviewed-by: Matt Ahrens <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Igor Kozhukhov <[email protected]> Reviewed-by: Kjeld Schouten <[email protected]> Signed-off-by: Nick Black <[email protected]> Closes #9821
* Fix "zpool add -n" for dedup, special and log devicesloli10K2020-01-061-1/+1
| | | | | | | | | | | | | | | | | | | | For dedup, special and log devices "zpool add -n" does not print correctly their vdev type: ~# zpool add -n pool dedup /tmp/dedup special /tmp/special log /tmp/log would update 'pool' to the following configuration: pool /tmp/normal /tmp/dedup /tmp/special /tmp/log This could lead storage administrators to modify their ZFS pools to unexpected and unintended vdev configurations. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: loli10K <[email protected]> Closes #9783 Closes #9390
* Colorize zpool status outputTony Hutter2019-12-191-0/+12
| | | | | | | | | | | | | | | | | | | If the ZFS_COLOR env variable is set, then use ANSI color output in zpool status: - Column headers are bold - Degraded or offline pools/vdevs are yellow - Non-zero error counters and faulted vdevs/pools are red - The 'status:' and 'action:' sections are yellow if they're displaying a warning. This also includes a new 'faketty' function in libtest.shlib that is compatible with FreeBSD (code provided by @freqlabs). Reviewed-by: Jorgen Lundman <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Signed-off-by: Tony Hutter <[email protected]> Closes #9340
* Make zfs_replay.c work on FreeBSDMatthew Macy2019-12-131-0/+2
| | | | | | | | | | | | | FreeBSD's vfs currently doesn't permit file systems to do their own locking. To avoid having to have duplicate zfs functions with and without locking add locking here. With luck these changes can be removed in the future. Reviewed-by: Sean Eric Fagan <[email protected]> Reviewed-by: Jorgen Lundman <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matt Macy <[email protected]> Closes #9715