summaryrefslogtreecommitdiffstats
path: root/include/sys
Commit message (Collapse)AuthorAgeFilesLines
* Add support for boot environment data to be stored in the labelPaul Dagnelie2020-05-073-7/+31
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Modern bootloaders leverage data stored in the root filesystem to enable some of their powerful features. GRUB specifically has a grubenv file which can store large amounts of configuration data that can be read and written at boot time and during normal operation. This allows sysadmins to configure useful features like automated failover after failed boot attempts. Unfortunately, due to the Copy-on-Write nature of ZFS, the standard behavior of these tools cannot handle writing to ZFS files safely at boot time. We need an alternative way to store data that allows the bootloader to make changes to the data. This work is very similar to work that was done on Illumos to enable similar functionality in the FreeBSD bootloader. This patch is different in that the data being stored is a raw grubenv file; this file can store arbitrary variables and values, and the scripting provided by grub is powerful enough that special structures are not required to implement advanced behavior. We repurpose the second padding area in each label to store the grubenv file, protected by an embedded checksum. We add two ioctls to get and set this data, and libzfs_core and libzfs functions to access them more easily. There are no direct command line interfaces to these functions; these will be added directly to the bootloader utilities. Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Paul Dagnelie <pcd@delphix.com> Closes #10009
* Remove deduplicated send/receive codeMatthew Ahrens2020-04-234-21/+11
| | | | | | | | | | | | | | | | | | | | | | | | | Deduplicated send streams (i.e. `zfs send -D` and `zfs receive` of such streams) are deprecated. Deduplicated send streams can be received by first converting them to non-deduplicated with the `zstream redup` command. This commit removes the code for sending and receiving deduplicated send streams. `zfs send -D` will now print a warning, ignore the `-D` flag, and generate a regular (non-deduplicated) send stream. `zfs receive` of a deduplicated send stream will print an error message and fail. The resulting code simplification (especially in the kernel's support for receiving dedup streams) should help enable future performance enhancements. Several new tests are added which leverage `zstream redup`. Reviewed-by: Paul Dagnelie <pcd@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Issue #7887 Issue #10117 Issue #10156 Closes #10212
* Use a struct to organize metaslab-group-allocator fieldsMatthew Ahrens2020-04-221-4/+11
| | | | | | | | | | | | | | | | | | | | Each metaslab group (of which there is one per top-level vdev) has several (4, by default) "metaslab group allocators". Each "allocator" has its own metaslab that it prefers to allocate from (the "primary" allocator), and each can perform allocations concurrently with the other allocators. In addition to the primary metaslab, there are several other fields that need to be tracked separately for each allocator. These are currently stored as several arrays in the metaslab_group_t, each array indexed by allocator number. This change organizes all the metaslab-group-allocator-specific fields into a new struct, metaslab_group_allocator_t. The metaslab_group_t now needs only one array indexed by the allocator number - which contains the metaslab_group_allocator_t's. Reviewed-by: Paul Dagnelie <pcd@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #10213
* Persistent L2ARCGeorge Amanakis2020-04-104-18/+297
| | | | | | | | | | | | | | | | | | | | | This commit makes the L2ARC persistent across reboots. We implement a light-weight persistent L2ARC metadata structure that allows L2ARC contents to be recovered after a reboot. This significantly eases the impact a reboot has on read performance on systems with large caches. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: George Wilson <gwilson@delphix.com> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Co-authored-by: Saso Kiselkov <skiselkov@gmail.com> Co-authored-by: Jorgen Lundman <lundman@lundman.net> Co-authored-by: George Amanakis <gamanakis@gmail.com> Ported-by: Yuxuan Shui <yshuiv7@gmail.com> Signed-off-by: George Amanakis <gamanakis@gmail.com> Closes #925 Closes #1823 Closes #2672 Closes #3744 Closes #9582
* Don't ignore zfs_arc_max below allmem/32Ryan Moeller2020-04-091-1/+1
| | | | | | | | | | | Set arc_c_min before arc_c_max so that when zfs_arc_min is set lower than the default allmem/32 zfs_arc_max can also be set lower. Add warning messages when tunables are being ignored. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ryan Moeller <ryan@iXsystems.com> Closes #10157 Closes #10158
* Add separate field for indicating that spa is in middle of splitMatthew Macy2020-04-091-0/+1
| | | | | | | | | | | By default it's not possible to open a device already owned by an active vdev. It's necessary to make an exception to this for vdev split. The FreeBSD platform code will make an exception if spa_is splitting is set to to true. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes #10178
* Finish refactoring for ZFS_MODULE_PARAM_CALLRyan Moeller2020-04-073-6/+7
| | | | | | | | | | | | | | | Linux and FreeBSD have different parameters for tunable proc handler. This has prevented FreeBSD from implementing the ZFS_MODULE_PARAM_CALL macro. To complete the sharing of ZFS_MODULE_PARAM_CALL declarations, create per-platform definitions of the parameter list, ZFS_MODULE_PARAM_ARGS. With the declarations wired up we discovered an incorrect scope prefix for spa_slop_shift, so this is now fixed. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ryan Moeller <ryan@iXsystems.com> Closes #10179
* Add 'zfs wait' commandPaul Dagnelie2020-04-012-0/+20
| | | | | | | | | | | | | | | | | | | | | | Add a mechanism to wait for delete queue to drain. When doing redacted send/recv, many workflows involve deleting files that contain sensitive data. Because of the way zfs handles file deletions, snapshots taken quickly after a rm operation can sometimes still contain the file in question, especially if the file is very large. This can result in issues for redacted send/recv users who expect the deleted files to be redacted in the send streams, and not appear in their clones. This change duplicates much of the zpool wait related logic into a zfs wait command, which can be used to wait until the internal deleteq has been drained. Additional wait activities may be added in the future. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: John Gallagher <john.gallagher@delphix.com> Signed-off-by: Paul Dagnelie <pcd@delphix.com> Closes #9707
* Let default arc_c_max be platform dependentRyan Moeller2020-03-271-0/+1
| | | | | | | | | | | | Linux changed the default max ARC size to 1/2 of physical memory to deal with shortcomings of the Linux SLUB allocator. Other platforms do not require the same logic. Implement an arc_default_max() function to determine a default max ARC size in platform code. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ryan Moeller <ryan@iXsystems.com> Closes #10155
* Compile cityhash code into libzfsMatthew Ahrens2020-03-272-42/+0
| | | | | | | | | Make the cityhash code compile into libzfs, in preparation for the new "zstream" command. Reviewed-by: Paul Dagnelie <pcd@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #10152
* Separate warning for incomplete and corrupt streamsPaul Dagnelie2020-03-171-0/+1
| | | | | | | | | | This change adds a separate return code to zfs_ioc_recv that is used for incomplete streams, in addition to the existing return code for streams that contain corruption. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Signed-off-by: Paul Dagnelie <pcd@delphix.com> Closes #10122
* dmu_objset_from_ds must be called with dp_config_rwlock heldMatthew Ahrens2020-03-121-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The normal lock order is that the dp_config_rwlock must be held before the ds_opening_lock. For example, dmu_objset_hold() does this. However, dmu_objset_open_impl() is called with the ds_opening_lock held, and if the dp_config_rwlock is not already held, it will attempt to acquire it. This may lead to deadlock, since the lock order is reversed. Looking at all the callers of dmu_objset_open_impl() (which is principally the callers of dmu_objset_from_ds()), almost all callers already have the dp_config_rwlock. However, there are a few places in the send and receive code paths that do not. For example: dsl_crypto_populate_key_nvlist, send_cb, dmu_recv_stream, receive_write_byref, redact_traverse_thread. This commit resolves the problem by requiring all callers ot dmu_objset_from_ds() to hold the dp_config_rwlock. In most cases, the code has been restructured such that we call dmu_objset_from_ds() earlier on in the send and receive processes, when we already have the dp_config_rwlock, and save the objset_t until we need it in the middle of the send or receive (similar to what we already do with the dsl_dataset_t). Thus we do not need to acquire the dp_config_rwlock in many new places. I also cleaned up code in dmu_redact_snap() and send_traverse_thread(). Reviewed-by: Paul Dagnelie <pcd@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Paul Zuchowski <pzuchowski@datto.com> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #9662 Closes #10115
* Prevent race condition in dnode_dest (#10101)John Poduska2020-03-122-0/+2
| | | | | | | | | | | | | | | | | | | | | | | dnode_special_close() waits for the refcount of dn_holds to go to zero without holding the dn_mtx. dnode_rele_and_unlock() does the final remove to dn_holds with dn_mtx being held: refs = zfs_refcount_remove(&dn->dn_holds, tag); mutex_exit(&dn->dn_mtx); So, there is a race condition after the remove until dn_mtx is dropped. During that time, dnode_destroy() can get called, which ends up in dnode_dest() calling mutex_destroy() and a panic since the lock is still held. This change adds a condvar to wait for the final dnode_rele_and_unlock() to release the dn_mtx before calling dnode_destroy(). Reviewed-by: Paul Dagnelie <pcd@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Signed-off-by: John Poduska <jpoduska@datto.com> Closes #7814 Closes #10101
* Improve zfs send performance by bypassing the ARCMatthew Ahrens2020-03-102-0/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When doing a zfs send on a dataset with small recordsize (e.g. 8K), performance is dominated by the per-block overheads. This is especially true with `zfs send --compressed`, which further reduces the amount of data sent, for the same number of blocks. Several threads are involved, but the limiting factor is the `send_prefetch` thread, which is 100% on CPU. The main job of the `send_prefetch` thread is to issue zio's for the data that will be needed by the main thread. It does this by calling `arc_read(ARC_FLAG_PREFETCH)`. This has an immediate cost of creating an arc_hdr, which takes around 14% of one CPU. It also induces later costs by other threads: * Since the data was only prefetched, dmu_send()->dmu_dump_write() will need to call arc_read() again to get the data. This will have to look up the arc_hdr in the hash table and copy the data from the scatter ABD in the arc_hdr to a linear ABD in arc_buf. This takes 27% of one CPU. * dmu_dump_write() needs to arc_buf_destroy() This takes 11% of one CPU. * arc_adjust() will need to evict this arc_hdr, taking about 50% of one CPU. All of these costs can be avoided by bypassing the ARC if the data is not already cached. This commit changes `zfs send` to check for the data in the ARC, and if it is not found then we directly call `zio_read()`, reading the data into a linear ABD which is used by dmu_dump_write() directly. The performance improvement is best expressed in terms of how many blocks can be processed by `zfs send` in one second. This change increases the metric by 50%, from ~100,000 to ~150,000. When the amount of data per block is small (e.g. 2KB), there is a corresponding reduction in the elapsed time of `zfs send >/dev/null` (from 86 minutes to 58 minutes in this test case). In addition to improving the performance of `zfs send`, this change makes `zfs send` not pollute the ARC cache. In most cases the data will not be reused, so this allows us to keep caching useful data in the MRU (hit-once) part of the ARC. Reviewed-by: Paul Dagnelie <pcd@delphix.com> Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #10067
* Add trim support to zpool waitBrian Behlendorf2020-03-041-0/+1
| | | | | | | | | | | | Manual trims fall into the category of long-running pool activities which people might want to wait synchronously for. This change adds support to 'zpool wait' for waiting for manual trim operations to complete. It also adds a '-w' flag to 'zpool trim' which can be used to turn 'zpool trim' into a synchronous operation. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com> Signed-off-by: John Gallagher <john.gallagher@delphix.com> Closes #10071
* Improve performance of zio_taskq_memberMatthew Ahrens2020-03-031-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | __zio_execute() calls zio_taskq_member() to determine if we are running in a zio interrupt taskq, in which case we may need to switch to processing this zio in a zio issue taskq. The call to zio_taskq_member() can become a performance bottleneck when we are processing a high rate of zio's. zio_taskq_member() calls taskq_member() on each of the zio interrupt taskqs, of which there are 21. This is slow because each call to taskq_member() does tsd_get(taskq_tsd), which on Linux is relatively slow. This commit improves the performance of zio_taskq_member() by having it cache the value of tsd_get(taskq_tsd), reducing the number of those calls to 1/21th of the current behavior. In a test case running `zfs send -c >/dev/null` of a filesystem with small blocks (average 2.5KB/block), zio_taskq_member() was using 6.7% of one CPU, and with this change it is reduced to 1.3%. Overall time to perform the `zfs send` reduced by 10% (~150,000 block/sec to ~165,000 blocks/sec). Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #10070
* Make spa_history_zone platform-dependent in kernelRyan Moeller2020-03-021-0/+1
| | | | | | | | | | | This function should only return "linux" on Linux. Move the kernel part of the function out of common code. Fix the tests for FreeBSD. Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ryan Moeller <ryan@iXsystems.com> Closes #10079
* Re-share zfsdev_getminor and zfs_onexit_fd_holdMatthew Macy2020-02-281-0/+1
| | | | | | | | | | By adding a zfs_file_private accessor to the common interfaces and some extensions to FreeBSD platform code it is now possible to share the implementations for the aforementioned functions. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes #10073
* Linux 5.6 compat: time_tBrian Behlendorf2020-02-271-2/+2
| | | | | | | | | | | | | | | | | | | As part of the Linux kernel's y2038 changes the time_t type has been fully retired. Callers are now required to use the time64_t type. Rather than move to the new type, I've removed the few remaining places where a time_t is used in the kernel code. They've been replaced with a uint64_t which is already how ZFS internally handled these values. Going forward we should work towards updating the remaining user space time_t consumers to the 64-bit interfaces. Reviewed-by: Matthew Macy <mmacy@freebsd.org> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #10052 Closes #10064
* Refactor dnode dirty context from dbuf_dirtyMatthew Macy2020-02-261-1/+2
| | | | | | | | | | | * Add dedicated donde_set_dirtyctx routine. * Add empty dirty record on destroy assertion. * Make much more extensive use of the SET_ERROR macro. Reviewed-by: Will Andrews <wca@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes #9924
* Remove zfs_getattr and convoff dead codeDirkjan Bussink2020-02-241-1/+1
| | | | | | | | | | | | The `convoff` function is called only in one code path in `zfs_space`. Each caller of `zfs_space` is called with a `flock64_t` that has `l_whence` set to `SEEK_SET`. This means that `convoff` always results in a no-op as the `bfp` parameter has `l_whence` set to `SEEK_SET` and `int whence` is `SEEK_SET` as well. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Dirkjan Bussink <d.bussink@gmail.com> Closes #10006
* Prefer org.openzfs for features and propertiesRichard Laager2020-02-181-2/+2
| | | | | | | | | | Moving forward, we wish to use org.openzfs (no dash) rather than org.open-zfs or org.zfsonlinux for feature GUIDs and property names. The existing feature GUIDs cannot be changed. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Richard Laager <rlaager@wiktel.com> Closes #10003
* Support setting user properties in a channel programJason King2020-02-142-0/+45
| | | | | | | | | | | | | This adds support for setting user properties in a zfs channel program by adding 'zfs.sync.set_prop' and 'zfs.check.set_prop' to the ZFS LUA API. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matt Ahrens <matt@delphix.com> Co-authored-by: Sara Hartse <sara.hartse@delphix.com> Contributions-by: Jason King <jason.king@joyent.com> Signed-off-by: Sara Hartse <sara.hartse@delphix.com> Signed-off-by: Jason King <jason.king@joyent.com> Closes #9950
* Remove limit on number of async zio_frees of non-dedup blocksMatthew Ahrens2020-02-141-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | The module parameter zfs_async_block_max_blocks limits the number of blocks that can be freed by the background freeing of filesystems and snapshots (from "zfs destroy"), in one TXG. This is useful when freeing dedup blocks, becuase each zio_free() of a dedup block can require an i/o to read the relevant part of the dedup table (DDT), and will also dirty that block. zfs_async_block_max_blocks is set to 100,000 by default. For the more typical case where dedup is not used, this can have a negative performance impact on the rate of background freeing (from "zfs destroy"). For example, with recordsize=8k, and TXG's syncing once every 5 seconds, we can free only 160MB of data per second, which may be much less than the rate we can write data. This change increases zfs_async_block_max_blocks to be unlimited by default. To address the dedup freeing issue, a new tunable is introduced, zfs_max_async_dedup_frees, which limits the number of zio_free()'s of dedup blocks done by background destroys, per txg. The default is 100,000 free's (same as the old zfs_async_block_max_blocks default). Reviewed-by: Paul Dagnelie <pcd@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #10000
* Remove duplicate dbufs accountingAlexander Motin2020-02-131-0/+7
| | | | | | | | | | | | | | Since AVL already has embedded element counter, use dn_dbufs_count only for dbufs not counted there (bonus buffers) and just add them. This removes two atomics per dbuf life cycle. According to profiler it reduces time spent by dbuf_destroy() inside bottlenecked dbuf_evict_thread() from 13.36% to 9.20% of the core. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matt Ahrens <matt@delphix.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc. Closes #9949
* zcp: add zfs.sync.bookmarkChristian Schwarz2020-02-111-0/+17
| | | | | | | | | | Add support for bookmark creation and cloning. Reviewed-by: Matt Ahrens <matt@delphix.com> Reviewed-by: Paul Dagnelie <pcd@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Christian Schwarz <me@cschwarz.com> Closes #9571
* Implement bookmark copyingChristian Schwarz2020-02-112-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This feature allows copying existing bookmarks using zfs bookmark fs#target fs#newbookmark There are some niche use cases for such functionality, e.g. when using bookmarks as markers for replication progress. Copying redaction bookmarks produces a normal bookmark that cannot be used for redacted send (we are not duplicating the redaction object). ZCP support for bookmarking (both creation and copying) will be implemented in a separate patch based on this work. Overview: - Terminology: - source = existing snapshot or bookmark - new/bmark = new bookmark - Implement bookmark copying in `dsl_bookmark.c` - create new bookmark node - copy source's `zbn_phys` to new's `zbn_phys` - zero-out redaction object id in copy - Extend existing bookmark ioctl nvlist schema to accept bookmarks as sources - => `dsl_bookmark_create_nvl_validate` is authoritative - use `dsl_dataset_is_before` check for both snapshot and bookmark sources - Adjust CLI - refactor shortname expansion logic in `zfs_do_bookmark` - Update man pages - warn about redaction bookmark handling - Add test cases - CLI - pyyzfs libzfs_core bindings Reviewed-by: Matt Ahrens <matt@delphix.com> Reviewed-by: Paul Dagnelie <pcd@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Christian Schwarz <me@cschwarz.com> Closes #9571
* Fix zdb -R with 'b' flagPaul Zuchowski2020-02-101-0/+9
| | | | | | | | | | | | | zdb -R :b fails due to the indirect block being compressed, and the 'b' and 'd' flag not working in tandem when specified. Fix the flag parsing code and create a zfs test for zdb -R block display. Also fix the zio flags where the dotted notation for the vdev portion of DVA (i.e. 0.0:offset:length) fails. Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Paul Zuchowski <pzuchowski@datto.com> Closes #9640 Closes #9729
* Share some code for spa deadman tunablesRyan Moeller2020-02-101-0/+2
| | | | | | | | | | | | | We need to do the same thing to update all spas on any OS for these tunables, so let's share the code. While here let's match the types of the literals initializing the variables with the type of the variable. Reviewed-by: Allan Jude <allanjude@freebsd.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Olaf Faaland <faaland1@llnl.gov> Signed-off-by: Ryan Moeller <ryan@iXsystems.com> Closes #9964
* ICP: Improve AES-GCM performanceAttila Fülöp2020-02-101-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently SIMD accelerated AES-GCM performance is limited by two factors: a. The need to disable preemption and interrupts and save the FPU state before using it and to do the reverse when done. Due to the way the code is organized (see (b) below) we have to pay this price twice for each 16 byte GCM block processed. b. Most processing is done in C, operating on single GCM blocks. The use of SIMD instructions is limited to the AES encryption of the counter block (AES-NI) and the Galois multiplication (PCLMULQDQ). This leads to the FPU not being fully utilized for crypto operations. To solve (a) we do crypto processing in larger chunks while owning the FPU. An `icp_gcm_avx_chunk_size` module parameter was introduced to make this chunk size tweakable. It defaults to 32 KiB. This step alone roughly doubles performance. (b) is tackled by porting and using the highly optimized openssl AES-GCM assembler routines, which do all the processing (CTR, AES, GMULT) in a single routine. Both steps together result in up to 32x reduction of the time spend in the en/decryption routines, leading up to approximately 12x throughput increase for large (128 KiB) blocks. Lastly, this commit changes the default encryption algorithm from AES-CCM to AES-GCM when setting the `encryption=on` property. Reviewed-By: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-By: Jason King <jason.king@joyent.com> Reviewed-By: Tom Caputi <tcaputi@datto.com> Reviewed-By: Richard Laager <rlaager@wiktel.com> Signed-off-by: Attila Fülöp <attila@fueloep.org> Closes #9749
* Reduce number of atomic_add() calls in aggsumAlexander Motin2020-02-061-1/+1
| | | | | | | | | | | | | | | | | | | | | | | Previous code used 4 atomics to do aggsum_flush_bucket() and 2 more to re-borrow after the flush. But since asc_borrowed and asc_delta are accessed only while holding asc_lock, it makes no any sense to modify as_lower_bound and as_upper_bound in multiple steps. Instead of that the new code uses only 2 atomics in all the cases, one per as_*_bound variable. I think even that is overkill, simple atomic store and load could be used here, since all modifications are done under the as_lock, but there are no such primitives in ZFS code now. While there, make borrow code consider previous borrow value, so that on mixed request patterns reduce chance of needing to borrow again if much larger request follows tiny one that needed borrow. Also reduce as_numbuckets from uint64_t to u_int. It makes no sense to use so large division operation on every aggsum_add(). Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Paul Dagnelie <pcd@delphix.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc. Closes #9930
* Few microoptimizations to dbuf layerAlexander Motin2020-02-051-6/+7
| | | | | | | | | | | | | | | | | | | Move db_link into the same cache line as db_blkid and db_level. It allows significantly reduce avl_add() time in dbuf_create() on systems with large RAM and huge number of dbufs per dnode. Avoid few accesses to dbuf_caches[].size, which is highly congested under high IOPS and never stays in cache for a long time. Use local value we are receiving from zfs_refcount_add_many() any way. Remove cache_size_bytes_max bump from dbuf_evict_one(). I don't see a point to do it on dbuf eviction after we done it on insertion in dbuf_rele_and_unlock(). Reviewed-by: Matt Ahrens <matt@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc. Closes #9931
* Convert dbuf dirty record record list to a list_tMatthew Macy2020-02-051-4/+27
| | | | | | | | | Additionally pull in state machine comments about upcoming async cow work. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matt Ahrens <matt@delphix.com> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes #9902
* Restore aclmode and remove acltype on FreeBSDRyan Moeller2020-02-041-1/+1
| | | | | | | | | | | | | | This replaces the placeholder ZFS_PROP_PRIVATE with ZFS_PROP_ACLMODE, matching what is done in the NFSv4 ACLs PR (#9709). On FreeBSD we hide ZFS_PROP_ACLTYPE, while on Linux we hide ZFS_PROP_ACLMODE. The tests already assume this arrangement. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Ryan Moeller <ryan@iXsystems.com> Closes #9913
* async zvol minor node creation interferes with receiveMatthew Ahrens2020-02-032-8/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When we finish a zfs receive, dmu_recv_end_sync() calls zvol_create_minors(async=TRUE). This kicks off some other threads that create the minor device nodes (in /dev/zvol/poolname/...). These async threads call zvol_prefetch_minors_impl() and zvol_create_minor(), which both call dmu_objset_own(), which puts a "long hold" on the dataset. Since the zvol minor node creation is asynchronous, this can happen after the `ZFS_IOC_RECV[_NEW]` ioctl and `zfs receive` process have completed. After the first receive ioctl has completed, userland may attempt to do another receive into the same dataset (e.g. the next incremental stream). This second receive and the asynchronous minor node creation can interfere with one another in several different ways, because they both require exclusive access to the dataset: 1. When the second receive is finishing up, dmu_recv_end_check() does dsl_dataset_handoff_check(), which can fail with EBUSY if the async minor node creation already has a "long hold" on this dataset. This causes the 2nd receive to fail. 2. The async udev rule can fail if zvol_id and/or systemd-udevd try to open the device while the the second receive's async attempt at minor node creation owns the dataset (via zvol_prefetch_minors_impl). This causes the minor node (/dev/zd*) to exist, but the udev-generated /dev/zvol/... to not exist. 3. The async minor node creation can silently fail with EBUSY if the first receive's zvol_create_minor() trys to own the dataset while the second receive's zvol_prefetch_minors_impl already owns the dataset. To address these problems, this change synchronously creates the minor node. To avoid the lock ordering problems that the asynchrony was introduced to fix (see #3681), we create the minor nodes from open context, with no locks held, rather than from syncing contex as was originally done. Implementation notes: We generally do not need to traverse children or prefetch anything (e.g. when running the recv, snapshot, create, or clone subcommands of zfs). We only need recursion when importing/opening a pool and when loading encryption keys. The existing recursive, asynchronous, prefetching code is preserved for use in these cases. Channel programs may need to create zvol minor nodes, when creating a snapshot of a zvol with the snapdev property set. We figure out what snapshots are created when running the LUA program in syncing context. In this case we need to remember what snapshots were created, and then try to create their minor nodes from open context, after the LUA code has completed. There are additional zvol use cases that asynchronously own the dataset, which can cause similar problems. E.g. changing the volmode or snapdev properties. These are less problematic because they are not recursive and don't touch datasets that are not involved in the operation, there is still potential for interference with subsequent operations. In the future, these cases should be similarly converted to create the zvol minor node synchronously from open context. The async tasks of removing and renaming minors do not own the objset, so they do not have this problem. However, it may make sense to also convert these operations to happen synchronously from open context, in the future. Reviewed-by: Paul Dagnelie <pcd@delphix.com> Reviewed-by: Prakash Surya <prakash.surya@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> External-issue: DLPX-65948 Closes #7863 Closes #9885
* Add AltiVec RAID-ZRomain Dolbeau2020-01-231-1/+4
| | | | | | | | | | | | | Implements the RAID-Z function using AltiVec SIMD. This is basically the NEON code translated to AltiVec. Note that the 'fletcher' algorithm requires 64-bits operations, and the initial implementations of AltiVec (PPC74xx a.k.a. G4, PPC970 a.k.a. G5) only has up to 32-bits operations, so no 'fletcher'. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Romain Dolbeau <romain.dolbeau@european-processor-initiative.eu> Closes #9539
* Support inheriting properties in channel programsJason King2020-01-221-0/+9
| | | | | | | | | This adds support in channel programs to inherit properties analogous to `zfs inherit` by adding `zfs.sync.inherit` and `zfs.check.inherit` functions to the ZFS LUA API. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Jason King <jason.king@joyent.com> Closes #9738
* Fix errata #4 handling for resuming streamsTom Caputi2020-01-141-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | Currently, the handling for errata #4 has two issues which allow the checks for this issue to be bypassed using resumable sends. The first issue is that drc->drc_fromsnapobj is not set in the resuming code as it is in the non-resuming code. This causes dsl_crypto_recv_key_check() to skip its checks for the from_ivset_guid. The second issue is that resumable sends do not clean up their on-disk state if they fail the checks in dmu_recv_stream() that happen before any data is received. As a result of these two bugs, a user can attempt a resumable send of a dataset without a from_ivset_guid. This will fail the initial dmu_recv_stream() checks, leaving a valid resume state. The send can then be resumed, which skips those checks, allowing the receive to be completed. This commit fixes these issues by setting drc->drc_fromsnapobj in the resuming receive path and by ensuring that resumablereceives are properly cleaned up if they fail the initial dmu_recv_stream() checks. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes #9818 Closes #9829
* Change http://zfsonlinux.org links to https://zfsonlinux.orgBrian Behlendorf2020-01-131-1/+1
| | | | | | | | | | | Update the project website links contained in to repository to reference the secure https://zfsonlinux.org address. Reviewed-By: Richard Laager <rlaager@wiktel.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Garrett Fields <ghfields@gmail.com> Reviewed-by: Kjeld Schouten <kjeld@schouten-lebbing.nl> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #9837
* Add 'zfs send --saved' flagTom Caputi2020-01-101-4/+6
| | | | | | | | | | | | | | | | | | This commit adds the --saved (-S) to the 'zfs send' command. This flag allows a user to send a partially received dataset, which can be useful when migrating a backup server to new hardware. This flag is compatible with resumable receives, so even if the saved send is interrupted, it can be resumed. The flag does not require any user / kernel ABI changes or any new feature flags in the send stream format. Reviewed-by: Paul Dagnelie <pcd@delphix.com> Reviewed-by: Alek Pinchuk <apinchuk@datto.com> Reviewed-by: Paul Zuchowski <pzuchowski@datto.com> Reviewed-by: Christian Schwarz <me@cschwarz.com> Reviewed-by: Matt Ahrens <matt@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes #9007
* Relocate common quota functions to shared codeRyan Moeller2019-12-113-0/+49
| | | | | | | | | | | The quota functions are common to all implementations and can be moved to common code. As a simplification they were moved to the Linux platform code in the initial refactoring. Reviewed-by: Jorgen Lundman <lundman@lundman.net> Reviewed-by: Igor Kozhukhov <igor@dilos.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ryan Moeller <ryan@ixsystems.com> Closes #9710
* Eliminate Linux specific inode usage from common code Matthew Macy2019-12-111-2/+2
| | | | | | | | | | Change many of the znops routines to take a znode rather than an inode so that zfs_replay code can be largely shared and in the future the much of the znops code may be shared. Reviewed-by: Jorgen Lundman <lundman@lundman.net> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes #9708
* Abstract away platform specific superblock referencesMatthew Macy2019-12-101-0/+2
| | | | | | | | The zfsvfs->z_sb field is Linux specified and should be abstracted. Reviewed-by: Richard Laager <rlaager@wiktel.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes #9697
* Exclude data from cores unconditionally and metadata conditionallyMatthew Macy2019-12-091-0/+1
| | | | | | | | | This change allows us to align the code dump logic across platforms. Reviewed-by: Jorgen Lundman <lundman@lundman.net> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Don Brady <don.brady@delphix.com> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes #9691
* Disable EDONR on FreeBSDMatthew Macy2019-12-051-0/+2
| | | | | | | | | | | FreeBSD uses its own crypto framework in-kernel which, at this time, has no EDONR implementation. Reviewed-by: Jorgen Lundman <lundman@lundman.net> Reviewed-by: Allan Jude <allanjude@freebsd.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Signed-off-by: Ryan Moeller <ryan@ixsystems.com> Closes #9664
* Add ZFS_IOC offsets for FreeBSDMatthew Macy2019-12-051-12/+13
| | | | | | | | | | | | | | | | | FreeBSD requires three additional ioctls, they are ZFS_IOC_NEXTBOOT, ZFS_IOC_JAIL, and ZFS_IOC_UNJAIL. These have been added after the Linux-specific ioctls. The range 0x80-0xFF has been reserved for future optional platform-specific ioctls. Any platform may choose to implement these as appropriate. None of the existing ioctl numbers have been changed to maintain compatibility. For Linux no vectors have been registered for the new ioctls and they are reported as unsupported. Reviewed-by: Jorgen Lundman <lundman@lundman.net> Reviewed-by: Allan Jude <allanjude@freebsd.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes #9667
* Refactor deadman set failmode to be cross platformMatthew Macy2019-12-052-1/+3
| | | | | | | | | Update zfs_deadman_failmode to use the ZFS_MODULE_PARAM_CALL wrapper, and split the common and platform specific portions. Reviewed-by: Jorgen Lundman <lundman@lundman.net> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes #9670
* Refactor zfs_context.h to build on FreeBSDMatthew Macy2019-12-041-18/+4
| | | | | | | | | | | | - on Linux move Linux specific headers to zfs_context_os.h - on FreeBSD move FreeBSD specific definitions to zfs_context_os.h - remove duplicate tsd_ definitions - remove unused AT_TYPE Reviewed-by: Jorgen Lundman <lundman@lundman.net> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Don Brady <don.brady@delphix.com> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes #9668
* Move zfs_cmd_t copyin/copyout to platform codeMatthew Macy2019-12-021-1/+1
| | | | | | | | | | | FreeBSD needs to cope with multiple version of the zfs_cmd_t structure. Allowing the platform code to pre and post process the cmd structure makes it possible to work with legacy tooling. Reviewed-by: Jorgen Lundman <lundman@lundman.net> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes #9624
* Add FreeBSD required defines to mntent.hMatthew Macy2019-11-301-0/+9
| | | | | | | Linux and FreeBSD use different names for suid / setuid. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes #9632