summaryrefslogtreecommitdiffstats
path: root/include
Commit message (Collapse)AuthorAgeFilesLines
* Small program that converts a dataset id and an object id to a pathPaul Dagnelie2020-05-201-1/+3
| | | | | | | | | Small program that converts a dataset id and an object id to a path Reviewed-by: Prakash Surya <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Matthew Ahrens <[email protected]> Signed-off-by: Paul Dagnelie <[email protected]> Closes #10204
* Fix ZVOL_DIRRyan Moeller2020-05-161-3/+0
| | | | | | | | | | We only use ZVOL_DIR on FreeBSD, and on FreeBSD it isn't correct. Move the definition to the file where it is needed, and define it as /dev/zvol/. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Ryan Moeller <[email protected]> Closes #10337
* Fix abd_enter/exit_critical wrappersBrian Behlendorf2020-05-141-2/+13
| | | | | | | | | | | | | | | | Commit fc551d7 introduced the wrappers abd_enter_critical() and abd_exit_critical() to mark critical sections. On Linux these are implemented with the local_irq_save() and local_irq_restore() macros which set the 'flags' argument when saving. By wrapping them with a function the local variable is no longer set by the macro and is no longer properly restored. Convert abd_enter_critical() and abd_exit_critical() to macros to resolve this issue and ensure the flags are properly restored. Reviewed-by: Matthew Ahrens <[email protected]> Reviewed-by: Brian Atkinson <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #10332
* Upstream: add missing thread_exit()Jorgen Lundman2020-05-141-22/+1
| | | | | | | | Undo FreeBSD wrapper for thread_create() added to call thread_exit. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Signed-off-by: Jorgen Lundman <[email protected]> Closes #10314
* remove unneeded member drc_err of dmu_recv_cookie_tMatthew Ahrens2020-05-141-1/+0
| | | | | | | | The member drc_err of dmu_recv_cookie_t is used only locally in receive_read, so we can replace it with a local variable. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> Closes #10319
* Combine OS-independent ABD Code into Common Source FileBrian Atkinson2020-05-103-50/+147
| | | | | | | | | | | | | | | | | Reorganizing ABD code base so OS-independent ABD code has been placed into a common abd.c file. OS-dependent ABD code has been left in each OS's ABD source files, and these source files have been renamed to abd_os. The OS-independent ABD code is now under: module/zfs/abd.c With the OS-dependent code in: module/os/linux/zfs/abd_os.c module/os/freebsd/zfs/abd_os.c Reviewed-by: Matthew Ahrens <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Brian Atkinson <[email protected]> Closes #10293
* Improvements on persistent L2ARCGeorge Amanakis2020-05-071-12/+28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Functional changes: We implement refcounts of log blocks and their aligned size on the cache device along with two corresponding arcstats. The refcounts are reflected in the header of the device and provide valuable information as to whether log blocks are accounted for correctly. These are dynamically adjusted as log blocks are committed/evicted. zdb also uses this information in the device header and compares it to the corresponding values as reported by dump_l2arc_log_blocks() which emulates l2arc_rebuild(). If the refcounts saved in the device header report higher values, zdb exits with an error. For this feature to work correctly there should be no active writes on the device. This is also employed in the tests of persistent L2ARC. We extend the structure of the cache device header by adding the two new variables mirroring the refcounts after the existing variables to preserve backward compatibility in terms of persistent L2ARC. 1) a new arcstat "l2_log_blk_asize" and refcount "l2ad_lb_asize" which reflect the total aligned size of log blocks on the device. This is also reflected in the header of the cache device as "dh_lb_asize". 2) a new arcstat "l2arc_log_blk_count" and refcount "l2ad_lb_count" which reflect the total number of L2ARC log blocks present on cache devices. It is also reflected in the header of the cache device as "dh_lb_count". In l2arc_rebuild_vdev() if the amount of committed log entries in a log block is 0 and the device header is valid we update the device header. This will facilitate trimming of the whole device in this case when TRIM for L2ARC is implemented. Improve loop protection in l2arc_rebuild() by using the starting offset of the payload of each log block instead of the starting offset of the log block. If the zio in l2arc_write_buffers() fails, restore the lbps array in the header of the device to its previous state in l2arc_write_done(). If l2arc_rebuild() ends the rebuild process without restoring any L2ARC log blocks in ARC and without any other error, this means that the lbps array in the header is pointing to non-existent or invalid log blocks. Reset the device header in this case. In l2arc_rebuild() change the zfs_dbgmsg messages to spa_history_log_internal() making them user visible with zpool history command. Non-functional changes: Make the first test in persistent L2ARC use `zdb -lll` to increase coverage in `zdb.c`. Rename psize with asize when referring to log blocks, since L2ARC_SET_PSIZE stores the vdev aligned size for log blocks. Also rename dh_log_blk_entries to dh_log_entries to make it clear that it is a mirror of l2ad_log_entries. Added comments for both changes. Fix inaccurate comments for example in l2arc_log_blk_restore(). Add asserts at the end in l2arc_evict() and l2arc_write_buffers(). Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: George Amanakis <[email protected]> Closes #10228
* Add support for boot environment data to be stored in the labelPaul Dagnelie2020-05-075-8/+36
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Modern bootloaders leverage data stored in the root filesystem to enable some of their powerful features. GRUB specifically has a grubenv file which can store large amounts of configuration data that can be read and written at boot time and during normal operation. This allows sysadmins to configure useful features like automated failover after failed boot attempts. Unfortunately, due to the Copy-on-Write nature of ZFS, the standard behavior of these tools cannot handle writing to ZFS files safely at boot time. We need an alternative way to store data that allows the bootloader to make changes to the data. This work is very similar to work that was done on Illumos to enable similar functionality in the FreeBSD bootloader. This patch is different in that the data being stored is a raw grubenv file; this file can store arbitrary variables and values, and the scripting provided by grub is powerful enough that special structures are not required to implement advanced behavior. We repurpose the second padding area in each label to store the grubenv file, protected by an embedded checksum. We add two ioctls to get and set this data, and libzfs_core and libzfs functions to access them more easily. There are no direct command line interfaces to these functions; these will be added directly to the bootloader utilities. Reviewed-by: Pavel Zakharov <[email protected]> Reviewed-by: Matthew Ahrens <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Paul Dagnelie <[email protected]> Closes #10009
* Update FreeBSD SPL atomicsRyan Moeller2020-05-041-71/+65
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Sync up with the following changes from FreeBSD: ZFS: add emulation of atomic_swap_64 and atomic_load_64 Some 32-bit platforms do not provide 64-bit atomic operations that ZFS requires, either in userland or at all. We emulate those operations for those platforms using a mutex. That is not entirely correct and it's very efficient. Besides, the loads are plain loads, so torn values are possible. Nevertheless, the emulation seems to work for some definition of work. This change adds atomic_swap_64, which is already used in ZFS code, and atomic_load_64 that can be used to prevent torn reads. Authored by: avg <[email protected]> FreeBSD-commit: freebsd/freebsd@3458e5d1e6354123ec2b0953d29f98126aa442e cleanup of illumos compatibility atomics atomic_cas_32 is implemented using atomic_fcmpset_32 on all platforms. Ditto for atomic_cas_64 and atomic_fcmpset_64 on platforms that have it. The only exception is sparc64 that provides MD atomic_cas_32 and atomic_cas_64. This is slightly inefficient as fcmpset reports whether the operation updated the target and that information is not needed for cas. Nevertheless, there is less code to maintain and to add for new platforms. Also, the operations are done inline now as opposed to function calls before. atomic_add_64_nv is implemented using atomic_fetchadd_64 on platforms that provide it. casptr, cas32, atomic_or_8, atomic_or_8_nv are completely removed as they have no users. atomic_mtx that is used to emulate 64-bit atomics on platforms that lack them is defined only on those platforms. As a result, platform specific opensolaris_atomic.S files have lost most of their code. The only exception is i386 where the compat+contrib code provides 64-bit atomics for userland use. That code assumes availability of cmpxchg8b instruction. FreeBSD does not have that assumption for i386 userland and does not provide 64-bit atomics. Hopefully, this can and will be fixed. Authored by: avg <[email protected]> FreeBSD-commit: freebsd/freebsd@e9642c209b4413f6afb41d3b2607c51d80a1a34 emulate illumos membar_producer with atomic_thread_fence_rel membar_producer is supposed to be a store-store barrier. Also, in the code that FreeBSD has ported from illumos membar_producer is used only with regular stores to regular memory (with respect to caching). We do not have an MI primitive for the store-store barrier, so atomic_thread_fence_rel is the closest we have as it provides (load | store) -> store barrier. Previously, membar_producer was an empty function call on all 32-bit arm-s, 32-bit powerpc, riscv and all mips variants. I think that it was inadequate. On other platforms, such as amd64, arm64, i386, powerpc64, sparc64, membar_producer was implemented using stronger primitives than required for a store-store barrier with respect to regular memory access. For example, it used sfence on amd64 and lock-ed nop in i386 (despite TSO). On powerpc64 we now use recommended lwsync instead of eieio. On sparc64 FreeBSD uses TSO mode. On arm64/aarch64 we now use dmb sy instead of dmb ish. Not sure if this is an improvement, actually. After this change we can drop opensolaris_atomic.S for aarch64, amd64, powerpc64 and sparc64 as all required atomic operations have either direct or light-weight mapping to FreeBSD native atomic operations. Discussed with: kib Authored by: avg <[email protected]> FreeBSD-commit: freebsd/freebsd@50cdda62fced8d21e45858e01dc375a10f1749e fix up r353340, don't assume that fcmpset has strong semantics fcmpset can have two kinds of semantics, weak and strong. For practical purposes, strong semantics means that if fcmpset fails then the reported current value is always different from the expected value. Weak semantics means that the reported current value may be the same as the expected value even though fcmpset failed. That's a so called "sporadic" failure. I originally implemented atomic_cas expecting strong semantics, but many platforms actually have weak one. Reported by: pkubaj (not confirmed if same issue) Discussed with: kib, mjg Authored by: avg <[email protected]> FreeBSD-commit: freebsd/freebsd@238787c74e737e271f17330fbad900acc35651c [PowerPC] [MIPS] Implement 32-bit kernel emulation of atomic64 operations This is a lock-based emulation of 64-bit atomics for kernel use, split off from an earlier patch by jhibbits. This is needed to unblock future improvements that reduce the need for locking on 64-bit platforms by using atomic updates. The implementation allows for future integration with userland atomic64, but as that implies going through sysarch for every use, the current status quo of userland doing its own locking may be for the best. Submitted by: jhibbits (original patch), kevans (mips bits) Reviewed by: jhibbits, jeff, kevans Authored by: bdragon <[email protected]> Differential Revision: https://reviews.freebsd.org/D22976 FreeBSD-commit: freebsd/freebsd@db39dab3a896b3d98e588736e9a2b4ddaeb31f1 Remove sparc64 kernel support Remove all sparc64 specific files Remove all sparc64 ifdefs Removee indireeect sparc64 ifdefs Authored by: imp <[email protected]> FreeBSD-commit: freebsd/freebsd@48b94864c51253da92e4444f0074eec36ef391f Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Ported-by: Ryan Moeller <[email protected]> Signed-off-by: Ryan Moeller <[email protected]> Closes #10250
* OpenZFS 742 - Resurrect the ZFS "aclmode" property OpenZFS 664 - Umask ↵Paul B. Henson2020-04-301-0/+1
| | | | | | | | | | | | | | | | | | | | | | | masking "deny" ACL entries OpenZFS 279 - Bug in the new ACL (post-PSARC/2010/029) semantics Porting notes: * Updated zfs_acl_chmod to take 'boolean_t isdir' as first parameter rather than 'zfsvfs_t *zfsvfs' * zfs man pages changes mixed between zfs and new zfsprops man pages Reviewed by: Aram Hvrneanu <[email protected]> Reviewed by: Gordon Ross <[email protected]> Reviewed by: Robert Gordon <[email protected]> Reviewed by: [email protected] Reviewed by: Brian Behlendorf <[email protected]> Approved by: Garrett D'Amore <[email protected]> Ported-by: Paul B. Henson <[email protected]> OpenZFS-issue: https://www.illumos.org/issues/742 OpenZFS-issue: https://www.illumos.org/issues/664 OpenZFS-issue: https://www.illumos.org/issues/279 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/a3c49ce110 Closes #10266
* Support custom URI schemes for the keylocation propertyJason King2020-04-281-0/+11
| | | | | | | | | | | | | | | | | | | | | Every platform has their own preferred methods for implementing URI schemes beyond the currently supported file scheme (e.g. 'https' on FreeBSD would likely use libfetch, while Linux distros and illumos would probably use libcurl, etc). It would be helpful if libzfs can be extended to support additional schemes in a simple manner. A table of (scheme, handler_function) pairs is added to libzfs_crypto.c, and the existing functions in libzfs_crypto.c so that when the key format is ZFS_KEYFORMAT_URI, the scheme from the URI string is extracted, and a matching handler it located in the aforementioned table (returning an error if no matching handler is found). The handler function is then invoked to retrieve the key material (in the format specified by the keyformat property) and the key is loaded or the handler can return an error to abort the key loading process. Reviewed by: Sean Eric Fagan <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Jason King <[email protected]> Closes #10218
* Remove deduplicated send/receive codeMatthew Ahrens2020-04-236-26/+15
| | | | | | | | | | | | | | | | | | | | | | | | | Deduplicated send streams (i.e. `zfs send -D` and `zfs receive` of such streams) are deprecated. Deduplicated send streams can be received by first converting them to non-deduplicated with the `zstream redup` command. This commit removes the code for sending and receiving deduplicated send streams. `zfs send -D` will now print a warning, ignore the `-D` flag, and generate a regular (non-deduplicated) send stream. `zfs receive` of a deduplicated send stream will print an error message and fail. The resulting code simplification (especially in the kernel's support for receiving dedup streams) should help enable future performance enhancements. Several new tests are added which leverage `zstream redup`. Reviewed-by: Paul Dagnelie <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> Issue #7887 Issue #10117 Issue #10156 Closes #10212
* Use a struct to organize metaslab-group-allocator fieldsMatthew Ahrens2020-04-221-4/+11
| | | | | | | | | | | | | | | | | | | | Each metaslab group (of which there is one per top-level vdev) has several (4, by default) "metaslab group allocators". Each "allocator" has its own metaslab that it prefers to allocate from (the "primary" allocator), and each can perform allocations concurrently with the other allocators. In addition to the primary metaslab, there are several other fields that need to be tracked separately for each allocator. These are currently stored as several arrays in the metaslab_group_t, each array indexed by allocator number. This change organizes all the metaslab-group-allocator-specific fields into a new struct, metaslab_group_allocator_t. The metaslab_group_t now needs only one array indexed by the allocator number - which contains the metaslab_group_allocator_t's. Reviewed-by: Paul Dagnelie <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> Closes #10213
* Fix zfs send progress reportingMatthew Ahrens2020-04-202-0/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | The progress of a send is supposed to be reported by `zfs send -v`, but it is not. This works by creating a new user thread (with pthread_create()) which does ZFS_IOC_SEND_PROGRESS ioctls to check how much progress has been made. This IOCTL finds the specified send (since there may be multiple concurrent sends in the system). The IOCTL also checks that the specified send was started by the current process. On Linux, different threads of the same process are represented as different `struct task_struct`s (and, confusingly, have different PID's). To check if if two threads are in the same process, we need to check if they have the same `struct task_struct:group_leader`. We used to to this correctly, but it was inadvertently changed by 30af21b02569 (Redacted Send) to simply check if the current `struct task_struct` is the one that started the send. This commit changes the code back to checking if the send was started by a `struct task_struct` with the same `group_leader` as the calling thread. Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Chris Wedgwood <[email protected]> Reviewed-by: Paul Dagnelie <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> Closes #10215 Closes #10216
* Use new FreeBSD API to largely eliminate object lockingMatthew Macy2020-04-171-0/+16
| | | | | | | | Propagate changes in HEAD that mostly eliminate object locking. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Matt Macy <[email protected]> Closes #10205
* Update FreeBSD tunablesRyan Moeller2020-04-151-2/+3
| | | | | | | | Remove some obsolete legacy compat, rename some misnamed, and add some missing tunables for FreeBSD. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Ryan Moeller <[email protected]> Closes #10203
* Add FreeBSD support to OpenZFSMatthew Macy2020-04-1492-0/+9317
| | | | | | | | | | | | | | | | | | Add the FreeBSD platform code to the OpenZFS repository. As of this commit the source can be compiled and tested on FreeBSD 11 and 12. Subsequent commits are now required to compile on FreeBSD and Linux. Additionally, they must pass the ZFS Test Suite on FreeBSD which is being run by the CI. As of this commit 1230 tests pass on FreeBSD and there are no unexpected failures. Reviewed-by: Sean Eric Fagan <[email protected]> Reviewed-by: Jorgen Lundman <[email protected]> Reviewed-by: Richard Laager <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Co-authored-by: Ryan Moeller <[email protected]> Signed-off-by: Matt Macy <[email protected]> Signed-off-by: Ryan Moeller <[email protected]> Closes #898 Closes #8987
* Persistent L2ARCGeorge Amanakis2020-04-104-18/+297
| | | | | | | | | | | | | | | | | | | | | This commit makes the L2ARC persistent across reboots. We implement a light-weight persistent L2ARC metadata structure that allows L2ARC contents to be recovered after a reboot. This significantly eases the impact a reboot has on read performance on systems with large caches. Reviewed-by: Matthew Ahrens <[email protected]> Reviewed-by: George Wilson <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Co-authored-by: Saso Kiselkov <[email protected]> Co-authored-by: Jorgen Lundman <[email protected]> Co-authored-by: George Amanakis <[email protected]> Ported-by: Yuxuan Shui <[email protected]> Signed-off-by: George Amanakis <[email protected]> Closes #925 Closes #1823 Closes #2672 Closes #3744 Closes #9582
* Don't ignore zfs_arc_max below allmem/32Ryan Moeller2020-04-091-1/+1
| | | | | | | | | | | Set arc_c_min before arc_c_max so that when zfs_arc_min is set lower than the default allmem/32 zfs_arc_max can also be set lower. Add warning messages when tunables are being ignored. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Ryan Moeller <[email protected]> Closes #10157 Closes #10158
* Add separate field for indicating that spa is in middle of splitMatthew Macy2020-04-091-0/+1
| | | | | | | | | | | By default it's not possible to open a device already owned by an active vdev. It's necessary to make an exception to this for vdev split. The FreeBSD platform code will make an exception if spa_is splitting is set to to true. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Signed-off-by: Matt Macy <[email protected]> Closes #10178
* Linux 5.7 compat: blk_alloc_queue()Brian Behlendorf2020-04-091-0/+14
| | | | | | | | | | | | | | | | | | Commit https://github.com/torvalds/linux/commit/3d745ea5 simplified the blk_alloc_queue() interface by updating it to take the request queue as an argument. Add a wrapper function which accepts the new arguments and internally uses the available interfaces. Other minor changes include increasing the Linux-Maximum to 5.6 now that 5.6 has been released. It was not bumped to 5.7 because this release has not yet been finalized and is still subject to change. Added local 'struct zvol_state_os *zso' variable to zvol_alloc. Reviewed-by: George Melikov <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #10181 Closes #10187
* Finish refactoring for ZFS_MODULE_PARAM_CALLRyan Moeller2020-04-074-7/+10
| | | | | | | | | | | | | | | Linux and FreeBSD have different parameters for tunable proc handler. This has prevented FreeBSD from implementing the ZFS_MODULE_PARAM_CALL macro. To complete the sharing of ZFS_MODULE_PARAM_CALL declarations, create per-platform definitions of the parameter list, ZFS_MODULE_PARAM_ARGS. With the declarations wired up we discovered an incorrect scope prefix for spa_slop_shift, so this is now fixed. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Ryan Moeller <[email protected]> Closes #10179
* Add 'zfs wait' commandPaul Dagnelie2020-04-014-0/+24
| | | | | | | | | | | | | | | | | | | | | | Add a mechanism to wait for delete queue to drain. When doing redacted send/recv, many workflows involve deleting files that contain sensitive data. Because of the way zfs handles file deletions, snapshots taken quickly after a rm operation can sometimes still contain the file in question, especially if the file is very large. This can result in issues for redacted send/recv users who expect the deleted files to be redacted in the send streams, and not appear in their clones. This change duplicates much of the zpool wait related logic into a zfs wait command, which can be used to wait until the internal deleteq has been drained. Additional wait activities may be added in the future. Reviewed-by: Matthew Ahrens <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: John Gallagher <[email protected]> Signed-off-by: Paul Dagnelie <[email protected]> Closes #9707
* Let default arc_c_max be platform dependentRyan Moeller2020-03-271-0/+1
| | | | | | | | | | | | Linux changed the default max ARC size to 1/2 of physical memory to deal with shortcomings of the Linux SLUB allocator. Other platforms do not require the same logic. Implement an arc_default_max() function to determine a default max ARC size in platform code. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Ryan Moeller <[email protected]> Closes #10155
* Compile cityhash code into libzfsMatthew Ahrens2020-03-273-1/+1
| | | | | | | | | Make the cityhash code compile into libzfs, in preparation for the new "zstream" command. Reviewed-by: Paul Dagnelie <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> Closes #10152
* Deprecate deduplicated send streamsMatthew Ahrens2020-03-181-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Dedup send can only deduplicate over the set of blocks in the send command being invoked, and it does not take advantage of the dedup table to do so. This is a very common misconception among not only users, but developers, and makes the feature seem more useful than it is. As a result, many users are using the feature but not getting any benefit from it. Dedup send requires a nontrivial expenditure of memory and CPU to operate, especially if the dataset(s) being sent is (are) not already using a dedup-strength checksum. Dedup send adds developer burden. It expands the test matrix when developing new features, causing bugs in released code, and delaying development efforts by forcing more testing to be done. As a result, we are deprecating the use of `zfs send -D` and receiving of such streams. This change adds a warning to the man page, and also prints the warning whenever dedup send or receive are used. In a future release, we plan to: 1. remove the kernel code for generating deduplicated streams 2. make `zfs send -D` generate regular, non-deduplicated streams 3. remove the kernel code for receiving deduplicated streams 4. make `zfs receive` of deduplicated streams process them in userland to "re-duplicate" them, so that they can still be received. Reviewed-by: Paul Dagnelie <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: George Melikov <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> Closes #7887 Closes #10117
* Separate warning for incomplete and corrupt streamsPaul Dagnelie2020-03-171-0/+1
| | | | | | | | | | This change adds a separate return code to zfs_ioc_recv that is used for incomplete streams, in addition to the existing return code for streams that contain corruption. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Matthew Ahrens <[email protected]> Signed-off-by: Paul Dagnelie <[email protected]> Closes #10122
* Add option for forcible unmounting dataset while receiving snapshot.Mariusz Zaborski2020-03-171-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | Currently when the dataset is in use we can't receive snapshots. zfs send test/1@asd | zfs recv -FM test/2 cannot unmount '/test/2': Device busy This commits add option 'M' which attempts to forcibly unmount the dataset. Thanks to this we can enforce receiving snapshots in a single step. Note that this functionality is not supported on Linux because the VFS will prevent active mounted filesystems from being unmounted, even with the force option. This is the intended VFS behavior. Test cases were added to verify the expected behavior based on the platform. Discussed-with: Pawel Jakub Dawidek <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Allan Jude <[email protected]> External-issue: https://reviews.freebsd.org/D22306 Closes #9904
* dmu_objset_from_ds must be called with dp_config_rwlock heldMatthew Ahrens2020-03-121-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The normal lock order is that the dp_config_rwlock must be held before the ds_opening_lock. For example, dmu_objset_hold() does this. However, dmu_objset_open_impl() is called with the ds_opening_lock held, and if the dp_config_rwlock is not already held, it will attempt to acquire it. This may lead to deadlock, since the lock order is reversed. Looking at all the callers of dmu_objset_open_impl() (which is principally the callers of dmu_objset_from_ds()), almost all callers already have the dp_config_rwlock. However, there are a few places in the send and receive code paths that do not. For example: dsl_crypto_populate_key_nvlist, send_cb, dmu_recv_stream, receive_write_byref, redact_traverse_thread. This commit resolves the problem by requiring all callers ot dmu_objset_from_ds() to hold the dp_config_rwlock. In most cases, the code has been restructured such that we call dmu_objset_from_ds() earlier on in the send and receive processes, when we already have the dp_config_rwlock, and save the objset_t until we need it in the middle of the send or receive (similar to what we already do with the dsl_dataset_t). Thus we do not need to acquire the dp_config_rwlock in many new places. I also cleaned up code in dmu_redact_snap() and send_traverse_thread(). Reviewed-by: Paul Dagnelie <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Paul Zuchowski <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> Closes #9662 Closes #10115
* Prevent race condition in dnode_dest (#10101)John Poduska2020-03-122-0/+2
| | | | | | | | | | | | | | | | | | | | | | | dnode_special_close() waits for the refcount of dn_holds to go to zero without holding the dn_mtx. dnode_rele_and_unlock() does the final remove to dn_holds with dn_mtx being held: refs = zfs_refcount_remove(&dn->dn_holds, tag); mutex_exit(&dn->dn_mtx); So, there is a race condition after the remove until dn_mtx is dropped. During that time, dnode_destroy() can get called, which ends up in dnode_dest() calling mutex_destroy() and a panic since the lock is still held. This change adds a condvar to wait for the final dnode_rele_and_unlock() to release the dn_mtx before calling dnode_destroy(). Reviewed-by: Paul Dagnelie <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Matthew Ahrens <[email protected]> Signed-off-by: John Poduska <[email protected]> Closes #7814 Closes #10101
* Improve zfs send performance by bypassing the ARCMatthew Ahrens2020-03-102-0/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When doing a zfs send on a dataset with small recordsize (e.g. 8K), performance is dominated by the per-block overheads. This is especially true with `zfs send --compressed`, which further reduces the amount of data sent, for the same number of blocks. Several threads are involved, but the limiting factor is the `send_prefetch` thread, which is 100% on CPU. The main job of the `send_prefetch` thread is to issue zio's for the data that will be needed by the main thread. It does this by calling `arc_read(ARC_FLAG_PREFETCH)`. This has an immediate cost of creating an arc_hdr, which takes around 14% of one CPU. It also induces later costs by other threads: * Since the data was only prefetched, dmu_send()->dmu_dump_write() will need to call arc_read() again to get the data. This will have to look up the arc_hdr in the hash table and copy the data from the scatter ABD in the arc_hdr to a linear ABD in arc_buf. This takes 27% of one CPU. * dmu_dump_write() needs to arc_buf_destroy() This takes 11% of one CPU. * arc_adjust() will need to evict this arc_hdr, taking about 50% of one CPU. All of these costs can be avoided by bypassing the ARC if the data is not already cached. This commit changes `zfs send` to check for the data in the ARC, and if it is not found then we directly call `zio_read()`, reading the data into a linear ABD which is used by dmu_dump_write() directly. The performance improvement is best expressed in terms of how many blocks can be processed by `zfs send` in one second. This change increases the metric by 50%, from ~100,000 to ~150,000. When the amount of data per block is small (e.g. 2KB), there is a corresponding reduction in the elapsed time of `zfs send >/dev/null` (from 86 minutes to 58 minutes in this test case). In addition to improving the performance of `zfs send`, this change makes `zfs send` not pollute the ARC cache. In most cases the data will not be reused, so this allows us to keep caching useful data in the MRU (hit-once) part of the ARC. Reviewed-by: Paul Dagnelie <[email protected]> Reviewed-by: Serapheim Dimitropoulos <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> Closes #10067
* Add trim support to zpool waitBrian Behlendorf2020-03-042-0/+4
| | | | | | | | | | | | Manual trims fall into the category of long-running pool activities which people might want to wait synchronously for. This change adds support to 'zpool wait' for waiting for manual trim operations to complete. It also adds a '-w' flag to 'zpool trim' which can be used to turn 'zpool trim' into a synchronous operation. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Serapheim Dimitropoulos <[email protected]> Signed-off-by: John Gallagher <[email protected]> Closes #10071
* Improve performance of zio_taskq_memberMatthew Ahrens2020-03-032-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | __zio_execute() calls zio_taskq_member() to determine if we are running in a zio interrupt taskq, in which case we may need to switch to processing this zio in a zio issue taskq. The call to zio_taskq_member() can become a performance bottleneck when we are processing a high rate of zio's. zio_taskq_member() calls taskq_member() on each of the zio interrupt taskqs, of which there are 21. This is slow because each call to taskq_member() does tsd_get(taskq_tsd), which on Linux is relatively slow. This commit improves the performance of zio_taskq_member() by having it cache the value of tsd_get(taskq_tsd), reducing the number of those calls to 1/21th of the current behavior. In a test case running `zfs send -c >/dev/null` of a filesystem with small blocks (average 2.5KB/block), zio_taskq_member() was using 6.7% of one CPU, and with this change it is reduced to 1.3%. Overall time to perform the `zfs send` reduced by 10% (~150,000 block/sec to ~165,000 blocks/sec). Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Serapheim Dimitropoulos <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Tony Nguyen <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> Closes #10070
* Make spa_history_zone platform-dependent in kernelRyan Moeller2020-03-021-0/+1
| | | | | | | | | | | This function should only return "linux" on Linux. Move the kernel part of the function out of common code. Fix the tests for FreeBSD. Reviewed-by: George Melikov <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Ryan Moeller <[email protected]> Closes #10079
* Re-share zfsdev_getminor and zfs_onexit_fd_holdMatthew Macy2020-02-281-0/+1
| | | | | | | | | | By adding a zfs_file_private accessor to the common interfaces and some extensions to FreeBSD platform code it is now possible to share the implementations for the aforementioned functions. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matt Macy <[email protected]> Closes #10073
* Linux 5.6 compat: time_tBrian Behlendorf2020-02-272-3/+3
| | | | | | | | | | | | | | | | | | | As part of the Linux kernel's y2038 changes the time_t type has been fully retired. Callers are now required to use the time64_t type. Rather than move to the new type, I've removed the few remaining places where a time_t is used in the kernel code. They've been replaced with a uint64_t which is already how ZFS internally handled these values. Going forward we should work towards updating the remaining user space time_t consumers to the 64-bit interfaces. Reviewed-by: Matthew Macy <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #10052 Closes #10064
* Linux 5.6 compat: ktime_get_raw_ts64()Brian Behlendorf2020-02-271-0/+5
| | | | | | | | | | | The getrawmonotonic() and getrawmonotonic64() interfaces have been fully retired. Update gethrtime() to use the replacement interface ktime_get_raw_ts64() which was introduced in the 4.18 kernel. Reviewed-by: Tony Hutter <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #10052 Closes #10064
* Refactor dnode dirty context from dbuf_dirtyMatthew Macy2020-02-261-1/+2
| | | | | | | | | | | * Add dedicated donde_set_dirtyctx routine. * Add empty dirty record on destroy assertion. * Make much more extensive use of the SET_ERROR macro. Reviewed-by: Will Andrews <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Matthew Ahrens <[email protected]> Signed-off-by: Matt Macy <[email protected]> Closes #9924
* Remove zfs_getattr and convoff dead codeDirkjan Bussink2020-02-242-2/+1
| | | | | | | | | | | | The `convoff` function is called only in one code path in `zfs_space`. Each caller of `zfs_space` is called with a `flock64_t` that has `l_whence` set to `SEEK_SET`. This means that `convoff` always results in a no-op as the `bfp` parameter has `l_whence` set to `SEEK_SET` and `int whence` is `SEEK_SET` as well. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Signed-off-by: Dirkjan Bussink <[email protected]> Closes #10006
* Enable zpool events tunables and tests on FreeBSDRyan Moeller2020-02-181-0/+1
| | | | | | | | | | | | | | We have have made the necessary changes in our module code to expose zevents through both devd and the zpool events ioctl. Now the tunables can be exposed and zpool events tests can be enabled on both platforms. A few minor tweaks to the tests were needed to accommodate the way wc formats output on FreeBSD. zed remains to be ported. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Ryan Moeller <[email protected]> Closes #10008
* Factor out some dbuf subroutines and add state change tracingMatthew Macy2020-02-181-0/+17
| | | | | | | | | | | | Create dedicated dbuf_read_hole and dbuf_read_bonus. Additionally, add a dtrace probe to allow state change tracing. Reviewed-by: Matt Ahrens <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Will Andrews <[email protected]> Reviewed by: Brad Lewis <[email protected]> Authored-by: Will Andrews <[email protected]> Signed-off-by: Matt Macy <[email protected]> Closes #9923
* Prefer org.openzfs for features and propertiesRichard Laager2020-02-181-2/+2
| | | | | | | | | | Moving forward, we wish to use org.openzfs (no dash) rather than org.open-zfs or org.zfsonlinux for feature GUIDs and property names. The existing feature GUIDs cannot be changed. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Signed-off-by: Richard Laager <[email protected]> Closes #10003
* Support setting user properties in a channel programJason King2020-02-142-0/+45
| | | | | | | | | | | | | This adds support for setting user properties in a zfs channel program by adding 'zfs.sync.set_prop' and 'zfs.check.set_prop' to the ZFS LUA API. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Matt Ahrens <[email protected]> Co-authored-by: Sara Hartse <[email protected]> Contributions-by: Jason King <[email protected]> Signed-off-by: Sara Hartse <[email protected]> Signed-off-by: Jason King <[email protected]> Closes #9950
* Remove limit on number of async zio_frees of non-dedup blocksMatthew Ahrens2020-02-141-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | The module parameter zfs_async_block_max_blocks limits the number of blocks that can be freed by the background freeing of filesystems and snapshots (from "zfs destroy"), in one TXG. This is useful when freeing dedup blocks, becuase each zio_free() of a dedup block can require an i/o to read the relevant part of the dedup table (DDT), and will also dirty that block. zfs_async_block_max_blocks is set to 100,000 by default. For the more typical case where dedup is not used, this can have a negative performance impact on the rate of background freeing (from "zfs destroy"). For example, with recordsize=8k, and TXG's syncing once every 5 seconds, we can free only 160MB of data per second, which may be much less than the rate we can write data. This change increases zfs_async_block_max_blocks to be unlimited by default. To address the dedup freeing issue, a new tunable is introduced, zfs_max_async_dedup_frees, which limits the number of zio_free()'s of dedup blocks done by background destroys, per txg. The default is 100,000 free's (same as the old zfs_async_block_max_blocks default). Reviewed-by: Paul Dagnelie <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> Closes #10000
* Remove duplicate dbufs accountingAlexander Motin2020-02-131-0/+7
| | | | | | | | | | | | | | Since AVL already has embedded element counter, use dn_dbufs_count only for dbufs not counted there (bonus buffers) and just add them. This removes two atomics per dbuf life cycle. According to profiler it reduces time spent by dbuf_destroy() inside bottlenecked dbuf_evict_thread() from 13.36% to 9.20% of the core. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Matt Ahrens <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored-By: iXsystems, Inc. Closes #9949
* zcp: add zfs.sync.bookmarkChristian Schwarz2020-02-111-0/+17
| | | | | | | | | | Add support for bookmark creation and cloning. Reviewed-by: Matt Ahrens <[email protected]> Reviewed-by: Paul Dagnelie <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Christian Schwarz <[email protected]> Closes #9571
* Implement bookmark copyingChristian Schwarz2020-02-113-0/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This feature allows copying existing bookmarks using zfs bookmark fs#target fs#newbookmark There are some niche use cases for such functionality, e.g. when using bookmarks as markers for replication progress. Copying redaction bookmarks produces a normal bookmark that cannot be used for redacted send (we are not duplicating the redaction object). ZCP support for bookmarking (both creation and copying) will be implemented in a separate patch based on this work. Overview: - Terminology: - source = existing snapshot or bookmark - new/bmark = new bookmark - Implement bookmark copying in `dsl_bookmark.c` - create new bookmark node - copy source's `zbn_phys` to new's `zbn_phys` - zero-out redaction object id in copy - Extend existing bookmark ioctl nvlist schema to accept bookmarks as sources - => `dsl_bookmark_create_nvl_validate` is authoritative - use `dsl_dataset_is_before` check for both snapshot and bookmark sources - Adjust CLI - refactor shortname expansion logic in `zfs_do_bookmark` - Update man pages - warn about redaction bookmark handling - Add test cases - CLI - pyyzfs libzfs_core bindings Reviewed-by: Matt Ahrens <[email protected]> Reviewed-by: Paul Dagnelie <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Christian Schwarz <[email protected]> Closes #9571
* Fix zdb -R with 'b' flagPaul Zuchowski2020-02-101-0/+9
| | | | | | | | | | | | | zdb -R :b fails due to the indirect block being compressed, and the 'b' and 'd' flag not working in tandem when specified. Fix the flag parsing code and create a zfs test for zdb -R block display. Also fix the zio flags where the dotted notation for the vdev portion of DVA (i.e. 0.0:offset:length) fails. Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Paul Zuchowski <[email protected]> Closes #9640 Closes #9729
* Share some code for spa deadman tunablesRyan Moeller2020-02-101-0/+2
| | | | | | | | | | | | | We need to do the same thing to update all spas on any OS for these tunables, so let's share the code. While here let's match the types of the literals initializing the variables with the type of the variable. Reviewed-by: Allan Jude <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Olaf Faaland <[email protected]> Signed-off-by: Ryan Moeller <[email protected]> Closes #9964
* ICP: Improve AES-GCM performanceAttila Fülöp2020-02-102-1/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently SIMD accelerated AES-GCM performance is limited by two factors: a. The need to disable preemption and interrupts and save the FPU state before using it and to do the reverse when done. Due to the way the code is organized (see (b) below) we have to pay this price twice for each 16 byte GCM block processed. b. Most processing is done in C, operating on single GCM blocks. The use of SIMD instructions is limited to the AES encryption of the counter block (AES-NI) and the Galois multiplication (PCLMULQDQ). This leads to the FPU not being fully utilized for crypto operations. To solve (a) we do crypto processing in larger chunks while owning the FPU. An `icp_gcm_avx_chunk_size` module parameter was introduced to make this chunk size tweakable. It defaults to 32 KiB. This step alone roughly doubles performance. (b) is tackled by porting and using the highly optimized openssl AES-GCM assembler routines, which do all the processing (CTR, AES, GMULT) in a single routine. Both steps together result in up to 32x reduction of the time spend in the en/decryption routines, leading up to approximately 12x throughput increase for large (128 KiB) blocks. Lastly, this commit changes the default encryption algorithm from AES-CCM to AES-GCM when setting the `encryption=on` property. Reviewed-By: Brian Behlendorf <[email protected]> Reviewed-By: Jason King <[email protected]> Reviewed-By: Tom Caputi <[email protected]> Reviewed-By: Richard Laager <[email protected]> Signed-off-by: Attila Fülöp <[email protected]> Closes #9749