aboutsummaryrefslogtreecommitdiffstats
path: root/module/zfs
Commit message (Collapse)AuthorAgeFilesLines
* OpenZFS 8930 - zfs_zinactive: do not remove the node if the filesystem is ↵Andriy Gapon2018-01-111-7/+26
| | | | | | | | | | | | | | readonly Authored by: Andriy Gapon <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Reviewed-by: George Melikov <[email protected]> Approved by: Gordon Ross <[email protected]> Ported-by: Brian Behlendorf <[email protected]> OpenZFS-issue: https://www.illumos.org/issues/8930 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/93c618e0f4 Closes #7029
* Support -fsanitize=address with --enable-asanBrian Behlendorf2018-01-101-2/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When --enable-asan is provided to configure then build all user space components with fsanitize=address. For kernel support use the Linux KASAN feature instead. https://github.com/google/sanitizers/wiki/AddressSanitizer When using gcc version 4.8 any test case which intentionally generates a core dump will fail when using --enable-asan. The default behavior is to disable core dumps and only newer versions allow this behavior to be controled at run time with the ASAN_OPTIONS environment variable. Additionally, this patch includes some build system cleanup. * Rules.am updated to set the minimum AM_CFLAGS, AM_CPPFLAGS, and AM_LDFLAGS. Any additional flags should be added on a per-Makefile basic. The --enable-debug and --enable-asan options apply to all user space binaries and libraries. * Compiler checks consolidated in always-compiler-options.m4 and renamed for consistency. * -fstack-check compiler flag was removed, this functionality is provided by asan when configured with --enable-asan. * Split DEBUG_CFLAGS in to DEBUG_CFLAGS, DEBUG_CPPFLAGS, and DEBUG_LDFLAGS. * Moved default kernel build flags in to module/Makefile.in and split in to ZFS_MODULE_CFLAGS and ZFS_MODULE_CPPFLAGS. These flags are set with the standard ccflags-y kbuild mechanism. * -Wframe-larger-than checks applied only to binaries or libraries which include source files which are built in both user space and kernel space. This restriction is relaxed for user space only utilities. * -Wno-unused-but-set-variable applied only to libzfs and libzpool. The remaining warnings are the result of an ASSERT using a variable when is always declared. * -D_POSIX_PTHREAD_SEMANTICS and -D__EXTENSIONS__ dropped because they are Solaris specific and thus not needed. * Ensure $GDB is defined as gdb by default in zloop.sh. Signed-off-by: DHE <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #7027
* Use zap_count instead of cached z_size for unlinkAlex Zhuravlev2018-01-091-1/+15
| | | | | | | | | | | | | | | | | | As a performance optimization Lustre does not strictly update the SA_ZPL_SIZE when adding/removing from non-directory entries. This results in entries which cannot be removed through the ZPL layer even though the ZAP is empty and safe to remove. Resolve this issue by checking the zap_count() directly instead on relying on the cached SA_ZPL_SIZE. Micro-benchmarks show no significant performance impact due to the additional overhead of using zap_count(). Reviewed-by: Olaf Faaland <[email protected]> Reviewed-by: Giuseppe Di Natale <[email protected]> Signed-off-by: Alex Zhuravlev <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #7019
* Fix -fsanitize=address memory leakDHE2018-01-091-2/+4
| | | | | | | | kmem_alloc(0, ...) in userspace returns a leakable pointer. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: DHE <[email protected]> Issue #6941
* Fix ARC hit rateBrian Behlendorf2018-01-082-2/+54
| | | | | | | | | | | | | | | | | | | | | | | | | | When the compressed ARC feature was added in commit d3c2ae1 the method of reference counting in the ARC was modified. As part of this accounting change the arc_buf_add_ref() function was removed entirely. This would have be fine but the arc_buf_add_ref() function served a second undocumented purpose of updating the ARC access information when taking a hold on a dbuf. Without this logic in place a cached dbuf would not migrate its associated arc_buf_hdr_t to the MFU list. This would negatively impact the ARC hit rate, particularly on systems with a small ARC. This change reinstates the missing call to arc_access() from dbuf_hold() by implementing a new arc_buf_access() function. Reviewed-by: Giuseppe Di Natale <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Tim Chase <[email protected]> Reviewed by: George Wilson <[email protected]> Reviewed-by: George Melikov <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #6171 Closes #6852 Closes #6989
* OpenZFS 8909 - 8585 can cause a use-after-free kernel panicPrakash Surya2017-12-282-100/+129
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Authored by: Prakash Surya <[email protected]> Reviewed by: John Kennedy <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Reviewed by: George Wilson <[email protected]> Reviewed by: Brad Lewis <[email protected]> Reviewed by: Igor Kozhukhov <[email protected]> Reviewed by: Brian Behlendorf <[email protected]> Approved by: Robert Mustacchi <[email protected]> Ported-by: Prakash Surya <[email protected]> PROBLEM ======= There's a race condition that exists if `zil_free_lwb` races with either `zil_commit_waiter_timeout` and/or `zil_lwb_flush_vdevs_done`. Here's an example panic due to this bug: > ::status debugging crash dump vmcore.0 (64-bit) from ip-10-110-205-40 operating system: 5.11 dlpx-5.2.2.0_2017-12-04-17-28-32b6ba51fb (i86pc) image uuid: 4af0edfb-e58e-6ed8-cafc-d3e9167c7513 panic message: BAD TRAP: type=e (#pf Page fault) rp=ffffff0010555970 addr=60 occurred in module "zfs" due to a NULL pointer dereference dump content: kernel pages only > $c zio_shrink+0x12() zil_lwb_write_issue+0x30d(ffffff03dcd15cc0, ffffff03e0730e20) zil_commit_waiter_timeout+0xa2(ffffff03dcd15cc0, ffffff03d97ffcf8) zil_commit_waiter+0xf3(ffffff03dcd15cc0, ffffff03d97ffcf8) zil_commit+0x80(ffffff03dcd15cc0, 9a9) zfs_write+0xc34(ffffff03dc38b140, ffffff0010555e60, 40, ffffff03e00fb758, 0) fop_write+0x5b(ffffff03dc38b140, ffffff0010555e60, 40, ffffff03e00fb758, 0) write+0x250(42, fffffd7ff4832000, 2000) sys_syscall+0x177() If there's an outstanding lwb that's in `zil_commit_waiter_timeout` waiting to timeout, waiting on it's waiter's CV, we must be sure not to call `zil_free_lwb`. If we end up calling `zil_free_lwb`, then that LWB may be freed and can result in a use-after-free situation where the stale lwb pointer stored in the `zil_commit_waiter_t` structure of the thread waiting on the waiter's CV is used. A similar situation can occur if an lwb is issued to disk, and thus in the `LWB_STATE_ISSUED` state, and `zil_free_lwb` is called while the disk is servicing that lwb. In this situation, the lwb will be freed by `zil_free_lwb`, which will result in a use-after-free situation when the lwb's zio completes, and `zil_lwb_flush_vdevs_done` is called. This race condition is prevented in `zil_close` by calling `zil_commit` before `zil_free_lwb` is called, which will ensure all outstanding (i.e. all lwb's in the `LWB_STATE_OPEN` and/or `LWB_STATE_ISSUED` states) reach the `LWB_STATE_DONE` state before the lwb's are freed (`zil_commit` will not return untill all the lwb's are `LWB_STATE_DONE`). Further, this race condition is prevented in `zil_sync` by only calling `zil_free_lwb` for lwb's that do not have their `lwb_buf` pointer set. All lwb's not in the `LWB_STATE_DONE` state will have a non-null value for this pointer; the pointer is only cleared in `zil_lwb_flush_vdevs_done`, at which point the lwb's state will be changed to `LWB_STATE_DONE`. This race *is* present in `zil_suspend`, leading to this bug. At first glance, it would appear as though this would not be true because `zil_suspend` will call `zil_commit`, just like `zil_close`, but the problem is that `zil_suspend` will set the zilog's `zl_suspend` field prior to calling `zil_commit`. Further, in `zil_commit`, if `zl_suspend` is set, `zil_commit` will take a special branch of logic and use `txg_wait_synced` instead of performing the normal `zil_commit` logic. This call to `txg_wait_synced` might be good enough for the data to reach disk safely before it returns, but it does not ensure that all outstanding lwb's reach the `LWB_STATE_DONE` state before it returns. This is because, if there's an lwb "stuck" in `zil_commit_waiter_timeout`, waiting for it's lwb to timeout, it will maintain a non-null value for it's `lwb_buf` field and thus `zil_sync` will not free that lwb. Thus, even though the lwb's data is already on disk, the lwb will be left lingering, waiting on the CV, and will eventually timeout and be issued to disk even though the write is unnecessary. So, after `zil_commit` is called from `zil_suspend`, we incorrectly assume that there are not outstanding lwb's, and proceed to free all lwb's found on the zilog's lwb list. As a result, we free the lwb that will later be used `zil_commit_waiter_timeout`. SOLUTION ======== The solution to this, is to ensure all outstanding lwb's complete before calling `zil_free_lwb` via `zil_destroy` in `zil_suspend`. This patch accomplishes this goal by forcing the normal `zil_commit` logic when called from `zil_sync`. Now, `zil_suspend` will call `zil_commit_impl` which will always use the normal logic of waiting/issuing lwb's to disk before it returns. As a result, any lwb's outstanding when `zil_commit_impl` is called will be guaranteed to reach the `LWB_STATE_DONE` state by the time it returns. Further, no new lwb's will be created via `zil_commit` since the zilog's `zl_suspend` flag will be set. This will force all new callers of `zil_commit` to use `txg_wait_synced` instead of creating and issuing new lwb's. Thus, all lwb's left on the zilog's lwb list when `zil_destroy` is called will be in the `LWB_STATE_DONE` state, and we'll avoid this race condition. OpenZFS-issue: https://www.illumos.org/issues/8909 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/ece62b6f8d Closes #6940
* Call commit callbacks from the tail of the listlidongyang2017-12-221-1/+1
| | | | | | | | | | | | | | | | | | | | Our zfs backed Lustre MDT had soft lockups while under heavy metadata workloads while handling transaction callbacks from osd_zfs. The problem is zfs is not taking advantage of the fast path in Lustre's trans callback handling, where Lustre will skip the calls to ptlrpc_commit_replies() when it already saw a higher transaction number. This patch corrects this, it also has a positive impact on metadata performance on Lustre with osd_zfs, plus some cleanup in the headers. A similar issue for ext4/ldiskfs is described on: https://jira.hpdd.intel.com/browse/LU-6527 Reviewed-by: Olaf Faaland <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Li Dongyang <[email protected]> Closes #6986
* Support re-prioritizing asynchronous prefetchesTom Caputi2017-12-214-35/+112
| | | | | | | | | | | | | | | | | | | | | When sequential scrubs were merged, all calls to arc_read() (including prefetch IOs) were given ZIO_PRIORITY_ASYNC_READ. Unfortunately, this behaves badly with an existing issue where prefetch IOs cannot be re-prioritized after the issue. The result is that synchronous reads end up in the same vdev_queue as the scrub IOs and can have (in some workloads) multiple seconds of latency. This patch incorporates 2 changes. The first ensures that all scrub IOs are given ZIO_PRIORITY_SCRUB to allow the vdev_queue code to differentiate between these I/Os and user prefetches. Second, this patch introduces zio_change_priority() to provide the missing capability to upgrade a zio's priority. Reviewed by: George Wilson <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tom Caputi <[email protected]> Closes #6921 Closes #6926
* Fix multihost stale cache file importBrian Behlendorf2017-12-181-4/+8
| | | | | | | | | | | | | | | | | | | | | When the multihost property is enabled it should be impossible to import an active pool even using the force (-f) option. This patch prevents a forced import from succeeding when importing with a stale cache file. The root cause of the problem is that the kernel modules trusted the hostid provided in configuration. This is always correct when the configuration is generated by scanning for the pool. However, when using an existing cache file the hostid could be stale which would result in the activity check being skipped. Resolve the issue by always using the hostid read from the label configuration where the best uberblock was found. Reviewed-by: Olaf Faaland <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #6933 Closes #6971
* Suppress incorrect objtool warningsBrian Behlendorf2017-12-071-0/+5
| | | | | | | | | | | | Suppress incorrect warnings from versions of objtool which are not aware of x86 EVEX prefix instructions used for AVX512. module/zfs/vdev_raidz_math_avx512bw.o: warning: objtool: <func+offset>: can't find jump dest instruction at .text Reviewed-by: Don Brady <[email protected]> Reviewed-by: George Melikov <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #6928
* OpenZFS 8603 - rename zilog's "zl_writer_lock" to "zl_issuer_lock"Prakash Surya2017-12-061-22/+22
| | | | | | | | | | | | | | | | This is a purely cosmetic change. The zilog's "zl_writer_lock" field is being renamed to "zl_issuer_lock" to try and make the code easier to understand; no other changes are made. Authored by: Prakash Surya <[email protected]> Reviewed by: C Fraire <[email protected]> Approved by: Dan McDonald <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Ported-by: Giuseppe Di Natale <[email protected]> OpenZFS-issue: https://www.illumos.org/issues/8603 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/2daf06546b Closes #6927
* OpenZFS 8585 - improve batching done in zil_commit()Prakash Surya2017-12-056-286/+1331
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Authored by: Prakash Surya <[email protected]> Reviewed by: Brad Lewis <[email protected]> Reviewed by: Matt Ahrens <[email protected]> Reviewed by: George Wilson <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Approved by: Dan McDonald <[email protected]> Ported-by: Prakash Surya <[email protected]> Problem ======= The current implementation of zil_commit() can introduce significant latency, beyond what is inherent due to the latency of the underlying storage. The additional latency comes from two main problems: 1. When there's outstanding ZIL blocks being written (i.e. there's already a "writer thread" in progress), then any new calls to zil_commit() will block waiting for the currently oustanding ZIL blocks to complete. The blocks written for each "writer thread" is coined a "batch", and there can only ever be a single "batch" being written at a time. When a batch is being written, any new ZIL transactions will have to wait for the next batch to be written, which won't occur until the current batch finishes. As a result, the underlying storage may not be used as efficiently as possible. While "new" threads enter zil_commit() and are blocked waiting for the next batch, it's possible that the underlying storage isn't fully utilized by the current batch of ZIL blocks. In that case, it'd be better to allow these new threads to generate (and issue) a new ZIL block, such that it could be serviced by the underlying storage concurrently with the other ZIL blocks that are being serviced. 2. Any call to zil_commit() must wait for all ZIL blocks in its "batch" to complete, prior to zil_commit() returning. The size of any given batch is proportional to the number of ZIL transaction in the queue at the time that the batch starts processing the queue; which doesn't occur until the previous batch completes. Thus, if there's a lot of transactions in the queue, the batch could be composed of many ZIL blocks, and each call to zil_commit() will have to wait for all of these writes to complete (even if the thread calling zil_commit() only cared about one of the transactions in the batch). To further complicate the situation, these two issues result in the following side effect: 3. If a given batch takes longer to complete than normal, this results in larger batch sizes, which then take longer to complete and further drive up the latency of zil_commit(). This can occur for a number of reasons, including (but not limited to): transient changes in the workload, and storage latency irregularites. Solution ======== The solution attempted by this change has the following goals: 1. no on-disk changes; maintain current on-disk format. 2. modify the "batch size" to be equal to the "ZIL block size". 3. allow new batches to be generated and issued to disk, while there's already batches being serviced by the disk. 4. allow zil_commit() to wait for as few ZIL blocks as possible. 5. use as few ZIL blocks as possible, for the same amount of ZIL transactions, without introducing significant latency to any individual ZIL transaction. i.e. use fewer, but larger, ZIL blocks. In theory, with these goals met, the new allgorithm will allow the following improvements: 1. new ZIL blocks can be generated and issued, while there's already oustanding ZIL blocks being serviced by the storage. 2. the latency of zil_commit() should be proportional to the underlying storage latency, rather than the incoming synchronous workload. Porting Notes ============= Due to the changes made in commit 119a394ab0, the lifetime of an itx structure differs than in OpenZFS. Specifically, the itx structure is kept around until the data associated with the itx is considered to be safe on disk; this is so that the itx's callback can be called after the data is committed to stable storage. Since OpenZFS doesn't have this itx callback mechanism, it's able to destroy the itx structure immediately after the itx is committed to an lwb (before the lwb is written to disk). To support this difference, and to ensure the itx's callbacks can still be called after the itx's data is on disk, a few changes had to be made: * A list of itxs was added to the lwb structure. This list contains all of the itxs that have been committed to the lwb, such that the callbacks for these itxs can be called from zil_lwb_flush_vdevs_done(), after the data for the itxs is committed to disk. * A list of itxs was added on the stack of the zil_process_commit_list() function; the "nolwb_itxs" list. In some circumstances, an itx may not be committed to an lwb (e.g. if allocating the "next" ZIL block on disk fails), so this list is used to keep track of which itxs fall into this state, such that their callbacks can be called after the ZIL's writer pipeline is "stalled". * The logic to actually call the itx's callback was moved into the zil_itx_destroy() function. Since all consumers of zil_itx_destroy() were effectively performing the same logic (i.e. if callback is non-null, call the callback), it seemed like useful code cleanup to consolidate this logic into a single function. Additionally, the existing Linux tracepoint infrastructure dealing with the ZIL's probes and structures had to be updated to reflect these code changes. Specifically: * The "zil__cw1" and "zil__cw2" probes were removed, so they had to be removed from "trace_zil.h" as well. * Some of the zilog structure's fields were removed, which affected the tracepoint definitions of the structure. * New tracepoints had to be added for the following 3 new probes: * zil__process__commit__itx * zil__process__normal__itx * zil__commit__io__error OpenZFS-issue: https://www.illumos.org/issues/8585 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/5d95a3a Closes #6566
* Fix NFS sticky bit permission denied errorBrian Behlendorf2017-12-041-3/+2
| | | | | | | | | | | | | | | | | | | | | | When zfs_sticky_remove_access() was originally adapted for Linux a typo was made which altered the intended behavior. As described in the block comment, the intended behavior is that permission should be granted when the entry is a regular file and you have write access. That is, S_ISREG should have been used instead of S_ISDIR. Restricting permission to regular files made good sense for older systems where setting the bit on executable files would instruct the system to save the program's text segment on the swap device. On modern systems this behavior has been replaced by the sticky bit acting as a restricted deletion flag and the plain file restriction has been relaxed. Reviewed-by: George Melikov <[email protected]> Reviewed-by: Giuseppe Di Natale <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #6889 Closes #6910
* Preserve itx alloc size for zio_data_buf_free()Brian Behlendorf2017-12-041-2/+5
| | | | | | | | | | | | | | Using zio_data_buf_alloc() to allocate the itx's may be unsafe because the itx->itx_lr.lrc_reclen field is not constant from allocation to free. Using a different itx->itx_lr.lrc_reclen size in zio_data_buf_free() can result in the allocation being returned to the wrong kmem cache. This issue can be avoided entirely by storing the allocation size in itx->itx_size and using that for zio_data_buf_free(). Reviewed by: Prakash Surya <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #6912
* Unbreak the scan status ABITom Caputi2017-11-301-5/+5
| | | | | | | | | | When d4a72f23 was merged, pss_pass_issued was incorrectly added to the middle of the pool_scan_stat_t structure instead of the end. This patch simply corrects this issue. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: George Melikov <[email protected]> Signed-off-by: Tom Caputi <[email protected]> Closes #6909
* Fix 'zfs get {user|group}objused@' functionalityLOLi2017-11-291-1/+1
| | | | | | | | | | | | | | Fix a regression accidentally introduced in 1b81ab4 that prevents 'zfs get {user|group}objused@' from correctly reporting the requested value. Update "userspace_003_pos.ksh" and "groupspace_003_pos.ksh" to verify this functionality. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: George Melikov <[email protected]> Reviewed-by: Giuseppe Di Natale <[email protected]> Signed-off-by: loli10K <[email protected]> Closes #6908
* Linux 4.14 compat: CONFIG_GCC_PLUGIN_RANDSTRUCTMark Wright2017-11-284-33/+33
| | | | | | | | | | | | | | | Fix build errors with gcc 7.2.0 on Gentoo with kernel 4.14 built with CONFIG_GCC_PLUGIN_RANDSTRUCT=y such as: module/nvpair/nvpair.c:2810:2:error: positional initialization of field in ?struct? declared with 'designated_init' attribute [-Werror=designated-init] nvs_native_nvlist, ^~~~~~~~~~~~~~~~~ Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Mark Wright <[email protected]> Closes #5390 Closes #6903
* Update for cppcheck v1.80Brian Behlendorf2017-11-184-11/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Resolve new warnings and errors from cppcheck v1.80. * [lib/libshare/libshare.c:543]: (warning) Possible null pointer dereference: protocol * [lib/libzfs/libzfs_dataset.c:2323]: (warning) Possible null pointer dereference: srctype * [lib/libzfs/libzfs_import.c:318]: (error) Uninitialized variable: link * [module/zfs/abd.c:353]: (error) Uninitialized variable: sg * [module/zfs/abd.c:353]: (error) Uninitialized variable: i * [module/zfs/abd.c:385]: (error) Uninitialized variable: sg * [module/zfs/abd.c:385]: (error) Uninitialized variable: i * [module/zfs/abd.c:553]: (error) Uninitialized variable: i * [module/zfs/abd.c:553]: (error) Uninitialized variable: sg * [module/zfs/abd.c:763]: (error) Uninitialized variable: i * [module/zfs/abd.c:763]: (error) Uninitialized variable: sg * [module/zfs/abd.c:305]: (error) Uninitialized variable: tmp_page * [module/zfs/zpl_xattr.c:342]: (warning) Possible null pointer dereference: value * [module/zfs/zvol.c:208]: (error) Uninitialized variable: p Convert the following suppression to inline. * [module/zfs/zfs_vnops.c:840]: (error) Possible null pointer dereference: aiov Exclude HAVE_UIO_ZEROCOPY and HAVE_DNLC from analysis since these macro's will never be defined until this functionality is implemented. Reviewed-by: George Melikov <[email protected]> Reviewed-by: Giuseppe Di Natale <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #6879
* Fix ARC pointer overrunDeHackEd2017-11-171-9/+11
| | | | | | | | | | | Only access the `b_crypt_hdr` field of an ARC header if the content is encrypted. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Tom Caputi <[email protected]> Reviewed-by: George Melikov <[email protected]> Reviewed-by: Giuseppe Di Natale <[email protected]> Signed-off-by: DHE <[email protected]> Closes #6877
* Sequential scrub and resilversTom Caputi2017-11-1514-644/+2669
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, scrubs and resilvers can take an extremely long time to complete. This is largely due to the fact that zfs scans process pools in logical order, as determined by each block's bookmark. This makes sense from a simplicity perspective, but blocks in zfs are often scattered randomly across disks, particularly due to zfs's copy-on-write mechanisms. This patch improves performance by splitting scrubs and resilvers into a metadata scanning phase and an IO issuing phase. The metadata scan reads through the structure of the pool and gathers an in-memory queue of I/Os, sorted by size and offset on disk. The issuing phase will then issue the scrub I/Os as sequentially as possible, greatly improving performance. This patch also updates and cleans up some of the scan code which has not been updated in several years. Reviewed-by: Brian Behlendorf <[email protected]> Authored-by: Saso Kiselkov <[email protected]> Authored-by: Alek Pinchuk <[email protected]> Authored-by: Tom Caputi <[email protected]> Signed-off-by: Tom Caputi <[email protected]> Closes #3625 Closes #6256
* Fix dirty check in dmu_offset_next()Brian Behlendorf2017-11-151-6/+4
| | | | | | | | | | | | The correct way to determine if a dnode is dirty is to check if any of the dn->dn_dirty_link's are active. Relying solely on the dn->dn_dirtyctx can result in the dnode being mistakenly reported as clean. Reviewed-by: Chunwei Chen <[email protected]> Reviewed-by: George Melikov <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #3125 Closes #6867
* Fix truncate(2) mtime and ctime handlingLOLi2017-11-131-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | On Linux, ftruncate(2) always changes the file timestamps, even if the file size is not changed. However, in case of a successfull truncate(2), the timestamps are updated only if the file size changes. This translates to the VFS calling the ZFS Posix Layer "setattr" function (zpl_setattr) with ATTR_MTIME and ATTR_CTIME unconditionally set on the iattr mask only when doing a ftruncate(2), while the truncate(2) is left to the filesystem implementation to be dealt with. This behaviour is consistent with POSIX:2004/SUSv3 specifications where there's no explicit requirement for file size changes to update the timestamps only for ftruncate(2): http://pubs.opengroup.org/onlinepubs/009695399/functions/truncate.html http://pubs.opengroup.org/onlinepubs/009695399/functions/ftruncate.html This has been later updated in POSIX:2008/SUSv4 where, for both truncate(2)/ftruncate(2), there's no mention of this size change requirement: http://austingroupbugs.net/view.php?id=489 http://pubs.opengroup.org/onlinepubs/9699919799/functions/truncate.html http://pubs.opengroup.org/onlinepubs/9699919799/functions/ftruncate.html Unfortunately the Linux VFS is still calling into the ZPL without ATTR_MTIME/ATTR_CTIME set in the truncate(2) case: we fix this by explicitly updating the timestamps when detecting the ATTR_SIZE bit, which is always set in do_truncate(), on the iattr mask. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: loli10K <[email protected]> Closes #6811 Closes #6819
* OpenZFS 7531 - Assign correct flags to prefetched buffersbenrubson2017-11-111-0/+12
| | | | | | | | | | | Reviewed by: Matthew Ahrens <[email protected]> Reviewed by: Prakash Surya <[email protected]> Authored by: abraunegg <[email protected]> Approved by: Dan McDonald <[email protected]> Ported-by: Brian Behlendorf <[email protected]> OpenZFS-issue: https://www.illumos.org/issues/7531 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/468008cb
* Long hold the dataset during upgradeArkadiusz Bubała2017-11-103-9/+24
| | | | | | | | | | | | | | | If the receive or rollback is performed while filesystem is upgrading the objset may be evicted in `dsl_dataset_clone_swap_sync_impl`. This will lead to NULL pointer dereference when upgrade tries to access evicted objset. This commit adds long hold of dataset during whole upgrade process. The receive and rollback will return an EBUSY error until the upgrade is not finished. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Arkadiusz Bubała <[email protected]> Closes #5295 Closes #6837
* Fix encryption root hierarchy issueTom Caputi2017-11-081-1/+2
| | | | | | | | | | | | | | After doing a recursive raw receive, zfs userspace performs a final pass to adjust the encryption root hierarchy as needed. Unfortunately, the FORCE_INHERIT ioctl had a bug which caused the encryption root to always be assigned to the direct parent instead of the inheriting parent. This patch simply fixes this issue. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alek Pinchuk <[email protected]> Signed-off-by: Tom Caputi <[email protected]> Closes #6847 Closes #6848
* Handle compressed buffers in __dbuf_hold_impl()Tim Chase2017-11-082-13/+45
| | | | | | | | | | | | | | | | | | | | | | | | | | | In __dbuf_hold_impl(), if a buffer is currently syncing and is still referenced from db_data, a copy is made in case it is dirtied again in the txg. Previously, the buffer for the copy was simply allocated with arc_alloc_buf() which doesn't handle compressed or encrypted buffers (which are a special case of a compressed buffer). The result was typically an invalid memory access because the newly-allocated buffer was of the uncompressed size. This commit fixes the problem by handling the 2 compressed cases, encrypted and unencrypted, respectively, with arc_alloc_raw_buf() and arc_alloc_compressed_buf(). Although using the proper allocation functions fixes the invalid memory access by allocating a buffer of the compressed size, another unrelated issue made it impossible to properly detect compressed buffers in the first place. The header's compression flag was set to ZIO_COMPRESS_OFF in arc_write() when it was possible that an attached buffer was actually compressed. This commit adds logic to only set ZIO_COMPRESS_OFF in the non-ZIO_RAW case which wil handle both cases of compressed buffers (encrypted or unencrypted). Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tim Chase <[email protected]> Closes #5742 Closes #6797
* OpenZFS 8607 - variable set but not usedBrian Behlendorf2017-11-082-13/+11
| | | | | | | | | | | | | | Reviewed by: Yuri Pankov <[email protected]> Reviewed by: Igor Kozhukhov <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Approved by: Robert Mustacchi <[email protected]> Authored by: Toomas Soome <[email protected]> Reviewed-by: George Melikov <[email protected]> Reviewed-by: Giuseppe Di Natale <[email protected]> Ported-by: Brian Behlendorf <[email protected]> OpenZFS-issue: https://www.illumos.org/issues/8607 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/b852c2f5 Closes #6842
* Bug fix in qat_compress.c when compressed size is < 4KBwli52017-11-071-4/+4
| | | | | | | | | | | | | | | When the 128KB block is compressed to less than 4KB, the pointer to the Footer is not in the end of the compressed buffer, that's because the Header offset was added twice for this case. So there is a gap between the Footer and the compressed buffer. 1. Always compute the Footer pointer address from the start of the last page. 2. Remove the un-used workaroud code which has been verified fixed with the latest driver and this fix. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: George Melikov <[email protected]> Signed-off-by: Weigang Li <[email protected]> Closes #6827
* Build regression in c89 cleanupsDon Brady2017-11-071-2/+3
| | | | | | | | | Fixed build regression in non-debug builds from recent cleanups of c89 workarounds. Reviewed-by: Tim Chase <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Don Brady <[email protected]> Closes #6832
* Undo c89 workarounds to match with upstreamDon Brady2017-11-0444-746/+414
| | | | | | | | | With PR 5756 the zfs module now supports c99 and the remaining past c89 workarounds can be undone. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: George Melikov <[email protected]> Signed-off-by: Don Brady <[email protected]> Closes #6816
* OpenZFS 8081 - Compiler warnings in zdbBrian Behlendorf2017-10-277-61/+95
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix compiler warnings in zdb. With these changes, FreeBSD can compile zdb with all compiler warnings enabled save -Wunused-parameter. usr/src/cmd/zdb/zdb.c usr/src/cmd/zdb/zdb_il.c usr/src/uts/common/fs/zfs/sys/sa.h usr/src/uts/common/fs/zfs/sys/spa.h Fix numerous warnings, including: * const-correctness * shadowing global definitions * signed vs unsigned comparisons * missing prototypes, or missing static declarations * unused variables and functions * Unreadable array initializations * Missing struct initializers usr/src/cmd/zdb/zdb.h Add a header file to declare common symbols usr/src/lib/libzpool/common/sys/zfs_context.h usr/src/uts/common/fs/zfs/arc.c usr/src/uts/common/fs/zfs/dbuf.c usr/src/uts/common/fs/zfs/spa.c usr/src/uts/common/fs/zfs/txg.c Add a function prototype for zk_thread_create, and ensure that every callback supplied to this function actually matches the prototype. usr/src/cmd/ztest/ztest.c usr/src/uts/common/fs/zfs/sys/zil.h usr/src/uts/common/fs/zfs/zfs_replay.c usr/src/uts/common/fs/zfs/zvol.c Add a function prototype for zil_replay_func_t, and ensure that every function of this type actually matches the prototype. usr/src/uts/common/fs/zfs/sys/refcount.h Change FTAG so it discards any constness of __func__, necessary since existing APIs expect it passed as void *. Porting Notes: - Many of these fixes have already been applied to Linux. For consistency the OpenZFS version of a change was applied if the warning was addressed in an equivalent but different fashion. Reviewed by: Matthew Ahrens <[email protected]> Reviewed by: Prakash Surya <[email protected]> Authored by: Alan Somers <[email protected]> Approved by: Richard Lowe <[email protected]> Ported-by: Brian Behlendorf <[email protected]> OpenZFS-issue: https://www.illumos.org/issues/8081 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/843abe1b8a Closes #6787
* ZFS send fails to dump objects larger than 128PiBLOLi2017-10-263-17/+22
| | | | | | | | | | | When dumping objects larger than 128PiB it's possible for do_dump() to miscalculate the FREE_RECORD offset due to an integer overflow condition: this prevents the receiving end from correctly restoring the dumped object. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Fabian Grünbichler <[email protected]> Signed-off-by: loli10K <[email protected]> Closes #6760
* OpenZFS 8558, 8602 - lwp_create() returns EAGAINBrian Behlendorf2017-10-262-9/+56
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 8558 lwp_create() returns EAGAIN on system with more than 80K ZFS filesystems On a system with more than 80K ZFS filesystems, we've seen cases where lwp_create() will start to fail by returning EAGAIN. The problem being, for each of those 80K ZFS filesystems, a taskq will be created for each dataset as part of the ZIL for each dataset. Porting Notes: - The new nomem taskq kstat was dropped. - Added module options and documentation for new tunings zfs_zil_clean_taskq_nthr_pct, zfs_zil_clean_taskq_minalloc, zfs_zil_clean_taskq_maxalloc, and zfs_sync_taskq_batch_pct. Reviewed by: George Wilson <[email protected]> Reviewed by: Sebastien Roy <[email protected]> Approved by: Robert Mustacchi <[email protected]> Authored by: Prakash Surya <[email protected]> Reviewed-by: George Melikov <[email protected]> Reviewed-by: Chris Dunlop <[email protected]> Ported-by: Brian Behlendorf <[email protected]> OpenZFS-issue: https://www.illumos.org/issues/8558 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/216d772 8602 remove unused "dp_early_sync_tasks" field from "dsl_pool" structure Reviewed by: Serapheim Dimitropoulos <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Approved by: Robert Mustacchi <[email protected]> Authored by: Prakash Surya <[email protected]> Reviewed-by: George Melikov <[email protected]> Reviewed-by: Chris Dunlop <[email protected]> Ported-by: Brian Behlendorf <[email protected]> OpenZFS-issue: https://www.illumos.org/issues/8602 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/2bcb545 Closes #6779
* Added no_scrub_restart flag to zpool reopenArkadiusz Bubała2017-10-261-9/+33
| | | | | | | | | | | | | | | | | | | | | | | Added -n flag to zpool reopen that allows a running scrub operation to continue if there is a device with Dirty Time Log. By default if a component device has a DTL and zpool reopen is executed all running scan operations will be restarted. Added functional tests for `zpool reopen` Tests covers following scenarios: * `zpool reopen` without arguments, * `zpool reopen` with pool name as argument, * `zpool reopen` while scrubbing, * `zpool reopen -n` while scrubbing, * `zpool reopen -n` while resilvering, * `zpool reopen` with bad arguments. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Tom Caputi <[email protected]> Signed-off-by: Arkadiusz Bubała <[email protected]> Closes #6076 Closes #6746
* Emit history events for 'zpool create'Brian Behlendorf2017-10-232-20/+23
| | | | | | | | | | | | | | | | | | | History commands and events were being suppressed for the 'zpool create' command since the history object did not yet exist. Create the object earlier so this history doesn't get lost. Split the pool_destroy event in to pool_destroy and pool_export so they may be distinguished. Updated events_001_pos and events_002_pos test cases. They now check for the expected history events and were reworked to be more reliable. Reviewed-by: Nathaniel Clark <[email protected]> Reviewed-by: Giuseppe Di Natale <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #6712 Closes #6486
* Support integration with new QAT productswli52017-10-201-3/+5
| | | | | | | | | | | | Support integration with new QAT products: Intel(R) C62x Chipset, or Atom(R) C3000 Processor Product Family SoC: 1. Detect new file name in auto-conf. 2. Change MAX_INSTANCES to 48. 3. Change "num_inst" to U16 to clean a build warning. Reviewed-by: George Melikov <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Weigang Li <[email protected]> Closes #6767
* Remove vn_rename and vn_remove dependencyBrian Behlendorf2017-10-191-9/+26
| | | | | | | | | | | | | | | | | | | | | | | | | | | The only place vn_rename and vn_remove are used is when writing out an updated pool configuration file. By truncating the file instead of renaming and removing it we can avoid having to implement these interfaces entirely. Functionally an empty cache file is treated the same as a missing cache file. This is particularly advantageous because the Linux kernel has never provided a way to reliably implement vn_rename and vn_remove. The cachefile_004_pos.ksh test case was updated to understand that an empty cache file is the same as a missing one. The zfs-import-* systemd service files were not updated to use ConditionFileNotEmpty in place of ConditionPathExists. This means that after exporting all pools and rebooting new pools will not the scanned for on the next boot. This small change should not impact normal usage since pools are not exported as part of a normal shutdown. Documentation was updated accordingly. Reviewed-by: George Melikov <[email protected]> Reviewed-by: Arkadiusz Bubała <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes zfsonlinux/spl#648 Closes #6753
* Fix ASSERT in dmu_free_long_object_raw()Tom Caputi2017-10-181-3/+4
| | | | | | | | | This small patch fixes an issue where dmu_free_long_object_raw() calls dnode_hold() after freeing the dnode a line above. Reviewed-by: Jorgen Lundman <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tom Caputi <[email protected]> Closes #6766
* Fix for #6714Tom Caputi2017-10-111-2/+2
| | | | | | This 2 line patch fixes a possible integer overflow reported by grsec. Signed-off-by: Tom Caputi <[email protected]>
* Fix for #6706Tom Caputi2017-10-112-18/+72
| | | | | | | | | | | | | | | | | | | This patch resolves an issue where raw sends would fail to send encryption parameters if the wrapping key was unloaded and reloaded before the data was sent and the dataset wass not an encryption root. The code attempted to lookup the values from the wrapping key which was not being initialized upon reload. This change forces the code to lookup the correct value from the encryption root's DSL Crypto Key. Unfortunately, this issue led to the on-disk DSL Crypto Key for some non-encryption root datasets being left with zeroed out encryption parameters. However, this should not present a problem since these values are never looked at and are overrwritten upon changing keys. This patch also fixes an issue where raw, resumable sends were not being cleaned up appropriately if an invalid DSL Crypto Key was received. Signed-off-by: Tom Caputi <[email protected]>
* Fix for #6703Tom Caputi2017-10-111-18/+6
| | | | | | | | | This patch resolves an issue where spa_keystore_change_key_sync_impl() incorrectly recursed into clone DSL Directories while recursively rewrapping encryption keys. Clones share keys with their origins, so this logic was incorrect. Signed-off-by: Tom Caputi <[email protected]>
* Fixes for #6639Tom Caputi2017-10-116-61/+143
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | Several issues were uncovered by running stress tests with zfs encryption and raw sends in particular. The issues and their associated fixes are as follows: * arc_read_done() has the ability to chain several requests for the same block of data via the arc_callback_t struct. In these cases, the ARC would only use the first request's dsobj from the bookmark to decrypt the data. This is problematic because the first request might be a prefetch zio which is able to handle the key not being loaded, while the second might use a different key that it is sure will work. The fix here is to pass the dsobj with each individual arc_callback_t so that each request can attempt to decrypt the data separately. * DRR_FREE and DRR_FREEOBJECT records in a send file were not having their transactions properly tagged as raw during raw sends, which caused a panic when the dbuf code attempted to decrypt these blocks. * traverse_prefetch_metadata() did not properly set ZIO_FLAG_SPECULATIVE when issuing prefetch IOs. * Added a few asserts and code cleanups to ensure these issues are more detectable in the future. Signed-off-by: Tom Caputi <[email protected]>
* Encryption patch follow-upTom Caputi2017-10-1110-223/+286
| | | | | | | | | | | | | | | | | | | | | | | | | | | | * PBKDF2 implementation changed to OpenSSL implementation. * HKDF implementation moved to its own file and tests added to ensure correctness. * Removed libzfs's now unnecessary dependency on libzpool and libicp. * Ztest can now create and test encrypted datasets. This is currently disabled until issue #6526 is resolved, but otherwise functions as advertised. * Several small bug fixes discovered after enabling ztest to run on encrypted datasets. * Fixed coverity defects added by the encryption patch. * Updated man pages for encrypted send / receive behavior. * Fixed a bug where encrypted datasets could receive DRR_WRITE_EMBEDDED records. * Minor code cleanups / consolidation. Signed-off-by: Tom Caputi <[email protected]>
* Relax ASSERT for #6526Tom Caputi2017-10-111-1/+2
| | | | | | | | This patch resolves a minor issue where an ASSERT in metaslab_passivate() that only applies to non weight-based metaslabs was erroneously applied to all metaslabs. Signed-off-by: Tom Caputi <[email protected]>
* Fix coverity defects: CID 147474Tobin Harding2017-10-101-1/+2
| | | | | | | | | | | | | CID 147474: Logically dead code (DEADCODE) Remove ternary operator and return `error` directly. Currently return value is derived from a ternary operator. The conditional is always true. The ternary operator is therefore redundant i.e dead code. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tobin C. Harding <[email protected]> Closes #6723
* Skip FREEOBJECTS for objects which can't existFabian Grünbichler2017-10-101-0/+16
| | | | | | | | | | | | | | | | | | When sending an incremental stream based on a snapshot, the receiving side must have the same base snapshot. Thus we do not need to send FREEOBJECTS records for any objects past the maximum one which exists locally. This allows us to send incremental streams (again) to older ZFS implementations (e.g. ZoL < 0.7) which actually try to free all objects in a FREEOBJECTS record, instead of bailing out early. Reviewed by: Paul Dagnelie <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Fabian Grünbichler <[email protected]> Closes #5699 Closes #6507 Closes #6616
* Free objects when receiving full stream as cloneFabian Grünbichler2017-10-101-1/+54
| | | | | | | | | | | | | All objects after the last written or freed object are not supposed to exist after receiving the stream. Free them accordingly, as if a freeobjects record for them had been included in the stream. Reviewed by: Paul Dagnelie <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Fabian Grünbichler <[email protected]> Closes #5699 Closes #6507 Closes #6616
* Fix ARC behavior on 32-bit systemsBrian Behlendorf2017-10-101-27/+58
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | With the addition of the ABD changes consumption of the virtual address space has been greatly reduced. This exposed an issue on CONFIG_HIGHMEM systems where free memory was being calculated incorrectly. Functionally this didn't cause any major problems prior to ABD because a lack of available virtual address space was used as an indicator of low memory. This patch makes the following changes to address the issue and in the process realigns the code further with OpenZFS. There are no substantive changes in behavior for 64-bit systems. * Added CONFIG_HIGHMEM case to the arc_all_memory() and arc_free_memory() functions to only consider low memory pages on CONFIG_HIGHMEM systems. * The arc_free_memory() function was updated to return bytes instead of pages to be consistent with the other helper functions. In user space we make up some reasonable values since currently only testing is performed in this context. * Adds three new values to the arcstats kstat to provide visibility in to the ARC's assessment of the memory situation: memory_all_bytes, memory_free_bytes, and memory_available_bytes. * Added kmem_reap() call to arc_available_memory() for 32-bit builds to realign code with OpenZFS. * Reduced size of test file in /async_destroy_001_pos.ksh to speed up test case. Multiple txgs are still required. * Move vdevs used by zpool_clear_001_pos and zpool_upgrade_002_pos to TEST_BASE_DIR location to speed up test cases. Reviewed-by: David Quigley <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #5352 Closes #6734
* Fix abdstats kstat on 32-bit systemsBrian Behlendorf2017-10-061-2/+2
| | | | | | | | When decrementing the struct_size and scatter_chunk_waste kstats the value needs to be cast to an int on 32-bit systems. Reviewed-by: George Melikov <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #6721
* Remove unnecessary equality checkTobin Harding2017-10-051-1/+1
| | | | | | | | | | | | | | | | | | | | | | Currently `if` statement includes an assignment (from a function return value) and a equality check. The parenthesis are in the incorrect place, currently the code clobbers the function return value because of this. We can fix this by simplifying the `if` statement. `if (foo != 0)` can be more succinctly expressed as `if (foo)` Remove the equality check, add parenthesis to correct the statement. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: George Melikov <[email protected]> Reviewed-by: Chris Dunlop <[email protected]> Signed-off-by: Tobin C. Harding <[email protected]> Closes #6685 Close #6719