aboutsummaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* Fix hung z_zvol tasks during 'zfs receive'LOLi2018-03-303-1/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | During a receive operation zvol_create_minors_impl() can wait needlessly for the prefetch thread because both share the same tasks queue. This results in hung tasks: <3>INFO: task z_zvol:5541 blocked for more than 120 seconds. <3> Tainted: P O 3.16.0-4-amd64 <3>"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. The first z_zvol:5541 (zvol_task_cb) is waiting for the long running traverse_prefetch_thread:260 root@linux:~# cat /proc/spl/taskq taskq act nthr spwn maxt pri mina spl_system_taskq/0 1 2 0 64 100 1 active: [260]traverse_prefetch_thread [zfs](0xffff88003347ae40) wait: 5541 spl_delay_taskq/0 0 1 0 4 100 1 delay: spa_deadman [zfs](0xffff880039924000) z_zvol/1 1 1 0 1 120 1 active: [5541]zvol_task_cb [zfs](0xffff88001fde6400) pend: zvol_task_cb [zfs](0xffff88001fde6800) This change adds a dedicated, per-pool, prefetch taskq to prevent the traverse code from monopolizing the global (and limited) system_taskq by inappropriately scheduling long running tasks on it. Reviewed-by: Albert Lee <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: loli10K <[email protected]> Closes #6330 Closes #6890 Closes #7343
* zfs-functions: skip lines where mntpnt is 'none'Georgy Yakovlev2018-03-301-0/+2
| | | | | | | | | | | | | | This fixes zfs-mount initscript trying to mount swap volumes as filesystems or anything that has 'none' as a mountpoint in /etc/fstab. Additionally, fixes trying to mount swap volumes as a filesystem on RHEL. RHEL defines mountpoint for swap as `swap`. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Giuseppe Di Natale <[email protected]> Reviewed-by: George Melikov <[email protected]> Signed-off-by: Georgy Yakovlev <[email protected]> Closes #7346 Closes #7347
* OpenZFS 9164 - assert: newds == os->os_dsl_datasetAndriy Gapon2018-03-306-25/+32
| | | | | | | | | | | | | | | | | | | | Authored by: Andriy Gapon <[email protected]> Reviewed by: Matt Ahrens <[email protected]> Reviewed by: Don Brady <[email protected]> Reviewed-by: loli10K <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Approved by: Richard Lowe <[email protected]> Ported-by: Giuseppe Di Natale <[email protected]> Porting Notes: * Re-enabled and tweaked the zpool_upgrade_007_pos test case to successfully run in under 5 minutes. OpenZFS-issue: https://www.illumos.org/issues/9164 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/0e776dc06a Closes #6112 Closes #7336
* Add support for nvme based devidsDon Brady2018-03-291-2/+13
| | | | | | | | | | | Adds a devid for nvme devices. This is very similar to how the other 'bus' (scsi|sata|usb) devids are generated. The devid resides in a name/value pair in the leaf vdevs in a zpool config. Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Richard Elling <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Don Brady <[email protected]> Closes #7356
* Resolve QAT issues with incompressible dataTom Caputi2018-03-293-37/+106
| | | | | | | | | | | | | | | | | | | | | | Currently, when ZFS wants to accelerate compression with QAT, it passes a destination buffer of the same size as the source buffer. Unfortunately, if the data is incompressible, QAT can actually "compress" the data to be larger than the source buffer. When this happens, the QAT driver will return a FAILED error code and print warnings to dmesg. This patch fixes these issues by providing the QAT driver with an additional buffer to work with so that even completely incompressible source data will not cause an overflow. This patch also resolves an error handling issue where incompressible data attempts compression twice: once by QAT and once in software. To fix this issue, a new (and fake) error code CPA_STATUS_INOMPRESSIBLE has been added so that the calling code can correctly account for the difference between a hardware failure and data that simply cannot be compressed. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Weigang Li <[email protected]> Signed-off-by: Tom Caputi <[email protected]> Closes #7338
* Fix ASSERT in dsl_scan_fini() and cleanup commentsTom Caputi2018-03-281-13/+13
| | | | | | | | | | | | | | | This patch fixes an issue where dsl_scan_prefetch_cb() might add more prefetch I/Os to the prefetch queue after prefetching has been completed. This was happening because that code was checking scn->scn_suspending instead of scn->scn_prefetch_stop. This occasionally triggered an ASSERT during ztest runs in dsl_scan_fini() when the code attempted to destroy an AVL tree that still had entires in it. This patch also includes a number of spelling corrections and comment cleanups throughout dsl_scan.c Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tom Caputi <[email protected]> Closes #7353
* chmod -x on etc/init.d/zfs-*.in automake filesTony Hutter2018-03-284-0/+0
| | | | | | | | | | | | Clear executable bit on zfs-import.in, zfs-mount.in, zfs-share.in, and zfs-zed.in. These are automake files and should not be marked executable. This fixes a RPM build error on Fedora 28. Reviewed-by: George Melikov <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tony Hutter <[email protected]> Closes #7355 Closes #7327
* Fix mmap / libaio deadlockBrian Behlendorf2018-03-2814-5/+184
| | | | | | | | | | | | | | Calling uiomove() in mappedread() under the page lock can result in a deadlock if the user space page needs to be faulted in. Resolve the issue by dropping the page lock before the uiomove(). The inode range lock protects against concurrent updates via zfs_read() and zfs_write(). Reviewed-by: Albert Lee <[email protected]> Reviewed-by: Chunwei Chen <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #7335 Closes #7339
* Remove libattr requirementDeHackEd2018-03-274-15/+2
| | | | | | | | | | RHEL/CentOS 6 supports sys/xattr.h eliminating the need for libattr-devel as a dependency. Reviewed-by: Giuseppe Di Natale <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: DHE <[email protected]> Closes #7344 Closes #7351
* OpenZFS 9321 - arc_loan_compressed_buf() can increment arc_loaned_bytes by ↵Allan Jude2018-03-261-2/+2
| | | | | | | | | | | | | | | | the wrong value Authored by: Allan Jude <[email protected]> Reviewed by: Matt Ahrens <[email protected]> Reviewed by: George Wilson <[email protected]> Reviewed-by: George Melikov <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Approved by: Garrett D'Amore <[email protected]> Ported-by: Giuseppe Di Natale <[email protected]> OpenZFS-issue: https://www.illumos.org/issues/9321 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/92b05f3a18 Closes #7333
* Fedora 28: Fix "Macro %_dracutdir has empty body"Tony Hutter2018-03-251-4/+17
| | | | | | | | | | | If you run ./configure --with-config=srpm, it will not trigger the user m4 scripts to populate the dracut and udev directories. This causes a build error on Fedora 28. Make the dracut and udev lines conditional to get around this. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tony Hutter <[email protected]> Closes #7326 Closes #7328
* Don't count embedded bps in read statsTom Caputi2018-03-231-1/+3
| | | | | | | | | | | | | | Currently, ZFS tracks statistics about calls to arc_read() via the /proc/spl/kstat/zfs/<pool>/reads file for debugging. Unfortunately, this file currently counts embedded bps as disk reads since they are technically processed by the ZIO layer. This pollutes the log since the ARC will never cache embedded bps. This patch corrects this issue by preventing the logging of embedded bp reads. Reviewed-by: George Melikov <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tom Caputi <[email protected]> Closes #7334
* OpenZFS 9193 - bootcfg -C doesn't workPaul Dagnelie2018-03-221-2/+3
| | | | | | | | | | | | | | | | | | | | When given an empty string as a rootds value, bootcfg -C fails with the error message 'could not set nextboot: '' is an invalid name'. This should be allowed because it represents clearing the nextboot configuration. Authored by: Paul Dagnelie <[email protected]> Reviewed by: Chris Williamson <[email protected]> Reviewed by: Sebastien Roy <[email protected]> Reviewed by: Igor Kozhukhov <[email protected]> Reviewed-by: George Melikov <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Approved by: Robert Mustacchi <[email protected]> Ported-by: Giuseppe Di Natale <[email protected]> OpenZFS-issue: https://www.illumos.org/issues/9193 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/504645d227 Closes #7230
* Clarify zpool actions for an intent log devicePeter Ashford2018-03-221-2/+3
| | | | | | | | | | Updated the "Intent Log" section of the "zpool" manual page to properly reflect the actions that may be performed. Reviewed-by: George Melikov <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Peter Ashford <[email protected]> Closes #6938 Closes #7318
* modprobe zfs during dracut mountkpande2018-03-221-0/+1
| | | | | | | | | | | Resolves importing root pool during boot in dracut. This case was inadvertently broken with the module autoloading change in #7287. Reviewed-by: Matthew Thode <[email protected]> Reviewed-by: George Melikov <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Giuseppe Di Natale <[email protected]> Signed-off-by: Kash Pande <[email protected]> Closes #7322
* enable zfs_dbgmsg() by default, without dprintf()Matthew Ahrens2018-03-215-28/+38
| | | | | | | | | | | | | | | | | | | | | | | zfs_dbgmsg() should record a message by default. As a general principal, these messages shouldn't be too verbose. Furthermore, the amount of memory used is limited to 4MB (by default). dprintf() should only record a message if this is a debug build, and ZFS_DEBUG_DPRINTF is set in zfs_flags. This flag is not set by default (even on debug builds). These messages are extremely verbose, and sometimes nontrivial to compute. SET_ERROR() should only record a message if ZFS_DEBUG_SET_ERROR is set in zfs_flags. This flag is not set by default (even on debug builds). This brings our behavior in line with illumos. Note that the message format is unchanged (including file, line, and function, even though these are not recorded on illumos). Reviewed-by: Giuseppe Di Natale <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Prakash Surya <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> Closes #7314
* Fix spelling errors in commentsTom Caputi2018-03-214-47/+48
| | | | | | | | | | This patch simply corrects some spelling / grammar errors in the QAT and encryption code comments. No functional changes Reviewed-by: George Melikov <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Giuseppe Di Natale <[email protected]> Signed-off-by: Tom Caputi <[email protected]> Closes #7319
* Add support for nvme disk detectiontimor2018-03-211-0/+14
| | | | | | | | | | | | | This treats /dev/nvme.. devices the same way as /dev/sd... devices. The motivation behind this is that whole disk detection did not work on nvme SSDs without that, because it DKC_UNKNOWN was returned for such devices. Perhaps there should be a separate DKC_ type for this, but I don't know enough about the code to know the implications of that. Reviewed-by: Don Brady <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: timor <[email protected]> Closes #7304
* Add comments for portable dnode / objset flagsTom Caputi2018-03-202-1/+13
| | | | | | | | | | | This patch adds some comments describing the purpose of "portable" dnode and objset flags so that it is clear when new flags should be added to the repective flag masks. This patch includes no functional changes. Reviewed-by: George Melikov <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tom Caputi <[email protected]> Closes #7313
* Add JSON output support to channel programsAlek P2018-03-1913-11/+639
| | | | | | | | | | | | | | | | | | | The changes piggyback JSON output support on top of channel programs (#6558). This way the JSON output support is targeted to scripting use cases and is easily maintainable since it really only touches one function (zfs_do_channel_program()). This patch ports Joyent's JSON nvlist library from illumos to enable easy JSON printing of channel program output nvlist. To keep the delta small I also took advantage of the fact that printing in zfs_do_channel_program() was almost always done before exiting the program. Reviewed by: Matt Ahrens <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Richard Elling <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alek Pinchuk <[email protected]> Closes #7281
* Fix deadlock in IO pipelineBrian Behlendorf2018-03-161-1/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In vdev_queue_aggregate() the zio_execute() bypass should not be called under the vdev queue lock. This can result in a deadlock as shown in the stack traces below. Drop the vdev queue lock then walk the parents of the aggregate IO to determine the list of component IOs to be bypassed. This can be done safely without holding the io_lock since the new aggregate IO has not yet been returned and its parents cannot change. --- THREAD 1 --- arc_read() zio_nowait() zio_vdev_io_start() vdev_queue_io() <--- mutex_enter(vq->vq_lock) vdev_queue_io_to_issue() vdev_queue_aggregate() zio_execute() zio_vdev_io_assess() zio_wait_for_children() <- mutex_enter(zio->io_lock) --- THREAD 2 --- (inverse order) arc_read() zio_change_priority() <- mutex_enter(zio->zio_lock) vdev_queue_change_io_priority() <- mutex_enter(vq->vq_lock) Reviewed-by: Tom Caputi <[email protected]> Reviewed-by: Don Brady <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #7307
* Allow QAT to handle non page-aligned buffersTom Caputi2018-03-162-34/+60
| | | | | | | | | | | | | | This patch adds some handling to the QAT acceleration functions that allows them to handle buffers that are not aligned with the page cache. At the moment this never happens since callers only happen to work with page-aligned buffers, but this code should prevent headaches if this isn't always true in the future. This patch also adds some cleanups to align the QAT compression code with the encryption and checksumming code. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Weigang Li <[email protected]> Signed-off-by: Tom Caputi <[email protected]> Closes #7305
* Report pool suspended due to MMPOlaf Faaland2018-03-1510-11/+38
| | | | | | | | | | | | | | | | | | | | | | | | | | When the pool is suspended, record whether it was due to an I/O error or due to MMP writes failing to succeed within the required time. Change spa_suspended from uint8_t to zio_suspend_reason_t to store the reason. When userspace queries pool status via spa_tryimport(), report the reason the pool was suspended in a new key, ZPOOL_CONFIG_SUSPENDED_REASON. In libzfs, when interpreting the returned config nvlist, report suspension due to MMP with a new pool status enum value, ZPOOL_STATUS_IO_FAILURE_MMP. In status_callback(), which generates and emits the message when 'zpool status' is executed, add a case to print an appropriate message for the new pool status enum value. Reviewed-by: George Melikov <[email protected]> Reviewed-by: Giuseppe Di Natale <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Signed-off-by: Olaf Faaland <[email protected]> Closes #7296
* SHA256 QAT accelerationTom Caputi2018-03-155-27/+235
| | | | | | | | | | | | This patch enables acceleration of SHA256 checksums using Intel Quick Assist Technology. This patch also fixes up and refactors some of the code from QAT encryption to make the behavior consistent. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Chengfeix Zhu <[email protected]> Signed-off-by: Weigang Li <[email protected]> Signed-off-by: Tom Caputi <[email protected]> Closes #7295
* OpenZFS 9076 - Adjust perf test concurrency settingsStephen Blinick2018-03-155-10/+55
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ZFS Performance test concurrency should be lowered for better latency Work by Stephen Blinick. Nightly performance runs typically consist of two levels of concurrency; and both are fairly high. Since the IO runs are to a ZFS filesystem, within a zpool, which is based on some variable number of vdev's, the amount of IO driven to each device is variable. Additionally, different device types (HDD vs SSD, etc) can generally handle a different amount of concurrent IO before saturating. Nevertheless, in practice, it appears that most tests are well past the concurrency saturation point and therefore both perform with the same throughput, the maximum of the device. Because the queuedepth to the device(s) is so high however, the latency is much higher than the best possible at that throughput, and increases linearly with the increase in concurrency. This means that changes in code that impact latency during normal operation (before saturation) may not be apparent when a large component of the measured latency is from the IO sitting in a queue to be serviced. Therefore, changing the concurrency settings is recommended Authored by: Stephen Blinick <[email protected]> Reviewed-by: George Melikov <[email protected]> Reviewed-by: Giuseppe Di Natale <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed by: Dan Kimmel <[email protected]> Reviewed by: John Wren Kennedy <[email protected]> Ported-by: John Wren Kennedy <[email protected]> OpenZFS-issue: https://www.illumos.org/issues/9076 OpenZFS-commit: https://github.com/openzfs/openzfs/pull/562 Upstream bug: DLPX-45477 Closes #7302
* receive_spill does not byte swap spill contentsPaul Zuchowski2018-03-151-0/+9
| | | | | | | | | In zfs receive, the function receive_spill should account for spill block endian conversion as a defensive measure. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Tom Caputi <[email protected]> Signed-off-by: Paul Zuchowski <[email protected]> Closes #7300
* OpenZFS 9188 - increase size of dbuf cache to reduce indirect block ↵Brian Behlendorf2018-03-132-11/+66
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | decompression With compressed ARC (bug 6950) we use up to 25% of our CPU to decompress indirect blocks, under a workload of random cached reads. To reduce this decompression cost, we would like to increase the size of the dbuf cache so that more indirect blocks can be stored uncompressed. If we are caching entire large files of recordsize=8K, the indirect blocks use 1/64th as much memory as the data blocks (assuming they have the same compression ratio). We suggest making the dbuf cache be 1/32nd of all memory, so that in this scenario we should be able to keep all the indirect blocks decompressed in the dbuf cache. (We want it to be more than the 1/64th that the indirect blocks would use because we need to cache other stuff in the dbuf cache as well.) In real world workloads, this won't help as dramatically as the example above, but we think it's still worth it because the risk of decreasing performance is low. The potential negative performance impact is that we will be slightly reducing the size of the ARC (by ~3%). Porting Notes: * Added modules options to zfs-module-parameters.5 man page. * Preserved scaling based on target ARC size rather than max ARC size. Authored by: George Wilson <[email protected]> Reviewed by: Dan Kimmel <[email protected]> Reviewed by: Prashanth Sreenivasa <[email protected]> Reviewed by: Paul Dagnelie <[email protected]> Reviewed-by: Richard Elling <[email protected]> Reviewed-by: George Melikov <[email protected]> Reviewed-by: Giuseppe Di Natale <[email protected]> Ported-by: Brian Behlendorf <[email protected]> OpenZFS-issue: https://www.illumos.org/issues/9188 OpenZFS-commit: https://github.com/openzfs/openzfs/pull/564 Upstream bug: DLPX-46942 Closes #7273
* Add kernel module auto-loadingBrian Behlendorf2018-03-1310-17/+66
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Historically a dynamic misc minor number was registered for the /dev/zfs device in order to prevent minor number collisions. This was fine but it prevented us from being able to use the kernel module auto-loaded which requires a known reserved value. Resolve this issue by adding a configure test to find an available misc minor number which can then be used in MODULE_ALIAS_MISCDEV at build time. By adding this alias the zfs kmod is added to the list of known static-nodes and the systemd-tmpfiles-setup-dev service will create a /dev/zfs character device at boot time. This in turn allows us to update the 90-zfs.rules file to make it aware this is a static node. The upshot of this is that whenever a process (zpool, zfs, zed) opens the /dev/zfs the kmods will be automatic loaded. This even works for unprivileged users so there is no longer a need to manually load the modules at boot time. As an additional bonus the zed now no longer needs to start after the zfs-import.service since it will trigger the module load. In the unlikely event the minor number we selected conflicts with another out of tree unregistered minor number the code falls back to dynamically allocating it. In this case the modules again must be manually loaded. Note that due to the change in the method of registering the minor number the zimport.sh test case may incorrectly fail when the static node for the installed packages is created instead of the dynamic one. This issue will only transiently impact zimport.sh for this single commit when we transition and are mixing and matching methods. Reviewed-by: Fabian Grünbichler <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> TEST_ZIMPORT_SKIP="yes" Closes #7287
* Add zfs_scan_ignore_errors tunableTim Chase2018-03-132-0/+30
| | | | | | | | | When it's set, a DTL range will be cleared even if its scan/scrub had errors. This allows to work around resilver/scrub upon import when the pool has errors. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tim Chase <[email protected]> Closes #7293
* Destroy makes full snap list before destroyingPaul Zuchowski2018-03-122-7/+18
| | | | | | | | | | Change zfs destroy logic so destroying begins before the entire list of snapshots is built. Reviewed-by: loli10K <[email protected]> Reviewed-by: Kash Pande <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Paul Zuchowski <[email protected]> Closes #7271
* Fix race in trace point in zrl_add_implChunwei Chen2018-03-122-10/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We hit an illegal memory access in the zrlock trace point. The problem is that zrl->zr_owner and zrl->zr_caller are assigned locklessly. And if zrl->zr_owner got assigned a longer string between when __string() calculate the strlen, and when __assign_str() does strcpy. The copy will overflow the buffer. == For example: Initial condition: zrl->zr_owner = A zrl->zr_caller = "abc" Thread A Thread B ------------------------------------------------- if (zrl->zr_owner == A) { DTRACE_PROBE2() { __string() { strlen(zrl->zr_caller) -> 3 allocate buf[4] } zrl->zr_owner = B zrl->zr_caller = "abcd" __assign_str() { strcpy(buf, zrl->zr_caller) <- buffer overflow == Dereferencing zrl->zr_owner->pid may also be problematic, in that the zrl->zr_owner got changed to other task, and that task exits, freeing the task_struct. This should be very unlikely, as the other task need to zrl_remove and exit between the dereferencing zr->zr_owner and zr->zr_owner->pid. Nevertheless, we'll deal with it as well. To fix the zrl->zr_caller issue, instead of copy the string content, we just copy the pointer, this is safe because it always points to __func__, which is static. As for the zrl->zr_owner issue, we pass in curthread instead of using zrl->zr_owner. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Chunwei Chen <[email protected]> Closes #7291
* Fix MMP write frequency for large poolsBrian Behlendorf2018-03-121-3/+3
| | | | | | | | | | | | | | | | | | | When a single pool contains more vdevs than the CONFIG_HZ for for the kernel the mmp thread will not delay properly. Switch to using cv_timedwait_sig_hires() to handle higher resolution delays. This issue was reported on Arch Linux where HZ defaults to only 100 and this could be fairly easily reproduced with a reasonably large pool. Most distribution kernels set CONFIG_HZ=250 or CONFIG_HZ=1000 and thus are unlikely to be impacted. Reviewed-by: George Melikov <[email protected]> Reviewed-by: Giuseppe Di Natale <[email protected]> Reviewed-by: Olaf Faaland <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #7205 Closes #7289
* Hold SCL_VDEV when counting leavesOlaf Faaland2018-03-091-1/+7
| | | | | | | | | | | | A config lock should be held while vdev_count_leaves() walks the tree, otherwise the pointers reference may become invalid during the walk. SCL_VDEV is a minimal lock provided for such uses cases. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Signed-off-by: Olaf Faaland <[email protected]> Closes #7286
* Handle zio_resume and mmp => offOlaf Faaland2018-03-091-4/+10
| | | | | | | | | | | | | | | When multihost is disabled on a pool, and the pool is resumed via zpool clear, within a single cycle of the mmp thread's loop (e.g. while it's in the cv_timedwait call), both mmp_last_write and mmp_delay should be updated. The original code mistakenly treated the two cases as if they could not occur at the same time. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Signed-off-by: Olaf Faaland <[email protected]> Closes #7286
* Document allowed pool namesTomohiro Kusumi2018-03-091-2/+6
| | | | | | | | | | | | PR #7208 was a patch to allow non-reserved pool names which begin with mirror, raidz, spare (but do not equal), however we'd rather document it in the man page for compatibility with other OpenZFS implementations, to avoid pool names that may not work on non-Linux platforms. Reviewed-by: Richard Laager <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tomohiro Kusumi <[email protected]> Closes #7216
* Fix zfs-kmod builds when using rpm >= 4.14LOLi2018-03-091-0/+1
| | | | | | | | | | | With rpm-software-management/rpm@5e94633 a package version containing invalid characters (most commonly a double '-') causes the kmod package generation to terminate with an error. This change takes advantage of the newly introduced rpm macro "_wrong_version_format_terminate_build" to allow kmod packages to be built. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: loli10K <[email protected]> Closes #7284
* Change functions which return literals to return `const char*`Tomohiro Kusumi2018-03-094-5/+5
| | | | | | | | | | | | | | get_format_prompt_string() and zpool_state_to_name() return a string literal which is read-only, thus they should return `const char*`. zpool_get_prop_string() returns a non-const string after successful nv-lookup, and returns a string literal otherwise. Since this function is designed to be used for read-only purpose, the return type should also be `const char*`. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tomohiro Kusumi <[email protected]> Closes #7285
* QAT support for AES-GCMTom Caputi2018-03-0910-232/+758
| | | | | | | | | | This patch adds support for acceleration of AES-GCM encryption with Intel Quick Assist Technology. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Chengfeix Zhu <[email protected]> Signed-off-by: Weigang Li <[email protected]> Signed-off-by: Tom Caputi <[email protected]> Closes #7282
* zdb and inuse tests don't pass with real disksPaul Zuchowski2018-03-078-15/+59
| | | | | | | | | | Due to zpool create auto-partioning in Linux (i.e. sdb1), certain utilities need to use the parition (sdb1) while others use the whole disk name (sdb). Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Paul Zuchowski <[email protected]> Closes #6939 Closes #7261
* Take user namespaces into account in policy checksWolfgang Bumiller2018-03-0717-7/+521
| | | | | | | | | | | | | | | | | | | | | | | | | Change file related checks to use user namespaces and make sure involved uids/gids are mappable in the current namespace. Note that checks without file ownership information will still not take user namespaces into account, as some of these should be handled via 'zfs allow' (otherwise root in a user namespace could issue commands such as `zpool export`). This also adds an initial user namespace regression test for the setgid bit loss, with a user_ns_exec helper usable in further tests. Additionally, configure checks for the required user namespace related features are added for: * ns_capable * kuid/kgid_has_mapping() * user_ns in cred_t Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Wolfgang Bumiller <[email protected]> Closes #6800 Closes #7270
* ZTS: fix send-c_stream_size_estimateBrian Behlendorf2018-03-071-0/+1
| | | | | | | | | | | | The test could fail when attempting to write to a newly created volume which was missing its device node. Resolve the issue by calling block_device_wait() which blocks until udev creates the needed entry. Reviewed-by: George Melikov <[email protected]> Reviewed-by: Giuseppe Di Natale <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #7276 Closes #7277
* Fix dbufstats_001_posGiuseppe Di Natale2018-03-072-4/+28
| | | | | | | | | | | | | | Implement a new helper within_tolerance to test if a value is within range of a target. Because the dbufstats and dbufs kstat file are being read at slightly different times, it is possible for stats to be slightly off. Use within_tolerance to determine if the value is "close enough" to the target. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Giuseppe Di Natale <[email protected]> Closes #7239 Closes #7266
* Allow to limit zed's syslog chattinessTony Hutter2018-03-069-7/+178
| | | | | | | | | | | | | | | | | | Some usage patterns like send/recv of replication streams can produce a large number of events. In such a case, the current all-syslog.sh zedlet will hold up to its name, and flood the logs with mostly redundant information. Two mitigate this situation, this changeset introduces to new variables ZED_SYSLOG_SUBCLASS_INCLUDE and ZED_SYSLOG_SUBCLASS_EXCLUDE to zed.rc that give more control over which event classes end up in the syslog. Reviewed-by: loli10K <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Giuseppe Di Natale <[email protected]> Signed-off-by: Tony Hutter <[email protected]> Signed-off-by: Daniel Kobras <[email protected]> Closes #6886 Closes #7260
* Record skipped MMP writes in multihost_historyOlaf Faaland2018-03-0611-50/+263
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Once per pass through the MMP thread's loop, the vdev tree is walked to find a suitable leaf to write the next MMP block to. If no such leaf is found, the thread sleeps for a while and resumes at the top of the loop. Add an entry to multihost_history when no leaf can be found, and record the reason in the error column. The error code for such entries is a bitfield, displayed in hex: 0x1 At least one vdev (interior or leaf) was not writeable. 0x2 At least one writeable leaf vdev was found, but it had a pending MMP write. timestamp = the time in seconds since the epoch when no leaf could be found originally. duration = the time (in ns) during which no MMP block was written for this reason. This does not include the preceeding inter-write period nor the following inter-write period. vdev_guid = the number of sequential cycles of the MMP thread looop when this occurred. Sample output, truncated to fit: For records of skipped MMP writes the right-most column, vdev_path, is reported as "-". id txg timestamp error duration mmp_delay vdev_guid ... 936 11 1520036441 0 146264 891422313 1740883117838 ... 937 11 1520036441 0 163956 888356657 7320395061548 ... 938 11 1520036442 0 130690 885314969 7320395061548 ... 939 11 1520036442 0 2001068577 882296582 1740883117838 ... 940 11 1520036443 0 161806 882296582 7320395061548 ... 941 11 1520036443 0x2 0 998020546 1 ... 942 11 1520036444 0 136585 998020546 7320395061548 ... 943 11 1520036444 0x2 0 998020257 1 ... 944 11 1520036445 5 2002662964 994160219 1740883117838 ... 945 11 1520036445 0x2 998073118 994160219 3 ... 946 11 1520036447 0 247136 994160219 7320395061548 ... Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Olaf Faaland <[email protected]> Closes #7212
* Detect long config lock acquisition in mmpOlaf Faaland2018-03-061-0/+6
| | | | | | | | | | | | | | If something holds the config lock as a writer for too long, MMP will fail to issue MMP writes in a timely manner. This will result either in the pool being suspended, or in an extreme case, in the pool not being protected. If the time to acquire the config lock exceeds 1/10 of the minimum zfs_multihost_interval, report it in the zfs debug log. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Olaf Faaland <[email protected]> Closes #7212
* Introduce a destroy_dataset helperGiuseppe Di Natale2018-03-064-24/+47
| | | | | | | | | | | | | Datasets can be busy when calling zfs destroy. Introduce a helper function to destroy datasets and use it to destroy datasets in zfs_allow_004_pos, zfs_promote_008_pos, and zfs_destroy_002_pos. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Giuseppe Di Natale <[email protected]> Closes #7224 Closes #7246 Closes #7249 Closes #7267
* Misc fixes and cleanup for project quotaNasf-Fan2018-03-054-6/+9
| | | | | | | | | | | | | | | | | | | | | | | 1) The Coverity Scan reports some issues for the project quota patch, including: 1.1) zfs_prop_get_userquota() directly uses the const quota type value as the condition check by wrong. 1.2) dmu_objset_userquota_get_ids() may cause dnode::dn_newgid to be overwritten by dnode::dn->dn_oldprojid. 2) This patch fixes related issues. It also enhances the logic for zfs_project_item_alloc() to avoid buffer overflow. 3) Skip project quota ability check if does not change project quota related things (id or flag). Otherwise, it will cause chattr (for other non project quota flags) operation failed if project quota disabled. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Fan Yong <[email protected]> Closes #7251 Closes #7265
* Linux 4.16 compat: get_disk_and_module()Giuseppe Di Natale2018-03-054-1/+29
| | | | | | | | | | As of https://github.com/torvalds/linux/commit/fb6d47a, get_disk() is now get_disk_and_module(). Add a configure check to determine if we need to use get_disk_and_module(). Reviewed-by: loli10K <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Giuseppe Di Natale <[email protected]> Closes #7264
* Change checksum & IO delay ratelimit valuesTony Hutter2018-03-046-15/+56
| | | | | | | | | | | | | | | | Change checksum & IO delay ratelimit thresholds from 5/sec to 20/sec. This allows zed to actually trigger if a bunch of these events arrive in a short period of time (zed has a threshold of 10 events in 10 sec). Previously, if you had, say, 100 checksum errors in 1 sec, it would get ratelimited to 5/sec which wouldn't trigger zed to fault the drive. Also, convert the checksum and IO delay thresholds to module params for easy testing. Reviewed-by: loli10K <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Giuseppe Di Natale <[email protected]> Signed-off-by: Tony Hutter <[email protected]> Closes #7252
* Increment zil_itx_needcopy_bytes properlychrisrd2018-03-021-2/+1
| | | | | | | | | | | | | | | | | In zil_lwb_commit() with TX_WRITE, we copy the log write record (lrw) into the log write block (lwb) and send it off using zil_lwb_add_txg(). If we also have WR_NEED_COPY, we additionally copy the lwr's data into the lwb to be sent off. If the lwr + data doesn't fit into the lwb, we send the lrw and as much data as will fit (dnow bytes), then go back and do the same with the remaining data. Each time through this loop we're sending dnow data bytes. I.e. zil_itx_needcopy_bytes should be incremented by dnow. Reviewed-by: Richard Elling <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Chris Dunlop <[email protected]> Closes #6988 Closes #7176