aboutsummaryrefslogtreecommitdiffstats
path: root/module
Commit message (Collapse)AuthorAgeFilesLines
* Linux 5.2 compat: Remove config/kernel-set-fs-pwd.m4Tomohiro Kusumi2019-05-251-4/+0
| | | | | | | | | | | | | | | | This failed on 5.2-rc1 with "error: unknown" message, for set_fs_pwd() not being visible in both const and non-const tests. This is caused by torvalds/linux@83da1bed86. It's configurable, but we would want to be able to compile with default kbuild setting. set_fs_pwd() has never been exported with exception of some distro kernels, and set_fs_pwd() wasn't used in ZoL to begin with. The test result was used for a spl function vn_set_fs_pwd(). Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: loli10K <[email protected]> Signed-off-by: Tomohiro Kusumi <[email protected]> Closes #8777
* Drop local definition of MOUNT_BUSYTomohiro Kusumi2019-05-241-2/+1
| | | | | | | | It's accessible via <sys/mntent.h>. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Tom Caputi <[email protected]> Signed-off-by: Tomohiro Kusumi <[email protected]> Closes #8765
* Device removal panics on 32-bit systemsloli10K2019-05-241-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The issue is caused by an incorrect usage of the sizeof() operator in vdev_obsolete_sm_object(): on 64-bit systems this is not an issue since both "uint64_t" and "uint64_t*" are 8 bytes in size. However on 32-bit systems pointers are 4 bytes long which is not supported by zap_lookup_impl(). Trying to remove a top-level vdev on a 32-bit system will cause the following failure: VERIFY3(0 == vdev_obsolete_sm_object(vd, &obsolete_sm_object)) failed (0 == 22) PANIC at vdev_indirect.c:833:vdev_indirect_sync_obsolete() Showing stack for process 1315 CPU: 6 PID: 1315 Comm: txg_sync Tainted: P OE 4.4.69+ #2 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-2.fc27 04/01/2014 c1abc6e7 0ae10898 00000286 d4ac3bc0 c14397bc da4cd7d8 d4ac3bf0 d4ac3bd0 d790e7ce d7911cc1 00000523 d4ac3d00 d790e7d7 d7911ce4 da4cd7d8 00000341 da4ce664 da4cd8c0 da33fa6e 49524556 28335946 3d3d2030 65647620 626f5f76 Call Trace: [<>] dump_stack+0x58/0x7c [<>] spl_dumpstack+0x23/0x27 [spl] [<>] spl_panic.cold.0+0x5/0x41 [spl] [<>] ? dbuf_rele+0x3e/0x90 [zfs] [<>] ? zap_lookup_norm+0xbe/0xe0 [zfs] [<>] ? zap_lookup+0x57/0x70 [zfs] [<>] ? vdev_obsolete_sm_object+0x102/0x12b [zfs] [<>] vdev_indirect_sync_obsolete+0x3e1/0x64d [zfs] [<>] ? txg_verify+0x1d/0x160 [zfs] [<>] ? dmu_tx_create_dd+0x80/0xc0 [zfs] [<>] vdev_sync+0xbf/0x550 [zfs] [<>] ? mutex_lock+0x10/0x30 [<>] ? txg_list_remove+0x9f/0x1a0 [zfs] [<>] ? zap_contains+0x4d/0x70 [zfs] [<>] spa_sync+0x9f1/0x1b10 [zfs] ... [<>] ? kthread_stop+0x110/0x110 This commit simply corrects the "integer_size" parameter used to lookup the vdev's ZAP object. Reviewed-by: Giuseppe Di Natale <[email protected]> Reviewed-by: Igor Kozhukhov <[email protected]> Reviewed-by: George Melikov <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: loli10K <[email protected]> Closes #8790
* Fix coverity defects: CID 186143loli10K2019-05-231-1/+1
| | | | | | | | | | | | | | CID 186143: Memory - illegal accesses (USE_AFTER_FREE) This patch fixes an use-after-free in spa_import_progress_destroy() moving the kmem_free() call at the end of the function. Reviewed-by: Chris Dunlop <[email protected]> Reviewed-by: Giuseppe Di Natale <[email protected]> Reviewed-by: Igor Kozhukhov <[email protected]> Reviewed-by: George Melikov <[email protected]> Reviewed by: Brian Behlendorf <[email protected]> Signed-off-by: loli10K <[email protected]> Closes #8788
* kernel timer API reworkRafael Kitover2019-05-232-28/+25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In `config/kernel-timer.m4` refactor slightly to check more generally for the new `timer_setup()` APIs, but also check the callback signature because some kernels (notably 4.14) have the new `timer_setup()` API but use the old callback signature. Also add a check for a `flags` member in `struct timer_list`, which was added in 4.1-rc8. Add compatibility shims to `include/spl/sys/timer.h` to allow using the new timer APIs with the only two caveats being that the callback argument type must be declared as `spl_timer_list_t` and an explicit assignment is required to get the timer variable for the `timer_of()` macro. So the callback would look like this: ```c __cv_wakeup(spl_timer_list_t t) { struct timer_list *tmr = (struct timer_list *)t; struct thing *parent = from_timer(parent, tmr, parent_timer_field); ... /* do stuff with parent */ ``` Make some minor changes to `spl-condvar.c` and `spl-taskq.c` to use the new timer APIs instead of conditional code. Reviewed-by: Tomohiro Kusumi <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rafael Kitover <[email protected]> Closes #8647
* Fix kstat state update during pool transitionRichard Elling2019-05-231-2/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When reading kstats, the health (aka state) of the pool is stored into /proc/spl/kstat/zfs/POOLNAME/state via spa_state_to_name(). However, during import/export there is a case where the spa exists, but the root vdev does not exist. This fix checks that case and sets the state to "TRANSITIONING" Unfortunately, it is not easy to reproduce a test for this. It was detected randomly during ZTS runs while kstats were also being sampled regularly. After this change, further testing did not trip on the case and the TRANSITIONING state was collected at least once by the kstats. For posterity, the backtrace prior to this fix is: [Mon May 13 17:21:00 2019] RIP: 0010:spa_state_to_name+0x10/0xb0 [zfs] ... Mon May 13 17:21:00 2019] Call Trace: [Mon May 13 17:21:00 2019] spa_state_data+0x1a/0x40 [zfs] [Mon May 13 17:21:00 2019] kstat_seq_show+0x117/0x440 [spl] [Mon May 13 17:21:00 2019] seq_read+0xe5/0x430 [Mon May 13 17:21:00 2019] proc_reg_read+0x45/0x70 [Mon May 13 17:21:00 2019] __vfs_read+0x1b/0x40 [Mon May 13 17:21:00 2019] vfs_read+0x8e/0x130 [Mon May 13 17:21:00 2019] SyS_read+0x55/0xc0 [Mon May 13 17:21:00 2019] ? SyS_fcntl+0x5d/0xb0 [Mon May 13 17:21:00 2019] do_syscall_64+0x73/0x130 [Mon May 13 17:21:00 2019] entry_SYSCALL_64_after_hwframe+0x3d/0xa2 Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Signed-off-by: Richard Elling <[email protected]> Closes #8746
* Linux 5.2 compat: rw_tryupgrade()Brian Behlendorf2019-05-232-4/+19
| | | | | | | | | | | | | | | | | | | | | | | | | Commit torvalds/linux@46ad0840b has removed the architecture specific rwsem source and headers leaving only the generic version. As part of this change the RWSEM_ACTIVE_READ_BIAS and RWSEM_ACTIVE_WRITE_BIAS macros were moved to the private kernel/locking/rwsem.h header. This results in a build failure because these macros were required to implement the rw_tryupgrade() compatibility function. In practice, this isn't a major problem because there are only a few consumers of rw_tryupgrade() and because consumers of rw_tryupgrade should be written to retry using rw_enter(RW_WRITER). After auditing all of the callers only dmu_zfetch() was determined not to perform a retry. It has been updated in this commit to resolve this issue. That said, the rw_tryupgrade() functionality should be considered for possible removal in a future release due to the difficultly in supporting the interface. Reviewed-by: Tomohiro Kusumi <[email protected]> Reviewed-by: Chunwei Chen <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #8730
* Fix incorrect assertion in dnode_dirty_l1rangePaul Dagnelie2019-05-191-1/+2
| | | | | | | | | | | | | | | | The db_dirtycnt of an EVICTING dbuf is always 0. However, it still appears in the dn_dbufs tree. If we call dnode_dirty_l1range on a range that contains an EVICTING dbuf, we will attempt to mark it dirty (which will fail because it's EVICTING, resulting in a new dbuf being created and dirtied). Later, in ZFS_DEBUG mode, we assert that all the dbufs in the range are dirty. If the EVICTING dbuf is still present, this will trip the assertion erroneously. Reviewed-by: Matt Ahrens <[email protected]> Reviewed-by: Richard Elling <[email protected]> Reviewed-by: Sara Hartse <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Paul Dagnelie <[email protected]> Closes #8745
* zpool import progress kstatOlaf Faaland2019-05-092-2/+234
| | | | | | | | | | | | | | | | | | | | | | | | | | | | When an import requires a long MMP activity check, or when the user requests pool recovery, the import make take a long time. The user may not know why, or be able to tell whether the import is progressing or is hung. Add a kstat which lists all imports currently being processed by the kernel (currently only one at a time is possible, but the kstat allows for more than one). The kstat is /proc/spl/kstat/zfs/import_progress. The kstat contents are as follows: pool_guid load_state multihost_secs max_txg pool_name 16667015954387398 3 15 0 tank3 load_state: the value of spa_load_state multihost_secs: seconds until the end of the multihost activity check; if over, or none required, this is 0 max_txg: current spa_load_max_txg, if rewind is occurring This could be used by outside tools, such as a pacemaker resource agent, to report import progress, or as a part of manual troubleshooting. The zpool import subcommand could also be modified to report this information. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Olaf Faaland <[email protected]> Closes #8696
* Add missing trailing '\n' in printk() messagesTomohiro Kusumi2019-05-082-2/+3
| | | | | | | | | These messages will want '\n' like any other regular printk() messages. Reviewed by: Brian Behlendorf <[email protected]> Reviewed-by: Richard Laager <[email protected]> Reviewed-by: George Melikov <[email protected]> Signed-off-by: Tomohiro Kusumi <[email protected]> Closes #8726
* Fix link count of root inode when snapdir is visibleTomohiro Kusumi2019-05-081-0/+6
| | | | | | | | | | | | | Given how zfs_getattr() is implemented, zfs_getattr_fast() (used by ->getattr() of zpl inodes) also needs to consider an additional link count if "snapdir" property is set to "visible". Without this, # of directories in root inode of each dataset doesn't match the link count when snapdir is visible. Reviewed-by: Richard Yao <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tomohiro Kusumi <[email protected]> Closes #8727
* Linux 5.0 compat: ASM_BUG macroBrian Behlendorf2019-05-084-40/+40
| | | | | | | | | | | | | The 5.0 kernel defines the macro ASM_BUG. In order to prevent a conflict and build failure rename ASM_BUG to ZFS_ASM_BUG. This is currently only an issue on aarch64 but all instances of ASM_BUG we're renamed to avoid any future conflict on x86_64. Reviewed-by: Tomohiro Kusumi <[email protected]> Reviewed-by: Richard Laager <[email protected]> Reviewed-by: Chris Dunlop <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #8725 Issue #8545
* Fix errant EFAULT during writes (#8719)Brian Behlendorf2019-05-084-16/+16
| | | | | | | | | | | | | | | | | | | | | Commit 98bb45e resolved a deadlock which could occur when handling a page fault in zfs_write(). This change added the uio_fault_disable field to the uio structure but failed to initialize it to B_FALSE. This uninitialized field would cause uiomove_iov() to call __copy_from_user_inatomic() instead of copy_from_user() resulting in unexpected EFAULTs. Resolve the issue by fully initializing the uio, and clearing the uio_fault_disable flags after it's used in zfs_write(). Additionally, reorder the uio_t field assignments to match the order the fields are declared in the structure. Reviewed-by: Chunwei Chen <[email protected]> Reviewed-by: Richard Laager <[email protected]> Reviewed-by: Tim Chase <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #8640 Closes #8719
* Make zfs_special_class_metadata_reserve_pct into a parameterDeHackEd2019-05-071-0/+5
| | | | | | | | Exported and documented a new module parameter. Reviewed-by: Richard Laager <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: DHE <[email protected]> Closes #8706
* Fix send/recv lost spill blockBrian Behlendorf2019-05-075-15/+142
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When receiving a DRR_OBJECT record the receive_object() function needs to determine how to handle a spill block associated with the object. It may need to be removed or kept depending on how the object was modified at the source. This determination is currently accomplished using a heuristic which takes in to account the DRR_OBJECT record and the existing object properties. This is a problem because there isn't quite enough information available to do the right thing under all circumstances. For example, when only the block size changes the spill block is removed when it should be kept. What's needed to resolve this is an additional flag in the DRR_OBJECT which indicates if the object being received references a spill block. The DRR_OBJECT_SPILL flag was added for this purpose. When set then the object references a spill block and it must be kept. Either it is update to date, or it will be replaced by a subsequent DRR_SPILL record. Conversely, if the object being received doesn't reference a spill block then any existing spill block should always be removed. Since previous versions of ZFS do not understand this new flag additional DRR_SPILL records will be inserted in to the stream. This has the advantage of being fully backward compatible. Existing ZFS systems receiving this stream will recreate the spill block if it was incorrectly removed. Updated ZFS versions will correctly ignore the additional spill blocks which can be identified by checking for the DRR_SPILL_UNMODIFIED flag. The small downside to this approach is that is may increase the size of the stream and of the received snapshot on previous versions of ZFS. Additionally, when receiving streams generated by previous unpatched versions of ZFS spill blocks may still be lost. OpenZFS-issue: https://www.illumos.org/issues/9952 FreeBSD-issue: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=233277 Reviewed-by: Paul Dagnelie <[email protected]> Reviewed-by: Matt Ahrens <[email protected]> Reviewed-by: Tom Caputi <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #8668
* Fix `zfs set atime|relatime=off|on` behavior on inherited datasetsTomohiro Kusumi2019-05-073-9/+61
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | `zfs set atime|relatime=off|on` doesn't disable or enable the property on read for datasets whose property was inherited from parent, until a dataset is once unmounted and mounted again. (The properties start to work properly if a dataset is once unmounted and mounted again. The difference comes from regular mount process, e.g. via zpool import, uses mount options based on properties read from ondisk layout for each dataset, whereas `zfs set atime|relatime=off|on` just remounts a specified dataset.) -- # zpool create p1 <device> # zfs create p1/f1 # zfs set atime=off p1 # echo test > /p1/f1/test # sync # zfs list NAME USED AVAIL REFER MOUNTPOINT p1 176K 18.9G 25.5K /p1 p1/f1 26K 18.9G 26K /p1/f1 # zfs get atime NAME PROPERTY VALUE SOURCE p1 atime off local p1/f1 atime off inherited from p1 # stat /p1/f1/test | grep Access | tail -1 Access: 2019-04-26 23:32:33.741205192 +0900 # cat /p1/f1/test test # stat /p1/f1/test | grep Access | tail -1 Access: 2019-04-26 23:32:50.173231861 +0900 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ changed by read(2) -- The problem is that zfsvfs::z_atime which was probably intended to keep incore atime state just gets updated by a callback function of "atime" property change, atime_changed_cb(), and never used for anything else. Since now that all file read and atime update use a common function zpl_iter_read_common() -> file_accessed(), and whether to update atime via ->dirty_inode() is determined by atime_needs_update(), atime_needs_update() needs to return false once atime is turned off. It currently continues to return true on `zfs set atime=off`. Fix atime_changed_cb() by setting or dropping SB_NOATIME in VFS super block depending on a new atime value, so that atime_needs_update() works as expected after property change. The same problem applies to "relatime" except that a self contained relatime test is needed. This is because relatime_need_update() is based on a mount option flag MNT_RELATIME, which doesn't exist in datasets with inherited "relatime" property via `zfs set relatime=...`, hence it needs its own relatime test zfs_relatime_need_update(). Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tomohiro Kusumi <[email protected]> Closes #8674 Closes #8675
* Fix typo/etc in module/zfs/zfs_ctldir.cTomohiro Kusumi2019-05-051-4/+3
| | | | | | | | | | | | Drop duplicated phrases in comments. Also drop an obsolete comment "Perform a mount of the associated...", as all it does now is get objid from DMU and lookup incore inode. Reviewed-by: Giuseppe Di Natale <[email protected]> Reviewed-by: Richard Laager <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tomohiro Kusumi <[email protected]> Closes #8707
* Linux 5.0 compat: Use totalhigh_pages()Tomohiro Kusumi2019-05-041-1/+1
| | | | | | | | | | | | | Linux kernel commit ca79b0c211af63fa3276f0e3fd7dd9ada2439839 "mm: convert totalram_pages and totalhigh_pages variables to atomic" replaced `totalhigh_pages` with an inline function `totalhigh_pages()`. This broke compilation on IA32, etc, as ZoL uses `totalhigh_pages` on archs with highmem. Confirmed on Fedora 30 (5.0.9-301.fc30.i686). Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tomohiro Kusumi <[email protected]> Closes #8677 Closes #8701
* Improve rate at which new zvols are processedJohn Gallagher2019-05-042-86/+116
| | | | | | | | | | | | | | | | | | | | | | | | The kernel function which adds new zvols as disks to the system, add_disk(), briefly opens and closes the zvol as part of its work. Closing a zvol involves waiting for two txgs to sync. This, combined with the fact that the taskq processing new zvols is single threaded, makes this processing new zvols slow. Waiting for these txgs to sync is only necessary if the zvol has been written to, which is not the case during add_disk(). This change adds tracking of whether a zvol has been written to so that we can skip the txg_wait_synced() calls when they are unnecessary. This change also fixes the flags passed to blkdev_get_by_path() by vdev_disk_open() to be FMODE_READ | FMODE_WRITE | FMODE_EXCL instead of just FMODE_EXCL. The flags were being incorrectly calculated because we were using the wrong version of vdev_bdev_mode(). Reviewed-by: George Wilson <[email protected]> Reviewed-by: Serapheim Dimitropoulos <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: John Gallagher <[email protected]> Closes #8526 Closes #8615
* Reword comment in lz4_compress_zfsMatthew Ahrens2019-05-021-4/+4
| | | | | | | | | | The comment in lz4_compress_zfs could be more clear and specific. It also contains needlessly strong language. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Serapheim Dimitropoulos <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> Closes: #8702 Closes: #8703
* Add feature check for 'zpool resilver' commandTom Caputi2019-05-021-0/+4
| | | | | | | | | | | | The 'zpool resilver' command requires that the resilver_defer feature is active on the pool. Unfortunately, the check for this was left out of the original patch. This commit simply corrects this so that the command properly returns an error in this case. Reviewed by: Brian Behlendorf <[email protected]> Reviewed-by: Igor Kozhukhov <[email protected]> Signed-off-by: Tom Caputi <[email protected]> Closes #8700
* Correct snprintf() size argumentTomohiro Kusumi2019-04-302-3/+2
| | | | | | | | | | | | | | | | The size argument of snprintf(3) in glibc and snprintf() in Linux kernel includes trailing \0, as snprintf(3) man page explains it as "write at most size bytes (including the trailing null byte ('\0'))", i.e. snprintf() can just take buffer size. e.g. For snprintf() in module/zfs/zfs_ctldir.c, a buffer size is MAXPATHLEN, and a caller is passing MAXPATHLEN to snprintf(), so size should just be `path_len` to do what the caller is trying to do. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: George Melikov <[email protected]> Reviewed-by: Richard Laager <[email protected]> Signed-off-by: Tomohiro Kusumi <[email protected]> Closes #8692
* Linux 5.0 compat: Remove incorrect ASSERTBrian Behlendorf2019-04-291-1/+0
| | | | | | | | | Not all block devices, notably scsi_debug, set a root_blkg on the request queue. Remove this assertion and allow the the existing call to blkg_tryget() to gracefully handle the NULL (which it does). Reviewed-by: Tomohiro Kusumi <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #8678
* Use NV_ENCODE_NATIVE for nvlist encoding variableTomohiro Kusumi2019-04-261-1/+2
| | | | | | | | Use NV_ENCODE_NATIVE for nvlist encoding variable instead of 0. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Chris Dunlop <[email protected]> Signed-off-by: Tomohiro Kusumi <[email protected]> Closes #8653
* Use SEEK_{SET,CUR,END} for file seek "whence"Tomohiro Kusumi2019-04-256-13/+13
| | | | | | | | Use either SEEK_* or 0,1,2..., but not both. Reviewed-by: George Melikov <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tomohiro Kusumi <[email protected]> Closes #8656
* Fixes for the DMU free throttleTom Caputi2019-04-251-29/+36
| | | | | | | | | | | | | | | | | | | | | | | This patch fixes 2 issues with the DMU free throttle implemented in dmu_free_long_range(). The first issue is that get_next_chunk() was calculating the number of L1 blocks the free would dirty incorrectly. In some cases involving extremely large files, this code would greatly overestimate the number of effected L1 blocks, causing excessive calls to txg_wait_open(). This patch corrects the calculation. The second issue is that the free throttle uses the total number of free'd blocks in all (open, quiescing, and syncing) txgs to determine whether to throttle. This causes large frees (such as those created by the first issue) to cause 4 txg syncs before any further frees were allowed to proceed. This patch ensures that the accounting is done entirely in a per-txg fashion, so that frees from a given txg don't affect those that immediately follow it. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Matthew Ahrens <[email protected]> Signed-off-by: Tom Caputi <[email protected]> Closes #8655
* Drop unused ZNODE_STATS and ZNODE_STAT_ADD()Tomohiro Kusumi2019-04-191-14/+0
| | | | | | | | | | Unused since 5649246dd3("Remove znode move functionality"), and ZNODE_STAT_ADD() will never be needed. Reviewed-by: Richard Elling <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: George Melikov <[email protected]> Signed-off-by: Tomohiro Kusumi <[email protected]> Closes #8636
* Fix incorrect "[UNUSED]" commentsTomohiro Kusumi2019-04-191-2/+2
| | | | | | | | | | | These aren't unused. `flag` in zfs_create() also isn't to indicate large file. Reviewed-by: Richard Elling <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: George Melikov <[email protected]> Reviewed-by: Richard Laager <[email protected]> Signed-off-by: Tomohiro Kusumi <[email protected]> Closes #8635
* Code improvement and bug fixes for QAT supportcfzhu2019-04-164-38/+143
| | | | | | | | | | | | | | | | | | | | 1. Support QAT when ZFS is root file-system: When ZFS module is loaded before QAT started, the QAT can be started again in post-process, e.g.: echo 0 > /sys/module/zfs/parameters/zfs_qat_compress_disable echo 0 > /sys/module/zfs/parameters/zfs_qat_encrypt_disable echo 0 > /sys/module/zfs/parameters/zfs_qat_checksum_disable 2. Verify alder checksum of the de-compress result 3. Allocate Digest, IV and AAD buffer in physical contiguous memory by QAT_PHYS_CONTIG_ALLOC. 4. Update the documentation for zfs_qat_compress_disable, zfs_qat_checksum_disable, zfs_qat_encrypt_disable. Reviewed-by: Tom Caputi <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Weigang Li <[email protected]> Signed-off-by: Chengfeix Zhu <[email protected]> Closes #8323 Closes #8610
* Restructure vec_idx loopsRichard Laager2019-04-163-42/+48
| | | | | | | | | | | | | This replaces empty for loops with while loops to make the code easier to read. Reviewed-by: Tom Caputi <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reported-by: github.com/dcb314 Signed-off-by: Richard Laager <[email protected]> Closes #6681 Closes #6682 Closes #6683 Closes #8623
* Update a comment to match the codeRichard Laager2019-04-161-2/+2
| | | | | | | | | GRUB supports large_blocks. Reviewed-by: George Melikov <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Richard Laager <[email protected]> Closes #8626
* Consistently captialize GUID for featuresRichard Laager2019-04-161-5/+5
| | | | | | | Reviewed-by: George Melikov <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Richard Laager <[email protected]> Closes #8626
* Fix the spelling of deferredRichard Laager2019-04-161-1/+1
| | | | | | | Reviewed-by: George Melikov <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Richard Laager <[email protected]> Closes #8626
* Fix issues with truncated files in raw sendsTom Caputi2019-04-151-8/+12
| | | | | | | | | | | | | | | | | | | | When receiving a raw send stream only reallocated objects whose contents were not freed by the standard indicators should call dmu_free_long_range(). Furthermore, if calling dmu_free_long_range() is required then the objects current block size must be used and not the new block size. Two additional test cases were added to provided realistic test coverage for processing reallocated objects which are part of a raw receive. Reviewed-by: Olaf Faaland <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Tom Caputi <[email protected]> Closes #8528 Closes #8607
* Cleanup nits from ab7615d92Tom Caputi2019-04-141-2/+0
| | | | | | | | | This patch simply up cleans up a nit and corrects an error message issue that were introduced in the Multiple DVA scrub patch. Reviewed-by: George Melikov <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tom Caputi <[email protected]> Closes #8619
* Fix issue in receive_object() during reallocationBrian Behlendorf2019-04-121-1/+6
| | | | | | | | | | | | | | | When receiving an object to a previously allocated interior slot the new object should be "allocated" by setting DMU_NEW_OBJECT, not "reallocated" with dnode_reallocate(). For resilience verify the slot is free as required in case the stream is malformed. Add a test case to generate more realistic incremental send streams that force reallocation to occur during the receive. Reviewed-by: Olaf Faaland <[email protected]> Reviewed-by: Tom Caputi <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #8067 Closes #8614
* Fix TXG_MASK cstyleBrian Behlendorf2019-04-124-16/+18
| | | | | | | | | | Fix style issue for 'tx->tx_txg&TXG_MASK'. There should be white space around the '&' character. Split the dnode_reallocate() ASSERT to make it more readable to clearly separate the checks. Reviewed-by: George Melikov <[email protected]> Reviewed-by: Tom Caputi <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #8606
* Always call rw_init in zio_crypt_key_unwrapJorgen Lundman2019-04-101-1/+2
| | | | | | | | | | | | The error path in zio_crypt_key_unwrap would call zio_crypt_key_destroy which calls rw_destroy(&key->zk_salt_lock); which has not yet been initialized. We move the rw_init() call to the start of zio_crypt_key_unwrap instead. Reviewed-by: Tom Caputi <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Jorgen Lundman <[email protected]> Closes #8604 Closes #8605
* Avoid stack overwrite in zfs_setattr_dir()Tim Chase2019-04-101-1/+2
| | | | | | | | | | | | The bulk[] array index, count, must be reset per-iteration in order to not overwrite the stack. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Chris Dunlop <[email protected]> Reviewed-by: Tom Caputi <[email protected]> Signed-off-by: Tim Chase <[email protected]> Closes #8072 Closes #8597 Closes #8601
* Unbreak build on Linux kernel < 3.10Tomohiro Kusumi2019-04-081-4/+0
| | | | | | | | | | | | | | | | | | | d12614521a("Fixes for procfs files backed by linked lists") uses PDE_DATA(), but since PDE_DATA() (public interface which replaced old public interface PDE()) first appeared in upstream kernel 3.10, it lacks visible local definition for kernel < 3.10. Move the local PDE_DATA() definition to a ZoL header, to unbreak build on kernel < 3.10. -- module/spl/spl-procfs-list.c: In function 'procfs_list_open': module/spl/spl-procfs-list.c:166: error: implicit declaration of function 'PDE_DATA' module/spl/spl-procfs-list.c:166: warning: assignment makes pointer from integer without a cast Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: John Gallagher <[email protected]> Signed-off-by: Tomohiro Kusumi <[email protected]> Closes #8599
* Revert "Fix issues with truncated files in raw sends"Brian Behlendorf2019-04-053-13/+14
| | | | | | | | | | | | | | | | | This partially reverts commit 5dbf8b4ed. This change resolved the issues observed with truncated files in raw sends. However, the required changes to dnode_allocate() introduced a regression for non-raw streams which needs to be understood. The additional debugging improvements from the original patch were not reverted. Reviewed-by: Tom Caputi <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #7378 Issue #8528 Issue #8540 Issue #8565 Close #8584
* predictive prefetch disabled on new pools until export/rebootMatthew Ahrens2019-04-051-1/+2
| | | | | | | | | | | | | | | | | | | | | | When a pool is initially created (by `zpool create`), predictive prefetch is inadvertently disabled, until the pool is export/import-ed, or the machine is rebooted. When device removal was introduced, we added some code to disable predictive prefetching until indirect vdevs have been loaded. This resulted in the "default state" of prefetch being disabled, until we proactively enable it after indirect vdevs are loaded. Unfortunately this resulted in a few bugs where in some code paths we neglect to enable predictive prefetch. The first of these was fixed by https://github.com/zfsonlinux/zfs/commit/20507534d4ede14d4dd82c99fc8d461704ce7419 This commit fixes another case where we also need to explicitly enable predictive prefetch, when the pool is initially created. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Richard Elling <[email protected]> Reviewed-by: George Melikov <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> Closes #8577
* features.kernel layout should match features.poolDon Brady2019-04-041-20/+46
| | | | | | | | The features.kernel layout should match features.pool. Reviewed-by: Sara Hartse <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Don Brady <[email protected]> Closes #8566
* Restrict kstats and print real pointersSara Hartse2019-04-0411-12/+25
| | | | | | | | | | | | | | | There are several places where we use zfs_dbgmsg and %p to print pointers. In the Linux kernel, these values obfuscated to prevent information leaks which means the pointers aren't very useful for debugging crash dumps. We decided to restrict the permissions of dbgmsg (and some other kstats while we were at it) and print pointers with %px in zfs_dbgmsg as well as spl_dumpstack Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: John Gallagher <[email protected]> Signed-off-by: sara hartse <[email protected]> Closes #8467 Closes #8476
* Fix txg_wait_open() load average inflationBrian Behlendorf2019-04-042-5/+25
| | | | | | | | | | | | | | | | | Callers of txg_wait_open() which set should_quiesce=B_TRUE should be accounted for as iowait time. Otherwise, the caller is understood to be idle and cv_wait_sig() is used to prevent incorrectly inflating the system load average. Similarly txg_wait_wait() has been updated to use cv_wait_io() to be accounted against iowait. Reviewed-by: Tim Chase <[email protected]> Reviewed-by: Olaf Faaland <[email protected]> Reviewed-by: Matt Ahrens <[email protected]> Reviewed-by: George Melikov <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #8550 Closes #8558
* Add TRIM supportBrian Behlendorf2019-03-2920-215/+2284
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | UNMAP/TRIM support is a frequently-requested feature to help prevent performance from degrading on SSDs and on various other SAN-like storage back-ends. By issuing UNMAP/TRIM commands for sectors which are no longer allocated the underlying device can often more efficiently manage itself. This TRIM implementation is modeled on the `zpool initialize` feature which writes a pattern to all unallocated space in the pool. The new `zpool trim` command uses the same vdev_xlate() code to calculate what sectors are unallocated, the same per- vdev TRIM thread model and locking, and the same basic CLI for a consistent user experience. The core difference is that instead of writing a pattern it will issue UNMAP/TRIM commands for those extents. The zio pipeline was updated to accommodate this by adding a new ZIO_TYPE_TRIM type and associated spa taskq. This new type makes is straight forward to add the platform specific TRIM/UNMAP calls to vdev_disk.c and vdev_file.c. These new ZIO_TYPE_TRIM zios are handled largely the same way as ZIO_TYPE_READs or ZIO_TYPE_WRITEs. This makes it possible to largely avoid changing the pipieline, one exception is that TRIM zio's may exceed the 16M block size limit since they contain no data. In addition to the manual `zpool trim` command, a background automatic TRIM was added and is controlled by the 'autotrim' property. It relies on the exact same infrastructure as the manual TRIM. However, instead of relying on the extents in a metaslab's ms_allocatable range tree, a ms_trim tree is kept per metaslab. When 'autotrim=on', ranges added back to the ms_allocatable tree are also added to the ms_free tree. The ms_free tree is then periodically consumed by an autotrim thread which systematically walks a top level vdev's metaslabs. Since the automatic TRIM will skip ranges it considers too small there is value in occasionally running a full `zpool trim`. This may occur when the freed blocks are small and not enough time was allowed to aggregate them. An automatic TRIM and a manual `zpool trim` may be run concurrently, in which case the automatic TRIM will yield to the manual TRIM. Reviewed-by: Jorgen Lundman <[email protected]> Reviewed-by: Tim Chase <[email protected]> Reviewed-by: Matt Ahrens <[email protected]> Reviewed-by: George Wilson <[email protected]> Reviewed-by: Serapheim Dimitropoulos <[email protected]> Contributions-by: Saso Kiselkov <[email protected]> Contributions-by: Tim Chase <[email protected]> Contributions-by: Chunwei Chen <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #8419 Closes #598
* Fix issues with truncated files in raw sendsTom Caputi2019-03-273-18/+28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch fixes a few issues with raw receives involving truncated files: * dnode_reallocate() now calls dnode_set_blksz() instead of dnode_setdblksz(). This ensures that any remaining dbufs with blkid 0 are resized along with their containing dnode upon reallocation. * One of the calls to dmu_free_long_range() in receive_object() needs to check that the object it is about to free some contents or hasn't been completely removed already by a previous call to dmu_free_long_object() in the same function. * The same call to dmu_free_long_range() in the previous point needs to ensure it uses the object's current block size and not the new block size. This ensures the blocks of the object that are supposed to be freed are completely removed and not simply partially zeroed out. This patch also adds handling for DRR_OBJECT_RANGE records to dprintf_drr() for debugging purposes. Reviewed-by: Matt Ahrens <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tom Caputi <[email protected]> Closes #7378 Closes #8528
* Fix vd_path and error in spa_vdev_remove()Igor K2019-03-221-5/+12
| | | | | | | | | | | Make a local copy of the vd_path and preserve the removal error for use in spa_history_log_internal(). This is required because after spa_vdev_exit() there is nothing preventing the vdev state from changing. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Tom Caputi <[email protected]> Signed-off-by: Igor Kozhukhov <[email protected]> Closes #8522
* Panic when running 'zpool split'Roman Strashkin2019-03-221-0/+12
| | | | | | | | | | | Added missing remove of detachable VDEV from txg's DTL list to avoid use-after-free for the split VDEV Reviewed by: Pavel Zakharov <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Jorgen Lundman <[email protected]> Signed-off-by: Roman Strashkin <[email protected]> Closes #5565 Closes #7856
* ZFS Reads may result in unneccesary calls to zil_commitGeorge Wilson2019-03-221-1/+9
| | | | | | | | | | | | | | | ZFS supports O_RSYNC for read operations and when specified will ensure the same level of data integrity that O_DSYNC and O_SYNC provides for writes. O_RSYNC by itself has no effect so it must be combined with either O_DSYNC or O_SYNC. However, many platforms don't support O_RSYNC and have mapped O_SYNC to mean O_RSYNC within ZFS. This is incorrect and causes unnecessary calls to zil_commit. Only platforms which support O_RSYNC should implement the zil_commit functionality in the read code path. Reviewed-by: Matt Ahrens <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: George Wilson <[email protected]> Closes #8523