summaryrefslogtreecommitdiffstats
path: root/module/zfs
Commit message (Collapse)AuthorAgeFilesLines
* fat zap should prefetch when iteratingMatthew Ahrens2019-06-124-7/+110
| | | | | | | | | | | | | | | | | When iterating over a ZAP object, we're almost always certain to iterate over the entire object. If there are multiple leaf blocks, we can realize a performance win by issuing reads for all the leaf blocks in parallel when the iteration begins. For example, if we have 10,000 snapshots, "zfs destroy -nv pool/fs@1%9999" can take 30 minutes when the cache is cold. This change provides a >3x performance improvement, by issuing the reads for all ~64 blocks of each ZAP object in parallel. Reviewed-by: Andreas Dilger <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> External-issue: DLPX-58347 Closes #8862
* Target ARC size can get reduced to arc_c_minMatthew Ahrens2019-06-121-2/+0
| | | | | | | | | | | | | | | | | | | | | | | | | Sometimes the target ARC size is reduced to arc_c_min, which impacts performance. We've seen this happen as part of the random_reads performance regression test, where the ARC size is reduced before the reads test starts which impacts how long it takes for system to reach good IOPS performance. We call arc_reduce_target_size when arc_reap_cb_check() returns TRUE, and arc_available_memory() is less than arc_c>>arc_shrink_shift. However, arc_available_memory() could easily be low, even when arc_c is low, because we can have tons of unused bufs in the abd kmem cache. This would be especially true just after the DMU requests a bunch of stuff be evicted from the ARC (e.g. due to "zpool export"). To fix this, the ARC should reduce arc_c by the requested amount, not all the way down to arc_size (or arc_c_min), which can be very small. Reviewed-by: Tim Chase <[email protected]> Reviewed by: Brian Behlendorf <[email protected]> Reviewed-by: George Melikov <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> External-issue: DLPX-59431 Closes #8864
* Fix typo in vdev_raidz_math.cbnjf2019-06-121-1/+1
| | | | | | | | | Fix typo in vdev_raidz_math.c Reviewed by: Brian Behlendorf <[email protected]> Reviewed-by: George Melikov <[email protected]> Signed-off-by: Brad Forschinger <[email protected]> Closes #8875 Closes #8880
* single-chunk scatter ABDs can be treated as linearMatthew Ahrens2019-06-113-53/+92
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Scatter ABD's are allocated from a number of pages. In contrast to linear ABD's, these pages are disjoint in the kernel's virtual address space, so they can't be accessed as a contiguous buffer. Therefore routines that need a linear buffer (e.g. abd_borrow_buf() and friends) must allocate a separate linear buffer (with zio_buf_alloc()), and copy the contents of the pages to/from the linear buffer. This can have a measurable performance overhead on some workloads. https://github.com/zfsonlinux/zfs/commit/87c25d567fb7969b44c7d8af63990e ("abd_alloc should use scatter for >1K allocations") increased the use of scatter ABD's, specifically switching 1.5K through 4K (inclusive) buffers from linear to scatter. For workloads that access blocks whose compressed sizes are in this range, that commit introduced an additional copy into the read code path. For example, the sequential_reads_arc_cached tests in the test suite were reduced by around 5% (this is doing reads of 8K-logical blocks, compressed to 3K, which are cached in the ARC). This commit treats single-chunk scattered buffers as linear buffers, because they are contiguous in the kernel's virtual address space. All single-page (4K) ABD's can be represented this way. Some multi-page ABD's can also be represented this way, if we were able to allocate a single "chunk" (higher-order "page" which represents a power-of-2 series of physically-contiguous pages). This is often the case for 2-page (8K) ABD's. Representing a single-entry scatter ABD as a linear ABD has the performance advantage of avoiding the copy (and allocation) in abd_borrow_buf_copy / abd_return_buf_copy. A performance increase of around 5% has been observed for ARC-cached reads (of small blocks which can take advantage of this), fixing the regression introduced by 87c25d567. Note that this optimization is only possible because all physical memory is always mapped into the kernel's address space. This is not the case for HIGHMEM pages, so the optimization can not be made on 32-bit systems. Reviewed-by: Chunwei Chen <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> Closes #8580
* make zil max block size tunableMatthew Ahrens2019-06-103-9/+71
| | | | | | | | | | | | | | | | | | | | | | | We've observed that on some highly fragmented pools, most metaslab allocations are small (~2-8KB), but there are some large, 128K allocations. The large allocations are for ZIL blocks. If there is a lot of fragmentation, the large allocations can be hard to satisfy. The most common impact of this is that we need to check (and thus load) lots of metaslabs from the ZIL allocation code path, causing sync writes to wait for metaslabs to load, which can take a second or more. In the worst case, we may not be able to satisfy the allocation, in which case the ZIL will resort to txg_wait_synced() to ensure the change is on disk. To provide a workaround for this, this change adds a tunable that can reduce the size of ZIL blocks. External-issue: DLPX-61719 Reviewed-by: George Wilson <[email protected]> Reviewed-by: Paul Dagnelie <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> Closes #8865
* Fix comparison signedness in arc_is_overflowing()Alexander Motin2019-06-101-2/+2
| | | | | | | | | | | | When ARC size is very small, aggsum_lower_bound(&arc_size) may return negative values, that due to unsigned comparison caused delays, waiting for arc_adjust() to "fix" it by calling aggsum_value(&arc_size). Use of signed comparison there fixes the problem. Reviewed-by: Matt Ahrens <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: George Melikov <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Closes #8873
* Fix incorrect error message for raw receiveTom Caputi2019-06-101-2/+9
| | | | | | | | | | | | | | | | | | This patch fixes an incorrect error message that comes up when doing a non-forcing, raw, incremental receive into a dataset that has a newer snapshot than the "from" snapshot. In this case, the current code prints a confusing message about an IVset guid mismatch. This functionality is supported by non-raw receives as an undocumented feature, but was never supported by the raw receive code. If this is desired in the future, we can probably figure out a way to make it work. Reviewed by: Brian Behlendorf <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Signed-off-by: Tom Caputi <[email protected]> Issue #8758 Closes #8863
* Allow metaslab to be unloaded even when not freed fromPaul Dagnelie2019-06-062-22/+39
| | | | | | | | | | | | | | | | | | | | | | On large systems, the memory used by loaded metaslabs can become a concern. While range trees are a fairly efficient data structure, on heavily fragmented pools they can still consume a significant amount of memory. This problem is amplified when we fail to unload metaslabs that we aren't using. Currently, we only unload a metaslab during metaslab_sync_done; in order for that function to be called on a given metaslab in a given txg, we have to have dirtied that metaslab in that txg. If the dirtying was the result of an allocation, we wouldn't be unloading it (since it wouldn't be 8 txgs since it was selected), so in effect we only unload a metaslab during txgs where it's being freed from. We move the unload logic from sync_done to a new function, and call that function on all metaslabs in a given vdev during vdev_sync_done(). Reviewed-by: Richard Elling <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Paul Dagnelie <[email protected]> Closes #8837
* Reinstate raw receive check when truncatingTom Caputi2019-06-061-1/+15
| | | | | | | | | | | | | | This patch re-adds a check that was removed in 369aa50. The check confirms that a raw receive is not occuring before truncating an object's dn_maxblkid. At the time, it was believed that all cases that would hit this code path would be handled in other places, but that was not the case. Reviewed-by: Matt Ahrens <[email protected]> Reviewed-by: Paul Dagnelie <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tom Caputi <[email protected]> Closes #8852 Closes #8857
* l2arc_apply_transforms: Fix typo in commentAllan Jude2019-06-061-1/+1
| | | | | | | | | Reviewed-by: Chris Dunlop <[email protected]> Reviewed-by: Matt Ahrens <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: George Melikov <[email protected]> Reviewed-by: Richard Laager <[email protected]> Signed-off-by: Allan Jude <[email protected]> Closes #8822
* Reduced IOPS when all vdevs are in the zfs_mg_fragmentation_thresholdSerapheim Dimitropoulos2019-06-061-5/+20
| | | | | | | | | | | | | | | | | | | | | | | | Historically while doing performance testing we've noticed that IOPS can be significantly reduced when all vdevs in the pool are hitting the zfs_mg_fragmentation_threshold percentage. Specifically in a hypothetical pool with two vdevs, what can happen is the following: Vdev A would go above that threshold and only vdev B would be used. Then vdev B would pass that threshold but vdev A would go below it (we've been freeing from A to allocate to B). The allocations would go back and forth utilizing one vdev at a time with IOPS taking a hit. Empirically, we've seen that our vdev selection for allocations is good enough that fragmentation increases uniformly across all vdevs the majority of the time. Thus we set the threshold percentage high enough to avoid hitting the speed bump on pools that are being pushed to the edge. We effectively disable its effect in the majority of the cases but we don't remove (at least for now) just in case we hit any weird behavior in the future. Reviewed-by: George Melikov <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Matt Ahrens <[email protected]> Signed-off-by: Serapheim Dimitropoulos <[email protected]> Closes #8859
* Fix integer overflow of ZTOI(zp)->i_generationTom Caputi2019-06-061-1/+1
| | | | | | | | | | | | | | | | | | | | The ZFS on-disk format stores each inode's generation ID as a 64 bit number on disk and in-core. However, the Linux kernel's inode is only a 32 bit number. In most places, the code handles this correctly, but the cast is missing in zfs_rezget(). For many pools, this isn't an issue since the generation ID is computed as the current txg when the inode is created and many pools don't have more than 2^32 txgs. For the pools that have more txgs, this issue causes any inode with a high enough generation number to report IO errors after a call to "zfs rollback" while holding the file or directory open. This patch simply adds the missing cast. Reviewed-by: Alek Pinchuk <[email protected]> Reviewed-by: George Melikov <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tom Caputi <[email protected]> Closes #8858
* Drop objid argument in zfs_znode_alloc() (sync with OpenZFS)Tomohiro Kusumi2019-06-051-5/+4
| | | | | | | | | | | | | Since zfs_znode_alloc() already takes dmu_buf_t*, taking another uint64_t argument for objid is redundant. inode's ->i_ino does and needs to match znode's ->z_id. zfs_znode_alloc() in FreeBSD and illumos doesn't have this argument since vnode doesn't have vnode# in VFS (hence ->z_id exists). Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Signed-off-by: Tomohiro Kusumi <[email protected]> Closes #8841
* Wait in 'S' state when send/recv pipe is blockingDeHackEd2019-06-031-2/+2
| | | | | | | | Reviewed-by: Paul Dagnelie <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Signed-off-by: DHE <[email protected]> Closes #8733 Closes #8752
* Make zfs_async_block_max_blocks handle zero correctlyTulsiJain2019-06-031-1/+3
| | | | | | | | | Reviewed-by: Matt Ahrens <[email protected]> Reviewed-by: Paul Dagnelie <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: TulsiJain <[email protected]> Closes #8829 Closes #8289
* Revert "Report holes when there are only metadata changes"Brian Behlendorf2019-05-301-28/+3
| | | | | | | | | | | | This reverts commit ec4f9b8f30 which introduced a narrow race which can lead to lseek(, SEEK_DATA) incorrectly returning ENXIO. Resolve the issue by revering this change to restore the previous behavior which depends solely on checking the dirty list. Reviewed-by: Olaf Faaland <[email protected]> Reviewed-by: Igor Kozhukhov <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #8816 Closes #8834
* Remove vn_set_fs_pwd()/vn_set_pwd() (no need to be at / during insmod)Tomohiro Kusumi2019-05-292-9/+1
| | | | | | | | | | | | | | | | | | Per suggestion from @behlendorf in #8777, remove vn_set_fs_pwd() and vn_set_pwd() which are only used in zfs_ioctl.c:_init() while loading zfs.ko. The rest of initialization functions being called here after cwd set to / don't depend on cwd of the process except for spa_config_load(). spa_config_load() uses a relative path ".//etc/zfs/zpool.cache" when `rootdir` is non-NULL, which is "/etc/zfs/zpool.cache" given cwd is /, so just unconditionally use the absolute path without "./", so that `vn_set_pwd("/")` as well as the entire functions can be removed. This is also what FreeBSD does. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Signed-off-by: Tomohiro Kusumi <[email protected]> Closes #8826
* Exclude log device ashift from normal classBrian Behlendorf2019-05-291-4/+1
| | | | | | | | | | | | | | | | When opening a log device during import its allocation bias will not yet have been set by vdev_load(). This results in the log device's ashift being incorrectly applied to the maximum ashift of the vdevs in the normal class. Which in turn prevents the removal of any top-level devices due to the ashift check in the spa_vdev_remove_top_check() function. This issue is resolved by including vdev_islog in the check since it will be set correctly during vdev_open(). Reviewed-by: Matt Ahrens <[email protected]> Reviewed-by: Igor Kozhukhov <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #8735
* Fix integer overflow in get_next_chunk()madz2019-05-291-2/+2
| | | | | | | | | | dn->dn_datablksz type is uint32_t and need to be casted to uint64_t to avoid an overflow when the record size is greater than 4 MiB. Reviewed-by: Tom Caputi <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Olivier Mazouffre <[email protected]> Closes #8778 Closes #8797
* Double-free of encryption wrapping key due to invalid pool propertiesloli10K2019-05-281-12/+9
| | | | | | | | | | This commits fixes a double-free in zfs_ioc_pool_create() triggered by specifying an unsupported combination of properties when creating a pool with encryption enabled. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Tom Caputi <[email protected]> Signed-off-by: loli10K <[email protected]> Closes #8791
* Update descriptions for vnopsTomohiro Kusumi2019-05-252-13/+14
| | | | | | | | These descriptions are not uptodate with the code. Reviewed-by: Igor Kozhukhov <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tomohiro Kusumi <[email protected]> Closes #8767
* Fix embedded bp accounting in count_block()Tom Caputi2019-05-251-0/+7
| | | | | | | | | | | | | | | Currently, count_block() does not correctly account for the possibility that the bp that is passed to it could be embedded. These blocks shouldn't be counted since the work of scanning these blocks in already handled when the containing block is scanned. This patch simply resolves this issue by returning early in this case. Reviewed by: Allan Jude <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Authored-by: Bill Sommerfeld <[email protected]> Signed-off-by: Tom Caputi <[email protected]> Closes #8800 Closes #8766
* Linux 5.2 compat: Directly call wait_on_page_bit()Tomohiro Kusumi2019-05-251-2/+4
| | | | | | | | | | | wait_on_page_writeback() was made GPL only in torvalds/linux@19343b5bdd. Directly call wait_on_page_bit() without using wait_on_page_writeback() interface, given zfs_putpage() is the only caller for now. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: loli10K <[email protected]> Signed-off-by: Tomohiro Kusumi <[email protected]> Closes #8794
* Drop local definition of MOUNT_BUSYTomohiro Kusumi2019-05-241-2/+1
| | | | | | | | It's accessible via <sys/mntent.h>. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Tom Caputi <[email protected]> Signed-off-by: Tomohiro Kusumi <[email protected]> Closes #8765
* Device removal panics on 32-bit systemsloli10K2019-05-241-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The issue is caused by an incorrect usage of the sizeof() operator in vdev_obsolete_sm_object(): on 64-bit systems this is not an issue since both "uint64_t" and "uint64_t*" are 8 bytes in size. However on 32-bit systems pointers are 4 bytes long which is not supported by zap_lookup_impl(). Trying to remove a top-level vdev on a 32-bit system will cause the following failure: VERIFY3(0 == vdev_obsolete_sm_object(vd, &obsolete_sm_object)) failed (0 == 22) PANIC at vdev_indirect.c:833:vdev_indirect_sync_obsolete() Showing stack for process 1315 CPU: 6 PID: 1315 Comm: txg_sync Tainted: P OE 4.4.69+ #2 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-2.fc27 04/01/2014 c1abc6e7 0ae10898 00000286 d4ac3bc0 c14397bc da4cd7d8 d4ac3bf0 d4ac3bd0 d790e7ce d7911cc1 00000523 d4ac3d00 d790e7d7 d7911ce4 da4cd7d8 00000341 da4ce664 da4cd8c0 da33fa6e 49524556 28335946 3d3d2030 65647620 626f5f76 Call Trace: [<>] dump_stack+0x58/0x7c [<>] spl_dumpstack+0x23/0x27 [spl] [<>] spl_panic.cold.0+0x5/0x41 [spl] [<>] ? dbuf_rele+0x3e/0x90 [zfs] [<>] ? zap_lookup_norm+0xbe/0xe0 [zfs] [<>] ? zap_lookup+0x57/0x70 [zfs] [<>] ? vdev_obsolete_sm_object+0x102/0x12b [zfs] [<>] vdev_indirect_sync_obsolete+0x3e1/0x64d [zfs] [<>] ? txg_verify+0x1d/0x160 [zfs] [<>] ? dmu_tx_create_dd+0x80/0xc0 [zfs] [<>] vdev_sync+0xbf/0x550 [zfs] [<>] ? mutex_lock+0x10/0x30 [<>] ? txg_list_remove+0x9f/0x1a0 [zfs] [<>] ? zap_contains+0x4d/0x70 [zfs] [<>] spa_sync+0x9f1/0x1b10 [zfs] ... [<>] ? kthread_stop+0x110/0x110 This commit simply corrects the "integer_size" parameter used to lookup the vdev's ZAP object. Reviewed-by: Giuseppe Di Natale <[email protected]> Reviewed-by: Igor Kozhukhov <[email protected]> Reviewed-by: George Melikov <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: loli10K <[email protected]> Closes #8790
* Fix coverity defects: CID 186143loli10K2019-05-231-1/+1
| | | | | | | | | | | | | | CID 186143: Memory - illegal accesses (USE_AFTER_FREE) This patch fixes an use-after-free in spa_import_progress_destroy() moving the kmem_free() call at the end of the function. Reviewed-by: Chris Dunlop <[email protected]> Reviewed-by: Giuseppe Di Natale <[email protected]> Reviewed-by: Igor Kozhukhov <[email protected]> Reviewed-by: George Melikov <[email protected]> Reviewed by: Brian Behlendorf <[email protected]> Signed-off-by: loli10K <[email protected]> Closes #8788
* Fix kstat state update during pool transitionRichard Elling2019-05-231-2/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When reading kstats, the health (aka state) of the pool is stored into /proc/spl/kstat/zfs/POOLNAME/state via spa_state_to_name(). However, during import/export there is a case where the spa exists, but the root vdev does not exist. This fix checks that case and sets the state to "TRANSITIONING" Unfortunately, it is not easy to reproduce a test for this. It was detected randomly during ZTS runs while kstats were also being sampled regularly. After this change, further testing did not trip on the case and the TRANSITIONING state was collected at least once by the kstats. For posterity, the backtrace prior to this fix is: [Mon May 13 17:21:00 2019] RIP: 0010:spa_state_to_name+0x10/0xb0 [zfs] ... Mon May 13 17:21:00 2019] Call Trace: [Mon May 13 17:21:00 2019] spa_state_data+0x1a/0x40 [zfs] [Mon May 13 17:21:00 2019] kstat_seq_show+0x117/0x440 [spl] [Mon May 13 17:21:00 2019] seq_read+0xe5/0x430 [Mon May 13 17:21:00 2019] proc_reg_read+0x45/0x70 [Mon May 13 17:21:00 2019] __vfs_read+0x1b/0x40 [Mon May 13 17:21:00 2019] vfs_read+0x8e/0x130 [Mon May 13 17:21:00 2019] SyS_read+0x55/0xc0 [Mon May 13 17:21:00 2019] ? SyS_fcntl+0x5d/0xb0 [Mon May 13 17:21:00 2019] do_syscall_64+0x73/0x130 [Mon May 13 17:21:00 2019] entry_SYSCALL_64_after_hwframe+0x3d/0xa2 Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Signed-off-by: Richard Elling <[email protected]> Closes #8746
* Linux 5.2 compat: rw_tryupgrade()Brian Behlendorf2019-05-231-3/+10
| | | | | | | | | | | | | | | | | | | | | | | | | Commit torvalds/linux@46ad0840b has removed the architecture specific rwsem source and headers leaving only the generic version. As part of this change the RWSEM_ACTIVE_READ_BIAS and RWSEM_ACTIVE_WRITE_BIAS macros were moved to the private kernel/locking/rwsem.h header. This results in a build failure because these macros were required to implement the rw_tryupgrade() compatibility function. In practice, this isn't a major problem because there are only a few consumers of rw_tryupgrade() and because consumers of rw_tryupgrade should be written to retry using rw_enter(RW_WRITER). After auditing all of the callers only dmu_zfetch() was determined not to perform a retry. It has been updated in this commit to resolve this issue. That said, the rw_tryupgrade() functionality should be considered for possible removal in a future release due to the difficultly in supporting the interface. Reviewed-by: Tomohiro Kusumi <[email protected]> Reviewed-by: Chunwei Chen <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #8730
* Fix incorrect assertion in dnode_dirty_l1rangePaul Dagnelie2019-05-191-1/+2
| | | | | | | | | | | | | | | | The db_dirtycnt of an EVICTING dbuf is always 0. However, it still appears in the dn_dbufs tree. If we call dnode_dirty_l1range on a range that contains an EVICTING dbuf, we will attempt to mark it dirty (which will fail because it's EVICTING, resulting in a new dbuf being created and dirtied). Later, in ZFS_DEBUG mode, we assert that all the dbufs in the range are dirty. If the EVICTING dbuf is still present, this will trip the assertion erroneously. Reviewed-by: Matt Ahrens <[email protected]> Reviewed-by: Richard Elling <[email protected]> Reviewed-by: Sara Hartse <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Paul Dagnelie <[email protected]> Closes #8745
* zpool import progress kstatOlaf Faaland2019-05-092-2/+234
| | | | | | | | | | | | | | | | | | | | | | | | | | | | When an import requires a long MMP activity check, or when the user requests pool recovery, the import make take a long time. The user may not know why, or be able to tell whether the import is progressing or is hung. Add a kstat which lists all imports currently being processed by the kernel (currently only one at a time is possible, but the kstat allows for more than one). The kstat is /proc/spl/kstat/zfs/import_progress. The kstat contents are as follows: pool_guid load_state multihost_secs max_txg pool_name 16667015954387398 3 15 0 tank3 load_state: the value of spa_load_state multihost_secs: seconds until the end of the multihost activity check; if over, or none required, this is 0 max_txg: current spa_load_max_txg, if rewind is occurring This could be used by outside tools, such as a pacemaker resource agent, to report import progress, or as a part of manual troubleshooting. The zpool import subcommand could also be modified to report this information. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Olaf Faaland <[email protected]> Closes #8696
* Add missing trailing '\n' in printk() messagesTomohiro Kusumi2019-05-081-1/+1
| | | | | | | | | These messages will want '\n' like any other regular printk() messages. Reviewed by: Brian Behlendorf <[email protected]> Reviewed-by: Richard Laager <[email protected]> Reviewed-by: George Melikov <[email protected]> Signed-off-by: Tomohiro Kusumi <[email protected]> Closes #8726
* Fix link count of root inode when snapdir is visibleTomohiro Kusumi2019-05-081-0/+6
| | | | | | | | | | | | | Given how zfs_getattr() is implemented, zfs_getattr_fast() (used by ->getattr() of zpl inodes) also needs to consider an additional link count if "snapdir" property is set to "visible". Without this, # of directories in root inode of each dataset doesn't match the link count when snapdir is visible. Reviewed-by: Richard Yao <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tomohiro Kusumi <[email protected]> Closes #8727
* Linux 5.0 compat: ASM_BUG macroBrian Behlendorf2019-05-084-40/+40
| | | | | | | | | | | | | The 5.0 kernel defines the macro ASM_BUG. In order to prevent a conflict and build failure rename ASM_BUG to ZFS_ASM_BUG. This is currently only an issue on aarch64 but all instances of ASM_BUG we're renamed to avoid any future conflict on x86_64. Reviewed-by: Tomohiro Kusumi <[email protected]> Reviewed-by: Richard Laager <[email protected]> Reviewed-by: Chris Dunlop <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #8725 Issue #8545
* Fix errant EFAULT during writes (#8719)Brian Behlendorf2019-05-084-16/+16
| | | | | | | | | | | | | | | | | | | | | Commit 98bb45e resolved a deadlock which could occur when handling a page fault in zfs_write(). This change added the uio_fault_disable field to the uio structure but failed to initialize it to B_FALSE. This uninitialized field would cause uiomove_iov() to call __copy_from_user_inatomic() instead of copy_from_user() resulting in unexpected EFAULTs. Resolve the issue by fully initializing the uio, and clearing the uio_fault_disable flags after it's used in zfs_write(). Additionally, reorder the uio_t field assignments to match the order the fields are declared in the structure. Reviewed-by: Chunwei Chen <[email protected]> Reviewed-by: Richard Laager <[email protected]> Reviewed-by: Tim Chase <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #8640 Closes #8719
* Make zfs_special_class_metadata_reserve_pct into a parameterDeHackEd2019-05-071-0/+5
| | | | | | | | Exported and documented a new module parameter. Reviewed-by: Richard Laager <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: DHE <[email protected]> Closes #8706
* Fix send/recv lost spill blockBrian Behlendorf2019-05-075-15/+142
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When receiving a DRR_OBJECT record the receive_object() function needs to determine how to handle a spill block associated with the object. It may need to be removed or kept depending on how the object was modified at the source. This determination is currently accomplished using a heuristic which takes in to account the DRR_OBJECT record and the existing object properties. This is a problem because there isn't quite enough information available to do the right thing under all circumstances. For example, when only the block size changes the spill block is removed when it should be kept. What's needed to resolve this is an additional flag in the DRR_OBJECT which indicates if the object being received references a spill block. The DRR_OBJECT_SPILL flag was added for this purpose. When set then the object references a spill block and it must be kept. Either it is update to date, or it will be replaced by a subsequent DRR_SPILL record. Conversely, if the object being received doesn't reference a spill block then any existing spill block should always be removed. Since previous versions of ZFS do not understand this new flag additional DRR_SPILL records will be inserted in to the stream. This has the advantage of being fully backward compatible. Existing ZFS systems receiving this stream will recreate the spill block if it was incorrectly removed. Updated ZFS versions will correctly ignore the additional spill blocks which can be identified by checking for the DRR_SPILL_UNMODIFIED flag. The small downside to this approach is that is may increase the size of the stream and of the received snapshot on previous versions of ZFS. Additionally, when receiving streams generated by previous unpatched versions of ZFS spill blocks may still be lost. OpenZFS-issue: https://www.illumos.org/issues/9952 FreeBSD-issue: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=233277 Reviewed-by: Paul Dagnelie <[email protected]> Reviewed-by: Matt Ahrens <[email protected]> Reviewed-by: Tom Caputi <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #8668
* Fix `zfs set atime|relatime=off|on` behavior on inherited datasetsTomohiro Kusumi2019-05-073-9/+61
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | `zfs set atime|relatime=off|on` doesn't disable or enable the property on read for datasets whose property was inherited from parent, until a dataset is once unmounted and mounted again. (The properties start to work properly if a dataset is once unmounted and mounted again. The difference comes from regular mount process, e.g. via zpool import, uses mount options based on properties read from ondisk layout for each dataset, whereas `zfs set atime|relatime=off|on` just remounts a specified dataset.) -- # zpool create p1 <device> # zfs create p1/f1 # zfs set atime=off p1 # echo test > /p1/f1/test # sync # zfs list NAME USED AVAIL REFER MOUNTPOINT p1 176K 18.9G 25.5K /p1 p1/f1 26K 18.9G 26K /p1/f1 # zfs get atime NAME PROPERTY VALUE SOURCE p1 atime off local p1/f1 atime off inherited from p1 # stat /p1/f1/test | grep Access | tail -1 Access: 2019-04-26 23:32:33.741205192 +0900 # cat /p1/f1/test test # stat /p1/f1/test | grep Access | tail -1 Access: 2019-04-26 23:32:50.173231861 +0900 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ changed by read(2) -- The problem is that zfsvfs::z_atime which was probably intended to keep incore atime state just gets updated by a callback function of "atime" property change, atime_changed_cb(), and never used for anything else. Since now that all file read and atime update use a common function zpl_iter_read_common() -> file_accessed(), and whether to update atime via ->dirty_inode() is determined by atime_needs_update(), atime_needs_update() needs to return false once atime is turned off. It currently continues to return true on `zfs set atime=off`. Fix atime_changed_cb() by setting or dropping SB_NOATIME in VFS super block depending on a new atime value, so that atime_needs_update() works as expected after property change. The same problem applies to "relatime" except that a self contained relatime test is needed. This is because relatime_need_update() is based on a mount option flag MNT_RELATIME, which doesn't exist in datasets with inherited "relatime" property via `zfs set relatime=...`, hence it needs its own relatime test zfs_relatime_need_update(). Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tomohiro Kusumi <[email protected]> Closes #8674 Closes #8675
* Fix typo/etc in module/zfs/zfs_ctldir.cTomohiro Kusumi2019-05-051-4/+3
| | | | | | | | | | | | Drop duplicated phrases in comments. Also drop an obsolete comment "Perform a mount of the associated...", as all it does now is get objid from DMU and lookup incore inode. Reviewed-by: Giuseppe Di Natale <[email protected]> Reviewed-by: Richard Laager <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tomohiro Kusumi <[email protected]> Closes #8707
* Linux 5.0 compat: Use totalhigh_pages()Tomohiro Kusumi2019-05-041-1/+1
| | | | | | | | | | | | | Linux kernel commit ca79b0c211af63fa3276f0e3fd7dd9ada2439839 "mm: convert totalram_pages and totalhigh_pages variables to atomic" replaced `totalhigh_pages` with an inline function `totalhigh_pages()`. This broke compilation on IA32, etc, as ZoL uses `totalhigh_pages` on archs with highmem. Confirmed on Fedora 30 (5.0.9-301.fc30.i686). Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tomohiro Kusumi <[email protected]> Closes #8677 Closes #8701
* Improve rate at which new zvols are processedJohn Gallagher2019-05-042-86/+116
| | | | | | | | | | | | | | | | | | | | | | | | The kernel function which adds new zvols as disks to the system, add_disk(), briefly opens and closes the zvol as part of its work. Closing a zvol involves waiting for two txgs to sync. This, combined with the fact that the taskq processing new zvols is single threaded, makes this processing new zvols slow. Waiting for these txgs to sync is only necessary if the zvol has been written to, which is not the case during add_disk(). This change adds tracking of whether a zvol has been written to so that we can skip the txg_wait_synced() calls when they are unnecessary. This change also fixes the flags passed to blkdev_get_by_path() by vdev_disk_open() to be FMODE_READ | FMODE_WRITE | FMODE_EXCL instead of just FMODE_EXCL. The flags were being incorrectly calculated because we were using the wrong version of vdev_bdev_mode(). Reviewed-by: George Wilson <[email protected]> Reviewed-by: Serapheim Dimitropoulos <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: John Gallagher <[email protected]> Closes #8526 Closes #8615
* Reword comment in lz4_compress_zfsMatthew Ahrens2019-05-021-4/+4
| | | | | | | | | | The comment in lz4_compress_zfs could be more clear and specific. It also contains needlessly strong language. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Serapheim Dimitropoulos <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> Closes: #8702 Closes: #8703
* Add feature check for 'zpool resilver' commandTom Caputi2019-05-021-0/+4
| | | | | | | | | | | | The 'zpool resilver' command requires that the resilver_defer feature is active on the pool. Unfortunately, the check for this was left out of the original patch. This commit simply corrects this so that the command properly returns an error in this case. Reviewed by: Brian Behlendorf <[email protected]> Reviewed-by: Igor Kozhukhov <[email protected]> Signed-off-by: Tom Caputi <[email protected]> Closes #8700
* Correct snprintf() size argumentTomohiro Kusumi2019-04-301-2/+1
| | | | | | | | | | | | | | | | The size argument of snprintf(3) in glibc and snprintf() in Linux kernel includes trailing \0, as snprintf(3) man page explains it as "write at most size bytes (including the trailing null byte ('\0'))", i.e. snprintf() can just take buffer size. e.g. For snprintf() in module/zfs/zfs_ctldir.c, a buffer size is MAXPATHLEN, and a caller is passing MAXPATHLEN to snprintf(), so size should just be `path_len` to do what the caller is trying to do. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: George Melikov <[email protected]> Reviewed-by: Richard Laager <[email protected]> Signed-off-by: Tomohiro Kusumi <[email protected]> Closes #8692
* Linux 5.0 compat: Remove incorrect ASSERTBrian Behlendorf2019-04-291-1/+0
| | | | | | | | | Not all block devices, notably scsi_debug, set a root_blkg on the request queue. Remove this assertion and allow the the existing call to blkg_tryget() to gracefully handle the NULL (which it does). Reviewed-by: Tomohiro Kusumi <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #8678
* Use SEEK_{SET,CUR,END} for file seek "whence"Tomohiro Kusumi2019-04-255-12/+12
| | | | | | | | Use either SEEK_* or 0,1,2..., but not both. Reviewed-by: George Melikov <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tomohiro Kusumi <[email protected]> Closes #8656
* Fixes for the DMU free throttleTom Caputi2019-04-251-29/+36
| | | | | | | | | | | | | | | | | | | | | | | This patch fixes 2 issues with the DMU free throttle implemented in dmu_free_long_range(). The first issue is that get_next_chunk() was calculating the number of L1 blocks the free would dirty incorrectly. In some cases involving extremely large files, this code would greatly overestimate the number of effected L1 blocks, causing excessive calls to txg_wait_open(). This patch corrects the calculation. The second issue is that the free throttle uses the total number of free'd blocks in all (open, quiescing, and syncing) txgs to determine whether to throttle. This causes large frees (such as those created by the first issue) to cause 4 txg syncs before any further frees were allowed to proceed. This patch ensures that the accounting is done entirely in a per-txg fashion, so that frees from a given txg don't affect those that immediately follow it. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Matthew Ahrens <[email protected]> Signed-off-by: Tom Caputi <[email protected]> Closes #8655
* Drop unused ZNODE_STATS and ZNODE_STAT_ADD()Tomohiro Kusumi2019-04-191-14/+0
| | | | | | | | | | Unused since 5649246dd3("Remove znode move functionality"), and ZNODE_STAT_ADD() will never be needed. Reviewed-by: Richard Elling <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: George Melikov <[email protected]> Signed-off-by: Tomohiro Kusumi <[email protected]> Closes #8636
* Fix incorrect "[UNUSED]" commentsTomohiro Kusumi2019-04-191-2/+2
| | | | | | | | | | | These aren't unused. `flag` in zfs_create() also isn't to indicate large file. Reviewed-by: Richard Elling <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: George Melikov <[email protected]> Reviewed-by: Richard Laager <[email protected]> Signed-off-by: Tomohiro Kusumi <[email protected]> Closes #8635
* Code improvement and bug fixes for QAT supportcfzhu2019-04-164-38/+143
| | | | | | | | | | | | | | | | | | | | 1. Support QAT when ZFS is root file-system: When ZFS module is loaded before QAT started, the QAT can be started again in post-process, e.g.: echo 0 > /sys/module/zfs/parameters/zfs_qat_compress_disable echo 0 > /sys/module/zfs/parameters/zfs_qat_encrypt_disable echo 0 > /sys/module/zfs/parameters/zfs_qat_checksum_disable 2. Verify alder checksum of the de-compress result 3. Allocate Digest, IV and AAD buffer in physical contiguous memory by QAT_PHYS_CONTIG_ALLOC. 4. Update the documentation for zfs_qat_compress_disable, zfs_qat_checksum_disable, zfs_qat_encrypt_disable. Reviewed-by: Tom Caputi <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Weigang Li <[email protected]> Signed-off-by: Chengfeix Zhu <[email protected]> Closes #8323 Closes #8610
* Update a comment to match the codeRichard Laager2019-04-161-2/+2
| | | | | | | | | GRUB supports large_blocks. Reviewed-by: George Melikov <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Richard Laager <[email protected]> Closes #8626