aboutsummaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* Implement Redacted Send/ReceivePaul Dagnelie2019-06-19103-2069/+10914
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Redacted send/receive allows users to send subsets of their data to a target system. One possible use case for this feature is to not transmit sensitive information to a data warehousing, test/dev, or analytics environment. Another is to save space by not replicating unimportant data within a given dataset, for example in backup tools like zrepl. Redacted send/receive is a three-stage process. First, a clone (or clones) is made of the snapshot to be sent to the target. In this clone (or clones), all unnecessary or unwanted data is removed or modified. This clone is then snapshotted to create the "redaction snapshot" (or snapshots). Second, the new zfs redact command is used to create a redaction bookmark. The redaction bookmark stores the list of blocks in a snapshot that were modified by the redaction snapshot(s). Finally, the redaction bookmark is passed as a parameter to zfs send. When sending to the snapshot that was redacted, the redaction bookmark is used to filter out blocks that contain sensitive or unwanted information, and those blocks are not included in the send stream. When sending from the redaction bookmark, the blocks it contains are considered as candidate blocks in addition to those blocks in the destination snapshot that were modified since the creation_txg of the redaction bookmark. This step is necessary to allow the target to rehydrate data in the case where some blocks are accidentally or unnecessarily modified in the redaction snapshot. The changes to bookmarks to enable fast space estimation involve adding deadlists to bookmarks. There is also logic to manage the life cycles of these deadlists. The new size estimation process operates in cases where previously an accurate estimate could not be provided. In those cases, a send is performed where no data blocks are read, reducing the runtime significantly and providing a byte-accurate size estimate. Reviewed-by: Dan Kimmel <[email protected]> Reviewed-by: Matt Ahrens <[email protected]> Reviewed-by: Prashanth Sreenivasa <[email protected]> Reviewed-by: John Kennedy <[email protected]> Reviewed-by: George Wilson <[email protected]> Reviewed-by: Chris Williamson <[email protected]> Reviewed-by: Pavel Zhakarov <[email protected]> Reviewed-by: Sebastien Roy <[email protected]> Reviewed-by: Prakash Surya <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Paul Dagnelie <[email protected]> Closes #7958
* Minimize aggsum_compare(&arc_size, arc_c) calls.Alexander Motin2019-06-141-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | For busy ARC situation when arc_size close to arc_c is desired. But then it is quite likely that aggsum_compare(&arc_size, arc_c) will need to flush per-CPU buckets to find exact comparison result. Doing that often in a hot path penalizes whole idea of aggsum usage there, since it replaces few simple atomic additions with dozens of lock acquisitions. Replacing aggsum_compare() with aggsum_upper_bound() in code increasing arc_p when ARC is growing (arc_size < arc_c) according to PMC profiles allows to save ~5% of CPU time in aggsum code during sequential write to 12 ZVOLs with 16KB block size on large dual-socket system. I suppose there some minor arc_p behavior change due to lower precision of the new code, but I don't think it is a big deal, since it should affect only very small window in time (aggsum buckets are flushed every second) and in ARC size (buckets are limited to 10 average ARC blocks per CPU). Reviewed-by: Chris Dunlop <[email protected]> Reviewed-by: Richard Elling <[email protected]> Reviewed-by: George Melikov <[email protected]> Reviewed-by: Allan Jude <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Closes #8901
* Python config cleanupRyan Moeller2019-06-132-79/+53
| | | | | | | | | | | Don't require Python at configure/build unless building pyzfs. Move ZFS_AC_PYTHON_MODULE to always-pyzfs.m4 where it is used. Make test syntax more consistent. Sponsored by: iXsystems, Inc. Reviewed-by: Neal Gompa <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Ryan Moeller <[email protected]> Closes #8895
* lz4_decompress_abd declared but not definedMatthew Ahrens2019-06-131-2/+1
| | | | | | | | | | | `lz4_decompress_abd` is declared in zio_compress.h but it is not defined anywhere. The declaration should be removed. Reviewed by: Dan Kimmel <[email protected]> Reviewed-by: Allan Jude <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> External-issue: DLPX-47477 Closes #8894
* panic in removal_remap test on 4K devicesMatthew Ahrens2019-06-134-14/+59
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If the zfs_remove_max_segment tunable is changed to be not a multiple of the sector size, then the device removal code will malfunction and try to create mappings that are smaller than one sector, leading to a panic. On debug bits this assertion will fail in spa_vdev_copy_segment(): ASSERT3U(DVA_GET_ASIZE(&dst), ==, size); On nondebug, the system panics with a stack like: metaslab_free_concrete() metaslab_free_impl() metaslab_free_impl_cb() vdev_indirect_remap() free_from_removing_vdev() metaslab_free_impl() metaslab_free_dva() metaslab_free() Fortunately, the default for zfs_remove_max_segment is 1MB, so this can't occur by default. We hit it during this test because removal_remap.ksh changes zfs_remove_max_segment to 1KB. When testing on 4KB-sector disks, we hit the bug. This change makes the zfs_remove_max_segment tunable more robust, automatically rounding it up to a multiple of the sector size. We also turn some key assertions into VERIFY's so that similar bugs would be caught before they are encoded on disk (and thus avoid a panic-reboot-loop). Reviewed-by: Sean Eric Fagan <[email protected]> Reviewed-by: Pavel Zakharov <[email protected]> Reviewed-by: Serapheim Dimitropoulos <[email protected]> Reviewed-by: Sebastien Roy <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> External-issue: DLPX-61342 Closes #8893
* compress metadata in later sync passesMatthew Ahrens2019-06-132-4/+30
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Starting in sync pass 5 (zfs_sync_pass_dont_compress), we disable compression (including of metadata). Ostensibly this helps the sync passes to converge (i.e. for a sync pass to not need to allocate anything because it is 100% overwrites). However, in practice it increases the average number of sync passes, because when we turn compression off, a lot of block's size will change and thus we have to re-allocate (not overwrite) them. It also increases the number of 128KB allocations (e.g. for indirect blocks and spacemaps) because these will not be compressed. The 128K allocations are especially detrimental to performance on highly fragmented systems, which may have very few free segments of this size, and may need to load new metaslabs to satisfy 128K allocations. We should increase zfs_sync_pass_dont_compress. In practice on a highly fragmented system we see a few 5-pass txg's, a tiny number of 6-pass txg's, and no txg's with more than 6 passes. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Richard Elling <[email protected]> Reviewed by: Pavel Zakharov <[email protected]> Reviewed-by: Serapheim Dimitropoulos <[email protected]> Reviewed-by: George Wilson <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> External-issue: DLPX-63431 Closes #8892
* Move write aggregation memory copy out of vq_lockAlexander Motin2019-06-131-10/+12
| | | | | | | | | | | | | | | | | | | Memory copy is too heavy operation to do under the congested lock. Moving it out reduces congestion by many times to almost invisible. Since the original zio removed from the queue, and the child zio is not executed yet, I don't see why would the copy need protection. My guess it just remained like this from the time when lock was not dropped here, which was added later to fix lock ordering issue. Multi-threaded sequential write tests with both HDD and SSD pools with ZVOL block sizes of 4KB, 16KB, 64KB and 128KB all show major reduction of lock congestion, saving from 15% to 35% of CPU time and increasing throughput from 10% to 40%. Reviewed-by: Richard Yao <[email protected]> Reviewed-by: Matt Ahrens <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Closes #8890
* looping in metaslab_block_picker impacts performance on fragmented poolsMatthew Ahrens2019-06-132-60/+117
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | On fragmented pools with high-performance storage, the looping in metaslab_block_picker() can become the performance-limiting bottleneck. When looking for a larger block (e.g. a 128K block for the ZIL), we may search through many free segments (up to hundreds of thousands) to find one that is large enough to satisfy the allocation. This can take a long time (up to dozens of ms), and is done while holding the ms_lock, which other threads may spin waiting for. When this performance problem is encountered, profiling will show high CPU time in metaslab_block_picker, as well as in mutex_enter from various callers. The problem is very evident on a test system with a sync write workload with 8K writes to a recordsize=8k filesystem, with 4TB of SSD storage, 84% full and 88% fragmented. It has also been observed on production systems with 90TB of storage, 76% full and 87% fragmented. The fix is to change metaslab_df_alloc() to search only up to 16MB from the previous allocation (of this alignment). After that, we will pick a segment that is of the exact size requested (or larger). This reduces the number of iterations to a few hundred on fragmented pools (a ~100x improvement). Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Paul Dagnelie <[email protected]> Reviewed-by: Tony Nguyen <[email protected]> Reviewed-by: George Wilson <[email protected]> Reviewed-by: Serapheim Dimitropoulos <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> External-issue: DLPX-62324 Closes #8877
* Restrict filesystem creation if name referred either '.' or '..'Tulsi Jain2019-06-134-1/+36
| | | | | | | | | | | This change restricts filesystem creation if the given name contains either '.' or '..' Reviewed-by: Matt Ahrens <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Richard Elling <[email protected]> Signed-off-by: TulsiJain <[email protected]> Closes #8842 Closes #8564
* ztest: dmu_tx_assign() gets ENOSPC in spa_vdev_remove_thread()Matthew Ahrens2019-06-131-1/+1
| | | | | | | | | | | | | | | | | | | | | When running zloop, we occasionally see the following crash: dmu_tx_assign(tx, TXG_WAIT) == 0 (0x1c == 0) ASSERT at ../../module/zfs/vdev_removal.c:1507:spa_vdev_remove_thread()/sbin/ztest(+0x89c3)[0x55faf567b9c3] The error value 0x1c is ENOSPC. The transaction used by spa_vdev_remove_thread() should not be able to fail due to being out of space. i.e. we should not call dmu_tx_hold_space(). This will allow the removal thread to schedule its work even when the pool is low on space. The "slop space" will provide enough free space to sync out the txg. Reviewed-by: Igor Kozhukhov <[email protected]> Reviewed-by: Paul Dagnelie <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> External-issue: DLPX-37853 Closes #8889
* Fix lockdep warning on insmodTomohiro Kusumi2019-06-121-0/+5
| | | | | | | | | | | | | | sysfs_attr_init() is required to make lockdep happy for dynamically allocated sysfs attributes. This fixed #8868 on Fedora 29 running kernel-debug. This requirement was introduced in 2.6.34. See include/linux/sysfs.h for what it actually does. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Olaf Faaland <[email protected]> Signed-off-by: Tomohiro Kusumi <[email protected]> Closes #8868 Closes #8884
* fat zap should prefetch when iteratingMatthew Ahrens2019-06-126-9/+140
| | | | | | | | | | | | | | | | | When iterating over a ZAP object, we're almost always certain to iterate over the entire object. If there are multiple leaf blocks, we can realize a performance win by issuing reads for all the leaf blocks in parallel when the iteration begins. For example, if we have 10,000 snapshots, "zfs destroy -nv pool/fs@1%9999" can take 30 minutes when the cache is cold. This change provides a >3x performance improvement, by issuing the reads for all ~64 blocks of each ZAP object in parallel. Reviewed-by: Andreas Dilger <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> External-issue: DLPX-58347 Closes #8862
* Target ARC size can get reduced to arc_c_minMatthew Ahrens2019-06-121-2/+0
| | | | | | | | | | | | | | | | | | | | | | | | | Sometimes the target ARC size is reduced to arc_c_min, which impacts performance. We've seen this happen as part of the random_reads performance regression test, where the ARC size is reduced before the reads test starts which impacts how long it takes for system to reach good IOPS performance. We call arc_reduce_target_size when arc_reap_cb_check() returns TRUE, and arc_available_memory() is less than arc_c>>arc_shrink_shift. However, arc_available_memory() could easily be low, even when arc_c is low, because we can have tons of unused bufs in the abd kmem cache. This would be especially true just after the DMU requests a bunch of stuff be evicted from the ARC (e.g. due to "zpool export"). To fix this, the ARC should reduce arc_c by the requested amount, not all the way down to arc_size (or arc_c_min), which can be very small. Reviewed-by: Tim Chase <[email protected]> Reviewed by: Brian Behlendorf <[email protected]> Reviewed-by: George Melikov <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> External-issue: DLPX-59431 Closes #8864
* Fix typo in vdev_raidz_math.cbnjf2019-06-121-1/+1
| | | | | | | | | Fix typo in vdev_raidz_math.c Reviewed by: Brian Behlendorf <[email protected]> Reviewed-by: George Melikov <[email protected]> Signed-off-by: Brad Forschinger <[email protected]> Closes #8875 Closes #8880
* single-chunk scatter ABDs can be treated as linearMatthew Ahrens2019-06-114-55/+103
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Scatter ABD's are allocated from a number of pages. In contrast to linear ABD's, these pages are disjoint in the kernel's virtual address space, so they can't be accessed as a contiguous buffer. Therefore routines that need a linear buffer (e.g. abd_borrow_buf() and friends) must allocate a separate linear buffer (with zio_buf_alloc()), and copy the contents of the pages to/from the linear buffer. This can have a measurable performance overhead on some workloads. https://github.com/zfsonlinux/zfs/commit/87c25d567fb7969b44c7d8af63990e ("abd_alloc should use scatter for >1K allocations") increased the use of scatter ABD's, specifically switching 1.5K through 4K (inclusive) buffers from linear to scatter. For workloads that access blocks whose compressed sizes are in this range, that commit introduced an additional copy into the read code path. For example, the sequential_reads_arc_cached tests in the test suite were reduced by around 5% (this is doing reads of 8K-logical blocks, compressed to 3K, which are cached in the ARC). This commit treats single-chunk scattered buffers as linear buffers, because they are contiguous in the kernel's virtual address space. All single-page (4K) ABD's can be represented this way. Some multi-page ABD's can also be represented this way, if we were able to allocate a single "chunk" (higher-order "page" which represents a power-of-2 series of physically-contiguous pages). This is often the case for 2-page (8K) ABD's. Representing a single-entry scatter ABD as a linear ABD has the performance advantage of avoiding the copy (and allocation) in abd_borrow_buf_copy / abd_return_buf_copy. A performance increase of around 5% has been observed for ARC-cached reads (of small blocks which can take advantage of this), fixing the regression introduced by 87c25d567. Note that this optimization is only possible because all physical memory is always mapped into the kernel's address space. This is not the case for HIGHMEM pages, so the optimization can not be made on 32-bit systems. Reviewed-by: Chunwei Chen <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> Closes #8580
* make zil max block size tunableMatthew Ahrens2019-06-107-32/+96
| | | | | | | | | | | | | | | | | | | | | | | We've observed that on some highly fragmented pools, most metaslab allocations are small (~2-8KB), but there are some large, 128K allocations. The large allocations are for ZIL blocks. If there is a lot of fragmentation, the large allocations can be hard to satisfy. The most common impact of this is that we need to check (and thus load) lots of metaslabs from the ZIL allocation code path, causing sync writes to wait for metaslabs to load, which can take a second or more. In the worst case, we may not be able to satisfy the allocation, in which case the ZIL will resort to txg_wait_synced() to ensure the change is on disk. To provide a workaround for this, this change adds a tunable that can reduce the size of ZIL blocks. External-issue: DLPX-61719 Reviewed-by: George Wilson <[email protected]> Reviewed-by: Paul Dagnelie <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> Closes #8865
* Fix comparison signedness in arc_is_overflowing()Alexander Motin2019-06-101-2/+2
| | | | | | | | | | | | When ARC size is very small, aggsum_lower_bound(&arc_size) may return negative values, that due to unsigned comparison caused delays, waiting for arc_adjust() to "fix" it by calling aggsum_value(&arc_size). Use of signed comparison there fixes the problem. Reviewed-by: Matt Ahrens <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: George Melikov <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Closes #8873
* Fix incorrect error message for raw receiveTom Caputi2019-06-101-2/+9
| | | | | | | | | | | | | | | | | | This patch fixes an incorrect error message that comes up when doing a non-forcing, raw, incremental receive into a dataset that has a newer snapshot than the "from" snapshot. In this case, the current code prints a confusing message about an IVset guid mismatch. This functionality is supported by non-raw receives as an undocumented feature, but was never supported by the raw receive code. If this is desired in the future, we can probably figure out a way to make it work. Reviewed by: Brian Behlendorf <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Signed-off-by: Tom Caputi <[email protected]> Issue #8758 Closes #8863
* Improve ZTS block_device_wait debuggingRichard Elling2019-06-101-1/+16
| | | | | | | | | | | | | | | | | | The udevadm settle timeout can be 120 or 180 seconds by default for some distributions. If a long delay is experienced, it could be due to some strangeness in a malfunctioning device that isn't related to the devices under test. To help debug this condition, a notice is given if settle takes too long. Arguments can now be passed to block_device_wait. The expected arguments are block device pathnames. Reviewed by: John Kennedy <[email protected]> Reviewed-by: Giuseppe Di Natale <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Richard Elling <[email protected]> Closes #8839
* Block_device_wait does not return an error codeRichard Elling2019-06-105-7/+10
| | | | | | | | | Reviewed by: John Kennedy <[email protected]> Reviewed-by: Giuseppe Di Natale <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Richard Elling <[email protected]> Closes #8839
* Remove redundant redundant removeRichard Elling2019-06-101-1/+0
| | | | | | | | | Reviewed by: John Kennedy <[email protected]> Reviewed-by: Giuseppe Di Natale <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Richard Elling <[email protected]> Closes #8839
* Fix logic error in setpartition functionRichard Elling2019-06-101-9/+13
| | | | | | | | | Reviewed by: John Kennedy <[email protected]> Reviewed-by: Giuseppe Di Natale <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Richard Elling <[email protected]> Closes #8839
* arc_summary: prefer python3 version and install when there is no pythonEli Schwartz2019-06-101-3/+1
| | | | | | | | | | | | | This matches the behavior of other python scripts, such as arcstat and dbufstat, which are always installed but whose install-exec-hook actions will simply touch up the shebang if a python interpreter was configured *and* that interpreter is a python2 interpreter. Fixes installation in a minimal build chroot without python available. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Signed-off-by: Eli Schwartz <[email protected]> Closes #8851
* Fix %post and %postun generation in kmodtoolSamuel VERSCHELDE2019-06-101-2/+2
| | | | | | | | | | | | During zfs-kmod RPM build, $(uname -r) gets unintentionally evaluated on the build host, once and for all. It should be evaluated during the execution of the scriptlets on the installation host. Escaping the $ character avoids evaluating it during build. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Olaf Faaland <[email protected]> Reviewed-by: Neal Gompa <[email protected]> Signed-off-by: Samuel Verschelde <[email protected]> Closes #8866
* Allow metaslab to be unloaded even when not freed fromPaul Dagnelie2019-06-063-22/+40
| | | | | | | | | | | | | | | | | | | | | | On large systems, the memory used by loaded metaslabs can become a concern. While range trees are a fairly efficient data structure, on heavily fragmented pools they can still consume a significant amount of memory. This problem is amplified when we fail to unload metaslabs that we aren't using. Currently, we only unload a metaslab during metaslab_sync_done; in order for that function to be called on a given metaslab in a given txg, we have to have dirtied that metaslab in that txg. If the dirtying was the result of an allocation, we wouldn't be unloading it (since it wouldn't be 8 txgs since it was selected), so in effect we only unload a metaslab during txgs where it's being freed from. We move the unload logic from sync_done to a new function, and call that function on all metaslabs in a given vdev during vdev_sync_done(). Reviewed-by: Richard Elling <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Paul Dagnelie <[email protected]> Closes #8837
* Avoid updating zfs_gitrev.h when rev is unchangedJorgen Lundman2019-06-061-0/+4
| | | | | | | | | | Build process would always re-compile spa_history.c due to touching zfs_gitrev.h - avoid if no change in gitrev. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Chris Dunlop <[email protected]> Reviewed-by: Allan Jude <[email protected]> Signed-off-by: Jorgen Lundman <[email protected]> Closes #8860
* Reinstate raw receive check when truncatingTom Caputi2019-06-061-1/+15
| | | | | | | | | | | | | | This patch re-adds a check that was removed in 369aa50. The check confirms that a raw receive is not occuring before truncating an object's dn_maxblkid. At the time, it was believed that all cases that would hit this code path would be handled in other places, but that was not the case. Reviewed-by: Matt Ahrens <[email protected]> Reviewed-by: Paul Dagnelie <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tom Caputi <[email protected]> Closes #8852 Closes #8857
* l2arc_apply_transforms: Fix typo in commentAllan Jude2019-06-061-1/+1
| | | | | | | | | Reviewed-by: Chris Dunlop <[email protected]> Reviewed-by: Matt Ahrens <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: George Melikov <[email protected]> Reviewed-by: Richard Laager <[email protected]> Signed-off-by: Allan Jude <[email protected]> Closes #8822
* Reduced IOPS when all vdevs are in the zfs_mg_fragmentation_thresholdSerapheim Dimitropoulos2019-06-062-6/+21
| | | | | | | | | | | | | | | | | | | | | | | | Historically while doing performance testing we've noticed that IOPS can be significantly reduced when all vdevs in the pool are hitting the zfs_mg_fragmentation_threshold percentage. Specifically in a hypothetical pool with two vdevs, what can happen is the following: Vdev A would go above that threshold and only vdev B would be used. Then vdev B would pass that threshold but vdev A would go below it (we've been freeing from A to allocate to B). The allocations would go back and forth utilizing one vdev at a time with IOPS taking a hit. Empirically, we've seen that our vdev selection for allocations is good enough that fragmentation increases uniformly across all vdevs the majority of the time. Thus we set the threshold percentage high enough to avoid hitting the speed bump on pools that are being pushed to the edge. We effectively disable its effect in the majority of the cases but we don't remove (at least for now) just in case we hit any weird behavior in the future. Reviewed-by: George Melikov <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Matt Ahrens <[email protected]> Signed-off-by: Serapheim Dimitropoulos <[email protected]> Closes #8859
* If $ZFS_BOOTFS contains guid, replace the guid portion with $poolGarrett Fields2019-06-061-1/+3
| | | | | | Reviewed-by: George Melikov <[email protected]> Reviewed-by: Richard Laager <[email protected]> Signed-off-by: Garrett Fields <[email protected]> Closes #8356
* Fix integer overflow of ZTOI(zp)->i_generationTom Caputi2019-06-061-1/+1
| | | | | | | | | | | | | | | | | | | | The ZFS on-disk format stores each inode's generation ID as a 64 bit number on disk and in-core. However, the Linux kernel's inode is only a 32 bit number. In most places, the code handles this correctly, but the cast is missing in zfs_rezget(). For many pools, this isn't an issue since the generation ID is computed as the current txg when the inode is created and many pools don't have more than 2^32 txgs. For the pools that have more txgs, this issue causes any inode with a high enough generation number to report IO errors after a call to "zfs rollback" while holding the file or directory open. This patch simply adds the missing cast. Reviewed-by: Alek Pinchuk <[email protected]> Reviewed-by: George Melikov <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tom Caputi <[email protected]> Closes #8858
* hkdf_test binary should only have one icp instanceDon Brady2019-06-051-2/+1
| | | | | | | | | | | The build for test binary hkdf_test was linking both against libicp and libzpool. This results in two instances of libicp inside the binary but the call to icp_init() only initializes one of them! Reviewed-by: Richard Elling <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: George Melikov <[email protected]> Signed-off-by: Don Brady <[email protected]> Closes #8850
* Drop objid argument in zfs_znode_alloc() (sync with OpenZFS)Tomohiro Kusumi2019-06-051-5/+4
| | | | | | | | | | | | | Since zfs_znode_alloc() already takes dmu_buf_t*, taking another uint64_t argument for objid is redundant. inode's ->i_ino does and needs to match znode's ->z_id. zfs_znode_alloc() in FreeBSD and illumos doesn't have this argument since vnode doesn't have vnode# in VFS (hence ->z_id exists). Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Signed-off-by: Tomohiro Kusumi <[email protected]> Closes #8841
* Fixed a small typo in man/man1/raidz_test.1Peter Wirdemo2019-06-051-1/+1
| | | | | | | Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: George Melikov <[email protected]> Reviewed-by: Chris Dunlop <[email protected]> Signed-off-by: Peter Wirdemo <[email protected]> Closes #8855
* Allow TRIM_UNUSED_KSYM when build as a builtin-moduleTorsten Wörtwein2019-06-041-2/+3
| | | | | | | | If ZFS is built with enable_linux_builtin, it seems to be possible to compile the kernel with TRIM_UNUSED_KSYM. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Torsten Wörtwein <[email protected]> Closes #8820
* Make Python detection optional and more portableRyan Moeller2019-06-043-24/+39
| | | | | | | | | | | | | | | | | | | | | Previously, --without-python would cause ./configure to fail. Now it is able to proceed, and the Python scripts will not be built. Use portable parameter expansion matching instead of nonstandard substring matching to detect the Python version. This test is duplicated in several places, so define a function for it. Don't assume the full path to binaries, since different platforms do install things in different places. Use AC_CHECK_PROGS instead. When building without Python, also build without pyzfs. Sponsored by: iXsystems, Inc. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Richard Laager <[email protected]> Reviewed-by: Eli Schwartz <[email protected]> Signed-off-by: Ryan Moeller <[email protected]> Closes #8809 Closes #8731
* Wait in 'S' state when send/recv pipe is blockingDeHackEd2019-06-031-2/+2
| | | | | | | | Reviewed-by: Paul Dagnelie <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Signed-off-by: DHE <[email protected]> Closes #8733 Closes #8752
* Make zfs_async_block_max_blocks handle zero correctlyTulsiJain2019-06-031-1/+3
| | | | | | | | | Reviewed-by: Matt Ahrens <[email protected]> Reviewed-by: Paul Dagnelie <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: TulsiJain <[email protected]> Closes #8829 Closes #8289
* Revert "Report holes when there are only metadata changes"Brian Behlendorf2019-05-301-28/+3
| | | | | | | | | | | | This reverts commit ec4f9b8f30 which introduced a narrow race which can lead to lseek(, SEEK_DATA) incorrectly returning ENXIO. Resolve the issue by revering this change to restore the previous behavior which depends solely on checking the dirty list. Reviewed-by: Olaf Faaland <[email protected]> Reviewed-by: Igor Kozhukhov <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #8816 Closes #8834
* Add link count test for root inodeTomohiro Kusumi2019-05-293-2/+122
| | | | | | | | | | Add tests for 97aa3ba44("Fix link count of root inode when snapdir is visible") as suggested in #8727. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Signed-off-by: Tomohiro Kusumi <[email protected]> Closes #8732
* Remove vn_set_fs_pwd()/vn_set_pwd() (no need to be at / during insmod)Tomohiro Kusumi2019-05-296-93/+1
| | | | | | | | | | | | | | | | | | Per suggestion from @behlendorf in #8777, remove vn_set_fs_pwd() and vn_set_pwd() which are only used in zfs_ioctl.c:_init() while loading zfs.ko. The rest of initialization functions being called here after cwd set to / don't depend on cwd of the process except for spa_config_load(). spa_config_load() uses a relative path ".//etc/zfs/zpool.cache" when `rootdir` is non-NULL, which is "/etc/zfs/zpool.cache" given cwd is /, so just unconditionally use the absolute path without "./", so that `vn_set_pwd("/")` as well as the entire functions can be removed. This is also what FreeBSD does. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Signed-off-by: Tomohiro Kusumi <[email protected]> Closes #8826
* Exclude log device ashift from normal classBrian Behlendorf2019-05-291-4/+1
| | | | | | | | | | | | | | | | When opening a log device during import its allocation bias will not yet have been set by vdev_load(). This results in the log device's ashift being incorrectly applied to the maximum ashift of the vdevs in the normal class. Which in turn prevents the removal of any top-level devices due to the ashift check in the spa_vdev_remove_top_check() function. This issue is resolved by including vdev_islog in the check since it will be set correctly during vdev_open(). Reviewed-by: Matt Ahrens <[email protected]> Reviewed-by: Igor Kozhukhov <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #8735
* Fix integer overflow in get_next_chunk()madz2019-05-291-2/+2
| | | | | | | | | | dn->dn_datablksz type is uint32_t and need to be casted to uint64_t to avoid an overflow when the record size is greater than 4 MiB. Reviewed-by: Tom Caputi <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Olivier Mazouffre <[email protected]> Closes #8778 Closes #8797
* grammar: it is / plural agreementJosh Soref2019-05-281-2/+2
| | | | | | | Reviewed-by: Richard Laager <[email protected]> Reviewed-by: Matt Ahrens <[email protected]> Reviewed-by: Chris Dunlop <[email protected]> Signed-off-by: Josh Soref <[email protected]> Closes #8818
* Refactor parent dataset handling in libzfs zfs_rename()Tomohiro Kusumi2019-05-281-9/+4
| | | | | | | | | | | For recursive renaming, simplify the code by moving `zhrp` and `parentname` to inner scope. `zhrp` is only used to test existence of a parent dataset for recursive dataset dir scan since ba6a24026c. Reviewed by: Brian Behlendorf <[email protected]> Reviewed-by: Richard Laager <[email protected]> Reviewed-by: Giuseppe Di Natale <[email protected]> Signed-off-by: Tomohiro Kusumi <[email protected]> Closes #8815
* Double-free of encryption wrapping key due to invalid pool propertiesloli10K2019-05-282-12/+14
| | | | | | | | | | This commits fixes a double-free in zfs_ioc_pool_create() triggered by specifying an unsupported combination of properties when creating a pool with encryption enabled. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Tom Caputi <[email protected]> Signed-off-by: loli10K <[email protected]> Closes #8791
* Update comments to match codeRyan Moeller2019-05-281-6/+6
| | | | | | | | | | | | s/get_vdev_spec/make_root_vdev The former doesn't exist anymore. Sponsored by: iXsystems, Inc. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Tom Caputi <[email protected]> Signed-off-by: Ryan Moeller <[email protected]> Closes #8759
* tests: fix cosmetic permission issues during `make install`Stoiko Ivanov2019-05-286-8/+12
| | | | | | | | | | | | | | | | files in dist_*_SCRIPTS get installed with 0755, those in dist_*_DATA with 0644. This commit moves all .kshlib, .shlib and .cfg files in the testsuite to dist_pkgdata_DATA, and removes the shebang from zpool_import.kshlib. This ensures that the files are installed with appropriate permissions and silences some warnings from lintian Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: loli10K <[email protected]> Reviewed by: John Kennedy <[email protected]> Signed-off-by: Stoiko Ivanov <[email protected]> Closes #8803
* test-runner.py: change shebang to python3Stoiko Ivanov2019-05-281-1/+1
| | | | | | | | | | | | | | | | | In commit 6e72a5b9b61066146deafda39ab8158c559f5f15 python scripts which work with python2 and python3 changed the shebang from /usr/bin/python to /usr/bin/python3. This gets adapted by the build-system on systems which don't provide python3. This commit changes test-runner.py to also use /usr/bin/python3, enabling the change during buildtime and fixing a minor lintian issue for those Debian packages, which depend on a specific python version (python3/python2). Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: loli10K <[email protected]> Reviewed by: John Kennedy <[email protected]> Signed-off-by: Stoiko Ivanov <[email protected]> Closes #8803
* Endless loop in zpool_do_remove() on platforms with unsigned charloli10K2019-05-283-4/+4
| | | | | | | | | | | | On systems where "char" is an unsigned type the value returned by getopt() will never be negative (-1), leading to an endless loop: this issue prevents both 'zpool remove' and 'zstreamdump' for working on some systems. Reviewed-by: Igor Kozhukhov <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Chris Dunlop <[email protected]> Signed-off-by: loli10K <[email protected]> Closes #8789