summaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* DLPX-44733 combine arc_buf_alloc_impl() with arc_buf_clone()Dan Kimmel2016-09-133-233/+306
| | | | | | | | Authored by: Dan Kimmel <[email protected]> Reviewed by: Tom Caputi <[email protected]> Reviewed by: Brian Behlendorf <[email protected]> Ported by: David Quigley <[email protected]> Issue #5078
* Remove lint suppression from dmu.h and unnecessary dmu.h include in spa.hDan Kimmel2016-09-132-9/+2
| | | | | | | | Authored by: Dan Kimmel <[email protected]> Reviewed by: Tom Caputi <[email protected]> Reviewed by: Brian Behlendorf <[email protected]> Ported by: David Quigley <[email protected]> Issue #5078
* Enable raw writes to perform dedup with verificationTom Caputi2016-09-131-7/+49
| | | | | | | | Reviewed by: Dan Kimmel <[email protected]> Reviewed by: Brian Behlendorf <[email protected]> Reviewed by: David Quigley <[email protected]> Signed-off-by: Tom Caputi <[email protected]> Issue #5078
* DLPX-40252 integrate EP-476 compressed zfs send/receiveDan Kimmel2016-09-1324-540/+1052
| | | | | | | | Authored by: Dan Kimmel <[email protected]> Reviewed by: Tom Caputi <[email protected]> Reviewed by: Brian Behlendorf <[email protected]> Ported by: David Quigley <[email protected]> Issue #5078
* OpenZFS 6950 - ARC should cache compressed dataGeorge Wilson2016-09-1327-2012/+2485
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Authored by: George Wilson <[email protected]> Reviewed by: Prakash Surya <[email protected]> Reviewed by: Dan Kimmel <[email protected]> Reviewed by: Matt Ahrens <[email protected]> Reviewed by: Paul Dagnelie <[email protected]> Reviewed by: Tom Caputi <[email protected]> Reviewed by: Brian Behlendorf <[email protected]> Ported by: David Quigley <[email protected]> This review covers the reading and writing of compressed arc headers, sharing data between the arc_hdr_t and the arc_buf_t, and the implementation of a new dbuf cache to keep frequently access data uncompressed. I've added a new member to l1 arc hdr called b_pdata. The b_pdata always hangs off the arc_buf_hdr_t (if an L1 hdr is in use) and points to the physical block for that DVA. The physical block may or may not be compressed. If compressed arc is enabled and the block on-disk is compressed, then the b_pdata will match the block on-disk and remain compressed in memory. If the block on disk is not compressed, then neither will the b_pdata. Lastly, if compressed arc is disabled, then b_pdata will always be an uncompressed version of the on-disk block. Typically the arc will cache only the arc_buf_hdr_t and will aggressively evict any arc_buf_t's that are no longer referenced. This means that the arc will primarily have compressed blocks as the arc_buf_t's are considered overhead and are always uncompressed. When a consumer reads a block we first look to see if the arc_buf_hdr_t is cached. If the hdr is cached then we allocate a new arc_buf_t and decompress the b_pdata contents into the arc_buf_t's b_data. If the hdr already has a arc_buf_t, then we will allocate an additional arc_buf_t and bcopy the uncompressed contents from the first arc_buf_t to the new one. Writing to the compressed arc requires that we first discard the b_pdata since the physical block is about to be rewritten. The new data contents will be passed in via an arc_buf_t (uncompressed) and during the I/O pipeline stages we will copy the physical block contents to a newly allocated b_pdata. When an l2arc is inuse it will also take advantage of the b_pdata. Now the l2arc will always write the contents of b_pdata to the l2arc. This means that when compressed arc is enabled that the l2arc blocks are identical to those stored in the main data pool. This provides a significant advantage since we can leverage the bp's checksum when reading from the l2arc to determine if the contents are valid. If the compressed arc is disabled, then we must first transform the read block to look like the physical block in the main data pool before comparing the checksum and determining it's valid. OpenZFS-issue: https://www.illumos.org/issues/6950 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/7fc10f0 Issue #5078
* OpenZFS 7262 - remove seq from zfs_receive_010.kshPaul Dagnelie2016-09-121-2/+2
| | | | | | | | | | | | | | | Authored by: Paul Dagnelie <[email protected]> Reviewed by: John Wren Kennedy <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Reviewed by: Prakash Surya <[email protected]> Reviewed by: Yuri Pankov <[email protected]> Reviewed by: Brian Behlendorf <[email protected]> Approved by: Dan McDonald <[email protected]> Ported-by: candychencan <[email protected]> OpenZFS-issue: https://www.illumos.org/issues/7262 OpenZFS-commit: https://github.com/illumos/illumos-gate/commit/b868f5d Closes #5080
* Fix memleak in zfs_do_* and zpool_do_*luozhengzheng2016-09-122-15/+58
| | | | | | Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: luozhengzheng <[email protected]> Closes #5056
* Allow ZVOL bookmarks to be listed recursivelyloli10K2016-09-121-3/+3
| | | | | | | Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: loli10K <[email protected]> Closes #4503 Closes #5072
* Remove redundant assignments to arc_cTim Chase2016-09-121-10/+0
| | | | | | | | | | | | Several assignments to arc_c had no effect because it is ultimately initialized to arc_c_max. This aligns ZoL better with the upstream code which removed these assignments some time ago. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tim Chase <[email protected]> Closes #5081
* Refactor spa_load_l2cache to make build happyNikolay Borisov2016-09-121-29/+28
| | | | | | | | | | | | | | | | | In case sav->sav_config was NULL the body of the function would skip the iteration of the l2 cache devices and will just cleanup the old devices. However, this wasn't very obvious since the null check was performed after the loop body and after the old devices were cleaned. Refactor the code so that it's now obvious when the iteration of the l2cache devices is skipped. This fixes the following cppcheck warning: [module/zfs/spa.c:1552]: (error) Possible null pointer dereference: newvdevs Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Nikolay Borisov <[email protected]> Closes #5087
* Free property names with spa_strfree() rather than strfree()Tim Chase2016-09-121-1/+1
| | | | | | | | | | | Since they're allocated with spa_strdup(), they should be freed with spa_strfree() so the proper length buffer is freed. Reviewed-by: Richard Yao <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tim Chase <[email protected]> Closes #5082 Closes #5086
* Fix memory/fd leak in check_file() and is_spare()liuhuang2016-09-121-2/+7
| | | | | | Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: liuhuang <[email protected]> Closes #5085
* Fix make lint targetBrian Behlendorf2016-09-091-1/+1
| | | | | | | | When errors are detected 'make lint' should return a non-zero error code. The value 2 was chosen to indicate these are warnings and not fatal. Signed-off-by: Brian Behlendorf <[email protected]>
* zfs dracut module should not assume systemd presenceMoritz Maxeiner2016-09-091-8/+10
| | | | | | | Signed-off-by: Moritz Maxeiner <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Closes #4749 Closes #5058
* Adapt genkernel fix for zfsonlinux/zfs#4749 to zfs dracut moduleMoritz Maxeiner2016-09-091-0/+6
| | | | | | | Signed-off-by: Moritz Maxeiner <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Closes #4749 Closes #5058
* OpenZFS - Performance regression suite for zfstestJohn Wren Kennedy2016-09-0831-10/+1360
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Author: John Wren Kennedy <[email protected]> Reviewed by: Prakash Surya <[email protected]> Reviewed by: Dan Kimmel <[email protected]> Reviewed by: Matt Ahrens <[email protected]> Reviewed by: Paul Dagnelie <[email protected]> Reviewed by: Don Brady <[email protected]> Reviewed by: Richard Elling <[email protected]> Reviewed by: Brian Behlendorf <[email protected]> Reviewed-by: David Quigley <[email protected]> Approved by: Richard Lowe <[email protected]> Ported-by: Don Brady <[email protected]> OpenZFS-issue: https://www.illumos.org/issues/6950 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/dcbf3bd6 Delphix-commit: https://github.com/delphix/delphix-os/commit/978ed49 Closes #4929 ZFS Test Suite Performance Regression Tests This was pulled into OpenZFS via the compressed arc featureand was separated out in zfsonlinux as a separate pull request from PR-4768. It originally came in as QA-4903 in Delphix-OS from John Kennedy. Expected Usage: $ DISKS="sdb sdc sdd" zfs-tests.sh -r perf-regression.run Porting Notes: 1. Added assertions in the setup script to make sure required tools (fio, mpstat, ...) are present. 2. For the config.json generation in perf.shlib used arcstats and other binaries instead of dtrace to query the values. 3. For the perf data collection: - use "zpool iostat -lpvyL" instead of the io.d dtrace script (currently not collecting zfs_read/write latency stats) - mpstat and iostat take different arguments - prefetch_io.sh is a placeholder that uses arcstats instead of dtrace 4. Build machines require fio, mdadm and sysstat pakage (YMMV). Future Work: - Need a way to measure zfs_read and zfs_write latencies per pool. - Need tools to takes two sets of output and display/graph the differences - Bring over additional regression tests from Delphix
* Real disk partitioning now enabled in test suite for LinuxSydney Vanda2016-09-0864-58/+776
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When using real devices, specify DISKS="sdb sdc sdd" opposed to /dev/sdb in zfs-tests.sh - otherwise errors with directory names and disk names registering as "/dev//dev/sdb" for some tests. The same goes for mpath: DISK="mpatha mpathad mpathb" Expected Usage: $ DISKS="sdb sdc sdd" zfs-tests.sh SLICE_PREFIX is now set as "p" for a loop device (ie loop0p2) or "" for a real device (ie sdb2), or either for multipath devices (ie mpatha1 or mpath1p1) instead of only "p" by default. Note that kpartx partitioning is not currently supported in this patch (ie "partx") and may need to be disabled on Debian distributions. Functions added for determining test directory (/dev or /dev/mapper) as well as slice prefix are determined and exported mostly in the cfg file of each test group directory. Currently zpools cannot be created on whole mpath devices that have been partitioned. In order to fix this tests have either been revised to use a partition instead, or if there is a size constraint and the pool needs to be created on the whole disk, partitions are then deleted if the device is a multipath device. This functionality is added to default_cleanup() or to individual cleanup scripts if a non-default cleanup method is used. The max partitions is currently set at 8 to account for all of the tests thus far. Patch changes are generally encompassed in "if is_linux" construct. Signed-off-by: Sydney Vanda <[email protected]> Reviewed-by: John Salinas <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: David Quigley <[email protected]> Closes #4447 Closes #4964 Closes #5074
* Tag 0.7.0-rc1zfs-0.7.0-rc1Brian Behlendorf2016-09-071-2/+2
| | | | | | First release candidate. Signed-off-by: Brian Behlendorf <[email protected]>
* Bring over illumos ZFS FMA logic -- phase 1Don Brady2016-09-0120-39/+1837
| | | | | | | | | | | | | This first phase brings over the ZFS SLM module, zfs_mod.c, to handle auto operations in response to disk events. Disk event monitoring is provided from libudev and generates the expected payload schema for zfs_mod. This work leverages the recently added devid and phys_path strings in the vdev label. Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Don Brady <[email protected]> Signed-off-by: Tony Hutter <[email protected]> Closes #4673
* Delete unreferenced function zfs_ereport_send_interim_checksumluozhengzheng2016-09-012-13/+0
| | | | | | Signed-off-by: luozhengzheng <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #5055
* kmem_zalloc with KM_SLEEP will never return NULLluozhengzheng2016-09-012-44/+1
| | | | | | | | | These allocations can never fail. Leaving the error handling code here gives the impression they can so it has been removed. Signed-off-by: luozhengzheng <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #5048
* Fix zfs_unmount() and zfs_unshare_proto() leakscao2016-09-011-3/+5
| | | | | | | | | | | Always free mnpt memory on failure in the zfs_unmount() function. In the zfs_unshare_proto() function mountpoint is a const and should not be assigned. Signed-off-by: cao.xuewen <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #5054
* Performance optimization of AVL tree comparator functionsGvozden Neskovic2016-08-3125-274/+148
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | perf: 2.75x faster ddt_entry_compare() First 256bits of ddt_key_t is a block checksum, which are expected to be close to random data. Hence, on average, comparison only needs to look at first few bytes of the keys. To reduce number of conditional jump instructions, the result is computed as: sign(memcmp(k1, k2)). Sign of an integer 'a' can be obtained as: `(0 < a) - (a < 0)` := {-1, 0, 1} , which is computed efficiently. Synthetic performance evaluation of original and new algorithm over 1G random keys on 2.6GHz Intel(R) Xeon(R) CPU E5-2660 v3: old 6.85789 s new 2.49089 s perf: 2.8x faster vdev_queue_offset_compare() and vdev_queue_timestamp_compare() Compute the result directly instead of using conditionals perf: zfs_range_compare() Speedup between 1.1x - 2.5x, depending on compiler version and optimization level. perf: spa_error_entry_compare() `bcmp()` is not suitable for comparator use. Use `memcmp()` instead. perf: 2.8x faster metaslab_compare() and metaslab_rangesize_compare() perf: 2.8x faster zil_bp_compare() perf: 2.8x faster mze_compare() perf: faster dbuf_compare() perf: faster compares in spa_misc perf: 2.8x faster layout_hash_compare() perf: 2.8x faster space_reftree_compare() perf: libzfs: faster avl tree comparators perf: guid_compare() perf: dsl_deadlist_compare() perf: perm_set_compare() perf: 2x faster range_tree_seg_compare() perf: faster unique_compare() perf: faster vdev_cache _compare() perf: faster vdev_uberblock_compare() perf: faster fuid _compare() perf: faster zfs_znode_hold_compare() Signed-off-by: Gvozden Neskovic <[email protected]> Signed-off-by: Richard Elling <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #5033
* Fix zhack argument processingBrian Behlendorf2016-08-312-6/+6
| | | | | | | | | | | | | | The argument processing is zhack makes the assumption that getopt() will not permute argv. This isn't true for the GNU implementation of getopt() unless the optstring is prefixed with a '+'. In which case this is equivalent to setting the POSIXLY_CORRECT environment variable In addition, update the usage() and optstrings to reflect the existing supported options. Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: liaoyuxiangqin <[email protected]> Closes #5047
* Update zpool_import_001_posBrian Behlendorf2016-08-311-1/+1
| | | | | | | | | | | Older versions of blkid may not promptly detect ZFS labels when they're located on partitions. In order to ensure this test passes reliably always perform a scan of default search paths (-s). Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: liaoyuxiangqin <[email protected]> Closes #4987 Closes #5047
* Fix "zpool get guid,freeing,leaked" sourceHajo Möller2016-08-301-6/+8
| | | | | | | | | | | | `zpool get guid,freeing,leaked` shows SOURCE as `default`, it should be `-` as those props are not editable. Changed code to not overwrite `src` for `ZPOOL_PROP_VERSION`, so it stays `ZPROP_SRC_NONE`. Make src const to avoid future mistakes Signed-off-by: Hajo Möller <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #4170
* Update zfs_destroy_004.ksh scriptcao2016-08-302-7/+12
| | | | | | | | | | | | | | | | | | | | | | Issues: Under Linux, when executing zfs_destroy_004.ksh destroy $fs is an error. The key issue here is that illumos kernel treats this case differently than the Linux kernel. On illumos you can unmount and destroy a filesystem which is busy and all consumers of it get EIO. On Linux the expected behavior is to prevent the unmount and destroy. Cause analysis: When create $fs file system and mount file system to $mntp. cd $mntp, linux isn't allow to destroy $fs in this mount contents. No matter what destroy with parameters. Solution: So log_mustnot $ZFS destroy $fs is ok. cd $olddir and destroy $fs. Signed-off-by: caoxuewen [email protected] Signed-off-by: Brian Behlendorf <[email protected]> Closes #5012
* Update zfs_create_003_pos.ksh and zfs_create_006_pos.kshChaoyuZhang2016-08-303-8/+5
| | | | | | | | | | As the scripts zfs_create_003_pos.ksh and zfs_create_006_pos.ksh can run successfully in the linux, add them to the <linux.run> file to increase test scene. Signed-off-by: ChaoyuZhang <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #5002
* Add log_must_{retry,busy} helpersBrian Behlendorf2016-08-301-0/+75
| | | | | | | | | | | | | | | Add helpers which automatically retry the provided command when the error message matches the provided keyword. This provides an easy way to handle the asynchronous nature of some ZFS commands. For example, the `zfs destroy` command may need to be retried in the case where the block device is unexpected busy. This can be accomplished as follows: log_must_busy $ZFS destroy ... Signed-off-by: Brian Behlendorf <[email protected]> Issue #5002
* Update zfs_mount_005_pos.ksh and zfs_mount_010_neg.kshliuhuang2016-08-303-17/+30
| | | | | | | | | | Update zfs_mount_005_pos.ksh and zfs_mount_010_neg.ksh to reflect the expected Linux behavior. The is_linux wrapper is used so the test case may be used on Linux and non-Linux platforms. Signed-off-by: liuhuang <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #5000
* Delete unused zfsctl_snapdir_inactive declarationcao2016-08-302-2/+0
| | | | | | | | | | | zfsctl_snapdir_inactive is defined in zfs-0.6.3. In zfs-0.6.5.7 this is declaration remains even though the implementation was removed in commit 278bee93. Removed fastreboot_disable_highpil which is also unused. Signed-off-by: caoxuewen [email protected] Signed-off-by: Brian Behlendorf <[email protected]> Closes #5042
* OpenZFS 6940 - Cannot unlink directories when over quotaSimon Klinkert2016-08-301-0/+1
| | | | | | | | | | | | | | | | | From user perspective, I would expect that ZFS is always able to remove files and directories even when the quota is exceeded. Authored by: Simon Klinkert <[email protected]> Reviewed by: Dan McDonald <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Approved by: Robert Mustacchi <[email protected]> Ported-by: kernelOfTruth [email protected] Signed-off-by: Brian Behlendorf <[email protected]> OpenZFS-issue: https://www.illumos.org/issues/6940 OpenZFS-issue: https://www.illumos.org/issues/6334 OpenZFS-commit: https://github.com/illumos/illumos-gate/commit/9918916 Closes #5044
* OpenZFS 6322 - ZFS indirect block predictive prefetchAlexander Motin2016-08-305-26/+91
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For quite some time I was thinking about possibility to prefetch ZFS indirection tables while doing sequential reads or writes. Recent changes in predictive prefetcher made that much easier to do. My tests on zvol with 16KB block size on 5x striped and 2x mirrored pool of 10 disks show almost double throughput on sequential read, and almost tripple on sequential rewrite. While for read alike effect can be received from increasing maximal prefetch distance (though at higher memory cost), for rewrite there is no other solution so far. Authored by: Alexander Motin <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Reviewed by: Paul Dagnelie <[email protected]> Approved by: Robert Mustacchi <[email protected]> Ported-by: kernelOfTruth [email protected] Signed-off-by: Brian Behlendorf <[email protected]> OpenZFS-issue: https://www.illumos.org/issues/6322 OpenZFS-commit: https://github.com/illumos/illumos-gate/commit/cb92f413 Closes #5040 Porting notes: - Change from upstream in module/zfs/dbuf.c in 'int dbuf_read' due to commit 5f6d0b6 'Handle block pointers with a corrupt logical size' - Difference from upstream in module/zfs/dmu_zfetch.c, uint32_t zfetch_max_idistance -> unsigned int zfetch_max_idistance - Variables have been initialized at the beginning of the function (void dmu_zfetch) to resemble the order of occurrence and account for C99, C11 mode errors.
* OpenZFS 7086 - ztest attempts dva_get_dsize_sync on an embedded blockpointerMatthew Ahrens2016-08-301-6/+14
| | | | | | | | | | | | | In dbuf_dirty(), we need to grab the dn_struct_rwlock before looking at the db_blkptr, to prevent it from being changed by syncing context. Reviewed by: Prakash Surya <[email protected]> Reviewed by: George Wilson <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> OpenZFS-issue: https://www.illumos.org/issues/7086 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/98fa317 Closes #5039
* Fix: Build warnings with different gcc optimization levels in debug modeGeLiXin2016-08-292-3/+2
| | | | | | | | | | | | | | | | | | | | | | | This fix resolves warnings reported during compiling with different gcc optimization levels in debug mode, Test tools: gcc version 4.4.7 20120313 (Red Hat 4.4.7-16) (GCC) Linux version: 2.6.32-573.18.1.el6.x86_64, Red Hat Enterprise Linux Server release 6.1 (Santiago) List of warnings: CFLAGS=-O1 ./configure --enable-debug ;make ../../module/icp/core/kcf_sched.c: In function ‘kcf_aop_done’: ../../module/icp/core/kcf_sched.c:499: error: ‘fg’ may be used uninitialized in this function ../../module/icp/core/kcf_sched.c:499: note: ‘fg’ was declared here CFLAGS=-Os ./configure --enable-debug ; make libzfs_dataset.c: In function ‘zfs_prop_set_list’: libzfs_dataset.c:1575: error: ‘nvl_len’ may be used uninitialized in this function Signed-off-by: GeLiXin <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #5022
* Fix cv_timedwait_hiresBrian Behlendorf2016-08-291-4/+11
| | | | | | | | | | | | The user space implementation of cv_timedwait_hires() was always passing a relative time to pthread_cond_timedwait() when an absolute time is expected. This was accidentally introduced in commit 206971d2. Replace two magic values with their corresponding preprocessor macro. Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Richard Yao <[email protected]> Closes #5024
* Add zfs_arc_meta_limit_percent tunableGeLiXin2016-08-232-12/+71
| | | | | | | | | | | | | | | | ARC will evict meta buffers that exceed the arc_meta_limit. Before a further investigating on whether we should take special protection on meta buffers, this tunable make arc_meta_limit adjustable for different workloads. People can set zfs_arc_meta_limit_percent to any value while insmod zfs.ko, so some range check is added to guarantee a suitable arc_meta_limit. Suggested by Tim Chase, zfs_arc_dnode_limit is changed to a percent-style tunable as well. Signed-off-by: GeLiXin <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #4957
* Prevent reclaim in send_traverse_thread()Tim Chase2016-08-221-0/+5
| | | | | | | | | | | | | As is the case with traverse_prefetch_thread(), the deep stacks caused by traversal require disabling reclaim in the send traverse thread. Also, do the same for receive_writer_thread() in which similar problems have been observed. Signed-off-by: Tim Chase <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #4912 Closes #4998
* Fix: Array bounds read in zprop_print_one_property()GeLiXin2016-08-221-1/+2
| | | | | | | | | | | | | | If the loop index i comes to (ZFS_GET_NCOLS - 1), the cbp->cb_columns[i + 1] actually read the data of cbp->cb_colwidths[0], which means the array subscript is above array bounds. Luckily the cbp->cb_colwidths[0] is always 0 and it seems we haven't looped enough times to exceed the array bounds so far, but it's really a secluded risk someday. Signed-off-by: GeLiXin <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #5003
* Linux compat: Grsecurity kernelGvozden Neskovic2016-08-226-13/+84
| | | | | | | | | | | API Change: Module parameter set/get methods take const parameter in Grsecurity kernel v4.7.1 Signed-off-by: Gvozden Neskovic <[email protected]> Signed-off-by: Jason Zaman <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #4997 Closes #5001
* OpenZFS 7004 - dmu_tx_hold_zap() does dnode_hold() 7x on same objectMatthew Ahrens2016-08-198-19/+146
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Using a benchmark which has 32 threads creating 2 million files in the same directory, on a machine with 16 CPU cores, I observed poor performance. I noticed that dmu_tx_hold_zap() was using about 30% of all CPU, and doing dnode_hold() 7 times on the same object (the ZAP object that is being held). dmu_tx_hold_zap() keeps a hold on the dnode_t the entire time it is running, in dmu_tx_hold_t:txh_dnode, so it would be nice to use the dnode_t that we already have in hand, rather than repeatedly calling dnode_hold(). To do this, we need to pass the dnode_t down through all the intermediate calls that dmu_tx_hold_zap() makes, making these routines take the dnode_t* rather than an objset_t* and a uint64_t object number. In particular, the following routines will need to have analogous *_by_dnode() variants created: dmu_buf_hold_noread() dmu_buf_hold() zap_lookup() zap_lookup_norm() zap_count_write() zap_lockdir() zap_count_write() This can improve performance on the benchmark described above by 100%, from 30,000 file creations per second to 60,000. (This improvement is on top of that provided by working around the object allocation issue. Peak performance of ~90,000 creations per second was observed with 8 CPUs; adding CPUs past that decreased performance due to lock contention.) The CPU used by dmu_tx_hold_zap() was reduced by 88%, from 340 CPU-seconds to 40 CPU-seconds. Sponsored by: Intel Corp. Signed-off-by: Matthew Ahrens <[email protected]> Signed-off-by: Ned Bass <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> OpenZFS-issue: https://www.illumos.org/issues/7004 OpenZFS-commit: https://github.com/openzfs/openzfs/pull/109 Closes #4641 Closes #4972
* OpenZFS 7003 - zap_lockdir() should tag holdMatthew Ahrens2016-08-195-106/+154
| | | | | | | | | | | | | | | | zap_lockdir() / zap_unlockdir() should take a "void *tag" argument which tags the hold on the zap. This will help diagnose programming errors which misuse the hold on the ZAP. Sponsored by: Intel Corp. Signed-off-by: Matthew Ahrens <[email protected]> Signed-off-by: Pavel Zakharov <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> OpenZFS-issue: https://www.illumos.org/issues/7003 OpenZFS-commit: https://github.com/openzfs/openzfs/pull/108 Closes #4972
* Fix spa config generate memory leak in spa_load_best functionheary-cao2016-08-191-0/+2
| | | | | | | | | | When spa retry load succeeds and spa recovery is requested it may leak in spa_load_best function. Always free the generated config when it is not assigned to the spa. Signed-off-by: cao.xuewen <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #4940
* Update zfs_create_(009,010)_neg.kshChaoyuZhang2016-08-182-2/+2
| | | | | | | | | Just cleanup the new fs created during the test, so the "$found" should be "true". Signed-off-by: ChaoyuZhang <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #4978
* OpenZFS 7176 - Yet another hole birth issuePaul Dagnelie2016-08-182-6/+32
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is another bug in the long line of hole-birth related issues. In this particular case, it was discovered that a previous hole-birth fix (illumos bug 6513, commit bc77ba73) did not cover as many cases as we thought it did. While the issue worked in the case of hole-punching (writing zeroes to a large part of a file), it did not deal with truncation, and then writing beyond the new end of the file. The problem is that dbuf_findbp will return ENOENT if the block it's trying to find is beyond the end of the file. If that happens, we assume there is no birth time, and so we lose that information when we write out new blkptrs. We should teach dbuf_findbp to look for things that are beyond the current end, but not beyond the absolute end of the file. Authored by: Paul Dagnelie <[email protected]> Reviewed by: Matthew Ahrens [email protected] Reviewed by: George Wilson [email protected] Ported-by: kernelOfTruth <[email protected]> Signed-off-by: Boris Protopopov <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> OpenZFS-issue: https://www.illumos.org/issues/7176 OpenZFS-commit: https://github.com/openzfs/openzfs/pull/173/commits/8b9f3ad Upstream-bugs: DLPX-46009 Porting notes: - Fix ISO C90 mixed declaration error in dbuf.c ( int nlevels, epbs; ) ; keep previous position of the initialization
* Fix do_link portion of ctime testNikolay Borisov2016-08-161-1/+3
| | | | | | | | | | | | | | | | | | From the man page of dirname: " Both dirname() and basename() may modify the contents of path, so it may be desirable to pass a copy when calling one of these functions." And in fact on linux using dirname actually changes the contents of the passed parameter as evident from the following failure when running the ctime test: link(/root/zfs-mount, /root/zfs-mount/link_file) Fix this by creating a copy of the input parameter and passing that to dirname, thus not compromising the original parameter, allowing the creation of hard link to succeed. Signed-off-by: Nikolay Borisov <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #4977
* It is not necessary to zero struct dbuf_hold_impl_dataMatthew Ahrens2016-08-161-1/+9
| | | | | | | | | | | | | | | | | | | | | Under a workload which makes heavy use of `dbuf_hold()`, I noticed that a considerable amount of time was spent in `dbuf_hold_impl()`, due to its call to `kmem_zalloc(sizeof (struct dbuf_hold_impl_data) * DBUF_HOLD_IMPL_MAX_DEPTH)`, which is around 2KiB. This structure is used as a stack, to limit the size of the C stack as dbuf_hold() calls itself recursively. We make a recursive call to hold the parent's dbuf when the requested dbuf is not found. The vast majority of the time, the parent or grandparent indirect dbuf is cached, so the number of recursive calls is very low. However, we initialize this entire array for every call to dbuf_hold(). To improve performance, this commit changes `dbuf_hold()` to use `kmem_alloc()` instead of `kmem_zalloc()`. __dbuf_hold_impl_init is changed to initialize all members of the struct before they are used. I observed ~5% performance improvement on a workload which creates many files. Signed-off-by: Matthew Ahrens <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #4974
* zdb: fencepost error at zdb_cb.zcb_embedded_histogram[][]Gvozden Neskovic2016-08-161-1/+1
| | | | | | | | | | | | Erroneous access detected by gcc UndefinedBehaviorSanitizer: `zdb.c:2424:7: runtime error: index 112 out of bounds for type 'uint64_t [112]'` Fix: increase histogram size by 1 to accommodate all possible sizes. Signed-off-by: Gvozden Neskovic <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #4934 Issue #4883
* Rework of fletcher_4 moduleGvozden Neskovic2016-08-167-178/+372
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | - Benchmark memory block is increased to 128kiB to reflect real block sizes more accurately. Measurements include all three stages needed for checksum generation, i.e. `init()/compute()/fini()`. The inner loop is repeated multiple times to offset overhead of time function. - Fastest implementation selects native and byteswap methods independently in benchmark. To support this new function pointers `init_byteswap()/fini_byteswap()` are introduced. - Implementation mutex lock is replaced by atomic variable. - To save time, benchmark is not executed in userspace. Instead, highest supported implementation is used for fastest. Default userspace selector is still 'cycle'. - `fletcher_4_native/byteswap()` methods use incremental methods to finish calculation if data size is not multiple of vector stride (currently 64B). - Added `fletcher_4_native_varsize()` special purpose method for use when buffer size is not known in advance. The method does not enforce 4B alignment on buffer size, and will ignore last (size % 4) bytes of the data buffer. - Benchmark `kstat` is changed to match the one of vdev_raidz. It now shows throughput for all supported implementations (in B/s), native and byteswap, as well as the code [fastest] is running. Example of `fletcher_4_bench` running on `Intel(R) Xeon(R) CPU E5-2660 v3 @ 2.60GHz`: implementation native byteswap scalar 4768120823 3426105750 sse2 7947841777 4318964249 ssse3 7951922722 6112191941 avx2 13269714358 11043200912 fastest avx2 avx2 Example of `fletcher_4_bench` running on `Intel(R) Xeon Phi(TM) CPU 7210 @ 1.30GHz`: implementation native byteswap scalar 1291115967 1031555336 sse2 2539571138 1280970926 ssse3 2537778746 1080016762 avx2 4950749767 1078493449 avx512f 9581379998 4010029046 fastest avx512f avx512f Signed-off-by: Gvozden Neskovic <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #4952
* Fletcher4 implementation using avx512f instruction setGvozden Neskovic2016-08-166-10/+182
| | | | | | | | | | | | | | | | | | | | | | Algorithm runs 8 parallel sums, consuming 8x uint32_t elements per loop iteration. Size alignment of main fletcher4 methods is adjusted accordingly. New implementation is called 'avx512f'. Note: byteswap method can be implemented more efficiently when avx512bw hardware becomes available. Currently, it is ~ 2x slower than native method. Table shows result of full (native) fletcher4 calculation for different buffer size: fletcher4 4KB 16KB 64KB 128KB 256KB 1MB 16MB -------------------------------------------------------------------- [scalar] 1213 1228 1231 1231 1225 1200 1160 [sse2] 2374 2442 2459 2456 2462 2250 2220 [avx2] 4288 4753 4871 4893 4900 4050 3882 [avx512f] 5975 8445 9196 9221 9262 6307 5620 Signed-off-by: Gvozden Neskovic <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #4952