openzfs/zfs.git - Unnamed repository; edit this file 'description' to name the repository.

	Commit message (Collapse)	Author	Age	Files	Lines
*	OpenZFS 4185 - add new cryptographic checksums to ZFS: SHA-512, Skein, Edon-R	Tony Hutter	2016-10-03	12	-12/+520
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Reviewed by: George Wilson <[email protected]> Reviewed by: Prakash Surya <[email protected]> Reviewed by: Saso Kiselkov <[email protected]> Reviewed by: Richard Lowe <[email protected]> Approved by: Garrett D'Amore <[email protected]> Ported by: Tony Hutter <[email protected]> OpenZFS-issue: https://www.illumos.org/issues/4185 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/45818ee Porting Notes: This code is ported on top of the Illumos Crypto Framework code: https://github.com/zfsonlinux/zfs/pull/4329/commits/b5e030c8dbb9cd393d313571dee4756fbba8c22d The list of porting changes includes: - Copied module/icp/include/sha2/sha2.h directly from illumos - Removed from module/icp/algs/sha2/sha2.c: #pragma inline(SHA256Init, SHA384Init, SHA512Init) - Added 'ctx' to lib/libzfs/libzfs_sendrecv.c:zio_checksum_SHA256() since it now takes in an extra parameter. - Added CTASSERT() to assert.h from for module/zfs/edonr_zfs.c - Added skein & edonr to libicp/Makefile.am - Added sha512.S. It was generated from sha512-x86_64.pl in Illumos. - Updated ztest.c with new fletcher_4_() args; used NULL for new CTX argument. - In icp/algs/edonr/edonr_byteorder.h, Removed the #if defined(__linux) section to not #include the non-existant endian.h. - In skein_test.c, renane NULL to 0 in "no test vector" array entries to get around a compiler warning. - Fixup test files: - Rename <sys/varargs.h> -> <varargs.h>, <strings.h> -> <string.h>, - Remove <note.h> and define NOTE() as NOP. - Define u_longlong_t - Rename "#!/usr/bin/ksh" -> "#!/bin/ksh -p" - Rename NULL to 0 in "no test vector" array entries to get around a compiler warning. - Remove "for isa in $($ISAINFO); do" stuff - Add/update Makefiles - Add some userspace headers like stdio.h/stdlib.h in places of sys/types.h. - EXPORT_SYMBOL _Init/_Update/_Final... routines in ICP modules. - Update scripts/zfs2zol-patch.sed - include <sys/sha2.h> in sha2_impl.h - Add sha2.h to include/sys/Makefile.am - Add skein and edonr dirs to icp Makefile - Add new checksums to zpool_get.cfg - Move checksum switch block from zfs_secpolicy_setprop() to zfs_check_settable() - Fix -Wuninitialized error in edonr_byteorder.h on PPC - Fix stack frame size errors on ARM32 - Don't unroll loops in Skein on 32-bit to save stack space - Add memory barriers in sha2.c on 32-bit to save stack space - Add filetest_001_pos.ksh checksum sanity test - Add option to write psudorandom data in file_write utility
*	Add parity generation/rebuild using 128-bits NEON for Aarch64	Romain Dolbeau	2016-10-03	3	-0/+67
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This re-use the framework established for SSE2, SSSE3 and AVX2. However, GCC is using FP registers on Aarch64, so unlike SSE/AVX2 we can't rely on the registers being left alone between ASM statements. So instead, the NEON code uses C variables and GCC extended ASM syntax. Note that since the kernel explicitly disable vector registers, they have to be locally re-enabled explicitly. As we use the variable's number to define the symbolic name, and GCC won't allow duplicate symbolic names, numbers have to be unique. Even when the code is not going to be used (e.g. the case for 4 registers when using the macro with only 2). Only the actually used variables should be declared, otherwise the build will fails in debug mode. This requires the replacement of the XOR(X,X) syntax by a new ZERO(X) macro, which does the same thing but without repeating the argument. And perhaps someday there will be a machine where there is a more efficient way to zero a register than XOR with itself. This affects scalar, SSE2, SSSE3 and AVX2 as they need the new macro. It's possible to write faster implementations (different scheduling, different unrolling, interleaving NEON and scalar, ...) for various cores, but this one has the advantage of fitting in the current state of the code, and thus is likely easier to review/check/merge. The only difference between aarch64-neon and aarch64-neonx2 is that aarch64-neonx2 unroll some functions some more. Reviewed-by: Gvozden Neskovic <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Romain Dolbeau <[email protected]> Closes #4801
*	fix: Shift exponent too large	Gvozden Neskovic	2016-09-29	2	-4/+4
\| \| \| \| \| \| \| \| \| \| \| \|	Undefined operation is reported by running ztest (or zloop) compiled with GCC UndefinedBehaviorSanitizer. Error only happens on top level of dnode indirection with large enough offset values. Logically, left shift operation would work, but bit shift semantics in C, and limitation of uint64_t, do not produce desired result. Issue #5059, #4883 Signed-off-by: Gvozden Neskovic <[email protected]>
*	OpenZFS 7230 - add assertions to dmu_send_impl() to verify that stream ↵	kernelOfTruth aka. kOT, Gentoo user	2016-09-22	1	-0/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	includes BEGIN and END records Authored by: Matt Krantz <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Reviewed by: Paul Dagnelie <[email protected]> Reviewed by: Igor Kozhukhov <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Approved by: Robert Mustacchi <[email protected]> Ported-by: kernelOfTruth <[email protected]> OpenZFS-issue: https://www.illumos.org/issues/7230 OpenZFS-commit: https://github.com/illumos/illumos-gate/commit/12b90ee2 Closes #5112
*	Fix coverity defects	cao	2016-09-20	1	-2/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Fix coverity defects: coverity scan CID:147623, Type: Resource leak. coverity scan CID:147622, Type: Resource leak. reason: zpool_open zhp, but not zpool_close zhp. so resource leak. coverity scan CID:147621, Type: Resource fd leak. coverity scan CID:147620, Type: Resource fd leak. reason: do_write do_read open file fd,but exception not close fd. delete unuse definition DMU_OS_IS_L2COMPRESSIBLE. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: cao.xuewen <[email protected]> Closes #5137
*	DLPX-44733 combine arc_buf_alloc_impl() with arc_buf_clone()	Dan Kimmel	2016-09-13	1	-1/+1
\| \| \| \| \| \| \| \|	Authored by: Dan Kimmel <[email protected]> Reviewed by: Tom Caputi <[email protected]> Reviewed by: Brian Behlendorf <[email protected]> Ported by: David Quigley <[email protected]> Issue #5078
*	Remove lint suppression from dmu.h and unnecessary dmu.h include in spa.h	Dan Kimmel	2016-09-13	2	-9/+2
\| \| \| \| \| \| \| \|	Authored by: Dan Kimmel <[email protected]> Reviewed by: Tom Caputi <[email protected]> Reviewed by: Brian Behlendorf <[email protected]> Ported by: David Quigley <[email protected]> Issue #5078
*	DLPX-40252 integrate EP-476 compressed zfs send/receive	Dan Kimmel	2016-09-13	11	-43/+77
\| \| \| \| \| \| \| \|	Authored by: Dan Kimmel <[email protected]> Reviewed by: Tom Caputi <[email protected]> Reviewed by: Brian Behlendorf <[email protected]> Ported by: David Quigley <[email protected]> Issue #5078
*	OpenZFS 6950 - ARC should cache compressed data	George Wilson	2016-09-13	9	-72/+145
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Authored by: George Wilson <[email protected]> Reviewed by: Prakash Surya <[email protected]> Reviewed by: Dan Kimmel <[email protected]> Reviewed by: Matt Ahrens <[email protected]> Reviewed by: Paul Dagnelie <[email protected]> Reviewed by: Tom Caputi <[email protected]> Reviewed by: Brian Behlendorf <[email protected]> Ported by: David Quigley <[email protected]> This review covers the reading and writing of compressed arc headers, sharing data between the arc_hdr_t and the arc_buf_t, and the implementation of a new dbuf cache to keep frequently access data uncompressed. I've added a new member to l1 arc hdr called b_pdata. The b_pdata always hangs off the arc_buf_hdr_t (if an L1 hdr is in use) and points to the physical block for that DVA. The physical block may or may not be compressed. If compressed arc is enabled and the block on-disk is compressed, then the b_pdata will match the block on-disk and remain compressed in memory. If the block on disk is not compressed, then neither will the b_pdata. Lastly, if compressed arc is disabled, then b_pdata will always be an uncompressed version of the on-disk block. Typically the arc will cache only the arc_buf_hdr_t and will aggressively evict any arc_buf_t's that are no longer referenced. This means that the arc will primarily have compressed blocks as the arc_buf_t's are considered overhead and are always uncompressed. When a consumer reads a block we first look to see if the arc_buf_hdr_t is cached. If the hdr is cached then we allocate a new arc_buf_t and decompress the b_pdata contents into the arc_buf_t's b_data. If the hdr already has a arc_buf_t, then we will allocate an additional arc_buf_t and bcopy the uncompressed contents from the first arc_buf_t to the new one. Writing to the compressed arc requires that we first discard the b_pdata since the physical block is about to be rewritten. The new data contents will be passed in via an arc_buf_t (uncompressed) and during the I/O pipeline stages we will copy the physical block contents to a newly allocated b_pdata. When an l2arc is inuse it will also take advantage of the b_pdata. Now the l2arc will always write the contents of b_pdata to the l2arc. This means that when compressed arc is enabled that the l2arc blocks are identical to those stored in the main data pool. This provides a significant advantage since we can leverage the bp's checksum when reading from the l2arc to determine if the contents are valid. If the compressed arc is disabled, then we must first transform the read block to look like the physical block in the main data pool before comparing the checksum and determining it's valid. OpenZFS-issue: https://www.illumos.org/issues/6950 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/7fc10f0 Issue #5078
*	Bring over illumos ZFS FMA logic -- phase 1	Don Brady	2016-09-01	5	-2/+273
\| \| \| \| \| \| \| \| \| \| \| \| \|	This first phase brings over the ZFS SLM module, zfs_mod.c, to handle auto operations in response to disk events. Disk event monitoring is provided from libudev and generates the expected payload schema for zfs_mod. This work leverages the recently added devid and phys_path strings in the vdev label. Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Don Brady <[email protected]> Signed-off-by: Tony Hutter <[email protected]> Closes #4673
*	Delete unreferenced function zfs_ereport_send_interim_checksum	luozhengzheng	2016-09-01	1	-1/+0
\| \| \| \| \| \|	Signed-off-by: luozhengzheng <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #5055
*	Performance optimization of AVL tree comparator functions	Gvozden Neskovic	2016-08-31	2	-0/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	perf: 2.75x faster ddt_entry_compare() First 256bits of ddt_key_t is a block checksum, which are expected to be close to random data. Hence, on average, comparison only needs to look at first few bytes of the keys. To reduce number of conditional jump instructions, the result is computed as: sign(memcmp(k1, k2)). Sign of an integer 'a' can be obtained as: `(0 < a) - (a < 0)` := {-1, 0, 1} , which is computed efficiently. Synthetic performance evaluation of original and new algorithm over 1G random keys on 2.6GHz Intel(R) Xeon(R) CPU E5-2660 v3: old 6.85789 s new 2.49089 s perf: 2.8x faster vdev_queue_offset_compare() and vdev_queue_timestamp_compare() Compute the result directly instead of using conditionals perf: zfs_range_compare() Speedup between 1.1x - 2.5x, depending on compiler version and optimization level. perf: spa_error_entry_compare() `bcmp()` is not suitable for comparator use. Use `memcmp()` instead. perf: 2.8x faster metaslab_compare() and metaslab_rangesize_compare() perf: 2.8x faster zil_bp_compare() perf: 2.8x faster mze_compare() perf: faster dbuf_compare() perf: faster compares in spa_misc perf: 2.8x faster layout_hash_compare() perf: 2.8x faster space_reftree_compare() perf: libzfs: faster avl tree comparators perf: guid_compare() perf: dsl_deadlist_compare() perf: perm_set_compare() perf: 2x faster range_tree_seg_compare() perf: faster unique_compare() perf: faster vdev_cache _compare() perf: faster vdev_uberblock_compare() perf: faster fuid _compare() perf: faster zfs_znode_hold_compare() Signed-off-by: Gvozden Neskovic <[email protected]> Signed-off-by: Richard Elling <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #5033
*	Delete unused zfsctl_snapdir_inactive declaration	cao	2016-08-30	1	-1/+0
\| \| \| \| \| \| \| \| \| \| \|	zfsctl_snapdir_inactive is defined in zfs-0.6.3. In zfs-0.6.5.7 this is declaration remains even though the implementation was removed in commit 278bee93. Removed fastreboot_disable_highpil which is also unused. Signed-off-by: caoxuewen [email protected] Signed-off-by: Brian Behlendorf <[email protected]> Closes #5042
*	OpenZFS 6322 - ZFS indirect block predictive prefetch	Alexander Motin	2016-08-30	2	-1/+17
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	For quite some time I was thinking about possibility to prefetch ZFS indirection tables while doing sequential reads or writes. Recent changes in predictive prefetcher made that much easier to do. My tests on zvol with 16KB block size on 5x striped and 2x mirrored pool of 10 disks show almost double throughput on sequential read, and almost tripple on sequential rewrite. While for read alike effect can be received from increasing maximal prefetch distance (though at higher memory cost), for rewrite there is no other solution so far. Authored by: Alexander Motin <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Reviewed by: Paul Dagnelie <[email protected]> Approved by: Robert Mustacchi <[email protected]> Ported-by: kernelOfTruth [email protected] Signed-off-by: Brian Behlendorf <[email protected]> OpenZFS-issue: https://www.illumos.org/issues/6322 OpenZFS-commit: https://github.com/illumos/illumos-gate/commit/cb92f413 Closes #5040 Porting notes: - Change from upstream in module/zfs/dbuf.c in 'int dbuf_read' due to commit 5f6d0b6 'Handle block pointers with a corrupt logical size' - Difference from upstream in module/zfs/dmu_zfetch.c, uint32_t zfetch_max_idistance -> unsigned int zfetch_max_idistance - Variables have been initialized at the beginning of the function (void dmu_zfetch) to resemble the order of occurrence and account for C99, C11 mode errors.
*	Linux compat: Grsecurity kernel	Gvozden Neskovic	2016-08-22	2	-1/+41
\| \| \| \| \| \| \| \| \| \| \|	API Change: Module parameter set/get methods take const parameter in Grsecurity kernel v4.7.1 Signed-off-by: Gvozden Neskovic <[email protected]> Signed-off-by: Jason Zaman <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #4997 Closes #5001
*	OpenZFS 7004 - dmu_tx_hold_zap() does dnode_hold() 7x on same object	Matthew Ahrens	2016-08-19	3	-8/+20
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Using a benchmark which has 32 threads creating 2 million files in the same directory, on a machine with 16 CPU cores, I observed poor performance. I noticed that dmu_tx_hold_zap() was using about 30% of all CPU, and doing dnode_hold() 7 times on the same object (the ZAP object that is being held). dmu_tx_hold_zap() keeps a hold on the dnode_t the entire time it is running, in dmu_tx_hold_t:txh_dnode, so it would be nice to use the dnode_t that we already have in hand, rather than repeatedly calling dnode_hold(). To do this, we need to pass the dnode_t down through all the intermediate calls that dmu_tx_hold_zap() makes, making these routines take the dnode_t* rather than an objset_t* and a uint64_t object number. In particular, the following routines will need to have analogous *_by_dnode() variants created: dmu_buf_hold_noread() dmu_buf_hold() zap_lookup() zap_lookup_norm() zap_count_write() zap_lockdir() zap_count_write() This can improve performance on the benchmark described above by 100%, from 30,000 file creations per second to 60,000. (This improvement is on top of that provided by working around the object allocation issue. Peak performance of ~90,000 creations per second was observed with 8 CPUs; adding CPUs past that decreased performance due to lock contention.) The CPU used by dmu_tx_hold_zap() was reduced by 88%, from 340 CPU-seconds to 40 CPU-seconds. Sponsored by: Intel Corp. Signed-off-by: Matthew Ahrens <[email protected]> Signed-off-by: Ned Bass <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> OpenZFS-issue: https://www.illumos.org/issues/7004 OpenZFS-commit: https://github.com/openzfs/openzfs/pull/109 Closes #4641 Closes #4972
*	OpenZFS 7003 - zap_lockdir() should tag hold	Matthew Ahrens	2016-08-19	2	-5/+9
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	zap_lockdir() / zap_unlockdir() should take a "void *tag" argument which tags the hold on the zap. This will help diagnose programming errors which misuse the hold on the ZAP. Sponsored by: Intel Corp. Signed-off-by: Matthew Ahrens <[email protected]> Signed-off-by: Pavel Zakharov <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> OpenZFS-issue: https://www.illumos.org/issues/7003 OpenZFS-commit: https://github.com/openzfs/openzfs/pull/108 Closes #4972
*	OpenZFS 7176 - Yet another hole birth issue	Paul Dagnelie	2016-08-18	1	-0/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This is another bug in the long line of hole-birth related issues. In this particular case, it was discovered that a previous hole-birth fix (illumos bug 6513, commit bc77ba73) did not cover as many cases as we thought it did. While the issue worked in the case of hole-punching (writing zeroes to a large part of a file), it did not deal with truncation, and then writing beyond the new end of the file. The problem is that dbuf_findbp will return ENOENT if the block it's trying to find is beyond the end of the file. If that happens, we assume there is no birth time, and so we lose that information when we write out new blkptrs. We should teach dbuf_findbp to look for things that are beyond the current end, but not beyond the absolute end of the file. Authored by: Paul Dagnelie <[email protected]> Reviewed by: Matthew Ahrens [email protected] Reviewed by: George Wilson [email protected] Ported-by: kernelOfTruth <[email protected]> Signed-off-by: Boris Protopopov <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> OpenZFS-issue: https://www.illumos.org/issues/7176 OpenZFS-commit: https://github.com/openzfs/openzfs/pull/173/commits/8b9f3ad Upstream-bugs: DLPX-46009 Porting notes: - Fix ISO C90 mixed declaration error in dbuf.c ( int nlevels, epbs; ) ; keep previous position of the initialization
*	Rework of fletcher_4 module	Gvozden Neskovic	2016-08-16	1	-5/+21
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	- Benchmark memory block is increased to 128kiB to reflect real block sizes more accurately. Measurements include all three stages needed for checksum generation, i.e. `init()/compute()/fini()`. The inner loop is repeated multiple times to offset overhead of time function. - Fastest implementation selects native and byteswap methods independently in benchmark. To support this new function pointers `init_byteswap()/fini_byteswap()` are introduced. - Implementation mutex lock is replaced by atomic variable. - To save time, benchmark is not executed in userspace. Instead, highest supported implementation is used for fastest. Default userspace selector is still 'cycle'. - `fletcher_4_native/byteswap()` methods use incremental methods to finish calculation if data size is not multiple of vector stride (currently 64B). - Added `fletcher_4_native_varsize()` special purpose method for use when buffer size is not known in advance. The method does not enforce 4B alignment on buffer size, and will ignore last (size % 4) bytes of the data buffer. - Benchmark `kstat` is changed to match the one of vdev_raidz. It now shows throughput for all supported implementations (in B/s), native and byteswap, as well as the code [fastest] is running. Example of `fletcher_4_bench` running on `Intel(R) Xeon(R) CPU E5-2660 v3 @ 2.60GHz`: implementation native byteswap scalar 4768120823 3426105750 sse2 7947841777 4318964249 ssse3 7951922722 6112191941 avx2 13269714358 11043200912 fastest avx2 avx2 Example of `fletcher_4_bench` running on `Intel(R) Xeon Phi(TM) CPU 7210 @ 1.30GHz`: implementation native byteswap scalar 1291115967 1031555336 sse2 2539571138 1280970926 ssse3 2537778746 1080016762 avx2 4950749767 1078493449 avx512f 9581379998 4010029046 fastest avx512f avx512f Signed-off-by: Gvozden Neskovic <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #4952
*	Fletcher4 implementation using avx512f instruction set	Gvozden Neskovic	2016-08-16	1	-0/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Algorithm runs 8 parallel sums, consuming 8x uint32_t elements per loop iteration. Size alignment of main fletcher4 methods is adjusted accordingly. New implementation is called 'avx512f'. Note: byteswap method can be implemented more efficiently when avx512bw hardware becomes available. Currently, it is ~ 2x slower than native method. Table shows result of full (native) fletcher4 calculation for different buffer size: fletcher4 4KB 16KB 64KB 128KB 256KB 1MB 16MB -------------------------------------------------------------------- [scalar] 1213 1228 1231 1231 1225 1200 1160 [sse2] 2374 2442 2459 2456 2462 2250 2220 [avx2] 4288 4753 4871 4893 4900 4050 3882 [avx512f] 5975 8445 9196 9221 9262 6307 5620 Signed-off-by: Gvozden Neskovic <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #4952
*	Add support for AVX-512 family of instruction sets	Gvozden Neskovic	2016-08-16	1	-12/+233
\| \| \| \| \| \| \| \| \| \| \| \| \|	This patch adds compiler and runtime tests (user and kernel) for following instruction sets: avx512f, avx512cd, avx512er, avx512pf, avx512bw, avx512dq, avx512vl, avx512ifma, avx512vbmi. note: Linux support for AVX-512F (Foundation) instruction set started with linux v3.15 Signed-off-by: Gvozden Neskovic <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #4952
*	OpenZFS 5997 - FRU field not set during pool creation and never updated	Hans Rosenfeld	2016-08-12	8	-19/+184
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Authored by: Hans Rosenfeld <[email protected]> Reviewed by: Dan Fields <[email protected]> Reviewed by: Josef Sipek <[email protected]> Reviewed by: Richard Elling <[email protected]> Reviewed by: George Wilson <[email protected]> Approved by: Robert Mustacchi <[email protected]> Signed-off-by: Don Brady <[email protected]> Ported-by: Brian Behlendorf <[email protected]> OpenZFS-issue: https://www.illumos.org/issues/5997 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/1437283 Porting Notes: In addition to the OpenZFS changes this patch realigns the events with those found in OpenZFS. Events which would be logged as sysevents on illumos have been been mapped to the 'sysevent' class for Linux. In addition, several subclass names have been changed to match what is used in OpenZFS. In all cases this means a '.' was changed to an '_' in the subclass. The scripts provided by ZoL have been updated, however users which provide scripts for any of the following events will need to rename them based on the new subclass names. ereport.fs.zfs.config.sync sysevent.fs.zfs.config_sync ereport.fs.zfs.zpool.destroy sysevent.fs.zfs.pool_destroy ereport.fs.zfs.zpool.reguid sysevent.fs.zfs.pool_reguid ereport.fs.zfs.vdev.remove sysevent.fs.zfs.vdev_remove ereport.fs.zfs.vdev.clear sysevent.fs.zfs.vdev_clear ereport.fs.zfs.vdev.check sysevent.fs.zfs.vdev_check ereport.fs.zfs.vdev.spare sysevent.fs.zfs.vdev_spare ereport.fs.zfs.vdev.autoexpand sysevent.fs.zfs.vdev_autoexpand ereport.fs.zfs.resilver.start sysevent.fs.zfs.resilver_start ereport.fs.zfs.resilver.finish sysevent.fs.zfs.resilver_finish ereport.fs.zfs.scrub.start sysevent.fs.zfs.scrub_start ereport.fs.zfs.scrub.finish sysevent.fs.zfs.scrub_finish ereport.fs.zfs.bootfs.vdev.attach sysevent.fs.zfs.bootfs_vdev_attach
*	Fix a typo in ZIL write handling comment	luozhengzheng	2016-08-12	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The following comment in zil.h * WR_COPIED: * If we know we'll immediately be committing the * transaction (FSYNC or FDSYNC), then we allocate a larger * log record here for the data and copy the data in. The word "the" should be "then". Signed-off-by: luozhengzheng <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #4961
*	Reorder HAVE_BIO_RW_* checks	Brian Behlendorf	2016-08-12	1	-4/+12
\| \| \| \| \| \| \| \| \| \| \| \|	The HAVE_BIO_RW_* #ifdef's must appear before REQ_* #ifdef's in the bio_is_flush() and bio_is_discard() macros. Linux 2.6.32 era kernels defined both of values and the HAVE_BIO_RW_* must be used in this case. This resulted in a panic in zconfig test 5. Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Chunwei Chen <[email protected]> Closes #4951 Closes #4959
*	Use file_dentry and file_inode wrappers	Chen Haiquan	2016-08-11	2	-2/+14
\| \| \| \| \| \| \| \| \| \| \| \|	Fix bugs due to kernel change in torvalds/linux@4bacc9c9234c ("overlayfs: Make f_path always point to the overlay and f_inode to the underlay"). This problem crashes system when use zfs as a layer of overlayfs. Signed-off-by: Chen Haiquan <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #4914 Closes #4935
*	Fix indefinite article	GeLiXin	2016-08-11	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \|	The indefinite article before nvlist should be "an", not "a". We have 27 "an nvlist" and 7 "a nvlist" in our comment, they should stay the same as we are such a strict filesystem. Signed-off-by: GeLiXin <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #4941
*	Remove custom root pool import code	Brian Behlendorf	2016-08-11	1	-6/+0
\| \| \| \| \| \| \| \| \| \|	Non-Linux OpenZFS implementations require additional support to be used a root pool. This code should simply be removed to avoid confusion and improve readability. Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Chunwei Chen <[email protected]> Closes #4951
*	Linux 4.8 compat: Fix removal of bio->bi_rw member	Brian Behlendorf	2016-08-11	1	-56/+109
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	All users of bio->bi_rw have been replaced with compatibility wrappers. This allows the kernel specific logic to be abstracted away, and for each of the supported cases to be documented with the wrapper. The updated interfaces are as follows: * void blk_queue_set_write_cache(struct request_queue , bool, bool) boolean_t bio_is_flush(struct bio ) boolean_t bio_is_fua(struct bio ) boolean_t bio_is_discard(struct bio ) boolean_t bio_is_secure_erase(struct bio ) VDEV_WRITE_FLUSH_FUA Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Chunwei Chen <[email protected]> Closes #4951
*	Linux 4.8 compat: posix_acl_valid()	Brian Behlendorf	2016-08-08	1	-0/+12
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	The posix_acl_valid() function has been updated to require a user namespace. Filesystem callers should normally provide the user_ns from the super block associcated with the ACL; the zpl_posix_acl_valid() wrapper has been added for this purpose. See https://github.com/torvalds/linux/commit/0d4d717f for complete details. Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Nikolay Borisov <[email protected]> Signed-off-by: Chunwei Chen <[email protected]> Closes #4922
*	Retire HAVE_CURRENT_UMASK and HAVE_POSIX_ACL_CACHING	Brian Behlendorf	2016-08-08	1	-13/+0
\| \| \| \| \| \| \| \| \| \|	Remove ZFS_AC_KERNEL_CURRENT_UMASK and ZFS_AC_KERNEL_POSIX_ACL_CACHING configure checks, all supported kernel provide this functionality. Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Nikolay Borisov <[email protected]> Signed-off-by: Chunwei Chen <[email protected]> Closes #4922
*	Fix interaction between userns uid/gid and SA	Nikolay Borisov	2016-08-08	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	* When the uid/gid change is handled in zfs_setattr we want to actually adjust the user passed uid to a KUID and write that to disk. * In trace points use the i_uid member without doing translation, since it has already been performed. * Use kuid in zfs_aclset_common Signed-off-by: Nikolay Borisov <[email protected]> Signed-off-by: Chunwei Chen <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #4928
*	Linux 4.8 compat: new s_user_ns member of struct super_block	Nikolay Borisov	2016-08-08	1	-0/+17
\| \| \| \| \| \| \| \| \| \| \| \| \|	Kernel 4.8 paved the way to enabling mounting a file system inside a non-init user namespace. To facilitate this a s_user_ns member was added holding the userns in which the filesystem's instance was mounted. This enables doing the uid/gid translation relative to this particular username space and not the default init_user_ns. Signed-off-by: Nikolay Borisov <[email protected]> Signed-off-by: Chunwei Chen <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #4928
*	Linux 4.8 compat: REQ_OP and bio_set_op_attrs()	Chunwei Chen	2016-07-29	1	-5/+20
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	New REQ_OP_* definitions have been introduced to separate the WRITE, READ, and DISCARD operations from the flags. This included changing the encoding of bi_rw. It places REQ_OP_* in high order bits and other stuff in low order bits. This encoding is done through the new helper function bio_set_op_attrs. For complete details refer to: https://github.com/torvalds/linux/commit/f215082 https://github.com/torvalds/linux/commit/4e1b2d5 Signed-off-by: Tim Chase <[email protected]> Signed-off-by: Chunwei Chen <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #4892 Closes #4899
*	Linux 4.8 compat: REQ_PREFLUSH	Brian Behlendorf	2016-07-29	1	-0/+12
\| \| \| \| \| \| \| \| \| \| \| \|	The REQ_FLUSH flag was renamed REQ_PREFLUSH to avoid confusion with REQ_OP_FLUSH. See https://github.com/torvalds/linux/commit/28a8f0d3 for complete details. Signed-off-by: Tim Chase <[email protected]> Signed-off-by: Chunwei Chen <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #4892 Issue #4899
*	Limit the amount of dnode metadata in the ARC	Tim Chase	2016-07-25	1	-1/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Metadata-intensive workloads can cause the ARC to become permanently filled with dnode_t objects as they're pinned by the VFS layer. Subsequent data-intensive workloads may only benefit from about 25% of the potential ARC (arc_c_max - arc_meta_limit). In order to help track metadata usage more precisely, the other_size metadata arcstat has replaced with dbuf_size, dnode_size and bonus_size. The new zfs_arc_dnode_limit tunable, which defaults to 10% of zfs_arc_meta_limit, defines the minimum number of bytes which is desirable to be consumed by dnodes. Attempts to evict non-metadata will trigger async prune tasks if the space used by dnodes exceeds this limit. The new zfs_arc_dnode_reduce_percent tunable specifies the amount by which the excess dnode space is attempted to be pruned as a percentage of the amount by which zfs_arc_dnode_limit is being exceeded. By default, it tries to unpin 10% of the dnodes. The problem of dnode metadata pinning was observed with the following testing procedure (in this example, zfs_arc_max is set to 4GiB): - Create a large number of small files until arc_meta_used exceeds arc_meta_limit (3GiB with default tuning) and arc_prune starts increasing. - Create a 3GiB file with dd. Observe arc_mata_used. It will still be around 3GiB. - Repeatedly read the 3GiB file and observe arc_meta_limit as before. It will continue to stay around 3GiB. With this modification, space for the 3GiB file is gradually made available as subsequent demands on the ARC are made. The previous behavior can be restored by setting zfs_arc_dnode_limit to the same value as the zfs_arc_meta_limit. Signed-off-by: Tim Chase <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #4345 Issue #4512 Issue #4773 Closes #4858
*	Fix sync behavior for disk vdevs	Tim Chase	2016-07-25	1	-0/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Prior to b39c22b, which was first generally available in the 0.6.5 release as b39c22b, ZoL never actually submitted synchronous read or write requests to the Linux block layer. This means the vdev_disk_dio_is_sync() function had always returned false and, therefore, the completion in dio_request_t.dr_comp was never actually used. In b39c22b, synchronous ZIO operations were translated to synchronous BIO requests in vdev_disk_io_start(). The follow-on commits 5592404 and aa159af fixed several problems introduced by b39c22b. In particular, 5592404 introduced the new flag parameter "wait" to __vdev_disk_physio() but under ZoL, since vdev_disk_physio() is never actually used, the wait flag was always zero so the new code had no effect other than to cause a bug in the use of the dio_request_t.dr_comp which was fixed by aa159af. The original rationale for introducing synchronous operations in b39c22b was to hurry certains requests through the BIO layer which would have otherwise been subject to its unplug timer which would increase the latency. This behavior of the unplug timer, however, went away during the transition of the plug/unplug system between kernels 2.6.32 and 2.6.39. To handle the unplug timer behavior on 2.6.32-2.6.35 kernels the BIO_RW_UNPLUG flag is used as a hint to suppress the plugging behavior. For kernels 2.6.36-2.6.38, the REQ_UNPLUG macro will be available and ise used for the same purpose. Signed-off-by: Tim Chase <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #4858
*	Remove znode's z_uid/z_gid member	Nikolay Borisov	2016-07-25	2	-17/+16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Remove duplicate z_uid/z_gid member which are also held in the generic vfs inode struct. This is done by first removing the members from struct znode and then using the KUID_TO_SUID/KGID_TO_SGID macros to access the respective member from struct inode. In cases where the uid/gids are being marshalled from/to disk, use the newly introduced zfs_(uid\|gid)_(read\|write) functions to properly save the uids rather than the internal kernel representation. Signed-off-by: Nikolay Borisov <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #4685 Issue #227
*	Check whether the kernel supports i_uid/gid_read/write helpers	Nikolay Borisov	2016-07-25	1	-0/+53
\| \| \| \| \| \| \| \| \| \| \| \| \|	Since the concept of a kuid and the need to translate from it to ordinary integer type was added in kernel version 3.5 implement necessary plumbing to be able to detect this condition during compile time. If the kernel doesn't support the kuid then just fall back to directly accessing the respective struct inode's members Signed-off-by: Nikolay Borisov <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #4685 Issue #227
*	Illumos Crypto Port module added to enable native encryption in zfs	Tom Caputi	2016-07-20	6	-2/+1097
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	A port of the Illumos Crypto Framework to a Linux kernel module (found in module/icp). This is needed to do the actual encryption work. We cannot use the Linux kernel's built in crypto api because it is only exported to GPL-licensed modules. Having the ICP also means the crypto code can run on any of the other kernels under OpenZFS. I ended up porting over most of the internals of the framework, which means that porting over other API calls (if we need them) should be fairly easy. Specifically, I have ported over the API functions related to encryption, digests, macs, and crypto templates. The ICP is able to use assembly-accelerated encryption on amd64 machines and AES-NI instructions on Intel chips that support it. There are place-holder directories for similar assembly optimizations for other architectures (although they have not been written). Signed-off-by: Tom Caputi <[email protected]> Signed-off-by: Tony Hutter <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #4329
*	RAIDZ parity kstat rework	Gvozden Neskovic	2016-07-19	1	-2/+2
\| \| \| \| \| \| \| \| \|	Print table with speed of methods for each implementation. Last line describes contents of [fastest] selection. Signed-off-by: Gvozden Neskovic <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #4860
*	Fixes and enhancements of SIMD raidz parity	Gvozden Neskovic	2016-07-19	2	-10/+14
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	- Implementation lock replaced with atomic variable - Trailing whitespace is removed from user specified parameter, to enhance experience when using commands that add newline, e.g. `echo` - raidz_test: remove dependency on `getrusage()` and RUSAGE_THREAD, Issue #4813 - silence `cppcheck` in vdev_raidz, partial solution of Issue #1392 - Minor fixes and cleanups - Enable use of original parity methods in [fastest] configuration. New opaque original ops structure, representing native methods, is added to supported raidz methods. Original parity methods are executed if selected implementation has NULL fn pointer. Signed-off-by: Gvozden Neskovic <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #4813 Issue #1392
*	Implementation of SSE optimized Fletcher-4	Tyler J. Stachecki	2016-07-15	1	-0/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Builds off of 1eeb4562 (Implementation of AVX2 optimized Fletcher-4) This commit adds another implementation of the Fletcher-4 algorithm. It is automatically selected at module load if it benchmarks higher than all other available implementations. The module benchmark was also amended to analyze the performance of the byteswap-ed version of Fletcher-4, as well as the non-byteswaped version. The average performance of the two is used to select the the fastest implementation available on the host system. Adds a pair of fields to an existing zcommon module parameter: - zfs_fletcher_4_impl (str) "sse2" - new SSE2 implementation if available "ssse3" - new SSSE3 implementation if available Signed-off-by: Tyler J. Stachecki <[email protected]> Signed-off-by: Gvozden Neskovic <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #4789
*	Use native inode->i_nlink instead of znode->z_links	Chris Dunlop	2016-07-14	2	-5/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	A mostly mechanical change, taking into account i_nlink is 32 bits vs ZFS's 64 bit on-disk link count. We revert "xattr dir doesn't get purged during iput" (ddae16a) as this is a more Linux-integrated fix for the same issue. In addition, setting the initial link count on a new node has been changed from setting one less than required in zfs_mknode() then incrementing to the correct count in zfs_link_create() (which was somewhat bizarre in the first place), to setting the correct count in zfs_mknode() and not incrementing it in zfs_link_create(). This both means we no longer set the link count in sa_bulk_update() twice (once for the initial incorrect count then again for the correct count), as well as adhering to the Linux requirement of not incrementing a zero link count without I_LINKABLE (see linux commit f4e0c30c). Signed-off-by: Chris Dunlop <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Chunwei Chen <[email protected]> Closes #4838 Issue #227
*	Add RAID-Z routines for SSE2 instruction set, in x86_64 mode.	Gvozden Neskovic	2016-07-13	1	-0/+11
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The patch covers low-end and older x86 CPUs. Parity generation is equivalent to SSSE3 implementation, but reconstruction is somewhat slower. Previous 'sse' implementation is renamed to 'ssse3' to indicate highest instruction set used. Benchmark results: scalar_rec_p 4 720476442 scalar_rec_q 4 187462804 scalar_rec_r 4 138996096 scalar_rec_pq 4 140834951 scalar_rec_pr 4 129332035 scalar_rec_qr 4 81619194 scalar_rec_pqr 4 53376668 sse2_rec_p 4 2427757064 sse2_rec_q 4 747120861 sse2_rec_r 4 499871637 sse2_rec_pq 4 522403710 sse2_rec_pr 4 464632780 sse2_rec_qr 4 319124434 sse2_rec_pqr 4 205794190 ssse3_rec_p 4 2519939444 ssse3_rec_q 4 1003019289 ssse3_rec_r 4 616428767 ssse3_rec_pq 4 706326396 ssse3_rec_pr 4 570493618 ssse3_rec_qr 4 400185250 ssse3_rec_pqr 4 377541245 original_rec_p 4 691658568 original_rec_q 4 195510948 original_rec_r 4 26075538 original_rec_pq 4 103087368 original_rec_pr 4 15767058 original_rec_qr 4 15513175 original_rec_pqr 4 10746357 Signed-off-by: Gvozden Neskovic <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #4783
*	Kill zp->z_xattr_parent to prevent pinning	Chunwei Chen	2016-07-12	1	-1/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	zp->z_xattr_parent will pin the parent. This will cause huge issue when unlink a file with xattr. Because the unlinked file is pinned, it will never get purged immediately. And because of that, the xattr stuff will never be marked as unlinked. So the whole unlinked stuff will stay there until shrink cache or umount. This change partially reverts e89260a. This is safe because only the zp->z_xattr_parent optimization is removed, zpl_xattr_security_init() is still called from the zpl outside the inode lock. Signed-off-by: Chunwei Chen <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Chris Dunlop <[email protected]> Issue #4359 Issue #3508 Issue #4413 Issue #4827
*	OpenZFS 6314 - buffer overflow in dsl_dataset_name	Igor Kozhukhov	2016-06-28	9	-30/+21
\| \| \| \| \| \| \| \| \| \| \|	Reviewed by: George Wilson <[email protected]> Reviewed by: Prakash Surya <[email protected]> Reviewed by: Igor Kozhukhov <[email protected]> Approved by: Dan McDonald <[email protected]> Ported-by: Brian Behlendorf <[email protected]> OpenZFS-issue: https://www.illumos.org/issues/6314 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/d6160ee
*	Implement zfs_ioc_recv_new() for OpenZFS 2605	Brian Behlendorf	2016-06-28	3	-4/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Adds ZFS_IOC_RECV_NEW for resumable streams and preserves the legacy ZFS_IOC_RECV user/kernel interface. The new interface supports all stream options but is currently only used for resumable streams. This way updated user space utilities will interoperate with older kernel modules. ZFS_IOC_RECV_NEW is modeled after the existing ZFS_IOC_SEND_NEW handler. Non-Linux OpenZFS platforms have opted to change the legacy interface in an incompatible fashion instead of adding a new ioctl. Signed-off-by: Brian Behlendorf <[email protected]>
*	OpenZFS 6536 - zfs send: want a way to disable setting of DRR_FLAG_FREERECORDS	Andrew Stormont	2016-06-28	1	-0/+5
\| \| \| \| \| \| \| \| \| \| \| \|	Authored by: Andrew Stormont <[email protected]> Reviewed by: Anil Vijarnia <[email protected]> Reviewed by: Kim Shrier <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Approved by: Dan McDonald <[email protected]> Ported-by: Brian Behlendorf <[email protected]> OpenZFS-issue: https://www.illumos.org/issues/6536 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/880094b
*	OpenZFS 6393 - zfs receive a full send as a clone	Paul Dagnelie	2016-06-28	2	-3/+12
\| \| \| \| \| \| \| \| \| \| \| \|	Authored by: Paul Dagnelie <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Reviewed by: Prakash Surya <[email protected]> Reviewed by: Richard Elling <[email protected]> Approved by: Dan McDonald <[email protected]> Ported-by: Brian Behlendorf <[email protected]> OpenZFS-issue: https://www.illumos.org/issues/6394 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/68ecb2e
*	OpenZFS 6051 - lzc_receive: allow the caller to read the begin record	Brian Behlendorf	2016-06-28	1	-1/+6
\| \| \| \| \| \| \| \| \| \|	Reviewed by: Matthew Ahrens <[email protected]> Reviewed by: Paul Dagnelie <[email protected]> Approved by: Robert Mustacchi <[email protected]> Ported-by: Brian Behlendorf <[email protected]> OpenZFS-issue: https://www.illumos.org/issues/6051 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/620f322