summaryrefslogtreecommitdiffstats
path: root/include/zfs_fletcher.h
Commit message (Collapse)AuthorAgeFilesLines
* DLPX-44812 integrate EP-220 large memory scalabilityDavid Quigley2016-11-291-4/+5
|
* Add superscalar fletcher4Romain Dolbeau2016-11-041-0/+7
| | | | | | | | | | | | This is the Fletcher4 algorithm implemented in pure C, but using multiple counters using algorithms identical to those used for SSE/NEON and AVX2. This allows for faster execution on core with strong superscalar capabilities but weak SIMD capabilities. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Romain Dolbeau <[email protected]> Closes #5317
* Fletcher4 algorithm implemented in pure NEON for Aarch64 / ARMv8 64 bitsRomain Dolbeau2016-10-211-0/+11
| | | | | | | | | | | | | | | | | | | | | | | | | This is not useful on micro-architecture with a weak NEON implementation (only 64 bits); the native version is slower & the byteswap barely faster than scalar. On A53 or A57, it's a small improvement on scalar but OK for byteswap. Results from an A53 system: 0 0 0x01 -1 0 1499068294333000 1499101101878000 implementation native byteswap scalar 1008227510 755880264 aarch64_neon 1198098720 1044818671 fastest aarch64_neon aarch64_neon Results from a A57 system: 0 0 0x01 -1 0 4407214734807033 4407233933777404 implementation native byteswap scalar 2302071241 1124873346 aarch64_neon 2542214946 2245570352 fastest aarch64_neon aarch64_neon Reviewed-by: Gvozden Neskovic <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Romain Dolbeau <[email protected]> Closes #5248
* Fletcher4: save/reload implementation contextGvozden Neskovic2016-10-051-3/+35
| | | | | | | | | | | | | Init, compute, and fini methods are changed to work on internal context object. This is necessary because ABI does not guarantee that SIMD registers will be preserved on function calls. This is technically the case in Linux kernel in between `kfpu_begin()/kfpu_end()`, but it breaks user-space tests and some kernels that don't require disabling preemption for using SIMD (osx). Use scalar compute methods in-place for small buffers, and when the buffer size does not meet SIMD size alignment. Signed-off-by: Gvozden Neskovic <[email protected]>
* OpenZFS 4185 - add new cryptographic checksums to ZFS: SHA-512, Skein, Edon-RTony Hutter2016-10-031-4/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Reviewed by: George Wilson <[email protected]> Reviewed by: Prakash Surya <[email protected]> Reviewed by: Saso Kiselkov <[email protected]> Reviewed by: Richard Lowe <[email protected]> Approved by: Garrett D'Amore <[email protected]> Ported by: Tony Hutter <[email protected]> OpenZFS-issue: https://www.illumos.org/issues/4185 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/45818ee Porting Notes: This code is ported on top of the Illumos Crypto Framework code: https://github.com/zfsonlinux/zfs/pull/4329/commits/b5e030c8dbb9cd393d313571dee4756fbba8c22d The list of porting changes includes: - Copied module/icp/include/sha2/sha2.h directly from illumos - Removed from module/icp/algs/sha2/sha2.c: #pragma inline(SHA256Init, SHA384Init, SHA512Init) - Added 'ctx' to lib/libzfs/libzfs_sendrecv.c:zio_checksum_SHA256() since it now takes in an extra parameter. - Added CTASSERT() to assert.h from for module/zfs/edonr_zfs.c - Added skein & edonr to libicp/Makefile.am - Added sha512.S. It was generated from sha512-x86_64.pl in Illumos. - Updated ztest.c with new fletcher_4_*() args; used NULL for new CTX argument. - In icp/algs/edonr/edonr_byteorder.h, Removed the #if defined(__linux) section to not #include the non-existant endian.h. - In skein_test.c, renane NULL to 0 in "no test vector" array entries to get around a compiler warning. - Fixup test files: - Rename <sys/varargs.h> -> <varargs.h>, <strings.h> -> <string.h>, - Remove <note.h> and define NOTE() as NOP. - Define u_longlong_t - Rename "#!/usr/bin/ksh" -> "#!/bin/ksh -p" - Rename NULL to 0 in "no test vector" array entries to get around a compiler warning. - Remove "for isa in $($ISAINFO); do" stuff - Add/update Makefiles - Add some userspace headers like stdio.h/stdlib.h in places of sys/types.h. - EXPORT_SYMBOL *_Init/*_Update/*_Final... routines in ICP modules. - Update scripts/zfs2zol-patch.sed - include <sys/sha2.h> in sha2_impl.h - Add sha2.h to include/sys/Makefile.am - Add skein and edonr dirs to icp Makefile - Add new checksums to zpool_get.cfg - Move checksum switch block from zfs_secpolicy_setprop() to zfs_check_settable() - Fix -Wuninitialized error in edonr_byteorder.h on PPC - Fix stack frame size errors on ARM32 - Don't unroll loops in Skein on 32-bit to save stack space - Add memory barriers in sha2.c on 32-bit to save stack space - Add filetest_001_pos.ksh checksum sanity test - Add option to write psudorandom data in file_write utility
* Rework of fletcher_4 moduleGvozden Neskovic2016-08-161-5/+21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | - Benchmark memory block is increased to 128kiB to reflect real block sizes more accurately. Measurements include all three stages needed for checksum generation, i.e. `init()/compute()/fini()`. The inner loop is repeated multiple times to offset overhead of time function. - Fastest implementation selects native and byteswap methods independently in benchmark. To support this new function pointers `init_byteswap()/fini_byteswap()` are introduced. - Implementation mutex lock is replaced by atomic variable. - To save time, benchmark is not executed in userspace. Instead, highest supported implementation is used for fastest. Default userspace selector is still 'cycle'. - `fletcher_4_native/byteswap()` methods use incremental methods to finish calculation if data size is not multiple of vector stride (currently 64B). - Added `fletcher_4_native_varsize()` special purpose method for use when buffer size is not known in advance. The method does not enforce 4B alignment on buffer size, and will ignore last (size % 4) bytes of the data buffer. - Benchmark `kstat` is changed to match the one of vdev_raidz. It now shows throughput for all supported implementations (in B/s), native and byteswap, as well as the code [fastest] is running. Example of `fletcher_4_bench` running on `Intel(R) Xeon(R) CPU E5-2660 v3 @ 2.60GHz`: implementation native byteswap scalar 4768120823 3426105750 sse2 7947841777 4318964249 ssse3 7951922722 6112191941 avx2 13269714358 11043200912 fastest avx2 avx2 Example of `fletcher_4_bench` running on `Intel(R) Xeon Phi(TM) CPU 7210 @ 1.30GHz`: implementation native byteswap scalar 1291115967 1031555336 sse2 2539571138 1280970926 ssse3 2537778746 1080016762 avx2 4950749767 1078493449 avx512f 9581379998 4010029046 fastest avx512f avx512f Signed-off-by: Gvozden Neskovic <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #4952
* Fletcher4 implementation using avx512f instruction setGvozden Neskovic2016-08-161-0/+4
| | | | | | | | | | | | | | | | | | | | | | Algorithm runs 8 parallel sums, consuming 8x uint32_t elements per loop iteration. Size alignment of main fletcher4 methods is adjusted accordingly. New implementation is called 'avx512f'. Note: byteswap method can be implemented more efficiently when avx512bw hardware becomes available. Currently, it is ~ 2x slower than native method. Table shows result of full (native) fletcher4 calculation for different buffer size: fletcher4 4KB 16KB 64KB 128KB 256KB 1MB 16MB -------------------------------------------------------------------- [scalar] 1213 1228 1231 1231 1225 1200 1160 [sse2] 2374 2442 2459 2456 2462 2250 2220 [avx2] 4288 4753 4871 4893 4900 4050 3882 [avx512f] 5975 8445 9196 9221 9262 6307 5620 Signed-off-by: Gvozden Neskovic <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #4952
* Implementation of SSE optimized Fletcher-4Tyler J. Stachecki2016-07-151-0/+8
| | | | | | | | | | | | | | | | | | | | | | Builds off of 1eeb4562 (Implementation of AVX2 optimized Fletcher-4) This commit adds another implementation of the Fletcher-4 algorithm. It is automatically selected at module load if it benchmarks higher than all other available implementations. The module benchmark was also amended to analyze the performance of the byteswap-ed version of Fletcher-4, as well as the non-byteswaped version. The average performance of the two is used to select the the fastest implementation available on the host system. Adds a pair of fields to an existing zcommon module parameter: - zfs_fletcher_4_impl (str) "sse2" - new SSE2 implementation if available "ssse3" - new SSSE3 implementation if available Signed-off-by: Tyler J. Stachecki <[email protected]> Signed-off-by: Gvozden Neskovic <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #4789
* Implementation of AVX2 optimized Fletcher-4Jinshan Xiong2016-06-021-1/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | New functionality: - Preserves existing scalar implementation. - Adds AVX2 optimized Fletcher-4 computation. - Fastest routines selected on module load (benchmark). - Test case for Fletcher-4 added to ztest. New zcommon module parameters: - zfs_fletcher_4_impl (str): selects the implementation to use. "fastest" - use the fastest version available "cycle" - cycle trough all available impl for ztest "scalar" - use the original version "avx2" - new AVX2 implementation if available Performance comparison (Intel i7 CPU, 1MB data buffers): - Scalar: 4216 MB/s - AVX2: 14499 MB/s See contents of `/sys/module/zcommon/parameters/zfs_fletcher_4_impl` to get list of supported values. If an implementation is not supported on the system, it will not be shown. Currently selected option is enclosed in `[]`. Signed-off-by: Jinshan Xiong <[email protected]> Signed-off-by: Andreas Dilger <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #4330
* Support custom build directories and move includesBrian Behlendorf2010-09-081-0/+53
One of the neat tricks an autoconf style project is capable of is allow configurion/building in a directory other than the source directory. The major advantage to this is that you can build the project various different ways while making changes in a single source tree. For example, this project is designed to work on various different Linux distributions each of which work slightly differently. This means that changes need to verified on each of those supported distributions perferably before the change is committed to the public git repo. Using nfs and custom build directories makes this much easier. I now have a single source tree in nfs mounted on several different systems each running a supported distribution. When I make a change to the source base I suspect may break things I can concurrently build from the same source on all the systems each in their own subdirectory. wget -c http://github.com/downloads/behlendorf/zfs/zfs-x.y.z.tar.gz tar -xzf zfs-x.y.z.tar.gz cd zfs-x-y-z ------------------------- run concurrently ---------------------- <ubuntu system> <fedora system> <debian system> <rhel6 system> mkdir ubuntu mkdir fedora mkdir debian mkdir rhel6 cd ubuntu cd fedora cd debian cd rhel6 ../configure ../configure ../configure ../configure make make make make make check make check make check make check This change also moves many of the include headers from individual incude/sys directories under the modules directory in to a single top level include directory. This has the advantage of making the build rules cleaner and logically it makes a bit more sense.