aboutsummaryrefslogtreecommitdiffstats
path: root/src/lib
Commit message (Collapse)AuthorAgeFilesLines
* Remove redundant checkJack Lloyd2017-10-201-3/+0
| | | | | | CBC mode already has this same size check. [ci skip]
* Add GHASH using SSSE3Jack Lloyd2017-10-204-2/+105
| | | | About 30% faster than scalar on Skylake
* Use base CBC modes to implement TLS CBC ciphersuitesJack Lloyd2017-10-193-49/+36
| | | | | This reduces code and also lets TLS make use of parallel decryption which it was not doing before.
* Remove unused variableJack Lloyd2017-10-191-1/+1
|
* Undeprecate these exceptionsJack Lloyd2017-10-192-5/+7
| | | | Cannot figure out how to get MSVC to shut up
* Another attempt at silencing MSVC warningJack Lloyd2017-10-192-6/+2
|
* Appease SonarJack Lloyd2017-10-191-1/+1
|
* Add a destructor to Policy_ViolationJack Lloyd2017-10-191-3/+4
| | | | | MSVC produces a deranged warning that the compiler generated destructor is deprecated, try to shut it up.
* Merge GH #1262 GCM and CTR optimizationsJack Lloyd2017-10-1913-436/+789
|\
| * PMULL optimizationsJack Lloyd2017-10-183-61/+192
| |
| * Further optimizations, and split out GHASH reduction codeJack Lloyd2017-10-183-87/+57
| |
| * GCM and CTR optimizationsJack Lloyd2017-10-1811-372/+624
| | | | | | | | | | | | | | | | | | | | | | In CTR, special case for counter widths of special interest. In GHASH, uses a 4x reduction technique suggested by Intel. Split out GHASH to its own source file and header. With these changes GCM is over twice as fast on Skylake and about 50% faster on Westmere.
* | Use conditional include in demaphore.hSimon Warta2017-10-191-1/+1
|/
* Correct usage of std::aligned_storageJack Lloyd2017-10-151-6/+6
| | | | This ended up allocating 256 KiB!
* Additional final annotationsJack Lloyd2017-10-1519-27/+26
|
* GMAC optimizationJack Lloyd2017-10-152-21/+32
| | | | | Avoid copying inputs needlessly, on Skylake doubles performance (from 1 GB/s -> 2 GB/s)
* Merge GH #1257 Use std::aligned_storage for AES T-tableJack Lloyd2017-10-151-32/+56
|\
| * Use overaligned storage for AES T-TableJack Lloyd2017-10-141-32/+56
| | | | | | | | | | This improves performance by ~ .5 cycle/byte. Also it ensures that our cache reading countermeasure works as expected.
* | Merge GH #1255 Use a single T-table in AESJack Lloyd2017-10-151-127/+78
|\|
| * Reduce AES to using a single T-tableJack Lloyd2017-10-131-127/+78
| | | | | | | | | | | | | | | | | | Should have significantly better cache characteristics, though it would be nice to verify this. It reduces performance somewhat but less than I expected, at least on Skylake. I need to check this across more platforms to make sure t won't hurt too badly.
* | De-inline bodies of exception classesJack Lloyd2017-10-153-67/+133
|/ | | | | | | | | This leads to a rather shocking decrease in binary sizes, especially the static library (~1.5 MB reduction). Saves 60KB in the shared lib. Since throwing or catching an exception is relatively expensive these not being inlined is not a problem in that sense. It had simply not occured to me that it would take up so much extra space in the binary.
* Optimizations for SM4Jack Lloyd2017-10-131-35/+94
| | | | | | | | | Using a larger table helps quite a bit. Using 4 tables (ala AES T-tables) didn't seem to help much at all, it's only slightly faster than a single table with rotations. Continue to use the 8 bit table in the first and last rounds as a countermeasure against cache attacks.
* Accept SHA-1, SHA1, or SHA-160 equallyJack Lloyd2017-10-133-3/+3
| | | | | | Fixes #1235 [ci skip]
* Further GCM optimizationsJack Lloyd2017-10-131-17/+27
| | | | Went from 27 to 20 cycles per byte on Skylake (with clmul disabled)
* Merge GH #1253 GCM optimizationsJack Lloyd2017-10-138-174/+242
|\
| * Optimize GCMJack Lloyd2017-10-138-174/+242
| | | | | | | | | | | | | | | | | | | | By allowing multiple blocks for clmul, slight speedup there though still far behind optimum. Precompute a table of multiples of H, 3-4x faster on systems without clmul (and still no secret indexes). Refactor GMAC to not derive from GHASH
* | Merge GH #1254 Add missing includeJack Lloyd2017-10-131-0/+1
|\ \
| * | Add limits.h header for INT_MAXAlon Bar-Lev2017-10-131-0/+1
| |/ | | | | | | | | Gentoo-Bug: https://bugs.gentoo.org/633468 Signed-off-by: Alon Bar-Lev <[email protected]>
* / Use memcpy trick in 3-arg xor_buf alsoJack Lloyd2017-10-131-23/+17
|/
* OCB optimizationsJack Lloyd2017-10-132-58/+90
| | | | | | With fast AES-NI, gets down to about 2 cycles per byte which is pretty good compared to the ~5.5 cpb of 2.3, still a long way off the best stiched impls which run at ~0.6 cpb.
* Somewhat faster xor_bufJack Lloyd2017-10-121-18/+15
| | | | Avoids the cast alignment problems of yesteryear
* Remove needless mutableJack Lloyd2017-10-121-2/+2
| | | | [ci skip]
* Swapped encrypt and decrypt in BlockCipher _xex functionsJack Lloyd2017-10-121-2/+2
| | | | | Missed by everything but the OCB wide tests because most ciphers have fixed width and get the override.
* Interleave SM3 message expansionJack Lloyd2017-10-121-141/+142
| | | | Reduces stack usage and a bit faster
* Use SIMD for in ThreefishJack Lloyd2017-10-121-2/+2
| | | | GCC 7 can actually vectorize this for AVX2
* OCB optimizationsJack Lloyd2017-10-127-124/+163
| | | | From ~5 cbp to ~2.5 cbp on Skylake
* Merge GH #1247 Improve bit rotation functionsJack Lloyd2017-10-1235-644/+724
|\
| * Ugh, the GCC/Clang trick triggers C4146 under MSVCJack Lloyd2017-10-121-8/+25
| | | | | | | | | | | | And rotate.h is a visible header. Blerg. Inline asm it is.
| * Add compile-time rotation functionsJack Lloyd2017-10-1235-660/+701
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The problem with asm rol/ror is the compiler can't schedule effectively. But we only need asm in the case when the rotation is variable, so distinguish the two cases. If a compile time constant, then static_assert that the rotation is in the correct range and do the straightforward expression knowing the compiler will probably do the right thing. Otherwise do a tricky expression that both GCC and Clang happen to have recognize. Avoid the reduction case; instead require that the rotation be in range (this reverts 2b37c13dcf). Remove the asm rotations (making this branch illnamed), because now both Clang and GCC will create a roll without any extra help. Remove the reduction/mask by the word size for the variable case. The compiler can't optimize that it out well, but it's easy to ensure it is valid in the callers, especially now that the variable input cases are easy to grep for.
| * Use rol/ror x86 instructions on GCC/ClangJack Lloyd2017-10-111-2/+24
| | | | | | | | | | | | | | Neither is very good at recognizing rotate sequences. For cases where the rotation value is a constant they do fine, but for variable rotations they do horribly. Using inline asm here improved performance of both CAST-128 and CAST-256 by ~20% on my system with both GCC and Clang.
* | Avoid std::count to skip a signed overflow warningJack Lloyd2017-10-122-3/+13
| | | | | | | | | | | | Couldn't figure out a way to silence this otherwise. Deprecate replace_char, erase_chars, replace_chars
* | Merge GH #1245 Restructure Barrier/Semaphore to avoid signed overflow warningsJack Lloyd2017-10-122-11/+9
|\ \ | |/ |/|
| * #1220 - fixed fixes of integer overflowHubert Bugaj2017-10-102-7/+3
| |
| * #1220 - fixed signed overflow warningsHubert Bugaj2017-10-092-10/+12
| |
* | Merge GH #1248 Unroll SM3 compression loopJack Lloyd2017-10-111-56/+94
|\ \
| * | Unroll SM3 compression functionJack Lloyd2017-10-101-56/+94
| | |
* | | Helpful commentJack Lloyd2017-10-111-1/+2
| | |
* | | Remove SSE2 bswap_4Jack Lloyd2017-10-111-24/+0
| | | | | | | | | | | | | | | It was disabled anyway (bad macro check) and with recent GCC turned out to be slower than just using bswap.
* | | Optimize CFB modeJack Lloyd2017-10-112-39/+97
| | | | | | | | | | | | Still slower but notably faster at least with AES-NI
* | | Add missing headerJack Lloyd2017-10-111-0/+1
| | | | | | | | | | | | Error under filesystem-free builds