| Commit message (Collapse) | Author | Age | Files | Lines |
| |
|
| |
|
|
|
|
| |
This ended up allocating 256 KiB!
|
| |
|
|
|
|
|
| |
Avoid copying inputs needlessly, on Skylake doubles performance
(from 1 GB/s -> 2 GB/s)
|
|\ |
|
| |
| |
| |
| |
| | |
This improves performance by ~ .5 cycle/byte. Also it ensures that
our cache reading countermeasure works as expected.
|
|\| |
|
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Should have significantly better cache characteristics, though it
would be nice to verify this.
It reduces performance somewhat but less than I expected, at least
on Skylake. I need to check this across more platforms to make sure
t won't hurt too badly.
|
|/
|
|
|
|
|
|
|
| |
This leads to a rather shocking decrease in binary sizes, especially
the static library (~1.5 MB reduction). Saves 60KB in the shared lib.
Since throwing or catching an exception is relatively expensive these
not being inlined is not a problem in that sense. It had simply not
occured to me that it would take up so much extra space in the binary.
|
|
|
|
|
|
|
|
|
| |
Using a larger table helps quite a bit. Using 4 tables (ala AES T-tables)
didn't seem to help much at all, it's only slightly faster than a single
table with rotations.
Continue to use the 8 bit table in the first and last rounds as a
countermeasure against cache attacks.
|
|
|
|
| |
[ci skip]
|
|
|
|
|
|
| |
Fixes #1235
[ci skip]
|
|
|
|
| |
Went from 27 to 20 cycles per byte on Skylake (with clmul disabled)
|
|\ |
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
By allowing multiple blocks for clmul, slight speedup there though still
far behind optimum.
Precompute a table of multiples of H, 3-4x faster on systems without clmul
(and still no secret indexes).
Refactor GMAC to not derive from GHASH
|
|\ \ |
|
| |/
| |
| |
| |
| | |
Gentoo-Bug: https://bugs.gentoo.org/633468
Signed-off-by: Alon Bar-Lev <[email protected]>
|
|/ |
|
|
|
|
| |
[ci skip]
|
|
|
|
|
|
| |
With fast AES-NI, gets down to about 2 cycles per byte which is
pretty good compared to the ~5.5 cpb of 2.3, still a long way off
the best stiched impls which run at ~0.6 cpb.
|
|
|
|
| |
Avoids the cast alignment problems of yesteryear
|
|
|
|
| |
[ci skip]
|
|
|
|
|
| |
Missed by everything but the OCB wide tests because most ciphers
have fixed width and get the override.
|
| |
|
|
|
|
| |
Reduces stack usage and a bit faster
|
|
|
|
| |
GCC 7 can actually vectorize this for AVX2
|
|
|
|
| |
From ~5 cbp to ~2.5 cbp on Skylake
|
|\ |
|
| |
| |
| |
| |
| |
| | |
And rotate.h is a visible header.
Blerg. Inline asm it is.
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
The problem with asm rol/ror is the compiler can't schedule effectively.
But we only need asm in the case when the rotation is variable, so distinguish
the two cases. If a compile time constant, then static_assert that the rotation
is in the correct range and do the straightforward expression knowing the compiler
will probably do the right thing. Otherwise do a tricky expression that both
GCC and Clang happen to have recognize. Avoid the reduction case; instead
require that the rotation be in range (this reverts 2b37c13dcf).
Remove the asm rotations (making this branch illnamed), because now both Clang
and GCC will create a roll without any extra help.
Remove the reduction/mask by the word size for the variable case. The compiler
can't optimize that it out well, but it's easy to ensure it is valid in the callers,
especially now that the variable input cases are easy to grep for.
|
| |
| |
| |
| |
| |
| |
| | |
Neither is very good at recognizing rotate sequences. For cases where
the rotation value is a constant they do fine, but for variable rotations
they do horribly. Using inline asm here improved performance of both
CAST-128 and CAST-256 by ~20% on my system with both GCC and Clang.
|
|\ \ |
|
| | | |
|
| |/ |
|
| |
| |
| |
| |
| |
| | |
Couldn't figure out a way to silence this otherwise.
Deprecate replace_char, erase_chars, replace_chars
|
|\ \
| |/
|/| |
|
| | |
|
| | |
|
|\ \ |
|
| | | |
|
|\ \ \ |
|
| | | | |
|
| | | |
| | | |
| | | |
| | | | |
Not needed here
|
| | | | |
|
| | | | |
|
| | | |
| | | |
| | | |
| | | |
| | | | |
It was disabled anyway (bad macro check) and with recent GCC
turned out to be slower than just using bswap.
|
| | | |
| | | |
| | | |
| | | | |
Still slower but notably faster at least with AES-NI
|
| | | |
| | | |
| | | |
| | | | |
Error under filesystem-free builds
|
| | | | |
|