| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
| |
be branch-free. This reduces performance noticably on my Core2 (from
32 MiB/s to a bit over 27 MiB), but so it goes.
The IDEA implementation using SSE2 is already branch-free here, and
runs at about 135 MiB/s on my machine.
Also add more IDEA tests, generated by OpenSSL
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Add a second template param to SecureVector which specifies the initial
length.
Change all callers to be SecureVector instead of SecureBuffer.
This can go away in C++0x, once compilers implement N2712 ("Non-static
data member initializers"), and we can just write code as
SecureVector<byte> P{18};
instead
|
|
|
|
|
|
|
|
| |
Default unless specified is now 4.
For SIMD code, use 2x the number of blocks which are processed in
parallel using SIMD by that cipher. It may make sense to increase this to
4x or even more, further experimentation is necessary.
|
|
|
|
|
|
|
|
|
| |
depend on the particular implementation. Add a new virtual function to
BlockCipher named parallelism that returns the number of blocks the
cipher object could or might want to process in parallel. Currently
set to 1 by default but may make sense to increase this for even
scalar implementations since it seems like better caching behavior
makes it a win.
|
|
|
|
| |
and 1.6x faster using SIMD_Scalar.
|
| |
|
| |
|
|
|
|
|
|
|
| |
Invalid_Argument just a typedef for std::invalid_argument. Make
Botan::Exception a typedef for std::runtime_error. Make Memory_Exhaustion
a public exception, and use it in other places where memory allocations
can fail.
|
| |
|
|
|
|
| |
faster than the scalar version on a Core2.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
bswap.h); too many external apps rely on loadstor.h existing.
Define 64-bit generic bswap in terms of 32-bit bswap, since it's
not much slower if 32-bit is also generic, and much faster if
it's not. This may be quite helpful on 32-bit x86 in particular.
Change formulation of generic 32-bit bswap. It may be faster or
slower depending on the CPU, especially the latency and throuput
of rotate instructions, but should be faster on an ideally
superscalar processor with rotate instructions (ie, what I expect
future CPUs to look more like).
|
|
|
|
| |
Move most of the engine headers to internal
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Fixes for the amalgamation generator for internal headers.
Remove BOTAN_DLL exporting macros from all internal-only headers;
the classes/functions there don't need to be exported, and
avoiding the PIC/GOT indirection can be a big win.
Add missing BOTAN_DLLs where necessary, mostly gfpmath and cvc
For GCC, use -fvisibility=hidden and set BOTAN_DLL to the
visibility __attribute__ to export those classes/functions.
|
| |
|
|
|
|
|
| |
Change serp_simd_sbox.h's header guard to use the leading BOTAN_ prefix for
proper macro namespacing.
|
| |
|
|
|
|
|
|
|
|
| |
containers (specifically vector).
Rename is_empty to empty
Remove has_items
Rename create to resize
|
|
|
|
| |
build magic, name them asm_macr_ARCH.h. Change all including files accordingly.
|
| |
|
| |
|
|
|
|
| |
to give a 3-7% speed improvement on Core2 with GCC.
|
| |
|
|
|
|
| |
an extra 4 words at the end of EK for writing (unused) values.
|
|
|
|
|
|
|
| |
Currently requires SSE4.1 for _mm_extract_epi32 for the key schedule, it
would be nice to remove this dependency, though all currently known/scheduled
chips with AES-NI (Intel Westmere and Sandy Bridge, and AMD Bulldozer) are
supposed to include SSE 4.1 so this is not a huge problem.
|
|
|
|
|
| |
No noticable change under the simulator (no surprises there), but should help
a lot with pipelining on real hardware.
|
|
|
|
|
|
|
|
|
| |
tests under Intel's emulator.
Document and enable in the engine.
Merge both versions to aes_intel.cpp - some shared code and much similiar
structure which might be sharable via macros.
|
| |
|
|
|
|
| |
testing with Intel's emulator shows all green.
|
|
|
|
| |
credits.txt and thanks.txt. Remove some various bits of formatting weirdness.
|
| |
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
| |
on a particular ISA extension rather than a list of CPUs. Much
easier to edit and audit, too. Add markers on the AES-NI code and
SHA-1/SSE2. Serpent and XTEA don't need it because they are
generic and only depend on simd_32 which will silenty swap out a
scalar version if SSE2/AltiVec isn't enabled (since it turns out
on supersclar processors just doing 4 blocks in parallel can be a
win even in GPRs).
Add pentium3 to the list of CPUs with rdtsc, was missing. Odd!
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
From looking at how key gen works in particular, it seems easiest to provide
only AES-128, AES-192, and AES-256 and not a general AES class that can
accept any key length. This also has the bonus of allowing full loop unrolling
which may be a win (how much so will depend on the latency/throughput of
the AES instructions which is currently unknown).
No block interleaving, though of course it works very nicely here, simply
due to the desire to keep things simple until what is currently here can
actually be tested. (Intel has an emulator that is supposed to work but
just crashes on my machine...)
I'm not entirely sure if byte swapping is required. Intel has a white paper
out that suggests it isn't (and really it would have been stupid of them to
not build this into the aes instructions), but who knows. If it turns
out to be necessary there is a pretty fast bswap instruction for SSE anyway.
|
|
|
|
|
| |
providing it. Also stubs in the engine for VIA's AES instructions, but
needs CPUID checking also.
|
| |
|
| |
|
|\
| |
| |
| |
| |
| | |
8fb69dd1c599ada1008c4cab2a6d502cbcc468e0)
to branch 'net.randombit.botan.general-simd' (head c05c9a6d398659891fb8cca170ed514ea7e6476d)
|
| |
| |
| |
| | |
and Altivec (though Altivec is seemingly slower ATM...)
|
| | |
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
operations.
Also add a pure scalar code version.
Convert Serpent to use this new interface, and add an implementation of
XTEA in SIMD.
The wrappers plus the scalar version allow SIMD-ish code to work on all
platforms. This is often a win due to better ILP being visible to the
processor (as with the recent XTEA optimizations). Only real danger is
register starvation, mostly an issue on x86 these days. So it may (or may
not) be a win to consolidate the standard C++ versions and the SIMD versions
together.
Future work:
- Add AltiVec/VMX version
- Maybe also for ARM's NEON extension? Less pressing, I would think.
- Convert SHA-1 code to use SIMD_32
- Add XTEA SIMD decryption (currently only encrypt)
- Change SSE2 engine to SIMD_engine
- Modify configure.py to set BOTAN_TARGET_CPU_HAS_[SSE2|ALTIVEC|NEON|XXX] macros
|
|/
|
|
|
| |
Pretty much useless and unused, except for listing the module names in
build.h and the short versions totally suffice for that.
|
| |
|
| |
|
|
|
|
|
|
|
|
| |
a time more than doubles performance (from 38 MB/s to 90 MB/s on Core2 Q6600).
Could do even better with SIMD, I'm sure, but this is fast and easy, and
works everywhere.
Probably will hurt on 32-bit x86 from the register pressure.
|
|
|
|
|
|
| |
just too fragile and not that useful. Something like Java's checked exceptions
might be nice, but simply killing the process entirely if an unexpected
exception is thrown is not exactly useful for something trying to be robust.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Remove encrypt, decrypt - replace by cipher() and cipher1()
Remove seek() - not well supported/tested, I want to redo with a new interface
once CTR and OFB modes become stream ciphers.
Rename resync to set_iv()
Remove StreamCipher::IV_LENGTH and add StreamCipher::valid_iv_length() to
allow multiple IV lengths (as for instance Turing allows, as would Salsa20
if XSalsa20 were supported).
|
|
|
|
|
|
|
|
| |
the prefetch is called for each block of input, and so a total of
(4096+256)/64 = 68 prefetches are executed for each block. This reduces
performance of iterative modes dramatically.
I'm not sure what the right approach for dealing with this is.
|
|
|
|
|
|
|
|
|
|
| |
timing attacks, since once all the TE/SE tables are entirely in cache then
timing attacks against it become somewhat harder. However for this to be
a full defense it would be necessary to ensure the tables were entirely
loaded into cache, which is not guaranteed by the normal SSE prefetch
instructions. (Or prefetch instructions for other CPUs, AFAIK).
Much more importantly, it provides a 10% speedup.
|
| |
|
| |
|