| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
| |
representation (rather than in an interator context), instead use &buf[0],
which works for both MemoryRegion and std::vector
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
harmonising MemoryRegion with std::vector:
The MemoryRegion::clear() function would zeroise the buffer, but keep
the memory allocated and the size unchanged. This is very different
from STL's clear(), which is basically the equivalent to what is
called destroy() in MemoryRegion. So to be able to replace MemoryRegion
with a std::vector, we have to rename destroy() to clear() and we have
to expose the current functionality of clear() in some other way, since
vector doesn't support this operation. Do so by adding a global function
named zeroise() which takes a MemoryRegion which is zeroed. Remove clear()
to ensure all callers are updated.
|
| |
|
| |
|
| |
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
rotations in the code. This reduces the number of cache lines
potentially accessed in the first round from 64 to 16 (assuming 64
byte cache lines). On average, about 10 cache lines will actually be
accessed, assuming a uniform distribution of the inputs, so there
definitely is still a timing channel here, just a somewhat smaller
one.
I experimented with using the 256 element table for all rounds but it
reduced performance significantly and I'm not sure if the benefit is
worth the cost or not.
|
| |
|
| |
|
|
|
|
| |
supports epi64x in 64-bit mode.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
reasons, Intel C++ rejects
const __m128i foo = _mm_set_epi64x(...)
though it will accept if you use one of the _mm_set1 variants.
And Visual C++ doesn't know about _mm_set_epi64x() in 32-bit mode for
similarly dumb reasons - it works fine compiling for 64 bit but for
whatever reason they don't offer this function when compiling as 32
bit. Unfortunately there isn't a good way to specify it's OK with a
particular compiler with one arch but not another, so just disable it
globally for the time being. The workaround for VC++ is probably to
use _mm_set_epi32 and break up the input values into 32 bit chunks.
ICC is a lost cause I fear.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
constant time and on a Nehalem is significantly faster than the table
based version. This implementation technique was invented by Mike
Hamburg and described in a paper in CHES 2009 "Accelerating AES with
Vector Permute Instructions". This code is basically a translation of
his public domain x86-64 assembly code into intrinsics.
Todo: Adding support for AES-192 and AES-256; this just requires
implementing the key schedules.
Currently only tested on an i7 with GCC (32 and 64 bit code);
testing/optimization on 32-bit processors with SSSE3 like the Atom,
and with Visual C++ and other compilers, are also todos.
|
|
|
|
| |
fine with latest SVN.
|
|
|
|
| |
process
|
| |
|
|
|
|
|
| |
for getting access to the key schedule, instead of giving the key
schedule protected status, which is much harder tu audit.
|
|
|
|
|
| |
protected accessor functions for get and set. Set is needed by the x86
version since it implements the key schedule directly.
|
| |
|
| |
|
| |
|
| |
|
| |
|
| |
|
| |
|
| |
|
|
|
|
|
| |
This caused Doxygen to think this was markup meant for it, which really
caused some clutter in the namespace page.
|
|
|
|
|
|
|
|
|
|
|
|
| |
in the help.
Unfortunately we can't just remove --enable-isa, because for the
callback to work the target list has to already exist, and it only
does by virtue of the default=[] param to the enable-isa setup. We
could just use append_const, except then we can't run on Python 2.4,
and the latest release of RHEL only has 2.4 :(
Rename aes_ni to aes-ni in configuration-speak
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
the implementation rather than the preferred one. Update all
implementations.
Add a new function parallel_bytes() which returns
parallelism() * BLOCK_SIZE * BUILD_TIME_CONSTANT
This is because i noticed all current calls of parallelism() just
multiplied the result by the block size already, so this simplified
that code.
The build time constant is set to 4, which was the previous default
return value of parallelism(). However the SIMD versions returned
2*native paralellism rather than 4*, so this increases the buffer
sizes used for those algorithms.
The constant multiple lives in buildh.in and build.h, and is named
BOTAN_BLOCK_CIPHER_PAR_MULT.
|
|
|
|
|
|
|
|
|
|
| |
be branch-free. This reduces performance noticably on my Core2 (from
32 MiB/s to a bit over 27 MiB), but so it goes.
The IDEA implementation using SSE2 is already branch-free here, and
runs at about 135 MiB/s on my machine.
Also add more IDEA tests, generated by OpenSSL
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Add a second template param to SecureVector which specifies the initial
length.
Change all callers to be SecureVector instead of SecureBuffer.
This can go away in C++0x, once compilers implement N2712 ("Non-static
data member initializers"), and we can just write code as
SecureVector<byte> P{18};
instead
|
|
|
|
|
|
|
|
| |
Default unless specified is now 4.
For SIMD code, use 2x the number of blocks which are processed in
parallel using SIMD by that cipher. It may make sense to increase this to
4x or even more, further experimentation is necessary.
|
|
|
|
|
|
|
|
|
| |
depend on the particular implementation. Add a new virtual function to
BlockCipher named parallelism that returns the number of blocks the
cipher object could or might want to process in parallel. Currently
set to 1 by default but may make sense to increase this for even
scalar implementations since it seems like better caching behavior
makes it a win.
|
|
|
|
| |
and 1.6x faster using SIMD_Scalar.
|
| |
|
| |
|
|
|
|
|
|
|
| |
Invalid_Argument just a typedef for std::invalid_argument. Make
Botan::Exception a typedef for std::runtime_error. Make Memory_Exhaustion
a public exception, and use it in other places where memory allocations
can fail.
|
| |
|
|
|
|
| |
faster than the scalar version on a Core2.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
bswap.h); too many external apps rely on loadstor.h existing.
Define 64-bit generic bswap in terms of 32-bit bswap, since it's
not much slower if 32-bit is also generic, and much faster if
it's not. This may be quite helpful on 32-bit x86 in particular.
Change formulation of generic 32-bit bswap. It may be faster or
slower depending on the CPU, especially the latency and throuput
of rotate instructions, but should be faster on an ideally
superscalar processor with rotate instructions (ie, what I expect
future CPUs to look more like).
|
|
|
|
| |
Move most of the engine headers to internal
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Fixes for the amalgamation generator for internal headers.
Remove BOTAN_DLL exporting macros from all internal-only headers;
the classes/functions there don't need to be exported, and
avoiding the PIC/GOT indirection can be a big win.
Add missing BOTAN_DLLs where necessary, mostly gfpmath and cvc
For GCC, use -fvisibility=hidden and set BOTAN_DLL to the
visibility __attribute__ to export those classes/functions.
|
| |
|
|
|
|
|
| |
Change serp_simd_sbox.h's header guard to use the leading BOTAN_ prefix for
proper macro namespacing.
|
| |
|
|
|
|
|
|
|
|
| |
containers (specifically vector).
Rename is_empty to empty
Remove has_items
Rename create to resize
|
|
|
|
| |
build magic, name them asm_macr_ARCH.h. Change all including files accordingly.
|
| |
|
| |
|
|
|
|
| |
to give a 3-7% speed improvement on Core2 with GCC.
|
| |
|