[openssl-dev] [openssl.org #3843] OpenSSL 1.0.1* and below: incorrect use of _lrotl()

Mon May 25 09:55:53 UTC 2015

Hi,

Thanks for tips and pointers. As for getting off-topic, I'm the one to
blame anyway. So I'm going to strip most of message and comment on
points that still might be of public interest.

>> (*) BTW, did you try existing [multi-block SHA]?
> 
> No, totally missed it!  Found it now, good work!
> 
> $ find -name 'sha*-mb*'
> ./crypto/sha/asm/sha256-mb-x86_64.pl
> ./crypto/sha/asm/sha1-mb-x86_64.pl
> 
> How is an application using OpenSSL supposed to access this
> functionality?  Is there documentation?  So far, I only found uses in
> OpenSSL's own e_aes_cbc_hmac_sha*.c and no export of these symbols.

Well, you have to admit that it's a bit too special to provide
general-purpose interface to it. Which is why application-specific
interface is provided instead, TLS-oriented one in
e_aes_cbc_hmac_sha*.c. Mention of multi-block SHA was not really "go
ahead and use it" kind, but rather "is it interesting?" with implied "if
it is interesting, then we can discuss how to interface your application
to it". Note that it's even possible to take those modules out of
OpenSSL context...

> You could want to add optional use of XOP there - rotates and vcmov.
> For SHA-1, F() is just one vcmov and H() is vcmov/andnot/xor (see
> sse-intrinsics.c above).  For SHA-2, we use:
> 
> #define Maj(x,y,z) vcmov(x, y, vxor(z, y))
> #define Ch(x,y,z) vcmov(y, z, x)

As for XOP. Motto is to provide near-optimal performance with minimum
code. That means that if some processor-specific optimization provides
just little improvement, then it's likely to be omitted. I don't recall
attempting XOP specifically in multi-block SHA256, but it was attempted
in SHA1 and it wasn't impressive. I even recall XOP-rotates delivering
worse performance in some case. It likely was some instruction alignment
issue (at least I ran into some anomaly with ChaCha code when merely
flipping order of instruction input arguments affected performance).
Another case of XOP omission is plain SHA256. Point there is that
execution is dominated by scalar part and reducing number of
vector instruction has no effect whatsoever. Anyway, XOP is considered,
but so far was not found "worthy". But it makes sense to double-check
specifically multi-block SHA256...

> We're also experimenting with instruction interleaving.  Sometimes,
> especially when running only 1 thread/core (such as on cheaper Intel
> CPUs without HT, or when there's no thread-level parallelism in the
> application - not our case, though), it's optimal to interleave several
> SIMD computations, for even wider virtual SIMD vectors than the CPU
> supports natively.  e.g. for MD5 on AVX (64-bit builds only, since need
> 16 registers for interleaving), we currently interleave 3 of those (so
> 12 MD5's in parallel per thread).

It's not uncommon that cryptographic algorithms have short dependency
chains and consequently limited ILP, instruction-level parallelism. But
then processors have limited resources too, and question is if those
resources are sufficient to sustain the algorithmic IPL. Or rather vice
versa, if processor has more resources than ILP, then resources will run
underutilized. And naturally only then it makes sense to interleave
instructions. Processor resources can be characterized by IPC,
instructions per cycle, limit, and maximum possible improvement would be
IPC/ILP. But one should remember that IPC is not just amount of
execution ports, for example 4 on Haswell. Some instructions are
port-specific and if algorithm uses such instructions a lot, you'll be
limited by that port. Anyway, MD5 is known for its low IPL and it does
make sense to interleave it (with itself or other algorithm). This
doesn't apply to SHA. It has higher ILP and no contemporary processor
has capacity to fully utilize this parallelism. Actually it's a bit
worse in practice, because thing about multi-block is that it's limited
by shifts, which are port-specific. This is why you observe virtually no
difference among "desktop/server" processors.

As for 4 Haswell ports. Of the 4 only 3 can execute vector instructions.
So that absolutely best results can be achieved when you mix scalar
integer-only and vector instructions, e.g. in addition to MD5 on AVX,
mix in even scalar "thread". Well, gain would have to be divided by
ratio between how many blocks vector part processes vs. how many blocks
scalar parts adds. So gain would be too little to care about. So it's
more of a fun fact in the context.