[openssl-dev] [openssl.org #3843] OpenSSL 1.0.1* and below: incorrect use of _lrotl()

Sun May 24 07:10:00 UTC 2015

Hi Andy,

Thank you for your reply!  I am CC'ing Lei on mine.

On Wed, May 20, 2015 at 12:55:10PM +0200, Andy Polyakov via RT wrote:
> For reference. icc was not cared for for quite some time. Initially it
> was possible for me, by then university employee, to use it, but then
> they changes terms and it became impossible for me to maintain it. But
> I've just noticed they provide some starter version of something, I'll
> see...

Yes, this might be usable for you:

https://software.intel.com/en-us/qualify-for-free-software/opensourcecontributor

"Intel provides select Intel Software Development Products at no cost to
qualified open source contributors who are working on open source
projects compliant with the Open Source Initiative (OSI)."

> But linux-x86_64-icc is not present in and was never supported in
> pre-1.0.2.

Oh, I didn't realize that.  Like I mentioned, we're actually building
with icc for MIC.  When we build with icc for x86_64 host, we typically
simply link against the distro's gcc-built OpenSSL, so didn't run into
this issue ourselves until we started building for MIC and thus had to
make our own OpenSSL build with icc.  (Indeed, I've been building
OpenSSL from source on many other occasions, and as part of a distro
too, but that's not with icc and unrelated to JtR project.)

> So you ought to provide custom line. This remark doesn't mean
> that fix can't be backported, but out of curiosity, what's your config
> line?

Currently, Lei put this into JtR -jumbo README-MIC:

Build LibreSSL (version 2.1.6):
$ cd libressl-2.1.6
$ ./configure CC="icc -mmic" --host=k1om-linux --prefix=$MIC
$ make && make install

The previous instructions were:

Build OpenSSL (version 1.0.0q):
$ cd openssl-1.0.0q
$ patch Configure < $JOHN/src/unused/openssl.patch
$ ./Configure linux-mic shared --prefix=$MIC
$ make && make install

I'm not sure what was in $JOHN/src/unused/openssl.patch - I guess it had
to add linux-mic support.  Lei, please reply to all.

> Is assembly engaged? If so, how fast is it? Or is it so that you
> count on compiler to produce vector code that would process multiple
> inputs in parallel with SIMD?

We're using OpenSSL (or LibreSSL) as an easy but slower option,
replacing it with our own SIMD code right in JtR tree whenever we can
and where this makes sense.  So we're not trying to optimize OpenSSL's
code.  It remains scalar and unmodified, and our use of it is just to
have things working where we do not have optimized code yet or where we
prefer simpler rather than faster code (such as for some lightweight
precomputation in some rare cases where this makes sense).

This varies by crypto primitive, but overall we currently have SIMD
intrinsics code for MMX, SSE2+/AVX, XOP, AVX2, MIC/AVX-512, and for
bitslice DES also for AltiVec and NEON.

One thing for which we still use OpenSSL's code in performance-critical
manner is SSH key passphrase cracking (which involves RSA).  There are
probably many more examples like this, but this is a prominent one that
comes to mind.  There must be a lot of room for optimization here.

As to compiler auto-vectorization - no, we are not relying on it.

> On related note. What's Xeon Phi in this context? I mean are we talking
> about Knights Corner

Unfortunately, yes.  BTW, you're welcome to play with it if you like:

http://openwall.info/wiki/HPC/Village

> (that features own compatible-with-nothing SIMD instruction set)

Yes, but at source code level many intrinsics match AVX-512.  So we use
it as a way to prepare for AVX-512.  In many cases, it's just a
recompile away.  There are some notable exceptions to this, though - in
fact, you happened to list some below.

> or Knights Landing (that features AVX512)? If latter,
> it might be interesting to extend multi-block SHA support(*), which
> should allow to achieve pretty cool results (with vector rotate and
> ternary logic instructions, not to mention 16 lanes:-). [As for
> "interesting". It's possible but not really interesting in Knights
> Corner case, because effort is too specific, just a single obscure and
> hardly available CPU, while AVX512 is planned even for other processors
> so that code will be reusable.]

This will take some #ifdef's to provide vector rotates as a macro when
building for MIC and to use the ternary logic intrinsics only when
building for true AVX-512 - nasty, but I think reasonable.  For now,
we're simply using the common subset between MIC and AVX-512:

https://github.com/magnumripper/JohnTheRipper/blob/bleeding-jumbo/src/pseudo_intrinsics.h
https://github.com/magnumripper/JohnTheRipper/blob/bleeding-jumbo/src/sse-intrinsics.c

> (*) BTW, did you try existing one?

No, totally missed it!  Found it now, good work!

$ find -name 'sha*-mb*'
./crypto/sha/asm/sha256-mb-x86_64.pl
./crypto/sha/asm/sha1-mb-x86_64.pl

How is an application using OpenSSL supposed to access this
functionality?  Is there documentation?  So far, I only found uses in
OpenSSL's own e_aes_cbc_hmac_sha*.c and no export of these symbols.

You could want to add optional use of XOP there - rotates and vcmov.
For SHA-1, F() is just one vcmov and H() is vcmov/andnot/xor (see
sse-intrinsics.c above).  For SHA-2, we use:

#define Maj(x,y,z) vcmov(x, y, vxor(z, y))
#define Ch(x,y,z) vcmov(y, z, x)

We're also experimenting with instruction interleaving.  Sometimes,
especially when running only 1 thread/core (such as on cheaper Intel
CPUs without HT, or when there's no thread-level parallelism in the
application - not our case, though), it's optimal to interleave several
SIMD computations, for even wider virtual SIMD vectors than the CPU
supports natively.  e.g. for MD5 on AVX (64-bit builds only, since need
16 registers for interleaving), we currently interleave 3 of those (so
12 MD5's in parallel per thread).

Is it OK that we went quite off-topic on this RT issue?

Alexander