[openssl-dev] [openssl.org #3615] [PATCH] ChaCha20 with Poly1305 TLS Cipher Suites via the EVP interface

Mon May 18 15:44:58 UTC 2015

Poly1305 implementations can be classified by amount of significant bits
in words making up multi-precision values involved, or in other words
base of numerical representation, e.g. base 2^64, base 2^32, base 2^26.
It's not obvious that single specific base would be optimal choice for
all platforms, rather contrary. Attached code provides pair of reference
C implementations, base 2^64 and base 2^32, and at assembly level
utilizes mixture of bases depending on processor code is currently
executed on. The choices are discussed in commentary sections (if not
obvious). For example, as it turns out, on most recent x86_64 platforms
scalar integer-only base 2^64 implementation is actually best choice for
processing single or small amount of input blocks. It even turned to be
best choice for some of them, most notably from Atom family. As result
it's argued that SSE2 implementation can be omitted on x86_64, because
it provides improvement only on old processors, too old to care about
(well, one can probably still argue in favor of Westmere, but luckily
performance improvement is least on it). On related note, given AVX
results, one can say that if not for Bulldozer, one could have argued in
favor of omitting even AVX.

The too-old-to-care-about card is used even to dismiss 32-bit FP
implementation. It's a bit tricky, because even though reported SSE2
results are better than FP, it doesn't necessarily hold true for single
block (because in SIMD case we effectively calculate more and discard
unused data). "Not necessarily" means you'll measure both ways. But as
in most interesting cases, contemporary Intel i[357], it's actually no
gain, FP implementation is omitted, too much additional complexity for
too little gain. Especially if we take into account additional
pre-computation and conversion costs.

Another thing that is distinctly different in suggested code is that I
don't attempt to process sub-block lengths in assembly. These, sub-block
lengths, are handled in C instead.

> (I hope I'm doing this right)
> 
> These are my Chacha implementations for reference, x86, SSE2, SSSE3, XOP,
> AVX, and AVX2, and Poly1305 implementations for reference, SSE2, AVX, and
> AVX2 for both 32 and 64 bits. (djb's floating point poly1305 is used for 32
> bits). I do things a bit differently with the assembler code to make
> supporting multiple versions easier, I don't know if it is too non-standard
> or not. Everything is as fast as possible, so hopefully some or all of it
> can be used to fill things out, or give you ideas on how to surpass it.

Thank you very much! Trouble with this submission is that lack of
annotations (commentary and symbolic names for register variables) makes
it prohibitively hard to unravel the ideas. Instead I wonder if you
could have a look at attached code and tell what you think.

Side note about surpassing. Two things to keep in mind. First, objective
is always "all-round" performance on multiple platforms, so that if
there is some optimization that benefits one processor, it's weighted
against other processors. I mean if it harms other processors
significantly, then optimization is omitted (or isolated to dedicated
code path). Secondly, question is not how fast does it go in absolute
terms, but how far is it from theoretical estimate in every specific
case (based on front-end throughput, port availability, latencies,
critical path lengths, etc.) and of course why. In this context it's
interesting to consider AVX result on current Intel processors. Indeed,
note that Bulldozer managed in less than 1 cpb, while contemporary Intel
AVX-capable processors in ~1.15. But Intel processors are
architecturally capable of achieving higher instruction-per-cycle ratio,
so that result at least should not be worse than Bulldozer's. Is it
subtle port availability limitation? Some anomaly in scheduling? It
remains to be seen...

More coming in. Cheers.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: poly1305.c
Type: text/x-csrc
Size: 23872 bytes
Desc: not available
URL: <http://mta.openssl.org/pipermail/openssl-dev/attachments/20150518/552ab456/attachment-0001.c>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: poly1305-x86.pl
Type: application/x-perl
Size: 49967 bytes
Desc: not available
URL: <http://mta.openssl.org/pipermail/openssl-dev/attachments/20150518/552ab456/attachment-0002.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: poly1305-x86_64.pl
Type: application/x-perl
Size: 43196 bytes
Desc: not available
URL: <http://mta.openssl.org/pipermail/openssl-dev/attachments/20150518/552ab456/attachment-0003.bin>