[openssl-dev] [openssl.org #4346] poly1305-x86.pl's AVX2 code

Sun Feb 28 22:27:49 UTC 2016

> There seems to be a bug in the AVX2 codepath in poly1305-x86.pl. I have not
> attempted to debug this, but I have attached a test file which produces
> different output in normal and AVX2 codepaths. Our existing poly1305
> implementation agrees with the former.
> 
> $ OPENSSL_ia32cap=0 ./poly1305_test
> PASS
> $ ./poly1305_test
> Poly1305 test failed.t
> got:      2e65f0054e36505687d937ff5e8ed112
> expected: 69d28f73dd09d39a92aa179da354b7ea
> 
> You may wish to generalize that Poly1305_Update pattern into your own
> tests. This is what I did to catch this:
> https://boringssl-review.googlesource.com/#/c/7223/

About choice of test vector lengths. From viewpoint of exercising
different branches lengths 2048 through 2048+15 would exercise same
vector path. What one needs is combinations of 64*n+16*m. But you do
compensate for this by exercising varying what you've called excess
lengths. I suppose what I'm saying is that that +15 thing is kind of
redundant in the context. This is because last block is always processed
separately.

> By the way, this assembly code is quite complicated.

Yeah, but what you gonna do, vectorizing is hard in this case...

> I wasn't able to find
> problems in the others (I tested armv4, armv8, x86, and x86_64), but I'm
> far from confident I've covered all the cases.
> 
> With the caveat that I'm no assembly programmer, much of the complexity
> seems to come the SIMD code needing a multiple of 2 or 4 blocks and the
> implementation converting internal state back and forth from base 2^26 and
> 2^64 and handling excess blocks slightly differently in different cases. (I
> counted nine distinct codepaths to test in the x86_64 AVX codepath alone.)

While it's correct that excess blocks are handled though 2^26<>2^64
conversion in 64-bit modules, in 32-bit modules 2^32->2^26 is one way. I
mean in 32-bit modules even excess blocks are handled by vector code
(because it's faster than scalar code even if you have to discard some
data). Though "excess blocks" can be a little bit misleading term in the
context, because it kind of makes you think that processing is done as
[2|4]*n + remainder, while it's actually processed as remainder +
[2|4]n. [For reference, [2|4] will become [2|4|8] as soon as support for
AVX512 is added]. All this is provided that input length is longer than
certain limit. Vectorized procedure requires expensive setup which is
postponed till we know that input is long enough to justify additional
overhead (see
https://www.openssl.org/blog/blog/2016/02/15/poly1305-revised/ for
background information).

> The C code already buffers up to 16-byte blocks. Did you consider buffering
> up to 32 or 64 bytes in C when the SIMD code called for it?

Yes, but here is the thing. You'd need to be able to handle all possible
remainders in either case. So I figured it would be appropriate to take
"straight-forward" version for a "burn-out" to real world prior next
step ;-)

> I haven't tried this, so perhaps the performance costs are prohibitive, but
> if the costs are modest, the simplifications may be worth it.

No, it's perfectly reasonable. One should recognize though that you are
unlikely to gain like a lot in TLS case, because call pattern is rather
predictable. The noticeable gain would be observed in "wilder" mix of
calls. BTW, at some point I also plan to make e_chacha_poly1305.c
process couple of 4KB chunks at a time in order to improve cache locality...

-- 
Ticket here: http://rt.openssl.org/Ticket/Display.html?id=4346
Please log in as guest with password guest if prompted