[openssl-dev] ARM optimised montgomery multiplication (armv4-mont)

Andy Polyakov appro at openssl.org
Tue Jun 16 12:09:40 UTC 2015


Hi,

> After the changes to DH requiring longer key lengths, I switched to 2048-bit
> keys, but was finding this was now making my test runs on an embedded ARM9
> target annoyingly slow; so thought I'd investigate to see if there was
> anything to improve.
> 
> With some experimentation, it turns out that if I *stop* using the
> crypto/bn/asm/bn/armv4-mont.pl generated asm "optimised" version, the time for
> a simplish test to establish and close a simple SSL connection went from 28
> seconds to 18. (It's quite a slow target at any time).
> 
> In other words, this "optimised" version has slowed things down dramatically.
> Has anyone queried the value of the asm of armv4-mont.pl any time in the last
> few years?

Yes, of course. For reference, here are speed rsa2048 dsa2048 results
from Cortex-A8. Numbers are operations per second, so that higher is better.

Without armv4-mont.pl:

                  sign    verify    sign/s verify/s
rsa 2048 bits 0.052684s 0.001421s     19.0    703.5
dsa 2048 bits 0.014576s 0.017526s     68.6     57.1

With armv4-mont.pl but without NEON (ARM SIMD extension):

rsa 2048 bits 0.039255s 0.001140s     25.5    877.3
dsa 2048 bits 0.011630s 0.013900s     86.0     71.9

With armv4-mont.pl and NEON on:

rsa 2048 bits 0.021053s 0.000606s     47.5   1650.2
dsa 2048 bits 0.006084s 0.006985s    164.4    143.2

Well, RSA/DSA are not DH, but they are very representative when it comes
to sheer BIGNUM performance. And of course Cortex-A8 is not ARM9, but at
least it shows that statement about armv4-mont.pl being bad for
performance does not hold universally true. It's rather contrary, as
similar picture can be observed on most ARM processors (well, all I tested).

> Is it just that compilers have become better (I'm only using gcc
> 4.7.3, so not bleeding edge even).

I don't think so. BIGNUM performance can be delicate balance between
multiple factors and it's not impossible to end up on the other side of
breaking point. What breaking point? If you examine performance
improvement with and without Montgomery multiplication module, you'll
notice that there are processors on which improvement coefficient
declines with key length. I mean you'll observe lower improvement longer
key is. This indicates that there ought to be point past which you can
as well observe worse performance, not better. So far such points fell
outside practical key lengths on tested systems, ARM or not. Well,
except for s390x-mont module [which by the way even discusses reasons
for why such breaking point exists, see commentary in
bn/asm/s390x-mont.pl]. In other words I argue that your case is case of
finding yourself on the other side of said breaking point on specific
CPU, not case of armv4-mont.pl being universally inferior. It does come
a little bit unexpected in sense that I wouldn't expect it to hit the
point at 2048-bit key length on any specific ARM processor, but on the
other hard it's not impossible (all it takes is multiplication
instruction stalling pipe-line for long enough to tip the balance).

> Anyway, it's uncertain to me whether armv4-mont.pl should remain.

Assuming that majority of ARM users are not ARM9 users, most would have
to disagree :-) So what does it leave us? One can argue that OpenSSL
could detect the breaking point at run-time and act accordingly, but
it's tricky and is likely to have too narrow use. One can argue that
OpenSSL can be further optimized so that breaking point is moved further
(if not eliminated), which is more practical, because it should improve
performance on all processors, but this is not something that happens
over night. Meanwhile just documenting the case and providing
instructions on how to disengage the module is probably reasonable
compromise. Would you agree? One can make arrangements so that said
instructions would be super-simple...

> FYI, I couldn't discern any difference whether using armv4-gf2m or not, but
> that doesn't mean it's bad.

armv4-gf2m is involved in Elliptic Curve, and of specific kind. Your
problem description doesn't sound like it should affect you. But even if
it did, it's unlike that you'll notice regression, because there are no
breaking points in that case.



More information about the openssl-dev mailing list