[openssl-dev] ARM optimised montgomery multiplication (armv4-mont)

Andy Polyakov appro at openssl.org
Tue Jun 16 21:12:27 UTC 2015


>>> With some experimentation, it turns out that if I *stop* using the
>>> crypto/bn/asm/bn/armv4-mont.pl generated asm "optimised" version, the time for
>>> a simplish test to establish and close a simple SSL connection went from 28
>>> seconds to 18. (It's quite a slow target at any time).
>>>
>>> In other words, this "optimised" version has slowed things down dramatically.
>>> Has anyone queried the value of the asm of armv4-mont.pl any time in the last
>>> few years?
>> Yes, of course. For reference, here are speed rsa2048 dsa2048 results
>> from Cortex-A8. Numbers are operations per second, so that higher is better.
>>
>> Without armv4-mont.pl:
>>
>>                   sign    verify    sign/s verify/s
>> rsa 2048 bits 0.052684s 0.001421s     19.0    703.5
>> dsa 2048 bits 0.014576s 0.017526s     68.6     57.1
>>
>> With armv4-mont.pl but without NEON (ARM SIMD extension):
>>
>> rsa 2048 bits 0.039255s 0.001140s     25.5    877.3
>> dsa 2048 bits 0.011630s 0.013900s     86.0     71.9
> 
> 
> Wow, I get very different results on my ARM9 target. Without armv4-mont.pl:
>                   sign    verify    sign/s verify/s
> rsa 2048 bits 2.567500s 0.072826s      0.4     13.7
> dsa 2048 bits 0.722857s 0.865833s      1.4      1.2
> 
> With armv4-mont.pl:
>                   sign    verify    sign/s verify/s
> rsa 2048 bits 3.433333s 0.104896s      0.3      9.5
> dsa 2048 bits 1.058000s 1.253750s      0.9      0.8

Can you provide data for speed rsa dsa, which tests variety of length?
As mentioned earlier, we should observe decreasing improvement
coefficient, be it positive or negative...

> What's more, I dug out a Cortex-A9 target (Atmel CycloneV board, operating
> with single core only) and got this without armv4-mont.pl:
>                   sign    verify    sign/s verify/s
> rsa 2048 bits 0.127342s 0.003628s      7.9    275.6
> dsa 2048 bits 0.035971s 0.042778s     27.8     23.4
> 
> and this with armv4-mont.pl:
>                   sign    verify    sign/s verify/s
> rsa 2048 bits 0.172931s 0.005222s      5.8    191.5
> dsa 2048 bits 0.052565s 0.061350s     19.0     16.3
> 
> As you can see, in both cases using armv4-mont.pl makes it 30% slower. So
> whatever is going on, it isn't down to the CPU. I think there must be
> something else going on. I'll get back to you.

This is odd. Two questions. As far as I understand Cyclone V is FPGA, so
what does Cortex-A9 target mean in the context? Is it actual Cortex-A9
with FPGA beside it, or is it ARM processor "loaded" to FPGA? I don't
think one can give any performance guarantees in latter case. Two, can
you show /proc/cpuinfo?

On side note. Specifically Cortex-A9 has turned to be an odd-ball. It's
mentioned in commentary section, for some reason NEON doesn't give any
improvement on A9 on longer key lengths, but losses are considered
acceptable because it improves performance on other NEON-capable
processors. Well, this doesn't explain above discrepancies, which is why
it's a side note...



More information about the openssl-dev mailing list