[openssl-users] BN_MUL_MONT for ARM64 v8

Vijay Chander vijay.chander at gmail.com
Tue Feb 7 16:42:38 UTC 2017


Thanks Andy.

A72 is running 1GHz compared to x86 at 2.1Ghz. So that should hopefully get
down to -1:5.
There is no L3 cache on the A72 eval board and performance counters do show
9x more DRAM accesses for ARM compared to x86.

Will check out Mongoose and Kyro.

Do you know of any good hardware crypto intellectual property like synopsis
for example which can help in asymmetric crypto part of TLS handshake ?

Thanks,

Vijay


On Feb 7, 2017 7:06 AM, "Andy Polyakov" <appro at openssl.org> wrote:

>   Is big number montogomery multiplication as optimized as it can be for
> ARM64 as compared to X86-64 from the latest openssl github ?
>   We are not seeing vmull ( or pmull/pmull2) instructions in
> armv8-mont.pl <http://armv8-mont.pl>.
>
>    On an ARM cortex-A72 (1GHz)  and E5-2620 (2.1 Ghz)  we are seeing an
> order of 10 difference in RSA signing perf for 2048 bit keys.

When it comes to performance correct question actually is not what is
the result in absolute terms, but how far is it from possible maximum
for specific processor [taking into consideration all the factors from
ISA capabilities and specific hardware implementation]. So that implying
that 10x difference between processors in question is result of
insufficient optimization for one is somewhat unjustified. Well, to be
completely honest there are some minor tricks one can pull on ARMv8, but
it will only make the gap a *little* bit smaller. Or in other words suck
it up, that's the way Cortex [currently?] is. If it's so critical *and*
you're in position to choose processor, then Samsung Mongoose core would
be much better choice (but I don't know anything about Qualcomm Kryo).
Yet, even though it would be better choice, it still wouldn't actually
close the gap, so don't get your hopes too high :-)

As for not seeing vector instructions. Pmull[2] is about something
completely different. As for vmull you have to recognize that it's
limited by 32-bit inputs and there is no carry handling in vector
instructions. This means that it would take more instructions to do same
job, even though you perform pair of multiplications in one vector
instruction. Well, it's more complicated than just amount of
instructions, but nevertheless, scalar 64x64 multiplication with carry
processing offered by ARMv8 ISA does deliver better result than 128-bit
vector instructions would.

--
openssl-users mailing list
To unsubscribe: https://mta.openssl.org/mailman/listinfo/openssl-users
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mta.openssl.org/pipermail/openssl-users/attachments/20170207/96e0257b/attachment.html>


More information about the openssl-users mailing list