<div dir="auto"><div>Thanks Andy. </div><div dir="auto"><br></div><div dir="auto">A72 is running 1GHz compared to x86 at 2.1Ghz. So that should hopefully get down to -1:5.</div><div dir="auto">There is no L3 cache on the A72 eval board and performance counters do show 9x more DRAM accesses for ARM compared to x86. </div><div dir="auto"><br></div><div dir="auto">Will check out Mongoose and Kyro.<div dir="auto"><br></div><div dir="auto">Do you know of any good hardware crypto intellectual property like synopsis for example which can help in asymmetric crypto part of TLS handshake ?</div><div dir="auto"><br></div><div dir="auto">Thanks, <br><div dir="auto"><br></div><div dir="auto">Vijay </div></div><br><div class="gmail_extra" dir="auto"><br><div class="gmail_quote">On Feb 7, 2017 7:06 AM, "Andy Polyakov" <<a href="mailto:appro@openssl.org">appro@openssl.org</a>> wrote:<br type="attribution"><blockquote class="quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="quoted-text">>   Is big number montogomery multiplication as optimized as it can be for<br>

> ARM64 as compared to X86-64 from the latest openssl github ?<br>

>   We are not seeing vmull ( or pmull/pmull2) instructions in<br>

</div>> <a href="http://armv8-mont.pl" rel="noreferrer" target="_blank">armv8-mont.pl</a> <<a href="http://armv8-mont.pl" rel="noreferrer" target="_blank">http://armv8-mont.pl</a>>.<br>

<div class="quoted-text">><br>

>    On an ARM cortex-A72 (1GHz)  and E5-2620 (2.1 Ghz)  we are seeing an<br>

> order of 10 difference in RSA signing perf for 2048 bit keys.<br>

<br>

</div>When it comes to performance correct question actually is not what is<br>

the result in absolute terms, but how far is it from possible maximum<br>

for specific processor [taking into consideration all the factors from<br>

ISA capabilities and specific hardware implementation]. So that implying<br>

that 10x difference between processors in question is result of<br>

insufficient optimization for one is somewhat unjustified. Well, to be<br>

completely honest there are some minor tricks one can pull on ARMv8, but<br>

it will only make the gap a *little* bit smaller. Or in other words suck<br>

it up, that's the way Cortex [currently?] is. If it's so critical *and*<br>

you're in position to choose processor, then Samsung Mongoose core would<br>

be much better choice (but I don't know anything about Qualcomm Kryo).<br>

Yet, even though it would be better choice, it still wouldn't actually<br>

close the gap, so don't get your hopes too high :-)<br>

<br>

As for not seeing vector instructions. Pmull[2] is about something<br>

completely different. As for vmull you have to recognize that it's<br>

limited by 32-bit inputs and there is no carry handling in vector<br>

instructions. This means that it would take more instructions to do same<br>

job, even though you perform pair of multiplications in one vector<br>

instruction. Well, it's more complicated than just amount of<br>

instructions, but nevertheless, scalar 64x64 multiplication with carry<br>

processing offered by ARMv8 ISA does deliver better result than 128-bit<br>

vector instructions would.<br>

<font color="#888888"><br>

--<br>

openssl-users mailing list<br>

To unsubscribe: <a href="https://mta.openssl.org/mailman/listinfo/openssl-users" rel="noreferrer" target="_blank">https://mta.openssl.org/<wbr>mailman/listinfo/openssl-users</a><br>

</font></blockquote></div><br></div></div></div>