[openssl-dev] [openssl.org #3615] [PATCH] ChaCha20 with Poly1305 TLS Cipher Suites via the EVP interface

Sun May 24 13:36:12 UTC 2015

> More coming in.

Here are preliminary results for 32- and 64-bit ARM. "Preliminary" means
that they are incomplete and subject to change. But in a sense they
underpin some of the points in previous post, both in message itself and
source code commentary.

Consider 32-bit results. First column is assembly results for base 2^32
integer-only code in comparison to compiler-generate code. Second column
is my result for NEON, and last column are results for Andrew Moon's
NEON implementation, both are base 2^26.

#                       IALU/gcc-4.4    NEON    poly1305-opt
#
# Cortex-A5             6.30/+130%      2.96    4.90
# Cortex-A8             6.25/+115%      2.40    2.36
# Cortex-A9             5.10/+95%       2.56    2.25
# Cortex-A15            3.79/+85%       1.30    1.53
# Snapdragon S4         5.70/+100%      1.48    7.58(?)

As mentioned earlier goal is "all-round" performance, i.e. near-optimal
performance across *range* of platforms. Judging from Cortex-A9 result I
have some room for improvement, hopefully it will benefit all
processors. As for (?). It's not clear why poly1305-opt has performed so
poorly on Snapdragon S4, it might happen that it failed to opt for NEON
for some reason. I have no possibility to verify, because it's somebody
else's mobile phone.

Here are some results for base 2^64 integer-only implementation on
64-bit ARM, and base 2^26 32-bit NEON results. Latter means that I
haven't ventured to NEON on 64-bit ARM yet, but as performance would be
virtually same (because NEON instruction set capabilities are
essentially same and it would be same base), we can use it to compare
and assess options.

#               IALU    gcc-4.9 gcc-4.7 NEON    poly1305-opt
#
# Cortex-A53    2.72    4.16    9.09    1.57    2.52
# Cortex-A57    2.70    2.89    6.46    1.30    1.46
# Denver        1.45    2.09    5.63    1.50    1.34

IALU vs. compiler-generated code basically tells the reason why we
program assembly, doesn't it? I mean if you compare assembly and gcc-4.9
on Cortex-A57, you'd probably say that assembly doesn't make sense. But
if you look at remaining results, you'll see that you are kind of left
to compiler's mercy and it's not that "mighty" in every situation. These
results also confirm concern in commentary session in poly1305.c about
base 2^64 not being optimal for every 64-bit case. Indeed, gcc-4.7 base
2^64 results are actually worse that base 2^32. Well, to be honest I was
actually referring more to instruction set capabilities, but it can be
extended to even to compiler.