[openssl-dev] AES-GCM for ARM: what is the status of the new work published by

Mon Jul 13 15:16:10 UTC 2015

Hi,

> What is the status of the improvements on security and performance for
> AES-GCM on ARM published recently by Conrado P. L. Gouvêa, Julio López ?
> 
> Implementing GCM on ARMv8. Conrado P. L. Gouvêa, Julio López. 2015 [1]
> Which details also the ARMv7 case, and was presented at the RSA
> Conference 2015 in the US, 2 months ago.
> The paper is here [2].
> The code is available here [3]
> 
> My question goes primarily to Andy Polyakov.
> 
> Is there any plan for integrating the code into openssl ?

I don't quite understand... On one hand you effectively imply that
OpenSSL doesn't have support for ARMv8 crypto extensions, which would
mean that you didn't do your homework. On the other hand you explicitly
call on me by name, which would mean that you did some homework...

But in either case. OpenSSL does have support for ARMv8 crypto
extensions, and there are no "holes" in it in sense that it utilizes all
available extensions. And both 64- and 32-bit modes of operations are
supported. Or to be more specific in context of the question ARMv8 AES
instructions are used to implement AES-CTR and PMULL ones to implement
GHASH, the GCM components. But unlike referred code OpenSSL code is
endian-neutral (in sense that can be compiled for either endiannnes) and
supports all AES key lengths and more encryption modes including
decrypt. Looking at performance metrics, cycles per processed byte, for
referred code vs. OpenSSL:

              AES-128-CTR  GHASH
Cortex A53    1.88/1.46    1.21/1.01
Cortex A57    1.84/0.93    0.95/1.17
Apple A7      1.21/1.20    0.51/0.92

OpenSSL code is organized so that AES-CTR and GHASH performance are
basically additive, so that you would have to add corresponding numbers
to obtain GCM result.

As it can be seen OpenSSL GHASH is slower on Cortex A57 (but not sum of
CTR and GHASH) and Apple A7. There is explanation for that. One of GHASH
implementations parameters is "aggregate factor" that denotes amount of
multiplications that are performed prior reduction. OpenSSL uses factor
of 4, while referred code - 8. Higher aggregate factor is on to-do list
and there is no reason to believe that performance would be worse than
reported in referred paper.

The paper also discusses non-crypto-extension code. It is a valid
question, because not all ARMv8 processors implement crypto extensions.
For example APM X-Gene doesn't, nor does Qualcomm Snapdragon 810. What's
going to happen there? As for AES. There was open question about which
NEON implementation would provide best all-round performance, i.e.
across range of processors. There were three contenders: a)
straight-forward vtbl-based implementation that can be found in Linux
kernel source tree; b) vector-permutation AES; c) bit-sliced AES. As it
turned out it's combination of b) and c) that provides best performance.
Vector-permutation code is already committed to source tree and existing
bit-sliced ARMv7 code will be adapted for 64-bit mode. On side note,
unlike original, OpenSSL bit-sliced AES module supports all key lengths
and more encryption modes, including decrypt. Performance metrics for
vector-permutation and bit-slices are collected in vpaes-armv8 module
(comparison to referred paper is left as exercise for reader). As for
non-crypto-extension GHASH, options are to be carefully examined and
course of action is to be determined. [On side note one also have to
keep in mind that even NEON support is specified to be optional.]

> [1]
> https://www.rsaconference.com/writable/presentations/file_upload/cryp-w01-secure-and-efficient-implementation-of-aes-based-cryptosystems.pdf
> 
> [2] http://conradoplg.cryptoland.net/files/2010/12/gcm14.pdf
> 
> [3] https://github.com/conradoplg/authenc