[openssl-dev] Usage of assembler code on ARM architectures

Thu Mar 12 19:37:45 UTC 2015

> I can't speak directly to your question on the iphone-cross target, but
> can warn you that your mileage will vary when using the ARM assembly
> modules.  We observed that some algorithms actually run slower when
> using the ARM assembly modules.  It's been a couple of years and I don't
> recall the details, but want to say some of the hash algorithms were
> actually faster when using no-asm.

Well, I can imagine compiler succeeding to generate code better than
sha1-armv4-large, but I can't imagine compiler beating sha256 or sha512.
Was it really some of algorithm*s* or just one? Anyway, why
sha1-amrv4-large? Two reasons: a) inner loops are not unrolled; b)
over-reliance on merged rotate-n-arithmetic. "Over-reliance" means that
it uses more such instructions than actually necessary, which can
negatively affect performance. I realized this after having hard time
getting sha256/512 to work well on Cortex-A53 (see sha512-armv8.pl, it's
64-bit module, but principle of merged rotate-n-arithmetic is same). It
should also be noted that now there are additional code paths in
sha1-armv4-large, namely NEON and ARMv8.

> The results are likely to vary
> depending on the actual chipset used.

Right, ARM universe is very diverse. Assembly modules, i.e. all, not
only ARM, are maintained to provide near-optimal performance across
range of platforms, but sometimes optimizations conflict. In either case
prerequisite is access to wide range of platforms and feedback. In order
words, bring it up.

> You'll probably want to test the
> performance on the target hardware using the "openssl speed" command. 
> You can do this on a jailbroken iOS device via SSH.

For the record. I do development on non-jailbroken unit, so that it's
not hard requirement.