[openssl-dev] Making assembly language optimizations working on Cortex-M3

Fri Aug 7 16:27:32 UTC 2015

Hi,

> In ./Configure, there is this comment:
> 
>         # big-endian platform. This is because ARMv7 processor always
>         # picks instructions in little-endian order. Another similar
>         # limitation is that -mthumb can't "cross" -march=armv6t2
>         # boundary, because that's where it became Thumb-2. Well, this
>         # limitation is a bit artificial, because it's not really
>         # impossible, but it's deemed too tricky to support. 
> 
> Cortex-M3 and Cortex-M4 processors are -mthumb, -march=armv7-m, which is
> exactly the problematic configuration, if I understand that comment
> correctly.

The comment in question applies *exclusively* to cases when you attempt
to build "universal" binary, one that works on multiple platforms, e.g.
on ARMv6 and ARMv7.

> I am interested in getting libcrypto working for these
> processors with the assembly language optimizations enabled.

This is kind of problematic, but for reason different from what [I
think] you imply. As you point out Cortex-Mx support *only* Thumb[-2]
instruction set (it was news for me). And the trouble is that not all
OpenSSL assembly modules can be compiled for Thumb-2. So far we kind of
relied on the fact that target ARM processors can switch between real
ARM and Thumb instruction sets (for those who wonder, yes, within same
application, so that it's possible to freely mix them,
compiler-generated code can be Thumb[-2] while assembly modules remain
ARM). Originally it was utterly natural assumption to make, because
Thumb (not Thumb-2!) really required separate development effort (less
registers, poorer instruction set). But with introduction of Thumb-2 and
Unified Assembler Language syntax it became possible to re-use ARM code,
but small adjustments are normally required. Or to paraphrase beginning
of this paragraph, not all OpenSSL assembly modules are Thumb-2 savvy.

> Specifically, the configuration I am interested in is:
> 
> CC=arm-none-eabi-gcc -mcpu=cortex-m3 -march=armv7-m
> -mthumb  -D__ARM_MAX_ARCH__=7.

On side note for reference, -D__ARM_MAX_ARCH__ is redundant if it
matches -march. Secondly, recommended way to engage cross-compiler is to
pass --cross-compile-prefix to Configure, e.g.
--cross-compile-prefix=arm-none-eabi-.

> Currently, the assembly language source files don't compile because they
> expect to be able to use ARM (-marm) instructions when __ARM_MAX_ARCH__
>>= 7. Further, they try to do feature detection for NEON in that case,
> but I'd prefer to not have the feature detection compiled in, since I
> know NEON is never available.

It's a little bit more nuanced than that and can even be split to two
sub-problems, namely a) making code Thumb-2 savvy and b) making NEON
support conditionally compiled.

> Has anybody started working on this?
> 
> If not, my thinking was to refactor the assembly language code so that
> all the ARM-only (-marm) code is cleanly separated from the unified
> (-mthumb and -marm) code,

As implied, there are few assembly modules that *can* be compiled for
Thumb-2 today, namely aes-armv4, bsaes-armv7, sha256-armv4,
sha512-armv4. Is there evidence that we can't adhere to this strategy of
adjusting modules for ARM/Thumb-2 "duality"? (I think I have ghash...)
(BTW, can you confirm that you can get mentioned modules work?)

> move the detection of NEON from the assembly
> language code to the C wrappers,

... I'd vote against...

> and recognize two new settings,
> OPENSSL_NO_ARM_NEON and OPENSSL_ARM_THUMB_ONLY, to accommodate this.

While NO_NEON might make sense, I really see no reason to introduce
THUMB_ONLY. Because pre-defines set by the compiler driver are
sufficient. Actually, one can argue that even decision to omit NEON code
can be made based on pre-defines, e.g. __ARM_ARCH_7M__. Well, this
doesn't exclude possibility to define NO_NEON based on pre-define and
using NO_NEON in code. Note that omission of NEON code implies even
omission of NEON detection. This is basically why I object moving
detection to to C. Keeping it in same place makes it more maintainable.

A word of warning. When looking at ARM assembly code, you might find
yourself asking "why isn't this done this way?" It's likely that answer
to that question is "because old assembler fails to compile it." I mean
there is certain level of legacy support there and it's not a coincidence.