[openssl/openssl] f5b5a3: Ensure proper memory barriers around ossl_rcu_dere...

Neil Horman noreply at github.com
Wed Apr 10 07:20:57 UTC 2024


  Branch: refs/heads/master
  Home:   https://github.com/openssl/openssl
  Commit: f5b5a35c84626823364b0c8535b968c106690a56
      https://github.com/openssl/openssl/commit/f5b5a35c84626823364b0c8535b968c106690a56
  Author: Neil Horman <nhorman at openssl.org>
  Date:   2024-04-10 (Wed, 10 Apr 2024)

  Changed paths:
    M crypto/threads_pthread.c

  Log Message:
  -----------
  Ensure proper memory barriers around ossl_rcu_deref/ossl_rcu_assign_ptr

Since the addition of macos14 M1 runners in our CI jobs we've been
seeing periodic random failures in the test_threads CI job.
Specifically we've seen instances in which the shared pointer in the
test (which points to a monotonically incrementing uint64_t went
backwards.

>From taking a look at the disassembled code in the failing case, we see
that __atomic_load_n when emitted in clang 15 looks like this
0000000100120488 <_ossl_rcu_uptr_deref>:
100120488: f8bfc000     ldapr   x0, [x0]
10012048c: d65f03c0     ret

Notably, when compiling with gcc on the same system we get this output
instead:
0000000100120488 <_ossl_rcu_uptr_deref>:
100120488: f8bfc000     ldar   x0, [x0]
10012048c: d65f03c0     ret

Checking the arm docs for the difference between ldar and ldapr:
https://developer.arm.com/documentation/ddi0602/2023-09/Base-Instructions/LDAPR--Load-Acquire-RCpc-Register-
https://developer.arm.com/documentation/dui0802/b/A64-Data-Transfer-Instructions/LDAR

It seems that the ldar instruction provides a global cpu fence, not
completing until all writes in a given cpus writeback queue have
completed

Conversely, the ldapr instruction attmpts to achieve performance
improvements by honoring the Local Ordering register available in the
system coprocessor, only flushing writes in the same address region as
other cpus on the system.

I believe that on M1 virtualized cpus the ldapr is not properly ordering
writes, leading to an out of order read, despite the needed fencing.
I've opened an issue with apple on this here:
https://developer.apple.com/forums/thread/749530

I believe that it is not safe to issue an ldapr instruction unless the
programmer knows that the Local order registers are properly configured
for use on the system.

So to fix it I'm proposing with this patch that we, in the event that:
1) __APPLE__ is defined
AND
2) __clang__ is defined
AND
3) __aarch64__ is defined

during the build, that we override the ATOMIC_LOAD_N macro in the rcu
code such that it uses a custom function with inline assembly to emit
the ldar instruction rather than the ldapr instruction.  The above
conditions should get us to where this is only used on more recent MAC
cpus, and only in the case where the affected clang compiler emits the
offending instruction.

I've run this patch 10 times in our CI and failed to reproduce the
issue, whereas previously I could trigger it within 5 runs routinely.

Reviewed-by: Tom Cosgrove <tom.cosgrove at arm.com>
Reviewed-by: Paul Dale <ppzgs1 at gmail.com>
Reviewed-by: Tomas Mraz <tomas at openssl.org>
(Merged from https://github.com/openssl/openssl/pull/23974)



To unsubscribe from these emails, change your notification settings at https://github.com/openssl/openssl/settings/notifications


More information about the openssl-commits mailing list