[OTC VOTE PROPOSAL] Don't merge PR#14759 (blinding=yes and similar properties)

Thu Apr 8 17:02:04 UTC 2021

Background
==========

[PR#14759](https://github.com/openssl/openssl/pull/14759) (Set
blinding=yes property on some algorithm implementations) is a fix for
[Issue#14654](https://github.com/openssl/openssl/issues/14654) which
itself is a spin-off of
[Issue#14616](https://github.com/openssl/openssl/issues/14616).

The original issue is about _Deprecated low level key API's that have no
replacements_, among which the following received special attention and
were issued a dedicated issue during an OTC meeting:

~~~c
// RSA functions on RSA_FLAG_*
void RSA_clear_flags(RSA *r, int flags);
int RSA_test_flags(const RSA *r, int flags);
void RSA_set_flags(RSA *r, int flags);

// RSA_FLAG_* about blinding
#define RSA_FLAG_BLINDING
#define RSA_FLAG_NO_BLINDING

// RSA functions directly on blinding
int RSA_blinding_on(RSA *rsa, BN_CTX *ctx);
void RSA_blinding_off(RSA *rsa);
BN_BLINDING *RSA_setup_blinding(RSA *rsa, BN_CTX *ctx);
~~~

The decision the sprung Issue#14616 and PR#14759 was to use the
propquery mechanism to let providers advertise algorithms as
`blinding=yes` to select secure implementations if there are insecure
ones present as well.

Similarly it was discussed the potential `consttime=yes` property that
would work in a similar way: if applied properly for our current
implementations that are not fully constant time it would allow a
user/sysadmin/developer to prefer a third party implementation for the
same algorithm with better security guarantees.
In some contexts the consttime implementation might be seriously
penalizing in terms of performance and in the contexts where const time
is not required this would allow to select accordingly.

Definition for the blinding property
------------------------------------

The current definition of the `blinding` property applies to
provider-native algorithm implementations for the `asym_cipher` and
`signature` operations:

```pod
=head2 Properties

The following property is set by some of the OpenSSL signature
algorithms.

=over 4

=item "blinding"

This boolean property is set to "yes" if the implementation performs
blinding to prevent some side-channel attacks.
```

Rationale
=========

Property queries are our decision making process for implementation
selection, and has been part of the design for 3.0 since the Brisbane
design meetings: I am not opposing the use of the property query
mechanism to select algorithms here, but the semantics of the properties
we decide to adopt (and thus endorse and encourage also for use by 3rd
parties).
In particular I see the following issues with choices like
`blinding=yes` or `consttime=yes`.

Design promotes insecure by default
-----------------------------------

This design is a slippery slope into patterns that go against the
"secure by default" best practices.
Users/administrators/developers should not have to opt-in for the safer
implementation, but the other way around: the opt-in should be for the
less safe but more performant implementations after evaluating the
consequences of such a choice in specific contexts.
We shouldn't have users having to query for `consttime=yes` algorithms
but rather for `consttime=no` explicitly in specific conditions.
So if this was the only issue with PR#14759 my recommendation would be
to rather flag the non-blinding implementations as such rather than the
other way around as is currently done.

The scenario in which 3rd party providers offer insecure algorithms not
flagged as such with `consttime=no` or `blinding=no` IMO would then fall
under the "know your providers" umbrella: if you are installing provider
X you should be well aware that you are trusting X's authors "to do the
right thing".

The project history showed us how the risk for insecure-by-default
designs and its consequences are not just an hypothetical matter of
debate: for example, `BN_FLG_CONSTTIME` has been with us since version
0.9.8f (and even before that under the `BN_FLG_EXP_CONSTTIME` name),
well over 14 years, in which the decision of having the flag off by
default and requiring to manually enabling it when desired has been the
cause of many vulnerabilities and fixes, that are still a concern today,
as we fixed yet another instance of forgetting to set it just some weeks
ago (PR#13889).

In this case, though, I expect that just flipping the default is not
enough to accept this PR: the problem of going with the negative
versions of these flags is that such design can't be made future proof
as we wish to do: we can't know in advance all possible properties that
should be declared as `=no` by an algorithm released today as part of
our providers when new relevant properties might be defined in the
future.
E.g., what if after 3.0 is released a ECCKiila provider was released and
opted to tag its implementations as `formally_verified`?
They could add `formally_verified=yes` but then they would fall into the
"insecure by default" pattern: users would need to opt-in for the safer
version instead of getting it by default, but there is no other option
given that our and others' providers' offerings were not preemptively
flagged with `formally_verified=no`.

Scope of propquery semantics
----------------------------

This follows directly from the previous example. The design of the
propquery mechanism seems (IMO) to be limited to provide
__well-defined__ semantics for the properties only within the scope of a
single provider (and a single provider version at that).

If a given provider offered at runtime for the same algorithm two or
more versions, one flagged as `consttime=no` and one with
consttime guarantees, then it could provide documentation to its
userbase on how to setup default propqueries in the configuration file
at installation time or to application developers to embed them in their
code depending on the use case, to select the `consttime=no` offering
when such a thing is perceived as desirable for whatever (ill-guided :P)
reason.

But doing this across providers does not really work, because other
provider authors might have not flagged their insecure implementations
with `consttime=no`, or because there might be very different opinions
on what qualifies as "fully consttime" (that evolve over time with
different threat models and attacker capabilities):
provX's `consttime=yes` might be "more consttime" than provY's
`consttime=yes`.
For example provX is similar to our default provider, and
offers an EC algorithm built on top of a generic BIGNUM module so its
generic implementation supports any valid EC curve named or custom and
orgX made sure that the ECDH/ECDSA and ECmath layers of the
implementation follow state-of-the-art practices regarding
secret-dependant code execution or data access so they flag their
EC implementation as `consttime=yes`, even though the underlying BIGNUM
module has limited consttime guarantees, so in certain circumstances a
dynamic `realloc` or some other seemingly minor thing could leak a few
secret bits to a motivated attacker.
provY is similar to our default provider, and provides an EC algorithm,
but instead of having a generic codepath they only support a subset of
named curves (e.g. only P256 and P521) each with a dedicated
implementation that removes the need for a generic BIGNUM layer and
instead embeds its specific consttime field arithmetic in each curve
implementation (e.g. they provide our `ecp_nistz256.c` and
`ecp_nistp521.c` implementations only). provY's ECDH/ECDSA, ECmath and
FFmath layers all follow best consttime practices so they flag as
`consttime=yes` their EC offering.
Now `constime=yes` is not well-defined anymore across providers, making
it quite useless for users and providers to use `consttime=yes` or
`consttime=no` to pick among offerings from different providers.

The same would be true within the same organization and provider, across
provider versions. Let's say tomorrow we were to merge a PR that made
all of BIGNUM resistant against all currently known timing side-channel
techniques. Coincidentally we decided to mark all our pubkey providers
as `consttime=yes` after verifying that all the layers above BIGNUM are
also timing-resistant to current techniques.
Fast-forward a year and a new CPU design flaw is discovered, giving a
remote attacker a side-channel with unprecedented resolution and
SNR: our previous `consttime=yes` semantic would not stand the
test of time and the disruption of semantics could have ripple-effects
within our providers and to the stack of applications that embedded
`consttime=yes` as a selection mechanism to pick among offerings from
various provider versions with now ill-defined qualifiers.

If we can agree, even just on some level or some specific cases like
these, that the definitions of properties in propqueries have their
scope limited to the single provider version, what is the point of
exposing `blinding=*` to our users, given that at runtime we don't offer
alternative implementations for the same algorithm with different
properties?
(I mean, yes we have alternative choices at runtime in some cases, but
not at the algorithm level in the provider API, they are buried well
within each algorithm implementation that offers such choices)

Too vague/coarse semantics
--------------------------

Elaborating further from the examples above, there is also a conceptual
issue at the base of the OTC discussion that led to PR#14759, namely
that "blinding is a property of an algorithm: either an implementation
does blinding or it doesn't".

But this is not entirely true: blinding is applied to individual
operations at various levels.
RSA is a special case that has the luxury of resting its security on
basically a single mathematical operation (modular exponentiation) over
a single mathematical layer of abstraction (finite field math provided
by BIGNUM): so it is enough to apply blinding to the modular
exponentiation to say "this algorithm implementations does blinding".
That is not true anymore already for DSA signing (that relies on the
same single finite field math abstraction provided by BIGNUM, but
already rests on two operation, exponentiation and modular inversion)
and one more layer deep with ECDH/ECDSA where you have two operations
(scalar multiplication and modular inversion), of which the first
is composed of math operations coming from the EC arithmetic layer of
abstraction, and each of those operation themselves are built as the
composition of many operations from the same FF arithmetic layer from
BIGNUM.
The number of levels of abstraction involved in an implementation keeps
rising steeply when we move into the realm of upcoming post-quantum
algorithms.
When you have multiple operations from multiple layers, applying binding
efficiently is non-trivial, one maybe blinds only the EC scalarmul but
not the FF modinv, because the first takes the bulk of the computation
time and common consensus at some point in time is that there is little
point in blinding the modinv if its SNR is deemed too low to be
exploitable. But what happens when this does not hold anymore? What is
the meaning of `blinding=yes` for an EC implementation that applied
blinding only to scalarmul and not to modinv because that was the best
practice at the time it was written?
What if blinding is applied to the scalarmul but the FF operation to
perform the blinding already leaks a small amount because BIGNUM?
A small leakage might not be considered a problem with 2048+ RSA
keys (shared opinion might be that 1 or 3 bits of knowledge wouldn't
give the attacker a noticeable boost compared to other tools in their
arsenal) but the same few bits have been proven sufficient to break even
P384 or P521 keys (which are above the security level of RSA 2048) with
limited effort and costs: what is the meaning of `blinding=yes` in the
case where the blinding implementation itself reveals information on the
secret it should protect?

What if the blinding operation itself is not leaking a single bit by
itself, but it magnifies some effect from the underlying level of
abstraction (e.g., blinding scalarmul could make the nonce bigger than
the prime order subgroup size `n`, and the specific scalarmul
implementation must guarantee that its scalar input is less than `n` to
compute correctly, so it has a special branch to be robust and handle
bigger scalars coming from blinding, creating a magnifying lens for an
attacker when the branch is taken, as that says something on the
probability of the high order bits of the scalar being set)?
This might seem as having nothing to do with the semantics of
`blinding=yes`, but it does for providers like ours where we apply
blinding at top abstraction layer (and in the case of certain paths also
in strategic points in the second layer) but then have a number of
different sub-algorithm implementations that behave very differently:
depending on the curve name (and encoding format when parsing a key) you
might end up in the generic prime/binary curves implementation, or in
one of the specific ones like nistz256 or nistp521.

Applying `blinding=yes` like this:

```c
{"ECDSA","provider=default,blinding=yes",ossl_ecdsa_signature_functions}
```

is already making the semantics of the `blinding` property ill-defined,
and affecting its effectiveness/purpose as a decision mechanism to be
exposed to users: we have different codepaths under the same umbrella
flagged as applying blinding, but even within it that blinding has very
different shades of "yes".

As @t8m said "Security is not a boolean value" and my point here is that
neither is `blinding` or `consttime` and similar: they come in shades
(even the `formally_verified` one: maybe the EC layer is formally
verified by tools, but the underlying FF module might come with
different guarantees, etc.) and the threshold by which SOTA deems the
quality to become "yes*" or "no*" changes over time (e.g., it was fine
to only blind scalarmul yesterday, so implementation X was considered as
a blinding implementation, but today also modinv blinding is required so
the same implementation X is now deemed non-blinding).

Alternative
-----------

Support for the specific `RSA_FLAG_BLINDING`&co functionality could
instead just be dropped.
My understanding is that its usefulness is already limited to legacy
compatibility with engines which anyway would not play nicely with the
new provider/core architecture.

More in general, as a mechanism to select secure implementations if
there are insecure ones present as well, the problem can already be
solved applying the "know your providers" mantra in combination with the
use of the `provider=default`, `provider=provX` properties to pick
across the offerings of multiple providers.

This wouldn't cover the use case (is this even supported/tested right
now?) of a single provider registering the same algorithm twice with
different alternative implementations and properties: in this case
though it would be possible to use a property definition local to the
specific provider version, and its documentation would inform the users
of how to use propquery to select what and the related conditions and
caveats. Luckily this last case does not affect our implementations, as
we don't offer competing algorithms at the provider level.

Although I wouldn't propose to do it, if we really really wanted to
retain the functionality of `RSA_FLAG_BLINDING` and have
algorithms exposing knobs with the capability of turning on/off (or
1000/500/0) whatever degree of blinding they support, we could achieve
this with existing ctrls/params as we do for many other tunable knobs.
Using ctrls/params over doing this on a propquery level has the
advantage that the definition/semantics can be scoped with more
granularity than offered at the provider level.
As a user/sysadmin, if I have done the evaluation of the specific
context that could lead me to responsibly decide to pick no-blinding,
no-consttime, no-secure or similar hazardous choices, I would expect to
be anyway in a situation where I have so much control over my
system, software and environment that patching the code to pass a param
or call a ctrl function deeply inside my application shouldn't be a
blocker at all.

Summary
-------

I am opposed to merging PR#14759 for several reasons:

- it lends itself to insecure-by-default patterns, repeating mistakes
  from the past
- the validity scope for similar properties is somewhat limited to a
  single provider version, so its effectiveness against its purpose of
  being a selection mechanism for users to pick among competing
  algorithms offered by different providers is limited (and has the
  potential to hunt us back in the future)
- its definition and semantics are vague, its applicability as an
  algorithm property too coarse to be well-defined even inside our own
  providers
- `secure`, `blinding`, `consttime`, etc. are not boolean properties of
  algorithms (even if I really wish they could be)
- it's one more vague concept for everyone to understand, and one more
  item with not-too-clear logic for the project to handle and maintain
  for the foreseeable future
- we have already established alternatives in the form of
  per-implementation params or ctrls to tune some security parameters
  that can alter the execution flow of our implementations

Also, from a more generic usability perspective and thinking of
supporting those in our Community developing 3rd party providers, I
think we should also

- [in the vote] take a position on what are good/bad ideas when devising
  provider-specific properties, and have some official guidelines in the
  documentation about it by the time final 3.0.0 is release, and
- [not in the vote, could be a separate discussion] document what
  mechanisms we plan to put in place to coordinate efforts
  which will allow for some carefully evaluated properties to have
  well-defined semantics (with versioning?) across providers/vendors
  (some form of registry of properties, with well-defined criteria for
  maintaining it?)

Proposed vote text
==================

    Do not merge PR#14759, prevent declaring properties similar to
    `blinding=yes` or `consttime=yes` in our implementations and
    discourage 3rd parties from adopting similar designs.

Please provide feedback on the vote text, I would like to open the vote
on Monday and have some public debate about its points via the list
rather then compress it on the OTC meeting which already has enough
items in agenda!

Thanks,

Nicola
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mta.openssl.org/pipermail/openssl-project/attachments/20210408/6bc9cb06/attachment-0001.html>