[openssl-project] [TLS] Yet more TLS 1.3 deployment updates

Wed Jan 23 17:42:07 UTC 2019

On Wed, Jan 23, 2019 at 4:24 AM Matt Caswell <matt at openssl.org> wrote:

> On 22/01/2019 20:41, David Benjamin wrote:
> > On Tue, Jan 22, 2019 at 1:48 PM Viktor Dukhovni <
> openssl-users at dukhovni.org
> > <mailto:openssl-users at dukhovni.org>> wrote:
> >     As for applications mishandling "SSL_CB_HANDSHAKE_START", not quite
> sure
> >     what to do there, but perhaps we could define a new even for
> keyUpdates
> >     that does not mislead applications into assuming a new "handshake".
> >
> > I think this is clearly the right option. This is a compatibility break,
> IMO.
> >
> > Prior to SSL_OP_NO_RENEGOTIATION (new in the same release that added
> 1.3), this
> > was the only way to disable renegotiations. I've seen a lot of codebases
> do
> > this. I don't think one could even call it a mishandling. The natural
> > interpretation of "SSL_CB_HANDSHAKE_START" is that a handshake has
> started.
> > Thus, if you wish to block renegotiations, absent a more direct API, it's
> > natural to count those and fail if you see more than one.
> >
> > A KeyUpdate is not a handshake.
>
> That's depends on your perspective. One peer sends a message, and the
> other peer
> (may) respond with a message. Sounds like a handshake to me. KeyUpdate
> gets sent
> in handshake records, with a HandshakeType value of 24 and is defined in
> section
> 4 of the RFC ("Handshake Protocol").
>

This notion of "handshake" is not supported by RFC 8446 uses the terms "the
handshake", "a handshake", and "post-handshake". "Post-handshake", in
particular, implies KeyUpdate are after the handshake, not part of it.

KeyUpdate is also not quite a request/response pair. The caller is allowed
to coalesce consecutive key_update_requests
<https://tools.ietf.org/html/rfc8446#section-4.6.3>. This was done to avoid
DoS <https://mailarchive.ietf.org/arch/msg/tls/cfw4paCGxI7Fj8QNmj6k1I66VII> and
allow a lighter strategy: if you see key_update_request on read, set a flag
and continue. When you make an outgoing record on write, if the flag is
set, queue a KeyUpdate up first and clear the flag. This avoids the DoS:
write overhead is bounded and read/write flows are independent.

Finally, compare with TCP socket APIs, which most applications have in
mind. There is a connect() phase that happens *once* at the start,
afterwards one calls read() and write(). During read() and write(), TCP
still ends other non-data packets, but those are abstracted away.

> To me it makes perfect sense to signal it this way. In fact it was a
> deliberate
> decision to do so and is documented accordingly:
>
>  =item SSL_CB_HANDSHAKE_START
>
>  Callback has been called because a new handshake is started. In TLSv1.3
> this is
>  also used for the start of post-handshake message exchanges such as for
> the
>  exchange of session tickets, or for key updates. It also occurs when
> resuming a
>  handshake following a pause to handle early data.
>
>  =item SSL_CB_HANDSHAKE_DONE           0x20
>
>  Callback has been called because a handshake is finished. In TLSv1.3 this
> is
>  also used at the end of an exchange of post-handshake messages such as for
>  session tickets or key updates. It also occurs if the handshake is paused
> to
>  allow the exchange of early data.
>

This documentation does not help code written prior to TLSv1.3, which is
the problem here. OpenSSL claims that TLS 1.3 is a backwards-compatible
addition, so documentation updates for 1.3 may clarify but cannot be
necessary. More on this below.

> Ironically it was done this way with a view to *avoiding* a compatibility
> break.
> The thinking was that applications written with TLSv1.2 in mind may find it
> surprising to start receiving events after the initial connection setup
> that are
> not enclosed in HANDSHAKE_START/HANDSHAKE_DONE. The info callback is
> intended as
> a tracing API IMO. It's only the fact that it has been used in ways other
> than
> we might have expected that we have this problem. In fact if your
> application is
> using the API to detect renegs because it wants to report when key updates
> are
> happening (rather than blocking them)....well that application will
> continue to
> work and will stop working if we change the events.
>
> Given this is documented that way, some applications may already be using
> it -
> so changing the events now should be done with caution. That itself could
> cause
> a break. We know that this is causing some problems now because of the way
> that
> some applications are (mis)using this API. What we don't know is what new
> problems would surface if we changed its semantics in a letter release.
>
> Really the root of this problem is not in the info callback at all. It is
> the
> fact that in earlier 1.1.0 releases we did not have an effective way of
> blocking
> renegs. We now have SSL_OP_NO_RENEGOTIATION which is the "right" way to do
> this.
> I'd also note that this was backported to the 1.1.0 branch in 1.1.0h (March
> 2018) and so has been available to all 1.1.x users for some while.
>

The fact of the matter is the TLS 1.3 ecosystem is intolerant to
KeyUpdates, as a result of behavior in OpenSSL and its consumers. As a
result of this failure, *no one* can use KeyUpdate. This warrants
consideration on both sides of the API boundary.

You're right that adapting this API to 1.3 required a judgement call. It
observed a state machine which changes across versions. There isn't a right
answer, a priori. Elsewhere in this thread, I pushed against adding new
tracing events without a clear need. This is counter to OpenSSL's usual API
style but comes from experience developing on both sides of the API. (My
day job involves working both on BoringSSL itself and code calling into it,
much of which was written to OpenSSL.) APIs like that are problematic. I'll
note BoringSSL doesn't expose state constants at all.

But OpenSSL 1.1.0 and 1.0.x had that API without SSL_OP_NO_RENEGOTIATION,
so we must deal with it. Note March 2018 is not "for some while" on the
timescales of OpenSSL consumers, sadly. Ubuntu 18.04 LTS still ships 1.1.0g
<https://packages.ubuntu.com/bionic-updates/openssl> (plus security fix
backports). OpenSSL 1.0.x is also relevant. Projects did not port to
OpenSSL 1.1.0 by revising every API call against the new documentation.
They tried to compile it, fixed any errors, maybe ran some basic tests,
fixed any obvious runtime issues, and called it done. This is simply the
natural way to do it. Change strategies must take this reality into account
<https://boringssl.googlesource.com/boringssl/+/HEAD/BREAKING-CHANGES.md#fail-early_fail-closed>
.

OpenSSL's API thus has a long history. Its consumers grew up organically
with OpenSSL, sometimes even SSLeay. Judgement calls cannot be made in a
vacuum. When we were designing TLS 1.3 support for BoringSSL, every
question like this was backed by sampling existing callers. From that
experience, I have seen the following uses of the info callback:

(a) Debugging hooks for tracing, often copied from the openssl binary
<https://git.openssl.org/gitweb/?p=openssl.git;a=blob;f=apps/s_cb.c;h=af57c34558684d4bc027d1471b63d5e4f7a316ab;hb=HEAD#l454>
.
(b) As a callback to know when the handshake (in the RFC8446 sense
described above, not the OpenSSL sense) is done, sensitive
to SSL_CB_HANDSHAKE_DONE.
(c) As a callback to block renegotiations.

The problem here is (c), and empirically has affected versions of NGINX,
Node, HAProxy and the real TLS 1.3 ecosystem. There may be more yet
undiscovered problems; we only had KeyUpdates on in Chrome for a week on
Chrome Canary before we had to shut it off. At three affected callers, one
cannot simply say this is the consumer's fault.

As for the others, (b) also doesn't want to trigger on KeyUpdate, though it
may tolerate it. (I have seen versions of (b) which ignore duplicates and
versions which break on renegos---no one tests against it, which is why
it's off by default in BoringSSL.) (a) is closest to the scenario you are
concerned about, but such debugging notes are just that: debugging. I have
never seen code which cares about their particulars. Indeed, if it did,
adding TLS 1.3 would not be compatible because 1.3 changes the state
machine.

Thus, the fix is clear: don't signal HANDSHAKE_START and HANDSHAKE_DONE on
KeyUpdate. Not signaling has some risk
<https://boringssl.googlesource.com/boringssl/+/HEAD/BREAKING-CHANGES.md#breakage-risk>,
but it is low, especially in comparison to the known breakage and ecosystem
damage caused by signaling.

David
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mta.openssl.org/pipermail/openssl-project/attachments/20190123/9b162e2f/attachment-0001.html>