Cert hot-reloading

Tue Sep 1 04:57:43 UTC 2020

On Mon, Aug 31, 2020 at 11:00:31PM -0500, David Arnold wrote:

> 1. Construe symlinks to current certs in a folder (old or new / file by file)
> 2. Symlink that folder
> 3. Rename the current symlink to that new symlink atomically.

This is fine, but does not provide atomicity of access across files in
that directory.  It just lets you prepare the new directory with
non-atomic operations on the list of published files or file content.

But if clients need to see consistent content across files, this does
not solve the problem, a client might read one file before the symlink
is updated and another file after.  To get actual atomicity, the client
would need to be sure to open a directory file descriptor, and then
openat(2) to read each file relative to the directory in question.

Most application code is not written that way, but conceivably OpenSSL
could have an interface for loading a key and certchain from two (or
perhaps even more for the cert chain) files relative to a given
directory.  I know how to do this on modern Unix systems, no idea
whether something similar is possible on Windows.

The above is *complicated*.  Requiring a single file for both key and
cert is far simpler.  Either PEM with key + cert or perhaps (under
duress) even PKCS#12.

> Does it look like we are actually getting somewhere here?

So far, not much, just some rough notes on the obvious obstacles.
There's a lot more to do to design a usable framework for always fresh
keys.  Keeping it portable between Windows and Unix (assuming MacOS will
be sufficiently Unix-like) and gracefully handling processes that drop
privs will be challenging.

Not all applications will want the same approach, so there'd need to be
various knobs to set to choose one of the supported modes.  Perhaps
the sanest approach (but one that does nothing for legacy applications)
is to provide an API that returns the *latest* SSL_CTX via some new
handle that under the covers constructs a new SSL_CTX as needed.

    SSL_CTX *SSL_Factory_get1_CTX(SSL_CTX_FACTORY *);

This would yield a reference-counted SSL_CTX that each caller must
ultimately release via SSL_CTX_free() to avoid a leak. 

    ... factory construction API calls ...
    ctx = SSL_Factory_get1_CTX(factory);    -- ctx ref count >= 1
    SSL *ssl = SSL_CTX_new(ctx);            -- ctx ref count >= 2
    ...
    SSL_free(ssl);                          -- ctx ref count >= 1
    SSL_CTX_free(ctx);                      -- ctx may be freed here

To address the needs of legacy clients is harder, because they
expect an SSL_CTX "in hand" to be valid indefinitely, but now
we want to be able age out and free old contexts, so we want
some mechanism by which it becomes safe to free old contexts
that we're sure no thread is still using.  This is difficult
to do right, because some thread may be blocked for a long
time, before becoming active again and using an already known
SSL_CTX pointer.

It is not exactly clear how multi-threaded unmodified legacy software
can be ensured crash free without memory leaks while behind the scenes
we're constantly mutating the SSL_CTX.  Once a pointer to an SSL_CTX
has been read, it might be squirreled away in all kinds of places, and
here's just no way to know that it won't be used indefinitely.

-- 
    Viktor.