[openssl-dev] State machine rewrite

Fri Sep 11 23:20:52 UTC 2015

On 11/09/15 16:07, John Foley wrote:
> +1
> 
> It's great to see improvements in the state machine along with
> consolidated handlers for TLS/DTLS.  Having said that, have you
> considered using a state transition table instead of long switch
> statements to enforce the state transition rules?  This would improve
> the maintainability of the code.  Here's a trivial example:
> 
> http://www.gedan.net/2008/09/08/finite-state-machine-matrix-style-c-implementation/
> 

Hi John,

That's a really good question that's going to need quite a long answer
(sorry!).

To summarise: the article states that any finite state machine can be
represented as a state transition table. The article shows such a table
with states along one axis, and events along the other. The contents of
each "cell" of the table are the new state to move to, and an action to
perform. The argument is that such an approach is easier to maintain
because changing the structure of the state machine does not change the
structure of the underlying code (you just change the table), whereas
traditional approaches do.

The article presents a very simplified table structure (for
understandable reasons). A real table approach (for SSL/TLS) would have
to be a bit more complicated than that. In particular the "events" we
are talking about here are the arrival of new messages (at least they
are when we are reading rather than writing). However we are working in
a security setting and we have to deal with the possibility of "illegal"
events. e.g. if you are server and the last message you read was
ClientKeyExchange and then you receive a ChangeCipherSpec - is that ok?
Well the answer is "it depends". Dependant on the preceding messages we
might need to have a CertificateVerify next. So transitions are actually
"guarded" - there is logic which determines whether a particular event
is "allowed" in the current scenario or not.

This is still fine from a table driven approach - it just means that the
contents of our "cells" have to include a transition guard entry as well.

The other thing to think about here is why is it that a table driven
approach is easier to maintain (assuming we accept the argument that it
is)? I would argue the reason is that each entry in the cells conforms
to a common interface. The logic that does all the hard work of moving
around that state machine can be generic, and all the state specific
work is contained within the functions implementing the individual cells
- which all have the same interface that the generic code can
understand. Therefore changing the state machine should not impact the
generic code - its just about changing the states/events/cells.

I would further argue that the mechanism that you use to implement the
table itself is fairly irrelevant - what is important is the generic
table "look up" code doing the work of moving around the state machine,
and the common interface to the state specific cells. The article
presents a fairly obvious table implementation: an array. In an array
based approach you first scan the "rows" of the table until you find
your current state, and then you scan the "columns" of the table until
you find the event. Now you've found the right cell.

This is not the only implementation approach however. You could
implement this in code directly:

switch(my_state) {
case state1:
	if (event1)
		cell_1_1_action();
	else if (event2)
		cell_1_2_action();
	else
		error();
	break;
case state2:
	if (event1)
		cell_2_1_action();
	else if (event2)
		cell_2_2_action();
	else
		error();
	break;
/* etc */
}

So why might you take this approach rather than an array based approach?
Well one reason is that if you have a lot of states and a lot of
possible events an array based approach ends up being quite large and
difficult to read...particularly if the table is quite sparsely
populated. In SSL/TLS (for reading messages) the current state will be
the last state that we read, and the event will be the arrival of the
next message. There are quite a large number of states (there are 37
entries in the HANSHAKE_STATE enum), so you're talking about a 37x37
array. That's going to get quite messy to maintain. Given that the table
is going to be sparsely populated I think the code based approach is
probably better. You could probably come up more complicated data driven
approaches that can cope with sparse data, but I reckon the code based
approach given above is equivalent, quite simple, and does the job
equally well.

There is another reason as well why the code based approach might be
preferable. I mentioned earlier about the necessity for transition
"guard" logic. This makes the code above look more like this:

switch(my_state) {
case state1:
	if (event1 && guard_1_1)
		cell_1_1_action();
	else if (event2 && guard_1_2)
		cell_1_2_action();
	else
		error();
	break;
case state2:
	if (event1 && guard_2_1)
		cell_2_1_action();
	else if (event2 && guard_2_2)
		cell_2_2_action();
	else
		error();
	break;
/* etc */
}

In an array based approach the guard logic would have to go into the
cell data. In practice this means that each guard would have to be a
function. That's going to see quite an explosion in the number of
functions required and will distribute the transition logic around the
code to wherever the functions are defined. In reality the guards are
often quite simple. There may be the odd occasion where the guards are
complicated enough to be functions, but more often than not you can just
inline the guards directly into the if statements. This has the
*significant* advantage that you can keep all the transition logic *in
one place*. This was actually one of my design goals for the whole
rewrite. The old state machine code has transition logic everywhere. Its
very difficult to reason about and verify its correctness. By keeping
everything together in one place the job of sanity checking that the
logic is correct is made much easier. I would also argue that code
maintenance is also much easier than having the logic distributed around
lots of different functions.

I have simplified things quite considerably. The whole thing is further
complicated by the need to handle non-blocking IO, and so the state
machine needs to be able to stop at any point, return control back to
the calling application in order to resume where it left off later. For
that reason (amongst others) the actual code implementation is slightly
more complicated than I have outlined. However I would argue that the
approach that has been taken is:
- still table driven (using a code based implementation approach)
- exhibits the core features of the table based approach: generic code
along with standard interfaces to the "cells" of the table
- benefits from the advantages of such an approach, i.e. code
maintainability

It's not an array based table which is what I think you were looking
for. But I think there are good reasons not to go down that path.

Matt