[openssl-users] How to handle DTLS Certificate Reassembly Error

Sat Sep 17 22:43:02 UTC 2016

On 17/09/16 16:11, Chad Phillips wrote:
>     Was this packet capture done on the client side or the server side or
>     somewhere in the middle? There appears to be some messages missing.
>     In particular I don’t see any CCS or Finished messages being
>     exchanged. Is the network this is over potentially noisy that might
>     explain packet loss?
> 
> 
> From the perspective of the DTLS handshake, my server hosting the Licode
> library is the client, and latest stable Chrome browser is the server,
> if I understand the terminology correctly. The packet capture was taken
> from the client (Licode) side.
> 
> Would the CCS or Finished messages have gotten filtered out by the
> ’dtls’ filter I applied to the packet capture? I do have the full trace
> and can re-filter to just one complete connection over a specific UDP
> port as you suggested, let me know if that would be helpful

I took another look at the packet trace. I found the CCS/Finished
messages! They are actually there but wireshark is not showing them for
some reason (at least my version of wireshark isn't).

On the end of the packet which contains three Certificate fragments, the
ClientKeyExchange and the Certificate Verify, my wireshark is then
saying "Malformed Packet". This in in relation to a load of data that is
in the packet after the Certificate Verify. Looking at it by eye the
packets look well formed so I'm not sure why wireshark is complaining.
Anyway after the Certificate Verify I can see the CCS, and an encrypted
handshake message - which will be the Finished.

What is odd is that we are seeing 3 Certificate fragments and 2
CertificateVerify fragments in a single network packet. OpenSSL will
only fragment if it thinks the MTU isn't big enough for anything larger.
It looks like licode is then combining the multiple fragments into a
single packet anyway. This is probably something to do with the way
licode is written meaning that OpenSSL is not getting the right MTU
value. I assume the licode developers are trying to compensate for that
by sending it all in one go anyway. That shouldn't cause any problems -
but its a bit odd and it would be better to make sure OpenSSL gets the
right MTU in the first place.

I speculate that the reason I'm seeing the "malformed packet" is that,
normally, you'd only see a maximum of 5 DTLS handshake records in a
single packet. However, because we have fragmented the Certificate and
CertificateVerify messages we've got more the 5 DTLS records in a
packet. My guess is that there is a bug in wireshark that fails if it
gets more the 5 records in one go. But that really is pure guess work.

> 
> I see these failures only in situations where browser users with slow
> and/or lossy internet are joining, and usually when the group size gets
> to be six or more participants. The particular testing scenario that
> generated the packets you saw was a user with 225kbps upload, 5120kbps
> down, 70ms delay, 0% packet loss.
> 
> I’ll grant you those network conditions aren’t the best for group video
> chat, but if Google Hangouts can pull it off, I’d like to as well.
>  
> 
>     On receiving that the client should respond with a retransmit of
>     the Certificate/ClientKeyExchange/CertificateVerify/CCS/Finished
>     flight of messages. But it does not appear to do so…the retransmit
>     does not happen until after the encrypted alert.
> 
> 
> This sounds like it might be a bug in the Licode library, not resending
> the retransmit properly?

Possibly. It could be that or it could also feasibly be a bug in
OpenSSL. However I have a theory that might explain it (but it is just a
theory).

DTLS uses a timer to retransmit messages that may have got lost. If it
hasn't had the response it expects following the last set of messages it
sent by the time the timer expires, then it retransmits them.

I wrote above that "On receiving that the client should respond with a
retransmit of the
Certificate/ClientKeyExchange/CertificateVerify/CCS/Finished flight of
messages.". Actually in reality that's a bit of an over-simplification.
What actually happens is the peer is waiting for some messages that it
doesn't receive within its timeout (either because they are lost or
delayed), and so it retransmits them. This is why we see the second set
of ServerHello (etc) messages from the server. The client application
usually then notices that the socket has become readable and the
application calls OpenSSL to process that data. OpenSSL reads it,
realises that they are retransmits of messages that it has already
processed and drops them. It also checks its own timer to see if the
client needs to retransmit any messages. If the client timer hasn't
expired yet then it does nothing. This could be why no retransmit
happens immediately after the second set of ServerHello messages, i.e.
the client timer hasn't expired yet.

Normally the server would continue to retransmit periodically, which
would cause the client application to try and process the retransmits
again, and eventually the timer will have expired and it will retransmit
its last set of messages again. However in this case it seems the server
only retransmits once and then doesn't try again. Perhaps boring has a
different retransmit policy to us - I'm not sure.

This means though that unless something else happens no further data
will be received by the client until it retransmits its last set of
messages...but with no further data being received the application never
calls openssl again to cause it to check its timer again and make the
retransmits happen! Therefore the client is sat there waiting for the
server to send it something...and the server is also sat there waiting
for something to happen.

There is an OpenSSL API which is intended to resolve this issue:

DTLSv1_handle_timeout()

The application is expected to call this periodically during the
handshake if no other data has been sent or received. The causes OpenSSL
to check its timer and do any retransmits if necessary. If licode
doesn't call this, then its plausible that this is the cause of the issue.

Unfortunately there is at least one OpenSSL bug here - in the documentation:
DTLSv1_handle_timeout() is undocumented :-(

Matt