Blocking on a non-blocking socket?
Wiebe Cazemier
wiebe at halfgaar.net
Sat Jun 1 02:54:06 UTC 2024
----- Original Message -----
> From: "Wiebe Cazemier" <wiebe at halfgaar.net>
> To: openssl-users at openssl.org
> Sent: Thursday, 23 May, 2024 12:22:31
> Subject: Blocking on a non-blocking socket?
>
> Hi List,
>
> I have a very obscure problem with an application using O_NONBLOCK still
> blocking. Over the course of a year of running with hundreds of thousands of
> clients, it has happened twice over the last month that a worker thread froze.
> It's a long story, but I'm pretty sure it's not a deadlock or spinning event
> loop or something, primarily because the application recovers after about 20
> minutes with a client errorring out with ETIMEDOUT. Coincidentally, that 20
> minutes matches the timeout description of the tcp man page [1].
>
> It really looks like a non-blocking socket is still blocking. I found something
> with a similar problem ([2]), but what they think of SSL_MODE_AUTO_RETRY does
> not match the documentation.
>
> So, is there indeed any way an application that has SSL_MODE_AUTO_RETRY on
> (which is default since 1.1.1) can block? Looking at the source code, I don't
> see any calls to fcntl() that removes the O_NONBLOCK.
>
> My IO method is SSL_read() and SSL_write() with an SSL object given to
> SSL_set_fd().
>
> The only SSL modes I change from the default is that I set
> SSL_MODE_ACCEPT_MOVING_WRITE_BUFFER.
>
> There are two primary deployments of this application, one with OpenSSL 1.1.1
> and one with 3.0.0. Only 1.1.1 has shown this problem, but it may be a
> coincidence.
>
> Side question, is it a problem to set SSL_set_fd() before using fcntl to set the
> fd to O_NONBLOCK? I ask, because the docs say "The BIO and hence the SSL engine
> inherit the behaviour of fd. If fd is non-blocking, the ssl will also have
> non-blocking behaviour.". The 'inherit' may be a key word here; not sure when
> it's done.
>
> Regards,
>
> Wiebe Cazemier
As a follow-up, the fault did turn out to be my own... As I imagine [1] is. They describe SSL_MODE_AUTO_RETRY 'attempts to renegotiate a broken SSL connection', but all SSL_MODE_AUTO_RETRY indeed really does is read multiple records at a time, without returning from read.
Despite what I thought before, my code actually did have an unfortunate edge case where there was a while loop spinning on SSL_write() when there was no room in the socket. This would eventually fail with ETIMEDOUT.
Well, it was educational at least...
[1] https://github.com/alanxz/rabbitmq-c/issues/586
More information about the openssl-users
mailing list