Blocking on a non-blocking socket?

Wiebe Cazemier wiebe at halfgaar.net
Thu May 23 02:22:31 UTC 2024


Hi List,

I have a very obscure problem with an application using O_NONBLOCK still blocking. Over the course of a year of running with hundreds of thousands of clients, it has happened twice over the last month that a worker thread froze. It's a long story, but I'm pretty sure it's not a deadlock or spinning event loop or something, primarily because the application recovers after about 20 minutes with a client errorring out with ETIMEDOUT. Coincidentally, that 20 minutes matches the timeout description of the tcp man page [1].

It really looks like a non-blocking socket is still blocking. I found something with a similar problem ([2]), but what they think of SSL_MODE_AUTO_RETRY does not match the documentation.

So, is there indeed any way an application that has SSL_MODE_AUTO_RETRY on (which is default since 1.1.1) can block? Looking at the source code, I don't see any calls to fcntl() that removes the O_NONBLOCK.

My IO method is SSL_read() and SSL_write() with an SSL object given to SSL_set_fd().

The only SSL modes I change from the default is that I set SSL_MODE_ACCEPT_MOVING_WRITE_BUFFER. 

There are two primary deployments of this application, one with OpenSSL 1.1.1 and one with 3.0.0. Only 1.1.1 has shown this problem, but it may be a coincidence.

Side question, is it a problem to set SSL_set_fd() before using fcntl to set the fd to O_NONBLOCK? I ask, because the docs say "The BIO and hence the SSL engine inherit the behaviour of fd. If fd is non-blocking, the ssl will also have non-blocking behaviour.". The 'inherit' may be a key word here; not sure when it's done.

Regards,

Wiebe Cazemier



[1] https://man7.org/linux/man-pages/man7/tcp.7.html
[2] https://github.com/alanxz/rabbitmq-c/issues/586


More information about the openssl-users mailing list