problems with too many ssl_read and ssl_write errors

Mon Aug 23 15:22:15 UTC 2021

Hello Michael,

Thank you very much for your detailed response.

We previously had checked the Registry settings for TCPIP Parameters and
have been using the Default values.  I also ran the PowershellScript for
the Ephemeral ports and you are correct - the ports are not being exhausted
as it used the same inport fort for the clients.  We did get CLIENT_WAIT
and TIME_WAIT states once on a while using the netstat commands but most
times the connections were ESTABLISHED.

We get the SSL_ERROR_SYSCALL from SSL_Read and SSL_Write quite often.  We
never got this error while using the SSL_connect for Client or SSL_accept
on the server side.  It seems the handshake is done correctly and over a
period of time( few hours to 2-3 days random)  the SSL_Read /SSL_Write
fails.  We do not get the *WSAEWOULDBLOCK *error code nor the OpenSSL's
version of SSL_ERROR_WANT_READ or SSL_ERROR_WANT_WRITE error.
We get WSAETIMEDOUT on Receive more often and a few times on the Send. We
are not using SO_KEEPALIVE but using application specific heartbeat TO to
keep the socket alive.

Thank you again for the response and we now have a direction to check and
probably tweak any timeouts on the application side.  We are mainly
concerned about the SSL_ERROR_SYSCALL we get quite often on the
SSL_Read/Write and the Windows error code is WSAETIMEOUT.  Based on blogs
and googling we have seen that OpenSSL quite often issues a
SSL_ERROR_SYSCALL when a Timeout is encountered (
https://github.com/openssl/openssl/issues/12416) and similar posts
We restart our server application and everything gets reset and connections
get established. We have looked at the Windows event server logs that have
not given us much.

Thanks
Kamala

*Kamala  Ayyar*

On Thu, Aug 19, 2021 at 6:23 PM Michael Wojcik <
Michael.Wojcik at microfocus.com> wrote:

> > From: openssl-users <openssl-users-bounces at openssl.org> On Behalf Of
> David Bowers via openssl-users
> > Sent: Wednesday, 18 August, 2021 16:38
>
> I don't think this is OpenSSL-related, but at this point it's not clear
> what the issue is.
>
> > . After maybe a few hours/days we see the clients dropping connections.
> The logs
> > indicate the SSL_Read or SSL_Write on the Server fails for a client with
> SSL_Error
> > number 5 (SSL_ERROR_SYSCALL) and the equivalent Windows error of
> WSATimeOut.  We
> > then observe the WSAECONNRESET as the Client closed connection.  We see
> this
> > behavior for multiple sites.
>
> I assume this is a Server-edition version of Windows and you're not trying
> to support that kind of connection load on a desktop edition.
>
> What's set in the Registry under
> HKLM\SYSTEM\CurrentControlSet\Services\TCPIP\Parameters? In particular I'd
> be suspicious of SynAttackProtect and NetworkThrottlingIndex (which
> shouldn't be set on Server, but you never know).
>
> Many online references will suggest altering settings that affect the
> ephemeral-port space, such as TcpTimedWaitDelay, but those are irrelevant
> on the server side (since the connection tuples will use the server port,
> not an ephemeral port, for the server side).
>
> Many of the settings under the TCPIP/Performance key are undocumented.
> This page describes a number of them:
>
>
> https://forums.alliedmods.net/showpost.php?s=5fedba9ea66557ccea3bfee9e192aaf4&p=1744400&postcount=1
>
> It also discusses a number of netsh commands for TCP/IP tuning.
>
> > . The number of Clients disconnected starts increasing and we see the
> logs in the
> > Client where the server refuses any more connections form Clients (10061-
> > WSAECONNREFUSED) There is nothing to indicate this state in the server
> logs. Our
> > theory is the backlog is filled and Server refusing further connections.
>
> That's possible. Windows, unlike BSD-based stacks, sends an RST when the
> listen queue is full. (BSD-based stacks simply discard the inbound SYN,
> which is a better choice for a number of reasons. Windows did this wrong
> and stubbornly refuses to change.)
>
> You say you're specifying a backlog of 500 in the call to listen().
> Microsoft recommends just passing SOMAXCONN and letting the provider set a
> "suitable" value. Worth trying.
>
> But this appears to be a secondary issue. The primary one seems to be that
> for whatever reason you get an increasing number of conversation failures,
> and then the client's aggressive retry behavior means you get a cascade of
> connection flooding until the listen queues are full. The clients ought to
> be changed to use random backoff or another strategy that avoids flooding
> the server, but at this point that seems to be addressing a symptom rather
> than the underlying problem.
>
> > . We are trying to find why we get the SSL_Read/SSL_Write Error as it a
> Blocking
> > socket. We cannot use to a non-blocking socket due to platform and
> application
> > limitation
>
> You said you're specifically getting SSL_ERROR_SYSCALL from SSL_read and
> SSL_write. That has nothing to do with whether the socket is in blocking
> mode -- system calls on blocking sockets can certainly return errors. I
> don't understand this question.
>
> There are any number of reasons why the server's ability to handle this
> load might be compromised. Network congestion, bufferbloat, load on the CPU
> or NIC (particularly if TCP offload is enabled to the NIC), contention for
> DMA, other application I/O, .... Years ago, I had one customer who had
> similar problems which turned out to be due to intermittent failures in a
> bad DRAM module in the server. Distributed computing is inherently fragile.
>
> But in my experience, this sort of problem is most often due to one or
> more of:
>
> - Application-logic errors or design issues. Are you multiplexing all
> these blocking sockets, or running a thread per conversation, or something
> else?
>
> - Middlebox problems. Routers, load balancers, firewall appliances, and so
> forth frequently cause issues.
>
> - Application firewalls and other "anti-malware" software (much of which
> is rubbish) running on the server.
>
> WSAETIMEDOUT on a send operation, assuming OpenSSL didn't need to do a
> receive under the covers for TLS-protocol reasons, could mean that a client
> app isn't doing its receives and consequently its receive window has
> filled; or it could mean that something is interfering with the delivery of
> network traffic in one direction or the other.
>
> WSAETIMEDOUT on a receive, though, again assuming OpenSSL didn't need to
> send under the covers, implies that something set a receive timeout on the
> socket, or that a keepalive wasn't responded to in the required time. Are
> you setting a receive timeout (typically with SO_RCVTIMEO)? Are you setting
> SO_KEEPALIVE? What about SO_KEEPALIVE_VALS? If you're not setting
> SO_KEEPALIVE_VALS, what are KeepAliveTime and KeepAliveInterval set to in
> the Registry? (See the MSDN docs for SO_KEEPALIVE.)
>
> Has the system administrator analyzed the Windows event logs and the
> network statistics? Has anyone looked at network traces when the problem is
> occurring?
>
> --
> Michael Wojcik
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mta.openssl.org/pipermail/openssl-users/attachments/20210823/bd1c6ba3/attachment-0001.html>