Why would a node http client request occasionally error out with "socket hang up" after approximately 5 seconds?

Question

I have a node (v14.19.1) process that collects data and then inserts it into VictoriaMetrics by making an http request to /api/v1/import. I'm using axios to send the requests. The vast majority of the time, this works perfectly. Occasionally, the client errors out with "socket hang up", with the following stack, and the data never makes it to the database:

Error: socket hang up
    at connResetException (internal/errors.js:639:14)
    at Socket.socketOnEnd (_http_client.js:499:23)
    at Socket.emit (events.js:412:35)
    at Socket.emit (domain.js:475:12)
    at endReadableNT (internal/streams/readable.js:1334:12)
    at processTicksAndRejections (internal/process/task_queues.js:82:21)

I have sometimes seen what seems to be the same problem manifest as an EPIPE instead. This is always the case when the database and data collection process are on the same virtual machine, but it also sometimes happens when they are on different machines:

Error: write EPIPE
    at afterWriteDispatched (internal/stream_base_commons.js:156:25)
    at writeGeneric (internal/stream_base_commons.js:147:3)
    at Socket._writeGeneric (net.js:798:11)
    at Socket._write (net.js:810:8)
    at writeOrBuffer (internal/streams/writable.js:358:12)
    at Socket.Writable.write (internal/streams/writable.js:303:10)
    at ClientRequest._writeRaw (_http_outgoing.js:351:17)
    at ClientRequest._send (_http_outgoing.js:327:15)
    at ClientRequest.end (_http_outgoing.js:849:10)

Logging has shown that the socket hang up tends to occur around 5-6s after the request starts, sometimes up to 15s or so; this doesn't seem to be long enough to trigger any connection timeouts I'm aware of. The errors seem to be more prevalent at times when more and/or larger data write requests are being made, but sometimes they will occur with small writes as well.

I increasingly suspect that the problem is with the client (or the virtual machine it's running on) rather than the database. My evidence for this is:

VictoriaMetrics has logged no errors corresponding to the socket hang ups
At one point, a dev machine that was running the data collection process had its root hard drive fill up, which created a cascade of problems, and during this time every attempt that machine made to write to the database resulted in a socket hang up (or EPIPE). In that situation it can't have been the database's fault because the database was on a different, unaffected virtual machine, and other clients were writing to it without error.

That second point, especially, makes me suspect that I'm occasionally hitting some node or OS (Ubuntu 20.04) limit on connections/files/network traffic/etc. that is causing some connections to be closed prematurely by the client (or its OS) during periods of higher traffic. What can I do to confirm or refute that suspicion? Or what other root cause might explain what I'm observing?

All of the virtual machines involved are running on a private network (which has otherwise been reliable) in the Azure cloud.

Does the problem go away if you make your requests with `{agent: new https.Agent({maxSockets: 10})}`, or some other limit on the number of parallel sockets? — Heiko Theißen, Jul 22 '22 at 15:15
@HeikoTheißen, this is a good idea, I'm trying it in my dev environment. There are often multiple processes writing to the db so this doesn't give me a firm limit on the total number of sockets being used on the machine, but at least it ensures that any given process isn't taking a huge amount at once. For what it's worth, it looks like the axios config option for this is actually `httpsAgent:` (or `httpAgent:`, depending on the protocol) not simply `agent:`. — mactyr, Jul 22 '22 at 21:07

score 1 · Answer 1 · answered Aug 30 '22 at 13:54

I seem to have solved this by making two changes to my client code:

As suggested by @HeikoTheißen in comments, I added the following to my axios config to limit the number of sockets per process.

httpAgent: new http.Agent({maxSockets: 10}),
httpsAgent: new https.Agent({maxSockets: 10})

I used the axios-retry package to retry any write requests that failed due to network errors. I used retryCondition: axiosRetry.isNetworkError; I had to set this explicitly because by default axios-retry conservatively only retries requests that are guaranteed to be idempotent, but in this case I needed to retry POSTs (and I know that in this context it's safe to do so).

I'm not sure which of these changes solved the problem or if they both helped, since I implemented them simultaneously. But I haven't seen any EPIPE or "socket hang up" errors since deploying the changes.

Why would a node http client request occasionally error out with "socket hang up" after approximately 5 seconds?

1 Answers1