I have a node (v14.19.1) process that collects data and then inserts it into VictoriaMetrics by making an http request to /api/v1/import
. I'm using axios to send the requests. The vast majority of the time, this works perfectly. Occasionally, the client errors out with "socket hang up", with the following stack, and the data never makes it to the database:
Error: socket hang up
at connResetException (internal/errors.js:639:14)
at Socket.socketOnEnd (_http_client.js:499:23)
at Socket.emit (events.js:412:35)
at Socket.emit (domain.js:475:12)
at endReadableNT (internal/streams/readable.js:1334:12)
at processTicksAndRejections (internal/process/task_queues.js:82:21)
I have sometimes seen what seems to be the same problem manifest as an EPIPE instead. This is always the case when the database and data collection process are on the same virtual machine, but it also sometimes happens when they are on different machines:
Error: write EPIPE
at afterWriteDispatched (internal/stream_base_commons.js:156:25)
at writeGeneric (internal/stream_base_commons.js:147:3)
at Socket._writeGeneric (net.js:798:11)
at Socket._write (net.js:810:8)
at writeOrBuffer (internal/streams/writable.js:358:12)
at Socket.Writable.write (internal/streams/writable.js:303:10)
at ClientRequest._writeRaw (_http_outgoing.js:351:17)
at ClientRequest._send (_http_outgoing.js:327:15)
at ClientRequest.end (_http_outgoing.js:849:10)
Logging has shown that the socket hang up tends to occur around 5-6s after the request starts, sometimes up to 15s or so; this doesn't seem to be long enough to trigger any connection timeouts I'm aware of. The errors seem to be more prevalent at times when more and/or larger data write requests are being made, but sometimes they will occur with small writes as well.
I increasingly suspect that the problem is with the client (or the virtual machine it's running on) rather than the database. My evidence for this is:
- VictoriaMetrics has logged no errors corresponding to the socket hang ups
- At one point, a dev machine that was running the data collection process had its root hard drive fill up, which created a cascade of problems, and during this time every attempt that machine made to write to the database resulted in a socket hang up (or EPIPE). In that situation it can't have been the database's fault because the database was on a different, unaffected virtual machine, and other clients were writing to it without error.
That second point, especially, makes me suspect that I'm occasionally hitting some node or OS (Ubuntu 20.04) limit on connections/files/network traffic/etc. that is causing some connections to be closed prematurely by the client (or its OS) during periods of higher traffic. What can I do to confirm or refute that suspicion? Or what other root cause might explain what I'm observing?
All of the virtual machines involved are running on a private network (which has otherwise been reliable) in the Azure cloud.