1

Got a Windows 10 c++ program using ZeroMQ that aborts very often on the same group of computers due to assertion failures.

The assert statement is buried deep into the libzmq code.

On other machines, the same program runs fine without those problems (but in all fairness, that's with different OS build numbers and program configurations).

The assertion failure seems to happen because internal zeromq (socket and/or pipe based) connection(s)/handles get unexpectedly closed.

What could possibly cause something like that?

More information:

The assertion failure seems to have something to do with the channels/mailboxes that ZeroMQ uses for internal signaling. In older versions of the library this works with several loopback TCP sockets while modern versions rely on a solution involving IOCP (I/O completion ports).

Here's a long standing and possibly related issue where the original author himself talked about a similar crash that happened to him:

https://github.com/zeromq/libzmq/issues/1108

Working with the crash dumps of our application I see that the stack trace leading to the assert statement usually happens at point right after attempting to read from a socket (or socket file descriptor?). The read or receive action fails and then the library panics.

So, suddenly a socket handle no longer seems valid. Examples of errors that I see are "The resource is temporarily unavailable" and things like "Invalid handle/parameter".

Can it be that something or someone is forcefully closing the socket for us? What could be causing this behavior?

This happens for an old version of zeromq (4.0.10) as well as a modern one (4.3.5). This leads me to believe that the fault is somewhere else if such different implementations fail roughly the same way.

When trying to reproduce the problem I can trigger a similar assertion failure for 4.0.x by manually force closing an internal TCP connection that ZeroMQ uses with TCPView. The resulting assertion failure is instant and the crash dump looks identical to what happens in the wild.

But the modern version doesn't seem to use loopback sockets, so I couldn't close the "private" connections there. Maybe they are using pipes or unix style sockets instead (which is now possible on Windows 10 I have heard).

For a moment I have considered ephemeral port exhaustion as a reason for all this trouble but that alone doesn't make sense to me: I don't expect the OS to force close existing connections, existing connections should keep working. You'd expect only new connections to fail then.

E. van Putten
  • 615
  • 7
  • 17
  • 1
    It could be your program closing garbage handles and sometimes closing a valid one that belongs to ZeroMQ – user253751 Feb 15 '21 at 16:18
  • Interesting insight! Now this program doesn't have a garbage collector, but in theory there could still be code - in the same process - that is closing handles by mistake. – E. van Putten Feb 15 '21 at 16:56
  • Garbage doesn't mean garbage-collected, it means nonsense, gibberish, probably because of uninitialized variables – user253751 Feb 15 '21 at 17:00
  • Ah yes, sorry - I misread your comment. You are right, that could indeed be the case if the numbers somehow magically align (Murphy!) on those machines and not on others. – E. van Putten Feb 15 '21 at 17:02
  • This is more a topic for a bug report, not a programming question for SO. Really, if you manage to trigger that assertion without abusing the API in any way, then add an example program to the existing or perhaps a new bug report. – Ulrich Eckhardt Feb 15 '21 at 18:59
  • I'm sorry, this does indeed look a lot like a bug report. Except that question really was "what could forcefully close a socket". Probably doesn't belong in a zmq bug report either, since the cause might be external to the library. – E. van Putten Feb 15 '21 at 19:26
  • https://stackoverflow.com/questions/3178952/is-it-safe-to-double-close-a-handle-using-closehandle – E. van Putten Feb 17 '21 at 13:13

1 Answers1

1

As @user253751 suggested, the culprit seems to be a particular piece of code in the application that closes the same HANDLE twice. A serious bug in our code, not ZeroMQ!

On Windows, closed handles immediately get reused, so anything that is opened right after the first CloseHandle is at risk of being unexpectely closed when the second CloseHandle strikes, due to the bug.

E. van Putten
  • 615
  • 7
  • 17
  • Exactly the same issue here. Would you pls post some of your sample bug code?How could you close the handle while the socket is managed by ZMQ internally? – MasterBeta Nov 15 '22 at 11:05
  • The "rightful owner" of the HANDLE (ZMQ) couldn't prevent another part of the application closing the same HANDLE. – E. van Putten Nov 15 '22 at 11:45
  • Thanks for the reply. Could you please provide a little more details? By saying "closing the same HANDLE", did you mean you accidentally zmq_closed a zmq_socket object which is being used by other threads? – MasterBeta Nov 16 '22 at 02:27
  • 1
    I mean that I have an unrelated (non ZMQ!) object. In my case it's a TIMER. A timer is also identified by a HANDLE. Handles are just values assigned by the OS. Let's suppose I close the timer handle. The OS is now free to reuse that "ex timer" handle for any new object the process asks for. Now ZMQ opens a new socket. It gets the "recycled" ex timer HANDLE value. See where this is going? What if I now accidentally try to close the TIMER object again! Well, the handle now identifies a ZMQ object and no longer the expected TIMER object as before. It unintentionally closes the socket! – E. van Putten Nov 16 '22 at 12:50