I have an issue with an Aeron application that uses an ArchivingMediaDriver
and spy subscriptions to record a publication, where the corresponding recording at times becomes inactive, i.e. recording is stopped.
The application consists of two processes, typically running on two separate hosts, a sender and a receiver.
The sender creates an ArchivingMediaDriver
, sets spiesSimulateConnection(true)
on the media driver's context. Before it create a publication, it scans the archive for previous recordings of the same stream. If a previous recording was found, a new publication is created with initial position equal to the stop position of the recording. Also, archive.extendRecording(...)
(archive
being the archive client in the sender process) is called to extend the recording with everything published through the new publication. If no previous recording was found, a new publication, w/o setting the initial position, is created and archive.startRecording(...)
is called, again to record all data published through the new publication in the recording.
The subscriber connects to the archive on the sender host, finds the correct recording for the stream it wants to subscribe for, and then uses a ReplayMerge
instance to replay any previously unseen messages, merging with the live stream once caught up.
This setup has been working fine most of the time, but occasionally, I've seen that the recording on the sender side stops unexpectedly. If the subscriber is consuming the live stream from the publication, this doesn't immediately create a problem, but since the archive here serves as a safety measure, I can't really afford it to stop. Also, since some data will not be written to the recording, once stopped, this may create problems further down the line, should the receiver re-start and query the (incomplete) archive. Naturally, once could also try to restart the recording session if a stopped recording is detected, but this may still result in loss of data if the publication has already advanced beyond the recording's stop position.
I would be grateful for any hints as to how to best investigate, mitigate, or resolve the issue. At first, I thought this may be a problem of the underlying storage being slow, but I wasn't able to find any proof for that. Which metrics could I look at to find out what the cause of the issues actually is? What's a good strategy to make sure a recording works as expected and is in sync with the corresponding publication?
Any help appreciated.
Thanks,
Jens