What is best and save solution to remove expired archive files after taking snapshots and also remove invalid snapshot files in frequency?
2 Answers
You can use class RecordingLog
to inspect the various entries (logs, snapshots) belonging to the Consensus Module (CM) and clustered service(s).
Once you identified which snapshots are safe to delete (according to your business requirements), you can delete the corresponding recordings from the archive and invalidate the entries in the recording log.
The next thing you have to do is purge the CM log to the position of the oldest CM snapshot you kept. There is a snippet in Aeron project that you can take inspiration from: io.aeron.test.cluster.TestCluster#purgeLogToLastSnapshot()
.

- 71
- 3
There are 2 system tests within the aeron repo that are demonstrating different ways on how to purge Aeron archive data and reclaim more disk space.
In the first test the connection to Aeron cluster is made to purge Log to the latest snapshot: test name is shouldRecoverWhenFollowerWithInitialSnapshotAndArchivePurgeThenIsMultipleTermsBehind
within ClusterTest class
The other way is described in test class StartFromTruncatedRecordingLogTest where the RecordingLog is inspected and unnecessary files are removed. My understanding is that for that case the cluster should be shutdown as the recording log is adjusted and then replaced.
However, once the data in archive is purged, it's not clear to me how the fresh new cluster node without any data could join the cluster. When I'm trying to do that I'm getting the following error:
io.aeron.archive.client.ArchiveException: ERROR - response for correlationId=270, error: requested replay start position=0 is less than recording start position=134217728 for recording 0

- 11
- 1
-
In my case, I run these purge tasks in each node during startup: after the ClusteredMediaDriver is started, and before clustered services are launched. Regarding your last point, I've just stumbled on the same error while starting a fresh node, didn't have time to investigate yet. – pcdv Jul 31 '23 at 08:06
-
@pcdv just wondering, why do you run them during the startup? Can't you run them when clustered services are already launched and make this as maintenance process whilst the cluster is running? – Anton Aug 08 '23 at 10:44
-
Don't remember exactly :) I guess it's because I didn't want to mess with the CM while it is running. In my case nodes are restarted frequently enough so this is not an issue. I suppose purging old data while the cluster is running can be achieved if needed. – pcdv Aug 17 '23 at 08:59