You are describing a Distributed Computing setup.
But it is racy.
When some cooperating partner sends you a "start" event,
they probably ought to await your "ack" before embarking
on anything adventurous. Else they won't know whether
you (A.) heard and (B.) recorded the event.
That is to say, lack of acknowledgement invites
races and lost events when hosts might randomly reboot.
Ideally that partner would persist such events to
stable storage on their own, before beginning an
expensive operation.
Given that apparently there there's no ACKs, it sounds
like your app needs to persist the event as soon as feasible,
either across the LAN to a redundant host, or to a filesystem.
A simple approach is to
- receive the partner's message
- write() a line, appending to a text file
- fsync() to flush it from memory to disk / NVRAM / SSD
When you receive "completed" events, and when
you execute "terminate" commands, log those as well.
No need to fsync at once. Presumably other events
arrive with some frequency, and they will flush all
pending log records out to disk before long.
Upon restarting, just seek to near the end of the file
and replay all logged events, setting up a bunch of
timeout counters, and canceling them when the log
reveals that they already finished. Some timeouts may fire
immediately after we finish reading the log,
because they are stale. Presumably it is harmless to
issue a terminate(task_id)
command for a task that
already did a normal exit.
An alternate strategy, which does not depend so heavily
on accurate logging, is to query the status of all
currently running jobs when you come up.
Set a conservative timeout in the somewhat distant
future, and hope you stay up long enough to see such
a time arrive.
Or use extra information, such as each task's
size and start_time, to pick more sensible timeout values.
Consider using kafka, redis, or similar distributed
message brokers to coordinate your cluster's actions,
rather than relying on filesystem or RDBMS.
There are low latency solutions available which
do a good job of balancing Consistency,
Availability, and Partition tolerance.