1

Is there any python library that would provide a (generic) job state journaling and recovery functionality?

Here's my use case:

  1. data received to start a job
  2. job starts processing
  3. job finishes processing

I then want to be able to restart a job back after 1 if the process aborts / power fails. Jobs would write to a journal file when job starts, and mark the job done when the job completes. So when the process starts, it checks a journal file for uncompleted jobs, and uses the journal data to restart the job(s) that did not complete, if present. So what python tools exist to solve this? (Or other python solutions to having fault tolerance and recovery for critical jobs that must complete). I know a job queue like RabbitMQ would work quite well for this case, but I want a solution that doesn't need an external service. I searched PyPI for "journaling" and didn't get much. So any solutions? Seems like a library for this would be useful, since there are multiple concerns when using a journal that are hard to get right, but a library could handle. (Such as multiple async writes, file splitting and truncating, etc.)

Garrett Motzner
  • 3,021
  • 1
  • 13
  • 30
  • You could add a threat that checks for journal entries. We are doing something similar, we refer to it as recovery logic. Basically all our jobs go into what you call a "journal." There is a separate thread that processes the tasks/jobs in this directory. If a task/job fails we don't remove it from the directory. We have a recovery thread that runs every 'x' seconds to reprocess any failed jobs. Threading is really easy in Python and it seems like you already have a lot of the business rules in place, so if it's feasible for your use case I would just add some threads. – spyder1329 Apr 22 '20 at 22:54
  • @spyder1329 Makes sense, but doesn't solve the stated problem of writing to and reading from the journal in the first place (I'll clarify that in the question). But there are a lot of concerns when it comes to journaling, especially when you might have multiple async writes to a journal. Also things like file rotation, number of files to right to, etc. – Garrett Motzner Apr 22 '20 at 23:14
  • have you looked into log4python? – spyder1329 Apr 23 '20 at 00:11
  • @spyder1329 that's a good idea, using a logger for journaling, since they are somewhat similar... However most loggers don't have a mechanism to discard _specific_ completed logs or truncate based on what jobs have been completed. Ideally I'd like a to use a journaling specific library. But, if you can give an example of how to use a logger to journal data, that would be awesome! – Garrett Motzner Apr 23 '20 at 00:32
  • I'm looking forward to see an answer too, but I would give more trust to a popular and widely deployed messaging service than to a specialized library. – VPfB May 08 '20 at 09:38
  • @VPfB I would too, but my particular use case would be better served with an 'embedded' solution. Honestly, I'm very surprised this isn't a more common thing, though. – Garrett Motzner May 08 '20 at 16:12

1 Answers1

1

I think you can do this using either crontabs or APScheduler, I think the latter has all the feature you need, but even with cron you can do something like:

1: schedule A process to run after a specific interval

2: A process checks if there is a running job or not

3: if no job is running, start one

4: job continues working, and saves state into drive/db

5: if it fails or finishes, step 3 will continue

APScheduler is likely what you're looking for, their feature list is extensive and it's also extendable if it doesn't fulfill your requirements.

  • I don't think that's quite the right solution, because none of those focus on data storage, which is what I want. Those solutions focus on running the job, which is not the problem in this case. What isn't handled is resilience in the case of catastrophic failure. And that's what I want to solve. However, that does give me an idea... SQLite might be a good solution here. it handles paging and storage and deletion, and might make a good journal file... – Garrett Motzner May 13 '20 at 17:33
  • Basically, the part I need solved is mostly 4, saving to the drive, _without_ an external service. And those tools use other services, seems like. – Garrett Motzner May 13 '20 at 17:39