0

Given an SQL table with timestamped records. Every once in a while an application App0 does something like foreach record in since(certainTimestamp) do process(record); commitOffset(record.timestamp), i.e. periodically it consumes a batch of "fresh" data, processes it sequentially and commits success after each record and then just sleeps for reasonable time (to accumulate yet another batch). That works perfect with single instance.. however how to load balance multiple ones?

In exactly the same environment App0 and App1 concurrently competite for the fresh data. The idea is that ready query executed by the App0 must not overlay with the same read query executed by the App1 - such that they never try to process the same item. In other words, I need SQL-based guarantees that concurrent read queries return different data. Is that even possible?

P.S. Postgres is preferred option.

Zazaeil
  • 3,900
  • 2
  • 14
  • 31

1 Answers1

1

The problem description is rather vague on what App1 should do while App0 is processing the previously selected records.
In this answer, I make the following assumptions:

  • all Apps somehow know what the last certainTimestamp is and it is the same for all Apps whenever they start a DB query.
  • while App0 is processing, say the 10 records it found when it started working, new records come in. That means, the pile of new records with respect to certainTimestamp grows.
  • when App1 (or any further App) starts, the should process only those new records with respect to certainTimestamp that are not yet being handled by other Apps.
  • yet, if on App fails/crashes, the unfinished records should be picked the next time another App runs.

This can be achieved by locking records in many SQL databases.

One way to go about this is to use

 SELECT ... FOR UPDATE SKIP LOCKED

This statement, in combination with the range-selection since(certainTimestamp) selects and locks all records matching the condition and not being locked currently. Whenever a new App instance runs this query, it only gets "what's left" to do and can work on that.

This solves the problem of "overlay" or working on the same data.

What's left is then the definition and update of the certainTimestamp. In order to keep this answer short, I don't go into that here and just leave the pointer to the OP that this needs to be thought through properly to avoid situations where e.g. a single record that cannot be processed for some reason keeps the certainTimestamp at a permanent minimum.

Lars Br.
  • 9,949
  • 2
  • 15
  • 29