-1

I have a single database table on a relational database. Data will be loaded into it. I then want to have multiple servers processing that data concurrently (I don't want to have only one server running at a time). E.g. each server will:

  • Query for a fixed number of rows
  • Do some work for each row retrieved
  • Update each row to show it has been processed

How do I ensure that each row is only processed once? Note I don't want to pre-assign a row of data to a server; i'm designing for high availability so the solution should keep running if one or more servers goes down.

The solution I've gone for so far is as follows:

  1. The table has three columns: LOCKED_BY (VARCHAR), LOCKED_AT (TIMESTAMP) and PROCESSED (CHAR)
  2. Each server starts by attempting to "pseudo-lock" some rows by doing:
UPDATE THE_TABLE
SET LOCKED_BY= $servername, 
    LOCKED_AT = CURRENT_TIMESTAMP,
WHERE (LOCKED_BY = null OR (CURRENT_TIMESTAMP- LOCKED_AT > $timeout)
AND PROCSSED = 'N'

i.e. try to "pseudo-lock" rows that aren't locked already or where the pseudo-lock as expired. Only do this for unprocessed rows.

  1. More than one server may have attempted this at the same time. The current server needs to query to find out if it was successful in the "pseudo-lock":
SELECT * FROM THE_TABLE
WHERE LOCKED_BY = $server_name
AND PROCESSED = 'N'

If any rows are returned the server can process them.

  1. Once the processing has been done the row is updated
UPDATE THE_TABLE SET PROCESSED = 'Y' WHERE PRIMARYKEYCOL = $pk

Note: the update statement should ideally limit the number of rows updated.

BigOakTree
  • 19
  • 1
  • 4
  • Which dbms are you using? – jarlh May 13 '22 at 09:20
  • I think you can fetch rows using Stored Procedure with locking mechanism. Take a look at this [question](https://stackoverflow.com/questions/3662766/sql-server-how-to-lock-a-table-until-a-stored-procedure-finishes) – mtdot May 13 '22 at 11:00
  • 1
    I think you asking about idempotency. One way you can go about this is by hooking up your servers to a redis DB and save all changes to the database from any server to the redis DB and read from there before making changes – maestro.inc May 13 '22 at 12:38
  • What you initially describe is parallel/distributed processing (workload being distributed across multiple processing engines) but then you say you are designing for High Availability. Parallel processing and HA are 2 different things - so are you looking for a solution for one or the other or both? – NickW May 13 '22 at 14:32
  • Jarlh, i'm using DB2 currently, though may move to a different RDBMS in the future. – BigOakTree May 13 '22 at 15:06
  • NickW , I have six servers. If one or more fail I want the remaining servers to continue processing. I also want to make use of the processing capacity available. I could solve the contention by designing a scheme where only one server processes data at a time, but then i'm not using available capacity. – BigOakTree May 13 '22 at 15:09
  • 1
    Have you considered leveraging some sort of centralized queueing mechanism in your architecture? Generally, a database isn't designed to provide airtight guarantees around the idempotency in these types of distributed operations use cases. Having a design that revolves around multiple "worker" nodes attempting to access/update the same rows in a RDBMS will generally always run the risk of race conditions and other odd edge cases. – esqew May 13 '22 at 15:30

1 Answers1

0

If you are open to changing platform then I would suggest moving to a modern, cloud-based solution like Snowflake. This will do what you want but in the background and by default - so you don't need to know what it's doing or how it's doing it (unless you want to).

This may come across as patronising, which is not my intention, but what you are attempting (in the way you are attempting it) is very complex; so if you don't already know how to do it then someone telling you how to do it is not going to give you the skills/experience you need to be able to implement it successfully

NickW
  • 8,430
  • 2
  • 6
  • 19