I have a single database table on a relational database. Data will be loaded into it. I then want to have multiple servers processing that data concurrently (I don't want to have only one server running at a time). E.g. each server will:
- Query for a fixed number of rows
- Do some work for each row retrieved
- Update each row to show it has been processed
How do I ensure that each row is only processed once? Note I don't want to pre-assign a row of data to a server; i'm designing for high availability so the solution should keep running if one or more servers goes down.
The solution I've gone for so far is as follows:
- The table has three columns: LOCKED_BY (VARCHAR), LOCKED_AT (TIMESTAMP) and PROCESSED (CHAR)
- Each server starts by attempting to "pseudo-lock" some rows by doing:
UPDATE THE_TABLE
SET LOCKED_BY= $servername,
LOCKED_AT = CURRENT_TIMESTAMP,
WHERE (LOCKED_BY = null OR (CURRENT_TIMESTAMP- LOCKED_AT > $timeout)
AND PROCSSED = 'N'
i.e. try to "pseudo-lock" rows that aren't locked already or where the pseudo-lock as expired. Only do this for unprocessed rows.
- More than one server may have attempted this at the same time. The current server needs to query to find out if it was successful in the "pseudo-lock":
SELECT * FROM THE_TABLE
WHERE LOCKED_BY = $server_name
AND PROCESSED = 'N'
If any rows are returned the server can process them.
- Once the processing has been done the row is updated
UPDATE THE_TABLE SET PROCESSED = 'Y' WHERE PRIMARYKEYCOL = $pk
Note: the update statement should ideally limit the number of rows updated.