1

What I want to do is to execute the same script every few minutes with cron.

The script needs to process some data read from the database, so obviously I need it work on diffrent row each time.

My concept was to use row locking to make sure each instance work on different row, but it doesn't seem to work that way. Is it even possible to use row locks this way? Any other solutins?

Example:

while($c < $limit) {
   $sql=mysql_query("SELECT * FROM table WHERE ... LIMIT 1 FOR UPDATE");
   $data=mysql_fetch_assoc($sql);

   (process data)

   mysql_query("update table set value=spmething, timestamp=NOW()");
   $c++;
}

Basically what i need is SCRIPT1 reads R1 from the table; SCRIPT2 reads R2 (next non-locked row matching criteria)

EDIT: Let's say for example that: 1) the table stores a list of URL 2) the script checks if URL responses, and updates it's status (and timestamp) in database

  • Couldn't you use transactions? – crush Feb 07 '13 at 20:03
  • 1
    are these rows independent of each other? might be better if you had one master script make up a list of rows to handle, and slice that list up to a bunch of separate sub-scripts to do the processing on. – Marc B Feb 07 '13 at 20:05
  • are you scared the instance you are running each 5 minutes is going to last longer than 5 minutes or you want to explicitly run X instances? – fsw Feb 07 '13 at 20:23

2 Answers2

4

This should essentially be treated as two separate problems:

  1. Finding a job for each worker to process. Ideally this should be very efficient and pre-emptively avoid failures in step 2, which comes next.
  2. Ensuring that each job gets processed at most once or exactly once. No matter what happens the same job should not be concurrently processed by multiple workers. You may want to ensure that no jobs are lost due to buggy/crashing workers.

Both problems have multiple workable solutions. I'll give some suggestions about my preference:

Finding a job to process

For low-velocity systems it should be sufficient just to look for the most recent un-processed job. You do not want to take the job yet, just identify it as a candidate. This could be:

SELECT id FROM jobs ORDER BY created_at ASC LIMIT 1

(Note that this will process the oldest job first—FIFO order—and we assume that rows are deleted after processing.)

Claiming a job

In this simple example, this would be as simple as (note I am avoiding some potential optimizations that will make things less clear):

BEGIN;
SELECT * FROM jobs WHERE id = <id> FOR UPDATE;
DELETE FROM jobs WHERE id = <id>;
COMMIT;

If the SELECT returns our job when queried by id, we've now locked it. If another worker has already taken this job, an empty set will be returned, and we should look for a different job. If two workers are competing for the same job, they will block each other from the SELECT ... FOR UPDATE onwards, such that the previous statements are universally true. This will allow you to ensure that each job is processed at most once. However...

Processing a job exactly once

A risk in the previous design is that a worker takes a job, fails to process it, and crashes. The job is now lost. Most job processing systems therefor do not delete the job when they claim it, instead marking it as claimed by some worker and implement a job-reclaim system.

This can be achieved by keeping track of the claim itself using either additional columns in the job table, or a separate claim table. Normally some information is written about the worker, e.g. hostname, PID, etc., (claim_description) and some expiration date (claim_expires_at) is provided for the claim e.g. 1 hour in the future. An additional process then goes through those claims and transactionally releases claims which are past their expiration (claim_expires_at < NOW()). Claiming a job then also requires that the job row is checked for claims (claim_expires_at IS NULL) both at selection time and when claiming with SELECT ... FOR UPDATE.

Note that this solution still has problems: If a job is processed successfully, but the worker crashes before successfully marking the job as completed, we may eventually release the claim and re-process the job. Fixing that requires a more advanced system which is left as an exercise for the reader. ;)

jeremycole
  • 2,741
  • 12
  • 15
1

If you are going to read the row once, and only once, then I would create an is_processed column and simply update that column on the rows that you've processed. Then you can simply query for the first row that has is_processed = 0

PickYourPoison
  • 108
  • 2
  • 8
jhansen
  • 284
  • 2
  • 10
  • If you're gonna down vote, it would be courteous to explain why. – jhansen Feb 07 '13 at 20:33
  • 1) there's still a slight chance, the same row will be read twice. 2) what in case, script crashes? Row stays as processed... After all I think I'll just update the `timestamp` after select, since I'm selecting based on `timestamp <= yesterday` – giveupyourname Feb 11 '13 at 10:39