2

I have a MySQL database table that I need to process. It takes about 1 second to process 3 rows (due to CURL connections I need to make for each row). So, I need to fork the PHP script in order to have a reasonable time (since I will process up to 10,000 rows for one batch).

I'm going to run 10-30 processes at once, and obviously I need some way to make sure that processes are not overlapping (in terms of which rows they are retrieving and modifying).

From what I've read, there are three ways to accomplish this. I'm trying to decide which method is best for this situation.

Option 1: Begin a transaction and use SELECT ... FOR UPDATE and limit the # of rows for each process. Save the data to an array. Update the selected rows with a status flag of "processing". Commit the transaction and then update the selected rows to a status of "finished".

Option 2: Update a certain number of rows with a status flag of "processing" and the process ID. Select all rows for that process ID and flag. Work with the data like normal. Update those rows and set the flag to "finished".

Option 3: Set a LIMIT ... OFFSET ... clause for each process's SELECT query, so that each process gets unique rows to work with. Then store the row IDs and perform and UPDATE when done.

I'm not sure which option is the safest. I think option 3 seems simple enough, but I wonder is there any way this could fail? Option 2 also seems very simple, but I'm not sure if the locking due to the UPDATE cause everything to slow down. Option 1 seems like the best bet, but I'm not very familiar with FOR UPDATE and transactions, and could use some help.

UPDATE: For clarity, I have currently just one file process.php which selects all the rows and posts the data to a third-party via Curl one-by-one. I'd like to have a fork in this file, so the 10,000 rows can be split among 10-30 child processes.

Trevor Gehman
  • 4,645
  • 3
  • 22
  • 25
  • Do you need to spin up a process for each job? Is each row an independent job? – Brad Feb 09 '13 at 20:41
  • Each row contains an ID and some text, which will be POSTED to a third-party via Curl. The response back will also be put into a different table. Currently I have just one while loop that cycles through all the rows returned and does this one-by-one. I'd like to spread it out over 10-30 processes. – Trevor Gehman Feb 09 '13 at 20:48
  • 3
    Why not use multi-curl and leave it in a single process? – Brad Feb 09 '13 at 21:16
  • I actually did not know that existed. It looks like one limitation I may run into is that I'm actually using a PHP library API to do the cURL commands (connecting to a third-party web service)... so to use multi-curl I'd have to ditch that and write the Curl commands myself. This may be a little too complex for me. Is this a much better way to accomplish what I want, compared to multiple processes? – Trevor Gehman Feb 09 '13 at 23:09
  • Yes, much better, more efficient, and actually much easier than dealing with external processes. Check out ParallelCurl to make it easy: https://github.com/petewarden/ParallelCurl – Brad Feb 10 '13 at 00:52
  • Why aren't you getting the whole 10000 rows in one query and work on them in the same process. What you do with them later is a different story. – Assaf Karmon Feb 12 '13 at 21:03

2 Answers2

0

Another way of handling this is to put the ids that you need to process into a redis queue (list). You can then pop/push items from the list. When the len(list) is empty, you know that there is nothing left to process.

There is also the php resque project which will implement some of the job queuing you want to do.

https://github.com/chrisboulton/php-resque

Daniel
  • 7,006
  • 7
  • 43
  • 49
0

I ended up using mult_curl functions (as proposed by Brad) to accomplish this task. I divided the array of rows into groups of 100 using array_chunk() and then configured a multi_curl task to process them. I started out using ParallelCurl, but it did not end up working correctly, so I just coded the mult_curl myself.

It went from taking almost 2 hours to process 10,000 curl connections to taking just a few minutes.

Trevor Gehman
  • 4,645
  • 3
  • 22
  • 25
  • Hi Trevor, i have the same question, Can you please post a link to an example or a script? that will be highly appreciated!!! – mongotop Feb 27 '13 at 03:54