0

I've been building PHP tools with a MySQL backend. Using multi-curl I can run dozens and sometimes hundreds of each script concurrently on just one old PC (that now run the latest Ubuntu desktop). Each machine and each VM on each machine is capable of running dozens of concurrent instances.

As a result I rapidly run out of MySQL connections. When I upped the number of connections in MySQL I froze the quad-core machine I was using to host the MySQL server. Since then I've moved on to a scheme where I have a DB specific for managing requests of other DBs and using cron jobs that run for hours I maintain a few open connections per machine. At this point I'm still in the light testing phase and I haven't attempted to make hundreds of simultaneous requests to see how the MySQL server handles it.

Instead I've run into another issue that I've tried several solutions for all with approx the same results. The issue is that of duplicate data being entered into the DB because I have concurrent instances of the tool running on different machines. For example, I feed a big chunk of new data into my DB and that data needs to have a few tasks performed on it such as linking an email address to a user profile. I've chosen to have a background cron process handle this a little at a time and each machine does it's share every 5 minutes. Since they all start at the same time they all grab the same email data and filter it using the same logic about which email addresses are higher priority. Then each machine begins to do what needs to be done with the emails it picked. Since they all kick off at the same time they frequently appear to grab the exact same data and attempt to do the same linkings. This causes a primary key exception in the linkage table but not some of the others. So I end up with duplicate data in some tables and occasionally incomplete linkage.

I've tried randomizing the SELECT data so the machines are working on different data sets. However, this is of course sub-optimal with respect to what I'd like the tool to accomplish so I need to run the tool more often in order for certain tasks to get completed in the desired time frame. I've tried creating a flag in the DB that designates 1 server is actively using the data so all other servers should wait. This works sometimes but sometimes 2 machines are polling that flag at the same time. Since we are talking multiple machines I don't flock will work. And from what I've read locking the table in the DB may not be a good solution either.

So I've come to StackOverflow for advice rather than continuing to beat my head against a wall.

== Update ==

Gearman looked like a great solution so I gave it the thumbs up as the answer. However, I was never able to get it working with my PHP install. I attempted several sets of suggestions/instructions online many of which never even got gearman installed. The suggestion of using apt-get install gearman-server did get gearman installed as far as I could tell - i.e. no errors were generated and gearmand would run. However, when I attempted to use gearman clients and workers in scripts I would get errors regarding not being able to find those classes.

After that I added a gearman.ini file to the correct directory. It had 1 line extension=gearman.so. This lead to a different error where PHP told me it couldn't find gearman.so. I attempted to use sudo find / -name gearman.* to find gearman.so with no luck - it returned C files but not gearman.so.

At this time I'd really, really, really like to implement gearman but since I can't get it to work I'm stuck with my hack and slash PHP code for implementing my distributed toolset. My "solution" to date is to create a flag that it set to "OCCUPIED" when 1 instance of the tool is doing something that would cause duplicate data issues. I have created 5 of the same flag with postfixes of _1, _2, ... so that 5 instances can run concurrently. (I use the _1, _2, ... to create an offset into the returned DB data so that no 2 tool instances are working on the same data set. In other words, if a SELECT statement would return 100+ rows and I only work on 10 at a time then _1 works on rows 1-10, _2 works on 11-20, ... Not ideal but it should allow multiple servers to operate on the DB simultaneously without duplicate data being created.)

The DB API tool timeouts if it doesn't see a result in 30 sec. The problem now is frequent timeouts when attempting to grab those flag states...

Gabe Spradlin
  • 1,937
  • 4
  • 23
  • 47

1 Answers1

1

You should use a queue system to insert each data-set as an item in the queue and use your distributed server setup to pull jobs from the queue and execute queries against the database. Using a queue will prevent multiple workers from working the same job.

Take a look at Gearman for an example

Mike B
  • 31,886
  • 13
  • 87
  • 111
  • I'll take a look, but right now what I have is a DB for a queue. Then crons pull tasks and execute them. However, multiple machines may put the same task into the queue. Eliminating duplicates entries into the queue can get pretty complicated and slow since I may want some entries to be duplicated (or nearly so). Thanks for the reference, I'll go look right now. – Gabe Spradlin Mar 13 '12 at 19:56
  • Gearman looks wonderful. Unfortunately I've been attempting to get it installed (Ubuntu 10.04 desktop) for the past hour without success. – Gabe Spradlin Mar 13 '12 at 21:09
  • @gabe I believe it's in the aptitude repos: `apt-get install gearman-server` (tried with 11.10, not sure about 10.04) – Mike B Mar 13 '12 at 22:23
  • Thanks, that worked better than the other 5 suggestions for installing Gearman. However, I've added teh extension=gearman.so to the ini files but it can't find it. And "sudo find / -name gearman.*" doesn't find it either. Do I still need to compile it? – Gabe Spradlin Mar 13 '12 at 23:11