I've been building PHP tools with a MySQL backend. Using multi-curl I can run dozens and sometimes hundreds of each script concurrently on just one old PC (that now run the latest Ubuntu desktop). Each machine and each VM on each machine is capable of running dozens of concurrent instances.
As a result I rapidly run out of MySQL connections. When I upped the number of connections in MySQL I froze the quad-core machine I was using to host the MySQL server. Since then I've moved on to a scheme where I have a DB specific for managing requests of other DBs and using cron jobs that run for hours I maintain a few open connections per machine. At this point I'm still in the light testing phase and I haven't attempted to make hundreds of simultaneous requests to see how the MySQL server handles it.
Instead I've run into another issue that I've tried several solutions for all with approx the same results. The issue is that of duplicate data being entered into the DB because I have concurrent instances of the tool running on different machines. For example, I feed a big chunk of new data into my DB and that data needs to have a few tasks performed on it such as linking an email address to a user profile. I've chosen to have a background cron process handle this a little at a time and each machine does it's share every 5 minutes. Since they all start at the same time they all grab the same email data and filter it using the same logic about which email addresses are higher priority. Then each machine begins to do what needs to be done with the emails it picked. Since they all kick off at the same time they frequently appear to grab the exact same data and attempt to do the same linkings. This causes a primary key exception in the linkage table but not some of the others. So I end up with duplicate data in some tables and occasionally incomplete linkage.
I've tried randomizing the SELECT data so the machines are working on different data sets. However, this is of course sub-optimal with respect to what I'd like the tool to accomplish so I need to run the tool more often in order for certain tasks to get completed in the desired time frame. I've tried creating a flag in the DB that designates 1 server is actively using the data so all other servers should wait. This works sometimes but sometimes 2 machines are polling that flag at the same time. Since we are talking multiple machines I don't flock will work. And from what I've read locking the table in the DB may not be a good solution either.
So I've come to StackOverflow for advice rather than continuing to beat my head against a wall.
== Update ==
Gearman looked like a great solution so I gave it the thumbs up as the answer. However, I was never able to get it working with my PHP install. I attempted several sets of suggestions/instructions online many of which never even got gearman installed. The suggestion of using apt-get install gearman-server did get gearman installed as far as I could tell - i.e. no errors were generated and gearmand would run. However, when I attempted to use gearman clients and workers in scripts I would get errors regarding not being able to find those classes.
After that I added a gearman.ini file to the correct directory. It had 1 line extension=gearman.so
. This lead to a different error where PHP told me it couldn't find gearman.so. I attempted to use sudo find / -name gearman.*
to find gearman.so with no luck - it returned C files but not gearman.so.
At this time I'd really, really, really like to implement gearman but since I can't get it to work I'm stuck with my hack and slash PHP code for implementing my distributed toolset. My "solution" to date is to create a flag that it set to "OCCUPIED" when 1 instance of the tool is doing something that would cause duplicate data issues. I have created 5 of the same flag with postfixes of _1, _2, ... so that 5 instances can run concurrently. (I use the _1, _2, ... to create an offset into the returned DB data so that no 2 tool instances are working on the same data set. In other words, if a SELECT statement would return 100+ rows and I only work on 10 at a time then _1 works on rows 1-10, _2 works on 11-20, ... Not ideal but it should allow multiple servers to operate on the DB simultaneously without duplicate data being created.)
The DB API tool timeouts if it doesn't see a result in 30 sec. The problem now is frequent timeouts when attempting to grab those flag states...