0

I'm about to undertake a large project, where I'll need scheduled tasks (cron jobs) to run a script that will loop through my entire database of entities and make calls to multiple API's such as Facebook, Twitter & Foursquare every 10 minutes. I need this application to be scalable.

I can already foresee a few potential pitfalls...

  1. Fetching data from API's is slow..
  2. With thousands of records (constantly increasing) in my database, it's going to take too much time to process every record within 10 minutes.
  3. Some shared servers only stop scripts running after 30 seconds.
  4. Server issues due to constant intensive scripts running.

My question is how to structure my application...?

  1. Could I create multiple cron jobs to handle small segments of my database (this will have to be automated)?
  2. This will require potentially thousands of cron jobs.. Is that sustainable?
  3. How to bypass the 30 sec issue with some servers?
  4. Is there a better way to go about this?

Thanks!

Danny
  • 993
  • 4
  • 20
  • 39
  • 4
    the biggest issue here is your chose of a a shared sever, it does not sound like the right kind of hosting for this - get a VPS. as to the questions: 1 yes, 2, one cron job could be used to do all the scheduling 3. don't use a shared server 4. see above –  Oct 09 '12 at 19:34
  • If youre looking to do something on this scale then shared servers should not be an issue. You should be looking a VPS or Dedicated server, or multiples. – prodigitalson Oct 09 '12 at 19:36
  • Ok thanks, @Dagon - how can one cron job be used to do all the scheduling if I were to segment my database? I need every record in my database to be processed within 10min, surely I would need multiple cron jobs running. Eg. (cron 1 schedules script.php?records=1-999 & cron 2 schedules script.php?records=1000-1999)? – Danny Oct 09 '12 at 19:47
  • 1 cron job that would check a db to see what it should now be spawning; easier to manage than multiple cron jobs. you should also look at running things in parallel, launch 2 scripts one to get record 1-100 and another to get 2-200 etc. i find one script per core works well, depending what else the server is doing –  Oct 09 '12 at 19:48

2 Answers2

5

I'm about to undertake a large project, where I'll need scheduled tasks (cron jobs) to run a script that will loop through my entire database of entities and make calls to multiple API's such as Facebook, Twitter & Foursquare every 10 minutes. I need this application to be scalable.

Your best option is to design the application to make use of a distributed database, and deploy it on multiple servers.

You can design it to work in two "ranks" of servers, not unlike the map-reduce approach: lightweight servers that only perform queries and "pre-digest" some data ("map"), and servers that aggregate the data ("reduce").

Once you do that, you can establish a performance baseline and calculate that, say, if you can generate 2000 queries per minute and you can handle as many responses, then you need a new server every 20,000 users. In that "generate 2000 queries per minute" you need to factor in:

  • data retrieval from the database
  • traffic bandwidth from and to the control servers
  • traffic bandwidth to Facebook, Foursquare, Twitter etc.
  • necessity to log locally (and maybe distill and upload log digests to Command and Control)

An advantage of this architecture is that you can start small - a testbed can be built with a single machine running both Connector, Mapper, Reducer, Command and Control and Persistence. When you grow, you just outsource different services to different servers.

On several distributed computing platforms, this also allows you to run queries faster by judiciously allocating Mappers geographically or connectivity-wise, and reduce the traffic costs between your various platforms by playing with, e.g. Amazon "zones" (Amazon has also a message service that you might find valuable for communicating between the tasks)

One note: I'm not sure that PHP is the right tool for this whole thing. I'd rather think Python.

At the 20,000 users-per-instance traffic level, though, I think that you'd better take this up with the guys at Facebook, Foursquare etc. . At a minimum you might glean some strategies such as running the connector scripts as independent tasks, each connector sorting its queue based on that service's user IDs, to leverage what little data locality there might be, and taking advantage of pipelining to squeeze more bandwidth with less server load. At the most, they might point you to bulk APIs or different protocols, or buy you for one trillion bucks :-)

LSerni
  • 55,617
  • 10
  • 65
  • 107
  • Thanks @Iserni, that's exactly what I was looking for. I do some more research around this idea. – Danny Oct 09 '12 at 20:11
2

See http://php.net/manual/en/function.set-time-limit.php to bypass the 30 second limit.

For scheduling jobs in PHP look at:

  1. http://www.phpjobscheduler.co.uk/
  2. http://www.zend.com/en/products/server/zend-server-job-queue

I personally would look at a more robust framework that handles job scheduling (see Grails with Quartz) instead of reinventing the wheel and writing your own job scheduler. Don't forget that you are probably going to need to be checking on the status of tasks from time to time so you will need a logging solution around the tasks.

John Moses
  • 1,283
  • 1
  • 12
  • 18