1

I've run into a weird problem, and hope that someone can help me out. I've written a multiCurl spider in PHP that scrapes keywords off websites, and I'm running into a strange performance problem.

When I run the spider to scrape the first few levels of a site, it takes about 2 minutes to complete, which isn't that big a problem for my purposes. What's strange is that when I try to run one spider after another in the same script, the runtime balloons for some reason. For example, when I want it to sequentially run on 7 sites, I'd expect it to take 14 minutes (2 minutes per site), but instead it's taking upwards of 45 minutes to run. I've tested each of the sites separately and they are in fact averaging at 2 minutes or below apiece, but when run in sequence it takes almost an hour.

I thought it might be something to do with memory issues, so I implemented APC cache to store the keyword data while the script is running. The thing is, when I look at my task manager (I'm running XAMPP on Windows 7) the Apache Server doesn't seem to go much higher than 46K/23% of the CPU, and everything else on my compy runs just fine.

I've taken a close look and made sure all the appropriate handlers are closed as soon as possible, large variables are unset/cached, and yet I'm still scratching my head as to why it's taking 3 times longer than expected to run one site after the other. I'm about to go the route of trying to fork the spiders to separate processes using pcntl (I'm going to try a thumb drive install of linux), but I was wondering if anyone might have any ideas of what might be giving my application the performance hit. Thanks!

Zero Wing
  • 233
  • 2
  • 12
  • 1
    Gonna be tough to tell without any code... – Mansfield Aug 26 '13 at 00:51
  • @Mansfield: Yeah I'm sorry about that, unfortunately the code is being written for a business project, so I can't post anything here besides generalities. I was just wondering if there might be some general aspect to PHP performance I might not know about. – Zero Wing Aug 26 '13 at 01:04
  • 2
    @ZeroWing, without partial of code, it is hard to answer. If you cannot post your code from your business application, why don't reproduce a small sample application and post it here? – invisal Aug 26 '13 at 01:07
  • I would add some benchmarking code, and see if the slow times are due to fetching the data over the network (more likely), or processing of the downloaded data. Did APC cache make much speed improvement? – bumperbox Aug 26 '13 at 01:09
  • @bumperbox: Thanks so much for the suggestion to benchmark the code, I've never worked on high-volume data projects before and it never occurred to me as something I ought to try. On benchmarking various areas of the code, I discovered that the problem seemed to be partially related to running out of memory when the subsequent scripts were running, which was then complicated by various inefficient array functions I was using and bad approaches to SQL insertions I was using to upload all of the collected data (much of which fixed the slowdown enormously, though memory loss is still an issue). – Zero Wing Aug 28 '13 at 23:01

0 Answers0