0

I have a web scraper that runs great, I can pass it a list of domain names, and it will scrape each site to see which sites are missing SEO. I realize there are tools for this (like Screaming Frog) but I am trying to create a PHP script to do this for me and write the results to a Google Sheet.

I've got a Google Sheet with a list of 300 sites. My script pulls from this Google Sheet like so:

 $domain_batch = getBatchOfFive(1,5);

This returns 5 of the sites from the Google Sheet, and then I take the returned array and pass it to a function that scrapes each site like so:

 foreach ($domain_batch as $site){

      $seo = getAllPagesSEO($site);

      //then logic to add the results to a spreadsheet
 }

Then when I run it again, I change that to:

 $domain_batch = getBatchOfFive(6,10);

And so on until I get through all of the sites in the Google Sheet.

How I run this script, is I just pull up the script in my browser:

https://example.com/seo-scraper.php

The problem is I can only scrape about 5 sites at a time before the script times out. I'm wondering if it would be possible to run this script incrementally somehow.

Is there any way I could programmatically create something that would run the script for the first 5 sites, and once the script finishes running, it would automatically run the script again with another 5 sites, and continue doing this process until all of the sites are run through the script?

That way I don't have to go into seo-scraper.php after each run and change the values here:

 $domain_batch = getBatchOfFive(6,10);

I'm thinking this might not be possible, but I'm looking for any ideas!

Alex Douglas
  • 534
  • 1
  • 8
  • 21
  • 1
    sure, you could look to see if its xhr or a post request, else serve some js, which in turn calls back with a post or xhr request then runs your main code, could even make it server sent events or long polling.. but if you just want it not to time out perhaps just add [ignore_user_abort](https://www.php.net/manual/en/function.ignore-user-abort.php) so it carries on even if your browser times out, and [set_time_limit](https://www.php.net/manual/en/function.set-time-limit.php) so php doesn't timeout, also you should loop over 0 to n records not manually shift along five – Lawrence Cherone Aug 16 '22 at 21:23
  • It seems that most of the time I am getting a 504 Gateway Timeout. I'm wondering if ignore_user_abort would help bypass that issue? – Alex Douglas Aug 16 '22 at 21:49
  • 3
    You could have a look at [How to run a large PHP script](https://stackoverflow.com/questions/2840711/how-to-execute-a-large-php-script). Some solutions might work for you, like running it on a command line instead of a browser or via a cron-job. – Uwe Aug 16 '22 at 22:50
  • Both of these comments were VERY helpful. Thank you both! – Alex Douglas Aug 18 '22 at 19:49

0 Answers0