-1

I develop a web scraper with PHP and I was faced with the problem of low data processing speed. When I load a web page I receive too much unnecessary data. Is there any way to receive not the whole page but only pieces? Specific HTML tag and its content?

Now I have code like this:

$html = file_get_html('http://www.google.com/');
$title = $html->find('title', 0);
$image = $html->find('img', 0);

echo $title->plaintext."<br>\n";
echo $image->src;
?>
petezurich
  • 9,280
  • 9
  • 43
  • 57
Evgtar
  • 1
  • 1
  • 1
    You could eventually receive part of a page but without any guarantee to find what you're looking for in the received fragment. Unless you just want the title (or meta tags) this seems like a dead-end. – Calimero Sep 26 '17 at 20:28
  • Unless the source have some API that let's you just fetch specific data, you need to download the full page and parse it yourself. I guess that you theoretically could download it chunks, check if the elements you want is in there and if it is, stop fetching more and if it isn't, fetch more, – M. Eriksson Sep 26 '17 at 20:33

1 Answers1

0

problem of low data processing speed

Really? IME the DOM parser works quite well. Assuming you have confirmed that this is the cause your woes, then there are 3 obvious solutions:

  • if you're scraping multiple pages, shard the workload across all your CPUs
  • Use the even based parser instead of the DOM parser (your code gets a lot more complicated at this point) and discard the trailling content you don't need.
  • upgrade your hardware

While HTTP supports range queries (i.e. you can fetch only part of page) you don't know where the tag blocks align with the byte stream - so you can't just fetch part of the page.

OTOH if you haven't bothered to check that the problem is with the code execution, then its far more likely that the slowness arises in network latency; you've not told us anything about how you are fetching the pages, and you've not shown us any of the code which retrieves the content (there is "file_get_html" in native PHP).

If the problem is actually latency, then the solution would be to run a batch process to carry out asynchronous fetching of several pages at a time - using curl_multi_exec.

symcbean
  • 47,736
  • 6
  • 59
  • 94