2

I'm trying to scrape a page that uses infinite scroll using phantomjs casperjs and spooky. It is supposed to continue clicking the more button and taking the new links from the results until its stopped manually. The script however starts using more and more memory until it crashes. I wrote the following script, is there a way to optimise it so it won't use as much memory:

function pressMore(previousLinksLength) {
    this.click('#projects > div.container-flex.px2 > div > a');
    this.wait(1000, function() {
      links = this.evaluate(function() {
        var projectPreview = document.querySelectorAll('.project-thumbnail a');
        return Array.prototype.map.call(projectPreview, function(e) {
          return e.getAttribute('href');
        });
    });
      this.emit('sendScrapedLinks', links.slice(previousLinksLength));
    // repeat scrape function
      pressMore.call(this, links.length);
  });
}
// spookyjs starts here
spooky.start(scrapingUrl);

//press the more button
spooky.then(pressMore);

spooky.run();
Bunker
  • 1,040
  • 3
  • 12
  • 25

1 Answers1

1

I've also run into this problem on infinite scrolling sites. I could never find away around the memory leaks.

In short what I ended up doing is using scroll to. Essentially I would run the app for awhile log the last scrolled to position and then restart the app using the logged values to prevent memory from getting to high. It's a pain because many sites you have to sequentially scroll to a certain position to load more and more. Finding those positions to divide up your last scrolled to position can be challenging.

Chris Hawkes
  • 11,923
  • 6
  • 58
  • 68
  • How does this help? Just because you know the last scroll position before a crash doesn't mean that you get further on a second attempt. – Artjom B. Sep 09 '14 at 16:28