How do you spider with PhantomJS

Question

I am trying to leverage PhantomJS and spider an entire domain. I want to start at the root domain e.g. www.domain.com - pull all links (a.href) and then have a que of fetching each new links and adding new links to the que if they haven't been crawled or in que.

Ideas, Help?

Thanks in advance!

Show us some code that you've implemented and we can help. As an aside, I'm not sure that javascript only is going to help you. — Spence, Nov 16 '11 at 07:46

score 20 · Answer 1 · answered Dec 06 '11 at 20:28

20

You might be interested in checking out Pjscrape (disclaimer: this is my project), an Open Source scraping library built on top of PhantomJS. It has built-in support for spidering pages and scraping information from each as it progresses. You could spider an entire site, looking at every anchor link, with a short script like this:

pjs.addSuite({
    url: 'http://www.example.com/your_start_page.html',
    moreUrls: function() {
        // get all URLs from anchor links,
        // restricted to the current domain by default
        return _pjs.getAnchorUrls('a');
    },
    scraper: function() {
        // scrapers can use jQuery
        return $('h1').first().text();
    }
});

By default this will skip pages already spidered and only follow links on the current domain, though these can both be changed in your settings.

answered Dec 06 '11 at 20:28

nrabinowitz

55,314
10
149
165

how does PhantomJS match up to rendering pages with Selenium? Can we expect same rendering quality? – KJW Apr 03 '12 at 05:18
@KimJongWoo PhantomJS uses the webkit rendering engine. I don't think it uses the latest version as it relies on Qt. However webkit is the rendering engine that powers the likes of Safari and Chrome so it pretty darn good. Looking at Selenium it just seems to automate browsers rather that being a headless browser. – thomas Aug 23 '12 at 15:11
@nrabinowitz any interest in making this a node module? – morgs32 Dec 09 '13 at 18:16
@morgs32 - depending what you mean, this may or may not be possible. See http://phantomjs.org/related-projects.html for node modules that integrate PhantomJS with node. It might be useful to package pjscrape with npm, but I don't have the time to do so at the moment. – nrabinowitz Dec 09 '13 at 22:40
project looks dead. examples page is broken. – chovy May 30 '14 at 19:28
project is... not dead, necessarily, but not actively maintained at the moment. – nrabinowitz Jun 02 '14 at 23:45
@nrabinowitz How about now? It seems dead to me – samayo Jan 26 '17 at 13:27
Yeah, I haven't touched it in several years. Sorry, just haven't had the time. :/ – nrabinowitz Jan 26 '17 at 14:48

score 6 · Answer 2 · answered May 22 '15 at 22:03

This is an old question, but to update, an awesome modern answer is http://www.nightmarejs.org/ ( github: https://github.com/segmentio/nightmare )

Quoting a compelling example from their homepage:

RAW PHANTOMJS:

phantom.create(function (ph) {
  ph.createPage(function (page) {
    page.open('http://yahoo.com', function (status) {
      page.evaluate(function () {
        var el =
          document.querySelector('input[title="Search"]');
        el.value = 'github nightmare';
      }, function (result) {
        page.evaluate(function () {
          var el = document.querySelector('.searchsubmit');
          var event = document.createEvent('MouseEvent');
          event.initEvent('click', true, false);
          el.dispatchEvent(event);
        }, function (result) {
          ph.exit();
        });
      });
    });
  });
});

WITH NIGHTMARE:

new Nightmare()
  .goto('http://yahoo.com')
  .type('input[title="Search"]', 'github nightmare')
  .click('.searchsubmit')
  .run();

score 3 · Answer 3 · answered Nov 17 '11 at 18:38

First, select all anchors on the index page and make a list of the href values. You can either do this with PhantomJS' document selector or with jQuery selectors. Then for each page, do the same thing until a page no longer contains any new links. You should have a master list of all links and a list of links for each page to be able to determine if a link has already been processed. You can think of web crawling as like a tree. The root node of the tree is the index page and the child nodes are the pages linked from the index page. Each child node can have one or more children depending on the links that the child pages contain. I hope this helps.

How do you spider with PhantomJS

3 Answers3