is there any java script web crawler framework

Question

Is there any JavaScript web crawler framework?

Could you be more specific? Are you looking for a web crawler implemented in JavaScript? Server-side (Node.js) or client-side (in a browser)? — Matt Ball, Apr 05 '11 at 17:31
Is there a client-side webcrawler framework? How would that work? — Shakakai, Apr 05 '11 at 17:36
I wrote three APIs using server-side javascript. You can run `nodejs` from your command-line as easy as you can `python`. This is a perfectly valid question. — salezica, Mar 13 '13 at 00:06

score 10 · Answer 1 · answered Apr 05 '11 at 17:32

There's a new framework that was just release for Node.js called spider. It uses jQuery under the hood to crawl/index a website's HTML pages. The API and configuration are really nice especially if you already know jQuery.

From the test suite, here's an example of crawling the New York Times website:

var spider = require('../main');

spider()
  .route('www.nytimes.com', '/pages/dining/index.html', function (window, $) {
    $('a').spider();
  })
  .route('travel.nytimes.com', '*', function (window, $) {
    $('a').spider();
    if (this.fromCache) return;

    var article = { title: $('nyt_headline').text(), articleBody: '', photos: [] }
    article.body = '' 
    $('div.articleBody').each(function () {
      article.body += this.outerHTML;
    })
    $('div#abColumn img').each(function () {
      var p = $(this).attr('src');
      if (p.indexOf('ADS') === -1) {
        article.photos.push(p);
      }
    })
    console.log(article);
  })
  .route('dinersjournal.blogs.nytimes.com', '*', function (window, $) {
    var article = {title: $('h1.entry-title').text()}
    console.log($('div.entry-content').html())
  })
  .get('http://www.nytimes.com/pages/dining/index.html')
  .log('info')
  ;

Spend a morning to make spider to work, it can't be run in latest 0.6.6 node.js. — Kuroro, Jan 01 '12 at 04:43
This is a good start, but it doesn't seem to handle meta redirects or document base overrides so it will fail to crawl many sites. But it is the best implementation I've seen for node. And with support for cookies it's better than other open source crawlers. — Marcus Pope, Jun 19 '12 at 19:13

score 8 · Answer 2 · answered Apr 05 '11 at 17:31

8

Try the PhantomJS. Not exactly a crawler, but could be easily used for that purpose. It has the fully-functional WebKit engine built-in, with an ability to save screenshots etc. Works as the simple command-line JS interpreter.

answered Apr 05 '11 at 17:31

zindel

1,837
11
13

score 1 · Answer 3 · answered Apr 05 '11 at 17:31

1

Server-side?

Try node-crawler: https://github.com/joshfire/node-crawler

answered Apr 05 '11 at 17:31

bpierre

10,957
2
26
27

I wouldn't consider this a crawler since it doesn't compile subsequent uri's to crawl. It will basically download the source of a given URL and trigger a callback on completion. It's up to the consumer to define logic for crawling the links provided in that page, something that isn't very straightforward. – Marcus Pope Jun 19 '12 at 18:56

is there any java script web crawler framework

3 Answers3

Linked