Is there any JavaScript web crawler framework?
Asked
Active
Viewed 1.1k times
13
-
Could you be more specific? Are you looking for a web crawler implemented in JavaScript? Server-side (Node.js) or client-side (in a browser)? – Matt Ball Apr 05 '11 at 17:31
-
4Is there a client-side webcrawler framework? How would that work? – Shakakai Apr 05 '11 at 17:36
-
I wrote three APIs using server-side javascript. You can run `nodejs` from your command-line as easy as you can `python`. This is a perfectly valid question. – salezica Mar 13 '13 at 00:06
3 Answers
10
There's a new framework that was just release for Node.js called spider. It uses jQuery under the hood to crawl/index a website's HTML pages. The API and configuration are really nice especially if you already know jQuery.
From the test suite, here's an example of crawling the New York Times website:
var spider = require('../main');
spider()
.route('www.nytimes.com', '/pages/dining/index.html', function (window, $) {
$('a').spider();
})
.route('travel.nytimes.com', '*', function (window, $) {
$('a').spider();
if (this.fromCache) return;
var article = { title: $('nyt_headline').text(), articleBody: '', photos: [] }
article.body = ''
$('div.articleBody').each(function () {
article.body += this.outerHTML;
})
$('div#abColumn img').each(function () {
var p = $(this).attr('src');
if (p.indexOf('ADS') === -1) {
article.photos.push(p);
}
})
console.log(article);
})
.route('dinersjournal.blogs.nytimes.com', '*', function (window, $) {
var article = {title: $('h1.entry-title').text()}
console.log($('div.entry-content').html())
})
.get('http://www.nytimes.com/pages/dining/index.html')
.log('info')
;

Shakakai
- 3,514
- 1
- 16
- 18
-
Spend a morning to make spider to work, it can't be run in latest 0.6.6 node.js. – Kuroro Jan 01 '12 at 04:43
-
This is a good start, but it doesn't seem to handle meta redirects or document base overrides so it will fail to crawl many sites. But it is the best implementation I've seen for node. And with support for cookies it's better than other open source crawlers. – Marcus Pope Jun 19 '12 at 19:13
1
Server-side?
Try node-crawler: https://github.com/joshfire/node-crawler

bpierre
- 10,957
- 2
- 26
- 27
-
I wouldn't consider this a crawler since it doesn't compile subsequent uri's to crawl. It will basically download the source of a given URL and trigger a callback on completion. It's up to the consumer to define logic for crawling the links provided in that page, something that isn't very straightforward. – Marcus Pope Jun 19 '12 at 18:56