I need a javascript library to crawl a web application. I found this https://github.com/riccardo-forina/status-jquery-crawler but as the author claims , this is in early stage of development. I could not find anything after a lot of googling Thanks for any inputs
-
I know tags say javascript. But a side note you could use php to do this very easily. – Jay Aug 26 '14 at 19:18
-
May help you out... http://stackoverflow.com/questions/11083522/is-it-possible-to-write-web-crawler-in-javascript – S1r-Lanzelot Aug 26 '14 at 19:41
2 Answers
Javascript has many utilities you can use.
The biggest question when choosing your tool is, "does my site use Javascript to load the content I want?". For example, Google's search page is almost all contained in the HTML they send in response to an HTTP GET request.
Other sites may load comments, notifications, or pictures that aren't contained in the HTML initially using Javascript. This means that if you just said, give me the HTML for Site A, the page you'd get back wouldn't be missing much of the content you wanted.
Static Sites
For most sites where what you want is in the HTML, there are some excellent node.js scraping libraries at your disposal:
x-ray - a neat package that bundles up cheerio inside a declarative scrape object. Provides some simple structure with which to build robust scrapes.
cheerio + request - this is a popular combination, using cheerio to parse the HTML and request to get it for you. You'll find lots of resources explaining the basics of requesting web-pages, extracting the HTML, and even adding authentication and maintaining sessions where required using these tools.
artoo.js - in browser scraping utility. Extremely useful for prototyping, and one-off scrapes. You can add it as a bookmarklet and run it in your browser developer's console. It allows jQuery like selectors and has some basic following logic.
Dynamic Sites
If you need a browser like environment to get content from your site, you'll want to check out headless web browsing and drivers in node.js. PhantomJS is the most popular, but there are many others. Be warned - to use PhantomJS with other Javascript libraries you'll need to find a node.js driver:
Nightmare - a node library that talks to PhantomJS and simplifies basic web-page workflow and scraping.
SpookyJS - a node library for CasperJS, a tool built on top of PhantomJS that is also a separate package.
PhantomJS-Node - the most popular PhantomJS driver for node.
(Sorry for the lack of links - I don't have enough reputation to post more than 2 right now)

- 3,927
- 2
- 13
- 16
PhantomJs is one of the Javascript based headless webkit, so you could use it for crawling. There is something new wrapper came up on top PhantomJS called Nightmare Js http://www.nightmarejs.org/.

- 121
- 4