Web crawler that can interpret JavaScript

Question

I want to write a web crawler that can interpret JavaScript. Basically its a program in Java or PHP that takes a URL as input and outputs the DOM tree which is similar to the output in Firebug HTML window. The best example is Kayak.com where you can not see the resulting DOM displayed on the browser when you 'view source' but can save the resulting HTML though Firebug.

How would I go about doing this? What tools exist that would help me?

I am looking to write a web crawler that can execute JavaScript code on the page I am trying to crawl. For example some pages have JavaScript code to populate the page with data from an AJAX call or from a JavaScript array. If you open these pages in FireFox and click on View -> 'Page Source' you don't see the complete HTML DOM that you see in the browser window. But if you have firebug plugin installed, you can open firebug, click on HTML tab, right click on in the debug window and click "Copy HTML" and pasted it to a text editor, you see the HTML DOM generated by the JavaScript code. — user320662, Apr 19 '10 at 20:51
Basically I want to write a program that takes a URL as an input, goes to that page, executes any JavaScript code on that page and returns the resulting page. This program should be able to run on Linux machines. The reason I brought up firebug is because firebug does what exactly what I want but I want some tool or webkit that can automate what firebug does. Please let me know if its still not clear. — user320662, Apr 19 '10 at 20:57
I think the context is pretty clear, but you still have not presented the actual question. Are you asking if this is possible? Are you looking for some library to build this on top of? Do you need ideas for building such library yourself? — Jørn Schou-Rode, Apr 19 '10 at 21:28
Do you want to **write** a webcrawler which can execute JS *or* do you want to **use** a (3rd party) webcrawler which can execute JS? As far you're constantly stating the first, but it seems more that you actually meant the second and want our recommendations about it. — BalusC, Apr 20 '10 at 01:39
Sorry for not being clear, as BalusC said I want to build a web crawler that can crawl JavaScript rich pages. I need some ideas/tools that can help me in building this. Thank you again — user320662, Apr 20 '10 at 03:30
user320662 ... are you still around? I am interested in knowing what, if any, solution you developed for a JavaScript capable web crawler. — Chad, Jul 21 '10 at 03:40
You could use CasperJS/PhantomJS, see http://webmasters.stackexchange.com/questions/34819/monitoring-gwt-website — Raf, Nov 08 '12 at 09:19

tokland · Answer 1 · 2014-09-30T08:13:13.167

6

Ruby's Capybara is an integration test library, but it can also be used to write stand-alone web-crawlers. Given that it uses backends like Selenium or headless WebKit, it interprets javascript out-of-the-box:

require 'capybara/dsl'
require 'capybara-webkit'

include Capybara::DSL
Capybara.current_driver = :webkit
Capybara.app_host = "http://www.google.com"
page.visit("/")
puts(page.html)

edited Sep 30 '14 at 08:13

answered Oct 17 '11 at 08:46

tokland

66,169
13
144
170

score 5 · Answer 2 · answered Apr 20 '10 at 05:41

5

I've been using HtmlUnit (Java). This was originally designed for unit testing pages. It's not perfect javascript, but it hasn't failed me in my limited usage. According to the site, it can run the following JS frameworks to a reasonable degree:

jQuery 1.2.6
MochiKit 1.4.1
GWT 2.0.0
Sarissa 0.9.9.3
MooTools 1.2.1
Prototype 1.6.0
Ext JS 2.2
Dojo 1.0.2
YUI 2.3.0

answered Apr 20 '10 at 05:41

Jeff

5,013
1
33
30

Do you have to know java to use it with a php script? – Michael Falciglia Oct 21 '14 at 22:55
You have to use java with it. Although you can try using a selenium-based solution for PHP. – Jeff Oct 25 '14 at 20:48

thomasrutter · Answer 3 · 2010-04-20T04:12:40.430

2

You are more likely to have success in Java than in PHP. There is a pre-existing Javascript interpreter for Java called Rhino. It's a reference implementation, and well-documented.

Rhino is used in lots of existing Java apps to provide Javascript scripting ability within the app. I have also heard of it used to assist with performing automated tests in Javascript.

I also know that Java includes code that can parse and render HTML, though someone who knows more about Java than me can probably advise more on that. I am not denying it would be very difficult to achieve something like this; you'd essentially be re-implementing a lot of what a browser does.

edited Apr 20 '10 at 04:12

answered Apr 20 '10 at 01:57

thomasrutter

114,488
30
148
167

hi thomasrutter, thank you for the pointer but I guess rhino is a JavaScript engine and probably I need to build a prototype browser using Rhino as JavaScript engine to crawl a JavaScript heavy page. Please correct me if I am wrong – user320662 Apr 20 '10 at 03:28
Java also includes HTML parsing/rendering abilities. Someone who knows more about Java than me might be able to advise better with that - my knowledge ends here. – thomasrutter Apr 20 '10 at 04:11

score 1 · Answer 4 · answered May 04 '12 at 07:44

Give a look here: http://snippets.scrapy.org/snippets/22/ it's a python screen scraping and web crawling framework used with webdrivers that open a page, render all the things you need and gives you the possibilities to "capture" anything you want in the page via

score 1 · Answer 5 · answered Apr 21 '10 at 08:53

1

You could use Mozilla's rendering engine Gecko:

https://developer.mozilla.org/en/Gecko

answered Apr 21 '10 at 08:53

RoToRa

37,635
12
69
105

1

Google Chrome's v8 might also be an option here, http://code.google.com/p/v8/ – phoenix24 Apr 28 '10 at 06:03

Web crawler that can interpret JavaScript

5 Answers5

Linked

Related