Headless Java HTTP client for crawling?

Question

I'm looking around for a crawling tool, written in Java, to detect invalid url's in our sites.

The difficulty is that much of the url's are done with javaScript, CSS3 and Ajax. So just getting the content of the site's url wouldn't do.

The ideal would be a headless tool that is able to do the javaScript, CSS styling and AJAX calls and spits out the various url's it accessed in doing so.

I do realize this is a tall order, but maybe it exists somewhere ?

Daniel Teply · Accepted Answer · 2011-08-04T07:28:42.390

7

I suggest using on http://htmlunit.sourceforge.net/, which is made for those things.

edited Aug 04 '11 at 07:28

answered Aug 02 '11 at 11:14

Daniel Teply

1,974
1
13
10

1

Didn't know html unit was that far really. I though it only grabbed html. Thanks ! – Jan Goyvaerts Aug 04 '11 at 06:18

score 0 · Answer 2 · answered Aug 02 '11 at 12:33

0

http://hc.apache.org/httpcomponents-client-ga/index.html

answered Aug 02 '11 at 12:33

keuleJ

3,418
4
30
51

http://hc.apache.org/httpcomponents-client-ga/index.html "Note that HttpClient is not a browser. It lacks the UI, HTML renderer and a JavaScript engine that a browser will possess." – user77115 Jan 20 '13 at 08:55

Headless Java HTTP client for crawling?

2 Answers2