4

I'm looking around for a crawling tool, written in Java, to detect invalid url's in our sites.

The difficulty is that much of the url's are done with javaScript, CSS3 and Ajax. So just getting the content of the site's url wouldn't do.

The ideal would be a headless tool that is able to do the javaScript, CSS styling and AJAX calls and spits out the various url's it accessed in doing so.

I do realize this is a tall order, but maybe it exists somewhere ?

Jonas
  • 121,568
  • 97
  • 310
  • 388
Jan Goyvaerts
  • 2,913
  • 4
  • 35
  • 48

2 Answers2

7

I suggest using on http://htmlunit.sourceforge.net/, which is made for those things.

Daniel Teply
  • 1,974
  • 1
  • 13
  • 10
0

http://hc.apache.org/httpcomponents-client-ga/index.html

keuleJ
  • 3,418
  • 4
  • 30
  • 51
  • http://hc.apache.org/httpcomponents-client-ga/index.html "Note that HttpClient is not a browser. It lacks the UI, HTML renderer and a JavaScript engine that a browser will possess." – user77115 Jan 20 '13 at 08:55