Web spider, which is able to crawl ajax-based websites

Question

Right now i'm using Crawler4j and i'm pretty happy with that - but it can not crawl ajax-based websites. I used selenium once for another approach and this works fine combined with phantomjs. So is there a way to plug in Selenium into crawler4j?

If not - is there another good library in Java for handling ajax based websites?

(With webspider i mean, that i have to give the program one url and it automatically starts extracting the content form the site)

score 2 · Accepted Answer · edited May 23 '17 at 10:27

Basically yes. The source code of crawler4j is hosted on GitHub.

You are free to contribute an extension, so crawler4j can fetch ajax-based websites. By default crawler4j is not possible to fetch such sites.

Apache Nutch is able to render JS while crawling web pages as described here. However setting up Apache Nutch for Web-Crawling is much more work than adapting existing code structures to be used with crawler4j.

Web spider, which is able to crawl ajax-based websites

1 Answers1