3

I am using apache Nutch 1.10 to crawl the web pages and to extract the contents in the page. Some of the links contains dynamic contents which are loaded on the call of ajax. Nutch cannot able to crawl and extract the dynamic contents of ajax. How can I solve this? Is there any solution? if yes please help me with your answers.

Thanks in advance.

yoganandh
  • 247
  • 1
  • 6
  • 20

2 Answers2

4

Most web crawler libraries do not offer javascript rendering out of the box. You usually have to plugin another library or product that offers js rendering like Selenium or PhantomJS.

Here is a tutorial using nutch and Selenium.

sjdirect
  • 2,224
  • 2
  • 22
  • 27
  • 1
    Thanks for your response. I have followed the instructions in that link. I have included selenium plugin everything goes fine. but atlast after crawling there is no data. If am not using the selenium plugin I am getting the data content. – yoganandh Oct 08 '15 at 08:37
  • I have the same problem, no content after crawling. Did you compile nutch as instructed in the tutorial? – jmng Nov 11 '15 at 12:18
1

Checkout the latest Nutch 1.11 trunk which includes a new plugin protocol-interactive selenium. (https://github.com/apache/nutch/tree/trunk/src/plugin/protocol-interactiveselenium)

This plugin allows you to write your own handler and execute javascript to get dynamic content.

Sujen Shah
  • 270
  • 2
  • 8