Questions tagged [nutch]

Nutch is a well matured, production ready Web crawler. Nutch enables fine grained configuration, relying on Apache Hadoop™ data structures, which are great for batch processing.

Nutch is open source web-search software. It builds on top of Hadoop adding web-specifics, such as a crawler, a link-graph database, parsers for HTML and other document formats, etc.

Nutch can run on a single machine, but gains a lot of its strength from running in a Hadoop cluster

The system can be enhanced (eg other document formats can be parsed or custom information extracted) using a plugin mechanism.

For more information about Nutch, please see the Nutch wiki.

Nutch has a mailing list, a place where users can post questions and developers can respond. Sometimes, it is faster to get a reply over there.

How Nutch Works : How Nutch Works !!

1571 questions
7
votes
1 answer

How to parse content located in specific HTML tags using nutch plugin?

I am using Nutch to crawl websites and I want to parse specific sections of html pages crawled by Nutch. For example, title to search
content to search
other…
abhijeet
  • 849
  • 2
  • 19
  • 54
6
votes
1 answer

Nutch-Cygwin How to set JAVA_HOME

i am trying to run Nutch with Cygwin. I am having problems setting the JAVA_HOME. $ export JAVA_HOME='/cygdrive/f/program files/java/jdk1.6.0_21' When i run nutch command $ bin/nutch crawl i get cygpath: can't convert empty path bin/nutch: line…
Kennedy
  • 2,146
  • 6
  • 31
  • 44
6
votes
1 answer

Nutch on EMR problem reading from S3

Hi I am trying to run Apache Nutch 1.2 on Amazon's EMR. To do this I specifiy an input directory from S3. I get the following error: Fetcher: java.lang.IllegalArgumentException: This file system object…
Peter H
  • 608
  • 4
  • 12
6
votes
4 answers

How to get the html content from nutch

Is there is any way to get the html content of each webpage in nutch while crawling the web page?
ragaa
  • 61
  • 1
  • 5
6
votes
1 answer

Nutch API advice

I'm working on a project where I need a mature crawler to do some work, and I'm evaluating Nutch for this purpose. My current needs are relatively straightforward: I need a crawler that is able to save the data to disk and I need it to be able to…
Eugen
  • 8,523
  • 8
  • 52
  • 74
6
votes
1 answer

nutch 1.10 input path does not exist /linkdb/current

When I run nutch 1.10 with the following command, assuming that TestCrawl2 did not previously exist and needs to be created,... sudo -E bin/crawl -i -D solr.server.url=http://localhost:8983/solr/TestCrawlCore2 urls/ TestCrawl2/ 20 I receive an…
Anonymous Man
  • 2,776
  • 5
  • 19
  • 38
6
votes
2 answers

Apache Nutch steps explaination

I have followed article: https://wiki.apache.org/nutch/NutchTutorial and set up apache nutch +solr. But i want to clarify if i understood correct about working of nutch steps. 1). Inject: In this part, apache reads url list from given seed.txt,…
user3089214
  • 267
  • 3
  • 14
6
votes
1 answer

Where is the crawled data stored when running nutch crawler?

I am new to Nutch. I need to crawl the web (say, a few hundred web pages), read the crawled data and do some analysis. I followed the link https://wiki.apache.org/nutch/NutchTutorial (and integrated Solr since I may require to search text in future)…
Marco99
  • 1,639
  • 1
  • 19
  • 32
6
votes
2 answers

zookeeper unable to open socket to localhost/0:0:0:0:0:0:0:1:2181

I am using zookeeper ensemble for hbase. Zookeeper is running on 3 machines. While HBase is also in fully distributed mode. I have Nutch 2.x version. When I start nutch to crawl some data, it gives following buggs in nutch log file. ERROR…
Hafiz Muhammad Shafiq
  • 8,168
  • 12
  • 63
  • 121
6
votes
4 answers

Nutch message "No IndexWriters activated" while loading to solr

I have run nutch crawler as per nutch tutorial http://wiki.apache.org/nutch/NutchTutorial but when i started loading it to solr i am getting this message i.e. "No IndexWriters activated - check your configuration" bin/nutch solrindex…
Subodh Gupta
  • 193
  • 1
  • 3
  • 12
6
votes
5 answers

Latest compatible versions of Nutch and Solr

I see different combinations of Nutch and Solr versions being used by people posting about this subject on the web. Which are the latest stable (non beta) and compatible versions of Nutch and Solr that I can download and setup without building…
MarioCannistra
  • 275
  • 3
  • 12
6
votes
1 answer

Creating an Akka fat Jar

I need to create a Nutch plugin that communicate with some external applications using Akka. In order to do this, I need to package the plugin as a fat Jar - I am using sbt-assembly version 0.8.3. When I try to run the plugin, I get the exception…
Andrea
  • 20,253
  • 23
  • 114
  • 183
6
votes
2 answers

Using nutch in Windows 7

I am trying to use nutch 1.6 from the windows environment but every time I try to run as per the procedure given in the site Nutch Tuorial Apache I always end up with the following exception: Exception in thread "main" java.io.IOException: Failed to…
Ajay Nair
  • 1,827
  • 3
  • 20
  • 33
6
votes
2 answers

connection refused error when running Nutch 2

I am trying to run Nutch 2 crawler on my system but I get the following error: Exception in thread "main" org.apache.gora.util.GoraException: java.io.IOException: java.sql.SQLTransientConnectionException: java.net.ConnectException: Connection…
orezvani
  • 3,595
  • 8
  • 43
  • 57
6
votes
1 answer

Crawling using Nutch...Shows an IOException

I've started using Nutch and everything was fine until I encountered an IOException exception, $ ./nutch crawl urls -dir myCrawl -depth 2 -topN 4 cygpath: can't convert empty path solrUrl is not set, indexing will be skipped... crawl started in:…
python-coder
  • 2,128
  • 5
  • 26
  • 37