Questions tagged [nutch]

Nutch is a well matured, production ready Web crawler. Nutch enables fine grained configuration, relying on Apache Hadoop™ data structures, which are great for batch processing.

Nutch is open source web-search software. It builds on top of Hadoop adding web-specifics, such as a crawler, a link-graph database, parsers for HTML and other document formats, etc.

Nutch can run on a single machine, but gains a lot of its strength from running in a Hadoop cluster

The system can be enhanced (eg other document formats can be parsed or custom information extracted) using a plugin mechanism.

For more information about Nutch, please see the Nutch wiki.

Nutch has a mailing list, a place where users can post questions and developers can respond. Sometimes, it is faster to get a reply over there.

How Nutch Works :

1571 questions

votes

1 answer

How to parse content located in specific HTML tags using nutch plugin?

I am using Nutch to crawl websites and I want to parse specific sections of html pages crawled by Nutch. For example, title to search

content to search

other…

nutch

asked Jul 31 '13 at 14:02

abhijeet

votes

1 answer

Nutch-Cygwin How to set JAVA_HOME

i am trying to run Nutch with Cygwin. I am having problems setting the JAVA_HOME. $ export JAVA_HOME='/cygdrive/f/program files/java/jdk1.6.0_21' When i run nutch command $ bin/nutch crawl i get cygpath: can't convert empty path bin/nutch: line…

cygwin nutch

asked Feb 19 '12 at 00:47

Kennedy

2,146
6
31
44

votes

1 answer

Nutch on EMR problem reading from S3

Hi I am trying to run Apache Nutch 1.2 on Amazon's EMR. To do this I specifiy an input directory from S3. I get the following error: Fetcher: java.lang.IllegalArgumentException: This file system object…

java hadoop amazon-web-services nutch

asked Aug 30 '11 at 01:52

Peter H

votes

4 answers

How to get the html content from nutch

Is there is any way to get the html content of each webpage in nutch while crawling the web page?

nutch

asked Feb 25 '11 at 23:16

ragaa

votes

1 answer

Nutch API advice

I'm working on a project where I need a mature crawler to do some work, and I'm evaluating Nutch for this purpose. My current needs are relatively straightforward: I need a crawler that is able to save the data to disk and I need it to be able to…

java web-crawler nutch

asked Dec 02 '10 at 21:37

Eugen

8,523
8
52
74

votes

1 answer

nutch 1.10 input path does not exist /linkdb/current

When I run nutch 1.10 with the following command, assuming that TestCrawl2 did not previously exist and needs to be created,... sudo -E bin/crawl -i -D solr.server.url=http://localhost:8983/solr/TestCrawlCore2 urls/ TestCrawl2/ 20 I receive an…

hadoop solr nutch

asked Nov 03 '15 at 20:44

Anonymous Man

2,776
5
19
38

votes

2 answers

Apache Nutch steps explaination

I have followed article: https://wiki.apache.org/nutch/NutchTutorial and set up apache nutch +solr. But i want to clarify if i understood correct about working of nutch steps. 1). Inject: In this part, apache reads url list from given seed.txt,…

apache nutch

asked Apr 12 '15 at 12:21

user3089214

votes

1 answer

Where is the crawled data stored when running nutch crawler?

I am new to Nutch. I need to crawl the web (say, a few hundred web pages), read the crawled data and do some analysis. I followed the link https://wiki.apache.org/nutch/NutchTutorial (and integrated Solr since I may require to search text in future)…

web-crawler nutch

asked Mar 30 '15 at 09:43

Marco99

1,639
1
19
32

votes

2 answers

zookeeper unable to open socket to localhost/0:0:0:0:0:0:0:1:2181

I am using zookeeper ensemble for hbase. Zookeeper is running on 3 machines. While HBase is also in fully distributed mode. I have Nutch 2.x version. When I start nutch to crawl some data, it gives following buggs in nutch log file. ERROR…

apache hbase nutch apache-zookeeper

asked Jan 23 '15 at 12:13

Hafiz Muhammad Shafiq

8,168
12
63
121

votes

4 answers

Nutch message "No IndexWriters activated" while loading to solr

I have run nutch crawler as per nutch tutorial http://wiki.apache.org/nutch/NutchTutorial but when i started loading it to solr i am getting this message i.e. "No IndexWriters activated - check your configuration" bin/nutch solrindex…

solr nutch

asked Jul 15 '13 at 08:13

Subodh Gupta

votes

5 answers

Latest compatible versions of Nutch and Solr

I see different combinations of Nutch and Solr versions being used by people posting about this subject on the web. Which are the latest stable (non beta) and compatible versions of Nutch and Solr that I can download and setup without building…

solr nutch

asked May 15 '13 at 17:32

MarioCannistra

votes

1 answer

Creating an Akka fat Jar

I need to create a Nutch plugin that communicate with some external applications using Akka. In order to do this, I need to package the plugin as a fat Jar - I am using sbt-assembly version 0.8.3. When I try to run the plugin, I get the exception…

scala sbt akka nutch sbt-assembly

asked Mar 04 '13 at 13:51

Andrea

20,253
23
114
183

votes

2 answers

Using nutch in Windows 7

I am trying to use nutch 1.6 from the windows environment but every time I try to run as per the procedure given in the site Nutch Tuorial Apache I always end up with the following exception: Exception in thread "main" java.io.IOException: Failed to…

windows windows-7 cygwin nutch

asked Dec 24 '12 at 07:03

Ajay Nair

1,827
3
20
33

votes

2 answers

connection refused error when running Nutch 2

I am trying to run Nutch 2 crawler on my system but I get the following error: Exception in thread "main" org.apache.gora.util.GoraException: java.io.IOException: java.sql.SQLTransientConnectionException: java.net.ConnectException: Connection…

java web-crawler nutch

asked Sep 25 '12 at 10:53

orezvani

3,595
8
43
57

votes

1 answer

Crawling using Nutch...Shows an IOException

I've started using Nutch and everything was fine until I encountered an IOException exception, $ ./nutch crawl urls -dir myCrawl -depth 2 -topN 4 cygpath: can't convert empty path solrUrl is not set, indexing will be skipped... crawl started in:…

java open-source web-crawler nutch ioexception

asked Jun 22 '12 at 22:27

python-coder

2,128
5
26
37

Prev 1 2

…

99 100 Next