Questions tagged [nutch]

Nutch is a well matured, production ready Web crawler. Nutch enables fine grained configuration, relying on Apache Hadoop™ data structures, which are great for batch processing.

Nutch is open source web-search software. It builds on top of Hadoop adding web-specifics, such as a crawler, a link-graph database, parsers for HTML and other document formats, etc.

Nutch can run on a single machine, but gains a lot of its strength from running in a Hadoop cluster

The system can be enhanced (eg other document formats can be parsed or custom information extracted) using a plugin mechanism.

For more information about Nutch, please see the Nutch wiki.

Nutch has a mailing list, a place where users can post questions and developers can respond. Sometimes, it is faster to get a reply over there.

How Nutch Works :

1571 questions

votes

1 answer

Crawling redirects later with Nutch

The nutch-default.xml suggests that there is a way to save redirect destination on the first crawl and crawl them on the next crawl by setting the http.redirect.max to 0. The first crawl finished successfully and we could see the redirect response…

nutch

asked Sep 17 '12 at 10:31

Enno Shioji

26,542
13
70
109

votes

1 answer

Error in Nutch NoClassDefFoundError

I am study nutch , and I am getting this error. I am not really sure how to fix this problem does anyone know the way to fix this program ? I am running nutch on the OS X mountain line.. apache-nutch-1.5.1 3 bin/nutch admin db -create bin/nutch:…

asked Sep 12 '12 at 06:30

Dc Redwing

1,681
11
32
42

votes

1 answer

Hadoop 1.03 and Nutch 1.5 issue

I get the following error when I try to run nutch-1.5 on hadoop 1.03. hadoop jar nutch-1.5.job org.apache.nutch.crawl.Crawl urls -dir urls -depth 1 -topN 5 **Caused by: java.io.IOException: can't find class: org.apache.nutch.protocol.ProtocolStatus…

hadoop nutch

asked Sep 10 '12 at 18:29

Roger Garzon Nieto

6,554
2
28
24

votes

1 answer

Hit rate limitation in nutch

Is it possible to limit the hit rate/IP address in nutch? In other words, can I configure nutch so that it will only hit an IP x number of times per hour, etc.?

web-crawler nutch robots.txt

asked Sep 10 '12 at 13:03

Enno Shioji

26,542
13
70
109

votes

3 answers

Updating Solr Field Value

is there any possibility to update a value of a Solr-Field without reindexing the whole document?

solr lucene nutch

asked Sep 10 '12 at 12:43

Julian Riedinger

votes

1 answer

How to crawl English site and avoid crawling other languages?

Hi I need to crawl only sites that their language is English. I know nutch can detect the langauge of sites by plugins like language detector But I need to prevent nutch from crawling the none English site. Although I know we need to crawl a page to…

nutch language-detection

asked Sep 05 '12 at 06:40

a.toraby

3,232
5
41
73

votes

1 answer

Nutch crawl fails when run as a background process on linux

When I run the Nutch crawl as a background process on Ubuntu in local mode, the Fetcher aborts with hung threads. The message is something like: WARN fetcher.Fetcher - Aborting with "X" hung threads. I start off the script using nohup and & as I…

linux ubuntu ssh nutch

asked Aug 29 '12 at 15:18

cprsd

votes

1 answer

fetch specific title in every page with nutch and solr

I have solr and nutch installed and my web page structure is that in every page the title is the same; e.g. Bank Something; but in every page there is a tag with an ID of TITLE, something like:

my page specific…

apache solr lucene nutch dismax

asked Aug 26 '12 at 05:43

Amir

votes

1 answer

automatic recrawl sites in nutch 1.4?

I want to recrawl my sites 3 times a day. I know I should write a script for this but i don't know how? and i don't know how run the script ? can someone explain this step by step thanks

nutch web-crawler

asked Aug 23 '12 at 07:23

user1618925

votes

1 answer

Nutch 2 parse and outlinks

I've noticed that parse plugins like tika extract the outlinks from the content, but the object WebPage passed in method getParse/2 already have 2 arrays containing outlinks and inlinks. Whats the difference between the extraction in getParse and…

nutch

asked Aug 13 '12 at 10:35

Hugo Alves

votes

1 answer

Nutch Parsing plugin and redirects

I am using nutch 2.0, i've created a plugin for parsing html that implements Parser and works just fine. The problem is that i need to "parse" also pages that generate redirects (301,300), for getting the url and the http code.My plugin ignores the…

nutch web-crawler

asked Aug 08 '12 at 12:11

Hugo Alves

votes

1 answer

Nutch - does not crawl, says "Stopping at depth=1 - no more URLs to fetch"

It's been long since I've been trying to crawl using Nutch but it just doesn't seem to run. I'm trying to build a SOLR search for a website and using Nutch for crawling and indexing in Solr. There have been some permission problems originally but…

nutch web-crawler

asked Jul 29 '12 at 15:34

Abhay

6,545
2
22
17

votes

2 answers

error when using solr and Integrating nutch and solr(HTTP ERROR 500)

I have Linux Ubuntu 12.04 installed and I'm trying to install nutch 1.5.1 and solr 3.6.1 and integrate theme together to crawl seed urls. I'm using This tutorial to get this work. I followed the steps before 3.2 and skipped to step 4 and I can…

solr integration web-crawler nutch

asked Jul 24 '12 at 14:40

Soroush

votes

1 answer

Can Nutch crawl video sites?

Is it possible to use Nutch to crawl sites with only video files? Appreciate any insight into this.

video indexing nutch

asked Jul 20 '12 at 09:19

Namrata Hangal

votes

1 answer

Is it possible to have a static index field for Liferay using solr-web plugin?

Can anyone tell me if I can associate a static index field for Liferay using the solr-web.plugin? Is there a way to define a static index in solr? I need something similar to the following configuration in Nutch …

solr indexing liferay nutch

asked Jul 20 '12 at 05:13

Namrata Hangal

Prev 1 2 3

…

100