Questions tagged [nutch]

Nutch is a well matured, production ready Web crawler. Nutch enables fine grained configuration, relying on Apache Hadoop™ data structures, which are great for batch processing.

Nutch is open source web-search software. It builds on top of Hadoop adding web-specifics, such as a crawler, a link-graph database, parsers for HTML and other document formats, etc.

Nutch can run on a single machine, but gains a lot of its strength from running in a Hadoop cluster

The system can be enhanced (eg other document formats can be parsed or custom information extracted) using a plugin mechanism.

For more information about Nutch, please see the Nutch wiki.

Nutch has a mailing list, a place where users can post questions and developers can respond. Sometimes, it is faster to get a reply over there.

How Nutch Works : How Nutch Works !!

1571 questions
0
votes
1 answer

Crawling redirects later with Nutch

The nutch-default.xml suggests that there is a way to save redirect destination on the first crawl and crawl them on the next crawl by setting the http.redirect.max to 0. The first crawl finished successfully and we could see the redirect response…
Enno Shioji
  • 26,542
  • 13
  • 70
  • 109
0
votes
1 answer

Error in Nutch NoClassDefFoundError

I am study nutch , and I am getting this error. I am not really sure how to fix this problem does anyone know the way to fix this program ? I am running nutch on the OS X mountain line.. apache-nutch-1.5.1 3 bin/nutch admin db -create bin/nutch:…
Dc Redwing
  • 1,681
  • 11
  • 32
  • 42
0
votes
1 answer

Hadoop 1.03 and Nutch 1.5 issue

I get the following error when I try to run nutch-1.5 on hadoop 1.03. hadoop jar nutch-1.5.job org.apache.nutch.crawl.Crawl urls -dir urls -depth 1 -topN 5 **Caused by: java.io.IOException: can't find class: org.apache.nutch.protocol.ProtocolStatus…
Roger Garzon Nieto
  • 6,554
  • 2
  • 28
  • 24
0
votes
1 answer

Hit rate limitation in nutch

Is it possible to limit the hit rate/IP address in nutch? In other words, can I configure nutch so that it will only hit an IP x number of times per hour, etc.?
Enno Shioji
  • 26,542
  • 13
  • 70
  • 109
0
votes
3 answers

Updating Solr Field Value

is there any possibility to update a value of a Solr-Field without reindexing the whole document?
0
votes
1 answer

How to crawl English site and avoid crawling other languages?

Hi I need to crawl only sites that their language is English. I know nutch can detect the langauge of sites by plugins like language detector But I need to prevent nutch from crawling the none English site. Although I know we need to crawl a page to…
a.toraby
  • 3,232
  • 5
  • 41
  • 73
0
votes
1 answer

Nutch crawl fails when run as a background process on linux

When I run the Nutch crawl as a background process on Ubuntu in local mode, the Fetcher aborts with hung threads. The message is something like: WARN fetcher.Fetcher - Aborting with "X" hung threads. I start off the script using nohup and & as I…
cprsd
  • 473
  • 4
  • 13
0
votes
1 answer

fetch specific title in every page with nutch and solr

I have solr and nutch installed and my web page structure is that in every page the title is the same; e.g. Bank Something; but in every page there is a tag with an ID of TITLE, something like:

my page specific…

Amir
  • 341
  • 1
  • 5
  • 16
0
votes
1 answer

automatic recrawl sites in nutch 1.4?

I want to recrawl my sites 3 times a day. I know I should write a script for this but i don't know how? and i don't know how run the script ? can someone explain this step by step thanks
0
votes
1 answer

Nutch 2 parse and outlinks

I've noticed that parse plugins like tika extract the outlinks from the content, but the object WebPage passed in method getParse/2 already have 2 arrays containing outlinks and inlinks. Whats the difference between the extraction in getParse and…
Hugo Alves
  • 188
  • 1
  • 10
0
votes
1 answer

Nutch Parsing plugin and redirects

I am using nutch 2.0, i've created a plugin for parsing html that implements Parser and works just fine. The problem is that i need to "parse" also pages that generate redirects (301,300), for getting the url and the http code.My plugin ignores the…
Hugo Alves
  • 188
  • 1
  • 10
0
votes
1 answer

Nutch - does not crawl, says "Stopping at depth=1 - no more URLs to fetch"

It's been long since I've been trying to crawl using Nutch but it just doesn't seem to run. I'm trying to build a SOLR search for a website and using Nutch for crawling and indexing in Solr. There have been some permission problems originally but…
Abhay
  • 6,545
  • 2
  • 22
  • 17
0
votes
2 answers

error when using solr and Integrating nutch and solr(HTTP ERROR 500)

I have Linux Ubuntu 12.04 installed and I'm trying to install nutch 1.5.1 and solr 3.6.1 and integrate theme together to crawl seed urls. I'm using This tutorial to get this work. I followed the steps before 3.2 and skipped to step 4 and I can…
Soroush
  • 989
  • 2
  • 10
  • 16
0
votes
1 answer

Can Nutch crawl video sites?

Is it possible to use Nutch to crawl sites with only video files? Appreciate any insight into this.
0
votes
1 answer

Is it possible to have a static index field for Liferay using solr-web plugin?

Can anyone tell me if I can associate a static index field for Liferay using the solr-web.plugin? Is there a way to define a static index in solr? I need something similar to the following configuration in Nutch
1 2 3
99
100