Questions tagged [nutch]

Nutch is a well matured, production ready Web crawler. Nutch enables fine grained configuration, relying on Apache Hadoop™ data structures, which are great for batch processing.

Nutch is open source web-search software. It builds on top of Hadoop adding web-specifics, such as a crawler, a link-graph database, parsers for HTML and other document formats, etc.

Nutch can run on a single machine, but gains a lot of its strength from running in a Hadoop cluster

The system can be enhanced (eg other document formats can be parsed or custom information extracted) using a plugin mechanism.

For more information about Nutch, please see the Nutch wiki.

Nutch has a mailing list, a place where users can post questions and developers can respond. Sometimes, it is faster to get a reply over there.

How Nutch Works : How Nutch Works !!

1571 questions
9
votes
4 answers

How to Open an Ant project (Nutch Source) at Intellij Idea?

I want to open Nutch 2.1 source file (http://www.eu.apache.org/dist/nutch/2.1/) at Intellij IDEA. Here is an explanation of how to open it at Eclipse: http://wiki.apache.org/nutch/RunNutchInEclipse However I am not familiar with Ant (I use Maven)…
kamaci
  • 72,915
  • 69
  • 228
  • 366
8
votes
5 answers

How to produce massive amount of data?

I'm doing some testing with nutch and hadoop and I need a massive amount of data. I want to start with 20GB, go to 100 GB, 500 GB and eventually reach 1-2 TB. The problem is that I don't have this amount of data, so I'm thinking of ways to produce…
AAaa
  • 3,659
  • 7
  • 35
  • 40
8
votes
4 answers

Have you indexed nutch crawl results using elasticsearch before?

Has anyone had any luck writing custom indexers for nutch to index the crawl results with elasticsearch? Or do you know of any that already exist?
neildf
  • 95
  • 1
  • 6
8
votes
1 answer

Nutch: Invoke in Java, not command line?

Am I being thick or is there really no way to invoke Apache Nutch through some Java code programmatically? Where is the documentation (or a guide or tutorial) on how to do this? Google has failed me. So I actually tried Bing. (Yes, I know,…
ChrisJF
  • 6,822
  • 4
  • 36
  • 41
8
votes
1 answer

could to find or load main class org.apache.nutch.crawl.InjectorJob

I'm using Linux with Hadoop, Cloudera and HBase. Could you tell me how to correct this error? Error: could to find or load main class org.apache.nutch.crawl.InjectorJob The following command gave me the error: src/bin/nutch inject crawl/crawldb…
orilion
  • 81
  • 4
8
votes
10 answers

How do we create a simple search engine using Lucene, Solr or Nutch?

Our company has thousands of PDF documents. How do we create a simple search engine using Lucene, Solr or Nutch? We'll provide a basic Java/JSP web page were people can type in words and perform basic and/or queries then show them the document…
anon
8
votes
3 answers

Solr indexing following a Nutch crawl fails, reports "Job Failed"

I have a site hosted on my local machine that I am attempting to crawl with Nutch and index in Solr (both also on my local machine). I installed Solr 4.6.1 and Nutch 1.7 per the instructions given on the Nutch site…
rldrummer
  • 81
  • 1
  • 2
8
votes
5 answers

Nutch in Windows: Failed to set permissions of path

I'm trying to user Solr with Nutch on a Windows Machine and I'm getting the following error: Exception in thread "main" java.io.IOException: Failed to set permissions of path: c:\temp\mapred\staging\admin-1654213299\.staging to 0700 From a lot of…
8
votes
1 answer

Apache Nutch 2.1 different batch id (null)

I crawl few sites with Apache Nutch 2.1. While crawling I see the following message on lot of pages: ex. Skipping http://www.domainname.com/news/subcategory/111111/index.html; different batch id (null). What causes this error ? How can I resolve…
Dragan Menoski
  • 1,092
  • 14
  • 33
8
votes
1 answer

Error while indexing in solr data crawled by nutch

I have starting working with nutch and solr and I have a problem with integrating Solr with Nutch. I followed this tutorial: http://wiki.apache.org/nutch/NutchTutorial and after using: bin/nutch crawl urls -solr http://localhost:8983/solr/ -depth 3…
user1831647
  • 81
  • 1
  • 2
7
votes
0 answers

Nutch problems executing crawl on Windows

I am trying to get nutch 1.11 to execute a crawl. I am using cygwin to run these commands in Windows 8. I have put hadoop-core jar into lib folder but when I try to run a crawl I obtain: Exception in thread "main" java.lang.NoSuchMethodError:…
Daniel Z.
  • 71
  • 3
7
votes
1 answer

Maximum number of Apache Nutch worker instances

What is the maximum number of Apache Nutch crawler instances that can run at the same time with one master node?
7
votes
0 answers

Solr dedup error Failed with exit value 255

I am crawling few data from web using apache nutch 2.3. My solr version is 4.10.3. Data is crawled successfully in hbase and indexed also in solr but at end (dedup stage ) Follwoing error appears in console; IndexingJob: done. SOLR dedup ->…
Hafiz Muhammad Shafiq
  • 8,168
  • 12
  • 63
  • 121
7
votes
1 answer

Apache Nutch: Get outlink URL's text context

Anyone knows an efficient way to extract the text context that wraps an outlink URL. For example, given this sample text containing an outlink: Nutch can run on a single machine, but gains a lot of its strength from running in a Hadoop cluster. You…
user3367701
  • 823
  • 1
  • 8
  • 17
7
votes
4 answers

Does any open, simply extendible web crawler exists?

I search for a web crawler solution which can is mature enough and can be simply extended. I am interested in the following features... or possibility to extend the crawler to meet them: partly just to read the feeds of several sites to scrape the…
fifigyuri
  • 5,771
  • 8
  • 30
  • 50