Questions tagged [nutch]

Nutch is a well matured, production ready Web crawler. Nutch enables fine grained configuration, relying on Apache Hadoop™ data structures, which are great for batch processing.

Nutch is open source web-search software. It builds on top of Hadoop adding web-specifics, such as a crawler, a link-graph database, parsers for HTML and other document formats, etc.

Nutch can run on a single machine, but gains a lot of its strength from running in a Hadoop cluster

The system can be enhanced (eg other document formats can be parsed or custom information extracted) using a plugin mechanism.

For more information about Nutch, please see the Nutch wiki.

Nutch has a mailing list, a place where users can post questions and developers can respond. Sometimes, it is faster to get a reply over there.

How Nutch Works :

1571 questions

votes

4 answers

How to Open an Ant project (Nutch Source) at Intellij Idea?

I want to open Nutch 2.1 source file (http://www.eu.apache.org/dist/nutch/2.1/) at Intellij IDEA. Here is an explanation of how to open it at Eclipse: http://wiki.apache.org/nutch/RunNutchInEclipse However I am not familiar with Ant (I use Maven)…

ant intellij-idea nutch

asked Mar 12 '13 at 09:27

kamaci

72,915
69
228
366

votes

5 answers

How to produce massive amount of data?

I'm doing some testing with nutch and hadoop and I need a massive amount of data. I want to start with 20GB, go to 100 GB, 500 GB and eventually reach 1-2 TB. The problem is that I don't have this amount of data, so I'm thinking of ways to produce…

java hadoop nutch bigdata

asked Dec 29 '11 at 12:59

AAaa

3,659
7
35
40

votes

4 answers

Have you indexed nutch crawl results using elasticsearch before?

Has anyone had any luck writing custom indexers for nutch to index the crawl results with elasticsearch? Or do you know of any that already exist?

lucene full-text-search web-crawler nutch elasticsearch

asked May 15 '11 at 23:58

neildf

votes

1 answer

Nutch: Invoke in Java, not command line?

Am I being thick or is there really no way to invoke Apache Nutch through some Java code programmatically? Where is the documentation (or a guide or tutorial) on how to do this? Google has failed me. So I actually tried Bing. (Yes, I know,…

java web-crawler nutch

asked Mar 24 '11 at 14:50

ChrisJF

6,822
4
36
41

votes

1 answer

could to find or load main class org.apache.nutch.crawl.InjectorJob

I'm using Linux with Hadoop, Cloudera and HBase. Could you tell me how to correct this error? Error: could to find or load main class org.apache.nutch.crawl.InjectorJob The following command gave me the error: src/bin/nutch inject crawl/crawldb…

hadoop solr nutch

asked Mar 09 '15 at 09:27

orilion

votes

10 answers

How do we create a simple search engine using Lucene, Solr or Nutch?

Our company has thousands of PDF documents. How do we create a simple search engine using Lucene, Solr or Nutch? We'll provide a basic Java/JSP web page were people can type in words and perform basic and/or queries then show them the document…

lucene solr nutch

asked Oct 21 '08 at 21:15

anon

votes

3 answers

Solr indexing following a Nutch crawl fails, reports "Job Failed"

I have a site hosted on my local machine that I am attempting to crawl with Nutch and index in Solr (both also on my local machine). I installed Solr 4.6.1 and Nutch 1.7 per the instructions given on the Nutch site…

solr nutch

asked Feb 07 '14 at 00:40

rldrummer

votes

5 answers

Nutch in Windows: Failed to set permissions of path

I'm trying to user Solr with Nutch on a Windows Machine and I'm getting the following error: Exception in thread "main" java.io.IOException: Failed to set permissions of path: c:\temp\mapred\staging\admin-1654213299\.staging to 0700 From a lot of…

windows solr hadoop cygwin nutch

asked Mar 03 '13 at 16:53

Boris Crismancich

votes

1 answer

Apache Nutch 2.1 different batch id (null)

I crawl few sites with Apache Nutch 2.1. While crawling I see the following message on lot of pages: ex. Skipping http://www.domainname.com/news/subcategory/111111/index.html; different batch id (null). What causes this error ? How can I resolve…

apache nutch web-crawler

asked Feb 12 '13 at 08:33

Dragan Menoski

1,092
14
33

votes

1 answer

Error while indexing in solr data crawled by nutch

I have starting working with nutch and solr and I have a problem with integrating Solr with Nutch. I followed this tutorial: http://wiki.apache.org/nutch/NutchTutorial and after using: bin/nutch crawl urls -solr http://localhost:8983/solr/ -depth 3…

solr indexing runtime-error nutch

asked Nov 17 '12 at 09:56

user1831647

votes

0 answers

Nutch problems executing crawl on Windows

I am trying to get nutch 1.11 to execute a crawl. I am using cygwin to run these commands in Windows 8. I have put hadoop-core jar into lib folder but when I try to run a crawl I obtain: Exception in thread "main" java.lang.NoSuchMethodError:…

windows web-crawler nutch

asked May 12 '16 at 08:48

Daniel Z.

votes

1 answer

Maximum number of Apache Nutch worker instances

What is the maximum number of Apache Nutch crawler instances that can run at the same time with one master node?

hadoop nutch

asked Dec 17 '15 at 02:39

Sanaz Marshall

votes

0 answers

Solr dedup error Failed with exit value 255

I am crawling few data from web using apache nutch 2.3. My solr version is 4.10.3. Data is crawled successfully in hbase and indexed also in solr but at end (dedup stage ) Follwoing error appears in console; IndexingJob: done. SOLR dedup ->…

java apache solr web-crawler nutch

asked Jan 28 '15 at 05:53

Hafiz Muhammad Shafiq

8,168
12
63
121

votes

1 answer

Apache Nutch: Get outlink URL's text context

Anyone knows an efficient way to extract the text context that wraps an outlink URL. For example, given this sample text containing an outlink: Nutch can run on a single machine, but gains a lot of its strength from running in a Hadoop cluster. You…

apache hadoop web-scraping nutch

asked Mar 09 '14 at 14:47

user3367701

votes

4 answers

Does any open, simply extendible web crawler exists?

I search for a web crawler solution which can is mature enough and can be simply extended. I am interested in the following features... or possibility to extend the crawler to meet them: partly just to read the feeds of several sites to scrape the…

web-scraping web-crawler nutch

asked Jan 18 '10 at 10:11

fifigyuri

5,771
8
30
50

Prev 1

…

99 100 Next