Highest Voted 'nutch2' Questions

0

votes

0 answers

Errors using curl for nutch RESTapi calls

I am using curl to make RESTapi calls to my nutch server running on a ubuntu instance. When I make the following call with curl to create my seeds file on my server curl -X POST http://**.185.***.**:8081/seed/create -d {"id": "ubuntu", "name":…

asked Mar 17 '20 at 04:35

Ciaran

451
1
4
14

0

votes

1 answer

Apache Nutch skipping URLs & truncating

In my nutch-site.xml, I add the following to stop truncating; however, during the fetch process, I get the following error. I want it to stop truncating and provide the results I need, which I assumed a -1 value would achieve. I'm using version…

java nutch nutch2

asked Aug 07 '19 at 14:57

Karl Hill

12,937
5
58
95

0

votes

1 answer

Apache Nutch 2.3.1, increase reducer memory

I have setup a small size cluster if Hadoop with Hbase for Nutch 2.3.1. The hadoop version is 2.7.7 and Hbase is 0.98. I have customized a hadoop job and now I have to set memory for reducer task in driver class. I have come to know, in simple…

hadoop web-crawler nutch nutch2

asked Feb 12 '19 at 05:27

Hafiz Muhammad Shafiq

8,168
12
63
121

0

votes

1 answer

Configuring RAM in Nutch

I am using Nutch 1.10 to crawl websites for my organization. I use a system with 16Gb RAM to do this crawl. As of now, my nutch file uses only 3-4Gb of RAM while crawling the data and it takes almmost 10 hours to finish it. Is there some way where i…

nutch nutch2

asked Jan 22 '19 at 04:53

UMA MAHESWAR

167
3
16

0

votes

1 answer

Find number of already exist documents in solr with solrindexing job in nutch

In nutch, In solrindex job how we can calculate the number of documents which have been updated in solr and the number of documents which have been indexed as new documents.

solr nutch2

asked Nov 07 '18 at 11:11

Naser Aslam

29
1
4

0

votes

1 answer

Apache Nutch ranking algorithm for specific language content

I have configured Nutch 2.3.1 with Hadoop/Hbase ecosystem to crawl Urdu language content. For language detection, I have customized fetcher and find language at that point. If document does not have enough Urdu language (bytes) then I deliberately…

web-crawler nutch nutch2

asked Aug 27 '18 at 11:28

Hafiz Muhammad Shafiq

8,168
12
63
121

0

votes

1 answer

Apache Nutch section pages handling trick

I have configured Nutch 2.3.1 with Hadoop/Hbase ecosystem. The idea is to crawl and index story pages mostly. For that I have prepared a seed of some domains. Now I am facing some logical problem in Nutch that is it behaves similar to all level of a…

solr web-crawler nutch nutch2

asked Aug 03 '18 at 09:44

Hafiz Muhammad Shafiq

8,168
12
63
121

0

votes

1 answer

Apache Nutch title parsing issue for Language specific websites

I have configured apache Nutch 2.3.1 with Hadoop 2.7.5 and Hbase 0.98. I have to crawl some Urdu websites. I am using its default parsers i.e., html, tika. Some documents have title in Urdu that are ok but some documents have title in Urdu and their…

parsing nutch apache-tika nutch2

asked Aug 02 '18 at 11:22

Hafiz Muhammad Shafiq

8,168
12
63
121

0

votes

1 answer

Apache Nutch 2.3.1 opic scoring filter not working

I have configured Nutch 2.3.1 with complete Hadoop/Hbase ecosystem on a small cluster. I am curious about scoring algorithm used in Nutch. I have found and used opic scoring filter in Nutch. To find its impect, I have check score at different steps…

web-crawler nutch scoring nutch2

asked May 09 '18 at 05:05

Hafiz Muhammad Shafiq

8,168
12
63
121

0

votes

1 answer

nutch time schedule to visit a page again

I have configured Nutch 2.3.1 with Hadoop/Hbase ecosystem. I have few hundred domains that I want to fetch. I have fetched many of them till now. I am curious that when Nutch will visit already fetched document again and refetch it if it is update.…

apache web-crawler nutch nutch2

asked May 04 '18 at 07:28

Hafiz Muhammad Shafiq

8,168
12
63
121

0

votes

1 answer

Apache Nutch SolrIndexer error in SolrCloud mode

I have configured Apache Nutch 2.3.1 and crawled few websites. I have to index these documents to Solr (6.6.3) that is running in Cloud mode. When I execute solrindex command, I got following exception 2018-05-02 13:10:40,679 INFO [main]…

java solr nutch solrcloud nutch2

asked May 02 '18 at 09:07

Hafiz Muhammad Shafiq

8,168
12
63
121

0

votes

1 answer

Apache Nutch 2.3.1 give more preference to seed domains at selection point

I have configured apache Nutch 2.3.1 with complete Hadoop/Hbase ecosystem. I want that my crawler should give more preference to those domains that are given in seed in each iteration. According to my testing; It can go complete in either direction…

web-crawler nutch giraph nutch2

asked Mar 28 '18 at 11:03

Hafiz Muhammad Shafiq

8,168
12
63
121

0

votes

1 answer

Apache Nutch 2.3.1 Fetcher giving Invalid uri exception

I have configured Apache Nutch 2.3.1 with Hadoop ecosystem. I have to fetch some person-arabic script websites. Nutch is giving exception for few URLs at fetch time. Following is an example exception java.lang.IllegalArgumentException: Invalid uri…

java exception web-crawler nutch nutch2

asked Mar 20 '18 at 08:00

Hafiz Muhammad Shafiq

8,168
12
63
121

0

votes

1 answer

Apache Nutch 2.3.1 fetch specific MIME type documents

I have configured Apache Nutch 2.3.1 with hadoop/hbase ecosystem. I have to crawl specific documents i.e. documents having textual content only. I have found regex-urlfilter.txt to exclude MIMEs but could not find any option to specify MIME that I…

apache web-crawler nutch mime-filter nutch2

asked Mar 15 '18 at 08:51

Hafiz Muhammad Shafiq

8,168
12
63
121

Questions tagged [nutch2]