Highest Voted 'nutch2' Questions

3

votes

1 answer

Combiner function in Apache Hadoop with Gora

I have a simple Hadoop, Nutch 2.x, Hbase cluster. I have to write a MR job that will find some statistics. It is two step job i.e., I think I need combiner function also. In simple Hadoop jobs, its not a big problem as a lot of guide is given e.g.,…

asked Jan 02 '19 at 09:17

Hafiz Muhammad Shafiq

8,168
12
63
121

1

vote

1 answer

Updating Max Depth for Apache-Nutch Crawler in scoring-depth filter is not working

I have setup Apache Nutch 1.18 to crawl the web. For ranking, I am using scoring-depth filter. By default, max depth length is set to 1000 (in each page crawled). Now, I have to update this value (increase for example). I have updated following…

web-crawler nutch nutch2

asked Aug 02 '22 at 03:27

Hafiz Muhammad Shafiq

8,168
12
63
121

1

vote

1 answer

Apache Nutch not reading a new configuration file when run with job file

I have configured Apache Nutch 1.x for web crawling. There is a requirement that I should add some extra information to Solr document for each domain that is indexed. Configuration is a JSON file. I have developed following code for this and tested…

hadoop solr hdfs nutch nutch2

asked Jun 12 '22 at 13:40

Hafiz Muhammad Shafiq

8,168
12
63
121

1

vote

0 answers

Nutch - Visit few pages again and again to find new links

I have setup Nutch 1.17 to crawl few thousand domains with inlinks crawl only. One of my main requirement is I should have to visit home pages again and again (lets say after 2 hour) and if there is any new page, then only that should be…

web-crawler nutch nutch2

asked Aug 26 '20 at 01:57

Hafiz Muhammad Shafiq

8,168
12
63
121

1

vote

1 answer

Apache Nutch not crawling all websites in in-links

I have configured Apache Nutch 2.3.1 with Hadoop/Hbase ecosystem. Following are the configuration information. db.score.link.internal 5.0 …

web-crawler nutch nutch2

asked May 08 '18 at 07:30

Hafiz Muhammad Shafiq

8,168
12
63
121

1

vote

1 answer

Apache Nutch flushes gora record after limit

I have configured Nutch 2.3.1 with Hadoop/Hbase ecosystem. I have not changed gora.buffer.read.limit and gora.buffer.read.limit i.e., using their default values that is 10000 in both cases. At generate phase, I set topN to 100,000. During generate…

hadoop hbase nutch gora nutch2

asked Apr 17 '18 at 07:43

Hafiz Muhammad Shafiq

8,168
12
63
121

1

vote

1 answer

How can I connect apache Nutch 2.x to a remote HBase cluster?

I have two machines. One machine runs HBase 0.92.2 in pseudo-distributed mode, while the other one is using Nutch 2.x crawler. How can I configure these two machines so that one machine with HBase-0.92.2 acts as back end storage and the other with…

hadoop hbase apache-zookeeper nutch nutch2

asked Mar 27 '14 at 05:22

zahid adeel

123
4

0

votes

1 answer

Apache Nutch is crawling few domain more and other less with default configuration

I have setup Apache Nutch 1.18 on Hadoop cluster. I have given it a seed of around 10k URLs. After few time, I have run domainstats command to know the statistics of each domain. I have come to know that Nutch is crawling some websites more…

web-crawler nutch nutch2

asked Jul 18 '22 at 11:35

Hafiz Muhammad Shafiq

8,168
12
63
121

0

votes

1 answer

I had some questions on db_redir_temp

I had injected some urls to crawl that is one round and I found some urls as db_redir_temp. {"url":"http://www.universityhealth.org","pst":"temp_moved(13), lastModified=0:…

nutch nutch2

asked Oct 26 '20 at 11:45

Ravi Kiran

65
6

0

votes

1 answer

Nutch http.redirect.max may I know what does it Mean

I am crawling for example 1000 websites.when I readdb for some websites it is showing db_redirect_temp and db_redirect_moved if I set http.redirect.max=10 is this value for each website or it treat only 10 redirects for entire crawling websites.

nutch nutch2

asked Oct 16 '20 at 10:30

Ravi Kiran

65
6

0

votes

0 answers

org.apache.tika.utils.XMLReaderUtils acquireSAXParser WARNING: Contention waiting for a SAXParser. Consider increasing the XMLReaderUtils.POOL_SIZE

when running nutch jobs it is showing as Oct 13, 2020 8:46:18 AM org.apache.tika.utils.XMLReaderUtils acquireSAXParser WARNING: Contention waiting for a SAXParser. Consider increasing the XMLReaderUtils.POOL_SIZE May I know what it means.I using…

nutch apache-tika tika-server nutch2

asked Oct 13 '20 at 10:50

Ravi Kiran

65
6

0

votes

1 answer

nutch fetch failed with protocol status: exception(16), lastModified=0: Http code=403, url=https://www.nicobuyscars.com

I am doing parsechecker for url:https://www.nicobuyscars.com o/p Fetch failed with protocol status: exception(16), lastModified=0: Http code=403, url=https://www.nicobuyscars.com May I know what is the issue and how to solve it. I tried changing the…

web-crawler nutch nutch2

asked Sep 25 '20 at 07:05

Ravi Kiran

65
6

0

votes

1 answer

Nutch 1.17 web crawling with storage optimization

I am using Nutch 1.17 to crawl over million websites. I have to perform following things for this. One time run crawler as deep crawler so that it should fetched maximum URLs from given (1 million) domains. For first time, you can run it for max 48…

hadoop solr hdfs nutch nutch2

asked Sep 25 '20 at 05:09

Hafiz Muhammad Shafiq

8,168
12
63
121

0

votes

2 answers

Restrict Nutch to Seed path and its following webpages only

I have setup Nutch 2.x to crawl few domains that are multilingual. I can restrict Nutch to inlinks only but not to subfolders. For example, for following seed, https://www.bbc.com/urdu I just want to crawl URLs in /urdu as this website contains…

web-crawler nutch nutch2

asked Sep 17 '20 at 07:59

Hafiz Muhammad Shafiq

8,168
12
63
121

0

votes

1 answer

Apache Nutch index only article pages to Solr

I have setup Nutch 1.17 for crawling few website. As usual, there can be two type of web pages at high level. First those that are category pages or home pages that does not contain the details of any specific story but provide links and short text…

solr web-crawler nutch web-mining nutch2

asked Aug 25 '20 at 02:25

Hafiz Muhammad Shafiq

8,168
12
63
121

Questions tagged [nutch2]