Questions tagged [nutch2]
29 questions
3
votes
1 answer
Combiner function in Apache Hadoop with Gora
I have a simple Hadoop, Nutch 2.x, Hbase cluster. I have to write a MR job that will find some statistics. It is two step job i.e., I think I need combiner function also. In simple Hadoop jobs, its not a big problem as a lot of guide is given e.g.,…

Hafiz Muhammad Shafiq
- 8,168
- 12
- 63
- 121
1
vote
1 answer
Updating Max Depth for Apache-Nutch Crawler in scoring-depth filter is not working
I have setup Apache Nutch 1.18 to crawl the web. For ranking, I am using scoring-depth filter. By default, max depth length is set to 1000 (in each page crawled). Now, I have to update this value (increase for example). I have updated following…

Hafiz Muhammad Shafiq
- 8,168
- 12
- 63
- 121
1
vote
1 answer
Apache Nutch not reading a new configuration file when run with job file
I have configured Apache Nutch 1.x for web crawling. There is a requirement that I should add some extra information to Solr document for each domain that is indexed. Configuration is a JSON file. I have developed following code for this and tested…

Hafiz Muhammad Shafiq
- 8,168
- 12
- 63
- 121
1
vote
0 answers
Nutch - Visit few pages again and again to find new links
I have setup Nutch 1.17 to crawl few thousand domains with inlinks crawl only. One of my main requirement is I should have to visit home pages again and again (lets say after 2 hour) and if there is any new page, then only that should be…

Hafiz Muhammad Shafiq
- 8,168
- 12
- 63
- 121
1
vote
1 answer
Apache Nutch not crawling all websites in in-links
I have configured Apache Nutch 2.3.1 with Hadoop/Hbase ecosystem. Following are the configuration information.
db.score.link.internal
5.0
…

Hafiz Muhammad Shafiq
- 8,168
- 12
- 63
- 121
1
vote
1 answer
Apache Nutch flushes gora record after limit
I have configured Nutch 2.3.1 with Hadoop/Hbase ecosystem. I have not changed gora.buffer.read.limit and gora.buffer.read.limit i.e., using their default values that is 10000 in both cases. At generate phase, I set topN to 100,000. During generate…

Hafiz Muhammad Shafiq
- 8,168
- 12
- 63
- 121
1
vote
1 answer
How can I connect apache Nutch 2.x to a remote HBase cluster?
I have two machines. One machine runs HBase 0.92.2 in pseudo-distributed mode, while the other one is using Nutch 2.x crawler. How can I configure these two machines so that one machine with HBase-0.92.2 acts as back end storage and the other with…

zahid adeel
- 123
- 4
0
votes
1 answer
Apache Nutch is crawling few domain more and other less with default configuration
I have setup Apache Nutch 1.18 on Hadoop cluster. I have given it a seed of around 10k URLs. After few time, I have run domainstats command to know the statistics of each domain. I have come to know that Nutch is crawling some websites more…

Hafiz Muhammad Shafiq
- 8,168
- 12
- 63
- 121
0
votes
1 answer
I had some questions on db_redir_temp
I had injected some urls to crawl that is one round and I found some urls as db_redir_temp.
{"url":"http://www.universityhealth.org","pst":"temp_moved(13), lastModified=0:…

Ravi Kiran
- 65
- 6
0
votes
1 answer
Nutch http.redirect.max may I know what does it Mean
I am crawling for example 1000 websites.when I readdb for some websites it is showing db_redirect_temp and db_redirect_moved if I set http.redirect.max=10 is this value for each website or it treat only 10 redirects for entire crawling websites.

Ravi Kiran
- 65
- 6
0
votes
0 answers
org.apache.tika.utils.XMLReaderUtils acquireSAXParser WARNING: Contention waiting for a SAXParser. Consider increasing the XMLReaderUtils.POOL_SIZE
when running nutch jobs it is showing as
Oct 13, 2020 8:46:18 AM org.apache.tika.utils.XMLReaderUtils
acquireSAXParser WARNING: Contention waiting for a SAXParser. Consider
increasing the XMLReaderUtils.POOL_SIZE
May I know what it means.I using…

Ravi Kiran
- 65
- 6
0
votes
1 answer
nutch fetch failed with protocol status: exception(16), lastModified=0: Http code=403, url=https://www.nicobuyscars.com
I am doing parsechecker for url:https://www.nicobuyscars.com o/p Fetch failed with protocol status: exception(16), lastModified=0: Http code=403, url=https://www.nicobuyscars.com
May I know what is the issue and how to solve it. I tried changing the…

Ravi Kiran
- 65
- 6
0
votes
1 answer
Nutch 1.17 web crawling with storage optimization
I am using Nutch 1.17 to crawl over million websites. I have to perform following things for this.
One time run crawler as deep crawler so that it should fetched maximum URLs from given (1 million) domains. For first time, you can run it for max 48…

Hafiz Muhammad Shafiq
- 8,168
- 12
- 63
- 121
0
votes
2 answers
Restrict Nutch to Seed path and its following webpages only
I have setup Nutch 2.x to crawl few domains that are multilingual. I can restrict Nutch to inlinks only but not to subfolders. For example, for following seed,
https://www.bbc.com/urdu
I just want to crawl URLs in /urdu as this website contains…

Hafiz Muhammad Shafiq
- 8,168
- 12
- 63
- 121
0
votes
1 answer
Apache Nutch index only article pages to Solr
I have setup Nutch 1.17 for crawling few website. As usual, there can be two type of web pages at high level. First those that are category pages or home pages that does not contain the details of any specific story but provide links and short text…

Hafiz Muhammad Shafiq
- 8,168
- 12
- 63
- 121