Questions tagged [nutch2]
29 questions
0
votes
0 answers
Errors using curl for nutch RESTapi calls
I am using curl to make RESTapi calls to my nutch server running on a ubuntu instance. When I make the following call with curl to create my seeds file on my server
curl -X POST http://**.185.***.**:8081/seed/create -d {"id": "ubuntu", "name":…

Ciaran
- 451
- 1
- 4
- 14
0
votes
1 answer
Apache Nutch skipping URLs & truncating
In my nutch-site.xml, I add the following to stop truncating; however, during the fetch process, I get the following error. I want it to stop truncating and provide the results I need, which I assumed a -1 value would achieve. I'm using version…

Karl Hill
- 12,937
- 5
- 58
- 95
0
votes
1 answer
Apache Nutch 2.3.1, increase reducer memory
I have setup a small size cluster if Hadoop with Hbase for Nutch 2.3.1. The hadoop version is 2.7.7 and Hbase is 0.98. I have customized a hadoop job and now I have to set memory for reducer task in driver class. I have come to know, in simple…

Hafiz Muhammad Shafiq
- 8,168
- 12
- 63
- 121
0
votes
1 answer
Configuring RAM in Nutch
I am using Nutch 1.10 to crawl websites for my organization. I use a system with 16Gb RAM to do this crawl. As of now, my nutch file uses only 3-4Gb of RAM while crawling the data and it takes almmost 10 hours to finish it. Is there some way where i…

UMA MAHESWAR
- 167
- 3
- 16
0
votes
1 answer
Find number of already exist documents in solr with solrindexing job in nutch
In nutch, In solrindex job how we can calculate the number of documents which have been updated in solr and the number of documents which have been indexed as new documents.

Naser Aslam
- 29
- 1
- 4
0
votes
1 answer
Apache Nutch ranking algorithm for specific language content
I have configured Nutch 2.3.1 with Hadoop/Hbase ecosystem to crawl Urdu language content. For language detection, I have customized fetcher and find language at that point. If document does not have enough Urdu language (bytes) then I deliberately…

Hafiz Muhammad Shafiq
- 8,168
- 12
- 63
- 121
0
votes
1 answer
Apache Nutch section pages handling trick
I have configured Nutch 2.3.1 with Hadoop/Hbase ecosystem. The idea is to crawl and index story pages mostly. For that I have prepared a seed of some domains. Now I am facing some logical problem in Nutch that is it behaves similar to all level of a…

Hafiz Muhammad Shafiq
- 8,168
- 12
- 63
- 121
0
votes
1 answer
Apache Nutch title parsing issue for Language specific websites
I have configured apache Nutch 2.3.1 with Hadoop 2.7.5 and Hbase 0.98. I have to crawl some Urdu websites. I am using its default parsers i.e., html, tika. Some documents have title in Urdu that are ok but some documents have title in Urdu and their…

Hafiz Muhammad Shafiq
- 8,168
- 12
- 63
- 121
0
votes
1 answer
Apache Nutch 2.3.1 opic scoring filter not working
I have configured Nutch 2.3.1 with complete Hadoop/Hbase ecosystem on a small cluster. I am curious about scoring algorithm used in Nutch. I have found and used opic scoring filter in Nutch. To find its impect, I have check score at different steps…

Hafiz Muhammad Shafiq
- 8,168
- 12
- 63
- 121
0
votes
1 answer
nutch time schedule to visit a page again
I have configured Nutch 2.3.1 with Hadoop/Hbase ecosystem. I have few hundred domains that I want to fetch. I have fetched many of them till now. I am curious that when Nutch will visit already fetched document again and refetch it if it is update.…

Hafiz Muhammad Shafiq
- 8,168
- 12
- 63
- 121
0
votes
1 answer
Apache Nutch SolrIndexer error in SolrCloud mode
I have configured Apache Nutch 2.3.1 and crawled few websites. I have to index these documents to Solr (6.6.3) that is running in Cloud mode. When I execute solrindex command, I got following exception
2018-05-02 13:10:40,679 INFO [main]…

Hafiz Muhammad Shafiq
- 8,168
- 12
- 63
- 121
0
votes
1 answer
Apache Nutch 2.3.1 give more preference to seed domains at selection point
I have configured apache Nutch 2.3.1 with complete Hadoop/Hbase ecosystem. I want that my crawler should give more preference to those domains that are given in seed in each iteration. According to my testing; It can go complete in either direction…

Hafiz Muhammad Shafiq
- 8,168
- 12
- 63
- 121
0
votes
1 answer
Apache Nutch 2.3.1 Fetcher giving Invalid uri exception
I have configured Apache Nutch 2.3.1 with Hadoop ecosystem. I have to fetch some person-arabic script websites. Nutch is giving exception for few URLs at fetch time. Following is an example exception
java.lang.IllegalArgumentException: Invalid uri…

Hafiz Muhammad Shafiq
- 8,168
- 12
- 63
- 121
0
votes
1 answer
Apache Nutch 2.3.1 fetch specific MIME type documents
I have configured Apache Nutch 2.3.1 with hadoop/hbase ecosystem. I have to crawl specific documents i.e. documents having textual content only. I have found regex-urlfilter.txt to exclude MIMEs but could not find any option to specify MIME that I…

Hafiz Muhammad Shafiq
- 8,168
- 12
- 63
- 121