Questions tagged [nutch2]

29 questions
0
votes
0 answers

Errors using curl for nutch RESTapi calls

I am using curl to make RESTapi calls to my nutch server running on a ubuntu instance. When I make the following call with curl to create my seeds file on my server curl -X POST http://**.185.***.**:8081/seed/create -d {"id": "ubuntu", "name":…
Ciaran
  • 451
  • 1
  • 4
  • 14
0
votes
1 answer

Apache Nutch skipping URLs & truncating

In my nutch-site.xml, I add the following to stop truncating; however, during the fetch process, I get the following error. I want it to stop truncating and provide the results I need, which I assumed a -1 value would achieve. I'm using version…
Karl Hill
  • 12,937
  • 5
  • 58
  • 95
0
votes
1 answer

Apache Nutch 2.3.1, increase reducer memory

I have setup a small size cluster if Hadoop with Hbase for Nutch 2.3.1. The hadoop version is 2.7.7 and Hbase is 0.98. I have customized a hadoop job and now I have to set memory for reducer task in driver class. I have come to know, in simple…
Hafiz Muhammad Shafiq
  • 8,168
  • 12
  • 63
  • 121
0
votes
1 answer

Configuring RAM in Nutch

I am using Nutch 1.10 to crawl websites for my organization. I use a system with 16Gb RAM to do this crawl. As of now, my nutch file uses only 3-4Gb of RAM while crawling the data and it takes almmost 10 hours to finish it. Is there some way where i…
UMA MAHESWAR
  • 167
  • 3
  • 16
0
votes
1 answer

Find number of already exist documents in solr with solrindexing job in nutch

In nutch, In solrindex job how we can calculate the number of documents which have been updated in solr and the number of documents which have been indexed as new documents.
Naser Aslam
  • 29
  • 1
  • 4
0
votes
1 answer

Apache Nutch ranking algorithm for specific language content

I have configured Nutch 2.3.1 with Hadoop/Hbase ecosystem to crawl Urdu language content. For language detection, I have customized fetcher and find language at that point. If document does not have enough Urdu language (bytes) then I deliberately…
Hafiz Muhammad Shafiq
  • 8,168
  • 12
  • 63
  • 121
0
votes
1 answer

Apache Nutch section pages handling trick

I have configured Nutch 2.3.1 with Hadoop/Hbase ecosystem. The idea is to crawl and index story pages mostly. For that I have prepared a seed of some domains. Now I am facing some logical problem in Nutch that is it behaves similar to all level of a…
Hafiz Muhammad Shafiq
  • 8,168
  • 12
  • 63
  • 121
0
votes
1 answer

Apache Nutch title parsing issue for Language specific websites

I have configured apache Nutch 2.3.1 with Hadoop 2.7.5 and Hbase 0.98. I have to crawl some Urdu websites. I am using its default parsers i.e., html, tika. Some documents have title in Urdu that are ok but some documents have title in Urdu and their…
Hafiz Muhammad Shafiq
  • 8,168
  • 12
  • 63
  • 121
0
votes
1 answer

Apache Nutch 2.3.1 opic scoring filter not working

I have configured Nutch 2.3.1 with complete Hadoop/Hbase ecosystem on a small cluster. I am curious about scoring algorithm used in Nutch. I have found and used opic scoring filter in Nutch. To find its impect, I have check score at different steps…
Hafiz Muhammad Shafiq
  • 8,168
  • 12
  • 63
  • 121
0
votes
1 answer

nutch time schedule to visit a page again

I have configured Nutch 2.3.1 with Hadoop/Hbase ecosystem. I have few hundred domains that I want to fetch. I have fetched many of them till now. I am curious that when Nutch will visit already fetched document again and refetch it if it is update.…
Hafiz Muhammad Shafiq
  • 8,168
  • 12
  • 63
  • 121
0
votes
1 answer

Apache Nutch SolrIndexer error in SolrCloud mode

I have configured Apache Nutch 2.3.1 and crawled few websites. I have to index these documents to Solr (6.6.3) that is running in Cloud mode. When I execute solrindex command, I got following exception 2018-05-02 13:10:40,679 INFO [main]…
Hafiz Muhammad Shafiq
  • 8,168
  • 12
  • 63
  • 121
0
votes
1 answer

Apache Nutch 2.3.1 give more preference to seed domains at selection point

I have configured apache Nutch 2.3.1 with complete Hadoop/Hbase ecosystem. I want that my crawler should give more preference to those domains that are given in seed in each iteration. According to my testing; It can go complete in either direction…
Hafiz Muhammad Shafiq
  • 8,168
  • 12
  • 63
  • 121
0
votes
1 answer

Apache Nutch 2.3.1 Fetcher giving Invalid uri exception

I have configured Apache Nutch 2.3.1 with Hadoop ecosystem. I have to fetch some person-arabic script websites. Nutch is giving exception for few URLs at fetch time. Following is an example exception java.lang.IllegalArgumentException: Invalid uri…
Hafiz Muhammad Shafiq
  • 8,168
  • 12
  • 63
  • 121
0
votes
1 answer

Apache Nutch 2.3.1 fetch specific MIME type documents

I have configured Apache Nutch 2.3.1 with hadoop/hbase ecosystem. I have to crawl specific documents i.e. documents having textual content only. I have found regex-urlfilter.txt to exclude MIMEs but could not find any option to specify MIME that I…
Hafiz Muhammad Shafiq
  • 8,168
  • 12
  • 63
  • 121
1
2