How to deep crawl with nutch

Question

I'm currently crawling 28 sites (small small, small large) and the crawls are generating about 25MBs of data. I'm indexing with Elasticsearch and using an edge_n-gram strategy for autocomplete. After some testing, it seems I need more data to create better multi-word (phrase) suggestions. I know I can simply crawl more sites but is there a way to enable Nutch to crawl each site completely or as much as possible to create more data for better search suggestions via edge_n_grams?

OR

Is this even a lost cause and no matter how much data I have, is the best way to create better multi-word suggestions by logging users search queries?

Also have a look at the ES [Completion Suggester](https://www.elastic.co/guide/en/elasticsearch/reference/current/search-suggesters-completion.html) as a better alternative to ngram, especially when it comes to phrase suggestions. — rustyx, Mar 29 '18 at 11:49

score 2 · Accepted Answer · answered May 03 '16 at 18:53

You could always increase the amount of links that you want to crawl, if you're using the bin/crawl command you could just increase the number of iterations or modify the script and increase the sizeFetchlist parameter (https://github.com/apache/nutch/blob/master/src/bin/crawl#L117). This parameter is just use as the topN argument in the conventional bin/nutch script.

Keep in mind that this options are available also on the 2.x branch.

What kind of suggestions are you trying to accomplish? In an app I've developed sometime ago we use a combination of both approachs (we were using Solr instead of elasticsearch but the essence is the same) we indexed the user queries in a separated collection/index and in this we configured an EdgeNGramFilterFactory (Solr's equivalent to edge_n_grams of ES) this provided some basic query suggestions based on what users had already searched. When no suggestions could be found using this approach we try to suggest single terms based on the content of the crawled content, this required some javascript tweaking in the frontend.

Not sure that using the edge_n_grams on the whole textual content of a webpage could be that helpful mainly because NGrams for the whole content would be created and suggestions wouldn't be that relevant, due to the great number of matches, but I don't know your specific use case.

thank you for your answer, it certainly gives me some things to try out. I know edge_n_gram strategy is not the best but its just to start, eventually logging will be employed — user3125823, May 03 '16 at 20:01

score 0 · Answer 2 · answered May 05 '16 at 06:33

0

If you are planning to crawl command passing with topN parameter, then you can use http://big-analytics.blogspot.com.au/2016/05/building-apache-nutch-job-running.html

where you add the crawl code in latest Apache Nutch and rebuild the nutch.job file.

answered May 05 '16 at 06:33

BigData_Consultant

96
4

How to deep crawl with nutch

2 Answers2