You could always increase the amount of links that you want to crawl, if you're using the bin/crawl
command you could just increase the number of iterations or modify the script and increase the sizeFetchlist
parameter (https://github.com/apache/nutch/blob/master/src/bin/crawl#L117). This parameter is just use as the topN
argument in the conventional bin/nutch
script.
Keep in mind that this options are available also on the 2.x branch.
What kind of suggestions are you trying to accomplish? In an app I've developed sometime ago we use a combination of both approachs (we were using Solr instead of elasticsearch but the essence is the same) we indexed the user queries in a separated collection/index and in this we configured an EdgeNGramFilterFactory
(Solr's equivalent to edge_n_grams
of ES) this provided some basic query suggestions based on what users had already searched. When no suggestions could be found using this approach we try to suggest single terms based on the content of the crawled content, this required some javascript tweaking in the frontend.
Not sure that using the edge_n_grams
on the whole textual content of a webpage could be that helpful mainly because NGrams for the whole content would be created and suggestions wouldn't be that relevant, due to the great number of matches, but I don't know your specific use case.