I'm building a small vertical search engine using Elasticsearch as the indexer and Nutch as the crawler. I was using the HTML
title field to build search suggestions for ES using an edge n gram
strategy, thinking that the title field would be good as it should contain relevant terms for the subject content of the page and it would keep the index smaller in terms of search suggestions, be them single words or phrases. However, in testing so far, its not working out as thought... there just aren't that many suggestions appearing.
At present I'm only doing testing using about 10 sites, but will eventually reach about 500 or so. I'm thinking that because of the small data set, (10 sites, only on HTML
title field) there probably aren't enough terms or phrases available to make good suggestions, at least phrase suggestions anyway.
Would it be advisable to just crawl more sites to create more suggestions (terms and phrases) with the edge n gram
strategy on the title field OR should I use the content field (which is obviously much larger than the title field).
I'm trying to fine tune this to get more search suggestions, especially phrase suggestions, while being mindful of the index size - so that performance doesn't suffer. Any ideas?