0

I'm building a small vertical search engine using Elasticsearch as the indexer and Nutch as the crawler. I was using the HTML title field to build search suggestions for ES using an edge n gram strategy, thinking that the title field would be good as it should contain relevant terms for the subject content of the page and it would keep the index smaller in terms of search suggestions, be them single words or phrases. However, in testing so far, its not working out as thought... there just aren't that many suggestions appearing.

At present I'm only doing testing using about 10 sites, but will eventually reach about 500 or so. I'm thinking that because of the small data set, (10 sites, only on HTML title field) there probably aren't enough terms or phrases available to make good suggestions, at least phrase suggestions anyway.

Would it be advisable to just crawl more sites to create more suggestions (terms and phrases) with the edge n gram strategy on the title field OR should I use the content field (which is obviously much larger than the title field).

I'm trying to fine tune this to get more search suggestions, especially phrase suggestions, while being mindful of the index size - so that performance doesn't suffer. Any ideas?

user3125823
  • 1,846
  • 2
  • 18
  • 46

1 Answers1

0

These days one could say that suggestions are even more important than the search results itself --- which is slightly nonsensical, I know. But users tend to expect that if there is no suggestion, there is no search result. Therefore make sure every searchable field is properly reflected in your suggestions --- in particular your content. And "optimize later"! Don't look at your performance too early. 500 sites does not sound like you'll get a lot of documents to index anyway. What kind of hardware are you using?

Harald
  • 4,575
  • 5
  • 33
  • 72
  • for development, just my local Ubuntu machine, but when dev is done, I plan to use aws. – user3125823 Apr 29 '16 at 19:48
  • I agree with you that suggestions are probably a bit more important than the results, at least initially. What you say makes sense, better to have suggestions first, worry about performance later – user3125823 Apr 29 '16 at 19:49