1

I started using RavenDB since a couple days and I'm already stuck in something, I think, should be quite easy to perform.

What I would like to do is a search to obtain a list of products that have in the property Title all the words typed by an user.

An example:

product/1 -> title: "my awesome product"
product/2 -> title: "super product asd"


If I search "prod per" I would expect only the second product to appear as the result.

In my head, I would do something like this

public IList<Product> GetBySearchTerms(string searchTerms, int pageIndex, int pageSize, out int totalItems)
{
  pageIndex--;
  if (pageIndex < 0)
    pageIndex = 0;
  IList<Product> result = new List<Product>();
  var query = var query = session.Query<Product>().Statistics(out stats);
  var termsList = searchTerms.Split(' ', StringSplitOptions.RemoveEmptyEntries);
  foreach (var term in termsList)
    query = query.Where(x => x.Title.Contains(term));

  if (pageSize > 0)
    result = query.Skip(pageIndex * pageSize).Take(pageSize).ToList();
  else
    result = query.ToList();
  totalItems = stats.TotalResults;
  return result;
}

After some digging I found out that the first problem is in the Contains method. It is not implemented/supported due to how the search behave in RavenDB.

I should instead use the Search method, but I also read that using *term* should not be used due to performance issues.

So I ended up creating an Index in RavenDB like this one

Name: ProductSearchByName
Map: from doc in docs.Products select new { Title = doc.Title }

And the code

public IList<Product> GetBySearchTerms(string searchTerms, int pageIndex, int pageSize, out int totalItems)
{
  pageIndex--;
  if (pageIndex < 0)
    pageIndex = 0;
  IList<Product> result = new List<Product>();
  RavenQueryStatistics stats;
  var query = session.Query<Product>("ProductSearchByName").Statistics(out stats);
  query = searchTerms
            .Split(new[] { " " }, StringSplitOptions.RemoveEmptyEntries)
            .Aggregate(query, (q, term) => q.Search(x => x.Title, "*" + term + "*", options: SearchOptions.And, escapeQueryOptions: EscapeQueryOptions.AllowAllWildcards));
  if (pageSize > 0)
    result = query.Skip(pageIndex * pageSize).Take(pageSize).ToList();
  else
    result = query.ToList();
  totalItems = stats.TotalResults;
  return result;
}

This search does what I need, but I'm concerned about all the warnings about using the wildcards.

Is there a way to obtain the Contains result without using *term*?

What should be a correct approach / solution to this problem?

Matt Johnson-Pint
  • 230,703
  • 74
  • 448
  • 575
Iridio
  • 9,213
  • 4
  • 49
  • 71

1 Answers1

0

RavenDB uses Lucene for it's searching, which is optimized for search terms, not substrings. It uses analyzers to define what terms exist within a string.

Using any of the built-in analyzers, when you take a string like "hello world", the terms are "hello" and "world". Only two index entries are created. If you search with a wildcard at the end, such as he*, it can still scan the index sequentially and match the terms. But when you place a wildcard at the beginning, such as *old, then it has to scan the entire index in order to respond.

In the vast majority of use cases, full substring searching is overkill. But if you want to enable it without killing performance, the trick is to use an analyzer that creates terms from the substrings. This is implemented in the NGram Analyzer. So the same "hello world" analyzed with NGram would create an index with the terms:

d
e
el
ell
ello
h
he
hel
hell
hello
l
ld
ll
lo
llo
o
or
orl
orld
r
rl
rld
w
wo
wor
worl
world

Now when you search for a substring, all of them are predefined in the index and it can match easier.

As you can imagine, using NGram will lead to much larger indexes. It's a trade-off between increased disk usage in order to get faster query response times. It should only be used where absolutely necessary.

In most cases, you are better off doing whole word searches, or "starts-with" searches - which don't require special analysis.

Community
  • 1
  • 1
Matt Johnson-Pint
  • 230,703
  • 74
  • 448
  • 575
  • Thanks for the explanation. I guess I have to do some tests and see if the eventually slow performances will be a real issue or not. – Iridio Jun 28 '13 at 07:03
  • Probably you just need to reconsider if you have to have full substring matches. Many use cases can get by with full-word or starts-with matching. Also, if you're just looking for a substitute for `.Contains`, you can invert the relationship and use `.In`. – Matt Johnson-Pint Jun 28 '13 at 18:28
  • Thanks for the insighs. I need to review some use-cases with my customer. I agree with you taht searching every words with "term*" should be enough – Iridio Jun 29 '13 at 05:47