10

I have a set of objects of type Idea

public class Idea
{
    public string Title { get; set; }
    public string Body { get; set; }
}

I want to search this objects by substring. For example when I have object of title "idea", I want it to be found when I enter any substring of "idea": i, id, ide, idea, d, de, dea, e, ea, a.

I'm using RavenDB for storing data. The search query looks like that:

var ideas = session
              .Query<IdeaByBodyOrTitle.IdeaSearchResult, IdeaByBodyOrTitle>()
              .Where(x => x.Query.Contains(query))
              .As<Idea>()
              .ToList();

while the index is following:

public class IdeaByBodyOrTitle : AbstractIndexCreationTask<Idea, IdeaByBodyOrTitle.IdeaSearchResult>
{
    public class IdeaSearchResult
    {
        public string Query;
        public Idea Idea;
    }

    public IdeaByBodyOrTitle()
    {
        Map = ideas => from idea in ideas
                       select new
                           {
                               Query = new object[] { idea.Title.SplitSubstrings().Concat(idea.Body.SplitSubstrings()).Distinct().ToArray() },
                               idea
                           };
        Indexes.Add(x => x.Query, FieldIndexing.Analyzed);
    }
}

SplitSubstrings() is an extension method which returns all distinct substrings of given string:

static class StringExtensions
{
    public static string[] SplitSubstrings(this string s)
    {
        s = s ?? string.Empty;
        List<string> substrings = new List<string>();
        for (int i = 0; i < s.Length; i++)
        {                
            for (int j = 1; j <= s.Length - i; j++)
            {
                substrings.Add(s.Substring(i, j));
            }
        }            
        return substrings.Select(x => x.Trim()).Where(x => !string.IsNullOrEmpty(x)).Distinct().ToArray();
    }
}

This is not working. Particularly because RavenDB is not recognizing SplitSubstrings() method, because it is in my custom assembly. How to make this work, basically how to force RavenDB to recognize this method ? Besides that, is my approach appropriate for this kind of searching (searching by substring) ?

EDIT

Basically, I want to build auto-complete feature on this search, so it need to be fast.

enter image description here

Btw: I'm using RavenDB - Build #960

jwaliszko
  • 16,942
  • 22
  • 92
  • 158
  • RavenDB indexes run on the server and so don't have access to custom code like that. The index you write gets turned into a string, sent to the server and compiled over there, the StringExtension code doesn't go with it, hence the error. – Matt Warren Mar 19 '12 at 10:58
  • I know that this is server side responsibility, but is there any way to inject there my custom code ? Maybe using reflection ? – jwaliszko Mar 19 '12 at 14:13

4 Answers4

10

You can perform substring search across multiple fields using following approach:

( 1 )

public class IdeaByBodyOrTitle : AbstractIndexCreationTask<Idea>
{
    public IdeaByBodyOrTitle()
    {
        Map = ideas => from idea in ideas
                       select new
                           {
                               idea.Title,
                               idea.Body
                           };
    }
}

on this site you can check, that:

"By default, RavenDB uses a custom analyzer called LowerCaseKeywordAnalyzer for all content. (...) The default values for each field are FieldStorage.No in Stores and FieldIndexing.Default in Indexes."

So by default, if you check the index terms inside the raven client, it looks following:

Title                    Body
------------------       -----------------
"the idea title 1"       "the idea body 1"
"the idea title 2"       "the idea body 2" 

Based on that, wildcard query can be constructed:

var wildquery = string.Format("*{0}*", QueryParser.Escape(query));

which is then used with the .In and .Where constructions (using OR operator inside):

var ideas = session.Query<User, UsersByDistinctiveMarks>()
                   .Where(x => x.Title.In(wildquery) || x.Body.In(wildquery));

( 2 )

Alternatively, you can use pure lucene query:

var ideas = session.Advanced.LuceneQuery<Idea, IdeaByBodyOrTitle>()
                   .Where("(Title:" + wildquery + " OR Body:" + wildquery + ")");

( 3 )

You can also use .Search expression, but you have to construct your index differently if you want to search across multiple fields:

public class IdeaByBodyOrTitle : AbstractIndexCreationTask<Idea, IdeaByBodyOrTitle.IdeaSearchResult>
{
    public class IdeaSearchResult
    {
        public string Query;
        public Idea Idea;
    }

    public IdeaByBodyOrTitle()
    {
        Map = ideas => from idea in ideas
                       select new
                           {
                               Query = new object[] { idea.Title, idea.Body },
                               idea
                           };
    }
}

var result = session.Query<IdeaByBodyOrTitle.IdeaSearchResult, IdeaByBodyOrTitle>()
                    .Search(x => x.Query, wildquery, 
                            escapeQueryOptions: EscapeQueryOptions.AllowAllWildcards,
                            options: SearchOptions.And)
                    .As<Idea>();

summary:

Also have in mind that *term* is rather expensive, especially the leading wildcard. In this post you can find more info about it. There is said, that leading wildcard forces lucene to do a full scan on the index and thus can drastically slow down query-performance. Lucene internally stores its indexes (actually the terms of string-fields) sorted alphabetically and "reads" from left to right. That’s the reason why it is fast to do a search for a trailing wildcard and slow for a leading one.

So alternatively x.Title.StartsWith("something") can be used, but this obviously do not search across all substrings. If you need fast search, you can change the Index option for the fields you want to search on to be Analyzed but it again will not search across all substrings.

If there is a spacebar inside of the substring query, please check this question for possible solution. For making suggestions check http://architects.dzone.com/articles/how-do-suggestions-ravendb.

Koen
  • 3,626
  • 1
  • 34
  • 55
jwaliszko
  • 16,942
  • 22
  • 92
  • 158
3

This appears to be a duplicate of RavenDB fast substring search

The answer there, which was not mentioned here, is to use a custom Lucene analyzer called NGram

Community
  • 1
  • 1
Matt Johnson-Pint
  • 230,703
  • 74
  • 448
  • 575
  • hi, good to know about NGram, actually this question was asked before the other one, but still both address the same topic – jwaliszko Oct 10 '12 at 16:31
2

Incase anyone else comes across this. Raven 3 has a Search() extension method that allows for substring searching.

A couple of gotchas:

  • Pay special attention to the "Query escaping" section at the bottom
  • I didn't see it mentioned anywhere, but it only worked for me if Search() was the added directly to Query() (i.e. without any Where(), OrderBy(), etc between them)

Hope this saves someone some frustration.

Jay Querido
  • 1,257
  • 1
  • 12
  • 15
0

I managed to do this in memory with the following code:

public virtual ActionResult Search(string term)
{
    var clientNames = from customer in DocumentSession.Query<Customer>()
                        select new { label = customer.FullName };

    var results = from name in clientNames.ToArray()
                    where name.label.Contains(term,
                                             StringComparison.CurrentCultureIgnoreCase)
                    select name;

    return Json(results.ToArray(), JsonRequestBehavior.AllowGet);
}

This saved me the trouble of going RavenDB way of searching for strings with Contains method as described by Daniel Lang's post.

The Contains extension method is this:

public static bool Contains(this string source, string toCheck, StringComparison comp)
{
     return source.IndexOf(toCheck, comp) >= 0;
}
Leniel Maccaferri
  • 100,159
  • 46
  • 371
  • 480
  • 2
    The problem with this is that you are pulling back ALL the Customer docs from RavenDB and then filtering them in-memory (as you point out). This might work with a few docs, but when you have 100's or even 1000's your going to have to start paging through them and the perf won't be great. – Matt Warren Mar 19 '12 at 10:56
  • So whilst the method outlined in Daniel's post might be a bit of extra work, the perf is better because it's doing all the work on the server and then only sending back the matching docs. – Matt Warren Mar 19 '12 at 11:01
  • @MattWarren: No doubt my friend. I considered this implication when writing the code but I'm satisfied with the current perf. Maybe I'll change this in the future. I forgot to mention that I'm using this code to create an auto-complete functionality. By the way, Daniel's post doesn't really show how to mimic the Contains functionality since it uses just StartWith and EndWith. – Leniel Maccaferri Mar 19 '12 at 15:45
  • 4
    Yeah it does, see the update in the last paragraph, you just use a leading wildcard, like this \*atthe\*. However there is a better way, see this thread https://groups.google.com/d/topic/ravendb/WHSLk5EQC_4/discussion – Matt Warren Mar 19 '12 at 23:11
  • @MattWarren: Just one word: AWESOME! I'm so grateful for being part of StackOverflow that good things just happen and come our way... – Leniel Maccaferri Mar 20 '12 at 01:50