5

As usual I turn to the massive brain power that is the Stackoverflow user base to help solve a Lucene.NET problem I am battling with. First off, I am a complete noob when it comes to Lucene and Lucene.NET and by using the scattered tutorials and code snippets online, I have cobbled together the follow solution for my scenario.

The Scenario

I have an index of the following structure:

---------------------------------------------------------
| id  |    date    | security |           text          |
---------------------------------------------------------
|  1  | 2011-01-01 | -1-12-4- | some analyzed text here |
---------------------------------------------------------
|  2  | 2011-01-01 |  -11-3-  | some analyzed text here |
---------------------------------------------------------
|  3  | 2011-01-01 |    -1-   | some analyzed text here |
---------------------------------------------------------

I need to be able to query the text field, but restrict the results to users that have specific roleId's.

What I came up with to accomplish this (after many, many trips to Google) is to use a "security field" and a Lucene filter to restrict the result set as outlined below:

class SecurityFilter : Lucene.Net.Search.Filter
{
    public override System.Collections.BitArray Bits(Lucene.Net.Index.IndexReader indexReader)
    {
        BitArray bitarray = new BitArray(indexReader.MaxDoc());

        for (int i = 0; i < bitarray.Length; i++)
        {
            if (indexReader.Document(i).Get("security").Contains("-1-"))
            {
                bitarray.Set(i, true);
            }
        }

        return bitarray;
    }
}

... and then ...

Lucene.Net.Search.Sort sort = new Lucene.Net.Search.Sort(new Lucene.Net.Search.SortField("date", true));
Lucene.Net.Analysis.Standard.StandardAnalyzer analyzer = new Lucene.Net.Analysis.Standard.StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_29);
Lucene.Net.Search.IndexSearcher searcher = new Lucene.Net.Search.IndexSearcher(Lucene.Net.Store.FSDirectory.Open(indexDirectory), true);
Lucene.Net.QueryParsers.QueryParser parser = new Lucene.Net.QueryParsers.QueryParser(Lucene.Net.Util.Version.LUCENE_29, "text", analyzer);
Lucene.Net.Search.Query query = parser.Parse("some search phrase");
SecurityFilter filter = new SecurityFilter();
Lucene.Net.Search.Hits hits = searcher.Search(query, filter, sort);

This works as expected and would only return documents with the id's of 1 and 3. The problem is that on large indexes this process becomes very slow.

Finally, my question... Does anyone out there have any tips on how to speed it up, or have an alternate solution that would be more efficient than the one I have presented here?

nokturnal
  • 2,809
  • 4
  • 29
  • 39

2 Answers2

6

If you index your security field as analyzed (such that it splits your security string as 1 12 4 ...)

you can create a filter like this

Filter filter = new QueryFilter(new TermQuery(new Term("security ", "1")));

or

form a query like some text +security:1

Dimitar Dimitrov
  • 14,868
  • 8
  • 51
  • 79
L.B
  • 114,136
  • 19
  • 178
  • 224
  • Intriguing solution, I will mess with this tomorrow and let you know how it goes. – nokturnal Oct 03 '11 at 02:38
  • @LB: This solution is working well, but it is now impacting the solution you helped me with on this http://stackoverflow.com/questions/7662829/lucene-net-range-queries-highlighting. Is there a way to not have the "1" from +security:1 highlighted? – nokturnal Oct 05 '11 at 18:51
  • No, then you have to use the filter based solution – L.B Oct 05 '11 at 19:17
5

I changed my answer with a simple example that explain what I meant in my previous answer.

I made this quickly and doesnt respect best practices, but it should give you the idea.

Note that the security field will need to be tokenized so that each ID in it are separate tokens, using the WhitespaceAnalyzer for example.

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using Lucene.Net.Search;
using Lucene.Net.Documents;
using Lucene.Net.Index;
using Lucene.Net.Analysis.Standard;
using System.IO;

namespace ConsoleApplication1
{
    class Program
    {
        public class RoleFilterCache
        {
            static public Dictionary<string, Filter> Cache = new Dictionary<string,Filter>();

            static public Filter Get(string role)
            {
                Filter cached = null;
                if (!Cache.TryGetValue(role, out cached))
                {
                    return null;
                }
                return cached;
            }

            static public void Put(string role, Filter filter)
            {
                if (role != null)
                {
                    Cache[role] = filter;
                }
            }
        }

        public class User
        {
            public string Username;
            public List<string> Roles;
        }

        public static Filter GetFilterForUser(User u)
        {
            BooleanFilter userFilter = new BooleanFilter();
            foreach (string rolename in u.Roles)
            {   
                // call GetFilterForRole and add to the BooleanFilter
                userFilter.Add(
                    new BooleanFilterClause(GetFilterForRole(rolename), BooleanClause.Occur.SHOULD)
                );
            }
            return userFilter;
        }

        public static Filter GetFilterForRole(string role)
        {
            Filter roleFilter = RoleFilterCache.Get(role);
            if (roleFilter == null)
            {
                roleFilter =
                    // the caching wrapper filter makes it cache the BitSet per segmentreader
                    new CachingWrapperFilter(
                        // builds the filter from the index and not from iterating
                        // stored doc content which is much faster
                        new QueryWrapperFilter(
                            new TermQuery(
                                new Term("security", role)
                            )
                        )
                );
                // put in cache
                RoleFilterCache.Put(role, roleFilter);
            }
            return roleFilter;
        }


        static void Main(string[] args)
        {
            IndexWriter iw = new IndexWriter(new FileInfo("C:\\example\\"), new StandardAnalyzer(), true);
            Document d = new Document();

            Field aField = new Field("content", "", Field.Store.YES, Field.Index.ANALYZED);
            Field securityField = new Field("security", "", Field.Store.NO, Field.Index.ANALYZED);

            d.Add(aField);
            d.Add(securityField);

            aField.SetValue("Only one can see.");
            securityField.SetValue("1");
            iw.AddDocument(d);
            aField.SetValue("One and two can see.");
            securityField.SetValue("1 2");
            iw.AddDocument(d);
            aField.SetValue("One and two can see.");
            securityField.SetValue("1 2");
            iw.AddDocument(d);
            aField.SetValue("Only two can see.");
            securityField.SetValue("2");
            iw.AddDocument(d);

            iw.Close();

            User userone = new User()
            {
                Username = "User one",
                Roles = new List<string>()
            };
            userone.Roles.Add("1");
            User usertwo = new User()
            {
                Username = "User two",
                Roles = new List<string>()
            };
            usertwo.Roles.Add("2");
            User userthree = new User()
            {
                Username = "User three",
                Roles = new List<string>()
            };
            userthree.Roles.Add("1");
            userthree.Roles.Add("2");

            PhraseQuery phraseQuery = new PhraseQuery();
            phraseQuery.Add(new Term("content", "can"));
            phraseQuery.Add(new Term("content", "see"));

            IndexSearcher searcher = new IndexSearcher("C:\\example\\", true);

            Filter securityFilter = GetFilterForUser(userone);
            TopDocs results = searcher.Search(phraseQuery, securityFilter,25);
            Console.WriteLine("User One Results:");
            foreach (var aResult in results.ScoreDocs)
            {
                Console.WriteLine(
                    searcher.Doc(aResult.doc).
                    Get("content")
                );
            }
            Console.WriteLine("\n\n");

            securityFilter = GetFilterForUser(usertwo);
            results = searcher.Search(phraseQuery, securityFilter, 25);
            Console.WriteLine("User two Results:");
            foreach (var aResult in results.ScoreDocs)
            {
                Console.WriteLine(
                    searcher.Doc(aResult.doc).
                    Get("content")
                );
            }
            Console.WriteLine("\n\n");

            securityFilter = GetFilterForUser(userthree);
            results = searcher.Search(phraseQuery, securityFilter, 25);
            Console.WriteLine("User three Results (should see everything):");
            foreach (var aResult in results.ScoreDocs)
            {
                Console.WriteLine(
                    searcher.Doc(aResult.doc).
                    Get("content")
                );
            }
            Console.WriteLine("\n\n");
            Console.ReadKey();
        }
    }
}
Jf Beaulac
  • 5,206
  • 1
  • 25
  • 46
  • +1 on using cached filters. They work so well that they're the obvious choice here. I mildly disagree with with using a query to do the security query: I think (cached) filters should be used for everything that you don't want scored and queries for term queries that need scoring. – Adrian Conlon Sep 30 '11 at 21:06
  • The (main) problem in nokturnal's code is not caching. He scans all the documents in the index to form a filter("Contains" is also another pitfall here). – L.B Oct 01 '11 at 08:50
  • I am mildly confused but I am sure that is due to my inexperience with Lucene. Let me see if I have this... Basically, I would create a cached filter for each role that is rebuilt every time the index is modified. Does the scenario above allow for multiple cached filters to be applied at one time (one for each role the user is valid for)? – nokturnal Oct 03 '11 at 02:43
  • (>> That is rebuilt every time the index is modified) What happens is filters are associated with individual segments, so only modified segments have their filters rebuilt. As long as you don't optimise your index, filters can exist as they are, for the unchanged segments in your index. – Adrian Conlon Oct 03 '11 at 08:48
  • 1
    you can use this contrib to apply multiple filters. https://svn.apache.org/repos/asf/incubator/lucene.net/branches/Lucene.Net_2_9_4g/src/contrib/Queries/BooleanFilter.cs – Jf Beaulac Oct 03 '11 at 14:23
  • Amazing and thanks for the complete example. For some reason I am battling to get my head around Lucene.NET. I will have some time tomorrow to fully review the provided solution and will advise how it goes for me. Thanks guys – nokturnal Oct 03 '11 at 15:34
  • Looks like I am heading down the filter path! First problem, does "BooleanFilter" exist in Lucene.NET 2.9.4.1? I seem to only be able to find BooleanQuery... – nokturnal Oct 05 '11 at 19:24
  • 1
    Why don't you use [new QueryFilter(new TermQuery(new Term("security", "1")))] or [new QueryWrapperFilter(new TermQuery(new Term("security", "1")))]. They are not related to your query type. – L.B Oct 05 '11 at 19:46
  • @L.B: The short reason why is that I don't know what I am doing :) I am fumbling my way through Lucene.NET as best I can – nokturnal Oct 05 '11 at 20:02
  • Just replace your SecurityFilter in question with one of these filters.(Assuming you are analyzing the "security" field) – L.B Oct 05 '11 at 20:16
  • 1
    @nokturnal see the link on the Lucene's SVN in one of my previous comment for the BooleanFilter class in .NET – Jf Beaulac Oct 05 '11 at 20:32
  • @JfBeaulac: Finally did, and I think I have been making this WAY to hard on myself. I am finally starting to get this through my thick Canadian skull (too many snowboarding bails over the years). – nokturnal Oct 05 '11 at 20:59
  • @L.B: Really appreciate the patience with me on this one. Thanks for all the tips! – nokturnal Oct 05 '11 at 20:59