3

If I have a field x, that can contain a value of y, or z etc, is there a way I can query so that I can return only the values that have been indexed?

Example x available settable values = test1, test2, test3, test4

Item 1 : Field x = test1

Item 2 : Field x = test2

Item 3 : Field x = test4

Item 4 : Field x = test1

Performing required query would return a list of: test1, test2, test4

mickyjtwin
  • 4,960
  • 13
  • 58
  • 77

5 Answers5

6

I've implemented this before as an extension method:

public static class ReaderExtentions
{
    public static IEnumerable<string> UniqueTermsFromField(
                                          this IndexReader reader, string field)
    {
        var termEnum = reader.Terms(new Term(field));

        do
        {
            var currentTerm = termEnum.Term();

            if (currentTerm.Field() != field)
                yield break;

            yield return currentTerm.Text();
        } while (termEnum.Next());
    }
}

You can use it very easily like this:

var allPossibleTermsForField = reader.UniqueTermsFromField("FieldName");

That will return you what you want.

EDIT: I was skipping the first term above, due to some absent-mindedness. I've updated the code accordingly to work properly.

Christopher Currens
  • 29,917
  • 5
  • 57
  • 77
  • How is this solution different to the approach that uses the FieldCache in Lucene? `String[] fieldValues = FieldCache.DEFAULT.getStrings(indexReader, fieldname);` – basZero Oct 21 '14 at 07:36
  • 1
    @basZero using the TermEnum works in the general case when there might be more than one value per field and doesn't consume memory to store the values in a cache. – Hakanai Feb 27 '15 at 02:08
  • Is there a Java solution for this? – basZero Feb 27 '15 at 08:36
  • Note that in lucene++ this will fail on `currentTerm.field` if there are no terms returned by the reader. Not sure whether this is handled gracefully in lucene.net. – JoshS Nov 30 '18 at 18:51
1
TermEnum te = indexReader.Terms(new Term("fieldx"));
do
{
    Term t = te.Term();
    if (t==null || t.Field() != "fieldx") break;
    Console.WriteLine(t.Text());
} while (te.Next());
guest
  • 17
  • 2
  • indexReader.Terms locates the first term. You will loose the first term if you call Next in while before accesing that term – guest Sep 08 '11 at 06:59
1

You can use facets to return the first N values of a field if the field is indexed as a string or is indexed using KeywordTokenizer and no filters. This means that the field is not tokenized but just saved as it is.

Just set the following properties on a query:

facet=true
facet.field=fieldname
facet.limit=N //the number of values you want to retrieve
basZero
  • 4,129
  • 9
  • 51
  • 89
Dorin
  • 2,482
  • 2
  • 22
  • 38
0

I once used Lucene 2.9.2 and there I used the approach with the FieldCache as described in the book "Lucene in Action" by Manning:

String[] fieldValues = FieldCache.DEFAULT.getStrings(indexReader, fieldname);

The array fieldValues contains all values in the index for field fieldname (Example: ["NY", "NY", "NY", "SF"]), so it is up to you now how to process the array. Usually you create a HashMap<String,Integer> that sums up the occurrences of each possible value, in this case NY=3, SF=1.

Maybe this helps. It is quite slow and memory consuming for very large indexes (1.000.000 documents in index) but it works.

basZero
  • 4,129
  • 9
  • 51
  • 89
0

I think a WildcardQuery searching on field 'x' and value of '*' would do the trick.

goalie7960
  • 863
  • 7
  • 26
  • Wildcard query is not allowed if you have '*' as your first character. – Dorin Sep 07 '11 at 09:15
  • 2
    Not true for Lucene.Net 2.9.2 at least. It is just slow since it has to visit every doc. http://stackoverflow.com/questions/3412585/wildcard-at-the-beginning-of-a-searchterm-lucene – goalie7960 Sep 07 '11 at 12:17