0

I am trying to build an application that implements a search system over Lucene index. Right now the index is built, I can search for documents over the index and everything seems to be working fine but, when I make a search using a field that is used in many documents, the analyzer only returns some documents. I have tried to make the same search using Luke and is behaving the same way.

i.e: My index have 2 fields:

Field A: An identifier that is unique. Field B: A String.

First Example:

We have 5 documents:

Doc 1: FieldA:1; FieldB:hello world

Doc 2: FieldA:2; FieldB:hello world!

Doc 3: FieldA:3; FieldB:hello world

Doc 4: FieldA:4; FieldB:anything

Doc 5: FieldA:5; FieldB:hello world

When I make a search like "B: hello world" it should returns the documents 1, 3 and 5 but it only returns 1 and 3.

When I make a search like "A: 5" it returns the document 5 and the field B value is "hello world".

Second Example: (one token)

Doc 6: FieldA:6; FieldB:token

Doc 7: FieldA:7; FieldB:token

Doc 8: FieldA:8; FieldB:TOKEN

Doc 9: FieldA:9 FieldB:token

When I search FieldB:"token" it only returns Doc 6 and Doc 9. The only way I can find Doc 7 is searching by its FieldA.

I am using WhitespaceAnalyzer and both Fields are NOT_ANALYZED.

IndexGenerator Main

...

IndexWriter writer = new IndexWriter(directory, new WhitespaceAnalyzer(), true, IndexWriter.MaxFieldLength.UNLIMITED);;
writer.setRAMBufferSizeMB(200);

List<Work> works = getWorks(); //Retrieves the information from the DB

for (Work work: works) {

   Document luceneDocument = createLuceneDocument(work);
   writer.addDocument(luceneDocument);

}
writer.commit();

...

CreateLuceneDocument Method:

private static Document createLuceneDocument(Work work) {

 try {
   Document luceneDoc = new Document();

   ...

   Field id = new Field("ID", work.getId(),Field.Store.YES,Field.Index.NOT_ANALYZED);
   luceneDoc.add(id);

   Field name = new Field("NAME", work.getName(),Field.Store.YES,Field.Index.NOT_ANALYZED);
   luceneDoc.add(name);

   ...

   return document;

   }
   catch (LuceneException e) {
       ...
   }
}

I have noticed that the Documents that are not returned have a low score value. Assuming that is a problem when the index is created because Luke behaves the same way than the applicacion, what am I doing wrong?

Thanks in advance!

  • I don't see anything in your example which would give rise to such a problem. I also don't understand how can you have a score for a document which is not found with your query. Perhaps some more information would be useful, such as your search code, and some further information on data where this issue is actually occuring? – femtoRgon Nov 14 '13 at 20:11
  • Thanks @femtoRgon! The example is the easiest way to explain what is happening. The real index has over 12 fields and is way more complex than the example. As I said in the first post, Luke doesn't show the documents even though the fields fulfill the search request. So, the problem should be at the index generation process. I am going to add more information about the index generation. – user2993510 Nov 14 '13 at 20:58
  • Where do you see "that the Documents that are not returned have a low score value"? – groverboy Nov 15 '13 at 23:37
  • I have 2 fields with the same values, the first field not analyzed in which I have the problem, the second field with tokens and analyzed that works fine. I try to search "hello world" in the first field and I get 2 results, the same search in second field returns 3 fields (following the example). The same happens when I try to search strings with only one token, so the problem is not related about tokens. When I search by this second field I get the Documents that are not returned searching by the first field and all of them have very low score values even though the field value is the same. – user2993510 Nov 16 '13 at 12:31

2 Answers2

1

I'll just give you my suspicion here, I suppose. You say you are using WhitespaceAnalyzer, but since your fields are NOT_ANALYZED, that analyzer isn't doing anything to the indexed content. They are indexed precisely as they are, as a single token.

If you are indexing the value "hello there", searching with a TermQuery on "hello" won't find anything. Neither will it find anything if you have indexed "Hello", "hello!", or even "hello ". It will be case, punctuation, whitespace, etc. sensitive, and require a match on the entire input. So I suspect, that your un-found document has a problem along these lines.

femtoRgon
  • 32,893
  • 7
  • 60
  • 87
  • That's right @femtoRgon , I want these fields to be indexed as a single token because I am using another `Field` to index the same value using different tokens. The problem is that I am searching "hello world" and only 2 of 3 documents are found (I have made some changes on the example to clarify this). Thanks again! – user2993510 Nov 15 '13 at 07:51
  • +1 @femtoRgon for helping me to answer a question that's less clear than it could be. – groverboy Nov 23 '13 at 00:25
1

Lucene will resolve the search expression B:hello world to B:hello D:world, an expression of two terms. Here D is the default search field, probably "another Field" mentioned in your comment on @femtoRgon's answer.

I'm guessing the results include documents 1 and 3 because they match on token "world" in field D, but this token is absent from document 5 field D. But this is possible only if the default search operator is OR not AND, because B:hello cannot match these documents.

You may get the results you expect by using a phrase expression: B:"hello world". But you may not; WhitespaceAnalyzer will break this phrase into two tokens when it builds a Query object.

You could get around the problem by usingKeywordAnalyzer for field B, as described in my answer to another question.

Community
  • 1
  • 1
groverboy
  • 1,133
  • 8
  • 20
  • I am facing this problem in fields with only 1 token and also in fields that have more than one token. I am trying to search with Luke using KeywordAnalyzer but I get the same results, maybe because the index is generated using WhitespaceAnalyzer, but does it affect to 1 token fields?? Thanks for your help @groverboy ! – user2993510 Nov 15 '13 at 11:04
  • @user2993510 Depends what you mean by "1 token fields". Just to be clear, if field B, value "hello world" is created with `NOT_ANALYZED` then the field has 1 token [hello world]. If field C, value "hello world" is created with `ANALYZED` then the field has 2 tokens [hello] [world]. – groverboy Nov 18 '13 at 13:38
  • @user2993510 please give an example search expression you've tried using Luke with `KeywordAnalyzer`? – groverboy Nov 18 '13 at 13:45
  • I have added another example to the explanation to clarify this. I was trying to explan that no matters if I am making a search using one or two tokens. If I search "hello world" I can't get all the results, the same if I search using only one token like the second example with "token". I.e I am searching in Luke this: FieldB:token. – user2993510 Nov 18 '13 at 15:40
  • @user2993510 - you are confusing _word_ and _token_, they are different things. Anyway if you want field B `NOT_ANALYZED` and there may be multiple words in this field, you need to use `KeywordAnalyzer` (or other non-tokenizing analyzer) for _both_ indexing and searching. – groverboy Nov 19 '13 at 23:47
  • Thanks @groverboy the key was to generate the index using `KeywordAnalyzer` instead of `WhitespaceAnalyzer`. – user2993510 Nov 21 '13 at 15:47