Problem with Proximity search Lucene. Field "content" was indexed without position data

Question

so as in the title when I'm trying to search for a query i get an error

Exception in thread "main" java.lang.IllegalStateException: field "content" was indexed without position data; cannot run PhraseQuery (phrase=content:"to be not"~1) at org.apache.lucene.search.PhraseQuery$1.getPhraseMatcher(PhraseQuery.java:497) at org.apache.lucene.search.PhraseWeight.scorer(PhraseWeight.java:64) at org.apache.lucene.search.Weight.bulkScorer(Weight.java:166) at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:731) at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:655) at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:649) at org.apache.lucene.search.IndexSearcher.searchAfter(IndexSearcher.java:487) at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:501) at ProximitySearch.main(ProximitySearch.java:81)

Here is my code:

    public static void main(String[] args) throws IOException, ParseException {

        Analyzer analyzer = new StandardAnalyzer();

        List<KeyValuePairs> listOfDocs = new LinkedList<>();

        KeyValuePairs file1 = new KeyValuePairs("file1", "to be or not to be that is the question");
        KeyValuePairs file2 = new KeyValuePairs("file2", "make a long story short");
        KeyValuePairs file3 = new KeyValuePairs("file3", "see eye to eye");

        listOfDocs.add(file1);
        listOfDocs.add(file2);
        listOfDocs.add(file3);

        Path indexPath = Files.createTempDirectory("tempIndex");
        Directory directory = FSDirectory.open(indexPath);
        IndexWriterConfig config = new IndexWriterConfig(analyzer);
        IndexWriter iwriter = new IndexWriter(directory, config);
        for (KeyValuePairs listOfDoc : listOfDocs) {
            Document doc = new Document();
            String text = listOfDoc.getKey();
            System.out.println(text);
            String title = listOfDoc.getValue();
            doc.add(new StringField("content", text, Field.Store.YES));
            doc.add(new Field("title", title, TextField.TYPE_STORED));
            iwriter.addDocument(doc);
        }
        iwriter.close();

        // Now search the index:
        DirectoryReader ireader = DirectoryReader.open(directory);
        IndexSearcher isearcher = new IndexSearcher(ireader);

        // Parse a simple query that searches for "something that u want to search":
        QueryParser parser = new QueryParser("content", analyzer);
        Query query = parser.parse("\"to be not\"~1");

        ScoreDoc[] hits = isearcher.search(query, 10).scoreDocs;
        System.out.println(Arrays.toString(Arrays.stream(hits).toArray()));
        System.out.println("Search terms found in :: " + hits.length + " files");

        ireader.close();
        directory.close();
        IOUtils.rm(indexPath);
    }

I dont know what am i doing wrong.

andrewJames · Accepted Answer · 2022-08-11T15:17:51.687

1

Short Answer

You cannot run proximity queries for data stored in a StringField. You have to use a TextField.

You did not show us the definition for KeyValuePairs, so I have made some assumptions below about that.

(Small point: I would also suggest that you do not need to use LinkedList - you probably only need ArrayList.)

Longer Answer for More Background

Your problem is related to the field types you are using.

You have a document containing 2 fields:

content - which uses a StringField
title - which uses a TextField.

An example of data in the content field is to be or not to be that is the question.

You are attempting to run a proximity query against the content field.

Remember from this question that StringField data "is indexed but not tokenized: the entire String value is indexed as a single token."

A single token, means the token's position is always effectively the only position - and therefore position data is not captured in the index (it is basically meaningless).

That is why your query throws that error. That query requires the data to be split up into separate tokens - and each token's position needs to be captured in the index.

Therefore you need to use a TextField for that type of data.

When you use a TextField for to be or not to be that is the question, then the StandardAnalyzer causes the following data to be captured in the index:

field content
  term be
    doc 0
      freq 2
      pos 1
      pos 5
  term is
    doc 0
      freq 1
      pos 7
  term not
    doc 0
      freq 1
      pos 3
  term or
    doc 0
      freq 1
      pos 2
  term question
    doc 0
      freq 1
      pos 9
  term that
    doc 0
      freq 1
      pos 6
  term the
    doc 0
      freq 1
      pos 8
  term to
    doc 0
      freq 2
      pos 0
      pos 4

You can see that the index now contains the required position data. The proximity query requires this position data to evaluate whether the words in your query are sufficiently close enough to each other, to match your query.

And just for completeness, here is what you get in the index if you use StringField instead of TextField:

doc 0
  field 0
    name content
    type string
    value to be or not to be that is the question

As you can see - only one token - and no position data.

edited Aug 11 '22 at 15:17

answered Aug 11 '22 at 15:12

andrewJames

19,570
8
19
51

Thank You a lot that solved my problem. I have a question tho. How can i change the query to support prefix search? For example with a string "long story short" it would be matched with "long story sho". Should i still check positions? I tried with textfield but it didn't work. My query was "\"long story sho*\"~0" – ReallyNicePerson Aug 16 '22 at 10:05
And also how should I modify the query used in the exercise to match files by input strings within a given edit distance and the same order of words (i.e. word permutations are not allowed)? – ReallyNicePerson Aug 16 '22 at 10:09
1

Your first comment: If you have a document containing `long short story` and you want to find a match using `"long story short"` then that requires `~2` because it takes 2 moves to transform your query word order into the document word order. Step 1: move `short` from pos n to pos n-1 (`short` is now in the same position as `story`); and then step 2: move `story` from pos n-1 to pos n (where `short` used to be). You have swapped the positions of `short` and `story` in two moves. I don't think you can combine wildcard searches and proximity searches - so you cannot do `"long story sho*"~2`. – andrewJames Aug 16 '22 at 13:11
Your second comment: I did not quite understand. Maybe you should ask a new question where you can explain with more details. Maybe I already answered that with my "word moves" explanation. – andrewJames Aug 16 '22 at 13:12
But i dont think I want to mix proximity search with wildcard. My document is containing "long story short" not "long short story". So I am searching for "long story sho*". So I guess if I am searching without changing positions of words I dont have to check for that with ~. So if I'm right I should just do a wildcard search and store my text in a document as a string file. Am I correct? – ReallyNicePerson Aug 16 '22 at 14:07
Could You also take a look at my question about token filter? I would really appreciate that. – ReallyNicePerson Aug 16 '22 at 14:14
I am not sure what you mean by "_store my text in a document as a string file_". You store text in a document as text. That has nothing to do with Lucene. Are you referring to how Lucene _indexes_ that data? By using a `TextField` vs. a `StringField`? Either way, the original document is just the original document. It's just a file containing some text. – andrewJames Aug 16 '22 at 14:30
For searching, if you don't put your search term inside double-quotes, then you can use wildcards - and Lucene searches for each separate token (using the Standard Analyzer). And you know that searching for _separate tokens_ requires an index which _stores_ those separate tokens: therefore you need to use a `TextField` for that, during indexing. – andrewJames Aug 16 '22 at 14:30
I am sorry, but I did not really understand your [token filter question](https://stackoverflow.com/q/73371758/12567365). What outputs are you expecting for different inputs? Can you show some examples? Also what have you tried? Can you show some code? Are you actually trying to do [something like this](https://stackoverflow.com/q/59723144/12567365)? – andrewJames Aug 16 '22 at 15:22

Problem with Proximity search Lucene. Field "content" was indexed without position data

1 Answers1