8

I'm trying to create a Lucene 4.10 index. I just want to save in the index the exact strings that I put into the document, witout tokenization.

I'm using the StandardAnalyzer.

    Directory dir = FSDirectory.open(new File("myDire"));
    Analyzer analyzer = new StandardAnalyzer();
    IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_4_10_0, analyzer);
    iwc.setOpenMode(OpenMode.CREATE);
    IndexWriter writer = new IndexWriter(dir, iwc);
    StringField field1 = new StringField("1", content1, Store.YES);
    StringField field2 = new StringField("2", content2, Store.YES);
    StringField field3 = new StringField("3", content3, Store.YES);
    doc.add(field1);
    doc.add(field2);
    doc.add(field3);
    writer.addDocument(doc, analyzer);
    writer.close();

If I print the index's content, I can see my data being stored, for example, my document has this "field 3":

    stored,indexed,tokenized,omitNorms,indexOptions=DOCS_ONLY<3:"Fuel Tank Capacity"@en>

I'm trying to query the index in order to get it back:

    IndexSearcher searcher = new IndexSearcher(reader);
    Analyzer analyzer = new StandardAnalyzer();
    QueryParser parser = new QueryParser("3", analyzer);
    String queryString = "\"\"Fuel Tank Capacity"\@en\"";
    Query query = parser.createPhraseQuery("3", QueryParser.escape(queryString));
    TopDocs docs = searcher.search(query, null, 20);

I'm trying to search the term "Fuel Tank Capacity"@en (quotation marks included) so I tried to escape them and I put another couple of quotes around the terms in order to let lucene understand that I'm searching for the entire texts.

If I print the query, I get: 3:"fuel tank capacity en" but I dont want to split the text on the @ symbol.

I think that my first problem is the StandardAnalyzer, because it seems to tokenize, if I'm not mistaken. However, I cannot understand how to query the index in order to get exactly "Fuel Tank Capacity"@en (quotation marks included).

Thank you

LucaT
  • 173
  • 1
  • 2
  • 6

2 Answers2

10

You could simplify matters, and just cut the QueryParser out of the equation entirely. Since you are using a StringField, the whole content of the field is a single term, so a simple TermQuery should work well:

Query query = new TermQuery(new Term("3","\"Fuel Tank Capacity\"@en"));
femtoRgon
  • 32,893
  • 7
  • 60
  • 87
1

When escaping quote (or any other special symbol in Lucene), you need to use \, but don't forget that backslash needs to be escaped inside Java string.

Following works for me:

    Query q = new QueryParser(
            Version.LUCENE_4_10_0,
            "",
            new StandardAnalyzer(Version.LUCENE_4_10_0)
    ).parse("3:\"\\\"Fuel Tank Capacity\\\"@en\"");

How did I arrive to this?

  1. Took the original string "Fuel Tank Capacity"@en
  2. Added escaping which is necessary for Lucene (escaped each " with \): \"Fuel Tank Capacity\"@en
  3. Added escaped quotes in the beginning and the end of the string: "\"Fuel Tank Capacity\"@en"
  4. Added escaping which is necessary for Java String (each slash becomes double slash, double quotes is escaped with backslash): \"\\\"Fuel Tank Capacity\\\"@en\"
mindas
  • 26,463
  • 15
  • 97
  • 154
  • Thak you for your answer, but maybe I'm missing something. I tried querying my index using the string escaped as you said, but when I try to print the Query q.toString() what I obtain is: 3:"fuel tank capacity en" and, once again, I did not get any document from my index... – LucaT Sep 12 '14 at 20:36