19

I'm currently writing a program that currently uses elasticsearch as a back-end database/search index. I'd like to mimic the functionality of the /_search endpoint, which currently uses a match query:

{
    "query": {
        "match" : {
            "message" : "Neural Disruptor"
        }
    }
}

Doing some sample queries, yielded the following results on a massive World of Warcraft database:

   Search Term          Search Result      
------------------ ----------------------- 
 Neural Disruptor   Neural Needler         
 Lovly bracelet     Ruby Bracelet          
 Lovely bracelet    Lovely Charm Bracelet  

After looking through elasticsearch's documentation, I found that the match query is fairly complex. What's the easiest way that I can simulate a match query with just lucene in java? (It appears to be doing some fuzzy matching, as well as looking for terms)

Importing elasticsearch code for MatchQuery (I believe org.elasticsearch.index.search.MatchQuery) doesn't seem to be that easy. It's heavily embedded into Elasticsearch, and doesn't look like something that can be easily pulled out.

I don't need a full proof "Must match exactly what elasticsearch matches", I just need something close, or that can fuzzy match/find the best match.

Blue
  • 22,608
  • 7
  • 62
  • 92
  • The only way of doing this is to parse the input and create a `query_string` query which is lucene's. It says so in the documentation (that match query is a subset of query_string). It's not trivial though. I once had to do something like that and I used antlr to generate a an AST, parsed it and created something else. – Alkis Kalogeris Feb 20 '18 at 20:49
  • It's not that easy otherwise I would have. I had to read a book in order to implement what I mentioned above (in order to use antlr4). In your case you could use the an analyzer to tokenize the input, check the operator specified (or use the default) and try to add the boolean operators needed. On the other hand, elasticsearch is opensource, wouldn't it be possible to just locate and isolate that implementation from the source code? – Alkis Kalogeris Feb 21 '18 at 08:15
  • As of now, the answer currently added does give *some* direction, but for the full bounty, I'd like to see an actual QueryParser that can generate a query to get similar results to elasticsearch. – Blue Feb 28 '18 at 11:51

2 Answers2

8

Whatever is sent to the q= parameter of the _search endpoint is used as is by the query_string query (not org.elasticsearch.index.search.MatchQuery) which understands the Lucene expression syntax.

The query parser syntax is defined in the Lucene project using JavaCC and the grammar can be found here if you wish to have a look. The end-product is a class called QueryParser (see below).

The class inside the ES source code that is responsible for parsing the query string is QueryStringQueryParser which delegates to Lucene's QueryParser class (generated by JavaCC).

So basically, if you get an equivalent query string as what gets passed to _search?q=..., then you can use that query string with QueryParser.parse("query-string-goes-here") and run the reified Query using just Lucene.

approxiblue
  • 6,982
  • 16
  • 51
  • 59
Val
  • 207,596
  • 13
  • 358
  • 360
  • So I need to rip apart the `QueryStringParser`, filling in the default context, and then run that to generate a query string? It appears as if this is extremely complex, which involves multiple embedded queries (MultiMatchQuery, which uses [even more Queries](https://github.com/elastic/elasticsearch/blob/190f1e1fb317a9f9e1e1d11e9df60c0aeb7e267c/server/src/main/java/org/elasticsearch/index/search/MultiMatchQuery.java#L61). The main issue being all these take a `ShardContext`, which appears to be elasticsearch specific, and extremely complex. – Blue Feb 23 '18 at 14:08
  • I would start with Lucene's `QueryParser` first and not care too much about `QueryStringQueryParser` which is only a wrapper around Lucene's `QueryParser` and responsible for parsing parameters for the ES `query_string` query Lucene is not aware of anyway. – Val Feb 23 '18 at 14:15
  • Is there a way to debug the actual query being generated on ElasticSearch's API? The explain commands don't seem to be doing much, and while this answer helps, I feel like I'm still light-years away from a solution. – Blue Feb 25 '18 at 10:41
6

It's been awhile since I've worked directly with lucene, but what you want should be, initially, fairly straightforward. The base behavior of a lucene query is very similar to the match query (query_string is exactly equivalent to lucene, but match is very close). I put together a small example that works just with lucene (7.2.1) if you want to try it out. The main code is as follows:

public static void main(String[] args) throws Exception {
    // Create the in memory lucence index
    RAMDirectory ramDir = new RAMDirectory();

    // Create the analyzer (has default stop words)
    Analyzer analyzer = new StandardAnalyzer();

    // Create a set of documents to work with
    createDocs(ramDir, analyzer);

    // Query the set of documents
    queryDocs(ramDir, analyzer);
}

private static void createDocs(RAMDirectory ramDir, Analyzer analyzer) 
        throws IOException {
    // Setup the configuration for the index
    IndexWriterConfig config = new IndexWriterConfig(analyzer);
    config.setOpenMode(IndexWriterConfig.OpenMode.CREATE);

    // IndexWriter creates and maintains the index
    IndexWriter writer = new IndexWriter(ramDir, config);

    // Create the documents
    indexDoc(writer, "document-1", "hello planet mercury");
    indexDoc(writer, "document-2", "hi PLANET venus");
    indexDoc(writer, "document-3", "howdy Planet Earth");
    indexDoc(writer, "document-4", "hey planet MARS");
    indexDoc(writer, "document-5", "ayee Planet jupiter");

    // Close down the writer
    writer.close();
}

private static void indexDoc(IndexWriter writer, String name, String content) 
        throws IOException {
    Document document = new Document();
    document.add(new TextField("name", name, Field.Store.YES));
    document.add(new TextField("body", content, Field.Store.YES));

    writer.addDocument(document);
}

private static void queryDocs(RAMDirectory ramDir, Analyzer analyzer) 
        throws IOException, ParseException {
    // IndexReader maintains access to the index
    IndexReader reader = DirectoryReader.open(ramDir);

    // IndexSearcher handles searching of an IndexReader
    IndexSearcher searcher = new IndexSearcher(reader);

    // Setup a query
    QueryParser parser = new QueryParser("body", analyzer);
    Query query = parser.parse("hey earth");

    // Search the index
    TopDocs foundDocs = searcher.search(query, 10);
    System.out.println("Total Hits: " + foundDocs.totalHits);

    for (ScoreDoc scoreDoc : foundDocs.scoreDocs) {
        // Get the doc from the index by id
        Document document = searcher.doc(scoreDoc.doc);
        System.out.println("Name: " + document.get("name") 
                + " - Body: " + document.get("body") 
                + " - Score: " + scoreDoc.score);
    }

    // Close down the reader
    reader.close();
}

The important parts to extending this is going to be the analyzer and understanding lucene query parser syntax.

The Analyzer is used by both indexing and queries to tell both how to parse text so they can think about the text in the same way. It sets up how to tokenize (what to split on, whether to toLower(), etc). The StandardAnalyzer splits on spaces and a few others (I don't have this handy) and also looks to apply toLower().

The QueryParser is going to do some of the work for you. If you see above in my example. I do two things, I tell the parser what the default field is and I pass a string of hey earth. The parser is going to turn this into a query that looks like body:hey body:earth. This will look for documents that have either hey or earth in the body. Two documents will be found.

If we were to pass hey AND earth the query is parsed to look like +body:hey +body:earth which will require docs to have both terms. Zero documents will be found.

To apply fuzzy options you add a ~ to the terms you want to be fuzzy. So if the query is hey~ earth it will apply fuzziness to hey and the query will look like body:hey~2 body:earth. Three documents will be found.

You can more directly write the queries and the parser still handles things. So if you pass it hey name:\"document-1\" (it token splits on -) it will create a query like body:hey name:"document 1". Two documents will be returned as it looks for the phrase document 1 (since it still tokenizes on the -). Where if I did hey name:document-1 it writes body:hey (name:document name:1) which returns all documents since they all have document as a term. There is some nuance to understanding here.


I'll try to cover a bit more on how they are similar. Referencing match query. Elastic says the main difference will be, "It does not support field name prefixes, wildcard characters, or other "advanced" features." These would probably stand out more going the other direction.

Both the match query and the lucene query, when working with an analyzed field will take the query string and apply the analyzer to it (tokenize it, toLower, etc). So they will both turn HEY Earth into a query that looks for the terms hey or earth.

A match query can set the operator by providing "operator" : "and". This change our query to look for hey and earth. The analogy in lucene is to do something like parser.setDefaultOperator(QueryParser.Operator.AND);

The next thing is fuzziness. Both are working with the same settings. I believe elastic's "fuzziness": "AUTO" is equivalent to lucene's auto when applying ~ to a query (though I think you have to add it each term yourself which is a little cumbersome).

Zero terms query appears to be an elastic construct. If you wanted the ALL setting you would have to replicate the match all query if the query parser removed all tokens from the query.

Cutoff frequery looks to be related to the CommonTermsQuery. I've not used this so you may have some digging if you want to use it.

Lucene has a synonym filter to be applied to an analyzer but you may need to build the map yourself.


The differences you may find will probably be in scoring. When I run they query hey earth against lucene. It get document-3 and document-4 both returned with a score of 1.3862944. When I run the query in the form of:

curl -XPOST http://localhost:9200/index/_search?pretty -d '{
  "query" : {
    "match" : {
      "body" : "hey earth"
    }
  }
}'

I get the same documents, but with a score of 1.219939. You can run an explain on both of them. In lucene by printing each document with

System.out.println(searcher.explain(query, scoreDoc.doc));

And in elastic by querying each document like

curl -XPOST http://localhost:9200/index/docs/3/_explain?pretty -d '{
  "query" : {
    "match" : {
      "body" : "hey earth"
    }
  }
}'

I get some differences, but I cannot exactly explain them. I do actually get a value for the doc of 1.3862944 but the fieldLength is different and that affects the weight.

phospodka
  • 988
  • 1
  • 5
  • 12