3

I have an application for which I need to allow the user to perform full text search on documents, and use the Lucene Query Parser syntax if desired. The eXist database is queried from a Django backend that uses eulexistdb to talk to eXist.

The problem is that when the user uses an incorrect syntax for the full text search, this is discovered late in the game. The Django application has to query a SQL database to determine some of the parameters of the search. By the time the complete XQuery is built and eXist is accessed, the SQL query has already run, which means that the cost of the SQL query has already been spent. (I know I could marshal the data queried on the SQL side into eXist so that only eXist is queried. It's just not an option for now.)

I'd like to know ahead of time whether the Lucene query has a syntactical error to that I can avoid starting querying the SQL database for nothing.

I've checked the documentation of eXist, but I've not found anything in the API which would be a simple function that checks whether a full-text query is syntactically valid or not.

Louis
  • 146,715
  • 28
  • 274
  • 320

2 Answers2

1

Here is a simple function that will return True if a Lucene query is fine, or False if there is a syntax error in the query. db must be an instance of eulexistdb.db.ExistDB and query is the Lucene query:

def check(db, query):
    try:
        db.query(safe_interpolate("ft:query(<doc/>, {lucene_query})",
                                   lucene_query=query))
    except ExistDBException as ex:
        if ex.message().startswith(
                "exerr:ERROR Syntax error in Lucene query string"):
            return False

        raise ex # Don't swallow other problems that may occur.

    return True

This should be adaptable to any language for which there is a library that provides access to eXist. The idea is to run the query of interest against a bogus document (<doc/>). Using the bogus document avoids having to actually search the database. (An empty node sequence might seem better, but we're not running ft:query against an empty node sequence because then the XQuery optimizer could skip trying to parse and run the Lucene query since a valid query on an empty sequence will necessarily return an empty sequence, irrespective of the actual Lucene query.) It does not matter whether it returns any results or not. If the query has no errors, then there won't be an exception. If the query has a syntax error, then an exception will be raised. I've not found a more robust way than checking the error message stored with the exception to detect whether it is a Lucene syntax error or something else.

(The safe_interpolate function is a function that should interpolate lucene_query so as to avoid injections. It is up to you to decide what you need in your application.)

Louis
  • 146,715
  • 28
  • 274
  • 320
  • I'm curious to know if there is something more direct I managed to miss. – Louis Jun 06 '16 at 19:20
  • The other option is to write your own small syntax checker or parser for the Lucene syntax in python, then you would never have to access eXist to know if the syntax was correct. – adamretter Jun 07 '16 at 09:01
  • I may yet end up settling on doing just that eventually. – Louis Jun 07 '16 at 09:08
0

Here is an approach I consider complementary to the one I posted earlier. I'm using lucene-query-parser to perform the check client-side (i.e. in the browser):

define(function (require, exports, _module) {
  "use strict";

  var lqp = require("lucene-query-parser");

  function preDrawCallback() {
    // We get the content of the search field.
    var search = this.api().search(); 
    var good = true;
    try {
      lqp.parse(search); // Here we check whether it is syntactically valid.
    }
    catch (ex) {
      if (!(ex instanceof lqp.SyntaxError)) {
        throw ex; // Don't swallow exceptions.
      }
      good = false;
    }

    // Some work is performed here depending on whether
    // the query is good or bad.

    return good;  // And finally we tell DataTables whether to inhibit the draw.
  }

  // ....
});

preDrawCallback is used with a DataTables instance. Returning false inhibits drawing the table, which also inhibits performing a query to the server. So if the query is syntactically incorrect, it won't ever make it to the backend. (The define and require calls are there because both my code and lucene-query-parser are AMD modules.)

Potential issues:

  1. If the library that performs the check is buggy or otherwise does not support the entire syntax that Lucene supports, it will block queries that should go through. I've found a few buggy (or at best severely obsolete) libraries before I settled on lucene-query-parser.

  2. If the client-side library happens to support a construct introduced in a later version of Lucene but which is not supported in the version used with eXist. Keeping the backend check I show in my other answer allows to make sure that anything that would slip through is caught there.

Community
  • 1
  • 1
Louis
  • 146,715
  • 28
  • 274
  • 320