0

We are confronting different search engines for our research archives and having browsed the Xapian-Omega documentation, we decided to try it out since the Omega option appears to be an appropriate solution with several interesting search options.

We installed Xapian-Omega on a Linux Server (Deb 7) and tested the setup with success. However we are unsure as to how one can employ or perhaps even enable the use of Wild Cards or Regular Expressions with Xapian-Omega.

We read that for Xapian one has to enable the Wild Card option "QueryParser flags" Could someone clarify this ? ie. explain with or indicate a page with an example or two.

But we did not see much information regarding examples with Omega CGI and although this latter runs well, wild card options (such as * for the general wild card and ? as a single character), do not seem to work as expected by default and they would be useful, even though stemming and substrings etc may be functional.

Eg: It would be interesting to be able to employ standard simple wild char searches with a certain precision such as : medic* for medicine medical medicament or with ? for single characters

Can Regexp be recognised with Omega ? eg : sep[ae]r[ae]te(\w+)? or searching for structured formats such as Email or Credit Card Numbers or certain formula types in research papers etc.

In a note from Olly Betts long ago (Dev Mailing List) regarding this one suggestion was to grep the index file but this would defeat the RAD advantage of Omega.

Any examples of searches using Omega with Wild Cards or Regular Expressions would be most appreciated ... even an indication of a page where information regarding this theme is well presented with examples illustrating how to develop advanced searches using Xapian alone would be most welcome (PHP or Python perhaps).

(We are not concerned for the moment about the eventual substantial increase in the size of the index size or in the time to index the archive)

Cœur
  • 37,241
  • 25
  • 195
  • 267
Nos Nix
  • 1
  • 1

2 Answers2

0

You can enable right-wildcards (such as "medic*") in Omega using $set{flag_wildcard,1} (covered in the Omegascript documentation), which enables FLAG_WILDCARD. There's a section in the user manual on using wildcards.

Xapian doesn't provide support for regular expression searching, although in theory I believe it would be possible to support, if potentially costly (depending on the regex). It would have to run the regular expression against unstemmed terms in the database, and then feed them into the search. Where it becomes difficult is if the regex expands to a lot of terms (eg just 'a' as a regex). There's also some subtlety in making it efficient; it's easy to jump through the term list to something with a constant prefix, and you'd want to take advantage of that if possible.

For your example of sep[ae]r[ae]te(\w+)?, it sounds like you actually want a combination of spelling correction (for the a-e substitutions, which you can enable using $set{flag_spelling_correction,1}) and stemming (for the trailing letters after 'te'; Omega defaults to English stemming, but that can be changed), or either wildcard or partial match support.

If you do need regular expressions for your use case, then I'd suggest bringing it up on the xapian-discuss mailing list. Xapian has moved on since the last discussion, and I believe it would be easier to build such support now than it was then.

James Aylett
  • 3,332
  • 19
  • 20
0

James Ayatt: Thank you for your answer and help, my apologies for this belated reply, a distraction with other work. We had already seen the Omegascript page but it was not clear to us how to employ these options with the CGI interface. Also the use of * seems to be for trailing chars, is that correct ? ie not for internal groups of words eg: omeg*ipt; there are cases where the stemming option would not be sufficient. We did not see an option for single wild chars, sometimes represented by ? in certain search engines. Could you comment here ?

Regarding the use of regular expressions we had immagined that it might not be quite as simple as one could hope. The examples mentioned in the preceding post were of course simple possible uses, there are of course many more. Your comment on using the stemming option seems appropriate.

In certain cases it could be interesting to enable some type of regexp option for the extraction of text forms, such as those mentioned. The quick extractiion of such text, perhaps together with some surrounding text could be very useful. We will certainly try your proposal with the mailing list.

Thank you again.

Nos Nix
  • 1
  • 1