0

For simplicity sake, consider two documents with the following fields and values:

RecordId: "12345"
CreatedAt: "27/02/1992"
Event: "Manchester, Dubai, Paris"
Event: "Manchester, Rome, Madrid"
Event: "Madrid, Sidney"


RecordId: "99999"
CreatedAt: "27/02/1992"
Event: "Manchester, Barcelona, Rome"
Event: "Rome, Paris"
Event: "Milan, Barcelona"

Is it possible to perform a search for multiple terms within a single instance of a "Event" field ?

Lets say I want to search for "Manchester" and "Paris" to appear in the same field. The second record contains "Manchester" and "Paris" but on different instances of the Event field, which is not supposed to be part of the resultset.

Ideally, the resultset would only be the first record (12345).

pelican_george
  • 961
  • 2
  • 13
  • 33
  • Hey, Pelican. Perhaps index each record (RecordID) once for each Event field with a suffix to RecordID for each one. In your example you would then have six indexes, 12345-1. 12345-2. 12345-3, etc. You would end up with a much bigger index and you would need to filter out duplicate hits (if, for example, you also had a "Manchester, Detroit, Paris" Event), but I think it would work. – Michael Gorsich Mar 04 '16 at 13:07
  • I see your point, but that approach in the long run would eventually give me nightmares. Nevertheless, it would work. – pelican_george Mar 04 '16 at 13:28
  • Yeah, I didn't make it a formal answer because it seems kludgy, even though it would work. If you go with that approach, please let me know. – Michael Gorsich Mar 04 '16 at 13:32
  • @MichaelGorsich Just to follow-up your comment, how would you perform a search to those fields during runtime, not being aware of their name values. (e.g 12345-1, 12345-2, 12345-3, etc) ? – pelican_george Mar 07 '16 at 08:55
  • In your example, plus the one in my first comment, the results for "Manchester" and "Paris" will get you 12345-1 and 12345-4. You initially accumulate all results, Then you lop the suffixes off (LastIndexOf()) and eliminate duplicates to reduce the results to 12345, so you end up with a single result, which you use to retrieve your document. – Michael Gorsich Mar 07 '16 at 11:15

2 Answers2

1

How about indexing Event as a non-tokenized field, and then using a KeywordAnalyzer for it. You could then use Lucene's Regex query to match the occurrence of both Manchester and Paris:

Event: "/^.*(Manchester).+(Paris).*$/"
cris almodovar
  • 151
  • 1
  • 5
  • 1
    regex queries are not available in the current version of lucene.net (based on 3.0.3) – AndyPook Mar 21 '16 at 12:44
  • Regex queries are slow pre-4.0 (Lucene Java), but with v4.0+ Regex queries are executed using dynamically constructed automatons, so its much faster. If you're on .NET, I'd recommend using FlexLucene (an IKVM-based port of the latest Lucene Java), instead of Lucene.NET. – cris almodovar Mar 22 '16 at 14:35
1

Depending on the analyser you use for the field (it would need to tokenise and remove the punctuation). You could use a slop phrase query.

"manchester paris"~2 should find just 12345. Depending on the number and order of values in each field you may need to use a larger slop.

The slop defines the number of "operations" on the phrase allowable to match. This can be reordering or additional terms within the phrase.

So "x y"~1 could match

  • "y x"
  • "x fred y"
  • but not "y fred x" (that would require two ops: swamp plus an addition)

For your need the slop probably ought to be equal to the maximum number of terms allowed in a field. I haven't worked it through but I think that would suffice even if you query for more than 2 terms.

AndyPook
  • 2,762
  • 20
  • 23
  • I wasn't aware of slop phrase queries. So, a slop query of 2 moves would find all records containing within the same field the terms "manchester" and "paris" in any given order? – pelican_george Mar 21 '16 at 09:27
  • 1
    Short answer: yes, correct I've updated the answer with a bit more on what slop does – AndyPook Mar 21 '16 at 12:36
  • For a field without a set maximum number of terms, would it be an overkill to set the slop to int.MaxValue ? – pelican_george Mar 21 '16 at 12:52
  • 1
    It'd work. But NOT recommended. There aren't an infinite number of cities :) A little analysis of the requirement ought to suggest a reasonable upper number. I assume this is coming from some dataset. Can you spin over that to see the max number? then add some tolerance. – AndyPook Mar 21 '16 at 13:01