0

I have a multivalued field in my schema called citation. One of the documents in the database has values for this field like:

 "citation":["13-33",
             "12-44"],

I want to be able to do a query like: citation:(13 44) and not have this document returned. In other words, I do not want queries to span individual values for the field.

Is there a way to do this?


Some further examples using the document above of how I want this to work:

  • citation:(13 33) --> Returns it.
  • citation:(12 44) --> Returns it.
  • citation:(12) --> Returns it.
  • citation:(33 13) --> Returns it.
  • citation:(33 12) --> DOES NOT RETURN IT.
mlissner
  • 17,359
  • 18
  • 106
  • 169

3 Answers3

1

SurroundQueryParser is your best bet for figuring out whether two terms are in the same value of a multiValued field.. The multivalued fields are actually internally one long set of tokens but with a big gap between tokens that belong to different "values". That's controlled by positionIncrementGap parameter in schema.xml, and is usually 100. So, setting the maximum gap to below 100 would require both terms to be within one field value.

Alexandre Rafalovitch
  • 9,709
  • 1
  • 24
  • 27
  • This assumes the entire value is less than 99 tokens long, right? – mlissner Nov 23 '15 at 22:33
  • If you have longer text, set the gap to 1000, or 10000. It does not take any extra space, the value is just an increment in token index position. – Alexandre Rafalovitch Nov 24 '15 at 01:45
  • I worked with this some today, and it looks like I can make it work using `~`, but not using `{~surround}`. It could be I can't figure out the syntax of `surround`, but is there a difference between these? – mlissner Feb 14 '16 at 01:11
0

Solr doesn't support this kind of query, but perhaps you can try block join to achieve it. https://cwiki.apache.org/confluence/display/solr/Other+Parsers#OtherParsers-BlockJoinQueryParsers

Zhitao Yue
  • 219
  • 1
  • 7
0

I think you can solve this with the correct field type and tokenisation for the citation field. If you use a field type like this:

<fieldType name="citation" class="solr.TextField" positionIncrementGap="100">
 <analyzer type="index">
   <tokenizer class="solr.KeywordTokenizerFactory"/>
   <filter class="solr.PatternCaptureGroupFilterFactory" 
           pattern="([0-9]+)-[0-9]+" preserve_original="true"/>
 </analyzer>
</fieldType>

Then your example document will be indexed thus:

"citation":["13", "13-33", "12", "12-44"]

This means the document will match on citation:"13" and citation:"13-33", but not citation:"13-12" or citation:"13-44"

brendanh
  • 76
  • 3
  • That's an interesting strategy, but in practise the citations aren't that regular. Unfortunately, different courts use different formats and I don't think a regex could ever match them all. – mlissner Jan 27 '16 at 13:42