1

Background, I am doing some key phrase extraction on some documents. Here I have a list of terms that I want to use as facets for documents uploaded (I did this) So I have a list of terms for colon cancer and an issue comes up where the facet states that there are 10 documents that have a particular term but I get 400 documents, 10 of which actually contain the term and the other 390 do not. I believe it is because the term in particular contains another term.

Term I am looking for: no evidence There is another term that actually comes up 400 times: no Similarly I am looking for the term: free of which appears 1 time in all of the documents, but I get 31 results. There is a term free which shows up 31 times.

Here is my schema:

<field name="ColonCancer" type="ColonCancer" indexed="true" stored="true" multiValued="true"
   termPositions="true"
   termVectors="true"
   termOffsets="true"/>
<fieldType name="ColonCancer" class="solr.TextField" sortMissingLast="true" omitNorms="true">
<analyzer>
<filter class="solr.ShingleFilterFactory"
            minShingleSize="2" maxShingleSize="5"
            outputUnigramsIfNoShingles="true"
    />
  <tokenizer class="solr.WhitespaceTokenizerFactory"/>
      <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.SynonymFilterFactory" synonyms="synonyms_ColonCancer.txt" ignoreCase="true" expand="true" tokenizerFactory="solr.KeywordTokenizerFactory"/>
    <filter class="solr.KeepWordFilterFactory"
            words="prefLabels_ColonCancer.txt" ignoreCase="true"/>
  </analyzer>
</fieldType>

Is there a way to have it behave in a way where I see only the correct amount (no evidence shows only 10 results).

EDIT: This seems to give what I want:

http://localhost:8983/solr/Cytokine/tvrh?q=%22no%22%20OR%20%22no%20evidence%22&fq=ColonCancer:no&fq=ColonCancer:no%20evidence&tv=true&tv.offsets=true
Kevin
  • 3,077
  • 6
  • 31
  • 77

1 Answers1

0

You can fix this in multiple ways.

You can change the field to a string field. This would turn the facet queries behavior into "specific". That is - looking for "no evidence" would only find "no evidence" - case sensitive.

Another option is to use facet queries - when looking for particular combinations. You can then use ~ simbol to force a range between them.

Example:

<field name="ColonCancer" type="ColonCancer" indexed="true" stored="true" multiValued="true"
termPositions="true"
termVectors="true"
termOffsets="true"/>

 <fieldType name="ColonCancerString" class="solr.StringField">

<analyzer>
 <filter class="solr.ShingleFilterFactory"
        minShingleSize="2" maxShingleSize="5"
        outputUnigramsIfNoShingles="true"
/>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
  <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.SynonymFilterFactory"     synonyms="synonyms_ColonCancer.txt" ignoreCase="true" expand="true"   tokenizerFactory="solr.KeywordTokenizerFactory"/>
     <filter class="solr.KeepWordFilterFactory"
        words="prefLabels_ColonCancer.txt" ignoreCase="true"/>
  </analyzer>
  </fieldType>
  <copyField source="ColonCancer" dest="ColonCancerString"/>

Here I've added another field called ColonCancerString that should hold the same text - but as string.

The copyFIeld line in the schema tells it to copy the field value.

See here for copy field thread:

How to use SOLR copyField directive

Community
  • 1
  • 1
Uri Shtand
  • 1,717
  • 11
  • 14