Background, I am doing some key phrase extraction on some documents. Here I have a list of terms that I want to use as facets for documents uploaded (I did this) So I have a list of terms for colon cancer and an issue comes up where the facet states that there are 10 documents that have a particular term but I get 400 documents, 10 of which actually contain the term and the other 390 do not. I believe it is because the term in particular contains another term.
Term I am looking for: no evidence
There is another term that actually comes up 400 times: no
Similarly I am looking for the term: free of
which appears 1 time in all of the documents, but I get 31 results. There is a term free
which shows up 31 times.
Here is my schema:
<field name="ColonCancer" type="ColonCancer" indexed="true" stored="true" multiValued="true"
termPositions="true"
termVectors="true"
termOffsets="true"/>
<fieldType name="ColonCancer" class="solr.TextField" sortMissingLast="true" omitNorms="true">
<analyzer>
<filter class="solr.ShingleFilterFactory"
minShingleSize="2" maxShingleSize="5"
outputUnigramsIfNoShingles="true"
/>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms_ColonCancer.txt" ignoreCase="true" expand="true" tokenizerFactory="solr.KeywordTokenizerFactory"/>
<filter class="solr.KeepWordFilterFactory"
words="prefLabels_ColonCancer.txt" ignoreCase="true"/>
</analyzer>
</fieldType>
Is there a way to have it behave in a way where I see only the correct amount (no evidence shows only 10 results).
EDIT: This seems to give what I want:
http://localhost:8983/solr/Cytokine/tvrh?q=%22no%22%20OR%20%22no%20evidence%22&fq=ColonCancer:no&fq=ColonCancer:no%20evidence&tv=true&tv.offsets=true