0

I am trying to remove the unwanted words and use stemming and finally create shingles. However, after removing stop words, its giving me shingles with "_" in the place of stop words. I tried using PatternReplaceFactory to replace _ but its not working. I have field type as below:

<fieldType name="common_shingle" class="solr.TextField">
    <analyzer type="index">
          <charFilter class="solr.HTMLStripCharFilterFactory"/>
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.PorterStemFilterFactory"/>
        <filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
        <filter class="solr.PatternReplaceFilterFactory" pattern=".*_.*" replacement=""/>
        <filter class="solr.ShingleFilterFactory" outputUnigrams="false" minShingleSize="3" maxShingleSize="3"/>            
    </analyzer>
</fieldType>

And when I analyse "A brown fox quickly jumps over the lazy dog". It gives me following result:

  1. _ brown fox
  2. brown fox quickli
  3. fox quickli jump
  4. quickli jump _
  5. jump _ _
  6. _ _ lazi
  7. _ lazi dog

How do I remove _ from the shingle token. Also, is there a way to create shingles only from stop words?

  • see http://stackoverflow.com/questions/28459949/solr-stop-words-replaced-with-symbol as well – Marco Jan 10 '17 at 10:01

3 Answers3

1

Thats because of stopwords Set PositionIncrements to False and luceneMatchVersion to 4.3

Replace your StopFilterFactory with this.

  <filter class="solr.StopFilterFactory" luceneMatchVersion="4.3" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="false"/>
Balu
  • 11
  • 1
  • `luceneMatchVersion` as a filter parameter doesn't exist anymore as of solr 6. you'd have to set `4.3` in solconfig.xml. – Marco Jan 10 '17 at 09:45
1

In the SOLR's Jira there is an improvement request with an available patch: https://issues.apache.org/jira/browse/SOLR-11604

Compile a new lucene-analyzers-common.jar with this patch and use the skipFillerTokens="true" option in your schema.xml

<filter class="solr.ShingleFilterFactory" ... skipFillerTokens="true"/>

If you want this patch to be included in the next SOLR version, vote for this Jira issue.

0

The _ is inserted by the ShingleFilter, as it replaces empty position increments with the token _.

If you want to remove the value, you'll have to perform the PatternReplace after the ShingleFilter, as it doesn't exist in the token stream before that.

ElasticSearch exposes an option to select the replacement character as "fillter_token", but Solr's implementation seem to directly use the Lucene implementation, so you should be able to use fillerToken to set this yourself. Try doing fillerToken="" in your ShingleFilter definition, instead of using the patternreplacefilter.

MatsLindh
  • 49,529
  • 4
  • 53
  • 84
  • it is working. But, i want the exact 3 words shingles after the stop words being removed. Like as below:" brown fox quickli, fox quickli jump. quickli jump lazi, jump lazi dog". I don't want the space or _ in shingles – Sanjay Lama Oct 18 '15 at 02:11
  • @SanjayLama OK - then that's what you should have asked :-) If you keep a magic filler token (such as `_`) in your text, you can move the PatternReplace to after the ShingleFilter and replace all tokens that contains `_` with "" (which is what your filter does), in effect removing them from the value of the field. – MatsLindh Oct 18 '15 at 15:11
  • Its working like you said. However, it is skipping the works that contain stop words. Currently, its giving me only 2 shingles: 'brown fox quickli' and 'fox quickli jump'. Rest are just removed. I want the shingles to be formed after the stop words are been removed. Like as below:" brown fox quickli, fox quickli jump. quickli jump lazi, jump lazi dog" . If you could guide me through this it would really be helpful to me . Thank You. – Sanjay Lama Oct 20 '15 at 00:27