Can't remove punctuation in Solr

Question

I have a solr install to query content on a Drupal site. Many of the title fields have punctuation at the start of the string and so when I sort by title the punctuation appears top of the list.

I would like to get solr to ignore the the title when sorting by title but none of the solutions I have tried work.

I am fairly new to solr and so it may be something really simple that I am doing wrong... I don't really understand much of what is going on in the schema.xml file!

The title field is called label in solr and I have tried various methods in solr.PatternReplaceFilterFactory which do not work.

<field name="label" type="text" indexed="true" stored="true"     termVectors="true" omitNorms="true"/>
<copyField source="label" dest="sort_label"/>

<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
  <analyzer type="index">
    <charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
<charFilter class="solr.HTMLStripCharFilterFactory"/>
<filter class="solr.PatternReplaceFilterFactory"
           pattern="(^\p{Punct}+)" replacement="" replace="all"
    />
<tokenizer class="solr.WhitespaceTokenizerFactory"/>

    <filter class="solr.StopFilterFactory"
            ignoreCase="true"
            words="stopwords.txt"
            enablePositionIncrements="true"
            />
    <filter class="solr.WordDelimiterFilterFactory"
            protected="protwords.txt"
            generateWordParts="1"
            generateNumberParts="1"
            catenateWords="1"
            catenateNumbers="1"
            catenateAll="0"
            splitOnCaseChange="0"
            preserveOriginal="1"/>
    <filter class="solr.LengthFilterFactory" min="2" max="100" />
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.SnowballPorterFilterFactory" language="English" protected="protwords.txt"/>
    <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>

  </analyzer>
  <analyzer type="query">
…
</analyzer>

My query is start=0&rows=25&q=education&fl=id%2Centity_id%2Centity_type%2Cbundle%2Cbundle_name%2Csort_label%2Css_language%2Cis_comment_count%2Cds_created%2Cds_changed%2Cscore%2Cpath%2Curl%2Cis_uid%2Ctos_name%2Czm_parent_entity%2Css_filemime%2Css_file_entity_title%2Css_file_entity_url&pf=content%5E2.0&&sort=sort_label%20asc

can you share the fielType text from your schema.xml? – Abhijit Bashetti Apr 11 '16 at 09:18 — Abhijit Bashetti, Apr 11 '16 at 09:18
Done - sorry the opening fieldtype tag was missing – ankles Apr 11 '16 at 13:07 — ankles, Apr 11 '16 at 13:07

Abhijit Bashetti · Accepted Answer · 2016-04-11T13:11:00.680

1

This is done with the WordDelimiterFilterFactory. Set generateWordParts=1. Add this filter to your

After modifying the schema.xml restart the server and re-index the data.

<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>

        <filter class="solr.StopFilterFactory"
                ignoreCase="true"
                words="stopwords.txt"
                enablePositionIncrements="true"
                />
        <filter class="solr.WordDelimiterFilterFactory"
                protected="protwords.txt"
                generateWordParts="1"
                generateNumberParts="1"
                catenateWords="1"
                catenateNumbers="1"
                catenateAll="0"
                splitOnCaseChange="0"
                preserveOriginal="1"/>
        <filter class="solr.LengthFilterFactory" min="2" max="100" />
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.SnowballPorterFilterFactory" language="English" protected="protwords.txt"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>

      </analyzer>
    </fieldType>

edited Apr 11 '16 at 13:11

answered Apr 11 '16 at 09:19

Abhijit Bashetti

8,518
7
35
47

can you try the one I have added now...?Lets check if this works for you? – Abhijit Bashetti Apr 11 '16 at 13:11
This seems to have caused some problems on the search, I don't think my settings for bias and other tweaks are working any more. Is it supposed to have the analyser type="index" removed? I think that is causing the problem – ankles Apr 12 '16 at 08:18
1

if you don't specify the type then it considered same for indexing(query) and for query...If you want a different analyzer for index then you need to mention the types...Here if you want to user same for index and query then no need to specify it. if You want separate analyzer for both stages then create seperate and mention.... – Abhijit Bashetti Apr 12 '16 at 08:23
BTW...did analyse the same in solr analysis tool...how it is parsing the text while indexing and while quering...Let me know what is not working... – Abhijit Bashetti Apr 12 '16 at 08:24
I think it is working - however i have a lot of errors in my drupal log, and various modules are not working at all. i don't know if this is connected or a different problem... – ankles Apr 12 '16 at 09:37
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/108910/discussion-between-abhijit-bashetti-and-ankles). – Abhijit Bashetti Apr 12 '16 at 09:42
Hi thanks for helping fix this! I have had to upgrade from Solr 4 to Solr 5 now, and the problem is occurring again... I have opened a new question at http://stackoverflow.com/questions/36798803/cant-ignore-punctuation-in-titles-solr-5-and-drupal - hope you can help! – ankles Apr 25 '16 at 09:01

Can't remove punctuation in Solr

1 Answers1

Linked