SOLR and accented characters

Question

I have an index for occupations (identifier + occupation):

<field name="occ_id" type="int" indexed="true" stored="true" required="true" />
<field name="occ_tx_name" type="text_es" indexed="true" stored="true" multiValued="false" />


<!-- Spanish -->
<fieldType name="text_es" class="solr.TextField" positionIncrementGap="100">
  <analyzer> 
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_es.txt" format="snowball" />
    <filter class="solr.SpanishLightStemFilterFactory"/>
  </analyzer>
</fieldType>

This is a real query, for three identifiers (1, 195 and 129):

curl -X GET "http://192.168.1.11:8983/solr/cyp_occupations/select?indent=on&q=occ_id:1+occ_id:195+occ_id:129&wt=json"
{
  "responseHeader":{
    "status":0,
    "QTime":1,
    "params":{
      "q":"occ_id:1 occ_id:195 occ_id:129",
      "indent":"on",
      "wt":"json"}},
  "response":{"numFound":3,"start":0,"docs":[
      {
        "occ_id":1,
        "occ_tx_name":"Abogado",
        "_version_":1565225103805906944},
      {
        "occ_id":129,
        "occ_tx_name":"Informático",
        "_version_":1565225103843655680},
      {
        "occ_id":195,
        "occ_tx_name":"Osteópata",
        "_version_":1565225103858335746}]
  }}

Two of them have accented characters, and one not. So let’s search by occ_tx_name without using accents:

curl -X GET "http://192.168.1.11:8983/solr/cyp_occupations/select?indent=on&q=occ_tx_name:abogado&wt=json"
{
  "responseHeader":{
    "status":0,
    "QTime":1,
    "params":{
      "q":"occ_tx_name:abogado",
      "indent":"on",
      "wt":"json"}},
  "response":{"numFound":1,"start":0,"docs":[
      {
        "occ_id":1,
        "occ_tx_name":"Abogado",
        "_version_":1565225103805906944}]
  }}

curl -X GET "http://192.168.1.11:8983/solr/cyp_occupations/select?indent=on&q=occ_tx_name:informatico&wt=json"
{
  "responseHeader":{
    "status":0,
    "QTime":0,
    "params":{
      "q":"occ_tx_name:informatico",
      "indent":"on",
      "wt":"json"}},
  "response":{"numFound”:1,”start":0,"docs":[
      {
        "occ_id":129,
        "occ_tx_name":"Informático",
        "_version_":1565225103843655680}]
  }}


curl -X GET "http://192.168.1.11:8983/solr/cyp_occupations/select?indent=on&q=occ_tx_name:osteopata&wt=json"
{
  "responseHeader":{
    "status":0,
    "QTime":0,
    "params":{
      "q":"occ_tx_name:osteopata",
      "indent":"on",
      "wt":"json"}},
  "response":{"numFound":0,"start":0,"docs":[]
  }}

I am very annoying about the fact that the last search ‘osteopata’ fails, while ‘informatico’ succeed. The source data for the index is a simple MySQL table:

-- -----------------------------------------------------
-- Table `mydb`.`occ_occupation`
-- -----------------------------------------------------
CREATE TABLE IF NOT EXISTS `mydb`.`occ_occupation` (
  `occ_id` INT UNSIGNED NOT NULL,
  `occ_tx_name` VARCHAR(255) NOT NULL,
  PRIMARY KEY (`occ_id`)
ENGINE = InnoDB

The collation of the table is “utf8mb4_general_ci”. The index is created with DataImportHandler. This is the definition:

<dataConfig>
    <dataSource type="JdbcDataSource" driver="com.mysql.jdbc.Driver" url="jdbc:mysql://192.168.1.11:3306/mydb"
        user=“mydb” password=“mydb” />
    <document name="occupations">
        <entity name="occupation" pk="occ_id"
            query="SELECT occ.occ_id, occ.occ_tx_name FROM occ_occupation occ WHERE occ.sta_bo_deleted = false">
            <field column="occ_id" name="occ_id" />
            <field column="occ_tx_name" name="occ_tx_name" />
        </entity>
    </document>
</dataConfig>

I need some clue to detect the problem. Can anyone help me? Thanks in advance.

I forgot to mention that I'm using solr-6.3.0, and I'm starting the server with this command: solr start -a "-Duser.language=es -Duser.country=ES -Duser.timezone=Europe/Madrid" — Ernesto Salgado, Apr 20 '17 at 20:28

score 1 · Answer 1 · answered Apr 20 '17 at 22:43

Just add solr.ASCIIFoldingFilterFactory to your filter analyzer chain or even better create a new fieldType:

<!-- Spanish -->
<fieldType name="text_es_ascii_folding" class="solr.TextField" positionIncrementGap="100">
  <analyzer> 
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.ASCIIFoldingFilterFactory" />
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_es.txt" format="snowball" />
    <filter class="solr.SpanishLightStemFilterFactory"/>
  </analyzer>
</fieldType>

This filter converts alphabetic, numeric, and symbolic Unicode characters which are not in the Basic Latin Unicode block (the first 127 ASCII characters) to their ASCII equivalents, if one exists.

This should let you to match the search even if the accented character is missing. The downside is that words like "cañon" and "canon" are now equivalent and both hit the same documents IIRC.

Hi. I have added the filter "solr.ASCIIFoldingFilterFactory", but I get the exact same results... — Ernesto Salgado, Apr 21 '17 at 19:34

score 0 · Answer 2 · answered Apr 20 '17 at 21:02

0

I don't think mysql or your jvm settings have anything to do with this. I suspect one works and the other does not probably due to the SpanishLightStemFilterFactory.

The right way to achieve matching no matter the diacritics is to use the following:

  <charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>

Put that before your tokenizer in both index and query analyzer chains, and any diacritic should be converted to the ascii version. That would make it work always.

answered Apr 20 '17 at 21:02

Persimmonium

15,593
11
47
78

go to Analysis tab an look at verbose output of that word in bot index and query side – Persimmonium Apr 21 '17 at 20:33
It is insane! In Solr Admin, I have selected my index, and then clicked on Schema section. Then, I have selected the field 'occ_tx_name', and then, "Load term info" button, so I can see top 10 terms listed. I have changed 10, to 278, to see all terms. Each term in the list is an HTML Anchor, that link to a SOLR query. And, I can't believe that I'm seeing... – Ernesto Salgado Apr 21 '17 at 20:47
In the list, all the terms have accented characters... All but "informatico"! I will give you an example. This is the anchor associated to the stored term "informatico" (lowercase without accented character): informatico. – Ernesto Salgado Apr 21 '17 at 20:48
And this is the anchor associated to "osteópata": osteópata... As you can see, "osteópata" has been stored with de accented character.... I don't know why, but this search succeed: curl -X GET "http://192.168.1.11:8983/solr/cyp_occupations/select?q=occ_tx_name:osteo%CC%81pata&indent=on&wt=json" – Ernesto Salgado Apr 21 '17 at 20:48
you have reindexed after changing the schema to add the charfilter right?? – Persimmonium Apr 21 '17 at 20:59
Yes, I have reindexed several times to make sure. Only the 'á' is transformed to 'a'. é í ó ú remains the same.... – Ernesto Salgado Apr 21 '17 at 21:05

score 0 · Accepted Answer · answered Apr 21 '17 at 21:52

Ok, I have discovered the source problem. I have opened my SQL load script with VI, in hex mode.

This is the hex content for 'Agrónomo' in an INSERT statement: 41 67 72 6f cc 81 6e 6f 6d 6f.

6f cc 81!!!! This is "o COMBINING ACUTE ACCENT" UTF code!!!!

So that's the problem... It must be "c3 b3"... I get the literals copy/pasting from a web page, so the source characters on the origin was the problem.

Thanks to both of you, because I have learning more about SOLR's soul.

Regards.

SOLR and accented characters

3 Answers3

Linked