-1

I have sphinx installed on my vagrant machine with CentOs 6 and i'm trying to install the dutch libstemmer from Snowball. The installation was executed successfully but the tests goes wrong.

I have create 2 indexes with exactly the same data. My indexes are:

index shop_products1 {
  type = rt
  dict = keywords
  min_prefix_len = 3
  rt_mem_limit = 2046M

  path = /var/lib/sphinxsearch/data/shop_products2

  morphology = libstemmer_nl, stem_en
  
  html_strip = 1
  html_index_attrs = img=alt,title; a=title;

  preopen = 1
  inplace_enable = 1
  index_exact_words = 1

  
  rt_field = name
  rt_field = brand
  rt_field = description
  rt_field = specifications
  rt_field = tags
  rt_field = ourtags
  rt_field = searchfield
  rt_field = shop
  rt_field = category
  
  rt_field = color
  rt_field = ourcolor
  rt_field = gender
  rt_field = material

  rt_field = ean
  rt_field = sku

  rt_attr_string = ean
  rt_attr_string = sku
  rt_attr_float = price
  rt_attr_float = discount
  rt_attr_uint = shopid
  rt_attr_uint = itemid
  rt_attr_uint = deleted
  rt_attr_uint = duplicate
  rt_attr_uint = brandid
  rt_attr_uint = duplicates
  rt_attr_timestamp = updated_at
}

index shop_products2 {
  type = rt
  dict = keywords
  min_prefix_len = 3
  rt_mem_limit = 2046M

  path = /var/lib/sphinxsearch/data/shop_products20

  html_strip = 1
  html_index_attrs = img=alt,title; a=title;

  preopen = 1
  inplace_enable = 1
  index_exact_words = 1

  
  rt_field = name
  rt_field = brand
  rt_field = description
  rt_field = specifications
  rt_field = tags
  rt_field = ourtags
  rt_field = searchfield
  rt_field = shop
  rt_field = category
  
  rt_field = color
  rt_field = ourcolor
  rt_field = gender
  rt_field = material

  rt_field = ean
  rt_field = sku

  rt_attr_string = ean
  rt_attr_string = sku
  rt_attr_float = price
  rt_attr_float = discount
  rt_attr_uint = shopid
  rt_attr_uint = itemid
  rt_attr_uint = deleted
  rt_attr_uint = duplicate
  rt_attr_uint = brandid
  rt_attr_uint = duplicates
  rt_attr_timestamp = updated_at
}




searchd {
 listen = 127.0.0.1:9306:mysql41
  log = /var/log/sphinxsearch/searchd.log
  workers = threads
  binlog_path = /var/lib/sphinxsearch/rt-binlog

  read_timeout = 5
  client_timeout = 200
  max_children = 0
   
  # 2 hours
  rt_flush_period = 7200
  pid_file = /var/run/searchd.pid
  
}

When i search for example the dutch word "afzuigkappen" it has to give the exact same results as "afzuigkap"

Can someone give me some information about how to get this work please? Ps. sorry for my bad english..

Assem
  • 11,574
  • 5
  • 59
  • 97

2 Answers2

0

The Dutch stemmer in snowball stems afzuigkappen and afzuigkap differently:

afzuigkappen  -> afzuigkapp
afzuigkap -> afzuigkap

So you should update the stemmer algorithm in order to attend your objective, documentation about the algorithm here

Assem
  • 11,574
  • 5
  • 59
  • 97
  • Oke, but at least there must be soms difference in the results. Now in my opinion it does nothing, cause index 1 does exactly the same as index 2.. – Rick Bongers Sep 07 '15 at 06:53
  • they dont give the same results because the stem is different – Assem Sep 07 '15 at 07:57
0

Alright, I have created some specific tests. My index i've created:

index test1 {
  type = rt
  dict = keywords
  min_prefix_len = 3
  rt_mem_limit = 2046M

  morphology = libstemmer_nl, stem_en

  path = /var/lib/sphinxsearch/data/test1

  preopen = 1
  inplace_enable = 1
  index_exact_words = 1

  rt_field = name
  rt_attr_uint = shopid
  rt_attr_uint = itemid
    
}

index test2 {
  type = rt
  dict = keywords
  min_prefix_len = 3
  rt_mem_limit = 2046M

  path = /var/lib/sphinxsearch/data/test2

  preopen = 1
  inplace_enable = 1
  index_exact_words = 1

  rt_field = name
  rt_attr_uint = shopid
  rt_attr_uint = itemid
    
}

I indexed with a smaller database with football products and searched with sphinx as results: https://i.stack.imgur.com/GGBOO.jpg

As you see both give the same output with 53 records. If i search just in my mysql : select * from tests1 WHERE name LIKE '%keeper%' i got 360 results.