4

i'm searching for an implementation of a croatian word stemming algorithm. Ideally in Java but i would also accept any other language.

Is there somewhere a community of english speaking developers, who are developing search applications for the croatian language?

Thanks,

Chris
  • 15,429
  • 19
  • 72
  • 74

2 Answers2

6

Slavic languages are highly inflective. The most accurate and fast approach would be a combination of rules and large mappings/dictionaries.

Work has been done, but it has been held back. The Croatian morphological lexicon will help, but it's behind a slow API. More work can be found between Bosnian, Serbian and Croatian, than just Croatian alone.

Large mappings aren't always convenient (and one could effectively build a better rule transformer from the mapping/dictionaries/corpus).

Implementing using Hunspell and affix files could be a great way to get the community and java support. Eg. Google search: hr_hr.aff

Not tested: One should be able to reverse all the words, build a trie of the ending characters, traverse using some rules (eg LCS) and build an accurate statistical transformer using corpus text.

Best I can do is some python:

import hunspell
hs = hunspell.HunSpell(
         '/usr/share/myspell/hr_HR.dic', 
         '/usr/share/myspell/hr_HR.aff')

# The following should return ['hrvatska']:
print hs.stem('hrvatski') 
12345
  • 565
  • 7
  • 12
0

here you can find a recent implementation done on ffzg in python - stemmer for croatian.

We performed basic evaluation of the stemmer on a lemmatized newspaper corpus as gold standard with a precision of 0.986 and recall of 0.961 (F1 0.973) for adjectives and nouns. On all parts of speech we obtained precision of 0.98 and recall of 0.92 (F1 0.947).

It is released under GNU licence but feel free to contact the author on further help (I only know the original author Nikola, but not his student).

mislavcimpersak
  • 2,880
  • 1
  • 27
  • 30