I've started to play with snowball, here's the basic code I'm using, and I'm stuck with non-ascii letter č. From what I see in the produced files, it has no special handling.
The .sbl in a nutshell is:
externals ( stem )
stringdef cv '{U+010D}' // č
define stem as (
backwards (
[substring] among (
'le{cv}en' (<-'leko')
)
)
)
The .py is created with this command:
snowball -py sl.sbl -o out/sl
The test file:
#!/usr/bin/env python
from out.sl import Sl
stemmer=Sl()
print(stemmer.stemWords(['mlečen', 'mle{cv}en']))
Then {cv} remains unprocessed, and unicode č in the input is not handled:
(venv) pooh@dell ~/metagrocery/stemmer $ ./test.py
['mlečen', 'mleko']
Maybe I'm missing something obvious?