Good day.
I'm trying to use Hunspell as a stemmer in my application. I don't quite like porter and snowball stemming because of their "chopped" words results like "abus", "exampl". Lemmatizing seems like a good alternative, but I don't know any good CoreNLP alternatives, and I'm certainly not ready to port my project's source code to Java or use bridges yet. Ideally I would like to see initial, like-in-the-dictionary form of the given word.
As I've noticed most of the dictionaries has separate words in .dic file for: bid and bidding, set and setting, get and getting, etc. I'm not that experienced in Hunspell, but isn't there any clever way to handle double d or t for 3-letter word? Is there a way to make it think that "setting" is actually is derivated from "set"?
My current particular problem with Hunspell is I can't get a good comprehensive documentation for creating/editing an affix file. That's what documentations says here: http://manpages.ubuntu.com/manpages/dapper/man4/hunspell.4.html
(4) condition.
Zero stripping or affix are indicated by zero. Zero condition is
indicated by dot. Condition is a simplified, regular
expression-like pattern, which must be met before the affix can
be applied. (Dot signs an arbitrary character. Characters in
braces sign an arbitrary character from the character subset.
Dash hasn’t got special meaning, but circumflex (^) next the
first brace sets the complementer character set.)
Default one is this:
SFX G Y 2
SFX G e ing e
SFX G 0 ing [^e]
I've tried this one:
SFX G Y 4
SFX G e ing e
SFX G 0 ing [^e]
SFX G 0 ting [bcdfghjklmnpqrstvwxz][aeiou]t
SFX G 0 ding [bcdfghjklmnpqrstvwxz][aeiou]d
but it clearly will also match asSET. Is there any way to get around it somehow? I've tried ^ symbol at the start of regexp, but it seems like it's not working. What can I do to make it work?
Thanks in advance.