Hunspell affix condition regex format. Any way to match the start?

Question

Good day.

I'm trying to use Hunspell as a stemmer in my application. I don't quite like porter and snowball stemming because of their "chopped" words results like "abus", "exampl". Lemmatizing seems like a good alternative, but I don't know any good CoreNLP alternatives, and I'm certainly not ready to port my project's source code to Java or use bridges yet. Ideally I would like to see initial, like-in-the-dictionary form of the given word.

As I've noticed most of the dictionaries has separate words in .dic file for: bid and bidding, set and setting, get and getting, etc. I'm not that experienced in Hunspell, but isn't there any clever way to handle double d or t for 3-letter word? Is there a way to make it think that "setting" is actually is derivated from "set"?

My current particular problem with Hunspell is I can't get a good comprehensive documentation for creating/editing an affix file. That's what documentations says here: http://manpages.ubuntu.com/manpages/dapper/man4/hunspell.4.html

(4) condition.

Zero stripping or affix are indicated by zero. Zero condition is
indicated   by   dot. Condition is a simplified, regular
expression-like pattern, which must be met before the affix  can
be  applied. (Dot  signs  an arbitrary character. Characters in
braces sign an arbitrary character from  the  character  subset.
Dash  hasn’t  got  special  meaning, but circumflex (^) next the
first brace sets the complementer character set.)

Default one is this:

SFX G Y 2
SFX G   e     ing        e
SFX G   0     ing        [^e]

I've tried this one:

SFX G Y 4
SFX G   e     ing        e
SFX G   0     ing        [^e] 
SFX G   0     ting       [bcdfghjklmnpqrstvwxz][aeiou]t 
SFX G   0     ding       [bcdfghjklmnpqrstvwxz][aeiou]d

but it clearly will also match asSET. Is there any way to get around it somehow? I've tried ^ symbol at the start of regexp, but it seems like it's not working. What can I do to make it work?

Thanks in advance.

score 2 · Accepted Answer · answered Jan 12 '15 at 20:25

Why would it match asset? That's not a verb, and as such shouldn't have that suffix attached to it.

The problems that languages aren't perfectly regular. The solution that we've used in the Asturian spell checker at SoftAstur is to keep track a list of verbs that form certain suffixes one way or another, and have a script construct the .dic file based on the lists we've kept.

So for English, you'd define two separate affixes¹:

SFX Gs Y 3
SFX Gs e ing [^eoy]e
SFX Gs 0 ing [eoy]e
SFX Gs 0 ing [^e]

SFX Gd Y 9
SFX 0 bing [^aeiou][aeiou]b
SFX 0 king [^aeiou][aeiou]c
SFX 0 ding [^aeiou][aeiou]d
SFX 0 ling [^aeiou][aeiou]l   # for British English
SFX 0 ming [^aeiou][aeiou]m
SFX 0 ning [^aeiou][aeiou]n
SFX 0 ping [^aeiou][aeiou]p
SFX 0 ring [^aeiou][aeiou]r
SFX 0 ting [^aeiou][aeiou]t

There are still other irregulars like singeing (to contrast with singing) that are uncommon enough they are probably best coded as separate. So your dictionary file then would like the following more or less:

admit/Gd    --> admitting
bake/Gs     --> baking
commit/Gd   --> committed
free/Gs     --> freeing
dye/Gs      --> dyeing
inherit/Gs  --> inherited
picnic/Gd   --> picnicking
target/Gs   --> targetting
tiptoe/Gs   --> tiptoeing
travel/Gs   --> traveling  (if American English)
travel/Gd   --> travelling (if British English)
refer/Gd    --> referring
sing/Gs     --> singing
singe
singing
sob/Gd      --> sobbing
smile/Gs    --> smiling
stop/Gd     --> stopping
tap/Gd      --> tapping
visit/Gs    --> visiting

^{1. I prefer two-letter tags as they can be easier to read if you have a word with lots of tags, such that Gd = gerund doubled and Gs = gerund single or similar. Probably not a problem for English, but it definitely is for other languages. If you don't have a lot of affixes, you might just go with g (no doubling) and G (doubling).}

Well, it's seems like it's a really **right** way to do it. And it seems so obvious and simple, making me see how stupid initial question was. Nice intuition behind this two letter suffix names btw. Thanks for answering. I wonder if there is any Hunspell dictionary with such a suffixes format. Because one I'm currently using has only single consonant suffixes, and whenever there's a word with doubled consonant suffix, it treats it as exclusion in `.dic` file. — SimpleV, Jan 18 '15 at 15:30
Hunspell is frustrating for two reasons: it's not as well documented as it could be, and it requires exact input (miscount the affixes by one and it refuses to recognize any of them). I'm strongly considering writing a tutorial based on my experiences as we've used every feature but compounded words. It really needs a rewrite, but it's source code is virtually unreadable and not well commented. I may in fact do a rewrite at some point but it will be likely two years before I have the time. — user0721090601, Jan 18 '15 at 17:03
@SimpleV As for two letter suffix examples, our Asturian one uses them, and I'd imagine Hungarian would since the two letter suffixes since it Hunspell was designed to add features they needed for Hungarian. If there are other highly synthetic languages they might have them — but if you want one two letter affix, they all have to be two letter. — user0721090601, Jan 18 '15 at 17:16

Hunspell affix condition regex format. Any way to match the start?

1 Answers1