I'm trying to create a CRF model that segments Japanese sentences into words. At the moment I'm not worried about perfect results as it's just a test. The training goes fine but when it's finished it always gives the same guess for every sentence I try to tag.
"""Labels: X: Character is mid word, S: Character starts a word, E:Character ends a word, O: One character word"""
Sentence:広辞苑や大辞泉には次のようにある。
Prediction:['S', 'X', 'E', 'S', 'E', 'S', 'E', 'S', 'E', 'S', 'E', 'S', 'E', 'S', 'E', 'S', 'E']
Truth:['S', 'X', 'E', 'O', 'S', 'X', 'E', 'O', 'O', 'O', 'O', 'S', 'E', 'O', 'S', 'E', 'O']
Sentence:他にも、言語にはさまざまな分類がある。
Prediction:['S', 'X', 'E', 'S', 'E', 'S', 'E', 'S', 'E', 'S', 'E', 'S', 'E', 'S', 'E', 'S', 'E', 'S', 'E']
Truth:['O', 'O', 'O', 'O', 'S', 'E', 'O', 'O', 'S', 'X', 'X', 'X', 'E', 'S', 'E', 'O', 'S', 'E', 'O']
When looking at the transition info for the model:
{('E', 'E'): -3.820618,
('E', 'O'): 3.414133,
('E', 'S'): 2.817927,
('E', 'X'): -3.056175,
('O', 'E'): -4.249522,
('O', 'O'): 2.583123,
('O', 'S'): 2.601341,
('O', 'X'): -4.322003,
('S', 'E'): 7.05034,
('S', 'O'): -4.817578,
('S', 'S'): -4.400028,
('S', 'X'): 6.104851,
('X', 'E'): 4.985887,
('X', 'O'): -5.141898,
('X', 'S'): -4.499069,
('X', 'X'): 4.749289}
This looks good since all the transitions with negative values are impossible, E -> X for example, going from the end of a word to the middle of the following one. S -> E gets has the highest value, and as seen above the model simply gets into a pattern of labeling S then E repeatedly until the sentence ends. I followed this demo when trying this, though that demo is for separating Latin. My features are similarly just n-grams:
['bias',
'char=ま',
'-2-gram=さま',
'-3-gram=はさま',
'-4-gram=にはさま',
'-5-gram=語にはさま',
'-6-gram=言語にはさま',
'2-gram=まざ',
'3-gram=まざま',
'4-gram=まざまな',
'5-gram=まざまな分',
'6-gram=まざまな分類']
I've tried changing labels to just S and X for start and other, but this just causes the model to repeat S,X,S,X till it runs out of characters. I've gone up to 6-grams in both directions which took a lot longer but didn't change anything. Tried training for more iterations and changing the L1 and L2 constants a bit. I've trained on up to 100,000 sentences which is about as far as I can go as it takes almost all 16GB of my ram to do so. Are my features structured wrong? How do I get the model to stop guessing in a pattern, is that even what's happening? Help would be appreciated, and let me know if I need to add more info to the question.