I use StanfordNLP to tokenize a set of messages written with smartphones. These texts have a lot of typos and do not respect the punctuation rules. Very often the blank spaces are missing affecting the tokenization.
For instance, the following sentences miss the blankspace in "California.This" and "university,founded".
Stanford University is located in California.This university is a great university,founded in 1891.
The tokenizer returns:
{"Stanford", "University", "is", "located", "in", "California.This", "university", "is", "a", "great", "university", ",", "founded", "in", "1891", "."}
As observed they split well all the tokens but "California.This" (I expect to have three tokens {"California" "." "this"}). I took a look at the tokenization rules and I observed that the regular expression for words accepts the punctuation signs used for the end of the sentence in the word.
WORD = {LETTER}({LETTER}|{DIGIT})([.!?]{LETTER}({LETTER}|{DIGIT}))*
I removed the last part and recompiled but the tokenizer still does not change it behaviour.
Does someone have an idea how to avoid this unwanted behaviour? Or someone, can indicate me another tokenizer that works well with this types of texts?