StanfordNLP Tokenizer

Question

I use StanfordNLP to tokenize a set of messages written with smartphones. These texts have a lot of typos and do not respect the punctuation rules. Very often the blank spaces are missing affecting the tokenization.

For instance, the following sentences miss the blankspace in "California.This" and "university,founded".

Stanford University is located in California.This university is a great university,founded in 1891.

The tokenizer returns:

{"Stanford", "University", "is", "located", "in", "California.This", "university", "is", "a", "great", "university", ",", "founded", "in", "1891", "."}

As observed they split well all the tokens but "California.This" (I expect to have three tokens {"California" "." "this"}). I took a look at the tokenization rules and I observed that the regular expression for words accepts the punctuation signs used for the end of the sentence in the word.

WORD = {LETTER}({LETTER}|{DIGIT})([.!?]{LETTER}({LETTER}|{DIGIT}))*

I removed the last part and recompiled but the tokenizer still does not change it behaviour.

Does someone have an idea how to avoid this unwanted behaviour? Or someone, can indicate me another tokenizer that works well with this types of texts?

score 1 · Accepted Answer · answered Feb 28 '15 at 18:55

I assume you're referring to the .flex file for the tokenizer?

You need to generate new Java code from this specification before building again. Use the flexeverything Ant build task (see our build spec).

You may also find Twokenize useful. This is a self-contained tokenizer for tweets. It's part of the TweetNLP package from Noah Smith's group at CMU. (Note that this code is GPL v2.)

StanfordNLP Tokenizer

1 Answers1