A possibly very basic question about NLP best practices.
Does punctuation affect the behaviour of NLTK's Parts-of-Speech tagger? Or is it fine to remove punctuation from a sentence before passing it to the POS tagger?
A possibly very basic question about NLP best practices.
Does punctuation affect the behaviour of NLTK's Parts-of-Speech tagger? Or is it fine to remove punctuation from a sentence before passing it to the POS tagger?
Typically punctuation is separated from word tokens before POS tagging. Punctuation has its own orthographical role which is distinct from that of the surrounding word tokens.
For example, tokenize this sentence: Noun verbs.
For PREP
example N
, ,
tokenize V
this PRON
sentence N
: :
Noun N
verbs V
. .
Whether or not to keep the punctuation past this stage depends on your ultimate goal. For grammatical markup, punctuation does have a grammatical role, and removing it will typically reduce the quality of the analysis. For sentiment analysis, punctuation typically does not contribute any polarity (though a large number of bangs might convey something like emphasis or strong polarity!!!!!!!)