POS tagging before/after punctuation removal?

Question

A possibly very basic question about NLP best practices.

Does punctuation affect the behaviour of NLTK's Parts-of-Speech tagger? Or is it fine to remove punctuation from a sentence before passing it to the POS tagger?

Maybe take a look at https://www.kaggle.com/alvations/basic-nlp-with-nltk ? — alvas, Sep 09 '19 at 09:44

score 4 · Accepted Answer · answered Sep 09 '19 at 09:56

Typically punctuation is separated from word tokens before POS tagging. Punctuation has its own orthographical role which is distinct from that of the surrounding word tokens.

For example, tokenize this sentence: Noun verbs.

For       PREP
example   N
,         ,
tokenize  V
this      PRON
sentence  N
:         :
Noun      N
verbs     V
.         .

Whether or not to keep the punctuation past this stage depends on your ultimate goal. For grammatical markup, punctuation does have a grammatical role, and removing it will typically reduce the quality of the analysis. For sentiment analysis, punctuation typically does not contribute any polarity (though a large number of bangs might convey something like emphasis or strong polarity!!!!!!!)

My application (for now) is to implement textrank for keyword extraction / summarisation. So I guess I can do without punctuation... Thanks! — gmason, Sep 09 '19 at 16:14

POS tagging before/after punctuation removal?

1 Answers1