Python Pandas: NLTK Part of Speech Tagging for Entire Column in Dataframe

Question

I have the following sample data frame shown below. It has been tokenized already.

No  category    problem_definition_stopwords
175 2521       ['coffee', 'maker', 'brewing', 'properly', '2', '420', '420', '420']
211 1438       ['galley', 'work', 'table', 'stuck']
912 2698       ['cloth', 'stuck']
572 2521       ['stuck', 'coffee']

I want to do part of speech tagging on this data frame. Below is the beginning of my code. It is erroring out:

from nltk.corpus import state_union
from nltk.tokenize import PunktSentenceTokenizer 

train_text = state_union.raw(df['problem_definition_stopwords'])

Error

TypeError: join() argument must be str or bytes, not 'list'

My desired result is below where 'XXX' is a tokenized word and after it is the part of speech (i.e. NNP):

[('XXX', 'NNP'), ('XXX', 'VBD'), ('XXX', 'POS')]

I think you're confused about what `state_union.raw()` is. It is a collection (corpus) of documents of presidential state of the union addresses. You can't "call" it on your dataframe because your dataframe is not a document in the `state_union` corpus — G. Anderson, Dec 18 '18 at 21:38

score 0 · Accepted Answer · answered Dec 18 '18 at 21:43

0

Convert the problem_definition_stopwords to a string and pass to nltk.sent_tokenize if you are trying to tokenize and get the POS with pos_tag.

answered Dec 18 '18 at 21:43

emendez

430
5
10

Python Pandas: NLTK Part of Speech Tagging for Entire Column in Dataframe

1 Answers1