0

I have the following sample data frame shown below. It has been tokenized already.

No  category    problem_definition_stopwords
175 2521       ['coffee', 'maker', 'brewing', 'properly', '2', '420', '420', '420']
211 1438       ['galley', 'work', 'table', 'stuck']
912 2698       ['cloth', 'stuck']
572 2521       ['stuck', 'coffee']

I want to do part of speech tagging on this data frame. Below is the beginning of my code. It is erroring out:

from nltk.corpus import state_union
from nltk.tokenize import PunktSentenceTokenizer 

train_text = state_union.raw(df['problem_definition_stopwords'])

Error

TypeError: join() argument must be str or bytes, not 'list'

My desired result is below where 'XXX' is a tokenized word and after it is the part of speech (i.e. NNP):

[('XXX', 'NNP'), ('XXX', 'VBD'), ('XXX', 'POS')]

PineNuts0
  • 4,740
  • 21
  • 67
  • 112
  • what is your expected output – BENY Dec 18 '18 at 21:15
  • 1
    I think you're confused about what `state_union.raw()` is. It is a collection (corpus) of documents of presidential state of the union addresses. You can't "call" it on your dataframe because your dataframe is not a document in the `state_union` corpus – G. Anderson Dec 18 '18 at 21:38
  • oh gosh, you are right! – PineNuts0 Dec 18 '18 at 21:41

1 Answers1

0

Convert the problem_definition_stopwords to a string and pass to nltk.sent_tokenize if you are trying to tokenize and get the POS with pos_tag.

emendez
  • 430
  • 5
  • 10