0
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'\w+')
dataset['text'] = dataset['text'].apply(lambda word_list: [tokenizer.tokenize(word) for word in word_list])
dataset['text'].head()

The above code shows an error

expected string or bytes-like object, got 'list'

eglease
  • 2,445
  • 11
  • 18
  • 28

1 Answers1

0

Assuming that dataset['text'] contains strings, try making this change in your code. If your goal is to tokenize each individual string in the dataset['text'] column, you need to apply the tokenizer to each string, not each word within a string.

dataset['text'] = dataset['text'].apply(lambda text: tokenizer.tokenize(text))
dataset['text'].head()

If dataset['text'] itself is a list of lists (where each inner list contains words), then we need to take another approach