Nltk tokenizer issue

Question

from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'\w+')
dataset['text'] = dataset['text'].apply(lambda word_list: [tokenizer.tokenize(word) for word in word_list])
dataset['text'].head()

The above code shows an error

expected string or bytes-like object, got 'list'

score 0 · Answer 1 · answered Aug 26 '23 at 16:40

Assuming that dataset['text'] contains strings, try making this change in your code. If your goal is to tokenize each individual string in the dataset['text'] column, you need to apply the tokenizer to each string, not each word within a string.

dataset['text'] = dataset['text'].apply(lambda text: tokenizer.tokenize(text))
dataset['text'].head()

If dataset['text'] itself is a list of lists (where each inner list contains words), then we need to take another approach

Nltk tokenizer issue

1 Answers1