I hava a pandas dataframe that has one column with conversational data. I preprocessed it in the following way:
def preprocessing(text):
return [word for word in simple_preprocess(str(text), min_len = 2, deacc = True) if word not in stop_words]
dataset['preprocessed'] = dataset.apply(lambda row: preprocessing(row['msgText']), axis = 1)
To make it one-dimensional I used (both):
processed_docs = data['preprocessed']
as well as:
processed_docs = data['preprocessed'].tolist()
Which now looks as follows:
>>> processed_docs[:2]
0 ['klinkt', 'alsof', 'zwaar', 'dingen', 'spelen...
1 ['waar', 'liefst', 'meedenk', 'betekenen', 'pe...
For both cases, I used:
dictionary = gensim.corpora.Dictionary(processed_docs)
However, in both cases I got the error:
TypeError: doc2bow expects an array of unicode tokens on input, not a single string
How can I modify my data, so that I don't get this TypeError?
Given that similar questions have been asked before, I've considered:
Gensim: TypeError: doc2bow expects an array of unicode tokens on input, not a single string
Based on the first answer, I tried the solution of:
dictionary = gensim.corpora.Dictionary([processed_docs.split()])
And got the error(/s):
AttributeError: 'Series'('List') object has no attribute 'split'
And in the second answer someone says that the input needs to be tokens, which already holds for me.
Furthermore, based on (TypeError: doc2bow expects an array of unicode tokens on input, not a single string when using gensim.corpora.Dictionary()), I used the .tolist()
approach as I described above, which does not work either.