How to input a series/list consisting of different tokens in a Gensim Dictionary?

Question

I hava a pandas dataframe that has one column with conversational data. I preprocessed it in the following way:

def preprocessing(text):
     return [word for word in simple_preprocess(str(text), min_len = 2, deacc = True) if word not in stop_words]

dataset['preprocessed'] = dataset.apply(lambda row: preprocessing(row['msgText']), axis = 1)

To make it one-dimensional I used (both):

processed_docs = data['preprocessed']

as well as:

processed_docs = data['preprocessed'].tolist()

Which now looks as follows:

>>> processed_docs[:2]
0    ['klinkt', 'alsof', 'zwaar', 'dingen', 'spelen...
1    ['waar', 'liefst', 'meedenk', 'betekenen', 'pe...

For both cases, I used:

dictionary = gensim.corpora.Dictionary(processed_docs)

However, in both cases I got the error:

TypeError: doc2bow expects an array of unicode tokens on input, not a single string

How can I modify my data, so that I don't get this TypeError?

Given that similar questions have been asked before, I've considered:

Gensim: TypeError: doc2bow expects an array of unicode tokens on input, not a single string

Based on the first answer, I tried the solution of:

dictionary = gensim.corpora.Dictionary([processed_docs.split()])

And got the error(/s):

AttributeError: 'Series'('List') object has no attribute 'split'

And in the second answer someone says that the input needs to be tokens, which already holds for me.

Furthermore, based on (TypeError: doc2bow expects an array of unicode tokens on input, not a single string when using gensim.corpora.Dictionary()), I used the .tolist() approach as I described above, which does not work either.

I found the problem. Apparently there were empty fields in my series/list object. The following code solved my problem: `processed_docs = processed_docs.dropna(axis = 'rows')` — Emil, Apr 25 '19 at 11:53

Djensonsan · Answer 1 · 2020-11-26T17:30:41.383

2

Question was posted long time ago but for anyone still wondering. Pandas stores lists as strings hence the TypeError, one way of interpreting this string as a list is using:

from ast import literal_eval

And then:

dictionary = gensim.corpora.Dictionary()
for doc in processed_docs:
  dictionary.add_documents([literal_eval(doc)])

edited Nov 26 '20 at 17:30

answered Nov 26 '20 at 13:09

Djensonsan

21
4

score 1 · Accepted Answer · answered Apr 24 '19 at 15:42

1

I think you need:

dictionary = gensim.corpora.Dictionary([processed_docs[:]])

To iterate through the set. You can write [2:] to start at two and iterate to the end or [:7] to start at 0 then go to 7 or [2:7]. You can also try [:len(processed_docs)]

I hope this helps :)

answered Apr 24 '19 at 15:42

Sara

1,162
1
8
21

Thanks for thinking with me, but your suggestion still results in the same error unfortunately – Emil Apr 25 '19 at 09:22

How to input a series/list consisting of different tokens in a Gensim Dictionary?

2 Answers2