I'm trying to build a Tf-Idf model that can score bigrams as well as unigrams using gensim. To do this, I build a gensim dictionary and then use that dictionary to create bag-of-word representations of the corpus that I use to build the model.
The step to build the dictionary looks like this:
dict = gensim.corpora.Dictionary(tokens)
where token
is a list of unigrams and bigrams like this:
[('restore',),
('diversification',),
('made',),
('transport',),
('The',),
('grass',),
('But',),
('distinguished', 'newspaper'),
('came', 'well'),
('produced',),
('car',),
('decided',),
('sudden', 'movement'),
('looking', 'glasses'),
('shapes', 'replaced'),
('beauties',),
('put',),
('college', 'days'),
('January',),
('sometimes', 'gives')]
However, when I provide a list such as this to gensim.corpora.Dictionary()
, the algorithm reduces all tokens to bigrams, e.g.:
test = gensim.corpora.Dictionary([(('happy', 'dog'))])
[test[id] for id in test]
=> ['dog', 'happy']
Is there a way to generate a dictionary with gensim that includes bigrams?