0

In order to use the Latent semantic indexation method from gensim, I want to begin with a small "classique" example like :

import logging, gensim, bz2
id2word = gensim.corpora.Dictionary.load_from_text('wiki_en_wordids.txt')
mm = gensim.corpora.MmCorpus('wiki_en_tfidf.mm')
lsi = gensim.models.lsimodel.LsiModel(corpus=mm, id2word=id2word, num_topics=400)
etc..

My question is : How to get the corpus iterator 'wiki_en_tfidf.mm' ? Must I download it from somewhere ? I have searched on the Internet but I did not find anything. Help please ?

Arij SEDIRI
  • 2,088
  • 7
  • 25
  • 43

1 Answers1

0

The first page of search results includes a link to:

https://radimrehurek.com/gensim/wiki.html

which says "First let’s load the corpus iterator and dictionary, created in the second step above."

Step 2 is

  1. Convert the articles to plain text (process Wiki markup) and store the result as sparse TF-IDF vectors. In Python, this is easy to do on-the-fly and we don’t even need to uncompress the whole archive to disk. There is a script included in gensim that does just that, run:

    $ python -m gensim.scripts.make_wiki

Tom Morris
  • 10,490
  • 32
  • 53