Questions tagged [gensim]

Gensim is a free Python framework designed to automatically extract semantic topics from documents, as efficiently (computer-wise) and painlessly (human-wise) as possible.

Gensim aims at processing raw, unstructured digital texts (“plain text”). The algorithms in gensim, such as Latent Semantic Analysis, Latent Dirichlet Allocation or Random Projections, discover semantic structure of documents, by examining word statistical co-occurrence patterns within a corpus of training documents. These algorithms are unsupervised, which means no human input is necessary – you only need a corpus of plain text documents.

Once these statistical patterns are found, any plain text documents can be succinctly expressed in the new, semantic representation, and queried for topical similarity against other documents.

Resources and Tutorials:

2433 questions
19
votes
4 answers

Getting "__init__() got an unexpected keyword argument 'document'" this error in python I'm working with Word2Vec and gensim

I'm working on project using Word2vec and gensim, model = gensim.models.Word2Vec( documents = 'userDataFile.txt', size=150, window=10, min_count=2, workers=10) model =…
dubooduboo
  • 233
  • 2
  • 3
  • 7
19
votes
3 answers

In spacy, how to use your own word2vec model created in gensim?

I have trained my own word2vec model in gensim and I am trying to load that model in spacy. First, I need to save it in my disk and then try to load an init-model in spacy but unable to figure out exactly…
Subigya Upadhyay
  • 266
  • 1
  • 2
  • 11
19
votes
1 answer

Why are multiple model files created in gensim word2vec?

When I try to create a word2vec model (skipgram with negative sampling) I received 3 files as output as follows. word2vec (File) word2vec.syn1nef.npy (NPY file) word2vec.wv.syn0.npy (NPY file) I am just worried why this happens as for my previous…
user8871463
19
votes
4 answers

LDA model generates different topics everytime i train on the same corpus

I am using python gensim to train an Latent Dirichlet Allocation (LDA) model from a small corpus of 231 sentences. However, each time i repeat the process, it generates different topics. Why does the same LDA parameters and corpus generate…
alvas
  • 115,346
  • 109
  • 446
  • 738
18
votes
5 answers

How to remove a word completely from a Word2Vec model in gensim?

Given a model, e.g. from gensim.models.word2vec import Word2Vec documents = ["Human machine interface for lab abc computer applications", "A survey of user opinion of computer system response time", "The EPS user interface management…
alvas
  • 115,346
  • 109
  • 446
  • 738
18
votes
4 answers

gensim word2vec accessing in/out vectors

In the word2vec model, there are two linear transforms that take a word in vocab space to a hidden layer (the "in" vector), and then back to the vocab space (the "out" vector). Usually this out vector is discarded after training. I'm wondering if…
Alex R.
  • 1,397
  • 3
  • 18
  • 33
17
votes
2 answers

Is there pre-trained doc2vec model?

Is there a pre-trained doc2vec model with a large data set, like Wikipedia or similar?
Idriss Brahimi
  • 171
  • 1
  • 1
  • 5
17
votes
3 answers

Get bigrams and trigrams in word2vec Gensim

I am currently using uni-grams in my word2vec model as follows. def review_to_sentences( review, tokenizer, remove_stopwords=False ): #Returns a list of sentences, where each sentence is a list of words # #NLTK tokenizer to split the…
user8566323
17
votes
3 answers

How to use TaggedDocument in gensim?

I have two directories from which I want to read their text files and label them, but I don't know how to do this via TaggedDocument. I thought it would work as TaggedDocument([Strings],[Labels]) but this doesn't work apparently. This is my code:…
Farhood
  • 391
  • 2
  • 4
  • 16
17
votes
2 answers

get_document_topics and get_term_topics in gensim

The ldamodel in gensim has the two methods: get_document_topics and get_term_topics. Despite their use in this gensim tutorial notebook, I do not fully understand how to interpret the output of get_term_topics and created the self-contained code…
tkja
  • 1,950
  • 5
  • 22
  • 40
17
votes
2 answers

Gensim train word2vec on wikipedia - preprocessing and parameters

I am trying to train the word2vec model from gensim using the Italian wikipedia "http://dumps.wikimedia.org/itwiki/latest/itwiki-latest-pages-articles.xml.bz2" However, I am not sure what is the best preprocessing for this corpus. gensim model…
Luca Fiaschi
  • 3,145
  • 7
  • 31
  • 44
17
votes
2 answers

Document topical distribution in Gensim LDA

I've derived a LDA topic model using a toy corpus as follows: documents = ['Human machine interface for lab abc computer applications', 'A survey of user opinion of computer system response time', 'The EPS user interface…
Moses Xu
  • 2,140
  • 4
  • 24
  • 35
16
votes
2 answers

Using word2vec to classify words in categories

BACKGROUND I have vectors with some sample data and each vector has a category name (Places,Colors,Names). ['john','jay','dan','nathan','bob'] -> 'Names' ['yellow', 'red','green'] -> 'Colors' ['tokyo','bejing','washington','mumbai'] -> 'Places' My…
Dinero
  • 1,070
  • 2
  • 19
  • 44
16
votes
2 answers

Using a Word2Vec model pre-trained on wikipedia

I need to use gensim to get vector representations of words, and I figure the best thing to use would be a word2vec module that's pre-trained on the english wikipedia corpus. Does anyone know where to download it, how to install it, and how to use…
Boris
  • 716
  • 1
  • 4
  • 25
16
votes
2 answers

Chunkize warning while installing gensim

I have installed gensim (through pip) in Python. After the installation is over I get the following warning: C:\Python27\lib\site-packages\gensim\utils.py:855: UserWarning: detected Windows; aliasing chunkize to chunkize_serial …
user7420652
  • 183
  • 1
  • 1
  • 8