Questions tagged [gensim]

Gensim is a free Python framework designed to automatically extract semantic topics from documents, as efficiently (computer-wise) and painlessly (human-wise) as possible.

Gensim aims at processing raw, unstructured digital texts (“plain text”). The algorithms in gensim, such as Latent Semantic Analysis, Latent Dirichlet Allocation or Random Projections, discover semantic structure of documents, by examining word statistical co-occurrence patterns within a corpus of training documents. These algorithms are unsupervised, which means no human input is necessary – you only need a corpus of plain text documents.

Once these statistical patterns are found, any plain text documents can be succinctly expressed in the new, semantic representation, and queried for topical similarity against other documents.

Resources and Tutorials:

2433 questions
7
votes
4 answers

TypeError: 'Word2Vec' object is not subscriptable

I am trying to build a Word2vec model but when I try to reshape the vector for tokens, I am getting this error. Any idea ? wordvec_arrays = np.zeros((len(tokenized_tweet), 100)) for i in range(len(tokenized_tweet)): wordvec_arrays[i,:] =…
Nishant Kashyap
  • 73
  • 1
  • 1
  • 3
7
votes
2 answers

pyLDAvis visualization from gensim not displaying the result in google colab

import pyLDAvis.gensim # Visualize the topics pyLDAvis.enable_notebook() vis = pyLDAvis.gensim.prepare(lda_model, corpus, id2word) vis The above code displayed the visualization of LDA model in google colab but then after reopening the notebook it…
Ravi Prajapati
  • 71
  • 1
  • 1
  • 4
7
votes
4 answers

Is it possible to do sentiment analysis of unlabelled text using word2vec model?

I have some text data for which I need to do sentiment classification. I don't have positive or negative labels on this data (unlabelled). I want to use the Gensim word2vec model for sentiment classification. Is it possible to do this? Because till…
Piyush Ghasiya
  • 515
  • 7
  • 25
7
votes
3 answers

How do you save a model, dictionary and corpus to disk in Gensim, and then load them again?

In Gensim's documentation, it says: You can save trained models to disk and later load them back, either to continue training on new training documents or to transform new documents. I would like to do this with a dictionary, corpus and tf.idf…
Data
  • 689
  • 7
  • 23
7
votes
2 answers

Word2vec Gensim Accuracy Analysis

I'm working on a NLP application, where I have a corpus of text files. I would like to create word vectors using the Gensim word2vec algorithm. I did a 90% training and 10% testing split. I trained the model on the appropriate set, but I would like…
Sam
  • 641
  • 1
  • 7
  • 17
7
votes
1 answer

What does epochs mean in Doc2Vec and train when I have to manually run the iteration?

I am trying to understand the epochs parameter in the Doc2Vec function and epochs parameter in the train function. In the following code snippet, I manually set up a loop of 4000 iterations. Is it required or passing 4000 as epochs parameter in the…
Suhail Gupta
  • 22,386
  • 64
  • 200
  • 328
7
votes
3 answers

pyLDAvis with Mallet LDA implementation : LdaMallet object has no attribute 'inference'

is it possible to plot a pyLDAvis with a Mallet implementation of LDA ? I have no troubles with LDA_Model but when I use Mallet I get : 'LdaMallet' object has no attribute 'inference' My code : pyLDAvis.enable_notebook() vis =…
Saguaro
  • 233
  • 3
  • 12
7
votes
1 answer

Pipeline and GridSearch for Doc2Vec

I currently have following script that helps to find the best model for a doc2vec model. It works like this: First train a few models based on given parameters and then test against a classifier. Finally, it outputs the best model and classifier (I…
Christopher
  • 2,120
  • 7
  • 31
  • 58
7
votes
2 answers

Applying word2vec to find all words above a similarity threshold

The command model.most_similar(positive=['france'], topn=100) gives the top 100 most similar words to "france". However, I would like to know if there is a method which will output the most similar words above a similarity threshold to a given word.…
sss90
  • 125
  • 1
  • 1
  • 6
7
votes
6 answers

Does gensim.corpora.Dictionary have term frequency saved?

Does gensim.corpora.Dictionary have term frequency saved? From gensim.corpora.Dictionary, it's possible to get the document frequency of the words (i.e. how many document did a particular word occur in): from nltk.corpus import brown from…
alvas
  • 115,346
  • 109
  • 446
  • 738
7
votes
3 answers

TypeError: Object of type 'complex' is not JSON serializable while using pyLDAvis.display() function

I have a document Term matrix with nine documents: I am running the code as below: import pyLDAvis.gensim topicData = pyLDAvis.gensim.prepare(ldamodel, docTermMatrix, dictionary) pyLDAvis.display(topicData) I am getting the below error when…
Gaurav Pandey
  • 71
  • 1
  • 4
7
votes
1 answer

C extension not loaded for Word2Vec

I re-install the gensim pkg and Cython but it continusly show this warning, Does anybody know about this? I am using Python 3.6,PyCharm Linux Mint. UserWarning: C extension not loaded for Word2Vec, training will be slow. Install a C compiler and…
user8349292
7
votes
1 answer

What is different between doc2vec models when the dbow_words is set to 1 or 0?

I read this page but I do not understand what is different between models which are built based on the following codes. I know when dbow_words is 0, training of doc-vectors is faster. First model model = doc2vec.Doc2Vec(documents1, size = 100,…
user3092781
  • 313
  • 2
  • 16
7
votes
2 answers

Python Gensim how to make WMD similarity run faster with multiprocessing

I am trying to run gensim WMD similarity faster. Typically, this is what is in the docs: Example corpus: my_corpus = ["Human machine interface for lab abc computer applications", >>> "A survey of user opinion of computer system…
jxn
  • 7,685
  • 28
  • 90
  • 172
7
votes
1 answer

How can I access output embedding(output vector) in gensim word2vec?

I want to use output embedding of word2vec such as in this paper (Improving document ranking with dual word embeddings). I know input vectors are in syn0, output vectors are in syn1 and syn1neg if negative sampling. But when I calculated…
Suin SEO
  • 83
  • 1
  • 6