Questions tagged [gensim]

Gensim is a free Python framework designed to automatically extract semantic topics from documents, as efficiently (computer-wise) and painlessly (human-wise) as possible.

Gensim aims at processing raw, unstructured digital texts (“plain text”). The algorithms in gensim, such as Latent Semantic Analysis, Latent Dirichlet Allocation or Random Projections, discover semantic structure of documents, by examining word statistical co-occurrence patterns within a corpus of training documents. These algorithms are unsupervised, which means no human input is necessary – you only need a corpus of plain text documents.

Once these statistical patterns are found, any plain text documents can be succinctly expressed in the new, semantic representation, and queried for topical similarity against other documents.

Resources and Tutorials:

2433 questions
7
votes
2 answers

Troubleshooting tips for clustering word2vec output with DBSCAN

I'm analyzing a corpus of roughly 2M raw words. I build a model using gensim's word2vec, embed the vectors using sklearn TSNE, and cluster the vectors (from word2vec, not TSNE) using sklearn DBSCAN. The TSNE output looks about right: the layout of…
Ian
  • 2,078
  • 1
  • 17
  • 27
7
votes
1 answer

gensim Getting Started Error: No such file or directory: 'text8'

I am learning about word2vec and GloVe model in python so I am going through this available here. After I compiled these code step by step in Idle3: >>>from gensim.models import word2vec >>>import logging >>>logging.basicConfig(format='%(asctime)s…
user7399214
7
votes
3 answers

Is there any way to get the vocabulary size from doc2vec model?

I am using gensim doc2vec. I want know if there is any efficient way to know the vocabulary size from doc2vec. One crude way is to count the total number of words, but if the data is huge(1GB or more) then this won't be an efficient way.
Rashmi Singh
  • 519
  • 1
  • 8
  • 20
7
votes
2 answers

Understanding the output of Doc2Vec from Gensim package

I have some sample sentences that I want to run through a Doc2Vec model. My end goal is a matrix of size (num_sentences, num_features). I'm using the Gensim package. from gensim.models.doc2vec import TaggedDocument from gensim.models import…
7
votes
1 answer

Using LDA(topic model) : the distrubution of each topic over words are similar and "flat"

Latent Dirichlet Allocation(LDA) is a topic model to find latent variable (topics) underlying a bunch of documents. I'm using python gensim package and having two problems: I printed out the most frequent words for each topic (I tried 10,20,50…
Ruby
  • 284
  • 1
  • 5
  • 18
7
votes
2 answers

NLTK - Automatically translating similar words

Big picture goal: I am making an LDA model of product reviews in Python using NLTK and Gensim. I want to run this on varying n-grams. Problem: Everything is great with unigrams, but when I run with bigrams, I start to get topics with repeated…
user2979931
  • 101
  • 1
  • 3
7
votes
1 answer

How do you initialize a gensim corpus variable with a csr_matrix?

I have X as a csr_matrix that I obtained using scikit's tfidf vectorizer, and y which is an array My plan is to create features using LDA, however, I failed to find how to initialize a gensim's corpus variable with X as a csr_matrix. In other words,…
IssamLaradji
  • 6,637
  • 8
  • 43
  • 68
7
votes
2 answers

Finding topics of an unseen document via Gensim

I am using Gensim to do some large-scale topic modeling. I am having difficulty understanding how to determine predicted topics for an unseen (non-indexed) document. For example: I have 25 million documents which I have converted to vectors in LSA…
Peter Kirby
  • 1,915
  • 1
  • 16
  • 29
6
votes
1 answer

Using BERT to generate similar word or synonyms through word embeddings

As we all know the capability of BERT model for word embedding, it is probably better than the word2vec and any other models. I want to create a model on BERT word embedding to generate synonyms or similar words. The same like we do in the Gensim…
DevPy
  • 439
  • 6
  • 17
6
votes
1 answer

How to load large dataset to gensim word2vec model

So I have multiple text files(around 40). and each file has around 2000 articles (average of 500 words each). And each document is a single line in the text file. So because of the memory limitations I wanted to use dynamic loading of these text…
little JJ
  • 71
  • 1
  • 3
6
votes
0 answers

Gensim lda gives negative log-perplexity value - is it normal and how can i interpret it?

I am currently using Gensim LDA for topic modeling. While Tuning hyper-parameters I found out that the model always gives negative log-perplexity Is it normal for model to behave like this?? (is it even possible?) if it is, is smaller perplexity…
nowheretogo
  • 125
  • 1
  • 5
6
votes
1 answer

Using Gensim Fasttext model with LSTM nn in keras

I have trained fasttext model with Gensim over the corpus of very short sentences (up to 10 words). I know that my test set includes words that are not in my train corpus, i.e some of the words in my corpus are like "Oxytocin" "Lexitocin",…
Latent
  • 556
  • 1
  • 9
  • 23
6
votes
1 answer

gensim - fasttext - Why `load_facebook_vectors` doesn't work?

I've tried to load pre-trained FastText vectors from fastext - wiki word vectors. My code is below, and it works well. from gensim.models import FastText model = FastText.load_fasttext_format('./wiki.en/wiki.en.bin') but, the warning message is a…
frhyme
  • 966
  • 1
  • 15
  • 24
6
votes
1 answer

Training time of gensim word2vec

I'm training word2vec from scratch on 34 GB pre-processed MS_MARCO corpus(of 22 GB). (Preprocessed corpus is sentnecepiece tokenized and so its size is more) I'm training my word2vec model using following code : from gensim.test.utils import…
Ruchit Patel
  • 733
  • 1
  • 11
  • 26
6
votes
2 answers

How do I calculate the coherence score of an sklearn LDA model?

Here, best_model_lda is an sklearn based LDA model and we are trying to find a coherence score for this model.. coherence_model_lda = CoherenceModel(model = best_lda_model,texts=data_vectorized, dictionary=dictionary,coherence='c_v') coherence_lda =…
Arvind Sudheer
  • 113
  • 1
  • 1
  • 14