Questions tagged [gensim]

Gensim is a free Python framework designed to automatically extract semantic topics from documents, as efficiently (computer-wise) and painlessly (human-wise) as possible.

Gensim aims at processing raw, unstructured digital texts (“plain text”). The algorithms in gensim, such as Latent Semantic Analysis, Latent Dirichlet Allocation or Random Projections, discover semantic structure of documents, by examining word statistical co-occurrence patterns within a corpus of training documents. These algorithms are unsupervised, which means no human input is necessary – you only need a corpus of plain text documents.

Once these statistical patterns are found, any plain text documents can be succinctly expressed in the new, semantic representation, and queried for topical similarity against other documents.

Resources and Tutorials:

2433 questions

votes

2 answers

Troubleshooting tips for clustering word2vec output with DBSCAN

I'm analyzing a corpus of roughly 2M raw words. I build a model using gensim's word2vec, embed the vectors using sklearn TSNE, and cluster the vectors (from word2vec, not TSNE) using sklearn DBSCAN. The TSNE output looks about right: the layout of…

asked Jan 23 '17 at 21:22

Ian

2,078
1
17
27

votes

1 answer

gensim Getting Started Error: No such file or directory: 'text8'

I am learning about word2vec and GloVe model in python so I am going through this available here. After I compiled these code step by step in Idle3: >>>from gensim.models import word2vec >>>import logging >>>logging.basicConfig(format='%(asctime)s…

python python-3.x error-handling gensim word2vec

asked Jan 13 '17 at 06:46

user7399214

votes

3 answers

Is there any way to get the vocabulary size from doc2vec model?

I am using gensim doc2vec. I want know if there is any efficient way to know the vocabulary size from doc2vec. One crude way is to count the total number of words, but if the data is huge(1GB or more) then this won't be an efficient way.

gensim word2vec doc2vec

asked Jan 12 '17 at 08:07

Rashmi Singh

votes

2 answers

Understanding the output of Doc2Vec from Gensim package

I have some sample sentences that I want to run through a Doc2Vec model. My end goal is a matrix of size (num_sentences, num_features). I'm using the Gensim package. from gensim.models.doc2vec import TaggedDocument from gensim.models import…

python gensim word2vec

asked May 12 '16 at 20:10

gensim_what

votes

1 answer

Using LDA(topic model) : the distrubution of each topic over words are similar and "flat"

Latent Dirichlet Allocation(LDA) is a topic model to find latent variable (topics) underlying a bunch of documents. I'm using python gensim package and having two problems: I printed out the most frequent words for each topic (I tried 10,20,50…

python lda topic-modeling gensim

asked Feb 23 '15 at 15:34

Ruby

votes

2 answers

NLTK - Automatically translating similar words

Big picture goal: I am making an LDA model of product reviews in Python using NLTK and Gensim. I want to run this on varying n-grams. Problem: Everything is great with unigrams, but when I run with bigrams, I start to get topics with repeated…

python algorithm nltk wordnet gensim

asked Jan 06 '14 at 16:42

user2979931

votes

1 answer

How do you initialize a gensim corpus variable with a csr_matrix?

I have X as a csr_matrix that I obtained using scikit's tfidf vectorizer, and y which is an array My plan is to create features using LDA, however, I failed to find how to initialize a gensim's corpus variable with X as a csr_matrix. In other words,…

python scikit-learn document-classification lda gensim

asked Mar 27 '13 at 22:12

IssamLaradji

6,637
8
43
68

votes

2 answers

Finding topics of an unseen document via Gensim

I am using Gensim to do some large-scale topic modeling. I am having difficulty understanding how to determine predicted topics for an unseen (non-indexed) document. For example: I have 25 million documents which I have converted to vectors in LSA…

python nlp latent-semantic-indexing gensim

asked Jul 13 '12 at 13:22

Peter Kirby

1,915
1
16
29

votes

1 answer

Using BERT to generate similar word or synonyms through word embeddings

As we all know the capability of BERT model for word embedding, it is probably better than the word2vec and any other models. I want to create a model on BERT word embedding to generate synonyms or similar words. The same like we do in the Gensim…

python nlp gensim word2vec bert-language-model

asked Jul 14 '21 at 11:56

DevPy

votes

1 answer

How to load large dataset to gensim word2vec model

So I have multiple text files(around 40). and each file has around 2000 articles (average of 500 words each). And each document is a single line in the text file. So because of the memory limitations I wanted to use dynamic loading of these text…

python iterator gensim word2vec

asked Aug 17 '20 at 22:55

little JJ

votes

0 answers

Gensim lda gives negative log-perplexity value - is it normal and how can i interpret it?

I am currently using Gensim LDA for topic modeling. While Tuning hyper-parameters I found out that the model always gives negative log-perplexity Is it normal for model to behave like this?? (is it even possible?) if it is, is smaller perplexity…

gensim lda perplexity

asked Jul 22 '20 at 02:30

nowheretogo

votes

1 answer

Using Gensim Fasttext model with LSTM nn in keras

I have trained fasttext model with Gensim over the corpus of very short sentences (up to 10 words). I know that my test set includes words that are not in my train corpus, i.e some of the words in my corpus are like "Oxytocin" "Lexitocin",…

tensorflow keras nlp gensim word-embedding

asked Jul 05 '20 at 16:39

Latent

votes

1 answer

gensim - fasttext - Why `load_facebook_vectors` doesn't work?

I've tried to load pre-trained FastText vectors from fastext - wiki word vectors. My code is below, and it works well. from gensim.models import FastText model = FastText.load_fasttext_format('./wiki.en/wiki.en.bin') but, the warning message is a…

python gensim fasttext

asked May 28 '20 at 07:27

frhyme

votes

1 answer

Training time of gensim word2vec

I'm training word2vec from scratch on 34 GB pre-processed MS_MARCO corpus(of 22 GB). (Preprocessed corpus is sentnecepiece tokenized and so its size is more) I'm training my word2vec model using following code : from gensim.test.utils import…

python nlp gensim word2vec

asked Mar 25 '20 at 16:22

Ruchit Patel

votes

2 answers

How do I calculate the coherence score of an sklearn LDA model?

Here, best_model_lda is an sklearn based LDA model and we are trying to find a coherence score for this model.. coherence_model_lda = CoherenceModel(model = best_lda_model,texts=data_vectorized, dictionary=dictionary,coherence='c_v') coherence_lda =…

scikit-learn gensim lda

asked Mar 10 '20 at 08:03

Arvind Sudheer

Prev 1 2 3

…

99 100 Next