Questions tagged [gensim]

Gensim is a free Python framework designed to automatically extract semantic topics from documents, as efficiently (computer-wise) and painlessly (human-wise) as possible.

Gensim aims at processing raw, unstructured digital texts (“plain text”). The algorithms in gensim, such as Latent Semantic Analysis, Latent Dirichlet Allocation or Random Projections, discover semantic structure of documents, by examining word statistical co-occurrence patterns within a corpus of training documents. These algorithms are unsupervised, which means no human input is necessary – you only need a corpus of plain text documents.

Once these statistical patterns are found, any plain text documents can be succinctly expressed in the new, semantic representation, and queried for topical similarity against other documents.

Resources and Tutorials:

2433 questions
10
votes
2 answers

Word2vec training using gensim starts swapping after 100K sentences

I'm trying to train a word2vec model using a file with about 170K lines, with one sentence per line. I think I may represent a special use case because the "sentences" have arbitrary strings rather than dictionary words. Each sentence (line) has…
Felipe
  • 11,557
  • 7
  • 56
  • 103
10
votes
2 answers

Retrieve string version of document by ID in Gensim

I am using Gensim for some topic modelling and I have gotten to the point where I am doing similarity queries using the LSI and tf-idf models. I get back the set of IDs and similarities, eg. (299501, 0.64505910873413086). How do I get the text…
jisaw
  • 200
  • 1
  • 7
9
votes
3 answers

Not able to import from `gensim.summarization` module in Django

I have included the 2 import statements in my views.py from gensim.summarization.summarizer import summarizer from gensim.summarization import keywords However, even after I installed gensim using pip, I am getting the error: ModuleNotFoundError:…
Alpha
  • 237
  • 1
  • 2
  • 7
9
votes
1 answer

How to get document_topics distribution of all of the document in gensim LDA?

I'm new to python and I need to construct a LDA project. After doing some preprocessing step, here is my code: dictionary = Dictionary(docs) corpus = [dictionary.doc2bow(doc) for doc in docs] from gensim.models import LdaModel num_topics =…
9
votes
2 answers

Why Doc2vec gives 2 different vectors for the same texts

I am using Doc2vec to get vectors from words. Please see my below code: from gensim.models.doc2vec import TaggedDocument f = open('test.txt','r') trainings = [TaggedDocument(words = data.strip().split(","),tags = [i]) for i,data in…
Thanh Bui
  • 103
  • 5
9
votes
1 answer

Difference between most_similar and similar_by_vector in gensim word2vec?

I was confused with the results of most_similar and similar_by_vector from gensim's Word2vecKeyedVectors. They are supposed to calculate cosine similarities in the same way - however: Running them with one word gives identical results, for…
peidaqi
  • 673
  • 1
  • 7
  • 18
9
votes
1 answer

Gensim: how to load precomputed word vectors from text file

I have a text file with my precomputed word vectors in the following format (example): word -0.0762464299711 0.0128308048976 ... 0.0712385589283\n’ on each line for every word (with 297 extra floats in place of the ...). I am trying to load these…
iloveseals
  • 93
  • 1
  • 4
9
votes
1 answer

UnpicklingError: invalid load key, '3'

I am creating a chatbot. So, i need word2vec file in binary format. When i am loading bin file then i am getting this type of error. import gensim model = gensim.models.Word2Vec.load('GoogleNews-vectors-negative300.bin') Traceback (most recent…
surya
  • 159
  • 2
  • 9
9
votes
1 answer

Improving Gensim Doc2vec results

I tried to apply doc2vec on 600000 rows of sentences: Code as below: from gensim import models model = models.Doc2Vec(alpha=.025, min_alpha=.025, min_count=1, workers = 5) model.build_vocab(res) token_count = sum([len(sentence) for sentence in…
Hackerds
  • 1,195
  • 2
  • 16
  • 34
9
votes
2 answers

Python: What is the "size" parameter in Gensim Word2vec model class

I have been struggling to understand the use of size parameter in the gensim.models.Word2Vec From the Gensim documentation, size is the dimensionality of the vector. Now, as far as my knowledge goes, word2vec creates a vector of the probability of…
Krishnang K Dalal
  • 2,322
  • 9
  • 34
  • 55
9
votes
3 answers

gensim.interfaces.TransformedCorpus - How use?

I'm relative new in the world of Latent Dirichlet Allocation. I am able to generate a LDA Model following the Wikipedia tutorial and I'm able to generate a LDA model with my own documents. My step now is try understand how can I use a previus…
Marco Oliveira
  • 167
  • 1
  • 10
9
votes
1 answer

Doc2Vec Worse Than Mean or Sum of Word2Vec Vectors

I'm training a Word2Vec model like: model = Word2Vec(documents, size=200, window=5, min_count=0, workers=4, iter=5, sg=1) and Doc2Vec model like: doc2vec_model = Doc2Vec(size=200, window=5, min_count=0, iter=5, workers=4,…
ScientiaEtVeritas
  • 5,158
  • 4
  • 41
  • 59
9
votes
1 answer

Docker unable to install numpy, scipy, or gensim

I am trying to build a Docker application that uses Python's gensim library, version 2.1.0, which is being installed via pip from a requirements.txt file. However, Docker seems to have trouble installing numpy, scipy, and gensim. I googled the error…
Shuklaswag
  • 1,003
  • 1
  • 10
  • 27
9
votes
2 answers

Reduce Google's Word2Vec model with Gensim

Loading the complete pre-trained word2vec model by Google is time intensive and tedious, therefore I was wondering if there is a chance to remove words below a certain frequency to bring the vocab count down to e.g. 200k words. I found Word2Vec…
neurix
  • 4,126
  • 6
  • 46
  • 71
9
votes
1 answer

doc2vec: How is PV-DBOW implemented

I know that there exists already an implementation of PV-DBOW (paragraph vector) in python (gensim). But I'm interested in knowing how to implement it myself. The explanation from the official paper for PV-DBOW is as follows: Another way is to…
саша
  • 521
  • 5
  • 20