Questions tagged [gensim]

Gensim is a free Python framework designed to automatically extract semantic topics from documents, as efficiently (computer-wise) and painlessly (human-wise) as possible.

Gensim aims at processing raw, unstructured digital texts (“plain text”). The algorithms in gensim, such as Latent Semantic Analysis, Latent Dirichlet Allocation or Random Projections, discover semantic structure of documents, by examining word statistical co-occurrence patterns within a corpus of training documents. These algorithms are unsupervised, which means no human input is necessary – you only need a corpus of plain text documents.

Once these statistical patterns are found, any plain text documents can be succinctly expressed in the new, semantic representation, and queried for topical similarity against other documents.

Resources and Tutorials:

2433 questions
6
votes
7 answers

Getting an error to install pyemd even though I just installed it

Here is the code: from pyemd import emd print("sentence 1:") print(input_document_lower[0]) print("sentence 2:") print(input_document_lower[1]) print("similarity:") model_w2v.wmdistance(input_document_lower[0], input_document_lower[1]) Here's the…
madsthaks
  • 2,091
  • 6
  • 25
  • 46
6
votes
1 answer

Is it possible to use gensim word2vec model in deeplearning4j.word2vec?

I'm new to deeplearning4j, i want to make sentence classifier using words vector as input for the classifier. I was using python before, where the vector model was generated using gensim, and i want to use that model for this new classifier. Is it…
zunzelf
  • 85
  • 1
  • 5
6
votes
1 answer

Does Doc2Vec learn representations for the tags?

I'm using the Doc2Vec tags as an unique identifier for my documents, each document has a different tag and no semantic meaning. I'm using the tags to find specific documents so I can calculate the similarity between them. Do the tags influence the…
Stanko
  • 4,275
  • 3
  • 23
  • 51
6
votes
1 answer

gensim KeydVectors dimensions

Im gensims latest version, loading trained vectors from a file is done using KeyedVectors, and dosent requires instantiating a new Word2Vec object. But now my code is broken because I can't use the model.vector_size property. What is the alternative…
proton
  • 393
  • 6
  • 31
6
votes
1 answer

Incremental Word2Vec Model Training in gensim

I have tried to train incrementally word2vec model produced by gensim. But I found that the vocabulary size doesn't increased , only the word2vec model weights are updated . But i need to update both vocabulary and model size . #Load data…
Rabindra Nath Nandi
  • 1,433
  • 1
  • 15
  • 28
6
votes
2 answers

AttributeError: type object 'Word2Vec' has no attribute 'load_word2vec_format'

I am trying to implement word2vec model and getting Attribute error AttributeError: type object 'Word2Vec' has no attribute 'load_word2vec_format' Below is the code : wv = Word2Vec.load_word2vec_format("GoogleNews-vectors-negative300.bin.gz",…
Rishabh Rusia
  • 173
  • 2
  • 4
  • 19
6
votes
1 answer

Doc2Vec: Differentiate Sentence and Document

I am just playing around with Doc2Vec from gensim, analysing stackexchange dump to analyze semantic similarity of questions to identify duplicates. The tutorial on Doc2Vec-Tutorial seems to describe the input as tagged sentences. But the original…
Vikash Balasubramanian
  • 2,921
  • 3
  • 33
  • 74
6
votes
4 answers

AttributeError: 'list' object has no attribute 'lower' gensim

I have a list of 10k words in a text file like so: G15 KDN C30A Action Standard Air Brush Air Dilution I am trying to convert them into lower cased tokens using this code for subsequent processing with GenSim: data = [line.strip() for line in…
tom
  • 315
  • 1
  • 3
  • 10
6
votes
1 answer

gensim: custom similarity measure

Using gensim, I want to calculate the similarity within a list of documents. This library is excellent at handling the amounts of data that I have got. The documents are all reduced to timestamps and I have got a function time_similarity to compare…
Simon
  • 5,464
  • 6
  • 49
  • 85
6
votes
1 answer

Semantic Similarity between Phrases Using GenSim

Background I am trying to judge whether a phrase is semantically related to other words found in a corpus using Gensim. For example, here is the corpus document pre-tokenized: **Corpus** Car Insurance Car Insurance Coverage Auto Insurance Best…
user3682157
  • 1,625
  • 8
  • 29
  • 55
6
votes
1 answer

Doc2vec MemoryError

I am using the doc2vec model from teh gensim framework to represent a corpus of 15 500 000 short documents (up to 300 words): gensim.models.Doc2Vec(sentences, size=400, window=10, min_count=1, workers=8 ) After creating the vectors there are …
6
votes
1 answer

Understanding LDA Transformed Corpus in Gensim

I tried to examine the contents of the BOW corpus vs. the LDA[BOW Corpus] (transformed by LDA model trained on that corpus with, say, 35 topics) I found the following output: DOC 1 : [(1522, 1), (2028, 1), (2082, 1), (6202, 1)] LDA 1 : [(29,…
Ravi Karan
  • 445
  • 1
  • 7
  • 13
6
votes
3 answers

Are there any efficient python libraries for Dynamic Topic Models, preferably extending Gensim?

I'm trying to model twitter stream data with topic models. Gensim, being an easy to use solution, is impressive in it's simplicity. It has a truly online implementation for LSI, but not for LDA. For a changing content stream like twitter, Dynamic…
Ravi Karan
  • 445
  • 1
  • 7
  • 13
6
votes
1 answer

Gensim Dictionary Implementation

I was just curious about the gensim dictionary implementation. I have the following code: def build_dictionary(documents): dictionary = corpora.Dictionary(documents) dictionary.save('/tmp/deerwester.dict') # store the dictionary …
dmil
  • 119
  • 1
  • 9
5
votes
1 answer

How to visualize Gensim Word2vec Embeddings in Tensorboard Projector

Following gensim word2vec embedding tutorial, I have trained a simple word2vec model: from gensim.test.utils import common_texts from gensim.models import Word2Vec model = Word2Vec(sentences=common_texts, size=100, window=5, min_count=1,…
G. Macia
  • 1,204
  • 3
  • 23
  • 38