Questions tagged [gensim]

Gensim is a free Python framework designed to automatically extract semantic topics from documents, as efficiently (computer-wise) and painlessly (human-wise) as possible.

Gensim aims at processing raw, unstructured digital texts (“plain text”). The algorithms in gensim, such as Latent Semantic Analysis, Latent Dirichlet Allocation or Random Projections, discover semantic structure of documents, by examining word statistical co-occurrence patterns within a corpus of training documents. These algorithms are unsupervised, which means no human input is necessary – you only need a corpus of plain text documents.

Once these statistical patterns are found, any plain text documents can be succinctly expressed in the new, semantic representation, and queried for topical similarity against other documents.

Resources and Tutorials:

2433 questions
6
votes
1 answer

Gensim Word2Vec model getting worse by increasing the number of epochs

I'm building a Word2Vec model for a category-recommendation on a dataset consisting of ~35.000 sentences for a total of ~500.000 words but only ~3.000 distinct ones. I build the model basically like this : def train_w2v_model(df, epochs): …
6
votes
2 answers

How to save fasttext model in binary and text formats?

The documentation is a bit unclear how to save the fasttext model to disk - how do you specify a path in the argument, I tried doing so and it failed with an error Example in documentation >>> from gensim.test.utils import get_tmpfile >>> >>> fname…
erotavlas
  • 4,274
  • 4
  • 45
  • 104
6
votes
9 answers

Gensim mallet CalledProcessError: returned non-zero exit status

I'm getting an error while trying to access gensims mallet in jupyter notebooks. I have the specified file 'mallet' in the same folder as my notebook, but cant seem to access it. I tried routing to it from the C drive but I still get the same…
Sara
  • 1,162
  • 1
  • 8
  • 21
6
votes
2 answers

gensim word2vec print log loss

how to print to log (file or stout) the loss of each epoch in the training phase, when using gensim word2vec model. I tried : logging.basicConfig(format='%(asctime)s: %(levelname)s: %(message)s') logging.root.setLevel(level=logging.INFO) But I…
Dkova
  • 1,087
  • 4
  • 16
  • 28
6
votes
1 answer

Measure similarity between two documents using Doc2Vec

I have already trained gensim doc2Vec model, which is finding most similar documents to an unknown one. Now I need to find the similarity value between two unknown documents (which were not in the training data, so they can not be referenced by doc…
Borislav Stoilov
  • 3,247
  • 2
  • 21
  • 46
6
votes
2 answers

Error when loading FastText's french pre-trained model with gensim

I am trying to use the FastText's french pre-trained binary model (downloaded from the official FastText's github page). I need the .bin model and not the .vec word-vectors so as to approximate misspelled and out-of-vocabulary words. However when I…
Clara-sininen
  • 191
  • 2
  • 9
6
votes
1 answer

How I can get the vectors for words that were not present in word2vec vocabulary?

I have check the previous post link but it doesn't seems to work for my case:- I have pre trained word2vec model: import gensim model = Word2Vec.load('w2v_model') Now I have a pandas dataframe with…
James
  • 528
  • 1
  • 6
  • 18
6
votes
2 answers

How to perform kmean clustering from Gensim TFIDF values

I am using Gensim for vector space model. after creating a dictionary and corpus from Gensim I calculated the (Term frequency*Inverse document Frequency)TFIDF using the following line Term_IDF = TfidfModel(corpus) corpus_tfidf =…
Nhqazi
  • 732
  • 3
  • 12
  • 30
6
votes
2 answers

How to get the wikipedia corpus text with punctuation by using gensim wikicorpus?

I'm trying to get the text with its punctuation as it is important to consider the latter in my doc2vec model. However, the wikicorpus retrieve only the text. After searching the web I found these pages: Page from gensim github issues section. It…
Ghaliamus
  • 101
  • 1
  • 4
6
votes
2 answers

Negative Values: Evaluate Gensim LDA with Topic Coherence

I´m currently trying to evaluate my topic models with gensim topiccoherencemodel: from gensim.models.coherencemodel import CoherenceModel cm_u_mass = CoherenceModel(model = model1, corpus = corpus1, coherence = 'u_mass') coherence_u_mass =…
Nils_Denter
  • 488
  • 1
  • 6
  • 18
6
votes
1 answer

NLP: Pre-processing in doc2vec / word2vec

A few papers on the topics of word and document embeddings (word2vec, doc2vec) mention that they used the Stanford CoreNLP framework to tokenize/lemmatize/POS-tag the input words/sentences: The corpora were lemmatized and POS-tagged with the…
Simon Hessner
  • 1,757
  • 1
  • 22
  • 49
6
votes
1 answer

Doc2vec: Only 10 docvecs in gensim doc2vec model?

I used gensim fit a doc2vec model, with tagged document (length>10) as training data. The target is to get doc vectors of all training docs, but only 10 vectors can be found in model.docvecs. The example of training data (length>10) docs = ['This is…
GemOfRoe
  • 125
  • 5
6
votes
1 answer

Load vectors into gensim Word2Vec model - not KeyedVectors

I'm attempting to load some pre-trained vectors into a gensim Word2Vec model, so they can be retrained with new data. My understanding is I can do the retraining with gensim.Word2Vec.train(). However, the only way I can find to load the vectors is…
Mike S
  • 1,451
  • 1
  • 16
  • 34
6
votes
2 answers

gensim - Word2vec continue training on existing model - AttributeError: 'Word2Vec' object has no attribute 'compute_loss'

I am trying to continue training on an existing model, model = gensim.models.Word2Vec.load('model/corpus.zhwiki.word.model') more_sentences = [['Advanced', 'users', 'can', 'load', 'a', 'model', 'and', 'continue', 'training', 'it', 'with', 'more',…
dididaisy
  • 141
  • 3
  • 10
6
votes
1 answer

How much data is actually required to train a doc2Vec model?

I have been using gensim's libraries to train a doc2Vec model. After experimenting with different datasets for training, I am fairly confused about what should be an ideal training data size for doc2Vec model? I will be sharing my understanding…
Shalabh Singh
  • 360
  • 1
  • 3
  • 10