Questions tagged [gensim]

Gensim is a free Python framework designed to automatically extract semantic topics from documents, as efficiently (computer-wise) and painlessly (human-wise) as possible.

Gensim aims at processing raw, unstructured digital texts (“plain text”). The algorithms in gensim, such as Latent Semantic Analysis, Latent Dirichlet Allocation or Random Projections, discover semantic structure of documents, by examining word statistical co-occurrence patterns within a corpus of training documents. These algorithms are unsupervised, which means no human input is necessary – you only need a corpus of plain text documents.

Once these statistical patterns are found, any plain text documents can be succinctly expressed in the new, semantic representation, and queried for topical similarity against other documents.

Resources and Tutorials:

2433 questions
11
votes
1 answer

Doc2Vec and PySpark: Gensim Doc2vec over DeepDist

I am looking at the DeepDist (link) module and thinking to combine it with Gensim's Doc2Vec API to train paragraph vectors on PySpark. The link actually provides with the following clean example for how to do it for Gensim's Word2Vec model: from…
Patrick the Cat
  • 2,138
  • 1
  • 16
  • 33
11
votes
3 answers

Is it possible to re-train a word2vec model (e.g. GoogleNews-vectors-negative300.bin) from a corpus of sentences in python?

I am using pre-trained Google news dataset for getting word vectors by using Gensim library in python model = Word2Vec.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True) After loading the model I am converting training reviews…
Nomiluks
  • 2,052
  • 5
  • 31
  • 53
11
votes
2 answers

What is the best way to obtain the optimal number of topics for a LDA-Model using Gensim?

I am trying to obtain the optimal number of topics for an LDA-model within Gensim. One method I found is to calculate the log likelihood for each model and compare each against each other, e.g. at The input parameters for using latent Dirichlet…
Akantor
  • 151
  • 1
  • 1
  • 6
11
votes
2 answers

How to obtain antonyms through word2vec?

I am currently working on word2vec model using gensim in Python, and want to write a function that can help me find the antonyms and synonyms of a given word. For example: antonym("sad")="happy" synonym("upset")="enraged" Is there a way to do that…
Salamander
  • 179
  • 5
  • 15
11
votes
3 answers

How to predict the topic of a new query using a trained LDA model using gensim?

I have trained a corpus for LDA topic modelling using gensim. Going through the tutorial on the gensim website (this is not the whole code): question = 'Changelog generation from Github issues?'; temp = question.lower() for i in…
Animesh Pandey
  • 5,900
  • 13
  • 64
  • 130
10
votes
1 answer

LDA Topic Model Performance - Topic Coherence Implementation for scikit-learn

I have a question around measuring/calculating topic coherence for LDA models built in scikit-learn. Topic Coherence is a useful metric for measuring the human interpretability of a given LDA topic model. Gensim's CoherenceModel allows Topic…
10
votes
3 answers

Continue training a FastText model

I have downloaded a .bin FastText model, and I use it with gensim as follows: model = FastText.load_fasttext_format("cc.fr.300.bin") I would like to continue the training of the model to adapt it to my domain. After checking FastText's Github and…
ted
  • 13,596
  • 9
  • 65
  • 107
10
votes
2 answers

gensim: pickle or not?

I have a question related to gensim. I like to know whether it is recommended or necessary to use pickle while saving or loading a model (or multiple models), as I find scripts on GitHub that do either. mymodel = Doc2Vec(documents, size=100,…
Christopher
  • 2,120
  • 7
  • 31
  • 58
10
votes
2 answers

How to do Text classification using word2vec

I want to perform text classification using word2vec. I got vectors of words. ls = [] sentences = lines.split(".") for i in sentences: ls.append(i.split()) model = Word2Vec(ls, min_count=1, size = 4) words =…
Shubham Agrawal
  • 109
  • 1
  • 1
  • 4
10
votes
5 answers

How to access topic words only in gensim

I built LDA model using Gensim and I want to get the topic words only How can I get the words of the topics only no probabilities and no IDs.words only I tried print_topics() and show_topics() functions in gensim but I can't get clean words ! This…
Muhammed Eltabakh
  • 375
  • 1
  • 10
  • 24
10
votes
1 answer

How to use the infer_vector in gensim.doc2vec?

def cosine(vector1,vector2): cosV12 = np.dot(vector1, vector2) / (linalg.norm(vector1) * linalg.norm(vector2)) return cosV12 model=gensim.models.doc2vec.Doc2Vec.load('Model_D2V_Game') string='民生 为了 父亲 我 要 坚强 地 ...' list=string.split('…
Jeffery
  • 151
  • 1
  • 1
  • 7
10
votes
4 answers

Python NLP British English vs American English

I'm currently working on NLP in python. However, in my corpus, there are both British and American English(realize/realise) I'm thinking to convert British to American. However, I did not find a good tool/package to do that. Any suggestions?
Mr.cysl
  • 1,494
  • 6
  • 23
  • 37
10
votes
1 answer

Getting TF-IDF Scores Of Words Using Gensim

I am trying to find the most important words in a corpus based on their TF-IDF scores. Been following along the example at https://radimrehurek.com/gensim/tut2.html. Based on >>> for doc in corpus_tfidf: ... print(doc) the TF-IDF score is…
user799188
  • 13,965
  • 5
  • 35
  • 37
10
votes
1 answer

Why Gensim doc2vec give AttributeError: 'list' object has no attribute 'words'?

I am trying to experiment gensim doc2vec, by using following code. As far as I understand from tutorials, it should work. However it gives AttributeError: 'list' object has no attribute 'words'. from gensim.models.doc2vec import LabeledSentence,…
W.S.
  • 647
  • 1
  • 6
  • 19
10
votes
3 answers

Error while loading Word2Vec model in gensim

I'm getting an AttributeError while loading the gensim model available at word2vec repository: from gensim import models w = models.Word2Vec() w.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True) print…
Tarantula
  • 19,031
  • 12
  • 54
  • 71