Questions tagged [gensim]

Gensim is a free Python framework designed to automatically extract semantic topics from documents, as efficiently (computer-wise) and painlessly (human-wise) as possible.

Gensim aims at processing raw, unstructured digital texts (“plain text”). The algorithms in gensim, such as Latent Semantic Analysis, Latent Dirichlet Allocation or Random Projections, discover semantic structure of documents, by examining word statistical co-occurrence patterns within a corpus of training documents. These algorithms are unsupervised, which means no human input is necessary – you only need a corpus of plain text documents.

Once these statistical patterns are found, any plain text documents can be succinctly expressed in the new, semantic representation, and queried for topical similarity against other documents.

Resources and Tutorials:

2433 questions
24
votes
1 answer

Gensim: What is difference between word2vec and doc2vec?

I'm kinda newbie and not native english so have some trouble understanding Gensim's word2vec and doc2vec. I think both give me some words most similar with query word I request, by most_similar()(after training). How can tell which case I have to…
user3595632
  • 5,380
  • 10
  • 55
  • 111
24
votes
5 answers

Interpreting the sum of TF-IDF scores of words across documents

First let's extract the TF-IDF scores per term per document: from gensim import corpora, models, similarities documents = ["Human machine interface for lab abc computer applications", "A survey of user opinion of computer system…
alvas
  • 115,346
  • 109
  • 446
  • 738
24
votes
1 answer

Export pyLDAvis graphs as standalone webpage

i am analysing text with topic modelling and using Gensim and pyLDAvis for that. Would like to share the results with distant colleagues, without a need for them to install python and all required libraries. Is there a way to export interactive…
Darius
  • 596
  • 1
  • 6
  • 22
24
votes
2 answers

Get most similar words, given the vector of the word (not the word itself)

Using the gensim.models.Word2Vec library, you have the possibility to provide a model and a "word" for which you want to find the list of most similar words: model = gensim.models.Word2Vec.load_word2vec_format(model_file,…
amin
  • 445
  • 1
  • 4
  • 14
24
votes
2 answers

Word2Vec: Effect of window size used

I am trying to train a word2vec model on very short phrases (5 grams). Since each sentence or example is very short, I believe the window size I can use can atmost be 2. I am trying to understand what the implications of such a small window size are…
vkmv
  • 1,345
  • 1
  • 14
  • 24
23
votes
3 answers

Gensim 3.8.0 to Gensim 4.0.0

I have trained a Word2Vec model using Gensim 3.8.0. Later I tried to use the pretrained model using Gensim 4.0.o on GCP. I used the following code: model = KeyedVectors.load_word2vec_format(wv_path, binary= False) words =…
23
votes
1 answer

Does Gensim library support GPU acceleration?

Using Word2vec and Doc2vec methods provided by Gensim, they have a distributed version which uses BLAS, ATLAS, etc to speedup (details here). However, is it supporting GPU mode? Is it possible to get GPU working if using Gensim?
Irene Li
  • 341
  • 1
  • 3
  • 6
22
votes
2 answers

Visualise word2vec generated from gensim using t-sne

I have trained a doc2vec and corresponding word2vec on my own corpus using gensim. I want to visualise the word2vec using t-sne with the words. As in, each dot in the figure has the "word" also with it. I looked at a similar question here : t-sne on…
Dreams
  • 5,854
  • 9
  • 48
  • 71
22
votes
4 answers

How to use gensim BM25 ranking in python

I found gensim has BM25 ranking function. However, i cannot find the tutorial how to use it. In my case, I had one query. a few documents which were retrieved from the search engine. How to use gensim BM 25 ranking to compare the query and…
dd90p
  • 503
  • 1
  • 7
  • 15
22
votes
2 answers

what does the vector of a word in word2vec represents?

word2vec is a open source tool by Google: For each word it provides a vector of float values, what exactly do they represent? There is also a paper on paragraph vector can anyone explain how they are using word2vec in order to obtain fixed length…
user168983
  • 822
  • 2
  • 10
  • 27
21
votes
4 answers

word2vec - what is best? add, concatenate or average word vectors?

I am working on a recurrent language model. To learn word embeddings that can be used to initialize my language model, I am using gensim's word2vec model. After training, the word2vec model holds two vectors for each word in the vocabulary: the…
Lemon
  • 1,394
  • 3
  • 14
  • 24
21
votes
3 answers

Interpreting negative Word2Vec similarity from gensim

E.g. we train a word2vec model using gensim: from gensim import corpora, models, similarities from gensim.models.word2vec import Word2Vec documents = ["Human machine interface for lab abc computer applications", "A survey of user…
alvas
  • 115,346
  • 109
  • 446
  • 738
21
votes
4 answers

Matching words and vectors in gensim Word2Vec model

I have had the gensim Word2Vec implementation compute some word embeddings for me. Everything went quite fantastically as far as I can tell; now I am clustering the word vectors created, hoping to get some semantic groupings. As a next step, I would…
patrick
  • 4,455
  • 6
  • 44
  • 61
20
votes
2 answers

Gensim word2vec in python3 missing vocab

I'm using gensim implementation of Word2Vec. I have the following code snippet: print('training model') model = Word2Vec(Sentences(start, end)) print('trained model:', model) print('vocab:', model.vocab.keys()) When I run this in python2, it runs…
Sam Lee
  • 9,913
  • 15
  • 48
  • 56
20
votes
6 answers

Using scikit-learn vectorizers and vocabularies with gensim

I am trying to recycle scikit-learn vectorizer objects with gensim topic models. The reasons are simple: first of all, I already have a great deal of vectorized data; second, I prefer the interface and flexibility of scikit-learn vectorizers; third,…
emiguevara
  • 1,359
  • 13
  • 26