Questions tagged [gensim]

Gensim is a free Python framework designed to automatically extract semantic topics from documents, as efficiently (computer-wise) and painlessly (human-wise) as possible.

Gensim aims at processing raw, unstructured digital texts (“plain text”). The algorithms in gensim, such as Latent Semantic Analysis, Latent Dirichlet Allocation or Random Projections, discover semantic structure of documents, by examining word statistical co-occurrence patterns within a corpus of training documents. These algorithms are unsupervised, which means no human input is necessary – you only need a corpus of plain text documents.

Once these statistical patterns are found, any plain text documents can be succinctly expressed in the new, semantic representation, and queried for topical similarity against other documents.

Resources and Tutorials:

2433 questions
14
votes
1 answer

Issues in getting trigrams using Gensim

I want to get bigrams and trigrams from the example sentences I have mentioned. My code works fine for bigrams. However, it does not capture trigrams in the data (e.g., human computer interaction, which is mentioned in 5 places of my…
user8566323
14
votes
2 answers

How should I interpret "size" parameter in Doc2Vec function of gensim?

I am using Doc2Vec function of gensim in Python to convert a document to a vector. An example of usage model = Doc2Vec(documents, size=100, window=8, min_count=5, workers=4) How should I interpret the size parameter. I know that if I set size = 100,…
mamatv
  • 3,581
  • 4
  • 19
  • 25
14
votes
6 answers

Ensure the gensim generate the same Word2Vec model for different runs on the same data

In LDA model generates different topics everytime i train on the same corpus , by setting the np.random.seed(0), the LDA model will always be initialized and trained in exactly the same way. Is it the same for the Word2Vec models from gensim? By…
alvas
  • 115,346
  • 109
  • 446
  • 738
14
votes
7 answers

Hierarchical Dirichlet Process Gensim topic number independent of corpus size

I am using the Gensim HDP module on a set of documents. >>> hdp = models.HdpModel(corpusB, id2word=dictionaryB) >>> topics = hdp.print_topics(topics=-1, topn=20) >>> len(topics) 150 >>> hdp = models.HdpModel(corpusA, id2word=dictionaryA) >>> topics…
Sam Weisenthal
  • 2,791
  • 9
  • 28
  • 66
13
votes
1 answer

How does the Gensim Fasttext pre-trained model get vectors for out-of-vocabulary words?

I am using gensim to load pre-trained fasttext model. I downloaded the English wikipedia trained model from fasttext website. here is the code I wrote to load the pre-trained model: from gensim.models import FastText as…
Baktaawar
  • 7,086
  • 24
  • 81
  • 149
13
votes
4 answers

Visualize Gensim Word2vec Embeddings in Tensorboard Projector

I've only seen a few questions that ask this, and none of them have an answer yet, so I thought I might as well try. I've been using gensim's word2vec model to create some vectors. I exported them into text, and tried importing it on tensorflow's…
I. Blum
  • 133
  • 2
  • 8
13
votes
2 answers

How to get word2index from gensim

By doc we can use this to read a word2vec model with genism model = KeyedVectors.load_word2vec_format('word2vec.50d.txt', binary=False) This is an index-to-word mapping, that is, e.g., model.index2word[2], how to derive an inverted mapping…
GabrielChu
  • 6,026
  • 10
  • 27
  • 42
13
votes
4 answers

Error: 'utf8' codec can't decode byte 0x80 in position 0: invalid start byte

I am trying to do the following kaggle assignmnet. I am using gensim package to use word2vec. I am able to create the model and store it to disk. But when I am trying to load the file back I am getting the error below. -HP-dx2280-MT-GR541AV:~$…
user168983
  • 822
  • 2
  • 10
  • 27
12
votes
3 answers

Gensim install in Python 3.11 fails because of missing longintrepr.h file

Operating System: macOS Monterey 12.6 Chip: Apple M1 Python version: 3.11.1 I try: pip3 install gensim The install process starts well but fatally fails towards the end while running 'clang'. The error message is: clang -Wsign-compare…
Halim Gurgenci
  • 135
  • 1
  • 1
  • 5
12
votes
3 answers

Using pretrained gensim Word2vec embedding in keras

I have trained word2vec in gensim. In Keras, I want to use it to make matrix of sentence using that word embedding. As storing the matrix of all the sentences is very space and memory inefficient. So, I want to make embedding layer in Keras to…
shivank01
  • 1,015
  • 3
  • 16
  • 35
12
votes
2 answers

Doc2Vec.infer_vector keeps giving different result everytime on a particular trained model

I am trying to follow the official Doc2Vec Gensim tutorial mentioned here - https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-lee.ipynb I modified the code in line 10 to determine best matching document for the given…
Rohan
  • 665
  • 9
  • 17
12
votes
2 answers

Necessary to apply TF-IDF to new documents in gensim LDA model?

I'm following the 'English Wikipedia' gensim tutorial at https://radimrehurek.com/gensim/wiki.html#latent-dirichlet-allocation where it explains that tf-idf is used during training (at least for LSA, not so clear with LDA). I expected to apply a…
Luke W
  • 8,276
  • 5
  • 44
  • 36
12
votes
2 answers

Gensim LDA topic assignment

I am hoping to assign each document to one topic using LDA. Now I realise that what you get is a distribution over topics from LDA. However as you see from the last line below I assign it to the most probable topic. My question is this. I have to…
sachinruk
  • 9,571
  • 12
  • 55
  • 86
11
votes
1 answer

Understanding LDA / topic modelling -- too much topic overlap

I'm new to topic modelling / Latent Dirichlet Allocation and have trouble understanding how I can apply the concept to my dataset (or whether it's the correct approach). I have a small number of literary texts (novels) and would like to extract some…
zinfandel
  • 428
  • 5
  • 12
11
votes
2 answers

Gensim word2vec on predefined dictionary and word-indices data

I need to train a word2vec representation on tweets using gensim. Unlike most tutorials and code I've seen on gensim my data is not raw, but has already been preprocessed. I have a dictionary in a text document containing 65k words (incl. an…
pir
  • 5,513
  • 12
  • 63
  • 101