Questions tagged [gensim]

Gensim is a free Python framework designed to automatically extract semantic topics from documents, as efficiently (computer-wise) and painlessly (human-wise) as possible.

Gensim aims at processing raw, unstructured digital texts (“plain text”). The algorithms in gensim, such as Latent Semantic Analysis, Latent Dirichlet Allocation or Random Projections, discover semantic structure of documents, by examining word statistical co-occurrence patterns within a corpus of training documents. These algorithms are unsupervised, which means no human input is necessary – you only need a corpus of plain text documents.

Once these statistical patterns are found, any plain text documents can be succinctly expressed in the new, semantic representation, and queried for topical similarity against other documents.

Resources and Tutorials:

2433 questions

votes

1 answer

Issues in getting trigrams using Gensim

I want to get bigrams and trigrams from the example sentences I have mentioned. My code works fine for bigrams. However, it does not capture trigrams in the data (e.g., human computer interaction, which is mentioned in 5 places of my…

asked Sep 11 '17 at 04:28

user8566323

votes

2 answers

How should I interpret "size" parameter in Doc2Vec function of gensim?

I am using Doc2Vec function of gensim in Python to convert a document to a vector. An example of usage model = Doc2Vec(documents, size=100, window=8, min_count=5, workers=4) How should I interpret the size parameter. I know that if I set size = 100,…

python gensim word2vec

asked Jan 22 '16 at 14:07

mamatv

3,581
4
19
25

votes

6 answers

Ensure the gensim generate the same Word2Vec model for different runs on the same data

In LDA model generates different topics everytime i train on the same corpus , by setting the np.random.seed(0), the LDA model will always be initialized and trained in exactly the same way. Is it the same for the Word2Vec models from gensim? By…

python random gensim word2vec word-embedding

asked Jan 16 '16 at 20:05

alvas

115,346
109
446
738

votes

7 answers

Hierarchical Dirichlet Process Gensim topic number independent of corpus size

I am using the Gensim HDP module on a set of documents. >>> hdp = models.HdpModel(corpusB, id2word=dictionaryB) >>> topics = hdp.print_topics(topics=-1, topn=20) >>> len(topics) 150 >>> hdp = models.HdpModel(corpusA, id2word=dictionaryA) >>> topics…

python nlp lda gensim

asked Jul 21 '15 at 15:34

Sam Weisenthal

2,791
9
28
66

votes

1 answer

How does the Gensim Fasttext pre-trained model get vectors for out-of-vocabulary words?

I am using gensim to load pre-trained fasttext model. I downloaded the English wikipedia trained model from fasttext website. here is the code I wrote to load the pre-trained model: from gensim.models import FastText as…

python nlp gensim fasttext

asked Jun 13 '18 at 02:33

Baktaawar

7,086
24
81
149

votes

4 answers

Visualize Gensim Word2vec Embeddings in Tensorboard Projector

I've only seen a few questions that ask this, and none of them have an answer yet, so I thought I might as well try. I've been using gensim's word2vec model to create some vectors. I exported them into text, and tried importing it on tensorflow's…

python tensorflow gensim tensorboard word-embedding

asked May 23 '18 at 15:50

I. Blum

votes

2 answers

How to get word2index from gensim

By doc we can use this to read a word2vec model with genism model = KeyedVectors.load_word2vec_format('word2vec.50d.txt', binary=False) This is an index-to-word mapping, that is, e.g., model.index2word[2], how to derive an inverted mapping…

gensim

asked Nov 05 '17 at 02:21

GabrielChu

6,026
10
27
42

votes

4 answers

Error: 'utf8' codec can't decode byte 0x80 in position 0: invalid start byte

I am trying to do the following kaggle assignmnet. I am using gensim package to use word2vec. I am able to create the model and store it to disk. But when I am trying to load the file back I am getting the error below. -HP-dx2280-MT-GR541AV:~$…

python character-encoding gensim word2vec kaggle

asked Dec 26 '14 at 17:25

user168983

votes

3 answers

Gensim install in Python 3.11 fails because of missing longintrepr.h file

Operating System: macOS Monterey 12.6 Chip: Apple M1 Python version: 3.11.1 I try: pip3 install gensim The install process starts well but fatally fails towards the end while running 'clang'. The error message is: clang -Wsign-compare…

python-3.x cython gensim

asked Jan 02 '23 at 06:55

Halim Gurgenci

votes

3 answers

Using pretrained gensim Word2vec embedding in keras

I have trained word2vec in gensim. In Keras, I want to use it to make matrix of sentence using that word embedding. As storing the matrix of all the sentences is very space and memory inefficient. So, I want to make embedding layer in Keras to…

python keras gensim word2vec word-embedding

asked Sep 01 '18 at 08:53

shivank01

1,015
3
16
35

votes

2 answers

Doc2Vec.infer_vector keeps giving different result everytime on a particular trained model

I am trying to follow the official Doc2Vec Gensim tutorial mentioned here - https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-lee.ipynb I modified the code in line 10 to determine best matching document for the given…

nlp word2vec gensim doc2vec

asked Jan 21 '18 at 00:31

Rohan

votes

2 answers

Necessary to apply TF-IDF to new documents in gensim LDA model?

I'm following the 'English Wikipedia' gensim tutorial at https://radimrehurek.com/gensim/wiki.html#latent-dirichlet-allocation where it explains that tf-idf is used during training (at least for LSA, not so clear with LDA). I expected to apply a…

gensim

asked Jun 27 '17 at 13:04

Luke W

8,276
5
44
36

votes

2 answers

Gensim LDA topic assignment

I am hoping to assign each document to one topic using LDA. Now I realise that what you get is a distribution over topics from LDA. However as you see from the last line below I assign it to the most probable topic. My question is this. I have to…

gensim lda topic-modeling

asked Oct 11 '16 at 03:07

sachinruk

9,571
12
55
86

votes

1 answer

Understanding LDA / topic modelling -- too much topic overlap

I'm new to topic modelling / Latent Dirichlet Allocation and have trouble understanding how I can apply the concept to my dataset (or whether it's the correct approach). I have a small number of literary texts (novels) and would like to extract some…

python nlp gensim lda topic-modeling

asked Sep 20 '17 at 15:30

zinfandel

votes

2 answers

Gensim word2vec on predefined dictionary and word-indices data

I need to train a word2vec representation on tweets using gensim. Unlike most tutorials and code I've seen on gensim my data is not raw, but has already been preprocessed. I have a dictionary in a text document containing 65k words (incl. an…

python nlp gensim word2vec

asked Mar 01 '16 at 11:20

pir

5,513
12
63
101

Prev 1 2 3

…

99 100 Next