Questions tagged [gensim]

Gensim is a free Python framework designed to automatically extract semantic topics from documents, as efficiently (computer-wise) and painlessly (human-wise) as possible.

Gensim aims at processing raw, unstructured digital texts (“plain text”). The algorithms in gensim, such as Latent Semantic Analysis, Latent Dirichlet Allocation or Random Projections, discover semantic structure of documents, by examining word statistical co-occurrence patterns within a corpus of training documents. These algorithms are unsupervised, which means no human input is necessary – you only need a corpus of plain text documents.

Once these statistical patterns are found, any plain text documents can be succinctly expressed in the new, semantic representation, and queried for topical similarity against other documents.

Resources and Tutorials:

2433 questions
9
votes
1 answer

gensim LdaMulticore not multiprocessing?

When I run gensim's LdaMulticore model on a machine with 12 cores, using: lda = LdaMulticore(corpus, num_topics=64, workers=10) I get a logging message that says using serial LDA version on this node A few lines later, I see another loging…
Edward Newell
  • 17,203
  • 7
  • 34
  • 36
9
votes
4 answers

How to filter out words with low tf-idf in a corpus with gensim?

I am using gensim for some NLP task. I've created a corpus from dictionary.doc2bow where dictionary is an object of corpora.Dictionary. Now I want to filter out the terms with low tf-idf values before running an LDA model. I looked into the…
Ziyuan
  • 4,215
  • 6
  • 48
  • 77
9
votes
4 answers

Gensim: How to save LDA model's produced topics to a readable format (csv,txt,etc)?

last parts of the code: lda = LdaModel(corpus=corpus,id2word=dictionary, num_topics=2) print lda bash output: INFO : adding document #0 to Dictionary(0 unique tokens) INFO : built Dictionary(18 unique tokens) from 5 documents (total 20 corpus…
jeremy.ting
  • 155
  • 1
  • 1
  • 7
9
votes
1 answer

Can we use a self made corpus for training for LDA using gensim?

I have to apply LDA (Latent Dirichlet Allocation) to get the possible topics from a data base of 20,000 documents that I collected. How can I use these documents rather than the other corpus available like the Brown Corpus or English Wikipedia as…
Animesh Pandey
  • 5,900
  • 13
  • 64
  • 130
8
votes
2 answers

How to import gensim summarize

I got gensim to work in Google Collab by following this process: !pip install gensim from gensim.summarization import summarize Then I was able to call summarize(some_text) Now I'm trying to run the same thing in VS code: I've installed…
Katie Melosto
  • 1,047
  • 2
  • 14
  • 35
8
votes
0 answers

ModuleNotFoundError: No module named 'numpy.testing.decorators'

I really need some help, as I have gone through all the posts and nothing has worked. I get this error when importing gensim and not numpy (numpy is before and works fine). All I want to do is import gensim and numpy to then run my analysis. Here is…
astampib
  • 91
  • 1
  • 4
8
votes
2 answers

Gensim LDA Coherence Score Nan

I created a Gensim LDA Model as shown in this tutorial: https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/ lda_model = gensim.models.LdaMulticore(data_df['bow_corpus'], num_topics=10, id2word=dictionary, random_state=100,…
Ramsha Siddiqui
  • 460
  • 6
  • 20
8
votes
1 answer

Not efficiently to use multi-Core CPU for training Doc2vec with gensim

I am using 24 cores virtual CPU and 100G memory to training Doc2Vec with Gensim, but the usage of CPU always is around 200% whatever to modify the number of cores. top htop The above two pictures showed the percentage of cpu usage, this pointed…
Ivan Lee
  • 3,420
  • 4
  • 30
  • 45
8
votes
2 answers

Cosine similarity between 0 and 1

I am interested in calculating similarity between vectors, however this similarity has to be a number between 0 and 1. There are many questions concerning tf-idf and cosine similarity, all indicating that the value lies between 0 and 1. From…
Bram Vanroy
  • 27,032
  • 24
  • 137
  • 239
8
votes
2 answers

How to avoid decoding to str: need a bytes-like object error in pandas?

Here is my code : data = pd.read_csv('asscsv2.csv', encoding = "ISO-8859-1", error_bad_lines=False); data_text = data[['content']] data_text['index'] = data_text.index documents = data_text It looks like print(documents[:2]) …
wayne64001
  • 399
  • 1
  • 3
  • 13
8
votes
1 answer

Gensim (word2vec) retrieve n most frequent words

How is it possible to retrieve the n most frequent words from a Gensim word2vec model? As I understand, the frequency and count are not the same, and I therefore can't use the object.count() method. I need to produce a list of the n most frequent…
Phils19
  • 156
  • 2
  • 8
8
votes
1 answer

Python/Gensim - What is the meaning of syn0 and syn0norm?

I know that in gensims KeyedVectors-model, one can access the embedding matrix by the attribute model.syn0. There is also a syn0norm, which doesn't seem to work for the glove model I recently loaded. I think I also have seen syn1 somewhere…
MBT
  • 21,733
  • 19
  • 84
  • 102
8
votes
2 answers

Loss does not decrease during training (Word2Vec, Gensim)

What can cause loss from model.get_latest_training_loss() increase on each epoch? Code, used for training: class EpochSaver(CallbackAny2Vec): '''Callback to save model after each epoch and show training parameters ''' def __init__(self,…
Dasha
  • 327
  • 2
  • 10
8
votes
2 answers

How to build a gensim dictionary that includes bigrams?

I'm trying to build a Tf-Idf model that can score bigrams as well as unigrams using gensim. To do this, I build a gensim dictionary and then use that dictionary to create bag-of-word representations of the corpus that I use to build the model. The…
fraxture
  • 5,113
  • 4
  • 43
  • 83
8
votes
3 answers

Gensim Word2Vec select minor set of word vectors from pretrained model

I have a large pretrained Word2Vec model in gensim from which I want to use the pretrained word vectors for an embedding layer in my Keras model. The problem is that the embedding size is enormous and I don't need most of the word vectors (because…
getaway22
  • 189
  • 1
  • 2
  • 9