Questions tagged [doc2vec]

Doc2Vec is an unsupervised algorithm used to convert documents in vectors ("dense embeddings"). It is based on the "Paragraph Vector" paper and implemented in the Gensim Python library and elsewhere. The algorithm can work in either a "Distributed Bag Of Words" mode (PV-DBOW, which works somewhat analogously to skip-gram mode in Word2Vec) or a "Distributed Memory" mode (PV-DM, which is more analogous to CBOW mode in Word2Vec.)

556 questions
2
votes
1 answer

Should I split sentences in a document for Doc2Vec?

I am building a Doc2Vec model with 1000 documents using Gensim. Each document has consisted of several sentences which include multiple words. Example) Doc1: [[word1, word2, word3], [word4, word5, word6, word7],[word8, word9, word10]] Doc2: [[word7,…
porororo
  • 27
  • 3
2
votes
1 answer

Checking model overfit of doc2vec with infer_vector()

my aim is to create document embeddings from the column df["text"] as a first step and then as a second step plug them along with other variables into a XGBoost Regressor model in order to make predictions. This works very well for the train_df. I…
karabara
  • 43
  • 1
  • 5
2
votes
1 answer

Why does a Gensim Doc2vec object return empty doctags?

My question is how I should interpret my situation? I trained a Doc2Vec model following this tutorial https://blog.griddynamics.com/customer2vec-representation-learning-and-automl-for-customer-analytics-and-personalization/. For some reason,…
Jeong Kim
  • 481
  • 3
  • 9
  • 21
2
votes
1 answer

Cannot load Doc2vec object using gensim

I am trying to load a pre-trained Doc2vec model using gensim and use it to map a paragraph to a vector. I am referring to https://github.com/jhlau/doc2vec and the pre-trained model I downloaded is the English Wikipedia DBOW, which is also in the…
user13584534
2
votes
1 answer

How to extract sentences which has similar meaning/intent compared against a example list of sentences

I have chat interaction [Utterances] between Customer and Advisor and would want to know if the advisor interactions contains particular sentences or similar sentences in the below list: Example sentences i am looking for in the Advisor interactions…
baskarmac
  • 35
  • 4
2
votes
1 answer

Gensim's Doc2Vec - How to use pre-trained word2vec (word similarities)

I don't have large corpus of data to train word similarities e.g. 'hot' is more similar to 'warm' than to 'cold'. However, I like to train doc2vec on a relatively small corpus ~100 docs so that it can classify my domain specific documents. To…
KGhatak
  • 6,995
  • 1
  • 27
  • 24
2
votes
1 answer

Gensim Doc2Vec infer_vector on unseen words differs based on characters in these words

Gensim Doc2Vec infer_vector on paragraphs with unseen words generates vectors that differ based on the characters in the unsween words. for i in range(0, 2): print(model.infer_vector(["zz"])[0:2]) print(model.infer_vector(["zzz"])[0:2]) …
Stanley Kirdey
  • 602
  • 5
  • 20
2
votes
1 answer

Default values of doc2vec for alpha and min_alpha

can anybody tell me which default values are used in Doc2Vec() for alpha and min_alpha?
2
votes
1 answer

How to use doc2vec model in production?

I wonder how to deploy a doc2vec model in production to create word vectors as input features to a classifier. To be specific, let say, a doc2vec model is trained on a corpus as follows. dataset['tagged_descriptions'] = datasetf.apply(lambda x:…
user3000538
  • 189
  • 1
  • 2
  • 14
2
votes
1 answer

How to use Sklearn linear regression with doc2vec input

I have 250k text documents (tweets and newspaper articles) represented as vectors obtained with a doc2vec model. Now, I want to use a regressor (multiple linear regression) to predict continuous value outputs - in my case the UK Consumer Confidence…
2
votes
1 answer

How to combine vectors generated by PV-DM and PV-DBOW methods of doc2vec?

I have around 20k documents with 60 - 150 words. Out of these 20K documents, there are 400 documents for which the similar document are known. These 400 documents serve as my test data. I am trying to find similar documents for these 400 datasets…
Vikrant
  • 139
  • 1
  • 12
2
votes
2 answers

AttributeError: module 'gensim.utils' has no attribute 'smart_open'

I am building the vocabulary table using Doc2vec, but there is an error "AttributeError: module 'gensim.utils' has no attribute 'smart_open'". How do I solve this? This is for a notebook on Databricks platform, running in Python 3. In the past, I've…
2
votes
1 answer

Tensorboard embedding visualization: what is cosine distance?

I'm PhD student in digital humanities. I'm quite new to programming languages. I have a problem that is freaking me out since last month. I'm trying to visualize a doc2vec model (python, gensim library) on the embeddings projector in Tensorboard but…
2
votes
2 answers

Convert a column in a dask dataframe to a TaggedDocument for Doc2Vec

Intro Currently I am trying to use dask in concert with gensim to do NLP document computation and I'm running into an issue when converting my corpus into a "TaggedDocument". Because I've tried so many different ways to wrangle this problem I'll…
ZdWhite
  • 501
  • 1
  • 3
  • 15
2
votes
1 answer

Where is word2vec mapping coming from for DBOW doc2vec in gensim implementation?

I am trying to use gensim for doc2vec and word2vec. Since PV-DM approach can generate word2vec and doc2vec at the same time, I thought PV-DM is the right model to use. So, I created a model using gensim by specifying dm=1 for PV-DM My questions are…
Brandon Lee
  • 695
  • 1
  • 10
  • 22