Lately I am doing a research with purpose of unsupervised clustering of a huge texts database. Firstly I tried bag-of-words and then several clustering algorithms which gave me a good result, but now I am trying to step into doc2vec representation and it seems to not be working for me, I cannot load prepared model and work with it, instead training my own doesnt prove any result.
I tried to train my model on 10k texts
model = gensim.models.doc2vec.Doc2Vec(vector_size=500, min_count=2, epochs=100,workers=8)
(around 20-50 words each) but the similarity score which is proposed by gensim like
sims = model.docvecs.most_similar([inferred_vector], topn=len(model.docvecs))
is working much worse than the same for Bag-of-words with my model. By much worse i mean that identical or almost identical text have similarity score compatible to text which dont have any connection i can think about. So i decided to use model from Is there pre-trained doc2vec model? to use some pretrained model which might have more connections between words. Sorry for somewhat long preambula but the question is how do i plug it in? Can someone provide some ideas how do i, using the loaded gensim model from https://github.com/jhlau/doc2vec convert my own dataset of text into vectors of same length? My data is preprocesssed (stemmed, no punctuation, lowercase, no nlst.corpus stopwords)and i can deliver it from list or dataframe or file if needed, the code question is how to pass my own data to pretrained model? Any help would be appreciated.
UPD: outputs that make me feel bad
Train Document (6134): «use medium paper examination medium habit one week must chart daily use medium radio television newspaper magazine film video etc wake radio alarm listen traffic report commuting get news watch sport soap opera watch tv use internet work home read book see movie use data collect journal basis analysis examining information using us gratification model discussed textbook us gratification article provided perhaps carrying small notebook day inputting material evening help stay organized smartphone use note app track medium need turn diary trust tell tell immediately paper whether actually kept one begin medium diary soon possible order give ample time complete journal write paper completed diary need write page paper use medium functional analysis theory say something best understood understanding used us gratification model provides framework individual use medium basis analysis especially category discussed posted dominick article apply concept medium usage expected le medium use cognitive social utility affiliation withdrawal must draw conclusion use analyzing habit within framework idea discussed text article concept must clearly included articulated paper common mistake student make assignment tell medium habit fail analyze habit within context us gratification model must include idea paper»
Similar Document (6130, 0.6926988363265991): «use medium paper examination medium habit one week must chart daily use medium radio television newspaper magazine film video etc wake radio alarm listen traffic report commuting get news watch sport soap opera watch tv use internet work home read book see movie use data collect journal basis analysis examining information using us gratification model discussed textbook us gratification article provided perhaps carrying small notebook day inputting material evening help stay organized smartphone use note app track medium need turn diary trust tell tell immediately paper whether actually kept one begin medium diary soon possible order give ample time complete journal write paper completed diary need write page paper use medium functional analysis theory say something best understood understanding used us gratification model provides framework individual use medium basis analysis especially category discussed posted dominick article apply concept medium usage expected le medium use cognitive social utility affiliation withdrawal must draw conclusion use analyzing habit within framework idea discussed text article concept must clearly included articulated paper common mistake student make assignment tell medium habit fail analyze habit within context us gratification model must include idea paper»
This looks perfectly ok, but looking on other outputs
Train Document (1185): «photography garry winogrand would like paper life work garry winogrand famous street photographer also influenced street photography aim towards thoughtful imaginative treatment detail referencescite research material academic essay university level»
Similar Document (3449, 0.6901006698608398): «tang dynasty write page essay tang dynasty essay discus buddhism tang dynasty name artifact tang dynasty discus them history put heading paragraph information tang dynasty discussed essay»
Shows us that the score of similarity between two exactly same texts which are the most similar in the system and two like super distinct is almost the same, which makes it problematic to do anything with the data. To get most similar documents i use
sims = model.docvecs.most_similar([inferred_vector], topn=len(model.docvecs))