3

I'd like to form a representation of users based on the last N documents they have liked.

So i'm planning on using doc2vec to form this representation of each document but i'm just trying to figure out what would be a good way to essentially place users in the same space.

Something as simple as averaging the vectors of their last 5 documents they consumed springs to mind but am not sure if this might be a bit silly. Maybe some sort of knn approach in the space might be possible.

Then i'm wondering - the same way we just use a doc id in doc2vec, how crazy would it be to just add in a user id token and try that way to get a representation of a user in much the same way as a document.

I've not been able to find much on ways to use word2vec type embeddings to come up with both document vectors and user vectors that can then be used in a sort of vector space model approach.

Anyone any pointers or suggestions?

andrewm4894
  • 1,451
  • 4
  • 17
  • 37

2 Answers2

2

It's reasonable to try Doc2Vec for analyzing such user-to-document relationships.

You could potentially represent a user by the average-of-the-last-N-docs-consumed, as you suggest. Or all docs they consumed. Or perhaps M centroids chosen to minimize the distances to the last N documents they consumed. But which might do well for the data/goals could only be found by exploratory experimentation.

You could try adding user-tags to whatever other doc-ID-tags (or doc-category-tags) provided during bulk Doc2Vec training. But, beware that adding more tags means a larger model, and in some rough sense "dilutes" the meaning that can be extracted from the corpus, or allows for overfitting based on idiosyncracies of seldom-occurring tags (rather than the desired generalization that's forced when a model is smaller). So if you have lots of user-tags, and perhaps lots of user-tags that are only applied to a small subset of documents, the overall quality of the doc-vectors may suffer.

One other interesting (but expensive-to-calculate) technique in the Word2Vec space is "Word Mover's Distance" (WMD), which compares texts based a cost to shift all one text's meaning, represented by a series of piles-of-meaning at vector positions for each word, to match another's piles. (Shifting words to word-vector nearby-words is cheap; to distant words is expensive. The calculation finds the optimal set of shifts, and reports its cost, with lower costs being more-similar texts.)

It strikes me that sets-of-doc-vectors could be treated the same way, and so the bag-of-doc-vectors associated with one user need not be reduced to any single average vector, but could instead be compared, via WMD, to another bag-of-doc-vectors, or even single doc-vectors. (There's support for WMD in the wmdistance() method of gensim's KeyedVectors, but not directly on Doc2Vec classes, so you'd need to do some manual object/array juggling or other code customization to adapt it.)

gojomo
  • 52,260
  • 14
  • 86
  • 115
1

Learning user embeddings using doc2vec is a well known technique. This is a comprehensive article on user2vec models that describes both word2vec and doc2vec approaches, as well many other useful techniques.

user1128016
  • 1,438
  • 3
  • 16
  • 17