How to measure how distinct a document is based on predefined linguistic categories?

Question

I have 3 categories of words that correspond to different types of psychological drives (need-for-power, need-for-achievement, and need-for-affiliation). Currently, for every document in my sample (n=100,000), I am using a tool to count the number of words in each category, and calculating a proportion score for each category by converting the raw word counts into a percentage based on total words used in the text.

                 n-power   n-achieve  n-affiliation
Document1        0.010      0.025      0.100  
Document2        0.045      0.010      0.050
:                :          :          :
:                :          :          :
Document100000   0.100      0.020      0.010

For each document, I would like to get a measure of distinctiveness that indicates the degree to which the content of a document on the three psychological categories differs from the average content of all documents (i.e., the prototypical document in my sample). Is there a way to do this?

This is rather a general question and it's not entirely clear what you want as output. My suggestion is to check the package **quanteda** which you can find at quanteda.io — Francesco Grossetti, May 27 '20 at 09:17

Aramakus · Accepted Answer · 2020-05-27T12:31:32.597

Essentially what you have is a clustering problem. Currently you made a representation of each of your documents with 3 numbers, lets call them a vector (essentially you cooked up some embeddings). To do what you want you can 1) Calculate an average vector for the whole set. Basically add up all numbers in each column and divide by the number of documents. 2) Pick a metric you like which will reflect an alignment of your document vectors with an average. You can just use (Euclidian) sklearn.metrics.pairwise.euclidean_distances or cosine sklearn.metrics.pairwise.cosine_distances X will be you list of document vectors and Y will be a single average vector in the list. This is a good place to start.

If I would do it I would ignore average vector approach as you are in fact dealing with clustering problem. So I would use KMeans see more here guide

Hope this helps!

Is there a reason to choose cosine over euclidean, or vice versa in this specific case of 3-dimensional vectors? The reason I ask is the two similarity metrics have very different correlations with other variables in my data. — SanMelkote, Jun 22 '20 at 22:35

How to measure how distinct a document is based on predefined linguistic categories?

1 Answers1