0

I am now working on a project that try to extend the LIWC dictionary to fit our local language (mixed English, Indonesia, Malay and Chinese). We use a word embedding model to find similar words to words in LIWC dictionary, then calculate score based on the new dictionary.

The original output from LIWC dictionary looks like this:

[53.2, 11.2,..., 85.01]

which represent the proportion of words belonging to each category, and the categories include:

['Function', 'Pronoun', 'Ppron', 'I', 'We', 'You', ... ,'Netspeak', 'Assent', 'Nonflu', 'Filler']

After extending the LIWC dictionary, I want to test whether we have the similar output as that from the original LIWC. However after extending the words in the dictionary, the proportion of each category will surely increase. Therefore instead of directly compare the two score, I think it will make more sense if we compare the relation between variables.

More precisely, say I have the original output dist1,

[d1v1, d1v2, ..., d1vp]

and the output from our extended dictioary dict2,

[d2v1, d2v2, ..., d2vp] 

where p represent the number of categories. Does there exist a test that can help me prove whether the relation between variables in dist1 is similar to that in dist2?

Margies Lo
  • 21
  • 2
  • I am thinking that in each dist divide the score by the highest score to get a normalized score dist, then use paired t-test to test whether there is a significant difference between two dists. However I'm not sure whether this makes any sense... – Margies Lo Oct 03 '17 at 06:25
  • Maybe you can look at cross entropy or something related to that such as Kullback-Leibler distance. You'll get more interest in this question at stats.stackexchange.com. – Robert Dodier Oct 03 '17 at 20:39
  • @RobertDodier Thank you so much! I'll try the two indices and take a look at the website. – Margies Lo Oct 04 '17 at 01:06
  • It is called mutual information: https://en.wikipedia.org/wiki/Mutual_information which leads to Kullback-Leibler divergence – Severin Pappadeux Oct 04 '17 at 17:40

0 Answers0