0

I have a corpus of book publications split into different clusters. I have information about the nationality of the authors (variable A) and the nationality of the publishing company (variable B).

In the case of variable B, publishing companies are either US-based or Euro-based (2 categories). In the case of variable A, authors are either American, European or others (3 categories).

I want to know whether a cluster is more euro-centered or more us-centered when compared to the overall corpus (basically identify clusters in which EU/US identity is important) and plot it on two axes according to variables A and B.

A positive value on the Y-axis would mean the cluster has an over-representation of EU authors, and a negative value the opposite. Similarly, the X-axis would have a positive value when we find an over-representation of EU publishing companies and a negative value for US companies. (In the case of variable A, it means that simply comparing proportions can lead to both US and EU authors being over-represented).

I initially substracted the relative ratio of variable B and plotted the resulting value on the y axis according to the following formula:

(share_europeans_authors_cluster/share_US_authors_cluster - share_europeans_authors/share_US_authors)

I did a similar thing for variable A and x-axis, and got the following plot:

Plot I would like

I would like a better measure of what I am trying to do because my intuition is that there is something wrong with my approach. I tried using the log ratio, but it led to other issues.

Homard
  • 39
  • 6

1 Answers1

0

You could use the log of the odds ratio:

Then, a value of zero means that the odds are the same in both groups. Numbers bigger than one mean that the odds being from the EU in the cluster are larger than the odds of being from the EU in the corpus. Negative values mean the odds of being from the US in the cluster are larger than the odds of being in the US in the corpus.

DaveArmstrong
  • 18,377
  • 2
  • 13
  • 25