I have a corpus of book publications split into different clusters. I have information about the nationality of the authors (variable A) and the nationality of the publishing company (variable B).
In the case of variable B, publishing companies are either US-based or Euro-based (2 categories). In the case of variable A, authors are either American, European or others (3 categories).
I want to know whether a cluster is more euro-centered or more us-centered when compared to the overall corpus (basically identify clusters in which EU/US identity is important) and plot it on two axes according to variables A and B.
A positive value on the Y-axis would mean the cluster has an over-representation of EU authors, and a negative value the opposite. Similarly, the X-axis would have a positive value when we find an over-representation of EU publishing companies and a negative value for US companies. (In the case of variable A, it means that simply comparing proportions can lead to both US and EU authors being over-represented).
I initially substracted the relative ratio of variable B and plotted the resulting value on the y axis according to the following formula:
(share_europeans_authors_cluster/share_US_authors_cluster - share_europeans_authors/share_US_authors)
I did a similar thing for variable A and x-axis, and got the following plot:
I would like a better measure of what I am trying to do because my intuition is that there is something wrong with my approach. I tried using the log ratio, but it led to other issues.