6

I want to scatterplot two categorical variables as follows

from matplotlib import pyplot as plt    
a=[1,1,1,1,2,2]
b=[2,2,2,2,1,1]
plt.scatter(a,b)

If I plot this I will see only two points (4 overlapping in (1,2), and 2 overlapping in (2,1)) without being able to appreciate the different occurrence of the two overlapping points.

enter image description here

I would like to see a scatter plot where the marker of the point of the left (1,2) is twice bigger than the marker on the point on the right (2,1), in order to show the different occurrence of the point. What is the correct way to do this? (beside the trival solution where I count occurrences by hand and I put them inside the size argument of plt.scatter)

I already searched other SOF questions, but they all propose to use an alpha like here, but I would like to see a marker size to appreciate better the different proportionalities between occurrences.

A pointer might be to use some Kernel Density Estimate as suggested in this answer

To give a bit more context to my question, the two output are the predictions of two classifiers, and I want to explore the differences between the predictions to evaluate whether to ensemble them.

1 Answers1

5

You can make use of the occurrence frequency of the x-points (or even y-points for this particular data set) which can be obtained using Counter module. The frequencies can then be used as a rescaling factor for defining the size of the markers. Here 200 is just a big number to emphasize the size of the markers.

from matplotlib import pyplot as plt    
from collections import Counter

a=[1,1,1,1,2,2]
b=[2,2,2,2,1,1]

weights = [200*i for i in Counter(a).values() for j in range(i)]
plt.scatter(a, b, s = weights)
plt.show()

enter image description here

Another option to visualise the distribution is a bar chart

freqs = Counter(a)

plt.bar(freqs.keys(), freqs.values(), width=0.5)
plt.xticks(list(freqs.keys()))

enter image description here

Sheldore
  • 37,862
  • 7
  • 57
  • 71