How to debug a scatter plot in Matplotlib?

Question

I have the following df:

df = pd.DataFrame([
    ['A', 'X', '2020-10-01', 1],
    ['A', 'X', '2020-10-02', 2], 
    ['A', 'X', '2020-10-03', 3], 
    ['A', 'Y', '2020-10-01', 4],
    ['A', 'Y', '2020-10-02', 5], 
    ['A', 'Y', '2020-10-03', 6],
    ['B', 'Z', '2020-10-01', 7],
    ['B', 'Z', '2020-10-02', 8], 
    ['B', 'Z', '2020-10-03', 9], 
    ['B', 'Z', '2020-10-01', 10],
    ['B', 'Z', '2020-10-02', 11], 
    ['B', 'Z', '2020-10-03', 12],    
],
    columns=['Q', 'W', 'DT', 'V']
)

I would like to create a scatter plot:

fig, ax = plt.subplots(figsize=(12, 8), frameon=False)
fig.suptitle('Plotz', fontsize=16)
ax.set_title('DF Plot')
ax.scatter(x=df.DT, y=df.W, s=df.V)

This created the following chart:

I would like to figure out what actually happens, since there are 9 datapoints on the graph while there are 12 datapoints in the data. Annotating the chart does not work, it will annotate with 2 values for the top row.

for i, txt in enumerate(df.V):
    ax.annotate(txt, (df.DT[i], df.W[i]), fontsize=14)

Is there a way to figure out what really happens under the hood when there are multiple values for the x,y pair (like in this case)?

Update: Maybe I was not clear. What is the default behaviour of Matplotlib in this scenario? Is it last value wins? How could I display on the plot the actual value? (That shows the real value on the plot unlike the annotate code that shows both values).

After googling more around I think is the answer:

Visualization of scatter plots with overlapping points in matplotlib

That's because your Z values are duplicated. So out of 6 Z values, you only get 3 — Sheldore, Feb 10 '20 at 16:27
Yes, what happens? Bigger value wins, last value wins? Mean is calculated? What is the default behaviour when there are overlapping values? Can I control the behaviour? — Istvan, Feb 10 '20 at 16:53

JohanC · Accepted Answer · 2020-02-11T13:37:17.397

What normally happens, is that the dots are plotted in the order they are encountered, one over the other. If there is no transparency, the last one plotted will be visible, and the earlier ones will only show some border in case they were larger.

Therefore, one approach to debug this kind of situation, is to set an alpha value making the dots transparent. Multiple dots over each other will show darker and have some border.

With the given the testdata, the code below blows up the size and sets an alpha. As the dot size becomes extremely large, the axes limits need to be adjusted. Using multiple colors would emphasize the overlapping even more.

ax.scatter(x=df.DT, y=df.W, s=df.V*150, alpha=0.4)
plt.xlim(-1,3)
plt.ylim(-1,3)

Another approach, is adding jitter: adding some small random noise to each dot position. In case of numerical data, one can add the jitter directly to the data. In case of categorical data, the positions could be modified after calling scatter:

import numpy as np
dots = ax.scatter(x=df.DT, y=df.W, s=df.V)
offsets =  dots.get_offsets()
jittered_offsets = offsets + np.random.uniform(-0.1, 0.1, offsets.shape)
dots.set_offsets(jittered_offsets)

With the original colors and sizes, and without alpha, this would clearly draw the attention to dots that overlapped:

Still another approach, in case both axes are categorical, is to just count based on position and encircle the positions that appear multiple times:

import collections
dots = ax.scatter(x=df.DT, y=df.W, s=df.V)
offsets =  dots.get_offsets()
counts = collections.Counter([(x,y)  for x, y in offsets])
suspects = [p for p in counts if counts[p] >= 2]
ax.scatter([x for x, _ in suspects], [y for _, y in suspects], ec='crimson', lw=1, fc='none', s=50)

Of course, the different approaches (alpha, colors, jittering, encircling) can be combined depending on the specifics of the actual data.

Thanks for this amazing answer! I just implemented the jitter way and it works very well. — Istvan, Feb 13 '20 at 07:15

How to debug a scatter plot in Matplotlib?

1 Answers1

Linked