3

I have the following df:

df = pd.DataFrame([
    ['A', 'X', '2020-10-01', 1],
    ['A', 'X', '2020-10-02', 2], 
    ['A', 'X', '2020-10-03', 3], 
    ['A', 'Y', '2020-10-01', 4],
    ['A', 'Y', '2020-10-02', 5], 
    ['A', 'Y', '2020-10-03', 6],
    ['B', 'Z', '2020-10-01', 7],
    ['B', 'Z', '2020-10-02', 8], 
    ['B', 'Z', '2020-10-03', 9], 
    ['B', 'Z', '2020-10-01', 10],
    ['B', 'Z', '2020-10-02', 11], 
    ['B', 'Z', '2020-10-03', 12],    
],
    columns=['Q', 'W', 'DT', 'V']
)

I would like to create a scatter plot:

fig, ax = plt.subplots(figsize=(12, 8), frameon=False)
fig.suptitle('Plotz', fontsize=16)
ax.set_title('DF Plot')
ax.scatter(x=df.DT, y=df.W, s=df.V)

This created the following chart:

enter image description here

I would like to figure out what actually happens, since there are 9 datapoints on the graph while there are 12 datapoints in the data. Annotating the chart does not work, it will annotate with 2 values for the top row.

for i, txt in enumerate(df.V):
    ax.annotate(txt, (df.DT[i], df.W[i]), fontsize=14)

Is there a way to figure out what really happens under the hood when there are multiple values for the x,y pair (like in this case)?

Update: Maybe I was not clear. What is the default behaviour of Matplotlib in this scenario? Is it last value wins? How could I display on the plot the actual value? (That shows the real value on the plot unlike the annotate code that shows both values).

After googling more around I think is the answer:

Visualization of scatter plots with overlapping points in matplotlib

Istvan
  • 7,500
  • 9
  • 59
  • 109
  • 2
    The `Z` points are overlapping. – CDJB Feb 10 '20 at 16:27
  • 1
    That's because your Z values are duplicated. So out of 6 Z values, you only get 3 – Sheldore Feb 10 '20 at 16:27
  • Yes, what happens? Bigger value wins, last value wins? Mean is calculated? What is the default behaviour when there are overlapping values? Can I control the behaviour? – Istvan Feb 10 '20 at 16:53

1 Answers1

5

What normally happens, is that the dots are plotted in the order they are encountered, one over the other. If there is no transparency, the last one plotted will be visible, and the earlier ones will only show some border in case they were larger.

Therefore, one approach to debug this kind of situation, is to set an alpha value making the dots transparent. Multiple dots over each other will show darker and have some border.

With the given the testdata, the code below blows up the size and sets an alpha. As the dot size becomes extremely large, the axes limits need to be adjusted. Using multiple colors would emphasize the overlapping even more.

ax.scatter(x=df.DT, y=df.W, s=df.V*150, alpha=0.4)
plt.xlim(-1,3)
plt.ylim(-1,3)

resulting plot

Another approach, is adding jitter: adding some small random noise to each dot position. In case of numerical data, one can add the jitter directly to the data. In case of categorical data, the positions could be modified after calling scatter:

import numpy as np
dots = ax.scatter(x=df.DT, y=df.W, s=df.V)
offsets =  dots.get_offsets()
jittered_offsets = offsets + np.random.uniform(-0.1, 0.1, offsets.shape)
dots.set_offsets(jittered_offsets)

With the original colors and sizes, and without alpha, this would clearly draw the attention to dots that overlapped: jittered plot

Still another approach, in case both axes are categorical, is to just count based on position and encircle the positions that appear multiple times:

import collections
dots = ax.scatter(x=df.DT, y=df.W, s=df.V)
offsets =  dots.get_offsets()
counts = collections.Counter([(x,y)  for x, y in offsets])
suspects = [p for p in counts if counts[p] >= 2]
ax.scatter([x for x, _ in suspects], [y for _, y in suspects], ec='crimson', lw=1, fc='none', s=50)

encercled plot

Of course, the different approaches (alpha, colors, jittering, encircling) can be combined depending on the specifics of the actual data.

JohanC
  • 71,591
  • 8
  • 33
  • 66