1

I have created a boxplot using pandas dataframe and now I want to mark specific values in the same plot, with an "X" (hopefully in screaming red!).

Some data:

df = pd.DataFrame(
[
[2, 4, 5, 6, 1],
[4, 5, 6, 7, 2],
[5, 4, 5, 5, 1],
[10, 4, 7, 8, 2],
[9, 3, 4, 6, 2],
[3, 3, 4, 4, 1]
], columns=['a1', 'a2', 'a3', 'a4', 'b'])

mark_values = pd.DataFrame(
[
[2,1],
[8.25,2]
], columns=['a1', 'b'])

df_long = pd.melt(df, "b", var_name="a", value_name="c")
g = sns.boxplot(x='c', y='a', hue='b', data=df_long, 
palette=sns.color_palette("Blues_d"), orient='h')
sns.despine(left=True)

This generates a boxplot. I would now like to add markers as red crosses, e.g. marking the category a1, subgroup 1 with an X at "4", and subgroup 2 with an X at "8.25" etc. and still keep my nice boxplots.

The values that would be marked should be defined and stored as in the dataframe mark_values defined above. As in the example:

mark_values

Out[1]: 
     a1  b
0  4.00  1
1  8.25  2

Any easy solution to this?

Thanks

MattR
  • 4,887
  • 9
  • 40
  • 67
gussilago
  • 922
  • 3
  • 12
  • 27

2 Answers2

0

Since Seaborn is built using matplotlib you can use text:

import pandas as pd
import seaborn as sns

df = pd.DataFrame(
[
[2, 4, 5, 6, 1],
[4, 5, 6, 7, 2],
[5, 4, 5, 5, 1],
[10, 4, 7, 8, 2],
[9, 3, 4, 6, 2],
[3, 3, 4, 4, 1]
], columns=['a1', 'a2', 'a3', 'a4', 'b'])

mark_values = pd.DataFrame(
[
[2,1],
[8.25,2]
], columns=['a1', 'b'])

df_long = pd.melt(df, "b", var_name="a", value_name="c")
g = sns.boxplot(x='c', y='a', hue='b', data=df_long, 
palette=sns.color_palette("Blues_d"), orient='h')
sns.despine(left=True)
g.text(4,0.1,'X', fontsize=50, color='red')
g.text(8.25,.5,'X', fontsize=50, color='red')

enter image description here

The X axis is simply the values from c. But you can also work with get_ylim() to get your desired output. you can also use np.linspace to get the evenly spaced values:

import numpy as np
print(g.get_ylim())
print(str(g.get_ylim()[0]) + ' is the low value')
print(str(g.get_ylim()[1]) + ' is the high value')
print(np.linspace(g.get_ylim()[0], g.get_ylim()[1], 4))

Please also note that the bottom-left of the 'X' will be at the exact intersaciton of X and Y-Axis. so the fontsize of 50 makes it too big where it looks like the X is "off". You may need to play around with these values so the 'X' is in the right spot. But from your question I'm unsure of how big you wanted the X.

Look at the differences here. It seems that -.08 and .1 were good adjustments for the fontsize of 30. The green "X" is using these adjusted values.

g.text(4,2.1666,'X', fontsize=30, color='red')
g.text(4 - (4*.08) ,2.1666 + (2.1666 * .1),'X', fontsize=30, color='green')

enter image description here

MattR
  • 4,887
  • 9
  • 40
  • 67
  • 1
    Yes, that would work, but you wouldn't be very specific on where to but the actual marker... Say I want a marker on 'a3', then I would need to guess what my y-value would be. Right? – gussilago Dec 14 '17 at 15:04
  • @gussilago, check my edit. you could get fancy with `get_ylim()`. Such as divide the `Y` value by the number of categories to get the estimate of the spot on the graph. There are probably more elegant ways, but this works okay my testing. – MattR Dec 14 '17 at 15:36
0

First I guess it makes sense to define the mark_values to include a column which specifies which "a" shall be marked, e.g. to mark "a1", put 1 in the a column.

      c  a  b
0  2.00  1  1
1  8.25  1  2

You may then plot a scatter plot with an "x" as marker where the scatter coordinates are column c for horizontal direction and the vertical direction is given by

y = (a-1)+(b-1.5)*0.4

To explain that:

  • a starts at 1, but the first category is plotted at 0,
  • the mean between all b values here is 1.5.
  • half of the bar width is 0.4

In total this gives:

import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns


df = pd.DataFrame( [[2, 4, 5, 6, 1],
                    [4, 5, 6, 7, 2],
                    [5, 4, 5, 5, 1],
                    [10, 4, 7, 8, 2],
                    [9, 3, 4, 6, 2],
                    [3, 3, 4, 4, 1]], 
                columns=['a1', 'a2', 'a3', 'a4', 'b'])

mark_values = pd.DataFrame( [ [2,1,1], [8.25,1,2], [4,3,2] ], columns=['c',"a",'b'])
print mark_values
df_long = pd.melt(df, "b", var_name="a", value_name="c")

ax = sns.boxplot(x='c', y='a', hue='b', data=df_long, 
                palette=sns.color_palette("Blues_d"), orient='h')
sns.despine(left=True)

y = (mark_values["a"].values - 1)+(mark_values["b"].values-1.5)*0.4
ax.scatter(mark_values["c"].values, y, marker="x", c="red", s=400, lw=6)

plt.show()

enter image description here

ImportanceOfBeingErnest
  • 321,279
  • 53
  • 665
  • 712