0

I want to make a scatter plot in python coloured by a categorical variable that handles and plots missing values for the categorical colour variable.

Using the iris dataset as an example

import seaborn as sns
iris = sns.load_dataset('iris')

Seaborn can plot by colour:

sns.lmplot('sepal_length', 'sepal_width', hue='species', data=iris, fit_reg=False)

With a little more work so can matplotlib (taken from this answer)

colours = {'setosa':'skyblue', 'versicolor':'orangered', 'virginica':'forestgreen'}
plt.scatter(iris.sepal_length, iris.sepal_width, c=iris.species.apply(lambda x:colours[x]))

But neither will plot missing colours. If we set the species variable (which we use to colour the plot) to np.nan for one species seaborn doesn't plot those points and matplotlib won't plot anything.

iris.species[iris.species == 'setosa'] = np.nan

sns.lmplot('sepal_length', 'sepal_width', hue='species', data=iris, fit_reg=False)
iris.plot('sepal_length', 'sepal_width', kind="scatter", c=iris.species.apply(lambda x:colours[x]))
D A Wells
  • 1,047
  • 12
  • 16

1 Answers1

0

I haven't found a solution with seaborn but you can tweak the lambda function to work it in matplotlib. If the species is in your colour dictionary it looks the colour up there but if the species is null it return a novel colour.

import seaborn as sns
iris = sns.load_dataset('iris')

#colour dictionary
colours = {'setosa':'skyblue', 'versicolor':'orangered', 'virginica':'forestgreen'}

col_convert = np.vectorize(lambda x: 'grey' if pd.isnull(x) else colours[x])

plt.scatter(iris.sepal_length, iris.sepal_width, c=col_convert(iris.species))
D A Wells
  • 1,047
  • 12
  • 16
  • 1
    Yes, `np.nan == np.nan` evaluates to `False` in Python. I think you'll need to do `fillna("missing")` or something similar on your actual dataset. – mwaskom May 29 '20 at 11:41