I want to make a scatter plot in python coloured by a categorical variable that handles and plots missing values for the categorical colour variable.
Using the iris dataset as an example
import seaborn as sns
iris = sns.load_dataset('iris')
Seaborn can plot by colour:
sns.lmplot('sepal_length', 'sepal_width', hue='species', data=iris, fit_reg=False)
With a little more work so can matplotlib (taken from this answer)
colours = {'setosa':'skyblue', 'versicolor':'orangered', 'virginica':'forestgreen'}
plt.scatter(iris.sepal_length, iris.sepal_width, c=iris.species.apply(lambda x:colours[x]))
But neither will plot missing colours. If we set the species variable (which we use to colour the plot) to np.nan
for one species seaborn doesn't plot those points and matplotlib won't plot anything.
iris.species[iris.species == 'setosa'] = np.nan
sns.lmplot('sepal_length', 'sepal_width', hue='species', data=iris, fit_reg=False)
iris.plot('sepal_length', 'sepal_width', kind="scatter", c=iris.species.apply(lambda x:colours[x]))