0

I have a pandas DataFrame with multiple columns filled with numbers and rows, and the 1st column has the categorical data. Obviously, I have NaN values and zeroes in multiple rows (but not the entire blank row, of course) and in different columns.

The rows have valuable data in other columns which are not NaN. And the columns have valuable data in other rows, which are also not NaN.

The problem is that sns.pairplot does not ignore NaN values for correlation and returns errors (such as division by zero, string to float conversion, etc.).

I have seen some people saying to use fillna() method, but I am hoping if anyone knows a more elegant way to do this, without having to go through that solution and spend numerous hours to fix the plot, axis, filters, etc. afterwards. I didn't like that work around.

It is similar to what this person has reported:
https://github.com/mwaskom/seaborn/issues/1699

ZeroDivisionError: 0.0 cannot be raised to a negative power

Here is the sample dataset: image of the sample dataset

Gino Mempin
  • 25,369
  • 29
  • 96
  • 135
user736963
  • 15
  • 1
  • 6
  • Remove nans from your data first before plotting. – ImportanceOfBeingErnest Aug 04 '19 at 17:44
  • That's not the poiint because the rows have valuable data in other columns which are not nan. – user736963 Aug 04 '19 at 17:45
  • and the columns have valuable data in other rows, which are also not nan – user736963 Aug 04 '19 at 17:46
  • Seaborn requires long form datasets anyways, so a row with a nan in any column is not useful and can be removed. – ImportanceOfBeingErnest Aug 04 '19 at 17:49
  • absolutely not, I have important data there, I just need the pair plot to ignore any correlation which deals with nan values and pass it. It is a simple concept. – user736963 Aug 04 '19 at 17:53
  • I'm not saying you should delete those from your data, just from the dataframe you use for plotting. – ImportanceOfBeingErnest Aug 04 '19 at 17:55
  • If I do this and remove rows if they have one nan value, for example, I loose a lot of possible correlation points with other columns not being nan. This is not what I am asking. The output would be plots (within the pairplot) with more quantity of data points and others with less. – user736963 Aug 04 '19 at 17:58
  • would be like a simple function concept: include point correlation in plot (row x column) if value != nan, else, pass. – user736963 Aug 04 '19 at 18:02
  • I see. That would render the pairgrid wrong though. Because different subplots are based on different datasets. You may still filter out the nans for each function you map onto the pairgrid though. – ImportanceOfBeingErnest Aug 04 '19 at 18:08
  • yes, you are right, theoretically, they are different subsets but practically is just a visualization function right? would have nothing to do with the datasets or source info. could you detail more your last statement please? how could I filter that out using sns.pairplot? – user736963 Aug 04 '19 at 18:11
  • feed in a function to [map_offdiag](https://seaborn.pydata.org/generated/seaborn.PairGrid.map_offdiag.html#seaborn.PairGrid.map_offdiag) that does whatever you need it to do before plotting the data. – ImportanceOfBeingErnest Aug 04 '19 at 18:12
  • because basically I'm getting an error and I can't use hue because of that issue.... – user736963 Aug 04 '19 at 18:12
  • or probably rather [map_diag](https://seaborn.pydata.org/generated/seaborn.PairGrid.map_diag.html#seaborn.PairGrid.map_diag) becaue that error is likely from the KDE curves, not some scatters, which handle nans just fine. – ImportanceOfBeingErnest Aug 04 '19 at 18:13
  • Nice, that might work, but I see is a bit advanced. I'm not sure I'm so skilled for this. Could you help me to write that down and apply ? – user736963 Aug 04 '19 at 18:15
  • it says it needs an Xarray. I imagine I would need to write a loop right ? – user736963 Aug 04 '19 at 18:16
  • Yes, I can **help** with that; meaning I first need an example dataset - the one from the issue seems not well suited because ***all*** elements from one hue are nan, right? – ImportanceOfBeingErnest Aug 04 '19 at 18:19
  • Thank you ! :) no, I have no columns or rows completely empty or with nan – user736963 Aug 04 '19 at 18:21
  • I added a sample dataset – user736963 Aug 04 '19 at 18:28
  • and one more comment, if I choose a column as hue which has valid data category and nan the legend would need to show eg. cat1 green cat2 red cat3 yellow and nan blue (which are valid points without a category. I'll try to replace nan in that particular hue column for a string 'empty' or 'none' and see what happens with the data – user736963 Aug 04 '19 at 19:29
  • Sorry, I cannot work with that dataset - too much stuff in it, not clear which information to show how... The general idea of using a custom function is along the lines of [this answer](https://stackoverflow.com/a/45865124/4124317) or [this one](https://stackoverflow.com/a/56387759/4124317). – ImportanceOfBeingErnest Aug 04 '19 at 19:37
  • ok what I tried didn't work and the solution in the links you provided is not that elementary. I'll have a look better tomorrow. I hope you can help me to find a solution to this. Another approach would be: fillna(0.0001) and this will also create a 'category' in hue of 0.0001. Therefore, I could create a standard pairplot passing vars and saying: hue = category column except cat 0.0001 (filter values from this category) and plot only if values > 0.0001, which will also filter the correlation within them (in categories which are not nan). but in also dont know how to.do that and nothing in web – user736963 Aug 04 '19 at 21:22

1 Answers1

1

Seaborn's PairGrid function will allow you to create your desired plot. PairGrid is much more flexible than sns.pairplot. Any PairGrid created has three sections: the upper triangle, the lower triangle and the diagonal.

For each part, you can define a customized plotting function. The upper and lower triangle sections can take any plotting function that accepts two arrays of features (such as plt.scatter) as well as any associated keywords (e.g. marker). The diagonal section accepts a plotting function that has a single feature array as input (such as plt.hist) in addition to the relevant keywords.

For your purpose, you can filter out the NaNs in your customized function(s):

from sklearn import datasets
import pandas as pd
import numpy as np
import seaborn as sns

data = datasets.load_iris()
iris = pd.DataFrame(data.data, columns=data.feature_names)

# break iris dataset to create NaNs
iris.iat[1, 0] = np.nan
iris.iat[4, 0] = np.nan
iris.iat[4, 2] = np.nan
iris.iat[5, 2] = np.nan

# create customized scatterplot that first filters out NaNs in feature pair
def scatterFilter(x, y, **kwargs):
    
    interimDf = pd.concat([x, y], axis=1)
    interimDf.columns = ['x', 'y']
    interimDf = interimDf[(~ pd.isnull(interimDf.x)) & (~ pd.isnull(interimDf.y))]
    
    ax = plt.gca()
    ax = plt.plot(interimDf.x.values, interimDf.y.values, 'o', **kwargs)
    
# Create an instance of the PairGrid class.
grid = sns.PairGrid(data=iris, vars=list(iris.columns), size = 4)

# Map a scatter plot to the upper triangle
grid = grid.map_upper(scatterFilter, color='darkred')

# Map a histogram to the diagonal
grid = grid.map_diag(plt.hist, bins=10, edgecolor='k', color='darkred')

# Map a density plot to the lower triangle
grid = grid.map_lower(scatterFilter, color='darkred')

This will yield the following plot:

Iris Seaborn PairPlot

PairPlot allows you to plot contour plots, annotate the panels with descriptive statistics, etc. For more details, see here.

Gino Mempin
  • 25,369
  • 29
  • 96
  • 135