18

I replaced the missing values with NaN using lambda following function:

data = data.applymap(lambda x: np.nan if isinstance(x, basestring) and x.isspace() else x)

where data is the dataframe I am working on.

Using seaborn afterwards, I tried to plot one of its attributes, 'alcconsumption', using seaborn.distplot as follows:

seaborn.distplot(data['alcconsumption'],hist=True,bins=100)
plt.xlabel('AlcoholConsumption')
plt.ylabel('Frequency(normalized 0->1)')

It's giving me the following error:

AttributeError: max must be larger than min in range parameter.
doine
  • 336
  • 1
  • 12
datavinci
  • 795
  • 2
  • 7
  • 27

4 Answers4

5

You can use the following line to select the non-NaN values for a distribution plot using seaborn:

seaborn.distplot(data['alcconsumption'].notnull(),hist=True,bins=100)
mr_mo
  • 1,401
  • 6
  • 10
ZicoNuna
  • 67
  • 1
  • 3
  • 1
    [displot] as of version 1.11, seaborn says: "This function is deprecated and will be removed in a future version." https://seaborn.pydata.org/generated/seaborn.distplot.html – Marc Maxmeister Apr 07 '21 at 18:24
  • shouldn't this be `seaborn.distplot(data[data['alcconsumption'].notnull()]['alcconsumption'],hist=True,bins=100)` ? I believe `data['alcconsumption'].notnull()` outputs boolean – user1442363 Oct 06 '21 at 15:50
4

This is a known issue with matplotlib/pylab histograms!

See e.g. https://github.com/matplotlib/matplotlib/issues/6483

where various workarounds are suggested, two favourites (for example from https://stackoverflow.com/a/19090183/1021819) being:

import numpy as np
nbins=100
A=data['alcconsumption']
Anan=A[~np.isnan(A)] # Remove the NaNs

seaborn.distplot(Anan,hist=True,bins=nbins)

Alternatively, specify bin edges (in this case by anyway making use of Anan...):

Amin=min(Anan)
Amax=max(Anan)
seaborn.distplot(A,hist=True,bins=np.linspace(Amin,Amax,nbins))
jtlz2
  • 7,700
  • 9
  • 64
  • 114
3

I would definitely handle missing values before you plot your data. Whether ot not to use dropna() would depend entirely on the nature of your dataset. Is alcconsumption a single series or part of a dataframe? In the latter case, using dropna() would remove the corresponding rows in other columns as well. Are the missing values few or many? Are they spread around in your series, or do they tend to occur in groups? Is there perhaps reason to believe that there is a trend in your dataset?

If the missing values are few and scattered, you could easiliy use dropna(). In other cases I would choose to fill missing values with the previously observed value (1). Or even fill the missing values with interpolated values (2). But be careful! Replacing a lot of data with filled or interpolated observations could seriously interrupt your dataset and lead to very wrong conlusions.

Here are some examples that use your snippet...

seaborn.distplot(data['alcconsumption'],hist=True,bins=100)
plt.xlabel('AlcoholConsumption')
plt.ylabel('Frequency(normalized 0->1)')

... on a synthetic dataset:

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

def sample(rows, names):
    ''' Function to create data sample with random returns

    Parameters
    ==========
    rows : number of rows in the dataframe
    names: list of names to represent assets

    Example
    =======

    >>> sample(rows = 2, names = ['A', 'B'])

                  A       B
    2017-01-01  0.0027  0.0075
    2017-01-02 -0.0050 -0.0024
    '''
    listVars= names
    rng = pd.date_range('1/1/2017', periods=rows, freq='D')
    df_temp = pd.DataFrame(np.random.randint(-100,100,size=(rows, len(listVars))), columns=listVars) 
    df_temp = df_temp.set_index(rng)


    return df_temp

df = sample(rows = 15, names = ['A', 'B'])
df['A'][8:12] = np.nan
df

Output:

            A   B
2017-01-01 -63.0  10
2017-01-02  49.0  79
2017-01-03 -55.0  59
2017-01-04  89.0  34
2017-01-05 -13.0 -80
2017-01-06  36.0  90
2017-01-07 -41.0  86
2017-01-08  10.0 -81
2017-01-09   NaN -61
2017-01-10   NaN -80
2017-01-11   NaN -39
2017-01-12   NaN  24
2017-01-13 -73.0 -25
2017-01-14 -40.0  86
2017-01-15  97.0  60

1. Using forward fill with pandas.DataFrame.fillna(method = ffill)

ffill will "fill values forward", meaning it will replace the nan's with the value of the row above.

df = df['A'].fillna(axis=0, method='ffill')
sns.distplot(df, hist=True,bins=5)
plt.xlabel('AlcoholConsumption')
plt.ylabel('Frequency(normalized 0->1)')

enter image description here

2. Using interpolation with pandas.DataFrame.interpolate()

Interpolate values according to different methods. Time interpolation works on daily and higher resolution data to interpolate given length of interval.

df['A'] = df['A'].interpolate(method = 'time')
sns.distplot(df['A'], hist=True,bins=5)
plt.xlabel('AlcoholConsumption')
plt.ylabel('Frequency(normalized 0->1)')

enter image description here

As you can see, the different methods render two very different results. I hope this will be useful to you. If not then let me know and I'll have a look at it again.

vestland
  • 55,229
  • 37
  • 187
  • 305
2

This may not solution for the asked question but I use the below code to check

sns.heatmap(df.isnull(),yticklabels=False,cbar=False,cmap='viridis')
PlutoSenthil
  • 332
  • 6
  • 13
  • 1
    Please add more context to this answer - code dumping is not encouraged here. https://meta.stackoverflow.com/questions/358727/are-there-any-guidelines-to-handle-one-line-correct-code-only-answers-in-vario – rayryeng Nov 06 '21 at 22:21