7

In a nutshell, what is my best option for a distribution-type graphs (histogram or kde) when my data is weighted?

df = pd.DataFrame({ 'x':[1,2,3,4], 'wt':[7,5,3,1] })

df.x.plot(kind='hist',weights=df.wt.values)

That works fine but seaborn won't accept a weights kwarg, i.e.

sns.distplot( df.x, bins=4,              # doesn't work like this
              weights=df.wt.values )     # or with kde=False added

It would also be nice if kde would accept weights but neither pandas nor seaborn seems to allow it.

I realize btw that the data could be expanded to fake the weighting and that's easy here but not of much use with my real data with weights in the hundreds or thousand, so I'm not looking for a workaround like that.

Anyway, that's all. I'm just trying to find out what (if anything) I can do with weighted data besides the basic pandas histogram. I haven't fooled around with bokeh yet, but bokeh suggestions are also welcome.

JohnE
  • 29,156
  • 8
  • 79
  • 109
  • Same questions & same answer here: https://stackoverflow.com/questions/31703149/weights-option-for-seaborn-distplot – JohnE Nov 14 '18 at 21:17

2 Answers2

7

You have to understand that seaborn uses the very matplotlib plotting functions that also pandas uses.

As the documentation states, sns.distplot does not accept a weights argument, however it does take a hist_kws argument, which will be sent to the underlying call to plt.hist. Thus, this should do what you want:

sns.distplot(df.x, bins=4, hist_kws={'weights':df.wt.values}) 
hitzg
  • 12,133
  • 52
  • 54
  • Yeah, thanks, that is helpful. I wasn't sure how to pass the kwarg to matplotlib. I'll upvote now but leave it open a little longer in case anyone has ideas about kde or such. – JohnE Apr 27 '15 at 12:24
  • 1
    Seaborns kde plots uses the python package statmodels for the computations. The relevant functions take a weights arguement, but it seems that this is not forwarded by seaborn. The relevant source files: https://github.com/mwaskom/seaborn/blob/master/seaborn/distributions.py and https://github.com/statsmodels/statsmodels/blob/master/statsmodels/nonparametric/kde.py – hitzg Apr 27 '15 at 12:36
  • OK, thanks. Looks like weights may not be implemented yet (I can't tell for sure from a quick skim). Anyway, I'll close this now and maybe ask a question more focused on kde at a later time. – JohnE Apr 27 '15 at 12:45
  • Ok. BTW: mwaskom is on SO too, and given that the question has the seaborn tag, he might look at the question. Then we'll know for sure. – hitzg Apr 27 '15 at 12:46
1

I solved this problem by resampling the data points based on their weight.

You can do it like this:

from random import random
from bisect import bisect

def weighted_choice(choices):
    values, weights = zip(*choices)
    total = 0
    cum_weights = []
    for w in weights:
        total += w
        cum_weights.append(total)
    x = random() * total
    i = bisect(cum_weights, x)
    return values[i]

samples = [([5, 0.5], 0.1), ([0, 10], 0.3), ([0, -4], 0.3)]
choices = np.array([weighted_choice(samples) for c in range(1000)])
sns.distributions.kdeplot(choices[:, 0], choices[:, 1], shade=True)

img

Mobina
  • 6,369
  • 2
  • 25
  • 41