The issue
I want to create a plot to show the concentration by a certain variable. Let's say I have a 1-dimensional array of prices.
- I want a plot that shows me that the first 10 most expensive items account for 10% of the total price, the first 100 most expensive items for 40% of the total price, etc.
- This is useful in all those situations where we want to understand how concentrated or not certain data is: e.g. few borrowers account for most of the exposure of a bank, few days account for most of the rainfall in a given period, etc.
What I have done so far
I manually sort by price, calculate a cumulative sum, divide by the total price and plot that.
Why it's not ideal
I would like to use SeaBorn's displot and facetgrids to calculate this for multiple categories. Something like this:
The question
Is there a way to use ecdfplot or another function compatible with seaborn's displot?
My code (which works but is not ideal)
import numpy as np
from numpy.random import default_rng
import pandas as pd
import copy
import matplotlib
matplotlib.use('TkAgg', force = True)
import matplotlib.pyplot as plt
import seaborn as sns
import seaborn.objects as so
from matplotlib.ticker import FuncFormatter
sns.set_style("darkgrid")
rng = default_rng()
# I generate random samples from a truncated normal distr
# (I don't want negative values)
n = int(2e3)
n_red = int(n/3)
n_green = n - n_red
df = pd.DataFrame()
df['price']= np.random.randn(n) * 100 + 20
df['colour'] = np.hstack([np.repeat('red',n_red),
np.repeat('green', n_green)])
df = copy.deepcopy(df.query('price > 0')).reset_index(drop=True)
num_cols = len(np.unique(df['colour']))
fig1, ax1 = plt.subplots(num_cols)
sub_dfs={}
for my_ax, c in enumerate(np.unique(df['colour'])):
sub_dfs[c] = copy.deepcopy(df.query('colour == @c'))
sub_dfs[c] = sub_dfs[c].sort_values(by='price', ascending=False).reset_index()
sub_dfs[c]['cum %'] = np.cumsum(sub_dfs[c]['price']) / sub_dfs[c]['price'].sum()
sns.lineplot(sub_dfs[c]['cum %'], ax = ax1[my_ax])
ax1[my_ax].set_title(c + ' - price concentration')
ax1[my_ax].set_xlabel('# of items')
ax1[my_ax].set_ylabel('% of total price')
What I have tried - but doesn't work
I have played around with displot
and ecdf
fig2 = sns.displot(kind='ecdf', data = df, y='price', col='colour', col_wrap =2, weights ='price',
facet_kws=dict(sharey=False))
fig3 = sns.displot(kind='ecdf', data = df, x='price', col='colour', col_wrap =2, weights='price',
facet_kws=dict(sharey=False))
EDIT: Mwascom's answer (I still can't get it to work)
@mwaskom, thank you for your answer. However, I'm afraid I'm still doing something wrong, as I'm not getting the desired result.
If I run:
fig5 = sns.displot(kind='ecdf', data=df, x=df.index, col='colour', col_wrap =2, weights='price',
facet_kws=dict(sharey=False, sharex=False))
- I get two straight lines, whereas the plot I need is convex (see first plot at the top). A straight line means the price is equally distributed, that 10% of the items account for 10% of the total price. A convex function means that the top 10% of the items account for more than 10% of the total price (which is my case). What I get is this:
- In my toy example, I have a category with ca. 400 items and one with ca. 800. Since the x axis is the index of the whole dataframe, the second plot goes from 400 to 1,200, instead of going from 1 to 800.