Can ecdfplot show the concentration of a variable? E.g. the top 10 items account for 20% of the total, etc

Question

The issue

I want to create a plot to show the concentration by a certain variable. Let's say I have a 1-dimensional array of prices.

I want a plot that shows me that the first 10 most expensive items account for 10% of the total price, the first 100 most expensive items for 40% of the total price, etc.
This is useful in all those situations where we want to understand how concentrated or not certain data is: e.g. few borrowers account for most of the exposure of a bank, few days account for most of the rainfall in a given period, etc.

What I have done so far

I manually sort by price, calculate a cumulative sum, divide by the total price and plot that.

Why it's not ideal

I would like to use SeaBorn's displot and facetgrids to calculate this for multiple categories. Something like this:

The question

Is there a way to use ecdfplot or another function compatible with seaborn's displot?

My code (which works but is not ideal)

import numpy as np
from numpy.random import default_rng
import pandas as pd
import copy

import matplotlib
matplotlib.use('TkAgg', force = True)
import matplotlib.pyplot as plt

import seaborn as sns
import seaborn.objects as so
from matplotlib.ticker import FuncFormatter
sns.set_style("darkgrid")
rng = default_rng()

# I generate random samples from a truncated normal distr
# (I don't want negative values)
n = int(2e3)
n_red = int(n/3)
n_green = n - n_red
df = pd.DataFrame()
df['price']= np.random.randn(n) * 100 + 20
df['colour'] = np.hstack([np.repeat('red',n_red),
                          np.repeat('green', n_green)])
df = copy.deepcopy(df.query('price > 0')).reset_index(drop=True)

num_cols = len(np.unique(df['colour']))
fig1, ax1 = plt.subplots(num_cols)

sub_dfs={}
for my_ax, c in enumerate(np.unique(df['colour'])):
    sub_dfs[c] = copy.deepcopy(df.query('colour == @c'))
    sub_dfs[c] = sub_dfs[c].sort_values(by='price', ascending=False).reset_index()
    sub_dfs[c]['cum %'] = np.cumsum(sub_dfs[c]['price']) / sub_dfs[c]['price'].sum()

    sns.lineplot(sub_dfs[c]['cum %'], ax = ax1[my_ax])
    ax1[my_ax].set_title(c + ' - price concentration')
    ax1[my_ax].set_xlabel('# of items')
    ax1[my_ax].set_ylabel('% of total price')

What I have tried - but doesn't work

I have played around with displot and ecdf

fig2 = sns.displot(kind='ecdf', data = df, y='price', col='colour', col_wrap =2, weights ='price',
                   facet_kws=dict(sharey=False))

fig3 = sns.displot(kind='ecdf', data = df, x='price', col='colour', col_wrap =2, weights='price',
                   facet_kws=dict(sharey=False))

EDIT: Mwascom's answer (I still can't get it to work)

@mwaskom, thank you for your answer. However, I'm afraid I'm still doing something wrong, as I'm not getting the desired result.

If I run:

fig5 = sns.displot(kind='ecdf', data=df, x=df.index, col='colour', col_wrap =2, weights='price',
                   facet_kws=dict(sharey=False, sharex=False))

I get two straight lines, whereas the plot I need is convex (see first plot at the top). A straight line means the price is equally distributed, that 10% of the items account for 10% of the total price. A convex function means that the top 10% of the items account for more than 10% of the total price (which is my case). What I get is this:
In my toy example, I have a category with ca. 400 items and one with ca. 800. Since the x axis is the index of the whole dataframe, the second plot goes from 400 to 1,200, instead of going from 1 to 800.

score 1 · Answer 1 · answered Apr 28 '23 at 11:26

1

If I'm understanding correctly, you can do this using an ECDF with weights:

df = (
    sns.load_dataset("diamonds")
    .sort_values("price", ascending=False)
    .reset_index(drop=True)
    .rename_axis("item")
)
sns.displot(df, x="item", weights="price", kind="ecdf")

answered Apr 28 '23 at 11:26

mwaskom

46,693
16
125
127

Thank you, but I can't get your answer to work. I have edited my initial question to explain. I must be doing something wrong. – Pythonista anonymous Apr 28 '23 at 12:06
You'll need to supply a reproducible dataset if you want more help. – mwaskom Apr 28 '23 at 18:08
I did. My code at the very top generates random samples from a truncated normal distribution, which is what I use in the example. My actual data is different but this is an easily reproducible example. One could argue I should have fixed the random seed, but in practice it doesn't make much of a difference from run to run. – Pythonista anonymous Apr 28 '23 at 18:43