How do I produce the CDF of trajectory's segments per user in this case?

Question

I have a dataframe containing users' trajectories and segments. A segment of a trajectory is considered part of the trajectories between 2-stops. So my df looks like this:

df = pd.DataFrame(
    {
        'trajectory': [1,1,1,2,2,2,3,3,3,4],
         'segment': [0,2,4,1,3,5,2,5,1,2],
         'user': ['A','A','A','B','B','B','A','A','A','C']
    }
)

df
  trajectory segment user
0     1        0      A
1     1        2      A
2     1        4      A
3     2        1      B
4     2        3      B
5     2        5      B
6     3        2      A
7     3        5      A
8     3        1      A
9     4        2      C

the number of segments in a user's trajectory are not sequential, e.g. trajectory 3 of user A are: 2,5, so 2 segments.
some users contribute more segments than others.

I want to plot the CDF of the number of segments per trajectory per user. This to understand on average, how many segments a user contributes per trajectory?

Are you limited with this only data or there's more? Because trajectory `4` represented by only one occasion hence doesn't have variance and therefore has no CDF to plot. — Nikita Shabankin, Oct 04 '22 at 12:47

Nikita Shabankin · Answer 1 · 2022-10-05T02:15:03.873

Let me first expand your data a little so then it would be possible to illustrate:

import pandas as pd
import seaborn as sns

# 10x sample from your initial data 
trajectory = pd.Series([1,1,1,2,2,2,3,3,3,4]).sample(frac=10, replace=True)
segment = pd.Series([0,2,4,1,3,5,2,5,1,2]).sample(frac=10, replace=True)
user = pd.Series(['A','A','A','B','B','B','A','A','A','C']).sample(frac=10, replace=True)

# rebuild the DataFrame with upsampled data
df = pd.DataFrame({'trajectory': trajectory.to_numpy(),
                   'segment': segment.to_numpy(),
                   'user': user.to_numpy()})

Now let's group df by trajectory and by user, use count to find the number of segments of each user in every trajectory:

# reset index so 'trajectory' and 'user' become columns again
df_grouped = df.groupby(['trajectory', 'user'])['segment'].agg('count').reset_index()
#rename the columns to add a name for aggregation column
df_grouped.columns = ['trajectory', 'user', 'segment_count']

And then plot df_grouped with seaborn's .kdeplot() using cumulative=True to plot CDF:

sns.kdeplot(data=df_grouped, x='segment_count', hue='user', cumulative=True)

Output:

How do I produce the CDF of trajectory's segments per user in this case?

1 Answers1