-2

I have a dataframe containing users' trajectories and segments. A segment of a trajectory is considered part of the trajectories between 2-stops. So my df looks like this:

df = pd.DataFrame(
    {
        'trajectory': [1,1,1,2,2,2,3,3,3,4],
         'segment': [0,2,4,1,3,5,2,5,1,2],
         'user': ['A','A','A','B','B','B','A','A','A','C']
    }
)

df
  trajectory segment user
0     1        0      A
1     1        2      A
2     1        4      A
3     2        1      B
4     2        3      B
5     2        5      B
6     3        2      A
7     3        5      A
8     3        1      A
9     4        2      C
  • the number of segments in a user's trajectory are not sequential, e.g. trajectory 3 of user A are: 2,5, so 2 segments.
  • some users contribute more segments than others.

I want to plot the CDF of the number of segments per trajectory per user. This to understand on average, how many segments a user contributes per trajectory?

Trenton McKinney
  • 56,955
  • 33
  • 144
  • 158
planar
  • 13
  • 4

1 Answers1

0

Let me first expand your data a little so then it would be possible to illustrate:

import pandas as pd
import seaborn as sns

# 10x sample from your initial data 
trajectory = pd.Series([1,1,1,2,2,2,3,3,3,4]).sample(frac=10, replace=True)
segment = pd.Series([0,2,4,1,3,5,2,5,1,2]).sample(frac=10, replace=True)
user = pd.Series(['A','A','A','B','B','B','A','A','A','C']).sample(frac=10, replace=True)

# rebuild the DataFrame with upsampled data
df = pd.DataFrame({'trajectory': trajectory.to_numpy(),
                   'segment': segment.to_numpy(),
                   'user': user.to_numpy()})

Now let's group df by trajectory and by user, use count to find the number of segments of each user in every trajectory:

# reset index so 'trajectory' and 'user' become columns again
df_grouped = df.groupby(['trajectory', 'user'])['segment'].agg('count').reset_index()
#rename the columns to add a name for aggregation column
df_grouped.columns = ['trajectory', 'user', 'segment_count']

And then plot df_grouped with seaborn's .kdeplot() using cumulative=True to plot CDF:

sns.kdeplot(data=df_grouped, x='segment_count', hue='user', cumulative=True)

Output:

enter image description here

Nikita Shabankin
  • 609
  • 8
  • 17