0

I am trying to create a dataframe for Sankey chart in Power BI which needs source and destination like this.

id Source Destination
1 Starting a next point b
1 next point b final point c
1 final point c end
2 Starting a next point b
2 next point b
3 Starting a next point b
3 next point b final point c
3 final point c end

I have a dataframe like this:

ID flow
1 Starting a
1 next point b
1 final point c
2 Starting a
2 next point b
3 Starting a
3 next point b
3 final point c

I tried doing by iterating over the dataframe twice like below:

for index, row in df.iterrows():
  for j, r in df.iterrows():
    if row['ID'] == r['ID']:
        if (index + 1 == j) & ("final point c" not in row['flow']):
            df['Destination'][index] = df['flow'][j]
        elif "final point c" in row['flow']:
            df['Destination'][index] = 'End of flow'

Since it is iterating over the same dataframe twice, when the records are huge, it is taking a lot of time to process.

Is there any better way to do this? I tried looking at the all similar questions, but couldn't find anything that relates to my question.

Sunderam Dubey
  • 1
  • 11
  • 20
  • 40
Rach
  • 37
  • 5

1 Answers1

0

You could use groupby+shift and a bit of masking:

end = df['flow'].str.startswith('final point')
df2 = (df.assign(destination=df.groupby('ID')['flow'].shift(-1)
                               .mask(end, end.map({True: 'end'}))
                 )
         .rename(columns={'flow': 'source'})
       )

output:

   ID         source    destination
0   1     Starting a   next point b
1   1   next point b  final point c
2   1  final point c            end
3   2     Starting a   next point b
4   2   next point b            NaN
5   3     Starting a   next point b
6   3   next point b  final point c
7   3  final point c            end

Alternative with combine_first to fill the NaNs:

end = df['flow'].str.startswith('final point').map({True: 'end', False: ''})
df2 = (df.assign(destination=df.groupby('ID')['flow'].shift(-1).combine_first(end))
         .rename(columns={'flow': 'source'})
       )

output:

   ID         source    destination
0   1     Starting a   next point b
1   1   next point b  final point c
2   1  final point c            end
3   2     Starting a   next point b
4   2   next point b               
5   3     Starting a   next point b
6   3   next point b  final point c
7   3  final point c            end
mozway
  • 194,879
  • 13
  • 39
  • 75