Iterating over a dataframe twice: which is the ideal way?

Question

I am trying to create a dataframe for Sankey chart in Power BI which needs source and destination like this.

id	Source	Destination
1	Starting a	next point b
1	next point b	final point c
1	final point c	end
2	Starting a	next point b
2	next point b
3	Starting a	next point b
3	next point b	final point c
3	final point c	end

I have a dataframe like this:

ID	flow
1	Starting a
1	next point b
1	final point c
2	Starting a
2	next point b
3	Starting a
3	next point b
3	final point c

I tried doing by iterating over the dataframe twice like below:

for index, row in df.iterrows():
  for j, r in df.iterrows():
    if row['ID'] == r['ID']:
        if (index + 1 == j) & ("final point c" not in row['flow']):
            df['Destination'][index] = df['flow'][j]
        elif "final point c" in row['flow']:
            df['Destination'][index] = 'End of flow'

Since it is iterating over the same dataframe twice, when the records are huge, it is taking a lot of time to process.

Is there any better way to do this? I tried looking at the all similar questions, but couldn't find anything that relates to my question.

score 0 · Accepted Answer · answered Nov 18 '21 at 05:16

You could use groupby+shift and a bit of masking:

end = df['flow'].str.startswith('final point')
df2 = (df.assign(destination=df.groupby('ID')['flow'].shift(-1)
                               .mask(end, end.map({True: 'end'}))
                 )
         .rename(columns={'flow': 'source'})
       )

output:

   ID         source    destination
0   1     Starting a   next point b
1   1   next point b  final point c
2   1  final point c            end
3   2     Starting a   next point b
4   2   next point b            NaN
5   3     Starting a   next point b
6   3   next point b  final point c
7   3  final point c            end

Alternative with combine_first to fill the NaNs:

end = df['flow'].str.startswith('final point').map({True: 'end', False: ''})
df2 = (df.assign(destination=df.groupby('ID')['flow'].shift(-1).combine_first(end))
         .rename(columns={'flow': 'source'})
       )

output:

   ID         source    destination
0   1     Starting a   next point b
1   1   next point b  final point c
2   1  final point c            end
3   2     Starting a   next point b
4   2   next point b               
5   3     Starting a   next point b
6   3   next point b  final point c
7   3  final point c            end

Iterating over a dataframe twice: which is the ideal way?

1 Answers1