How to increase pandas explode function performance?

Question

I have data as below in a dataframe

FID	SID_START	SID_END
404915	1	3

and this should be expanded as below

FID	SID
404915	1
404915	2
404915	3

So I can group by SID to get the count

I have around 480 million rows and I am using explode function in pandas

df['SID'] = [pd.Series(range(left,right+1)) for left, right in 
             zip(df['SID_START'],df['SID_END'])]


df= df.explode('SID').drop(['SID_START', 'SID_END'], axis=1)

and its taking around 10 minutes for 10 million records is there a fasetr way in python to handle this ?

Then how can I get the range of values between those column values ? — abcd_1234, Jun 16 '23 at 14:42

score 3 · Answer 1 · answered Jun 16 '23 at 15:45

3

Pandas is probably not a good choice for 480 millions records.. having said that here is one approach which utilizes the numpys repeat function to explode the dataframe in a efficient manner

# calculate diff between end and start
n = df['SID_END'] - df['SID_START'] + 1

# repeat the index n times
df1 = df.reindex(df.index.repeat(n))

# Use cumcount to create sequential counter per group
# then add this with SID_START to calculate the SID
df1['SID'] = df['SID_START'] + df1.groupby('FID').cumcount()

Input

      FID  SID_START  SID_END
0  404915          1        3
1  404916          5       10

Output

      FID  SID_START  SID_END  SID
0  404915          1        3    1
0  404915          1        3    2
0  404915          1        3    3
1  404916          5       10    5
1  404916          5       10    6
1  404916          5       10    7
1  404916          5       10    8
1  404916          5       10    9
1  404916          5       10   10

answered Jun 16 '23 at 15:45

Shubham Sharma

68,127
6
24
53

1

Sharama , Thank you , this worked very well and was able to process 10 million rows in 5 seconds :) but in my dat FID is not unique and few rows can be like this FID SID_START SID_END 404915 1 3 404915 7 10 , there rows are generating wrong ranges – abcd_1234 Jun 17 '23 at 05:40
1

Group by index would solve this problem df1['SID'] = df['SID_START'] + df1.groupby(df1.index).cumcount() – abcd_1234 Jun 17 '23 at 06:43
2

@abcd_1234 Glad to help. yes, if FID is not unique then groupby index would be best. – Shubham Sharma Jun 17 '23 at 06:49
df1['SID'] = df['SID_START'] + df1.groupby('FID').cumcount() this statement is taking lot of memory , even using the same column in LHS df1['SID_START'] = df['SID_START'] + df1.groupby('FID').cumcount() , is there a way to reduce the memory ? anything inplace to alter an existing column with new values ? – abcd_1234 Jun 27 '23 at 04:23

How to increase pandas explode function performance?

1 Answers1