1

I have data as below in a dataframe

FID SID_START SID_END
404915 1 3

and this should be expanded as below

FID SID
404915 1
404915 2
404915 3

So I can group by SID to get the count

I have around 480 million rows and I am using explode function in pandas

df['SID'] = [pd.Series(range(left,right+1)) for left, right in 
             zip(df['SID_START'],df['SID_END'])]


df= df.explode('SID').drop(['SID_START', 'SID_END'], axis=1)

and its taking around 10 minutes for 10 million records is there a fasetr way in python to handle this ?

abcd_1234
  • 29
  • 5

1 Answers1

3

Pandas is probably not a good choice for 480 millions records.. having said that here is one approach which utilizes the numpys repeat function to explode the dataframe in a efficient manner

# calculate diff between end and start
n = df['SID_END'] - df['SID_START'] + 1

# repeat the index n times
df1 = df.reindex(df.index.repeat(n))

# Use cumcount to create sequential counter per group
# then add this with SID_START to calculate the SID
df1['SID'] = df['SID_START'] + df1.groupby('FID').cumcount()

Input

      FID  SID_START  SID_END
0  404915          1        3
1  404916          5       10

Output

      FID  SID_START  SID_END  SID
0  404915          1        3    1
0  404915          1        3    2
0  404915          1        3    3
1  404916          5       10    5
1  404916          5       10    6
1  404916          5       10    7
1  404916          5       10    8
1  404916          5       10    9
1  404916          5       10   10
Shubham Sharma
  • 68,127
  • 6
  • 24
  • 53
  • 1
    Sharama , Thank you , this worked very well and was able to process 10 million rows in 5 seconds :) but in my dat FID is not unique and few rows can be like this FID SID_START SID_END 404915 1 3 404915 7 10 , there rows are generating wrong ranges – abcd_1234 Jun 17 '23 at 05:40
  • 1
    Group by index would solve this problem df1['SID'] = df['SID_START'] + df1.groupby(df1.index).cumcount() – abcd_1234 Jun 17 '23 at 06:43
  • 2
    @abcd_1234 Glad to help. yes, if FID is not unique then groupby index would be best. – Shubham Sharma Jun 17 '23 at 06:49
  • df1['SID'] = df['SID_START'] + df1.groupby('FID').cumcount() this statement is taking lot of memory , even using the same column in LHS df1['SID_START'] = df['SID_START'] + df1.groupby('FID').cumcount() , is there a way to reduce the memory ? anything inplace to alter an existing column with new values ? – abcd_1234 Jun 27 '23 at 04:23