1

I am building a repeat orders report in ipython notebook using graphlab and sframes. I have a csv file with roughly 100k rows of data containing user_id, user_email, user_phone. I added a new column called unique identifier. For each row I am traversing all other rows to see if user_id, user_email or user_phone matches the current record. If unique identifier is not empty and there is a match, I assign user_id from the current record into unique_identifier slot of each matching record.

At the end, I get an SFrame with 4 columns, where unique_identifier contains user_id of the oldest order for all matching orders. I am doing this via .apply method with a lambda function. The whole process takes a few seconds on my laptop. However, after the process is done, the SFframe becomes extremely slow and unmanageable to the point where SFrame.save seems to be taking forever.

It seems like my process of adding unique_identifier clogs up the memory or something like that. However, the problem is irrelevant of the sframe size. If I limit it to just 10 rows, the problem persists. What am I doing wrong?

Here is my method

def set_unique_identifier():
  orders['unique_identifier'] = ''
  orders['unique_identifier'] = orders.apply(lambda order:      
       order['unique_identifier'] if order['unique_identifier'] else                                          
       orders[(orders['user_email']==order['user_email']) | 
       (orders['phone'] == order['user_phone'])][0]['user_id'])
Shami
  • 11
  • 2

1 Answers1

0

don't use apply on entire sframe, instead, use it on SArray, that should speed up a little

ikel
  • 1,790
  • 6
  • 31
  • 61