I am building a repeat orders report in ipython notebook using graphlab and sframes. I have a csv file with roughly 100k rows of data containing user_id
, user_email
, user_phone
. I added a new column called unique identifier. For each row I am traversing all other rows to see if user_id
, user_email
or user_phone
matches the current record. If unique identifier is not empty and there is a match, I assign user_id
from the current record into unique_identifier slot of each matching record.
At the end, I get an SFrame with 4 columns, where unique_identifier
contains user_id
of the oldest order for all matching orders. I am doing this via .apply
method with a lambda function. The whole process takes a few seconds on my laptop. However, after the process is done, the SFframe becomes extremely slow and unmanageable to the point where SFrame.save seems to be taking forever.
It seems like my process of adding unique_identifier
clogs up the memory or something like that. However, the problem is irrelevant of the sframe size. If I limit it to just 10 rows, the problem persists. What am I doing wrong?
Here is my method
def set_unique_identifier():
orders['unique_identifier'] = ''
orders['unique_identifier'] = orders.apply(lambda order:
order['unique_identifier'] if order['unique_identifier'] else
orders[(orders['user_email']==order['user_email']) |
(orders['phone'] == order['user_phone'])][0]['user_id'])