I want to find the difference (in days) between two columns in a dataframe (more specifically in the graphlab SFrame datastructure).
I have tried to write a couple of functions to do this but I cannot seem to create a function that is fast enough. Speed is my issue right now as I have ~80 million rows to process.
I have tried two different functions but both are too slow:
The t2_colname_str and t1_colname_str arguments are the column-names of which I want to use, and both columns contain datetime.datetime objects.
For Loop
def diff_days(sframe_obj,t2_colname_str,t1_colname_str):
import graphlab as gl
import datetime as datetime
# creating the new column name to be used later
new_colname = str(t2_colname_str[:-9] + "_DiffDays_" + t1_colname_str[:-9])
diff_days_list = []
for i in range(len(sframe_obj[t2_colname_str])):
t2 = sframe_obj[t2_colname_str][i]
t1 = sframe_obj[t1_colname_str][i]
try:
diff = t2 - t1
diff_days = diff.days
diff_days_list.append(diff_days)
except TypeError:
diff_days_list.append(None)
sframe_obj[new_colname] = gl.SArray(diff_days_list)
List Comprehension
I know this is not the intended purpose of list comprehensions, but I just tried it to see if it was faster.
def diff_days(sframe_obj,t2_colname_str,t1_colname_str):
import graphlab as gl
import datetime as datetime
# creating the new column name to be used later
new_colname = str(t2_colname_str[:-9] + "_DiffDays_" + t1_colname_str[:-9])
diff_days_list = [(sframe_obj[t2_colname_str][i]-sframe_obj[t1_colname_str][i]).days if sframe_obj[t2_colname_str][i] and sframe_obj[t1_colname_str][i] != None else None for i in range(len(sframe_obj[t2_colname_str]))]
sframe_obj[new_colname] = gl.SArray(diff_days_list)
Additional Notes
I have been using GraphLab-Create by Dato and their SFrame data-structure mainly because it parallelizes all the computation which makes my analysis super-fast and it has a great library for machine learning applications. It's a great product if you haven't checked it out already.
GraphLab User Guide can be found here: https://dato.com/learn/userguide/index.html