In pandas, I have two data frames. One containing the Holidays of a particular country from http://www.timeanddate.com/holidays/austria and another one containing a date column. I want to calculate the #days
after a holiday.
def compute_date_diff(x, y):
difference = y - x
differenceAsNumber = (difference/ np.timedelta64(1, 'D'))
return differenceAsNumber.astype(int)
for index, row in holidays.iterrows():
secondDF[row['name']+ '_daysAfter'] = secondDF.dateColumn.apply(compute_date_diff, args=(row.day,))
However, this
- calculates the wrong difference e.g.
>
than a year in caseholidays
contains data for more than a year. - is pretty slow.
How could I fix the flaw and increase performance? Is there a parallel apply? Or what about http://pandas.pydata.org/pandas-docs/stable/timeseries.html#holidays-holiday-calendars
As I am new to pandas I am unsure how to obtain the current date/index of the date object whilst iterating through in apply. As far as I know I cannot loop the other way round e.g. over all my rows in secondDF
as it was impossible for me to generate feature columns whilst iterating via apply