0

In pandas, I have two data frames. One containing the Holidays of a particular country from http://www.timeanddate.com/holidays/austria and another one containing a date column. I want to calculate the #days after a holiday.

def compute_date_diff(x, y):
    difference = y - x
    differenceAsNumber = (difference/ np.timedelta64(1, 'D'))
    return differenceAsNumber.astype(int)

for index, row in holidays.iterrows():
    secondDF[row['name']+ '_daysAfter'] = secondDF.dateColumn.apply(compute_date_diff, args=(row.day,))

However, this

  • calculates the wrong difference e.g. > than a year in case holidays contains data for more than a year.
  • is pretty slow.

How could I fix the flaw and increase performance? Is there a parallel apply? Or what about http://pandas.pydata.org/pandas-docs/stable/timeseries.html#holidays-holiday-calendars As I am new to pandas I am unsure how to obtain the current date/index of the date object whilst iterating through in apply. As far as I know I cannot loop the other way round e.g. over all my rows in secondDF as it was impossible for me to generate feature columns whilst iterating via apply

Georg Heiler
  • 16,916
  • 36
  • 162
  • 292

2 Answers2

0

To do this, join both data frames using a common column and then try this code

import pandas
import numpy as np
df = pandas.DataFrame(columns=['to','fr','ans'])
df.to = [pandas.Timestamp('2014-01-24'), pandas.Timestamp('2014-01-27'), pandas.Timestamp('2014-01-23')]
df.fr = [pandas.Timestamp('2014-01-26'), pandas.Timestamp('2014-01-27'), pandas.Timestamp('2014-01-24')]
df['ans']=(df.fr-df.to) /np.timedelta64(1, 'D')
print df

output

          to         fr  ans
0 2014-01-24 2014-01-26  2.0
1 2014-01-27 2014-01-27  0.0
2 2014-01-23 2014-01-24  1.0
Shijo
  • 9,313
  • 3
  • 19
  • 31
0

I settled for something entirely different: Now, only the number of days since before the most current holiday will be calculated.

my function:

def get_nearest_holiday(holidays, pivot):
   return min(holidays, key=lanbda x: abs(x- pivot)
   # this needs to be converted to an int, but at least the nearest holiday is found efficiently

is called as a lambda expression on a per-row basis

Georg Heiler
  • 16,916
  • 36
  • 162
  • 292