0

I read in a DataFrame with around 80.000 observations taken every 15 mins (= 96 data records per day). The objective is to create a new column which indicates (with 1 or 0) that the records are taken at a holiday.

df = pd.read_csv('data_prepared.csv', sep=',', index_col=10, parse_dates=True)

Out[49]: 
                                datetime    H1     H2  ...     H7     H8    Hges
MEZ/MESZ                                               ...                      
2018-04-01 00:15:00  2018-03-31 22:15:00  25.0   47.0  ...   47.0   45.0   241.0
2018-04-01 00:30:00  2018-03-31 22:30:00  42.0   45.0  ...   37.0   35.0   228.0
2018-04-01 00:45:00  2018-03-31 22:45:00  25.0   50.0  ...   42.0   35.0   205.0
2018-04-01 01:00:00  2018-03-31 23:00:00  34.0   68.0  ...   46.0   88.0   301.0
2018-04-01 01:15:00  2018-03-31 23:15:00  27.0   45.0  ...   44.0  168.0   352.0
...

In Pandas: Checking if a date is a holiday and assigning boolean value I saw how to assign a True/False value for every observation when it was taken at a holiday. So I tried to recreate this:

dr = df.index
df = pd.DataFrame()
df['Date'] = dr

I created a csv, where the Bavarian holidays are collected:

b_hol = pd.read_csv(r'Data/Dt_Feiertage.csv', encoding= 'unicode_escape', sep=';', index_col=0,    parse_dates=True)
holidays = b_hol.index
holidays = holidays.strftime('%Y-%m-%d')
holidays[0:20]

Out[52]: 
Index(['2018-01-01', '2018-01-06', '2018-03-30', '2018-04-02', '2018-05-01',
       '2018-05-10', '2018-05-21', '2018-05-31', '2018-08-15', '2018-10-03',
       '2018-11-01', '2018-12-25', '2018-12-26', '2019-01-01', '2019-01-06',
       '2019-04-19', '2019-04-22', '2019-05-01', '2019-05-30', '2019-06-10'],
      dtype='object', name='Datum')

With .isin I ask the DataFrame whether the observation was taken at a holiday:

df['Holiday'] = df['Date'].isin(holidays)

The code works fine so far, but my problem is, that e.g. for the '2018-04-02' I want that every observation is marked with True and not only the first timestamp, as you can see here:

df[90:100]
Out[56]: 
                  Date  Holiday
90 2018-04-01 22:45:00    False
91 2018-04-01 23:00:00    False
92 2018-04-01 23:15:00    False
93 2018-04-01 23:30:00    False
94 2018-04-01 23:45:00    False
95 2018-04-02 00:00:00     True
96 2018-04-02 00:15:00    False
97 2018-04-02 00:30:00    False
98 2018-04-02 00:45:00    False
99 2018-04-02 01:00:00    False

Instead I want to have the following:

                  Date  Holiday
90 2018-04-01 22:45:00    False
91 2018-04-01 23:00:00    False
92 2018-04-01 23:15:00    False
93 2018-04-01 23:30:00    False
94 2018-04-01 23:45:00    False
95 2018-04-02 00:00:00     True
96 2018-04-02 00:15:00     True
97 2018-04-02 00:30:00     True
98 2018-04-02 00:45:00     True
99 2018-04-02 01:00:00     True
christianbauer1
  • 467
  • 5
  • 16

2 Answers2

2

In you code, only the Dates where the HH:MM:SS is 00:00:00 will be set to True. You need to consider only the %Y-%m-%d part of your datetime objects in order to achieve the desired result.

    df['Date'].apply(lambda x: x.strftime("%Y-%m-%d")).isin(holidays)

Hope this helps!

Muntasir Wahed
  • 297
  • 2
  • 11
1

this is due to the fact that the date match only when hour is 00:00:00. you should create a new array that takes only the date withouth hour and minutes.

a very dirty soultion could be something like this:

df["Date_day"]=pd.to_datetime(df["Date"].apply(lambda x: x.strftime("%Y-%m-%d")))
df['Holiday'] = df['Date_day'].isin(holidays)

df[['Holiday',"Date_day"]]

gabriele
  • 11
  • 1