0

I have a half-hourly dataframe with two columns. I would like to take all the hours of a day, then do some calculation which returns one number and assign that to all half-hours of that day. Below is an example code:

dates = pd.date_range("2003-01-01 08:30:00","2003-01-05",freq="30min")
data = np.transpose(np.array([np.random.rand(dates.shape[0]),np.random.rand(dates.shape[0])*100]))
data[0:50,0]=np.nan # my actual dataframe includes nan
df = pd.DataFrame(data = data,index =dates,columns=["DATA1","DATA2"])
print(df)
                        DATA1      DATA2
2003-01-01 08:30:00       NaN  79.990866
2003-01-01 09:00:00       NaN   5.461791
2003-01-01 09:30:00       NaN  68.892447
2003-01-01 10:00:00       NaN  44.823338
2003-01-01 10:30:00       NaN  57.860309
...                       ...        ...
2003-01-04 22:00:00  0.394574  31.943657
2003-01-04 22:30:00  0.140950  78.275981

Then I would like to apply the following function which returns one numbre:

def my_f(data1,data2):
    y = data1[data2>20]
    return np.median(y) 

This function selects all data in DATA1 based on a condition (DATA2>20) then takes the median of all these data. How can I create a third column (let's say result) and assign back this fixed number (y) for all half-hours data of that day?

My guess is I should use something like this:

daily_tmp = df.resample('D').apply(my_f)
df['results'] = b.reindex(df.index,method='ffill')

If this approach is correct, how can I pass my_f with two arguments to resample.apply()? Or is there any other way to do the similar task?

Ress
  • 667
  • 1
  • 7
  • 24

2 Answers2

1

My solution assumes that you have a fairly small dataset. Please let me know if it is not the case.

I would decompose your goal as follows: (1) group data by day (2) for each day, compute some complicated function (3) assign the resulted value in to half-hours.

# specify the day for each datapoint
df['day'] = df.index.map(lambda x: x.strftime('%Y-%m-%d'))
# compute a complicated function for each day and store the result
mapping = {}
for day, data_for_the_day in df.groupby(by='day'):
    # assign to mapping[day] the result of a complicated function
    mapping[day] = np.mean(data_for_the_day[data_for_the_day['Data2'] > 20]['Data1'])

# assign the values to half-hours
df['result'] = df.index.map(lambda x: mapping.get(x.strftime('%Y-%m-%d'), np.nan) if x.strftime('%M')=='30' else np.nan)

That's not the neatest solution, but it is straight-forward, easy-to-understand, and works well on small datasets.

Kate Melnykova
  • 1,863
  • 1
  • 5
  • 17
  • Thanks. This perfectly answered my question. My actual dataset is about 250,000 points and this code took few seconds. – Ress Jan 29 '21 at 04:08
1

Here is a fast way to do it.

First, import libraries :

import time
import pandas as pd
import numpy as np
import datetime as dt

Second, the code to achieve it:

%%time
dates = pd.date_range("2003-01-01 08:30:00","2003-01-05",freq="30min")
data = np.transpose(np.array([np.random.rand(dates.shape[0]),np.random.rand(dates.shape[0])*100]))
data[0:50,0]=np.nan # my actual dataframe includes nan
df = pd.DataFrame(data = data,index =dates,columns=["DATA1","DATA2"])

#### Create an unique marker per hour

df['Date'] = df.index
df['Date'] = df['Date'].dt.strftime(date_format='%Y-%m-%d %H')

#### Then Stipulate some conditions

_condition_1 = df.Date == df.Date.shift(-1) # if full hour
_condition_2 = df.DATA2 > 20 # yours
_condition_3 = df.Date == df.Date.shift(1) # if half an hour

#### Now, report median where condition 1 and 2 are fullfilled

df['result'] = np.where(_condition_1 & _condition_2,(df.DATA1+df.DATA1.shift(-1)/2),0)

#### Fill the hours with median

df['result'] = np.where(_condition_3,df.result.shift(1),df.result)

#### Drop useless column
df = df.drop(['Date'],axis=1)

df[df.DATA2>20].tail(20)

Third: the output

output

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
Lumber Jack
  • 602
  • 3
  • 9
  • Thanks, I am not sure if the code does what I intended to do. The outputs should be similar to the @kate-melnykova code. – Ress Jan 29 '21 at 04:10
  • My bad, I've omitted some brackets in the calculation of the mean (because for two values, median is mean). this code was mostly to show step by step how to get the result without going through a loop. – Lumber Jack Jan 29 '21 at 09:04
  • Gotcha! That makes sense. Thanks – Ress Jan 29 '21 at 16:03