I have a time series with hourly frequency and a label per day. I would like to fix the class imbalance by oversampling while preserving the sequence for each one day period. Ideally I would be able to use ADASYN or another method better than random oversampling. Here is what the data looks like:
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
np.random.seed(seed=1111)
date_today = datetime.now()
days = pd.date_range(date_today, date_today + timedelta(45), freq='H')
data = np.random.random(size=len(days))
data2 = np.random.random(size=len(days))
df = pd.DataFrame({'DateTime': days, 'col1': data, 'col_2' : data2})
df['Date'] = [df.loc[i,'DateTime'].floor('D') for i in range(len(df))]
class_labels = []
for i in df['Date'].unique():
class_labels.append([i,np.random.choice((1,2,3,4,5,6,7,8,9,10),size=1,
p=(.175,.035,.016,.025,.2,.253,.064,.044,.072,.116))[0]])
class_labels = pd.DataFrame(class_labels)
df['class_label'] = [class_labels[class_labels.loc[:,0] == df.loc[i,'Date']].loc[:,1].values[0] for i in range(len(df))]
df = df.set_index('DateTime')
df.drop('Date',axis=1,inplace=True)
print(df['class_label'].value_counts())
df.head(15)
Out[209]:
5 264
1 240
6 145
9 120
7 120
10 72
8 72
4 24
2 24
Out[213]:
col1 col_2 class_label
DateTime
2019-02-01 18:28:29.214935 0.095549 0.307041 6
2019-02-01 19:28:29.214935 0.925004 0.981620 6
2019-02-01 20:28:29.214935 0.343573 0.610662 6
2019-02-01 21:28:29.214935 0.310477 0.482961 6
2019-02-01 22:28:29.214935 0.002010 0.242208 6
2019-02-01 23:28:29.214935 0.235595 0.355516 6
2019-02-02 00:28:29.214935 0.237792 0.028726 5
2019-02-02 01:28:29.214935 0.735916 0.221198 5
2019-02-02 02:28:29.214935 0.495468 0.712723 5
2019-02-02 03:28:29.214935 0.784425 0.818065 5
2019-02-02 04:28:29.214935 0.126506 0.414326 5
2019-02-02 05:28:29.214935 0.606649 0.264835 5
2019-02-02 06:28:29.214935 0.466121 0.244843 5
2019-02-02 07:28:29.214935 0.237132 0.298100 5
2019-02-02 08:28:29.214935 0.435159 0.621991 5
I would like to use ADASYN or SMOTE, but even random oversampling to fix the class imbalance would be good.
The desired result is in hourly increments like the original, has one label per day and classes are balanced:
print(df['class_label'].value_counts())
Out[211]:
5 264
1 264
6 264
9 264
7 264
10 264
8 264
4 264
2 264