2

I have a time series with hourly frequency and a label per day. I would like to fix the class imbalance by oversampling while preserving the sequence for each one day period. Ideally I would be able to use ADASYN or another method better than random oversampling. Here is what the data looks like:

import pandas as pd
import numpy as np
from datetime import datetime, timedelta
np.random.seed(seed=1111)

date_today = datetime.now()
days = pd.date_range(date_today, date_today + timedelta(45), freq='H')

data = np.random.random(size=len(days))
data2 = np.random.random(size=len(days))
df = pd.DataFrame({'DateTime': days, 'col1': data, 'col_2' : data2})
df['Date'] = [df.loc[i,'DateTime'].floor('D') for i in range(len(df))]

class_labels = []
for i in df['Date'].unique():
    class_labels.append([i,np.random.choice((1,2,3,4,5,6,7,8,9,10),size=1,
                                           p=(.175,.035,.016,.025,.2,.253,.064,.044,.072,.116))[0]])
class_labels = pd.DataFrame(class_labels)

df['class_label'] = [class_labels[class_labels.loc[:,0] == df.loc[i,'Date']].loc[:,1].values[0] for i in range(len(df))]
df = df.set_index('DateTime')
df.drop('Date',axis=1,inplace=True)

print(df['class_label'].value_counts())
df.head(15)

Out[209]: 
5     264
1     240
6     145
9     120
7     120
10     72
8      72
4      24
2      24

Out[213]: 
                                col1     col_2  class_label
DateTime                                                   
2019-02-01 18:28:29.214935  0.095549  0.307041            6
2019-02-01 19:28:29.214935  0.925004  0.981620            6
2019-02-01 20:28:29.214935  0.343573  0.610662            6
2019-02-01 21:28:29.214935  0.310477  0.482961            6
2019-02-01 22:28:29.214935  0.002010  0.242208            6
2019-02-01 23:28:29.214935  0.235595  0.355516            6
2019-02-02 00:28:29.214935  0.237792  0.028726            5
2019-02-02 01:28:29.214935  0.735916  0.221198            5
2019-02-02 02:28:29.214935  0.495468  0.712723            5
2019-02-02 03:28:29.214935  0.784425  0.818065            5
2019-02-02 04:28:29.214935  0.126506  0.414326            5
2019-02-02 05:28:29.214935  0.606649  0.264835            5
2019-02-02 06:28:29.214935  0.466121  0.244843            5
2019-02-02 07:28:29.214935  0.237132  0.298100            5
2019-02-02 08:28:29.214935  0.435159  0.621991            5

I would like to use ADASYN or SMOTE, but even random oversampling to fix the class imbalance would be good.

The desired result is in hourly increments like the original, has one label per day and classes are balanced:

print(df['class_label'].value_counts())

Out[211]: 
5     264
1     264
6     264
9     264
7     264
10    264
8     264
4     264
2     264
JHall651
  • 427
  • 1
  • 4
  • 15

2 Answers2

0

Using for loop with groupby then sample each subset

newdf=pd.concat([y.sample(264,replace=True) for _, y in df.groupby('class_label')])
newdf.class_label.value_counts()
9     264
7     264
5     264
1     264
10    264
8     264
6     264
4     264
2     264
Name: class_label, dtype: int64
BENY
  • 317,841
  • 20
  • 164
  • 234
  • 2
    This fixes the class imbalance but it is no longer a sequential time series. Each one day period would need to be taken as a sample to preserve the sequence at least from midnight to midnight. – JHall651 Feb 02 '19 at 01:48
0

You really can't "oversample" time series data, at least not in the same sense that you can unordered data. It wouldn't be possible to have 264 examples of every class, that would mean inserting new data into the time series between existing points and throwing all of the time sensitive patters out of wack.

The best option (as far as oversampling) is to synthetically generate one or more new time series based on your original data. One option: for each point, pick a random class then interpolate between the closest data points of that class from the original time series. Another option: randomly sample 24 points from each class (which will always include all of class 2 and 4) and interpolate the rest of the time series a few times until you have a set of balanced time series.

A much better option is to address class imbalance some other way, say by changing your loss/error function.

French
  • 1