I have an imbalanced dataset and I want to downsample it.
This is the dataset:
testframe = pd.DataFrame()
testframe['id_unique'] = [0,0,0,1,1,1,2,2,2,3,3,3]
testframe['t'] = [1,2,3,1,2,3,1,2,3,1,2,3]
testframe['value'] = [10,11,12,21,22,23,31,32,33,41,42,43]
testframe['class'] = [1,1,1,2,2,2,1,1,1,1,1,1]
where id_unique stands for unique time series, t is the order of the values, value is the measured value and class is the class that the time series belongs to.
It is an imbalanced dataset and I want to downsample it to the following:
final_frame = pd.DataFrame()
final_frame['id_uniqe'] = [0,0,0,1,1,1]
final_frame['t'] = [1,2,3,1,2,3]
final_frame['value'] = [10,11,12,21,22,23]
final_frame['class'] = [1,1,1,2,2,2]
I have tried:
from imblearn.under_sampling import RandomUnderSampler
rus = RandomUnderSampler(sampling_strategy = 'all')
X_rus, y_rus= rus.fit_resample(df.drop(['classes'], axis = 1), df['classes'])
but obviously this just selected some rows from the original datafame (that belonged to the majority class) and dropped them. As a result, I got some snippets from each timeseries in each class, but the information, that was in the complete timeseries got lost.
I am looking for a way to downsample the dataset in a way, so that from the majority class a whole time series (id_unique) gets dropped and I end up an equal number of full time series for each class. The selection should be random.
I have tried some groupby lines, but they all resulted in errors..
Thanks for any hints on this!