Downsampling problems with complex dataset

Question

I have an imbalanced dataset and I want to downsample it.

This is the dataset:

testframe = pd.DataFrame()
testframe['id_unique'] = [0,0,0,1,1,1,2,2,2,3,3,3]
testframe['t'] = [1,2,3,1,2,3,1,2,3,1,2,3]
testframe['value'] = [10,11,12,21,22,23,31,32,33,41,42,43]
testframe['class'] = [1,1,1,2,2,2,1,1,1,1,1,1]

where id_unique stands for unique time series, t is the order of the values, value is the measured value and class is the class that the time series belongs to.

It is an imbalanced dataset and I want to downsample it to the following:

final_frame = pd.DataFrame()
final_frame['id_uniqe'] = [0,0,0,1,1,1]
final_frame['t'] = [1,2,3,1,2,3]
final_frame['value'] = [10,11,12,21,22,23]
final_frame['class'] = [1,1,1,2,2,2]

I have tried:

from imblearn.under_sampling import RandomUnderSampler

rus = RandomUnderSampler(sampling_strategy = 'all')

X_rus, y_rus= rus.fit_resample(df.drop(['classes'], axis = 1), df['classes'])

but obviously this just selected some rows from the original datafame (that belonged to the majority class) and dropped them. As a result, I got some snippets from each timeseries in each class, but the information, that was in the complete timeseries got lost.

I am looking for a way to downsample the dataset in a way, so that from the majority class a whole time series (id_unique) gets dropped and I end up an equal number of full time series for each class. The selection should be random.

I have tried some groupby lines, but they all resulted in errors..

Thanks for any hints on this!

By resampling, do you mean just pick out the same number of id's in the majority class as in the minority class? — Quang Hoang, Apr 24 '20 at 14:30
Exactly. I want a balanced data set and upsampling is not an option for now. Downsampling with imblearn is quite easy, when you are not dealing with time series data. — nopact, Apr 24 '20 at 15:14

score 0 · Accepted Answer · answered Apr 24 '20 at 15:22

Here's a possible solution:

# ids 
majority_ids = testframe.loc[testframe['class']==1, 'id_unique'].unique()
minority_ids = testframe.loc[testframe['class']==2, 'id_unique'].unique()

# pick out a given number of id's in majority class
all_ids = majority_ids[:len(minority_ids)+1] + minority_ids

final_df = testframe[testframe.id_unique.isin(all_ids)]

Output:

    id_unique  t  value  class
3           1  1     21      2
4           1  2     22      2
5           1  3     23      2
9           3  1     41      1
10          3  2     42      1
11          3  3     43      1

Thanks, that worked, expect for I needed randomly picked numbers. This I achieved with np.random.sample(population, k). This was, what I was looking for! final_df = testframe[testframe.id_unique.isin(all_ids)] — nopact, Apr 28 '20 at 09:07

Downsampling problems with complex dataset

1 Answers1