1

I have an imbalanced dataset and I want to downsample it.

This is the dataset:

testframe = pd.DataFrame()
testframe['id_unique'] = [0,0,0,1,1,1,2,2,2,3,3,3]
testframe['t'] = [1,2,3,1,2,3,1,2,3,1,2,3]
testframe['value'] = [10,11,12,21,22,23,31,32,33,41,42,43]
testframe['class'] = [1,1,1,2,2,2,1,1,1,1,1,1]

where id_unique stands for unique time series, t is the order of the values, value is the measured value and class is the class that the time series belongs to.

It is an imbalanced dataset and I want to downsample it to the following:

final_frame = pd.DataFrame()
final_frame['id_uniqe'] = [0,0,0,1,1,1]
final_frame['t'] = [1,2,3,1,2,3]
final_frame['value'] = [10,11,12,21,22,23]
final_frame['class'] = [1,1,1,2,2,2]

I have tried:

from imblearn.under_sampling import RandomUnderSampler

rus = RandomUnderSampler(sampling_strategy = 'all')

X_rus, y_rus= rus.fit_resample(df.drop(['classes'], axis = 1), df['classes'])

but obviously this just selected some rows from the original datafame (that belonged to the majority class) and dropped them. As a result, I got some snippets from each timeseries in each class, but the information, that was in the complete timeseries got lost.

I am looking for a way to downsample the dataset in a way, so that from the majority class a whole time series (id_unique) gets dropped and I end up an equal number of full time series for each class. The selection should be random.

I have tried some groupby lines, but they all resulted in errors..

Thanks for any hints on this!

nopact
  • 195
  • 2
  • 12
  • By resampling, do you mean just pick out the same number of id's in the majority class as in the minority class? – Quang Hoang Apr 24 '20 at 14:30
  • Exactly. I want a balanced data set and upsampling is not an option for now. Downsampling with imblearn is quite easy, when you are not dealing with time series data. – nopact Apr 24 '20 at 15:14

1 Answers1

0

Here's a possible solution:

# ids 
majority_ids = testframe.loc[testframe['class']==1, 'id_unique'].unique()
minority_ids = testframe.loc[testframe['class']==2, 'id_unique'].unique()

# pick out a given number of id's in majority class
all_ids = majority_ids[:len(minority_ids)+1] + minority_ids

final_df = testframe[testframe.id_unique.isin(all_ids)]

Output:

    id_unique  t  value  class
3           1  1     21      2
4           1  2     22      2
5           1  3     23      2
9           3  1     41      1
10          3  2     42      1
11          3  3     43      1
Quang Hoang
  • 146,074
  • 10
  • 56
  • 74
  • Thanks, that worked, expect for I needed randomly picked numbers. This I achieved with np.random.sample(population, k). This was, what I was looking for! final_df = testframe[testframe.id_unique.isin(all_ids)] – nopact Apr 28 '20 at 09:07