4

I have a dataset consisting of 21 unique data records. To benchmark the performance certain algorithms like kNN and SVM by increasing the no of samples for each class, I would like to test on data with at least 20 or more unique records for each class (Predict Conc. are the different classes).

I don't want to generate random data. I would like to use the 21 unique data records which I have as the base dataset and generate the remaining data similar to the the existing data.

How can I do this using Python?

Here's the sample data

Index  OD600AV  Cell Count  Predict Conc            
1     0.059625  800000        1
2     0.063125  442000        1
3     0.067375  544000        1 
4     0.060125  728000        2
5     0.062500  616000        2
6     0.063000  688000        2
7     0.061125  532000        3
8     0.059875  470000        3
9     0.059250  556000        3
10    0.060250  466000        4
11    0.056000  222000        4
12    0.056000  390000        4
13    0.055125  112000        5
14    0.049625  105000        5
15    0.050875  120000        5
16    0.047875  56000         6
17    0.058000  44000         6
18    0.048500  140000        6
19    0.052500  62000         7
20    0.061125  52000         7
21    0.047125  64000         7  

This question is quite similar to Generate data by using existing dataset as the base dataset which seems has been answered using R which I could not get to work.

Thanks

Rob Alamgir
  • 51
  • 1
  • 2

1 Answers1

0

I think these 3 approaches might be helpful (in increasing complexity):

  1. Pick from your own sample with replacement (bootstrapping). In this case the data will be identical to the original, but sample size increases. You could use something like this:
def sampleDF(df, k): 
    df.iloc[np.random.randint(0, len(df), size=k)]

new_df = sampleDF(df, k=300)    
  1. Pick with replacement, add some random noise to your independent variables. You can choose how much noise to introduce for your testing purposes.
new_df = sampleDF(df, k=300)    

# Add noise from normal distribution and uniform distribution
new_df['OD600AV'] = new_df['OD600AV'] + np.random.normal(loc=0, scale=0.1, size=new_df.shape[0]) 
new_df['Cell Count'] = new_df['Cell Count'] + np.random.uniform(low=0, high=10000, size=new_df.shape[0])

  1. Use a more sophisticated approach to create a new synthetic dataset such as SMOTE. With the package, you can also choose to resample only some classes.

Hope this helps!

Dudelstein
  • 383
  • 3
  • 16