1

I am having a Data set consist of around 10^6 entries. But the problem is data is Imbalance.

I creating a linear classifier using Adboost. But due to imbalance data my accuracy is very poor. How to cop with Imbalance Data. I am using Graphlab.

Here is simple code for balancing of Data:

safe_loans_raw = loans[loans[target] == 1]
risky_loans_raw = loans[loans[target] == -1]

# Undersample the safe loans.
percentage = len(risky_loans_raw)/float(len(safe_loans_raw))
safe_loans = safe_loans_raw.sample(percentage, seed = 1)
risky_loans = risky_loans_raw
loans_data = risky_loans.append(safe_loans) 

But the accuracy is still not approving can anyone provide efficient approach for this ?

user6250837
  • 458
  • 2
  • 21

3 Answers3

1

Handling the imbalanced data is one of the most challenging fields in the data mining and machine learning domains. Therefore, you will not find a simple, straight answer your question right away.

In my experience using penalized (or weighted) evaluation metrics is one of the best ways (SHORT ANSWER), however (always there is a but!), you can refer the following resources to find the effective approach. Your problem is more of a scientific issue rather than an issue with the tool.

This should handle the situation but make sure that you know the background before using it.

Free

Not Free but more valuable

Mohsen Kamrani
  • 7,177
  • 5
  • 42
  • 66
  • Could you please answer this question: https://datascience.stackexchange.com/questions/32812/categorization-of-approaches-to-deal-with-imbalanced-classes Thanks. @mok – ebrahimi Jun 12 '18 at 11:25
1

How did you come to a conclusion that the poor accuracy is because of the imbalance of data? Because based on the code that you have provided,loans_data should have balanced data(50% risky loans and 50% safe loans approximately). Please check the number of risky loans and safe loans after creating loans_data to confirm.

The poor accuracy could be because of the features that you have selected for training your model or the data itself.

Praveen
  • 113
  • 1
  • 13
0

You can also use the paramter "class_weights="auto" in boosted trees, which takes care of imbalanced data to a certain extent. For more information, have a look at this : default paramters

Dreams
  • 5,854
  • 9
  • 48
  • 71