I am having a Data set consist of around 10^6
entries. But the problem is data is Imbalance
.
I creating a linear classifier using Adboost. But due to imbalance data my accuracy is very poor. How to cop with Imbalance Data. I am using Graphlab
.
Here is simple code for balancing of Data:
safe_loans_raw = loans[loans[target] == 1]
risky_loans_raw = loans[loans[target] == -1]
# Undersample the safe loans.
percentage = len(risky_loans_raw)/float(len(safe_loans_raw))
safe_loans = safe_loans_raw.sample(percentage, seed = 1)
risky_loans = risky_loans_raw
loans_data = risky_loans.append(safe_loans)
But the accuracy is still not approving can anyone provide efficient approach for this ?