0

I need to calculate the mutual information between various features for designing a classification model using logistic regression. I am facing following problems:

  1. I need to divide my data into n bins having approximately equal number of samples. How can I achieve this in Matlab?

  2. Should I perform the above discretization on raw data or normalized data?

Thanks.

Neel Shah
  • 349
  • 1
  • 4
  • 12

1 Answers1

0

I guess what you want to do is similar to cross validation, in Matlab you can use the function crossvalind that allow you to split your dataset.

I added the example shown on the page for splitting your data into 10 bins (called 10-fold cross validation).

load fisheriris 
indices = crossvalind('Kfold',species,10);
cp = classperf(species);
for i = 1:10
    test = (indices == i); train = ~test;
    class = classify(meas(test,:),meas(train,:),species(train,:));
    classperf(cp,class,test)
end
cp.ErrorRate

ans =

    0.0200

You should do this operation after doing the pre-processing of your data (normalisation / standardisation).

R.Falque
  • 904
  • 8
  • 29
  • Do i need to sort the data in a particular feature (Column) before applying cross validation? I want n equal density bins with approximately equal samples in each bin. I did normalization->sort->divided the vector into 10 bins but the values for a feature are not equally divided into bins. – Neel Shah Feb 17 '16 at 05:57
  • I would rather randomize the order of the sample rather than sorting it. And yeah the size of each bin will be similar. – R.Falque Feb 17 '16 at 06:01