-1

Let's say I have a dataset that has 9 continuous columns of data and 4 columns of categorical data. In Matlab, I separate the columns into two groups and do training/testing (naïve bayes) on them separately and determine that the continuous columns have an error rate of 0.45 and the categorical columns have an error 0.33. My question is - how do I determine the combined error?

EDIT - Simple pseudocode overview added:

for x = 1:num_iterations
  Mdl_NB1 = fitcnb(TrainingSet_Con,TrainingTargets,'Distribution','normal');
  Mdl_NB2 = fitcnb(TrainingSet_Dis,TrainingTargets,'Distribution','mn');
  [NB1_label,NB1_Posterior,NB1_Cost] = predict(Mdl_NB1,TestPoint_Con);
  [NB2_label,NB2_Posterior,NB2_Cost] = predict(Mdl_NB2,TestPoint_Dis);
  NB1_cumulLoss = NB1_cumulLoss + resubLoss(Mdl_NB1);
  NB2_cumulLoss = NB2_cumulLoss + resubLoss(Mdl_NB2);
end
NB1_avg_score = NB1_cumulLoss/num_iterations
NB2_avg_score = NB2_cumulLoss/num_iterations
total_avg_score = ???

The three obvious choices, in principle, are:

  • (A+B) / 2
  • A * B
  • (A*(CountA/TotalCount)) + (B*(CountB/TotalCount))

But not sure if any of these are right, in this case.

swabygw
  • 813
  • 1
  • 10
  • 22

1 Answers1

0

This does not make sense; you are effectively building two separate models. So either build one model with all columns (maybe with 'Distribution','mvmn') or combine both models into one with something like

Mdl_Ens = fitcnb([NB1_Posterior; NB2_Posterior],TrainingTargets,'Distribution','normal');
NEns_cumulLoss = NEns_cumulLoss + resubLoss(Mdl_Ens);

to actually build a single model out of the output of the two models based on a subset of the columns each.

damienfrancois
  • 52,978
  • 9
  • 96
  • 110