0

I'm not seeing similar questions, in this question the positive result turned out not to be specified. This question is similar and was asked by myself but it's for a different question, that one was a Zero-R data set, I seem to be having the same issue with One R, this one might have more clarity. My question is why my results are different than what I expected and whether my One Rule model is functioning correctly--there's a warning message that I'm not sure if I need to address, but specifically there's two conflicting confusion matrices that don't correlate, the manual calculations for sensitivity and specificity don't match with the confusionMatrix() function's specificity and sensitivity calculations in the caret package. It looks like something was inverted but I'll keep checking. Any advice is greatly appreciated!

For context, the One Rule model tests for each attribute or column of the cancer data set, so for example did texture yield the highest accurate results for benign (B) predictions versus malignant (M) predictions in the confusion matrix, or was it smoothness, or area, or some other factor that are each represented as raw data in each column.

There's this warning and my assumption is that I could've added more parameters but I didn't fully understand them:

oneRModel <- OneR(as.factor(Diagnosis)~., cancersamp)
#> Warning message:
#> In OneR.data.frame(x = data, ties.method = ties.method, verbose = verbose
#> data contains unused factor levels

Here's where there's two separate confusion matrices that may have inverted labels and that each give different specificity and sensitivity results, one I did manually and the other with the confusionMatrix() function in the caret package:

table(dataTest$Diagnosis, dataTest.pred)
#> dataTest.pred
#>     B  M
#>  B 28  1
#>  M  5 12
 
 #OneR(formula, data, subset, na.action,
 #     control = Weka_control(), options = NULL)
 
 
confusionMatrix(dataTest.pred, as.factor(dataTest$Diagnosis), positive="B") 
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction  B  M
#>          B 28  5
#>          M  1 12
#>                                          
#>                Accuracy : 0.8696          
#>                  95% CI : (0.7374, 0.9506)
#>     No Information Rate : 0.6304          
#>     P-Value [Acc > NIR] : 0.0003023       
#>                                          
#>                   Kappa : 0.7058          
#>                                          
#>  Mcnemar's Test P-Value : 0.2206714       
#>                                           
#>             Sensitivity : 0.9655          
#>             Specificity : 0.7059          
#>          Pos Pred Value : 0.8485          
#>          Neg Pred Value : 0.9231          
#>              Prevalence : 0.6304          
#>          Detection Rate : 0.6087          
#>    Detection Prevalence : 0.7174          
#>       Balanced Accuracy : 0.8357          
#>                                          
#>        'Positive' Class : B               
#>                                          
 
sensitivity1 = 28/(28+5)
specificity1 = 12/(12+1)
specificity1
#> [1] 0.9230769
sensitivity1
#> [1] 0.8484848

Here's pseudo-code, my assumption was this is what the OneR function already does and I'm not supposed to manually do this:

For each attribute, 
  For each value of the attribute, make a rule as follows:
      count how often each class appears 
      find the most frequent class 
      make the rule assign that class to this attribute-value
  Calculate the error rate of the rules Choose the rules with the smallest error rate

Here's the rest of my code for the One R Model:

 #--------------------------------------------------
 #     One R Model
 #--------------------------------------------------
 
 
 set.seed(23)
 randsamp <- sample(nrow(cancerdata), 150, replace=FALSE)
 #randsamp
 
 cancersamp <- cancerdata[randsamp,]
 #cancersamp
 
 #?sample.split
 
 spl = sample.split(cancersamp$Diagnosis, SplitRatio = 0.7)
 #spl

dataTrain = subset(cancersamp, spl==TRUE)
dataTest = subset(cancersamp, spl==FALSE)
 
 oneRModel <- OneR(as.factor(Diagnosis)~., cancersamp)
#> Warning message:
#> In OneR.data.frame(x = data, ties.method = ties.method, verbose = #> verbose,  :
#>   data contains unused factor levels
summary(oneRModel)

#> Call:
#> OneR.formula(formula = as.factor(Diagnosis) ~ ., data = cancersamp)

#> Rules:
#> If perimeter = (53.2,75.7] then as.factor(Diagnosis) = B
#> If perimeter = (75.7,98.2] then as.factor(Diagnosis) = B
#> If perimeter = (98.2,121]  then as.factor(Diagnosis) = M
#> If perimeter = (121,143]   then as.factor(Diagnosis) = M
#> If perimeter = (143,166]   then as.factor(Diagnosis) = M

#> Accuracy:
#> 134 of 150 instances classified correctly (89.33%)

#> Contingency table:
#>                     perimeter
#> as.factor(Diagnosis) (53.2,75.7] (75.7,98.2] (98.2,121] (121,143] #> (143,166] Sum
#>                 B          * 31        * 63          1         0         0  95
#>                 M             1          14       * 19      * 18       * 3  55
#>                 Sum          32          77         20        18         3 150
#> ---
#> Maximum in each column: '*'

#> Pearson's Chi-squared test:
#> X-squared = 92.412, df = 4, p-value < 2.2e-16

dataTest.pred <- predict(oneRModel, newdata = dataTest)
table(dataTest$Diagnosis, dataTest.pred)
#>   dataTest.pred
#>      B  M
#>   B 28  1
#>   M  5 12

Here's a small snippet of the data set, as you can see perimeter is the one-rule factor that was selected but I was expecting results to correlate with the study's predictions on texture, area, and smoothness giving the best results, but I don't know all of the variables surrounding that in the study and these are randomized samples so I can always just keep testing.

head(cancerdata)
  PatientID radius texture perimeter   area smoothness compactness concavity concavePoints symmetry  fractalDimension Diagnosis
1    842302  17.99   10.38    122.80 1001.0    0.11840     0.27760    0.3001       0.14710   0.2419          0.07871         M
2    842517  20.57   17.77    132.90 1326.0    0.08474     0.07864    0.0869       0.07017   0.1812          0.05667         M
3  84300903  19.69   21.25    130.00 1203.0    0.10960     0.15990    0.1974       0.12790   0.2069          0.05999         M
4  84348301  11.42   20.38     77.58  386.1    0.14250     0.28390    0.2414       0.10520   0.2597          0.09744         M
5  84358402  20.29   14.34    135.10 1297.0    0.10030     0.13280    0.1980       0.10430   0.1809          0.05883         M
6    843786  12.45   15.70     82.57  477.1    0.12780     0.17000    0.1578       0.08089   0.2087          0.07613         M
jared_mamrot
  • 22,354
  • 4
  • 21
  • 46
cocoakrispies93
  • 45
  • 1
  • 11
  • What `OneR` function is being used? What package? What documentation? – IRTFM Sep 20 '21 at 00:35
  • Hi @IRTFM! I actually specified the package in the first three sentences, go ahead and comment back when you've read it and let me know what package you see written. In that same sentence you'll find the function being used, which is then specified in the first two code snippets including the warning message, let me know when you find it. Can you clarify your question, what documentation would you like? – cocoakrispies93 Sep 20 '21 at 00:48
  • I do not see any package mentioned and I see no `library` or `require` calls. Voting to close as needing further details. – IRTFM Sep 20 '21 at 00:54
  • @IRTFM In the third sentence the package clarified is caret and in that same sentence it says the confusionMatrix() function is the problem. It also specifies the OneR function that's just called OneR(). But the question was already answered because I ran into the same thing twice with two separate models and deduced what the common denominator was. – cocoakrispies93 Sep 20 '21 at 01:14

2 Answers2

1

As per https://topepo.github.io/caret/measuring-performance.html

Sensitivity is the true positive rate (predicted positives/total positives); in this case, when you tell confusionMatrix() that the "positive" class is "B": 28/(28 + 1) = 0.9655

Specificity is the true negative rate (predicted negatives/total negatives); in this case, when you tell confusionMatrix() that the "positive" class is "B": 12/(12 + 5) = 0.7059

It looks like the inconsistency is arising because the OneR/manual confusion matrix tabulation is inverted relative to the matrix produced by confusionMatrix(). Your manual calculations also appear to be incorrect because you're dividing by the total true/false predictions rather than the total true/false values.

huttoncp
  • 161
  • 4
-1

This website gave some information but for the OneR model it was hard to figure out which matrix to use, both had similar specificity and sensitivity calculations and both had similar tables for their confusion matrix.

However, my Zero-R question is another problem with the confusion matrix issue and just cleared up which one is correct. This Zero R matrix looked wrong because it says sensitivity is 1.00 and specificity is 0.00, while my results were that sensitivity was along the lines of 0.6246334 among multiple trials with 0.00 for specificity. But this website actually clears it up, because the Zero-R model has zero factors, sensitivity really is just 1.00, and specificity is 0.00. It gives one prediction and that's just based on the majority.

Cross-applying which table is correct on the Zero-R model to the One-R model, the correct one is based on the same confusionMatrix() function done in the same way:

> confusionMatrix(dataTest.pred, as.factor(dataTest$Diagnosis), positive="B") 
Confusion Matrix and Statistics

          Reference
Prediction  B  M
         B 28  5
         M  1 12

And these are the correct calculations, correlating with the 1.00 sensitivity on the Zero-R model and 0.00 Specificity:

Sensitivity : 0.9655          
Specificity : 0.7059

This one was done incorrectly on both of my questions, for Zero-R and One-R, presumably because the parameters aren't done correctly:

> dataTest.pred <- predict(oneRModel, newdata = dataTest)
> table(dataTest$Diagnosis, dataTest.pred)
   dataTest.pred
     B  M
  B 28  1
  M  5 12
cocoakrispies93
  • 45
  • 1
  • 11
  • The first link appears to be broken. You should fix it. You should also explain what context this is being used in. Is this classwork? Are you affiliated with the developers? – IRTFM Sep 20 '21 at 00:40
  • Hi @IRTFM, thanks for letting me know about that broken link, it's actually the same website as the other link that also says "this website" just on a different page, sorry for the confusion. The context is a cancer data set based on a study, the point of this is to determine two very basic classification models, the ZeroR and OneR classification model. This is for educational purposes, I'm in class and couldn't figure out why one confusion matrix said one thing and another said another thing haha – cocoakrispies93 Sep 20 '21 at 00:57