I'm not seeing similar questions, in this question the positive result turned out not to be specified. This question is similar and was asked by myself but it's for a different question, that one was a Zero-R data set, I seem to be having the same issue with One R, this one might have more clarity. My question is why my results are different than what I expected and whether my One Rule model is functioning correctly--there's a warning message that I'm not sure if I need to address, but specifically there's two conflicting confusion matrices that don't correlate, the manual calculations for sensitivity and specificity don't match with the confusionMatrix() function's specificity and sensitivity calculations in the caret package. It looks like something was inverted but I'll keep checking. Any advice is greatly appreciated!
For context, the One Rule model tests for each attribute or column of the cancer data set, so for example did texture yield the highest accurate results for benign (B) predictions versus malignant (M) predictions in the confusion matrix, or was it smoothness, or area, or some other factor that are each represented as raw data in each column.
There's this warning and my assumption is that I could've added more parameters but I didn't fully understand them:
oneRModel <- OneR(as.factor(Diagnosis)~., cancersamp)
#> Warning message:
#> In OneR.data.frame(x = data, ties.method = ties.method, verbose = verbose
#> data contains unused factor levels
Here's where there's two separate confusion matrices that may have inverted labels and that each give different specificity and sensitivity results, one I did manually and the other with the confusionMatrix() function in the caret package:
table(dataTest$Diagnosis, dataTest.pred)
#> dataTest.pred
#> B M
#> B 28 1
#> M 5 12
#OneR(formula, data, subset, na.action,
# control = Weka_control(), options = NULL)
confusionMatrix(dataTest.pred, as.factor(dataTest$Diagnosis), positive="B")
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction B M
#> B 28 5
#> M 1 12
#>
#> Accuracy : 0.8696
#> 95% CI : (0.7374, 0.9506)
#> No Information Rate : 0.6304
#> P-Value [Acc > NIR] : 0.0003023
#>
#> Kappa : 0.7058
#>
#> Mcnemar's Test P-Value : 0.2206714
#>
#> Sensitivity : 0.9655
#> Specificity : 0.7059
#> Pos Pred Value : 0.8485
#> Neg Pred Value : 0.9231
#> Prevalence : 0.6304
#> Detection Rate : 0.6087
#> Detection Prevalence : 0.7174
#> Balanced Accuracy : 0.8357
#>
#> 'Positive' Class : B
#>
sensitivity1 = 28/(28+5)
specificity1 = 12/(12+1)
specificity1
#> [1] 0.9230769
sensitivity1
#> [1] 0.8484848
Here's pseudo-code, my assumption was this is what the OneR function already does and I'm not supposed to manually do this:
For each attribute,
For each value of the attribute, make a rule as follows:
count how often each class appears
find the most frequent class
make the rule assign that class to this attribute-value
Calculate the error rate of the rules Choose the rules with the smallest error rate
Here's the rest of my code for the One R Model:
#--------------------------------------------------
# One R Model
#--------------------------------------------------
set.seed(23)
randsamp <- sample(nrow(cancerdata), 150, replace=FALSE)
#randsamp
cancersamp <- cancerdata[randsamp,]
#cancersamp
#?sample.split
spl = sample.split(cancersamp$Diagnosis, SplitRatio = 0.7)
#spl
dataTrain = subset(cancersamp, spl==TRUE)
dataTest = subset(cancersamp, spl==FALSE)
oneRModel <- OneR(as.factor(Diagnosis)~., cancersamp)
#> Warning message:
#> In OneR.data.frame(x = data, ties.method = ties.method, verbose = #> verbose, :
#> data contains unused factor levels
summary(oneRModel)
#> Call:
#> OneR.formula(formula = as.factor(Diagnosis) ~ ., data = cancersamp)
#> Rules:
#> If perimeter = (53.2,75.7] then as.factor(Diagnosis) = B
#> If perimeter = (75.7,98.2] then as.factor(Diagnosis) = B
#> If perimeter = (98.2,121] then as.factor(Diagnosis) = M
#> If perimeter = (121,143] then as.factor(Diagnosis) = M
#> If perimeter = (143,166] then as.factor(Diagnosis) = M
#> Accuracy:
#> 134 of 150 instances classified correctly (89.33%)
#> Contingency table:
#> perimeter
#> as.factor(Diagnosis) (53.2,75.7] (75.7,98.2] (98.2,121] (121,143] #> (143,166] Sum
#> B * 31 * 63 1 0 0 95
#> M 1 14 * 19 * 18 * 3 55
#> Sum 32 77 20 18 3 150
#> ---
#> Maximum in each column: '*'
#> Pearson's Chi-squared test:
#> X-squared = 92.412, df = 4, p-value < 2.2e-16
dataTest.pred <- predict(oneRModel, newdata = dataTest)
table(dataTest$Diagnosis, dataTest.pred)
#> dataTest.pred
#> B M
#> B 28 1
#> M 5 12
Here's a small snippet of the data set, as you can see perimeter is the one-rule factor that was selected but I was expecting results to correlate with the study's predictions on texture, area, and smoothness giving the best results, but I don't know all of the variables surrounding that in the study and these are randomized samples so I can always just keep testing.
head(cancerdata)
PatientID radius texture perimeter area smoothness compactness concavity concavePoints symmetry fractalDimension Diagnosis
1 842302 17.99 10.38 122.80 1001.0 0.11840 0.27760 0.3001 0.14710 0.2419 0.07871 M
2 842517 20.57 17.77 132.90 1326.0 0.08474 0.07864 0.0869 0.07017 0.1812 0.05667 M
3 84300903 19.69 21.25 130.00 1203.0 0.10960 0.15990 0.1974 0.12790 0.2069 0.05999 M
4 84348301 11.42 20.38 77.58 386.1 0.14250 0.28390 0.2414 0.10520 0.2597 0.09744 M
5 84358402 20.29 14.34 135.10 1297.0 0.10030 0.13280 0.1980 0.10430 0.1809 0.05883 M
6 843786 12.45 15.70 82.57 477.1 0.12780 0.17000 0.1578 0.08089 0.2087 0.07613 M