How to create Naive Bayes in R for numerical and categorical variables

Question

I am trying to implement a Naive Bayes model in R based on known information:

Age group, e.g. "18-24" and "25-34", etc.
Gender, "male" and "female"
Region, "London" and "Wales", etc.
Income, "£10,000 - £15,000", etc.
Job, "Full Time" and "Part Time", etc.

I am experiencing errors when implementing. My code is as per below:

library(readxl)
iphone <- read_excel("~/Documents/iPhone_1k.xlsx")
View(iphone)

summary(iphone)
iphone

library(caTools)
library(e1071)

set.seed(101) 
sample = sample.split(iphone$Gender, SplitRatio = .7)
train = subset(iphone, sample == TRUE)
test  = subset(iphone, sample == FALSE)

nB_model <- naiveBayes(Gender ~ Region + Retailer, data = train)
pred <- predict(nB_model, test, type="raw")

In the above scenario, I have an excel file called iPhone_1k (1,000 entries relating to people who have visited a website to buy an iPhone). Each row is a person visiting the website and the above demographics are known.

I have been trying to make the model work and have resorted to following the below link that uses only two variables (I would like to use a minimum of 4 but introduce more, if possible):

https://rpubs.com/dvorakt/144238

I want to be able to use these demographics to predict which retailer they will go to (also known for each instance in the iPhone_1k file). There are only 3 options. Can you please advise how to complete this?

P.S. Below is a screenshot of a simplified version of the data I have used to keep it simple in R. Once I get some code to work, I'll expand the number of variables and entries.

enter image description here

Cihan · Accepted Answer · 2017-11-27T16:09:08.507

2

You are setting the problem incorrectly. It should be:

naiveBayes(Retailer ~  Gender + Region + AgeGroup, data = train)

or in short

naiveBayes(Retailer ~ ., data = train)

Also you might need to convert the columns into factors if they are characters. You can do it for all columns, right after reading from excel, by

iphone[] <- lapply(iphone, factor)

Note that if you add numeric variables in the future, you should not apply this step on them.

edited Nov 27 '17 at 16:09

answered Nov 27 '17 at 16:04

Cihan

2,267
8
19

Thanks a lot for your help - it now works! Am I right in saying that a confusion matrix is not applicable in this setting, since you are calculating the probability that a consumer with given attributes will visit a retailer, rather than whether they did and compare that with the prediction? – Christopher Loynes Nov 27 '17 at 20:18
You get probabilities because you specifiy "raw" in the predict() function's parameters. If you do predict(nB_model, test, type="class"), you will get actual predictions of which retailer will be visited (which is calculated by selecting the retailer with the highest probability). Then you can make use of a confusion matrix using these predictions. For instance you can do confusionMatrix(pred, test$Retailer) to see the confusion matrix of your predictions on the test data (you need caret library for this function). – Cihan Nov 27 '17 at 20:47
Thanks again for your help - It's greatly appreciated! – Christopher Loynes Nov 27 '17 at 22:08

How to create Naive Bayes in R for numerical and categorical variables

1 Answers1