0

I'm trying to build a decision tree model using R language, and when i run the rpart() function, Rstudio freezes. i provided blow a link to the dataset i use, and the code too process it until the decision tree model building, any help is appreciated

https://github.com/ArcanePersona/files/blob/main/vgsales.csv

#Libraries used:
library(tidyverse)
library(Hmisc)
library(mctest)
library(rpart)
library(rpart.plot)
library(RColorBrewer)
library(rattle)
library(missForest)
library(VIM)
library(caret)
library(fmsb)




#Phase One: Data Preprocessing:
#Loading in the "vgsales.csv" data:
game_sales <- read.csv("vgsales.csv", header = T, stringsAsFactors = F)

#turning the structure of the data to tibble for ease of use:
game_sales <- as_tibble(game_sales)

#Replacing the "N/A" character values in Year_of_Release with real NA values:
game_sales %>% filter(game_sales$Year_of_Release == "N/A")
game_sales <- game_sales %>% mutate( Year_of_Release = gsub("N/A","", Year_of_Release))

#Changing the data type of column Year_of_release from "chr" to "int":
game_sales$Year_of_Release <- as.integer(game_sales$Year_of_Release)
str(game_sales$Year_of_Release)
#Imputing Year_of_Release variable and inserting the imputed values:
imputeyear <- with(game_sales,Hmisc::impute(game_sales$Year_of_Release, 'mean'))
game_sales <- game_sales %>% mutate (Year_of_Release = imputeyear)

#filtering data for "year_of_release" >= 2010 then ordering data ascending:
game_sales <- game_sales %>% filter(Year_of_Release >= 1991) %>% filter(Year_of_Release <=2010)


#Creating a subset of not NA values in the Rating variable
#Because the missing data is too many and not imputable (50%) 
#This subset are for machine learning purposes only:
ml_subset_x <- subset(game_sales, !is.na(game_sales$Critic_Score) | !is.na(game_sales$Critic_Count))
ml_subset_y <- ml_subset_x %>% filter( Rating == "E"| Rating == "M" | Rating =="T" |
                Rating == "E10+"| Rating == "AO" | Rating =="K-A" | Rating =="RP")



#Phase Four: Machine Learning:
#Decision Tree:
ml_subset_y$Publisher <- as.factor(ml_subset_y$Publisher)
ml_subset_y$Platform <- as.factor(ml_subset_y$Platform)
ml_subset_y$Genre <- as.factor(ml_subset_y$Genre)
ml_subset_y$Rating <- as.factor(ml_subset_y$Rating)

#Splitting data into train (70%) and test (30%):
set.seed(1234)
index <- sample(nrow(ml_subset_y), 0.7 * nrow(ml_subset_y)) 
ml_subset_ytrain <- ml_subset_y[index,] 
ml_subset_ytest <- ml_subset_y[-index,]

#Modelling the train data using decision tree algorithm:
treemodel <- rpart(Rating~., data=ml_subset_ytrain)
plot(treemodel, margin=0.25)
text(treemodel, use.n=T)
fancyRpartPlot(treemodel)

#Testing the model using the test data and using confusion matrix 
#to check Accuracy:
prediction <- predict(treemodel, newdata=ml_subset_ytest, type='class')
accuracy_test <- table(prediction, ml_subset_ytest$Rating)
confusionmatrix(accuracy_test)
Phil
  • 7,287
  • 3
  • 36
  • 66

1 Answers1

0

You have multiple problems in your code. I'll try to be as clear as possible.

First, you convert the column "Year_of_Release" to integer, which is good. However, your variable imputeyear is a character vector and this convert it back when you use mutate(Year_of_Release = imputeyear).

I rewrote the first part of the code. Then you can see that you need to be careful with some variables (the 'chr' ones).

In the end, in your set of independent variables, the variable Name should be removed as it makes no sense to use it and the function can't deal with it. Then, I think that the 252 levels of the Publisher variable could be a bit much for the rpart algorithm. Remove them and the function works fine. You could try to filter your data for maybe 20-30 different publishers before converting to factor, and then try to see if the function works with a smaller number of levels.

Hope it helps ;)

#Libraries used:
library(tidyverse)
library(Hmisc)
library(mctest)
library(rpart)
library(rpart.plot)
library(RColorBrewer)
library(rattle)
library(missForest)
library(VIM)
library(caret)
library(fmsb)

#Phase One: Data Preprocessing:
#Loading in the "vgsales.csv" data:
#I first transformed to NA to do it like you did but
#you could directly use recode or mutate with an ifelse
#to do it in One step.
game_sales <- read.csv("vgsales.csv", header = T, stringsAsFactors = F)                                          %>%
              as_tibble()                                                                                        %>%
              mutate(Year_of_Release = as.integer(na_if(Year_of_Release, "N/A")))                                %>%
              mutate(Year_of_Release = replace_na(Year_of_Release,
                                                  round(mean(.$Year_of_Release[!is.na(.$Year_of_Release)]), 0))) %>%
              filter(between(Year_of_Release, 1991, 2010))

str(game_sales) #--> Be careful with Name, Platform, Genre, Publisher and Rating.

#Creating a subset of not NA values in the Rating variable
#Because the missing data is too many and not imputable (50%)
#This subset are for machine learning purposes only:
ml_subset_x <- subset(game_sales, !is.na(game_sales$Critic_Score) | !is.na(game_sales$Critic_Count))
ml_subset_y <- ml_subset_x %>% filter( Rating == "E"| Rating == "M" | Rating =="T" |
                                          Rating == "E10+"| Rating == "AO" | Rating =="K-A" | Rating =="RP")
#Phase Four: Machine Learning:
#Decision Tree:
ml_subset_y$Publisher <- as.factor(ml_subset_y$Publisher)
ml_subset_y$Platform <- as.factor(ml_subset_y$Platform)
ml_subset_y$Genre <- as.factor(ml_subset_y$Genre)
ml_subset_y$Rating <- as.factor(ml_subset_y$Rating)

#Splitting data into train (70%) and test (30%):
set.seed(1234)
index <- sample(nrow(ml_subset_y), 0.7 * nrow(ml_subset_y))
ml_subset_ytrain <- ml_subset_y[index,]
ml_subset_ytest <- ml_subset_y[-index,]

str(ml_subset_ytrain) # --> Name should be removed from this table or the formula be explicit, your choice.
                      # --> The 252 levels of the publisher variable are problematic.
ml_subset_ytrain = select(ml_subset_ytest, -c(Name, Publisher))

#Modelling the train data using decision tree algorithm:
treemodel <- rpart(Rating ~ ., data = ml_subset_ytrain) # Now it works ;)
plot(treemodel, margin=0.25)
text(treemodel, use.n=T)
fancyRpartPlot(treemodel)

#Testing the model using the test data and using confusion matrix
#to check Accuracy:
prediction <- predict(treemodel, newdata=ml_subset_ytest, type='class')
accuracy_test <- table(prediction, ml_subset_ytest$Rating)
confusionmatrix(accuracy_test)
  • 1
    Hi Ignatu, good point about the impute, i always noticed when i wanted to use correlation in the dataset, i get that the year is char, even though i thought i changed it to int :D thanks for pointing it out, i follow your approach for the machine learning, and come back with feedback, thanks a lot – Arcane Persona Feb 25 '22 at 11:31