I'm tackling a problem for this dataset. I am trying to build a model to predict Japan sales from every other predictor (except Rank, Name and Global Sales which is too correlated with the outcome variable). So, I did:
vgames <- read_csv('data/vgsales.csv', show_col_types = FALSE, col_types = list(
Year = col_date("%Y")
)) %>%
mutate(
Platform = factor(Platform),
Genre = factor(Genre),
Publisher = factor(Publisher)
)
vgames_model <- vgames %>%
select(-c(Rank, Name, Global_Sales))
# Train test split
vgames_split <- vgames_model %>% initial_split()
vgames_training <- vgames_split %>% training()
vgames_testing <- vgames_split %>% testing()
# Folds for CV
vgames_folds <- vgames_training %>% vfold_cv(v = 10)
# Recipe
vgames_recipe <- vgames_training %>%
recipe(formula = JP_Sales ~ .) %>%
step_normalize(all_numeric_predictors()) %>%
step_date(Year, features = c("year"), keep_original_cols = FALSE) %>%
step_dummy(all_nominal()) %>%
step_zv(all_numeric_predictors())
The output of this recipe is something like this:
# A tibble: 12,448 × 570
NA_Sales EU_Sales Other_…¹ JP_Sa…² Year_…³ Platf…⁴ Platf…⁵ Platf…⁶ Platf…⁷ Platf…⁸ Platf…⁹ Platf…˟ Platf…˟ Platf…˟
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 -0.272 -0.279 -0.240 0 2006 0 0 0 0 0 0 0 0 0
2 0.145 0.258 0.0629 0 2012 0 0 0 0 0 0 0 0 0
3 -0.198 -0.241 -0.189 0.07 2008 0 0 0 1 0 0 0 0 0
4 -0.149 -0.260 -0.189 0 2010 0 0 0 1 0 0 0 0 0
5 -0.149 -0.0679 -0.0380 0 2006 0 0 0 0 0 0 0 0 0
6 -0.296 -0.183 -0.189 0 2015 0 1 0 0 0 0 0 0 0
7 3.32 1.05 0.315 1.81 1988 0 0 0 0 0 0 0 0 0
8 -0.308 -0.260 -0.240 0 2016 0 0 0 0 0 0 0 0 0
9 -0.321 -0.202 -0.240 0 2015 0 0 0 0 0 0 0 0 0
10 -0.112 -0.145 -0.139 0 2010 0 0 0 0 0 0 0 0 0
# … with 12,438 more rows, 556 more variables: Platform_N64 <dbl>, Platform_NES <dbl>, Platform_NG <dbl>,
# Platform_PC <dbl>, Platform_PCFX <dbl>, Platform_PS <dbl>, Platform_PS2 <dbl>, Platform_PS3 <dbl>,
# Platform_PS4 <dbl>, Platform_PSP <dbl>, Platform_PSV <dbl>, Platform_SAT <dbl>, Platform_SCD <dbl>,
# Platform_SNES <dbl>, Platform_TG16 <dbl>, Platform_Wii <dbl>, Platform_WiiU <dbl>, Platform_WS <dbl>,
# Platform_X360 <dbl>, Platform_XB <dbl>, Platform_XOne <dbl>, Genre_Adventure <dbl>, Genre_Fighting <dbl>,
# Genre_Misc <dbl>, Genre_Platform <dbl>, Genre_Puzzle <dbl>, Genre_Racing <dbl>, Genre_Role.Playing <dbl>,
# Genre_Shooter <dbl>, Genre_Simulation <dbl>, Genre_Sports <dbl>, Genre_Strategy <dbl>, …
# ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names
Now, here comes the problem: when I define and fit the mlp, the epochs give all a nan back as a loss function and every other metric, that is:
nn <- mlp(epochs = 20) %>%
set_engine('keras', verbose = 1, metrics = c("mae"), optimizer = 'adam', loss = 'mean_absolute_error') %>%
set_mode('regression')
nnwf <- workflow() %>%
add_model(nn) %>%
add_recipe(vgames_recipe)
nnwf %>% fit(vgames_training)
yields
...
Epoch 16/20
389/389 [==============================] - 1s 1ms/step - loss: nan - mae: nan
Epoch 17/20
389/389 [==============================] - 1s 1ms/step - loss: nan - mae: nan
Epoch 18/20
389/389 [==============================] - 1s 2ms/step - loss: nan - mae: nan
Epoch 19/20
389/389 [==============================] - 1s 2ms/step - loss: nan - mae: nan
Epoch 20/20
389/389 [==============================] - 1s 1ms/step - loss: nan - mae: nan
I already looked around and tried to normalize in other points, to take the learning rate down (both in the mlp() function and in the set_engine specification) and to remove the date column altogether. None of that worked, and I'm having a hard time figuring out what. Did anybody run into this issue before?