0

I'm tackling a problem for this dataset. I am trying to build a model to predict Japan sales from every other predictor (except Rank, Name and Global Sales which is too correlated with the outcome variable). So, I did:

vgames <- read_csv('data/vgsales.csv', show_col_types = FALSE, col_types = list(
    Year = col_date("%Y")
)) %>%
    mutate(
        Platform = factor(Platform),
        Genre = factor(Genre),
        Publisher = factor(Publisher)
    )

vgames_model <- vgames %>%
    select(-c(Rank, Name, Global_Sales))

# Train test split
vgames_split <- vgames_model %>% initial_split()
vgames_training <- vgames_split %>% training()
vgames_testing <- vgames_split %>% testing()

# Folds for CV
vgames_folds <- vgames_training %>% vfold_cv(v = 10)

# Recipe
vgames_recipe <- vgames_training %>%
    recipe(formula = JP_Sales ~ .) %>%
    step_normalize(all_numeric_predictors()) %>%
    step_date(Year, features = c("year"), keep_original_cols = FALSE) %>%
    step_dummy(all_nominal()) %>%
    step_zv(all_numeric_predictors())

The output of this recipe is something like this:

# A tibble: 12,448 × 570
   NA_Sales EU_Sales Other_…¹ JP_Sa…² Year_…³ Platf…⁴ Platf…⁵ Platf…⁶ Platf…⁷ Platf…⁸ Platf…⁹ Platf…˟ Platf…˟ Platf…˟
      <dbl>    <dbl>    <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
 1   -0.272  -0.279   -0.240     0       2006       0       0       0       0       0       0       0       0       0
 2    0.145   0.258    0.0629    0       2012       0       0       0       0       0       0       0       0       0
 3   -0.198  -0.241   -0.189     0.07    2008       0       0       0       1       0       0       0       0       0
 4   -0.149  -0.260   -0.189     0       2010       0       0       0       1       0       0       0       0       0
 5   -0.149  -0.0679  -0.0380    0       2006       0       0       0       0       0       0       0       0       0
 6   -0.296  -0.183   -0.189     0       2015       0       1       0       0       0       0       0       0       0
 7    3.32    1.05     0.315     1.81    1988       0       0       0       0       0       0       0       0       0
 8   -0.308  -0.260   -0.240     0       2016       0       0       0       0       0       0       0       0       0
 9   -0.321  -0.202   -0.240     0       2015       0       0       0       0       0       0       0       0       0
10   -0.112  -0.145   -0.139     0       2010       0       0       0       0       0       0       0       0       0
# … with 12,438 more rows, 556 more variables: Platform_N64 <dbl>, Platform_NES <dbl>, Platform_NG <dbl>,
#   Platform_PC <dbl>, Platform_PCFX <dbl>, Platform_PS <dbl>, Platform_PS2 <dbl>, Platform_PS3 <dbl>,
#   Platform_PS4 <dbl>, Platform_PSP <dbl>, Platform_PSV <dbl>, Platform_SAT <dbl>, Platform_SCD <dbl>,
#   Platform_SNES <dbl>, Platform_TG16 <dbl>, Platform_Wii <dbl>, Platform_WiiU <dbl>, Platform_WS <dbl>,
#   Platform_X360 <dbl>, Platform_XB <dbl>, Platform_XOne <dbl>, Genre_Adventure <dbl>, Genre_Fighting <dbl>,
#   Genre_Misc <dbl>, Genre_Platform <dbl>, Genre_Puzzle <dbl>, Genre_Racing <dbl>, Genre_Role.Playing <dbl>,
#   Genre_Shooter <dbl>, Genre_Simulation <dbl>, Genre_Sports <dbl>, Genre_Strategy <dbl>, …
# ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names

Now, here comes the problem: when I define and fit the mlp, the epochs give all a nan back as a loss function and every other metric, that is:

nn <- mlp(epochs = 20) %>%
    set_engine('keras', verbose = 1, metrics = c("mae"), optimizer = 'adam', loss = 'mean_absolute_error') %>%
    set_mode('regression')

nnwf <- workflow() %>%
    add_model(nn) %>%
    add_recipe(vgames_recipe)

nnwf %>% fit(vgames_training)

yields

...
Epoch 16/20
389/389 [==============================] - 1s 1ms/step - loss: nan - mae: nan
Epoch 17/20
389/389 [==============================] - 1s 1ms/step - loss: nan - mae: nan
Epoch 18/20
389/389 [==============================] - 1s 2ms/step - loss: nan - mae: nan
Epoch 19/20
389/389 [==============================] - 1s 2ms/step - loss: nan - mae: nan
Epoch 20/20
389/389 [==============================] - 1s 1ms/step - loss: nan - mae: nan

I already looked around and tried to normalize in other points, to take the learning rate down (both in the mlp() function and in the set_engine specification) and to remove the date column altogether. None of that worked, and I'm having a hard time figuring out what. Did anybody run into this issue before?

1 Answers1

0

There are missing data in the original Year column and missing data generate missing statistics.

topepo
  • 13,534
  • 3
  • 39
  • 52