0

I'm new to data science and want to build a neural network model in R. I've read about one-hot encoding categorical data prior to training. I've tried to implement this, however, I'm receiving the following error when trying to train the model:

Error in model.frame.default(formula = nndf$class ~ ., data = train) : 
  invalid type (list) for variable 'nndf$class'

I've read the nnet documentation which explains that the formula should be passed as:

class ~ x1 + x2

But I'm still unsure of how to pass the data correctly.

Here is the code:

nndf$al <- one_hot(as.data.table(nndf$al))
nndf$su <- one_hot(as.data.table(nndf$su))
nndf$rbc <- one_hot(as.data.table(nndf$rbc))
nndf$pc <- one_hot(as.data.table(nndf$pc))
nndf$pcc <- one_hot(as.data.table(nndf$pcc))
nndf$ba <- one_hot(as.data.table(nndf$ba))
nndf$htn <- one_hot(as.data.table(nndf$htn))
nndf$dm <- one_hot(as.data.table(nndf$dm))
nndf$cad <- one_hot(as.data.table(nndf$cad))
nndf$appet <- one_hot(as.data.table(nndf$appet))
nndf$pe <- one_hot(as.data.table(nndf$pe))
nndf$ane <- one_hot(as.data.table(nndf$ane))
nndf$class <- one_hot(as.data.table(nndf$class))

class(nndf$class)

# view the dataframe to ensure one hot encoding is correct
summary(nndf)

# randomly sample rows for tt split
train_idx <- sample(1:nrow(nndf), 0.8 * nrow(nndf))
test_idx <- setdiff(1:nrow(nndf), train_idx)

# prepare training set and corresponding labels
train <- nndf[train_idx,]

# prepare testing set and corresponding labels
X_test <- nndf[test_idx,]
y_test <- nndf[test_idx, "class"]

# create model with a single hidden layer containing 500 neurons
model <- nnet(nndf$class~., train, maxit=150, size=10)

# prediction
X_pred <- predict(train, type="raw")
mrrain
  • 134
  • 1
  • 12
  • i'm not sure what your function `one_hot` does, but `model.matrix` handles factors this way `model.matrix(~ factor(gear) + 0, mtcars)` – rawr Apr 23 '20 at 02:03
  • I should've metioned, I'm using the one_hot function from the library mltools – mrrain Apr 23 '20 at 02:13
  • @mrrain you need not use `one_hot` until mandatory, as @rawr, suggested, you should use `model.matrix( ~ . - 1, df) ` to transform **all categorical variables to one_hot encoding at once**. *Note:* `df` must contains only categorical variables. – nikn8 Apr 23 '20 at 02:30

1 Answers1

2

Assumption

All the variable in your dataset(nndf) is categorical.

Steps

  1. convert all variables except Response variable(i.e class) to one-hot encoding (i.e 0,1 format)

one_hot method

  one_hot_df <- one_hot(nndf[, -13]) # 13 is the index of `class` variable.

model.matrix method

  model_mat_df <- model.matrix( ~ . - 1, nndf[, -13])
  1. Convert class as factor and add it either of above dfs.

    class <- as.factor(nndf$class)
    final_df <- cbind(model_mat_df, class)

  2. Split final_df into train and test and use that in the model.

    nnet(class~., train, maxit=150, size=10)

nikn8
  • 1,016
  • 8
  • 23
  • is it possible to pass both dummy categorical variables and normalized numeric variables into nnet in R? This is what I'm trying to achieve overall, however, I haven't been able to find any examples. – mrrain Apr 24 '20 at 00:23
  • 2
    Yes, you can, as most of the algorithm takes input in the same format, numeric as normalised and categorical: one hot. Just separate numeric df and categorical, do the operations and combine back using `cbind`/ `bind_columns`. Last, split the combined df into test and train pass it to algo to build the model. – nikn8 Apr 24 '20 at 03:19