One hot coding a data-frame in R

Question

Consider a data-frame df1 similar to the one shown

ID EDUCATION   OCCUPATION      BINARY_VAR
1  Undergrad   Student              1
2  Grad        Business Owner       1
3  Undergrad   Unemployed           0
4  PhD         Other                1

You may create your own random df1 using R-code below

ID <- c(1:4)
EDUCATION <- sample (c('Undergrad', 'Grad', 'PhD'), 4, rep = TRUE)
OCCUPATION <- sample (c('Student', 'Business Owner', 'Unemployed', 'Other'), 4, rep = FALSE)
BINARY_VAR <- sample(c(0, 1), 4, rep = TRUE)
df1 <- data.frame(ID, EDUCATION, OCCUPATION, BINARY_VAR)

# Convert to factor
df1[, names(df1)] <- lapply(df1[, names(df1)] , factor)

From this, I need to derive another data-frame df2 that would look like this

ID Undergrad Grad PhD Student Business Owner Unemployed Other BINARY_VAR
1      1      0    0     1           0           0        0       1
2      1      1    0     0           1           0        0       1
3      1      0    0     0           0           1        0       0
4      1      1    1     0           0           0        1       1

You must have noticed how for level PhD, the other factor levels under EDUCATION also hold true since EDUCATION is the highest education level for that ID. That, however, is the secondary objective.

I can't seem to figure out a way to obtain a data-frame with each column giving the truth value corresponding to individual factor levels in its parent data-frame. Is there a package in R that could help? Or maybe a way to code this?

Can I do this using melt?

I read through previously asked question(s) that looked similar, but they deal with frequencies of occurrence.

Edit:

As recommended by Sumedh, one way to do this is using dummyVars from caret.

dummies <- dummyVars(ID ~ ., data = df1)
df2 <- data.frame(predict(dummies, newdata = df1))
df2 <- df2 [1:7]

For your primary objective, you could use `dummyVars` from `caret` http://topepo.github.io/caret/preprocess.html — Sumedh, Jun 21 '16 at 13:25
For PhD, you could use `df$Grad[df$PhD == 1] <- 1` and `df$Undergrad[df$PhD == 1] <- 1`. Maybe there is a better way. — Sumedh, Jun 21 '16 at 13:48

Bryan Goggin · Accepted Answer · 2016-06-21T14:10:00.613

tidyr and dplyr combined with that base table() function should work:

ID <- c(1:4)
EDUCATION <- c('Undergrad', 'Grad', 'PhD', 'Undergrad')
OCCUPATION <- c('Student', 'Business Owner', 'Unemployed', 'Other')
BINARY_VAR <- sample(c(0, 1), 4, rep = TRUE)
df1 <- data.frame(ID, EDUCATION, OCCUPATION, BINARY_VAR)

# Convert to factor
df1[, names(df1)] <- lapply(df1[, names(df1)] , factor)

library(dplyr)
library(tidyr)

edu<-as.data.frame(table(df1[,1:2])) %>% spread(EDUCATION, Freq)

for(i in 1:nrow(edu))
  if(edu[i,]$PhD == 1) 
    edu[i,]$Undergrad <-edu[i,]$Grad <-1

truth_table<-merge(edu,
      as.data.frame(table(df1[,c(1,3)])) %>% spread(OCCUPATION, Freq),
      by = "ID")

truth_table$BINARY_VAR<-df1$BINARY_VAR
truth_table

ID Grad PhD Undergrad Business Owner Other Student Unemployed BINARY_VAR
1    0   0         1              0     0       1          0          1
2    1   0         0              1     0       0          0          1
3    1   1         1              0     0       0          1          0
4    0   0         1              0     1       0          0          1

Edit: added a for loop to update the education levels beneath PhD inspired by @ Sumedh's earlier suggestion.

@Sumedh's solution works, except - the original data has 12 factor levels, so maybe I'd have to create a function do it in (somewhat) an iterative fashion over the levels. Also, there are some 160,000 rows ... — , Jun 21 '16 at 13:57

One hot coding a data-frame in R

1 Answers1