One Hot Encoding for top categories, NA, and remaining subsumed as 'others' in R

Question

I want to one hot encode my variables only for the top categories and NA and 'others'.

So in this simplified example, hot encoding b where freq > 1 and NA:

id <- c(1, 2, 3, 4, 5, 6)
b <- c(NA, "A", "C", "A", "B", "C")
c <- c(2, 3, 6, NA, 4, 7)
df <- data.frame(id, b, c)

  id    b  c
1  1 <NA>  2
2  2    A  3
3  3    C  6
4  4    A NA
5  5    B  4
6  6    C  7

table <- as.data.frame(table(df$b))

  Var1 Freq
1    A    2
2    B    1
3    C    2

table_top <- table[table$Freq > 1,]

  Var1 Freq
1    A    2
3    C    2

Now, I would like to have something like this

  id  b_NA  c b_A b_C b_Others
    1    1  2   0   0        0
    2    0  3   1   0        0
    3    0  6   0   1        0
    4    0 NA   1   0        0
    5    0  4   0   0        1
    6    0  7   0   1        0

I have tried with subsetting df

table_top <- as.vector(table_top$Var1)
table_only_top <- subset(df, b %in% table_top)
table_only_top

  a b  c
2 1 A  3
3 2 C  6
4 2 A NA
6 3 C  7

However, now I am stuck how to get to the output. In my real data I have many more categories than here, so using the names from the output is not an option. Also the others category in my real output exists of many categories.

Any hint is highly appreciated :)

Roman · Accepted Answer · 2018-10-21T11:53:38.393

4

Fast and sexy with data.table and mltools:

> one_hot(dt, naCols = TRUE, sparsifyNAs = TRUE)

   id cat_NA cat_A cat_C cat_Others freq
1:  1      1     0     0          0    2
2:  2      0     1     0          0    3
3:  3      0     0     1          0    6
4:  4      0     1     0          0   NA
5:  5      0     0     0          1    4
6:  6      0     0     1          0    7

Code

Load libraries

library(dplyr)
library(data.table)
library(mltools)

Transform data

# Kick out all with freq == 1 and below
df <- df %>%
    # Group by variables that will be onehotted
    group_by(cat) %>%
    # Add a count per group item column 
    mutate(count = n()) %>%
    # Ungroup for next steps
    ungroup() %>%
    # Change all that have a count of 1 or below to "Others".
    # If cat was a factor, we would get numeric results at this step.
    mutate(cat = ifelse(!is.na(cat) & count <= 1, "Others", cat),
    # Only now we turn it into a factor for the one_hot function 
                        cat = as.factor(cat)) %>%
    # Drop the count column
    select(id, cat, freq)

# Turn into data.table
dt <- as.data.table(df)

Check intermediate result

> dt
       id    cat freq
1:  1   <NA>    2
2:  2      A    3
3:  3      C    6
4:  4      A   NA
5:  5 Others    4
6:  6      C    7

Data

id <- c(1, 2, 3, 4, 5, 6)
cat <- c(NA, "A", "C", "A", "B", "C")
freq <- c(2, 3, 6, NA, 4, 7)
# It is important to have no other factor variables other
# than the variable(s) you one want to one hot. For that reason
# the automatic factoring is turned off.
df <- data.frame(id, cat, freq, 
                 stringsAsFactors = FALSE)          

> df
  id  cat freq
1  1 <NA>    2
2  2    A    3
3  3    C    6
4  4    A   NA
5  5    B    4
6  6    C    7

edited Oct 21 '18 at 11:53

answered Oct 20 '18 at 17:38

Roman

4,744
2
16
58

This is genius @Roman, thanks a lot! On the dummy data it works perfectly. When I applied your logic on my real data something weird is happening - I wanted it to only include those activities with a freq > 66 but it now has in row 4 (next comment) a count of 19 included - any clue what I might have done wrong? `test_2 <- test %>% group_by(activity_new) %>% mutate(count = n()) %>% ungroup() %>% mutate(activity_new = ifelse(!is.na(activity_new) & count <= 66, "Others", activity_new), activity_new = as.factor(activity_new)) %>% select(id, activity_new, cons)` – Sarah Oct 21 '18 at 11:42
`test_2` `# A tibble: 16,096 x 3 id activity_new cons 1 A1 NA 22034 2 A2 NA 4060 3 A3 Others 7440 4 B2 19 4199490` – Sarah Oct 21 '18 at 11:42
@Sarah, you're welcome. What are those weird things? Please consider that the `one_hot()` will only work with factor variables (see `mutate(..., cat = as.factor(cat))`). There are also different ways to treat NAs. – Roman Oct 21 '18 at 11:42
Hi @Sarah, I added some comments for clarification. It would be helpful if you could add a `str(test)` to your original post. It really seems to me that something went wrong in the construction of your data. Is `activity_new` maybe already a factor variable **before** the mutation? – Roman Oct 21 '18 at 11:59
1

Thank you so much - your additional comments cleared my mistake: I had two factors and also activity_new was a factor before. Converting both to a character beforehand resolved the problem and your code worked perfectly! :) @Roman – Sarah Oct 21 '18 at 12:22

tmfmnk · Answer 2 · 2018-10-20T16:35:57.737

Definitely not an elegant solution but it should work:

library(tideverse)
library(reshape2)

df %>% 
  gather(var, val, -id) %>%
  add_count(var, val) %>% 
  mutate(res = ifelse(var == "b" & n > 1, 1, 0),
         val = paste("b_", val, sep = "")) %>% 
  filter(var == "b" & n != 1) %>% 
  dcast(id ~ val, value.var = "res") %>% 
  full_join(df, by = c("id" = "id")) %>%
  mutate(b_NA = ifelse(is.na(b), 1, 0)) %>%
  mutate_at(vars(contains("b_")), funs(replace(., is.na(.), 0))) %>%
  mutate(b_OTHERS = ifelse(rowSums(.[grep("b_", names(.))]) != 0, 0, 1))

  id b_A b_C    b  c b_NA b_OTHERS
1  2   1   0    A  3    0        0
2  3   0   1    C  6    0        0
3  4   1   0    A NA    0        0
4  6   0   1    C  7    0        0
5  1   0   0 <NA>  2    1        0
6  5   0   0    B  4    0        1

gaut · Answer 3 · 2018-10-20T14:48:56.723

You could cbind data.frames based on your different criteria.

# simple conditions -------------------------------------------------------
df <-  df_orig[,-1]
df_na <- is.na(df)
colnames(df_na) <- paste0(colnames(df),"_NA")
df_A <- df=="A"
colnames(df_A) <- paste0(colnames(df),"_A")
df_C <- df=="C"
colnames(df_C) <- paste0(colnames(df),"_C")

# for counts you can use sapply with one loop -----------------------------
df_counts <- df
for(j in 1:ncol(df)) {
  counts <- sapply(1:nrow(df), function(x) sum(df[x,j]==df[,j], na.rm=T) )
 df_counts[,j] <- counts
}

df_counts <- df
# or avoid explicit loops altogether --------------------------------------
df_counts2 <- sapply(1:ncol(df), function(y) sapply(1:nrow(df), function(x) sum(df[x,y]==df[,y], na.rm=T) ) )
colnames(df_counts2 ) <- paste0(colnames(df),"_counts")

# cbind df's  -------------------------------------------------------------
df_full <- cbind(df_orig, df_na, df_A, df_C, df_counts2)
# check if frequency greater then 1 or NA ---------------------------------
df_full$result <- df_full[,10:11] >=2 | df_full[,4:5]
df_full

The harder part is I suppose to compute the frequencies, here I included two ways. the result is:

  id    b  c  b_NA  c_NA   b_A   c_A   b_C   c_C b_counts c_counts result.b_NA result.c_NA
1  1 <NA>  2 FALSE FALSE FALSE FALSE FALSE FALSE        1        1       FALSE       FALSE
2  2    A  3 FALSE FALSE  TRUE FALSE FALSE FALSE        2        1        TRUE       FALSE
3  3    C  6 FALSE FALSE FALSE FALSE  TRUE FALSE        2        1        TRUE       FALSE
4  4    A NA FALSE  TRUE  TRUE    NA FALSE    NA        2        0        TRUE        TRUE
5  5    B  4 FALSE FALSE FALSE FALSE FALSE FALSE        1        1       FALSE       FALSE
6  6    C  7 FALSE FALSE FALSE FALSE  TRUE FALSE        2        1        TRUE       FALSE

You can modify the columns based on your conditions. Hope that helps

thanks a lot for your answer. Unfortunately this is not quite what I meant, as I only want to have those categories in my new df which fulfill the if-clause (freq > 1). Since I have a lot of categories it is not an option for me to do it this way and then delete the not needed columns. — Sarah, Oct 20 '18 at 15:21
ok, what about `df_counts` and `df_counts2`? these only have this criterium — gaut, Oct 20 '18 at 16:50

One Hot Encoding for top categories, NA, and remaining subsumed as 'others' in R

3 Answers3

Code

Data