Creating dummy variables from cells with multiple character values

Question

I'm trying to create multiple dummy variables, based on one column called 'Tags' within my df (2 rows, 2 columns, Tags and Score. The problem is that in each cell of the column Tags there can be any number of chr values (up to about 30 values). I want to create a new dummy variable for each unique chr value within one cell. This should tell me if a cases has that specific value or not (1/0). To show you the problem I'm including dput(df):

structure(list(Tags = structure(c(27L, 16L, 4L), .Label = c("\"aan het water\", \"biologische gerechten\", \"certificaat van uitmuntendheid tripadvisor 2016\", \"er even tussenuit\", \"gebruik streekproducten\", \"iens topper 2016\", \"lactose intolerantie\", \"noten allergie\", \"pinda allergie\", \"vegetarische gerechten\", frans, glutenvrij, romantisch, wijnbar, zakelijk", 
"\"aan het water\", \"biologische gerechten\", \"certificaat van uitmuntendheid tripadvisor 2016\", \"gebruik streekproducten\", \"iens topper 2016\", \"lactose intolerantie\", \"noten allergie\", \"pinda allergie\", \"vegetarische gerechten\", glutenvrij, kindvriendelijk, romantisch, wereldkeuken, zakelijk", 
"\"aan het water\", \"biologische gerechten\", \"certificaat van uitmuntendheid tripadvisor 2016\", \"lactose intolerantie\", \"noten allergie\", \"pinda allergie\", \"vegetarische gerechten\", frans, glutenvrij, romantisch, zakelijk", 
"\"aan het water\", \"biologische gerechten\", \"gebruik streekproducten\", \"lactose intolerantie\", \"noten allergie\", \"pinda allergie\", \"vegetarische gerechten\", frans, glutenvrij, romantisch, wijnbar, zakelijk", 
"\"aan het water\", \"certificaat van uitmuntendheid tripadvisor 2016\", \"er even tussenuit\", \"iens topper 2016\", \"lactose intolerantie\", \"noten allergie\", \"pinda allergie\", glutenvrij, grieks, romantisch", 
"\"aan het water\", \"certificaat van uitmuntendheid tripadvisor 2016\", \"lactose intolerantie\", \"noten allergie\", \"pinda allergie\", italiaans, kindvriendelijk, romantisch, zakelijk", 
"\"aan het water\", \"high tea\", brasserie, frans, kindvriendelijk, romantisch, zakelijk", 
"\"aan het water\", \"high tea\", kindvriendelijk, romantisch, wereldkeuken", 
"\"aan het water\", \"iens topper 2016\", italiaans, kindvriendelijk, romantisch, zakelijk", 
"\"aan het water\", \"lactose intolerantie\", \"noten allergie\", \"pinda allergie\", glutenvrij, kindvriendelijk, romantisch, wereldkeuken, zakelijk", 
"\"aan het water\", \"lactose intolerantie\", frans, glutenvrij, zakelijk", 
"\"aan het water\", frans", "\"all you can eat buffet\", \"er even tussenuit\", \"lactose intolerantie\", \"noten allergie\", \"pinda allergie\", glutenvrij, kindvriendelijk, romantisch, wereldkeuken, zakelijk", 
"\"biologische gerechten\", \"certificaat van uitmuntendheid tripadvisor 2016\", \"er even tussenuit\", \"gebruik streekproducten\", \"lactose intolerantie\", \"noten allergie\", \"pinda allergie\", \"vegetarische gerechten\", glutenvrij, kindvriendelijk, romantisch, wereldkeuken", 
"\"biologische gerechten\", \"certificaat van uitmuntendheid tripadvisor 2016\", \"gebruik streekproducten\", \"high tea\", \"lactose intolerantie\", \"noten allergie\", \"pinda allergie\", \"vegetarische gerechten\", frans, glutenvrij, kindvriendelijk, romantisch, zakelijk", 
"\"biologische gerechten\", \"certificaat van uitmuntendheid tripadvisor 2016\", \"gebruik streekproducten\", \"iens topper 2016\", \"lactose intolerantie\", \"noten allergie\", \"pinda allergie\", \"vegetarische gerechten\", glutenvrij, kindvriendelijk, romantisch, zakelijk", 
"\"biologische gerechten\", \"certificaat van uitmuntendheid tripadvisor 2016\", \"gebruik streekproducten\", \"lactose intolerantie\", \"met familie\", \"met vrienden\", \"noten allergie\", \"pinda allergie\", \"vegetarische gerechten\", chinees, gastronomisch, glutenvrij, kindvriendelijk, romantisch, traditioneel, trendy, verjaardag, zakelijk", 
"\"biologische gerechten\", \"certificaat van uitmuntendheid tripadvisor 2016\", \"vegetarische gerechten\", italiaans, kindvriendelijk", 
"\"biologische gerechten\", \"gebruik streekproducten\", \"iens topper 2016\", \"lactose intolerantie\", \"noten allergie\", \"pinda allergie\", \"vegetarische gerechten\", bbq/grill, glutenvrij, kindvriendelijk, romantisch, wijnbar", 
"\"biologische gerechten\", \"gebruik streekproducten\", \"lactose intolerantie\", \"vegetarische gerechten\", glutenvrij, romantisch, wereldkeuken", 
"\"biologische gerechten\", \"gebruik streekproducten\", frans, romantisch", 
"\"certificaat van uitmuntendheid tripadvisor 2016\", \"high tea\", \"lactose intolerantie\", \"noten allergie\", \"pinda allergie\", glutenvrij, romantisch, wereldkeuken, zakelijk", 
"\"er even tussenuit\", \"met familie\", \"met vrienden\", amerikaans, romantisch, trendy, verjaardag, wijnbar, zakelijk", 
"\"gebruik streekproducten\", \"lactose intolerantie\", \"noten allergie\", \"pinda allergie\", \"vegetarische gerechten\", frans, glutenvrij, romantisch, zakelijk", 
"\"high tea\", \"lactose intolerantie\", \"noten allergie\", \"pinda allergie\", frans, glutenvrij, romantisch, zakelijk", 
"\"lactose intolerantie\", \"noten allergie\", \"pinda allergie\", frans, glutenvrij, kindvriendelijk, romantisch, wijnbar, zakelijk", 
"\"lactose intolerantie\", \"noten allergie\", \"pinda allergie\", glutenvrij, kindvriendelijk, spaans", 
"\"lactose intolerantie\", frans, glutenvrij, romantisch, zakelijk", "grieks", "spaans"), class = "factor"), Score = c(8, 9, 8.8)), row.names = c(NA, 
-3L), class = c("tbl_df", "tbl", "data.frame"), .Names = c("Tags", 
"Score"))

and df$Tags[1] returns me:

[1] "lactose intolerantie", "noten allergie", "pinda allergie", glutenvrij, kindvriendelijk, spaans
30 Levels: "aan het water", "biologische gerechten", "certificaat van uitmuntendheid tripadvisor 2016", "er even tussenuit", "gebruik streekproducten", "iens topper 2016", "lactose intolerantie", "noten allergie", "pinda allergie", "vegetarische gerechten", frans, glutenvrij, romantisch, wijnbar, zakelijk ...

Manually I can run the following for example and it works:

df = mutate(df, lactose_intolerantie = ifelse(grepl("lactose intolerantie", Tags), 1, 0))

It created a new column containing a 1 when the value "lactose intolerantie" was present and zero when it's absent.

I'm looking for a way to have this done faster, for each possible chr value. Hope someone can help. Many thanks for giving a thought.

Do you have a list of all possible character values you want to check for? — aosmith, Nov 18 '16 at 20:17

joel.wilson · Accepted Answer · 2016-11-18T20:42:32.560

Just a starting step :

x1 = gsub("\"", "",unlist(strsplit(as.character(df$Tags[1]),",")))
x2 = gsub("\"", "",unlist(strsplit(as.character(df$Tags[2]),",")))
x3 = gsub("\"", "",unlist(strsplit(as.character(df$Tags[3]),",")))

# removing only spaces occuring at the start
x11=gsub("^ ","" ,x1)
x22=gsub("^ ","" ,x2)
x33=gsub("^ ","" ,x3)

# get the unique ones
x = unique(c(x11,x22,x33))

df1 = as.data.frame(lapply(as.list(x), function(x) as.numeric(grepl(x, df$Tags))))
colnames(df1) = x

> df1
  lactose intolerantie noten allergie pinda allergie glutenvrij kindvriendelijk spaans biologische gerechten
1                    1              1              1          1               1      1                     0
2                    1              1              1          1               1      0                     1
3                    1              1              1          1               0      0                     1
  certificaat van uitmuntendheid tripadvisor 2016 gebruik streekproducten iens topper 2016 vegetarische gerechten
1                                               0                       0                0                      0
2                                               1                       1                1                      1
3                                               0                       1                0                      1
  romantisch zakelijk aan het water frans wijnbar
1          0        0             0     0       0
2          1        1             0     0       0
3          1        1             1     1       1

yup, this makes me go! officially I have more than 10,000 rows in my original df, but there won't be more than 10,000 tags for sure, since it's about restaurants. Great! — Benjamin Telkamp, Nov 19 '16 at 09:19

aosmith · Answer 2 · 2016-11-18T20:57:33.767

A possibility with dplyr and tidyr, although using separate_rows means I didn't keep the original column. You could join back together based on row numbers or make a duplicate column of "Tags" to use for separate_rows.

If there is only one instance of a tag within each cell:

library(dplyr)
library(tidyr)
library(tibble)

df %>%
    rownames_to_column() %>% 
    separate_rows(Tags, sep = ", ") %>%
    mutate(Tags = gsub('"', "", Tags), n = 1) %>%
    spread(Tags, n, fill = 0)

I added the row names to the dataset, separated the "Tags" to be on separate rows instead of in a single column, removed the extra quotes around some of the tag names, then made a dummy column of 1's for each row prior to spreading into a wide format.

If each row could have multiple values of one of the character strings:

df %>%
    rownames_to_column() %>% 
    separate_rows(Tags, sep = ", ") %>%
    mutate(Tags = gsub('"', "", Tags), n = 1) %>%
    distinct(rowname, Tags, .keep_all = TRUE) %>%
    spread(Tags, n, fill = 0)

Creating dummy variables from cells with multiple character values

2 Answers2