0

I have a dataset that looks like the following:

name   ingredient  allergic
prod1     ing1        yes
prod1     ing2        yes
prod2     ing1        no
prod2     ing3        no
prod3     ing3        yes

I want to convert the ingredient variable to dummies and format my data such that it looks like:

name   ing1    ing2    ing3   allergic
prod1     1        1       0        yes
prod2     1        0       1        no
prod3     0        0       1        yes

Does any one know how I can go about doing this? I was able to convert my variables to dummies using

model.matrix(allergic ~ ingredient, data)

But I don't think it is doing what I want it to. Any help would be greatly appreciated!

Joey B
  • 125
  • 3
  • 14
  • You usually don't have to worry about converting your variables to dummies-- R gives dummy values to factors automatically. You can check it out by looking at `levels()`. – Matt Jun 09 '17 at 17:34
  • If I were to try to use this data to predict the allergic feature, wouldn't I need the data formatted like the second data frame in my example? Otherwise, it would predict using only 1 feature at a time. – Joey B Jun 09 '17 at 17:35
  • 1
    No, predict methods take care of this for you. – Roland Jun 09 '17 at 17:50
  • Hi @Joey B you've received a couple of answers below. Please consider accepting one as an answer (clicking check mark) if it helped you to solve your issue. This lets the community know the answer worked for you. – CPak Sep 09 '17 at 02:25

2 Answers2

1

(Since I can't comment (not enough points) )

Using the tibble created by Robertmc, use:

df <- df %>% 
      group_by(name,allergic) %>% 
      tidyr::spread( ingredient, value =dummy, fill = 0 )

This should give you the output posted.

# A tibble: 3 x 5
name allergic  ing1  ing2  ing3
* <chr>    <chr> <dbl> <dbl> <dbl>
1 prod1      yes     1     1     0
2 prod2       no     1     0     1
3 prod3      yes     0     0     1
CPak
  • 13,260
  • 3
  • 30
  • 48
0

You can achieve this with tools from the tidyverse packages:

df <- tibble::tibble(
  name = c("prod1", "prod1", "prod2", "prod2", "prod3"),
  ingredient = c("ing1", "ing2", "ing1", "ing3", "ing3"),
  allergic = c("yes", "yes", "no", "no", "yes"), 
  dummy = 1)



 tidyr::spread(df, ingredient, value = dummy, fill = 0, drop = FALSE) %>% slice(c(-1, -4, -5))

 # A tibble: 3 x 5
   name allergic  ing1  ing2  ing3
  <chr>    <chr> <dbl> <dbl> <dbl>
1 prod1      yes     1     1     0
2 prod2       no     1     0     1
3 prod3      yes     0     0     1
RobertMyles
  • 2,673
  • 3
  • 30
  • 45
  • Adding `%>% dplyr::slice(c(-1, -4, -5))` to the spread line will take out the lines it appears you don't want, or you can use some type of logical condition with `dplyr::filter()`to get rid of those. – RobertMyles Jun 09 '17 at 17:48
  • Thanks, but I need a way for the Name column to have only unique values. I want to predict, for each unique product, whether allergic is 'yes' or 'no'. The way you have it formatted does not allow me to do that. – Joey B Jun 09 '17 at 17:50
  • I added the comment into the answer. Should do what you're looking for now. – RobertMyles Jun 09 '17 at 17:52
  • Thanks for that. I am very close but I am getting an error saying: Error: Duplicate identifiers for rows (61, 76), (189, 197), (181, 196) any idea what that means? – Joey B Jun 09 '17 at 18:00
  • Is you data exactly like that which you posted above? It works perfectly for me with he code I've written. This error happens when there are rows that can't be separately identified. – RobertMyles Jun 09 '17 at 18:04
  • It essentially is. It is data scraped off sephoras website so its very unclean and probably causing some problems. If I slice it so it only does the first 50 rows it works fine. I may just have to revisit this at another time. – Joey B Jun 09 '17 at 18:13
  • @JoeyB try Chi Pak's answer above, looks like it should work without using `slice`, which is hacky anyway, and only works for the specific data you provided above. – RobertMyles Jun 09 '17 at 18:27