r- Duplicated rows in dummyVars

Question

I have a dataframe in R, here there is an example

asdf <- data.frame(id = c(2345, 7323, 2345, 4533),
               place = c("Home", "Home", "Office", "Office"),
               sex = c("Male", "Male", "Male", "Female"),
               consumed = c(1000, 800, 1000, 500))

As you can see there is one id duplicated, because he has two locations, Home and Office. I want to convert every character variable to a dummy variable, and obtain just one id, without duplicated id's. I am sure that the only duplicated values can be the "place" variable.

When i apply dummyVars from caret, i can't do this, and for me this behavior does not make sense, for example, when I apply the following

dummy <- dummyVars( ~ ., data = asdf, fullRank = FALSE, levelsOnly = TRUE)
predict(dummy, asdf)

I get the following dataframe, with duplicated id's

result <- data.frame(id = c(2345, 7323, 2345, 4533),
                 placeHome = c(1, 1, 0, 0),
                 placeOffice = c(0, 0, 1, 1),
                 sexFemale = c(0, 0, 0, 1),
                 sexMale = c(1, 1, 1, 0),
                 consumed = c(1000,  800, 1000,  500))

but I want this

sexy_result <- data.frame(id = c(2345, 7323, 4533),
                 placeHome = c(1, 1, 0),
                 placeOffice = c(1, 0, 1),
                 sexFemale = c(0, 0, 1),
                 sexMale = c(1, 1, 0),
                 consumed = c(1000,  800, 500))

why dont you remove the duplicates before you create your dummy variables. use: `asdf <- asdf[!duplicated(asdf$id),]` — Mankind_008, Dec 06 '18 at 21:23
You can try `tidyr::spread()` https://tidyr.tidyverse.org/reference/spread.html — phili_b, Dec 06 '18 at 23:24
If I delete duplicates I would loose information about some variables :( — Gabriel Gajardo, Dec 07 '18 at 04:43
@GabrielGajardo No you will get all the information as per your expected output. `duplicated` checks in sequence, if a value was observed before and mark it as duplicate. Try: `sexy_result <- result[!duplicated(result$id),]` — Mankind_008, Dec 07 '18 at 15:46

score 1 · Answer 1 · answered Dec 06 '18 at 23:41

You could transform your result data frame using dplyr package.

library(dplyr)
sexy_result <- result %>% group_by(id) %>% summarise_all(sum)
data.frame(sexy_result)

   id    placeHome  placeOffice sexFemale sexMale consumed
1 2345         1           1         0       2     2000
2 4533         0           1         1       0      500
3 7323         1           0         0       1      800

If you want to sum only placeHome and placeOffice, you could use the following code

sexy_result <- result %>% group_by(id) %>% summarise(placeHome=sum(placeHome), placeOffice=sum(placeOffice), sexFemale=mean(sexFemale), sexMale=mean(sexMale), consumed=mean(consumed))
data.frame(sexy_result)

   id     placeHome  placeOffice sexFemale sexMale consumed
1 2345         1           1         0       1     1000
2 4533         0           1         1       0      500
3 7323         1           0         0       1      800

r- Duplicated rows in dummyVars

1 Answers1