0

Here is dput() of a structure I currently have.

structure(list(id = c(1, 1, 2, 4, 4), country = c("USA", "Japan",  "Germany", "Germany", "USA"), USA = c(0, 0, 0, 0, 0), Germany = c(0,  0, 0, 0, 0), Japan = c(0, 0, 0, 0, 0)), class = "data.frame", row.names = c(NA,  -5L))

I want to edit this dataframe to get the below results in order to apply this approach to a dataset with 100k+ observations. Specifically, I want to use information from (df$country) that describes a country assigned to a particular ID (e.g., id == 1 and country == Japan), and changes the column value with the corresponding column name (e.g., a column named "Japan") equal to 1. Note that IDs are not unique!

This is what I'd like to end up with:

structure(list(id = c(1, 1, 2, 4, 4), country = c("USA", "Japan",  "Germany", "Germany", "USA"), USA = c(1, 1, 0, 1, 1), Germany = c(0,  0, 1, 1, 1), Japan = c(1, 1, 0, 0, 0)), class = "data.frame", row.names = c(NA,  -5L))

The following code gives a close result:

df[levels(factor(df$country))] = model.matrix(~country - 1, df)

But ends up giving me the following, erroneous structure:

structure(list(id = c(1, 1, 2, 4, 4), country = c("USA", "Japan", 
"Germany", "Germany", "USA"), USA = c(1, 0, 0, 0, 1), Germany = c(0, 
0, 1, 1, 0), Japan = c(0, 1, 0, 0, 0)), row.names = c(NA, -5L
), class = "data.frame") 

How can I edit the above command in order to yield my desired result? I cannot use pivot because, in actuality, I'm working with many datasets that have different values in the "country" column that, once pivoted, will yield datasets with non-uniform columns/structures, which will impede data analysis later on.

Thank you for any help!

ZZ Top
  • 93
  • 5
  • I'm not sure why a pivot and rejoin wouldn't work, e.g. in *data.table* `setDT(df)` and then `df[,.(id,country)][dcast(id ~ country, data=cbind(df, value=1), fill=0), on=.(id)]` – thelatemail Jul 07 '22 at 23:21
  • This code seems to work, but deletes all columns not represented in (id, country) – ZZ Top Jul 08 '22 at 00:22
  • Are the `USA`/`Japan` etc columns all just place holders full of 0's - could they be deleted first, e.g. `df[, unique(df$country) := .(NULL)]` then added back? Or just not made in the first place? – thelatemail Jul 08 '22 at 03:39
  • They're placeholders full of zeroes that I want to turn into dummy variables = 1 if country == column name – ZZ Top Jul 08 '22 at 13:25
  • It'd been helpful for all if you'd posted a link to or updated your question yesterday here: https://stackoverflow.com/questions/72900665/assigning-column-values-in-for-loops-too-slow where we discussed most of these issues. – harre Jul 08 '22 at 15:08
  • You could probably use: `df |> group_by(id) |> mutate(across(-c(country), ~ max(.)))` after `model.matrix` (or the similar `pivoting`). – harre Jul 08 '22 at 15:08

3 Answers3

2

Perhaps this helps

library(dplyr)
df %>% 
    mutate(across(USA:Japan, ~  +(country == cur_column()))) %>% 
    group_by(id) %>% 
    mutate(across(USA:Japan, max)) %>% 
    ungroup

-output

# A tibble: 5 × 5
     id country   USA Germany Japan
  <dbl> <chr>   <int>   <int> <int>
1     1 USA         1       0     1
2     1 Japan       1       0     1
3     2 Germany     0       1     0
4     4 Germany     1       1     0
5     4 USA         1       1     0

Or modifying the model.matrix as

m1 <- model.matrix(~country - 1, df)
m1[] <- ave(c(m1), df$id[row(m1)], col(m1), FUN = max)
akrun
  • 874,273
  • 37
  • 540
  • 662
  • I appreciate your help. My actual structure is: `structure(list(id = c(1, 1, 2, 4, 4), country = c("USA", "Japan", "Germany", "Germany", "USA",), USA = c(1, 1, 0, 1, 1), Germany = c(0, 0, 1, 1, 1), Japan = c(1, 1, 0, 0, 0)), Korea = (0,0,0,0,0), class = "data.frame", row.names = c(NA, -5L))`. The suggestion you posted doesn't account for the fact that I have other columns not represented in "country" – ZZ Top Jul 08 '22 at 00:34
  • @ZZTop it was based on your example – akrun Jul 09 '22 at 20:10
1

You can use base R

re <- rle(df$id)
for(j in re$values){
    y <- which(j == df$id)
        df[y , match(df$country[y] , colnames(df))] <- 1
}
  • Output
  id country USA Germany Japan
1  1     USA   1       0     1
2  1   Japan   1       0     1
3  2 Germany   0       1     0
4  4 Germany   1       1     0
5  4     USA   1       1     0
Mohamed Desouky
  • 4,340
  • 2
  • 4
  • 19
1

Are you looking for such a solution (in combination) to your closed question here CRAN R - Assign the value '1' to many dummy variables at once

The solution provided by @akrun solves the question here. But you may look for something like this:

library(dplyr)

df %>% 
  group_by(id) %>% 
  mutate(across(-country, ~case_when(country == cur_column() ~ 1))) %>% 
  fill(-country, .direction = "updown") %>% 
  mutate(across(-country, ~ifelse(is.na(.), 0, .))) %>% 
  ungroup()
     id country   USA Germany Japan
  <dbl> <chr>   <dbl>   <dbl> <dbl>
1     1 USA         1       0     1
2     1 Japan       1       0     1
3     2 Germany     0       1     0
4     4 Germany     1       1     0
5     4 USA         1       1     0
TarJae
  • 72,363
  • 6
  • 19
  • 66
  • 1
    And in combination with the question from yesterday: https://stackoverflow.com/questions/72900665/assigning-column-values-in-for-loops-too-slow/72901079#comment128783286_72901079 – harre Jul 08 '22 at 15:23