0

I have a dataset (df) similar to this one:

df <- data.frame("ID"=c(1, 1, 1, 2, 2), 
                 "Method of payment"=c("cash","liabilities", 
                                       "shares", "cash", NA), 
                 "USD"=c(110, 130, 200, 100, NA), 
                 "var3"=c(1500, NA, NA, "ab", "bc"))
df
#   ID Method.of.payment USD var3
# 1  1              cash 110 1500
# 2  1       liabilities 130 <NA>
# 3  1            shares 200 <NA>
# 4  2              cash 100   ab
# 5  2              <NA>  NA   bc

And I would like to have:

data.frame("ID"=c(1, 2), 
           "Method of payment"=c("cash, liabilities, shares", "cash"), 
           "Cash"=c(110, 100), 
           "Liabilities"=c(130, 0),
           "Shares"=c(200, 0),
           "var3"=c(1500, "ab,bc"))

#   ID         Method.of.payment Cash Liabilities Shares  var3
# 1  1 cash, liabilities, shares  110         130    200  1500
# 2  2                      cash  100           0      0 ab,bc

Therefore, I would like to

  1. collapse the variables "method of payment" and "var3" in order to have only one row per ID, having all the levels/values linked to an ID pasted in one row;
  2. generate new variables from existing levels of the factor variable method of payment ("cash", "liabilities", "shares") whose values should be the corresponding values of the variable USD or 0 if there is no corresponding value.

I am working with a very large dataset, so I am looking for something that works also on large data. I hope it is clear.

jay.sf
  • 60,139
  • 8
  • 53
  • 110
Esperanta
  • 83
  • 7
  • @A.Suliman Why post an answer as a comment...? – duckmayr May 26 '19 at 20:20
  • @A.Suliman Thanks for the answer. On the dataframe that I showed here it works, however on my real dataframe it gives me the following error "Error: Each row of output must be identified by a unique combination of keys. Keys are shared for 17732 rows". I searched a bit on internet and it seems this is a common error for the spread function, however I did not really understand how to fix it. Thanks. – Esperanta May 27 '19 at 09:44
  • @A.Suliman This df generates the same error: `df <- data.frame("ID"=c(1, 1, 1, 2, 2,3,3), "Method"=c("cash","liabilities", "shares", "cash", NA, NA, NA), "USD"=c(110, 130, 200, 100, NA,NA,NA), "var3"=c(1500, NA, NA, "ab", "bc",10,NA)) `. Error: "Each row of output must be identified by a unique combination of keys. Keys are shared for 2 rows: * 6, 7 Do you need to create unique ID with tibble::rowid_to_column()?" I guess it has to do with the NAs in Method? – Esperanta May 27 '19 at 10:38

3 Answers3

0

Here is a solution via data.table that avoids replicating the collapsed fields over multiple rows. Performance bottleneck will be paste:

library(data.table)

df <- data.table(ID = c(1, 1, 1, 2, 2), 
                 Method = c("cash","liabilities", "shares", "cash", NA), 
                 USD = c(110, 130, 200, 100, NA), 
                 var3 = c(1500, NA, NA, "ab", "bc"))

nice_paste <- function(z) {
  paste(z[!is.na(z)], collapse = ", ")
}

# Compress 
part_1 <- df[, .(Method = .(nice_paste(Method)),
                 var3 = .(nice_paste(var3))), by = ID]

# Reshape "Method" to wide
part_2 <- dcast(df[!is.na(Method)], 
                ID ~ Method, 
                value.var = "USD", 
                fill = 0, 
                fun.aggregate = sum)

part_1[part_2, on = "ID"]

Gives

   ID                    Method   var3 cash liabilities shares
1:  1 cash, liabilities, shares   1500  110         130    200
2:  2                      cash ab, bc  100           0      0
Michael M
  • 880
  • 7
  • 10
0

You could use by and reshape to get the USD into wide format, dropping the var3 column.

b <- by(df1, df1$id, reshape, direction="wide", timevar="method.of.payment", drop="var3")
b <- Reduce(function(x, y) merge(x, y, all=TRUE), b)  # merge the list resulting from `by`

To collapse the levels, you could use toString in another by and cbind it to the merged

res <- cbind(b, do.call(rbind, by(df1[c(2, 4)], df1$id, function(X) 
  lapply(X, function(x) toString(na.omit(x))))))[c(1, 6, 2:4, 7)]  # some column sorting 
res
#   id         method.of.payment usd.cash usd.liabilities usd.shares   var3
# 1  1 cash, liabilities, shares      110             130        200   1500
# 2  2                      cash      100              NA         NA ab, bc

Edit: Perhaps it is more elegant to do this in just one by.

b <- by(df1, df1$id, function(X) {
  r <- reshape(X, direction="wide", timevar="method.of.payment", drop="var3")
  s <- lapply(X[c(2, 4)], function(x) toString(na.omit(x)))
  return(merge(r, s))
  })

res <- Reduce(function(x, y) merge(x, y, all=TRUE), b)[c(1, 3, 2, 5:6, 4)]
res
#   id         method.of.payment usd.cash usd.liabilities usd.shares   var3
# 1  1 cash, liabilities, shares      110             130        200   1500
# 2  2                      cash      100              NA         NA ab, bc

Data

Note: Better avoid spaces in column names in R.

df1 <- structure(list(id = c(1, 1, 1, 2, 2), method.of.payment = structure(c(1L, 
2L, 3L, 1L, NA), .Label = c("cash", "liabilities", "shares"), class = "factor"), 
    usd = c(110, 130, 200, 100, NA), var3 = structure(c(1L, NA, 
    NA, 2L, 3L), .Label = c("1500", "ab", "bc"), class = "factor")), class = "data.frame", row.names = c(NA, 
-5L))
jay.sf
  • 60,139
  • 8
  • 53
  • 110
0

Error message says

Keys are shared for 2 rows: * 6, 7

which can be seen in row 6 and 7 in the following output, i.e. Method is the same in row 6 and 7, so how spread will assign them to different column

df %>% group_by(ID) %>% 
       mutate(MofP=paste(Method, collapse = ','),var3=paste(var3[!is.na(var3)], collapse = ','))

# A tibble: 7 x 5
# Groups:   ID [3]
     ID Method        USD var3  MofP                   
   <dbl> <fct>       <dbl> <chr> <chr>                  
1     1 cash          110 1500  cash,liabilities,shares
2     1 liabilities   130 1500  cash,liabilities,shares
3     1 shares        200 1500  cash,liabilities,shares
4     2 cash          100 ab,bc cash,NA                
5     2 NA             NA ab,bc cash,NA                
6     3 NA             NA 10    NA,NA                  
7     3 NA             NA 10    NA,NA

To solve this we can do:

#Option 1. Select groups where Method is unique among the group, then do spread as usual
df %>% group_by(ID) %>% 
       mutate(MofP=paste(Method, collapse = ','),var3=paste(var3[!is.na(var3)], collapse = ',')) %>% 
       filter(n_distinct(Method)==n())

#Option 2. Follow dplyr erro msg Do you need to create unique ID with tibble::rowid_to_column()?, 
#but create our own id
df %>% group_by(ID) %>% mutate(MofP=paste(Method, collapse = ','),var3=paste(var3[!is.na(var3)], collapse = ','), 
                               rid=if(n_distinct(Method)!=n()) row_number() else 1) %>% 
      spread(key=Method, value=USD, fill = 0)

# A tibble: 4 x 8
# Groups:   ID [3]
     ID var3  MofP                      rid  cash liabilities shares `<NA>`
   <dbl> <chr> <chr>                   <dbl> <dbl>       <dbl>  <dbl>  <dbl>
1     1 1500  cash,liabilities,shares     1   110         130    200      0
2     2 ab,bc cash,NA                     1   100           0      0      0
3     3 10    NA,NA                       1     0           0      0      0
4     3 10    NA,NA                       2     0           0      0      0
A. Suliman
  • 12,923
  • 5
  • 24
  • 37
  • Thanks again! Option one does the job but eliminates all the rows in which Method = NA, which I do not want to do. While option 2 does not group the rows with the same id, and creates one column for each combination of the collapsed Method (i.e. one column= "Libabilities", another="Liabilities, cash" and so on. I am also trying to find a wat to make it work. – Esperanta May 27 '19 at 13:25
  • @Esperanta, try `library(purrr);df %>% group_by(ID) %>% mutate(MofP=paste(Method, collapse = ','),var3=paste(var3[!is.na(var3)], collapse = ',')) %>% split(.$ID) %>% map_dfr(~if(n_distinct(.x$Method)==nrow(.x)) mutate(.x, rid=1) else mutate(.x, rid=1:nrow(.x)))` – A. Suliman May 27 '19 at 14:49