5

I have explored various options using quosures, symbols, and evaluation, but I can't seem to get the right syntax. Here is an example dataframe.

data.frame("A" = letters[1:4], "B" = letters[26:23], "C" = letters[c(1,3,5,7)], "D" = letters[c(2,4,6,8)], "pastecols" = c("B, C","B, D", "B, C, D", NA))
  A B C D pastecols
1 a z a b      B, C
2 b y c d      B, D
3 c x e f   B, C, D
4 d w g h      <NA>

Now suppose I want to paste values from different columns based on the lookup string in pastecols, and I always want to include column A. This is my desired result:

  A B C D pastecols  result
1 a z a b      B, C   a z a
2 b y c d      B, D   b y d
3 c x e f   B, C, D c x e f
4 d w g h      <NA>       d

Ideally this could be done in dplyr. This is the closest I have gotten:

x %>% mutate(result = lapply(lapply(str_split(pastecols, ", "), c, "A"), na.omit))
  A B C D pastecols     result
1 a z a b      B, C    B, C, A
2 b y c d      B, D    B, D, A
3 c x e f   B, C, D B, C, D, A
4 d w g h      <NA>          A

3 Answers3

2

Here's one way using pmap to do a similar thing. pmap can be used to effectively work on dataframes by row by capturing each row as a named vector; you can then get the desired column names for indexing as cols by selecting them with ["pastecols"].

Most of the anonymous function syntax is not tidyverse stuff, but just basic R stuff. To walk through it:

  1. Pass the dataframe as the list to the .l argument of pmap_chr. Remember that dataframes are lists of columns!
  2. Capture all the ... arguments with c(...). Basically we are calling each row of the dataframe as arguments to the function; now row is a named vector containing the row. Note that if you have list-columns this will break, (but so will a lot of other things here so I assume there aren't any...)
  3. We can get the values of row that we want from row["pastecols"], but we need to turn (say) "B, C" into c("A", "B", "C") to do that. This next line just adds the "A", replaces missing values with "A", splits into pieces if there are any, and then indexes back down into the list. The [[ part is just how you do list[[1]]" in a pipe chain, it's the prefix form of the operator. You need this because str_split returns a list and we just want the vector.
  4. Use this cols vector to get the desired values from row and return it, collapsed into a length 1 character vector!
library(tidyverse)
tbl <- tibble("A" = letters[1:4], "B" = letters[26:23], "C" = letters[c(1,3,5,7)], "D" = letters[c(2,4,6,8)], "pastecols" = c("B, C","B, D", "B, C, D", NA))

tbl %>%
  mutate(result = pmap_chr(
    .l = .,
    .f = function(...){
      row <-  c(...)
      cols <- row["pastecols"] %>% str_c("A, ", .) %>% replace_na("A") %>% str_split(", ") %>% `[[`(1)
      vals <- row[cols] %>% str_c(collapse = ", ")
      return(vals)
    }
  ))
#> # A tibble: 4 x 6
#>   A     B     C     D     pastecols result    
#>   <chr> <chr> <chr> <chr> <chr>     <chr>     
#> 1 a     z     a     b     B, C      a, z, a   
#> 2 b     y     c     d     B, D      b, y, d   
#> 3 c     x     e     f     B, C, D   c, x, e, f
#> 4 d     w     g     h     <NA>      d

Created on 2018-12-03 by the reprex package (v0.2.0).

Calum You
  • 14,687
  • 4
  • 23
  • 42
  • This approach appears to be almost 5X faster than your suggestion using gather. Bonus points because I have been trying to learn new functionality in purrr! Do you mind explaining what is happening in your function definition within pmap_chr? Specifically, the function(...), c(...) and the %>% `[[`(1). – tcuthbertson Dec 04 '18 at 03:56
  • Added some explanation. I suspect that this could be made faster if it is currently still too slow by doing the `cols` part outside of `pmap` since `str_c` and `str_split` are vectorised, but it would require some shuffling around of code. – Calum You Dec 04 '18 at 04:30
  • Thanks for the update. I was familiar with the list operator, I had just never seen it in a pipe chain like that. As for the ellipsis, that is something I have only seen in function documentation, glad I know how to use it effectively now! – tcuthbertson Dec 04 '18 at 15:17
1

Not the most elegant solution but gets the job done with just base R. If column A never shows up in pastecols you can remove unique() from the code.

for(r in seq_len(nrow(df))) {
  df$result[r] <- paste(
                    df[r, na.omit(unique(c("A", unlist(strsplit(df$pastecols[r], ", ")))))],
                    collapse = " "
                  )
}
df

  A B C D pastecols  result
1 a z a b      B, C   a z a
2 b y c d      B, D   b y d
3 c x e f   B, C, D c x e f
4 d w g h      <NA>       d

Data -

df <- data.frame(
  "A" = letters[1:4], 
  "B" = letters[26:23], 
  "C" = letters[c(1,3,5,7)], 
  "D" = letters[c(2,4,6,8)], 
  "pastecols" = c("B, C","B, D", "B, C, D", NA), stringsAsFactors = F
)
Shree
  • 10,835
  • 1
  • 14
  • 36
1

Here's a different way that doesn't rely on iterating functions in the apply or map families, if you would prefer to avoid them, and tries to leverage the tidyr side of the tidyverse. The approach is basically to expand the dataframe with gather and separate_rows into each combination of pastecols and actual columns, and then filter so we only keep the ones that match for each rowid. Once we have that, we can group_by and summarise to bring it back to one row per rowid. There is a bunch of housekeeping to deal with the fact that you always have column A, and note that I leave A in the output pastecols, but you can remove that if you want to.

library(tidyverse)
tbl <- tibble("A" = letters[1:4], "B" = letters[26:23], "C" = letters[c(1,3,5,7)], "D" = letters[c(2,4,6,8)], "pastecols" = c("B, C","B, D", "B, C, D", NA))

tbl %>%
  rowid_to_column() %>%
  mutate(
    pastecols = str_c("A, ", pastecols),
    pastecols = if_else(is.na(pastecols), "A", pastecols)
  ) %>%
  gather(colname, value, -pastecols, -rowid) %>%
  separate_rows(pastecols) %>%
  filter(pastecols == colname) %>%
  group_by(rowid) %>%
  summarise(
    pastecols = str_c(pastecols, collapse = ", "),
    result = str_c(value, collapse = ", ")
  )
#> # A tibble: 4 x 3
#>   rowid pastecols  result    
#>   <int> <chr>      <chr>     
#> 1     1 A, B, C    a, z, a   
#> 2     2 A, B, D    b, y, d   
#> 3     3 A, B, C, D c, x, e, f
#> 4     4 A          d

Created on 2018-12-03 by the reprex package (v0.2.0).

Calum You
  • 14,687
  • 4
  • 23
  • 42
  • I like that this example sticks to the tidyverse side of things. I think I was trying to get too creative with quosures and such when gather and separate_rows are perfect for the job. I do worry about computation time for how I will be applying this in practice. Simple for the toy example, but the data I will be applying this to has 15 "pastecols" and ~6M rows. I'll have to try it out, but I imagine I may run into some memory overflow errors even with the beefy server I am using. – tcuthbertson Dec 04 '18 at 03:45