How to speed up combining columns when one column is just a repetition of the same value?

Question

Given the following data frame:

df <-
  data.frame(one_letter = rep("a", 5),
             other_letters = letters[2:6])

df
#>   one_letter other_letters
#> 1          a             b
#> 2          a             c
#> 3          a             d
#> 4          a             e
#> 5          a             f

I want to combine both columns to one, to get:

#>   all_letters_combined
#> 1                    a
#> 2                    b
#> 3                    c
#> 4                    d
#> 5                    e
#> 6                    f

Although I could utilize dplyr&tidyr and do the following:

library(dplyr, warn.conflicts = FALSE)
library(tidyr)

# yes, it gets the job done
df %>%
  pivot_longer(everything()) %>%
  select(value) %>%
  unique()
#> # A tibble: 6 x 1
#>   value
#>   <chr>
#> 1 a    
#> 2 b    
#> 3 c    
#> 4 d    
#> 5 e    
#> 6 f

I'm nevertheless looking for a faster/more direct way to do it. This is because speed becomes an issue when our df is a tibble with list-columns that contain dataframes. Here's an example, although still pretty minimal:

library(nycflights13)
library(babynames)
library(tictoc)


bigger_tib <- 
  tibble(one_df = rep(list(babynames), 10),
         other_dfs = list(starwars, flights, mtcars, trees, women, PlantGrowth, ToothGrowth, co2, Titanic, USArrests))

tic()
bigger_tib %>%
  pivot_longer(everything()) %>%
  select(value) %>%
  unique()
#> # A tibble: 11 x 1
#>    value                   
#>    <list>                  
#>  1 <tibble [1,924,665 x 5]>
#>  2 <tibble [87 x 14]>      
#>  3 <tibble [336,776 x 19]> 
#>  4 <df [32 x 11]>          
#>  5 <df [31 x 3]>           
#>  6 <df [15 x 2]>           
#>  7 <df [30 x 2]>           
#>  8 <df [60 x 3]>           
#>  9 <ts [468]>              
#> 10 <table [4 x 2 x 2 x 2]> 
#> 11 <df [50 x 4]>
toc()
#> 0.97 sec elapsed

I know the example isn't great because it doesn't demonstrate problematic run time, but in my real data this procedure gets pretty slow and I want to speed it up.

If it is just 2 columns then you could try to column bind (unique(column1) and unique(column2)) that should be faster than pivoting. If you know column1 is just the same value then just take the first value and attached to column2. — Dave2e, Jan 22 '22 at 14:33
@Dave2e thanks. It seems very straightforward, but still: can you please demonstrate with some code? — Emman, Jan 22 '22 at 15:14
@Henrik following your idea we could even go with `tibble(value = c(bigger_tib$one_df[1], bigger_tib$other_dfs))`. Problem with both is that they don't stem from `bigger_tib` if we want to use the `%>%` pipe. — Emman, Jan 22 '22 at 15:52
Maybe `bigger_tib %>% summarise(value = c(one_df[1], other_dfs))` (my previous comment originated from some `base` code; forgot to squeeze it into a pipe...;)) — Henrik, Jan 22 '22 at 15:55
@Henrik this is perfect. Could you please post as an answer? — Emman, Jan 22 '22 at 16:10
If anyone, I think @Dave2e may want to write it up. It was their idea ("_If you know column1 is just the same value then just take the first value and attached to column2_"). Cheers — Henrik, Jan 22 '22 at 16:12

ekoam · Answer 1 · 2022-01-22T14:45:24.460

The bottleneck is unique, which becomes extremely costly when applied to a list of dataframes. distinct would be faster. On the other hand, if you already know that the dataframes are unique before pivoting them, giving each of them a unique id to preserve this relationship would be an even more ideal approach. That said, consider the following benchmark.

library(dplyr)
library(tidyr)

f1 <- . %>% pivot_longer(everything()) %>% select(value) %>% unique()
f2 <- . %>% pivot_longer(everything()) %>% select(value) %>% distinct()
f3 <- . %>% 
  rename(one_df = one_df, other_df = other_dfs) %>% 
  mutate(one_id = 0L, other_id = row_number()) %>% 
  pivot_longer(starts_with(c("one", "other")), c(NA, ".value"), names_sep = "_") %>% 
  distinct(id, .keep_all = TRUE)

microbenchmark::microbenchmark(f1(bigger_tib), f2(bigger_tib), f3(bigger_tib), times = 10L)

Output

> f3(bigger_tib)
# A tibble: 11 x 2
   df                          id
   <list>                   <int>
 1 <tibble [1,924,665 x 5]>     0
 2 <tibble [87 x 14]>           1
 3 <df [50 x 2]>                2
 4 <df [32 x 11]>               3
 5 <df [31 x 3]>                4
 6 <df [15 x 2]>                5
 7 <df [30 x 2]>                6
 8 <df [60 x 3]>                7
 9 <ts [468]>                   8
10 <table [4 x 2 x 2 x 2]>      9
11 <df [50 x 4]>               10

Benchmark

Unit: milliseconds
           expr      min       lq     mean   median       uq      max neval
 f1(bigger_tib) 619.5852 623.8327 638.0796 634.4866 644.9060 687.6760    10
 f2(bigger_tib) 230.6140 231.6163 234.4957 234.1330 237.1576 238.6012    10
 f3(bigger_tib)   4.0693   5.2220   5.5078   5.2996   5.4089   8.6592    10

One special note on that pivot_longer line: it means that we use the characters after "_" as names_to, discard the characters before "_". All values stack in the same column if having the same characters after "_".

score 2 · Accepted Answer · answered Jan 22 '22 at 16:46

If the problem as stated is merging the unique value in first column with the second column. If the first column is just a repeated value and the second column contains all unique values then a simple solution is:

data.frame(all_letters_combined=c(df[1,1], df[,2]))

If you need to remove duplicates (duplicates in column 2 or column 1 is duplicated in column 2) from the resulting column. Based on ekoam's observation that dplyr::distinct() is faster than unique() Then here an option:

distinct(data.frame(all_letters_combined=c(df[1,1], df[,2])))

Of course if there are more columns and the different possibilities of values than a more complex solution would be required.

How to speed up combining columns when one column is just a repetition of the same value?

2 Answers2