1

I have a dataset with some duplicate entries that I want to change to include only unique combinations of values, with a dup_num column to indicate the number of duplicate entries, and a dup_rows column to indicate which rows contain duplicate data.

I implemented a solution based on Finding duplicate observations of selected variables in a tibble , but it throws a mess of warnings when coercing data in the column containing the list of row numbers to a character vector. Not a problem now, but I want to show this data with DT and Shiny and the warnings are a problem for this application.

library(tidyverse)

df <- tibble(episode = 1:30,
             day = rep(c("Mon", "Wed", "Fri"), 10),
             name = rep(c(
               "Moe", "Larry", "Curly", "Shemp", "extra"
             ), 6))

chr_dups <- as_mapper( ~ str_c(.x) %>%
                         str_remove_all("[c\\(\\)]"))

df %>%
  nest(episode, .key = "dups") %>%
  mutate(dup_num = map_dbl(dups, nrow),
         dup_rows = map_chr(dups, chr_dups))
#> Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
#> argument is not an atomic vector; coercing

#> Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
#> argument is not an atomic vector; coercing

#> Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
#> argument is not an atomic vector; coercing

#> Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
#> argument is not an atomic vector; coercing

#> Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
#> argument is not an atomic vector; coercing

#> Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
#> argument is not an atomic vector; coercing

#> Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
#> argument is not an atomic vector; coercing

#> Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
#> argument is not an atomic vector; coercing

#> Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
#> argument is not an atomic vector; coercing

#> Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
#> argument is not an atomic vector; coercing

#> Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
#> argument is not an atomic vector; coercing

#> Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
#> argument is not an atomic vector; coercing

#> Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
#> argument is not an atomic vector; coercing

#> Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
#> argument is not an atomic vector; coercing

#> Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
#> argument is not an atomic vector; coercing
#> # A tibble: 15 x 5
#>    day   name  dups             dup_num dup_rows
#>    <chr> <chr> <list>             <dbl> <chr>   
#>  1 Mon   Moe   <tibble [2 x 1]>       2 1, 16   
#>  2 Wed   Larry <tibble [2 x 1]>       2 2, 17   
#>  3 Fri   Curly <tibble [2 x 1]>       2 3, 18   
#>  4 Mon   Shemp <tibble [2 x 1]>       2 4, 19   
#>  5 Wed   extra <tibble [2 x 1]>       2 5, 20   
#>  6 Fri   Moe   <tibble [2 x 1]>       2 6, 21   
#>  7 Mon   Larry <tibble [2 x 1]>       2 7, 22   
#>  8 Wed   Curly <tibble [2 x 1]>       2 8, 23   
#>  9 Fri   Shemp <tibble [2 x 1]>       2 9, 24   
#> 10 Mon   extra <tibble [2 x 1]>       2 10, 25  
#> 11 Wed   Moe   <tibble [2 x 1]>       2 11, 26  
#> 12 Fri   Larry <tibble [2 x 1]>       2 12, 27  
#> 13 Mon   Curly <tibble [2 x 1]>       2 13, 28  
#> 14 Wed   Shemp <tibble [2 x 1]>       2 14, 29  
#> 15 Fri   extra <tibble [2 x 1]>       2 15, 30

Created on 2019-09-19 by the reprex package (v0.3.0)

I am pretty sure that the problem is in as_mapper().

Below is a reprex with representative toy data. The tibble describes some episodes from the Three Stooges, the day the episode ran, and the character who was the protagonist for the episode.

Thanks!

M. Wood
  • 450
  • 4
  • 13

3 Answers3

3

It is a warning because the list elements are not atomic, i.e. it is a list of tibble which can be identified, if we pull the column

df %>%
  nest(dups = episode)  %>% 
  pull(dups)
#<list_of<tbl_df<episode:integer>>[15]>
#[[1]]
# A tibble: 2 x 1
#  episode
#    <int>
#1       1
#2      16

#[[2]]
# A tibble: 2 x 1
#  episode
3    <int>
#1       2
#2      17
# ...

So, it is a list of tibble. either we can extract the column with pull

or we can flatten it and apply the function

library(purrr)
df %>%
   nest(dups = episode) %>%
   mutate(dup_num = map_dbl(dups, nrow), 
         dup_rows = map(dups, ~ flatten_int(.x) %>% 
                                     chr_dups))

NOTE: It is not clear why the function 'chr_dups' is applied on the 'episode' column which is numeric. The transformations are also not making sense


If we just need to paste the elements of 'episode' grouped by the other columns, a base R single line approach is

aggregate(episode~ day + name, df, toString)
#   day  name episode
#1  Fri Curly   3, 18
#2  Mon Curly  13, 28
#3  Wed Curly   8, 23
#4  Fri extra  15, 30
#5  Mon extra  10, 25
#6  Wed extra   5, 20
#7  Fri Larry  12, 27
#8  Mon Larry   7, 22
#9  Wed Larry   2, 17
#10 Fri   Moe   6, 21
#11 Mon   Moe   1, 16
#12 Wed   Moe  11, 26
#13 Fri Shemp   9, 24
#14 Mon Shemp   4, 19
#15 Wed Shemp  14, 29
akrun
  • 874,273
  • 37
  • 540
  • 662
  • Thanks! I was using chr_dups to list the values in the cell so I can display them with a Shiny app or write to xls later. Elegant baseR solution! – M. Wood Sep 20 '19 at 12:56
2

I think the source of the warning has already been addressed. I'll add that you can do this without mapping, using just vectorised functions.

library(tidyverse)

df <- tibble(episode = 1:30,
             day = rep(c("Mon", "Wed", "Fri"), 10),
             name = rep(c(
               "Moe", "Larry", "Curly", "Shemp", "extra"
             ), 6))

df %>%
  group_by(day, name) %>%
  summarise(
    dup_num = n(),
    dup_rows = str_c(episode, collapse = ", ")
  )
#> # A tibble: 15 x 4
#> # Groups:   day [3]
#>    day   name  dup_num dup_rows
#>    <chr> <chr>   <int> <chr>   
#>  1 Fri   Curly       2 3, 18   
#>  2 Fri   extra       2 15, 30  
#>  3 Fri   Larry       2 12, 27  
#>  4 Fri   Moe         2 6, 21   
#>  5 Fri   Shemp       2 9, 24   
#>  6 Mon   Curly       2 13, 28  
#>  7 Mon   extra       2 10, 25  
#>  8 Mon   Larry       2 7, 22   
#>  9 Mon   Moe         2 1, 16   
#> 10 Mon   Shemp       2 4, 19   
#> 11 Wed   Curly       2 8, 23   
#> 12 Wed   extra       2 5, 20   
#> 13 Wed   Larry       2 2, 17   
#> 14 Wed   Moe         2 11, 26  
#> 15 Wed   Shemp       2 14, 29

Created on 2019-09-19 by the reprex package (v0.3.0)

Calum You
  • 14,687
  • 4
  • 23
  • 42
  • Much simpler approach that does not require purrr! My real dataset is ~50k entries of similar length. I can load just ```dplyr``` and ```stringr``` and save myself some overhead for rending table in ```DT``` & ```Shiny``` later. – M. Wood Sep 20 '19 at 13:02
1

Just adding to other posters. You don't have to use purrr to achieve what you want. Base R will do.

df <- df %>%
  nest(episode, .key = "dups") %>%
  mutate(dup_num = sapply(dups, nrow),
         dup_rows = sapply(dups, function(x) paste0(x$episode, collapse = ",")))
slava-kohut
  • 4,203
  • 1
  • 7
  • 24