Finding duplicate observations of selected variables in a tibble

Question

I have a rather large tibble (called df.tbl with ~ 26k rows and 22 columns) and I want to find the "twins" of each object, i.e. each row that has the same values in column 2:7 (date:Pos).

If I use:

inner_join(df.tbl, ~ df.tbl[i,], by = c("date", "forge", "serNum", "PinMain", "PinMainNumber", "Pos"))

with i being the row I want to check for "twins", everything is working as expected, spitting out a 2 x 22 tibble, and I can expand this using:

x <- NULL
for (i in 1:nrow(df.tbl)) {
x[[i]] <- as_vector(inner_join(df.tbl[,], 
                        df.tbl[i,], 
                        by = c("date", 
                               "forge", 
                               "serNum", 
                               "PinMain", 
                               "PinMainNumber", 
                               "Pos")) %>% 
               select(rowNum.x) 
}

to create a list containing the row numbers for each twin for each object (row).

I cannot, however I try, use map to produce a similar result:

twins <- map(df.tbl, ~ inner_join(df.tbl, 
                                     ., 
                                     by = c("date", 
                                            "forge", 
                                            "serNum", 
                                            "PinMain", 
                                            "PinMainNumber", 
                                            "Pos")) %>% 
         select(rowNum.x) )

All I get is the following error:

Error in UseMethod("tbl_vars") : no applicable method for 'tbl_vars' applied to an object of class "c('double', 'numeric')"

How would I go about to convert the for loop into an equivalent using map?

My original data look like this:

>head(df.tbl, 3)
# A tibble: 3 x 22
  rowNum date       forge serNum PinMain PinMainNumber Pos   FrontBack flow  Sharped SV    OP      max   min  mean
   <dbl> <date>     <chr> <fct>  <fct>   <fct>         <fct> <fct>     <chr> <fct>   <fct> <chr> <dbl> <dbl> <dbl>
1      1 2017-10-18 NA    179    Pin     1             W     F         NA    3       36237 235    77.7  55.3  64.7
2      2 2017-10-18 NA    179    Pin     2             W     F         NA    3       36237 235    77.5  52.1  67.4
3      3 2017-10-18 NA    179    Pin     3             W     F         NA    3       36237 235    79.5  58.6  69.0
# ... with 7 more variables: median <dbl>, sd <dbl>, Round2 <dbl>, Round4 <dbl>, OrigData <list>, dataSize <int>,
#   fileName <chr>

and I would like a list with a length the same as nrow(df.tbl) looking like this:

> twins
[[1]]
[1] 1 7

[[2]]
[1] 2 8

[[3]]
[1] 3 9

Almost all objects have one twin / duplicate (as above) but a few have two or even three duplicates (as defined above, i.e. column 2:7 are the same)

Could you give a sample of your data and show the desired output? It would make it easier for others to help. — Joe, Nov 29 '18 at 21:55
I have added the requested information. I hope it is more clear now. — Daniel Bäckström, Nov 30 '18 at 07:53

Bertil Baron · Answer 1 · 2018-11-30T12:38:24.380

0

do you really need to solve it with map?

I would solve through combining duplicated and semi_join from the package dplyr like this

defining_columns <- c("date", "forge", "serNum", "PinMain", "PinMainNumber", "Pos")

dplyr::semi_join(
  df.tbl,
  df.tbl[duplicated(df.tbl[defining_columns]),],
  by = defining_columns
) %>% 
  group_by_at(defining_columns) %>% 
  arrange(.by_group = TRUE) %>%
  summarise(twins = paste0(rowNum,collapse = ",")) %>% 
  pull(twins) %>% 
  strsplit(",")

the duplicated gives us which rows are duplicated and the semi_join only keeps rows in x that are present in y

Hope this helps!!

edited Nov 30 '18 at 12:38

answered Nov 29 '18 at 10:54

Bertil Baron

4,923
1
15
24

Thanks and it does help a lot. Not all the way, though. If I understand your solution correct, I could do an anti_join() on the full vs the duplicated data set, to get a subset that should match the duplicates row by row, hence get the list containing matching "twin" row numbers I am after (row 1 in set 1 map to row 1 in set 2 etc.) However, the anti_join() results in a slightly smaller data set, so a few rows in the original has more than one duplicate... – Daniel Bäckström Nov 29 '18 at 12:29
From your example I thought you just had one data set, and you want to get all rows that have duplicates in a certein subset of the columns - thats correct,isn't it? I think you could replace the semi_join with an inner_join in my example and it should give the same answer. An anti_join is the opposite - that would give you the rows that have *no* duplicates – Bertil Baron Nov 29 '18 at 12:55
You thought right, I do only have one data set. I want to get, for each row in my data set, the rows that we call twins / duplicates so I can add that as a column in my data frame. Most (almost all) rows have only one duplicate but a few have more (3-4). Have I misunderstood your proposed method or does it not assume that each row has one twin, no less and no more? – Daniel Bäckström Nov 29 '18 at 13:16
No, my method should also find duplicates with more than one duplicate. the function duplicated scans through a data set from top or bottom dependent of the parameters set. and marks each row that is identical to a previous row as duplicated. With the semi_ or inner_join we then make sure to get also the first version of every row. – Bertil Baron Nov 29 '18 at 13:24
I have added information about how my data look like and how I want my result to look like. Perhaps it will make it more clear as to why I feel your proposed solution is good and interesting, but not going all the way. – Daniel Bäckström Nov 30 '18 at 10:07
Hi, I have updated my answer to give the return format you wanted. I tested it on the iris data set and it seems to work – Bertil Baron Nov 30 '18 at 11:06
I can't get the code to work. When I copy-paste it and run the Iris data set, I get the same error at the same position as when I run my own data on your code - *"Evaluation error: cannot coerce type 'closure' to vector of type 'character'."* at the summarise line. – Daniel Bäckström Nov 30 '18 at 12:17
sorry I thought you had a column named id. It should be rowNum instead. I have corrected my example – Bertil Baron Nov 30 '18 at 12:39
Thanks. You have not only helped me but shown me a few cute tricks. Thanks a bunch for all the help! How do I mark this question as answered? – Daniel Bäckström Nov 30 '18 at 12:53

Joe · Accepted Answer · 2018-12-03T12:35:43.577

0

A bit late to the party, but you can do it much more neatly with nest().

tbl.df1 <- tbl.df %>% group_by(date, forge, serNum, PinMain, PinMainNumber, Pos) %>% nest(rowNum)

The twins will be in the list of tibbles created by nest.

tbl.df1$data

# [[1]]
# A tibble: 2 x 1
#   rowNum
#    <dbl>
# 1      1
# 2      7

#[[2]]
# A tibble: 2 x 1
#   rowNum
#    <dbl>
# 1      2
# 2      8

# etc

edited Dec 03 '18 at 12:35

answered Dec 02 '18 at 11:27

Joe

8,073
1
52
58

Nothing wrong with adding your input when you have something tasty to add to the pot. It is indeed a very neat way of solving the issue (and very, very fast too). It also has the additional benefit of solving my problem using a different approach. Much appreciated! Thanks! – Daniel Bäckström Dec 03 '18 at 12:23

Finding duplicate observations of selected variables in a tibble

2 Answers2

Linked