1

I have 2 data frames that I would like to merge. In df1observation were recorded on 2 different days. Each record has an index , id1 person identification number and id2 refers the number of the day that recording was made (day had to be different ).There is also a Day variable that records the week day when the recording was made.

In df2 observations were recorded just based on serial number and id1 person identification number. There is just one observation per person. Similarly here there is also a Day variable that records when the recordings were made.

I would like to identify the observations from df2 that were recorded on the same day as in df1.

How can I do this?

Sample data

df1:

    structure(list(index = c(11011202, 11011202, 11011202, 11011202, 
11011203, 11011203, 11011207, 11011207, 11011207, 11011207, 11011209, 
11011209, 11011209, 11011209, 11011210, 11011210, 11011210, 11011210, 
11011211, 11011211, 11011211, 11011211, 11011212, 11011212, 11011212, 
11011212, 11011212, 11011212, 11011212, 11011212, 11011213, 11011213, 
11011213, 11011213, 11011213, 11011213, 11011217, 11011217, 11011219, 
11011219, 11011220, 11011220, 11011220, 11011220, 11011220, 11011220, 
11020202, 11020202, 11020202, 11020202), id1 = c(1, 1, 4, 4, 
1, 1, 1, 1, 2, 2, 1, 1, 2, 2, 1, 1, 2, 2, 1, 1, 2, 2, 1, 1, 2, 
2, 3, 3, 4, 4, 1, 1, 3, 3, 4, 4, 1, 1, 1, 1, 1, 1, 2, 2, 3, 3, 
1, 1, 2, 2), id2 = c(1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 
1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 
2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2), Day = c(5, 1, 5, 
1, 1, 3, 4, 7, 4, 7, 4, 1, 4, 1, 5, 7, 5, 7, 1, 2, 1, 2, 7, 2, 
7, 2, 7, 2, 7, 2, 7, 4, 7, 4, 7, 4, 4, 1, 3, 1, 1, 2, 1, 2, 1, 
2, 4, 7, 4, 7)), row.names = c(NA, -50L), class = c("tbl_df", 
"tbl", "data.frame"))

df2:

    structure(list(Day = c(3, 3, 4, 6, 6, 6, 7, 7, 7, 7, 4, 4, 6, 
6, 6, 4, 3, 7, 7, 5, 5, 7, 5, 6, 6), index = c(11011209, 11011209, 
11011210, 11011212, 11011212, 11011213, 11011213, 11011220, 11011220, 
11020208, 11020212, 11020212, 11020301, 11020301, 11020301, 11020305, 
11020310, 11020315, 11020315, 11020316, 11020316, 11020320, 11020606, 
11020611, 11020611), id1 = c(1, 2, 2, 1, 2, 1, 4, 1, 2, 2, 1, 
2, 1, 2, 3, 1, 1, 1, 2, 1, 2, 2, 1, 1, 2)), row.names = c(NA, 
-25L), class = c("tbl_df", "tbl", "data.frame"))
Rstudent
  • 887
  • 4
  • 12
  • Hi there, all of your "outcome" is wholly present in `df1`. Is this a merging question or a subset question? Can you clarify? – Ian Campbell Apr 10 '20 at 16:26
  • @Ian Campbell I tried to merge the 2 data frames. I want to identify the observations from df2 that are recorded in df1. The difference between the observation is that in df1 with the same index and id1 2 different observations were made but on different date. Many thanks – Rstudent Apr 10 '20 at 16:31
  • I edited my answer based on your updated question. – Ian Campbell Apr 11 '20 at 14:18

2 Answers2

2

Both the datasets have duplicate by variables, resulting in duplication of rows after the join. One option is to nest by those variables, and then do a join

library(dplyr)
library(tidyr)
df2 %>%
      group_by(Day, index) %>%
      nest %>%
      left_join(df1 %>% 
                   rename(idnew = id1)) %>% 
      unnest(data)
akrun
  • 874,273
  • 37
  • 540
  • 662
  • many thanks I updated my data sample - basically I would like to merge df2 (25 observation) with df1 (50 observation) aiming for 25 observations. But I don't know why after merging I receive 50 obs. – Rstudent Apr 10 '20 at 20:30
  • 1
    @user11964604 i checked your `dput`, you have both datasets have duplicates for 'Day', 'index', resulting in duplication after the merge Check `count(df2, Day, index)` and `count(df1, Day, index)` – akrun Apr 10 '20 at 20:40
  • 1
    @user11964604 i updated the post. Please check if that helps – akrun Apr 10 '20 at 20:46
  • i receive an error suggesting Error in rename(., idnew = id1) : unused argument (idnew = id1) > – Rstudent Apr 10 '20 at 20:56
  • 1
    @user11964604 it could be that `rename` is masked from other package function of the same name. can you chhange it to `dplyr::rename(idnew = id1))` – akrun Apr 10 '20 at 20:58
  • @user11964604 iff you can update with an expected output, it would be easier to crosscheck – akrun Apr 11 '20 at 19:34
  • apologise and many thanks for your continuous help I am literally stuck on https://stackoverflow.com/questions/61187493/identify-consecutive-sequences-based-on-a-given-variable question - I was thinking to aggregate the data on day level, but still I don't know how to set a criteria for a sequence to start with a specific day. If you have time could you please help me. Many thanks – Rstudent Apr 13 '20 at 17:13
1

Edit: Based on your clarifications, I think this is what you wanted. I switched to using tidyverse since your data is stored in tibbles.

library(dplyr)
inner_join(df1,df2,by=c("id1","Day"),suffix=c(".df1",".df2"))
#   index.df1   id1   id2   Day index.df2
#       <dbl> <dbl> <dbl> <dbl>     <dbl>
# 1  11011202     1     1     5  11020316
# 2  11011202     1     1     5  11020606
# 3  11011203     1     2     3  11011209
# 4  11011203     1     2     3  11020310
# 5  11011207     1     1     4  11020212
# 6  11011207     1     1     4  11020305
# 7  11011207     1     2     7  11011220
# 8  11011207     1     2     7  11020315
# 9  11011207     2     1     4  11011210
#10  11011207     2     1     4  11020212
## … with 43 more rows
Ian Campbell
  • 23,484
  • 14
  • 36
  • 57
  • i tried this it reduces the number of observations. For example in my case df1 has 2942 observations and df2 16533 observations. After merging I receive 169 observations and not 2942. Where is my mistake? – Rstudent Apr 10 '20 at 16:42
  • @user11964604 If that's not what you want, I would say the mistake is in your question. Your question shows `df1` with 6 rows, `df2` with 2 rows, and an output with 2 rows. This answer produces an output with 2 rows, just like your question asks for... if that's not what you want, please edit your question to make it clearer what you do want. – Gregor Thomas Apr 10 '20 at 16:53
  • Hi My suggestion : (1).You can Check your Day column like remove extra space by ```gsub()``` or letter small or capital. (2). convert the day column in character (3). convert Index column to numeric ... then try it ```merge()``` with ```all.x=T``` – Tushar Lad Apr 10 '20 at 17:12
  • @Ian Campbell many thanks I updated the sample data with df1 (50obs) and df2 (25obs). I would like to have 25 (obs) but it gives me 50. – Rstudent Apr 10 '20 at 20:32
  • @GregorThomas I updated the sample data: basically I would like to have a df3 with 25 observation but I don't know how to merge the data frames. – Rstudent Apr 10 '20 at 20:32
  • @Ian Campbell if you have time I updated my question – Rstudent Apr 11 '20 at 11:01
  • @GregorThomas if you have time I updated my question – Rstudent Apr 11 '20 at 11:01