I have below-mentioned two dataframe in R, and I have tried various method but couldn't achieve the required output yet.
DF:
ID Date city code uid
I-1 2020-01-01 10:12:15 New York 123 K-1
I-1 2020-01-01 10:12:15 Utha 103 K-1
I-2 2020-01-02 10:12:15 Washington 122 K-1
I-3 2020-02-01 10:12:15 Tokyo 123 K-2
I-3 2020-02-01 10:12:15 Osaka 193 K-2
I-4 2020-02-02 10:12:15 London 144 K-3
I-5 2020-02-04 10:12:15 Dubai 101 K-4
I-6 2019-11-01 10:12:15 Dubai 101 K-4
I-7 2019-11-01 10:12:15 London 144 K-3
I-8 2018-12-13 10:12:15 Tokyo 143 K-5
I-9 2019-05-17 10:12:15 Dubai 101 K-4
I-19 2020-03-11 10:12:15 Dubai 150 K-7
Dput:
structure(list(ID = c("I-1", "I-1",
"I-2", "I-3", "I-3", "I-4",
"I-5", "I-6", "I-7", "I-8", "I-9","I-19"
), DATE = c("2020-01-01 11:49:40.842", "2020-01-01 09:35:33.607",
"2020-01-02 06:14:58.731", "2020-02-01 16:51:27.190", "2020-02-01 05:35:46.952",
"2020-02-02 05:48:49.443", "2020-02-04 10:00:41.616", "2019-11-01 09:10:46.536",
"2019-11-01 11:54:05.655", "2018-12-13 14:24:31.617", "2019-05-17 14:24:31.617", "2020-03-11 14:24:31.617"), CITY = c("New York",
"UTAH", "Washington", "Tokyo",
"Osaka", "London", "Dubai",
"Dubai", "London", "Tokyo", "Dubai",
"Dubai"), CODE = c("221010",
"411017", "638007", "583101", "560029", "643102", "363001", "452001",
"560024", "509208"), UID = c("K-1",
"K-1", "K-1", "K-2", "K-2",
"K-3", "K-4", "K-4", "K-3",
"K-5","K-4","K-7")), .Names = c("ID", "DATE",
"CITY", "CODE", "UID"), row.names = c(NA,
10L), class = "data.fram)
Using the above-mentioned two dataframe, I want to fetch records between 1st Jan 2020 to 29th Feb 2002 and compare those ID in entire database to check whether both city and code together match with other ID and categorize it further to check how many have the same uid and how many have different.
Where,
- Match - combination of city and code match with other ID in database
- Same_uid - classification of Match ids to identify how many ID have similar uid
- different_uid - classification of Match ids to identify how many ID doesn't have similar uid
- uid_count - count of similar uid of that particular ID in entire database
Note - I have more than 10M records in the dataframe.
Required Output
ID Date city code uid Match Same_uid different_uid uid_count
I-1 2020-01-01 10:12:15 New York 123 K-1 No 0 0 2
I-2 2020-01-02 10:12:15 Washington 122 K-1 No 0 0 2
I-3 2020-02-01 10:12:15 Tokyo 123 K-2 No 0 0 1
I-4 2020-02-02 10:12:15 London 144 K-3 Yes 1 0 2
I-5 2020-02-04 10:12:15 Dubai 101 K-4 Yes 2 0 3