6

I have below-mentioned two dataframe in R, and I have tried various method but couldn't achieve the required output yet.

DF:

ID     Date                 city        code    uid
I-1    2020-01-01 10:12:15  New York     123    K-1
I-1    2020-01-01 10:12:15  Utha         103    K-1
I-2    2020-01-02 10:12:15  Washington   122    K-1
I-3    2020-02-01 10:12:15  Tokyo        123    K-2
I-3    2020-02-01 10:12:15  Osaka        193    K-2
I-4    2020-02-02 10:12:15  London       144    K-3
I-5    2020-02-04 10:12:15  Dubai        101    K-4
I-6    2019-11-01 10:12:15  Dubai        101    K-4
I-7    2019-11-01 10:12:15  London       144    K-3
I-8    2018-12-13 10:12:15  Tokyo        143    K-5
I-9    2019-05-17 10:12:15  Dubai        101    K-4
I-19   2020-03-11 10:12:15  Dubai        150    K-7

Dput:

structure(list(ID = c("I-1", "I-1", 
"I-2", "I-3", "I-3", "I-4", 
"I-5", "I-6", "I-7", "I-8", "I-9","I-19" 
), DATE = c("2020-01-01 11:49:40.842", "2020-01-01 09:35:33.607", 
"2020-01-02 06:14:58.731", "2020-02-01 16:51:27.190", "2020-02-01 05:35:46.952", 
"2020-02-02 05:48:49.443", "2020-02-04 10:00:41.616", "2019-11-01 09:10:46.536", 
"2019-11-01 11:54:05.655", "2018-12-13 14:24:31.617", "2019-05-17 14:24:31.617", "2020-03-11 14:24:31.617"), CITY = c("New York", 
"UTAH", "Washington", "Tokyo", 
"Osaka", "London", "Dubai", 
"Dubai", "London", "Tokyo", "Dubai", 
"Dubai"), CODE = c("221010", 
"411017", "638007", "583101", "560029", "643102", "363001", "452001", 
"560024", "509208"), UID = c("K-1", 
"K-1", "K-1", "K-2", "K-2", 
"K-3", "K-4", "K-4", "K-3", 
"K-5","K-4","K-7")), .Names = c("ID", "DATE", 
"CITY", "CODE", "UID"), row.names = c(NA, 
10L), class = "data.fram)

Using the above-mentioned two dataframe, I want to fetch records between 1st Jan 2020 to 29th Feb 2002 and compare those ID in entire database to check whether both city and code together match with other ID and categorize it further to check how many have the same uid and how many have different.

Where,

  • Match - combination of city and code match with other ID in database
  • Same_uid - classification of Match ids to identify how many ID have similar uid
  • different_uid - classification of Match ids to identify how many ID doesn't have similar uid
  • uid_count - count of similar uid of that particular ID in entire database

Note - I have more than 10M records in the dataframe.

Required Output

ID      Date                  city         code   uid   Match   Same_uid   different_uid  uid_count
I-1     2020-01-01 10:12:15   New York     123    K-1    No      0          0              2
I-2     2020-01-02 10:12:15   Washington   122    K-1    No      0          0              2
I-3     2020-02-01 10:12:15   Tokyo        123    K-2    No      0          0              1   
I-4     2020-02-02 10:12:15   London       144    K-3    Yes     1          0              2
I-5     2020-02-04 10:12:15   Dubai        101    K-4    Yes     2          0              3 
Sophia Wilson
  • 581
  • 3
  • 16

1 Answers1

1

An approach,

Load in the dataset

    library(tidyverse)
    library(lubridate)

   

   mydata <- tibble(
   ID = c("I-1","I-1",
          "I-2","I-3",
          "I-3","I-4",
          "I-5","I-6",
          "I-7","I-8",
          "I-9","I-19"),
   Date = c("2020-01-01", "2020-01-01",
            "2020-01-02", "2020-02-01",
            "2020-02-01", "2020-02-02",
            "2020-02-04", "2019-11-01", 
            "2019-11-01", "2018-12-13", 
            "2019-05-17", "2020-03-11"),
   city = c("New York", "Utha", 
            "Washington", "Tokyo", 
            "Osaka", "London", 
            "Dubai", "Dubai", 
            "London", "Tokyo", 
            "Dubai", "Dubai"),
   code = c("123", "103", "122", "123", "193, "144",
            "101", "101", "144", "143", "101", "150"),
   uid = c("K-1", "K-1", "K-1", "K-2", "K-2", "K-3",
           "K-4", "K-4", "K-3", "K-5", "K-4", "K-7"))

   mydata <- mydata %>% 
     mutate(Date = ymd(str_remove(Date, " .*")),
            code = as.character(code))

Where clause number 1

I use count from dplyr to count the codes by cities. Then case_when to further identify with a "Yes" or "No" as requested.

# This counts city and code, and fullfills your "Match" column requirement
startdate <- "2017-01-01"
enddate <-   "2020-03-29"
mydata %>% 
  filter(Date >= startdate,
         Date <= enddate) %>%
  count(city, code, name = "count_samecode") %>%   
  mutate(Match = case_when(
   count_samecode > 1 ~ "Yes",
                    T ~ "No")) %>%
  head()
# # A tibble: 6 x 4
#  city     code  count_samecode Match
# <chr>    <chr>          <int> <chr>
# 1 Dubai    101                3 Yes  
# 2 Dubai    150                1 No   
# 3 London   144                2 Yes  
# 4 New York 123                1 No   
# 5 Osaka    193                1 No   
# 6 Tokyo    123                1 No  

Where clause number 2

I will do the same with UID

mydata %>% 
  filter(Date >= startdate,
         Date <= enddate ) %>% 
  count(city, uid, name = "UIDs_#_filtered") %>%
  head()

# # A tibble: 6 x 3
# city     uid   `UIDs_#_filtered`
# <chr>    <chr>             <int>
# 1 Dubai    K-4                   3
# 2 Dubai    K-7                   1
# 3 London   K-3                   2
# 4 New York K-1                   1
# 5 Osaka    K-2                   1
# 6 Tokyo    K-2                   1

Where clause number 3

I can repeat the count of clause number 2 to find how many of these cities have a different UID, where > 1 signals a different UID.

mydata %>% 
  filter(Date >= startdate,
         Date <= enddate ) %>% 
  count(city, uid, name = "UIDs_#_filtered") %>% 
  count(city, name = "UIDs_#_different") %>% 
  head()
# # A tibble: 6 x 2
# city     `UIDs_#_different`
# <chr>                 <int>
# 1 Dubai                     2
# 2 London                    1
# 3 New York                  1
# 4 Osaka                     1
# 5 Tokyo                     2
# 6 Utha                      1

Where clause number 4

Taking the same code from #2, I can eliminate the filter to find the entire dataset

mydata %>% 
  count(city, uid, name = "UIDs_#_all") %>% 
  head()

Putting it all together

Using several left_join's we can get closer to your desired output. EDIT: Now will bring the first instance of the ID from the first City / Code combination

check_duplicates_filterview.f <- function( df, startdate, enddate ){
  # df should be a tibble
  # startdate should be a string "yyyy-mm-dd"
  # enddate should be a string   "yyyy-mm-dd"
  cityfilter <- df %>% filter(Date >= startdate,
                              Date <= enddate) %>% distinct(city) %>% pull(1)
  df <- df %>% 
    filter(city %in% cityfilter) %>% 
    mutate(Date = ymd(str_remove(Date, " .*")),
           code = as.character(code))
  entire.db.countcodes <- df %>%              # Finds count of code in entire DB
    count(city, code) 
  where.1 <- df %>% filter(Date >= startdate, 
                     Date <= enddate) %>% 
    distinct(city, code, .keep_all = T) %>%
    left_join(entire.db.countcodes)  %>% 
    rename("count_samecode" = n) %>% 
    mutate(Match = case_when(
      count_samecode > 1 ~ "Yes",
      T ~ "No"))
  
  where.2 <- df %>% 
    filter(Date >= startdate,
           Date <= enddate ) %>% 
    count(city, uid, name = "UIDs_#_filtered")
  where.3 <- df %>% 
    filter(Date >= startdate,
           Date <= enddate ) %>% 
    distinct(city, uid) %>% 
    count(city, name = "UIDs_#_distinct") 
  where.4 <- df %>% 
    filter(city %in% cityfilter) %>% 
    count(city, uid, name = "UIDs_#_all")
  first_half <- left_join(where.1, where.2)
  second_half <- left_join(where.4, where.3)
  full <- left_join(first_half, second_half)
  return(full)
}


# > check_duplicates_filterview.f(mydata, "2018-01-01", "2020-01-01")
# Joining, by = "city"
# Joining, by = "city"
# Joining, by = c("city", "uid")
# # A tibble: 5 x 8
# city     code  count_samecode Match uid   `UIDs_#_filtered` `UIDs_#_all` `UIDs_#_distinct`
# <chr>    <chr>          <int> <chr> <chr>             <int>        <int>             <int>
# 1 Dubai    101                2 Yes   K-4                   2            3                 1
# 2 London   144                1 No    K-3                   1            2                 1
# 3 New York 123                1 No    K-1                   1            1                 1
# 4 Tokyo    143                1 No    K-5                   1            1                 1
# 5 Utha     103                1 No    K-1                   1            1                 1
marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
  • `ID` column is not coming. – Sophia Wilson Aug 05 '20 at 07:51
  • Now I am confused,can you explain the `ID` column a bit more clearly before I can go on. In your example output - you have chosen certain `ID`s over others, even though the output appears to be a summary output, and there are several IDs for some of these cities. The same occurs with `Date`. If it is simply the most recent `ID` that is interesting to you, I will update my answer as that. – Bryan Wammack Aug 05 '20 at 14:15
  • For Example, suppose We have data set of 5 year (2015 to 2020) and we want to match the `ID` from Mar-20 to May-20 with entire dataset. – Sophia Wilson Aug 05 '20 at 14:50
  • Okay, I have it so that it matches the `code` to the entire dataset. This is because that is how your example currently shows the `code` to be the one that is matched. There are not any repeat `ID`s in the example above. Please see the edits to my OP. – Bryan Wammack Aug 05 '20 at 19:28