0

Let's say I have a data.frame called countDF:

> countDF date count complete 1 20180124 16 FALSE 2 20180123 24 TRUE 3 20180122 24 TRUE 4 20180121 24 TRUE 5 20180120 23 FALSE 6 20180119 23 FALSE 7 20180118 24 TRUE

Which looks like this under the hood:

> dput(countDF)
structure(list(date = c("20180124", "20180123", "20180122", "20180121", 
"20180120", "20180119", "20180118"), count = c(16L, 24L, 24L, 
24L, 23L, 23L, 24L), complete = c(FALSE, TRUE, TRUE, TRUE, FALSE, 
FALSE, TRUE)), class = "data.frame", row.names = c(NA, -7L), .Names = c("date", 
"count", "complete"))

And this list:

> last7D_missingHours
$`20180124`
[1]  3 17 18 19 20 21 22 23

$`20180120`
[1] 18

$`20180119`
[1] 7

Which looks like this:

> dput(last7D_missingHours)
structure(list(`20180124` = c(3L, 17L, 18L, 19L, 20L, 21L, 22L, 
23L), `20180120` = 18L, `20180119` = 7L), .Names = c("20180124", 
"20180120", "20180119"))

I would like to make a data.frame (or, perhaps data_frame) that joins the latter to the former with left_join(countDF, last7D_missingHours, by = c('date' = names(last7D_missingHours))) and have NA in the date rows that do not match, like this:

> countDF date count complete missingHour 1 20180124 16 FALSE 3 17 18 19 20 21 22 23 2 20180123 24 TRUE NA 3 20180122 24 TRUE NA 4 20180121 24 TRUE NA 5 20180120 23 FALSE 18 6 20180119 23 FALSE 7 7 20180118 24 TRUE NA

I could probably hack through this with recursive subsetting I'm guessing, but wanted to see if anyone had any suggestions for a more optimal approach, since I know tibbles have come quite a long way recently...

d8aninja
  • 3,233
  • 4
  • 36
  • 60

1 Answers1

1

Make the missing hours into a list-column in a tibble with the other variable as the dates, and then just left_join.


library(tidyverse)

countDF <- structure(list(date = c("20180124", "20180123", "20180122", "20180121", 
                                   "20180120", "20180119", "20180118"), 
                          count = c(16L, 24L, 24L, 24L, 23L, 23L, 24L), 
                          complete = c(FALSE, TRUE, TRUE, TRUE, FALSE, FALSE, TRUE)), 
                     class = "data.frame", row.names = c(NA, -7L), .Names = c("date", "count", "complete"))

last7D_missingHours <- structure(list(`20180124` = c(3L, 17L, 18L, 19L, 20L, 21L, 22L, 
                                                     23L), `20180120` = 18L, `20180119` = 7L), .Names = c("20180124", 
                                                                                                          "20180120", "20180119"))

lst_tbl <- tibble(date = c("20180124", "20180120", "20180119"),
                  missingHour = last7D_missingHours)

left_join(countDF, lst_tbl)
#> Joining, by = "date"
#>       date count complete                   missingHour
#> 1 20180124    16    FALSE 3, 17, 18, 19, 20, 21, 22, 23
#> 2 20180123    24     TRUE                          NULL
#> 3 20180122    24     TRUE                          NULL
#> 4 20180121    24     TRUE                          NULL
#> 5 20180120    23    FALSE                            18
#> 6 20180119    23    FALSE                             7
#> 7 20180118    24     TRUE                          NULL

I end up with NULL rather than NA, which I think makes more sense, so I didn't try to change them just to get what you asked for.

ngm
  • 2,539
  • 8
  • 18
  • Works as advertised, and I think you're right about the `NULL`s, too. Thank you! I'll see if anyone else has any thoughts but this will likely do it. – d8aninja Jan 24 '18 at 20:32
  • You can simplify `date = c("20180124", "20180120", "20180119")` to `date = names(last7D_missingHours)` which is more scaleable. – Mako212 Jan 24 '18 at 21:08