0

Thank you in advance for answering this question, and I will accept tips on improving my question asking on here as this is my first time!

I have am pulling data into an r-Notebook from our SQL Server into a dataframe which is attached to clients who are receiving quarterly assessments, which have been happening since Q4 2015 up until Q2 2018. The problem is, assessments aren't always conducted, therefore there are gaps the data. For example, I currently would be creating a dataframe that looks like so:

client name | assessment date | assessment value
client 1    | 2015 Q4         | Green
client 1    | 2018 Q1         | Green
client 2    | 2015 Q4         | Yellow
client 2    | 2016 Q2         | Green
client 2    | 2016 Q4         | Green
client 2    | 2017 Q1         | Yellow

Basically I need each client name to have a record for each of the quarters between Q4 2015 and Q2 2018. I will be assuming that is an assessment hasn't been made, then the assessment value from the previous assessment will be the default. The dataframe should end up looking as such:

client name | assessment date | assessment value
client 1    | 2015 Q4         | Green
client 1    | 2016 Q1         | Green
client 1    | 2016 Q2         | Green
client 1    | 2016 Q3         | Green
client 1    | 2016 Q4         | Green
client 1    | 2017 Q1         | Green
client 1    | 2017 Q2         | Green
client 1    | 2017 Q3         | Green
client 1    | 2017 Q4         | Green
client 1    | 2018 Q1         | Green
client 1    | 2018 Q2         | Green
client 2    | 2015 Q4         | Yellow
client 2    | 2016 Q1         | Yellow
client 2    | 2016 Q2         | Green
client 2    | 2016 Q3         | Green
client 2    | 2016 Q4         | Green
client 2    | 2017 Q1         | Yellow
client 2    | 2017 Q2         | Yellow
client 2    | 2017 Q3         | Yellow
client 2    | 2017 Q4         | Yellow
client 2    | 2018 Q1         | Yellow
client 2    | 2018 Q2         | Yellow

Thanks!

  • 1
    Use `tidyr::expand` to create rows for all possible combinations and then use one of the many "last observation carried forward" (LOCF) solutions on this site to update the missing values. – MrFlick Jun 20 '18 at 20:46
  • I'd recommend, also, that you consider just using `NA` for the missing values, and only use a LOCF solution when you need to do something with the data that requires that kind of solution. – De Novo Jun 20 '18 at 21:06

1 Answers1

0

Per @MrFlick's suggestion, I wanted to give it a shot because I haven't used expand before.

library(tidyr)
library(dplyr)
library(zoo)

df <- data.table::fread("client name | assessment date | assessment value
                        client 1    | 2015 Q4         | Green
                        client 1    | 2018 Q1         | Green
                        client 2    | 2015 Q4         | Yellow
                        client 2    | 2016 Q2         | Green
                        client 2    | 2016 Q4         | Green
                        client 2    | 2017 Q1         | Yellow")

df <- df %>% 
  mutate(qtr = as.yearqtr(`assessment date`))

df2 <- expand(df,  client = `client name`,
                   qtr = seq(min(qtr), max(qtr), by = 0.25)) %>%
  arrange(client, qtr)

df2 %>% 
  mutate(qtr = as.character(qtr)) %>%
  left_join(df %>% mutate(qtr = as.character(qtr)),
            by = c('client' = 'client name', 'qtr' = 'qtr')) %>%
  group_by(client) %>%
  fill(`assessment value`) %>%
  select(-`assessment date`)

# A tibble: 20 x 3
# Groups:   client [2]
   client   qtr     `assessment value`
   <chr>    <chr>   <chr>             
 1 client 1 2015 Q4 Green             
 2 client 1 2016 Q1 Green             
 3 client 1 2016 Q2 Green             
 4 client 1 2016 Q3 Green             
 5 client 1 2016 Q4 Green             
 6 client 1 2017 Q1 Green             
 7 client 1 2017 Q2 Green             
 8 client 1 2017 Q3 Green             
 9 client 1 2017 Q4 Green             
10 client 1 2018 Q1 Green             
11 client 2 2015 Q4 Yellow            
12 client 2 2016 Q1 Yellow            
13 client 2 2016 Q2 Green             
14 client 2 2016 Q3 Green             
15 client 2 2016 Q4 Green             
16 client 2 2017 Q1 Yellow            
17 client 2 2017 Q2 Yellow            
18 client 2 2017 Q3 Yellow            
19 client 2 2017 Q4 Yellow            
20 client 2 2018 Q1 Yellow     

I had to finagle with the yearqtr type and convert it to character to retain all the info during the join. There's likely a much cleaner way to do this, but hope it points you in the right direction.

zack
  • 5,205
  • 1
  • 19
  • 25