0

I have a table that looks like so (1 example - total of 2 million rows):

tweet_id |  id         |  group   |                created_at         | tweet                                                 |  response_tweet_id
1           sprintcare    Support      Tue Oct 31 22:10:47 +0000 2017   @115712 I understand. I would like to assist you.        2
2           115712        Customer     Tue Oct 31 22:11:45 +0000 2017   @sprintcare and how do you propose we do that?           NA
3           115712        Customer     Tue Oct 31 22:08:27 +0000 2017   @sprintcare I have sent several private messages.        1
4           sprintcare    Support      Tue Oct 31 21:54:49 +0000 2017   @115712 Please send us a Private Message.                3
5           115712        Customer     Tue Oct 31 21:49:35 +0000 2017   @sprintcare I did.                                       4
6           sprintcare    Support      Tue Oct 31 21:46:24 +0000 2017   @115712 Can you please send us a private message.        5,7
8           115712        Customer     Tue Oct 31 21:45:10 +0000 2017   @sprintcare is the worst customer service                9,6,10
11          apple         Support      Tue Oct 31 22:10:35 +0000 2017  @115713 This is saddening to hear.                        NA
12          115713        Customer     Tue Oct 31 22:04:47 +0000 2017   @apple My phone is not working. Help!               11

When referencing the first row, we can see the time the first support tweet response came in when responding to tweet_id 1 (this is labeled as 2).

Ideally - I would like to calculate how long it takes for each response tweet to come in - ordering from support to customer.

I would like to calculate two values:

  1. the response time between the first tweet made to Support (tweet_id, response_tweet_id, and created_at). For the first row - the time difference would be: Tue Oct 31 22:10:47 +0000 2017 - Tue Oct 31 22:08:27 +0000 2017 = 00:02:20.

  2. Total response time between first tweet from support to last tweet given each customer. In the example below - it would essentially be the first tweet made relative to time to support and the last tweet made before next unique id comes into play.

Desired output would look like so based on each group (100+ unique companies):

id         | group  | Avg_response_time_per_tweet (in minutes) | Total_avg_response_time (in minutes)
sprintcare   Support  ####                                       ####
apple        Support  ####                                       #### 
Dinho
  • 704
  • 4
  • 15
  • 1
    When you say average response time, how do you want to calculate. Is it based on the difference in created_at for sprintcare with respect to 115712 values and then take the average for each occurence of sprintcare – akrun Sep 15 '21 at 20:03
  • 1
    Okay what is considered average response time? Is it time from the first tweet to the second? Or first tweet to the last tweet within a group? – akash87 Sep 15 '21 at 20:06
  • 1
    It is not entirely clear about the expected. Maybe `df1 %>% mutate(created_at = as.POSIXct(created_at, format = '%a %b %d %H:%M:%S +0000 %Y')) %>% group_by(group2 = cumsum(group == "Support")) %>% summarise(id = first(id), group = first(group), avg = mean(difftime(first(created_at), created_at[-1], units = "sec")))` – akrun Sep 15 '21 at 20:10
  • Sorry - I updated the desired output based on both your questions. – Dinho Sep 15 '21 at 20:13

2 Answers2

1
library(lubridate)
library(dplyr)
library(tidyr)
df %>% 
    separate(created_at, c("Day_name", "Month", "Day", "Hour", "Minute", "Second", "X", "Year")) %>% 
    type.convert(as.is = TRUE) %>% 
    mutate(Month = match(Month, month.abb)) %>% 
    mutate(created_at = make_datetime(Year, Month, Day, Hour, Minute, Second), .keep="unused") %>% 
    group_by(id, group) %>% 
    summarise(Avg_response_time = mean(difftime(max(created_at), min(created_at))))
  id         group    Avg_response_time
  <chr>      <chr>    <drtn>           
1 115712     Customer 26.58333 mins    
2 sprintcare Support  24.38333 mins  
TarJae
  • 72,363
  • 6
  • 19
  • 66
  • thank you for responding - this is helpful - I think my question was a little vague before so I added more datapoints to make it more clear to calculate the time respective to each company. – Dinho Sep 15 '21 at 23:57
0

It's not a really performant approach, but it should yield the desired results.

library(dplyr)
library(stringr)

dat %>% 
  mutate(id_at = str_extract(tweet, "\\d+"),
         global_id = ifelse(is.na(id_at), id, id_at)) %>% 
  group_by(global_id) %>% 
  group_modify(~ .x %>%
                 rowwise %>% 
                 mutate(response_time = if(group == "Support") {
                   filter(.x, created_at < .env$created_at, group == "Customer") %>% 
                     slice_max(created_at, n = 1) %>% pull(created_at)
                 } else NA)
  ) %>% 
  summarise(id = first(id), Total_avg_response_time = difftime(max(created_at), min(created_at)),
            Avg_response_time_per_tweet = mean(difftime(created_at, response_time), na.rm = TRUE))

#> # A tibble: 1 x 4
#>   global_id id         Total_avg_response_time Avg_response_time_per_tweet
#>   <chr>     <chr>      <drtn>                  <drtn>                     
#> 1 115712    sprintcare 26.58333 mins           2.933333 mins

# data used
dat <- read.table(text = "tweet_id   id             group                  created_at            tweet
                  1           sprintcare    Support      'Tue Oct 31 22:10:47 +0000 2017'   '@115712 I understand. I would like to assist you.'
                  2           115712        Customer     'Tue Oct 31 22:11:45 +0000 2017'   '@sprintcare and how do you propose we do that'
                  3           115712        Customer     'Tue Oct 31 22:08:27 +0000 2017'   '@sprintcare I have sent several private messages and no one is responding as usual'
                  4           sprintcare    Support      'Tue Oct 31 21:54:49 +0000 2017'   '@115712 Please send us a Private Message so that we can further assist you.'
                  5           115712        Customer     'Tue Oct 31 21:49:35 +0000 2017'   '@sprintcare I did.'
                  6           sprintcare    Support      'Tue Oct 31 21:46:24 +0000 2017'   '@115712 Can you please send us a private message, so that I can gain further details about your account?'
                  7           115712        Customer     'Tue Oct 31 21:45:10 +0000 2017'   '@sprintcare is the worst customer service'",
                  header = TRUE)

dat <- dat %>% 
  mutate(created_at = strptime(created_at, "%a %b %d %H:%M:%S %z %Y"))

Created on 2021-09-15 by the reprex package (v0.3.0)

TimTeaFan
  • 17,549
  • 4
  • 18
  • 39
  • Thank you! I added more context to the data to make it easier to computer @TimTeaFan. – Dinho Sep 15 '21 at 23:56
  • @Dinho: My approach is not really performant for 2 million rows, but it should yield the results your after - even for your updated data. Try it on a subset of your data to verify. – TimTeaFan Sep 16 '21 at 10:07
  • @TimeTeaFan - I did run your code on my dataset - sample_n(15000) - and I got this error. Error: Problem with `mutate()` column `response_time`. i `response_time = if (...) NULL`. i `response_time` must be size 1, not 0. i Did you mean: `response_time = list(if (...) NULL)` ? i The error occurred in row 1. – Dinho Sep 16 '21 at 16:16
  • @Dinho: does my example above work for you? If not then it might be an R version / package version problem. – TimTeaFan Sep 16 '21 at 16:31