0

This may be a basic question, but I've been searching through many lists to find what I am looking for. Basically, let me describe the data set. I have a set of data from a ticketing system that represents resolved tickets. I'm looking at resolve time of tickets over time to see if my resolution time is going up, going down or leveling off. There are many tickets each day (somewhere around 200 or so). I pull all the data each day and calculate the time in days it took for the ticket to be resolved. I am doing this with R, so I end up with a data frame that looks like:

1/1/2015  INC00001   1.23
1/1/2015  INC00002   .089

and so forth (many tickets each day with a resolution time for each). What I am trying to do is determine some sort of running average for this. What I'd really like is a line plot over time that shows the trend. I've played around with moving averages and such, but my chart is still very choppy. I'm sure there is something built into R that gives a moving / running / cumulative average over time, but I'm still unable to find exactly what I am looking for.

The chart I'd like to see, would resemble something like this:

[Chart1

But smoother, sort of like a stock ticker so I kind of know what the overall average is leading up to the current day. Can any one point me in a direction of what this would be called and how one would go about doing this in R? Thanks so much!!

Heroka
  • 12,889
  • 1
  • 28
  • 38
azdatasci
  • 801
  • 1
  • 13
  • 32

1 Answers1

1

This can be done quickly and easily with dplyr. Here's some example data:

library(dplyr)

x <- 1:3000
y <- 10 + 1:3000 / 300 + 2 * sin(pi * x / 150) + rnorm(3000, sd = 0.1)

df <- data.frame(x = x, y = y)

Here's a plot of the data. And here's how you can use dplyr to create a rolling average and a cumulative average:

df.avgs <-
   df %>%
   mutate(
      roll.avg = ( y + lag(y, 1) + lag(y, 2) + lag(y, 3) +
                      lag(y, 4) + lag(y, 5) + lag(y, 6) + lag(y, 7) +
                      lag(y, 8) + lag(y, 9) ) / 10,
      cum.avg = cummean(y)
   )

Here's a plot of the rolling average. Sadly, I don't have enough reputation to add a third link to a plot of the cumulative average. Note that this method of calculating the rolling average is not tolerant of NA values. One NA in the original data will create ten NA values in the rolling average data. You could get around this by first replacing NA values with interpolated values.

Cameron Bieganek
  • 7,208
  • 1
  • 23
  • 40
  • clbiegankek - Thanks for the response. I believe this is exactly what I am looking for. Not to turn this into a math lesion, but what is the difference between a rolling average and a cumulative average. Also, I have around 56k incidents total - with around 200 being created per day. How do you suggest I handle groupings on this data when calculating a rolling/cumulative average so the averages come out roughly per day? Thanks! – azdatasci Nov 20 '15 at 19:56
  • Also - based on the format of the data frame I put in the original post, how would I feed that into your example, say if the column name is dataframe$resolutiontime? Thanks! – azdatasci Nov 20 '15 at 20:04
  • Say you are at row 1010 in your data frame. The rolling average as calculated above gives the average resolution time for rows 1001 through 1010. The cumulative average gives you the average resolution time of rows 1 through 1010. In other words, the cumulative average always gives you the average of all the rows up to the current row. Try typing `cummean(1:10)` into your interpreter (if you have `dplyr` loaded). – Cameron Bieganek Nov 20 '15 at 20:37
  • If you want to get a daily average of the resolution times, I would use dplyr like this: `df %>% group_by(date) %>% summarize(daily.avg = mean(resolutiontime, na.rm = TRUE))`. This approach is robust to NA's, due to the `na.rm = TRUE` argument. Note that the above pipe assumes that your date column above is called "date". In this case, the dates could be either strings or Date objects, or they could be POSIXct/POSIXlt date-time objects if you're careful to ensure that all the time-stamps for a day have the same hours-minutes-seconds (00:00:00 is convenient). – Cameron Bieganek Nov 20 '15 at 20:50
  • clbieganek - Thanks for the feedback. In your first reply to my note on the 20th - the scenario you mentioned where cumulative average would take into account 1 through 1010 rows is what I am looking for. Then the next day, I'd want it to take into account 1 through 1011 rows, etc.. I won't have any NAs in my data - ever, the resolution times and dates are automatically populated by the ticketing system. It sounds like the suggestion above: `df %>% group_by(date) %>% summarize(daily.avg = mean(resolutiontime, na.rm = TRUE))` would do that. Is that correct? – azdatasci Nov 24 '15 at 15:47
  • The pipe `df %>% group_by(date) %>% summarize(daily.avg = mean(resolutiontime, na.rm = TRUE))` doesn't do a cumulative average, it just averages all the ticket resolution times for a given day. In other words, the output will have two columns: one for the date, and one that contains the average resolution time for each date. If you're looking for the cumulative mean, than the `cummean` function as used in the answer should do the trick. – Cameron Bieganek Nov 25 '15 at 19:25
  • clbieganek - Awesome. I'll play around with that and see if I can get it to work. I accepted your answer (clicked on the check to make it green). If I need to do something else, let me know (I'm still new to StackOverflow, so I'm getting my bearing). Thanks! – azdatasci Dec 07 '15 at 22:14