1

I have a data frame with 1,000,000 rows. I would like to calculate mean and variance of Tor overtime for each SID to see if I can predict when Tor is starting to go out of limits. The Low limit is 0.4 and the high limit is 0.7. Below is a small example of my data.

dat <- structure(list(timestamp = c("29-06-2021-06:00", "29-06-2021-06:01", 
"29-06-2021-06:02", "29-06-2021-06:03", "29-06-2021-06:04", "29-06-2021-06:05", 
"29-06-2021-06:06", "29-06-2021-06:07", "29-06-2021-06:08", "29-06-2021-06:09", 
"29-06-2021-06:10", "29-06-2021-06:11", "29-06-2021-06:12", "29-06-2021-06:13", 
"29-06-2021-06:14", "29-06-2021-06:15", "29-06-2021-06:16", "29-06-2021-06:17", 
"29-06-2021-06:18", "29-06-2021-06:19", "29-06-2021-06:20", "29-06-2021-06:21", 
"29-06-2021-06:22", "29-06-2021-06:23", "29-06-2021-06:24", "29-06-2021-06:25", 
"29-06-2021-06:26"), SID = c(301L, 351L, 304L, 357L, 358L, 302L, 
303L, 309L, 356L, 304L, 308L, 351L, 304L, 357L, 358L, 302L, 303L, 
352L, 307L, 353L, 304L, 308L, 352L, 307L, 304L, 354L, 356L), 
    Tor = c(0.70161919, 0.639416295, 0.288282073, 0.932362166, 
    0.368616626, 0.42175565, 0.409735918, 0.942170196, 0.381396521, 
    0.818102394, 0.659391671, 0.246387978, 0.196001777, 0.632630259, 
    0.66618385, 0.440625167, 0.639759498, 0.050001835, 0.775660271, 
    0.762934189, 0.516830196, 0.244674975, 0.38620466, 0.970792903, 
    0.752674581, 0.190366737, 0.56596405), Lowt = c(0L, 0L, 1L, 
    0L, 1L, 0L, 1L, 0L, 1L, 0L, 0L, 1L, 1L, 0L, 0L, 0L, 0L, 1L, 
    0L, 0L, 0L, 1L, 1L, 0L, 0L, 1L, 0L), Hit = c(1L, 0L, 0L, 
    1L, 0L, 0L, 0L, 1L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
    1L, 1L, 0L, 0L, 0L, 1L, 1L, 0L, 0L)), class = "data.frame", row.names = c(NA, 
-27L))
head(dat)
#         timestamp SID       Tor Lowt Hit
#1 29-06-2021-06:00 301 0.7016192    0   1
#2 29-06-2021-06:01 351 0.6394163    0   0
#3 29-06-2021-06:02 304 0.2882821    1   0
#4 29-06-2021-06:03 357 0.9323622    0   1
#5 29-06-2021-06:04 358 0.3686166    1   0
#6 29-06-2021-06:05 302 0.4217556    0   0
  • Timestamp is when sample is recorded

  • SID is the ID of the part taking the reading. These values can be 301 - 310 and 351 to 360

  • Tor is the actual reading, and its data type is <dbl>.

  • Lowt is a binary variable showing that the Tor reading is below the lower limit.

  • Hit is a binary variable showing that the Tor reading is below the upper limit.

I have read up about variance but I can't seem to get my head around it. Any help would be great.

Zheyuan Li
  • 71,365
  • 17
  • 180
  • 248
thewal
  • 75
  • 6

1 Answers1

2

This is a very good question. You want to compute cumulative mean and cumulative variance of Tor (over time) per SID. Given the volume of your actual dataset, it is appropriate to use online algorithms. See my answer and Benjamin's answer on this topic back in 2018 for algorithmic details. In brief, my contribution is:

cummean <- function (x) cumsum(x) / seq_along(x)

cumvar <- function (x, sd = FALSE) {
  x <- x - x[sample.int(length(x), 1)]
  n <- seq_along(x)
  v <- (cumsum(x ^ 2) - cumsum(x) ^ 2 / n) / (n - 1)
  if (sd) v <- sqrt(v)
  v
}

The extra work required here, is to apply these functions per SID.

## sort data entries
sorted_dat <- dat[order(dat$SID, dat$timestamp), ]

## split Tor by SID
lst <- split(sorted_dat$Tor, sorted_dat$SID)

## apply cummean() and cumvar()
runmean <- unlist(lapply(lst, cummean), use.names = FALSE)
runvar <- unlist(lapply(lst, cumvar), use.names = FALSE)

## add back
sorted_dat$runmean <- runmean
sorted_dat$runvar <- runvar

Here is the result. Don't be surprised by the NaN in variance. The first value is always NaN in each SID. This is just normal (we can only compute variance when there are 2+ data).

## inspection
sorted_dat
#          timestamp SID        Tor Lowt Hit    runmean       runvar
#1  29-06-2021-06:00 301 0.70161919    0   1 0.70161919          NaN
#6  29-06-2021-06:05 302 0.42175565    0   0 0.42175565          NaN
#16 29-06-2021-06:15 302 0.44062517    0   0 0.43119041 0.0001780293
#7  29-06-2021-06:06 303 0.40973592    1   0 0.40973592          NaN
#17 29-06-2021-06:16 303 0.63975950    0   0 0.52474771 0.0264554237
#3  29-06-2021-06:02 304 0.28828207    1   0 0.28828207          NaN
#10 29-06-2021-06:09 304 0.81810239    0   1 0.55319223 0.1403547863
#13 29-06-2021-06:12 304 0.19600178    1   0 0.43412875 0.1127057339
#21 29-06-2021-06:20 304 0.51683020    0   0 0.45480411 0.0768470383
#25 29-06-2021-06:24 304 0.75267458    0   1 0.51437820 0.0753806422
#19 29-06-2021-06:18 307 0.77566027    0   1 0.77566027          NaN
#24 29-06-2021-06:23 307 0.97079290    0   1 0.87322659 0.0190383720
#11 29-06-2021-06:10 308 0.65939167    0   0 0.65939167          NaN
#22 29-06-2021-06:21 308 0.24467497    1   0 0.45203332 0.0859949690
#8  29-06-2021-06:07 309 0.94217020    0   1 0.94217020          NaN
#2  29-06-2021-06:01 351 0.63941629    0   0 0.63941629          NaN
#12 29-06-2021-06:11 351 0.24638798    1   0 0.44290214 0.0772356290
#18 29-06-2021-06:17 352 0.05000184    1   0 0.05000184          NaN
#23 29-06-2021-06:22 352 0.38620466    1   0 0.21810325 0.0565161698
#20 29-06-2021-06:19 353 0.76293419    0   1 0.76293419          NaN
#26 29-06-2021-06:25 354 0.19036674    1   0 0.19036674          NaN
#9  29-06-2021-06:08 356 0.38139652    1   0 0.38139652          NaN
#27 29-06-2021-06:26 356 0.56596405    0   0 0.47368029 0.0170325864
#4  29-06-2021-06:03 357 0.93236217    0   1 0.93236217          NaN
#14 29-06-2021-06:13 357 0.63263026    0   0 0.78249621 0.0449196080
#5  29-06-2021-06:04 358 0.36861663    1   0 0.36861663          NaN
#15 29-06-2021-06:14 358 0.66618385    0   0 0.51740024 0.0442731264
Zheyuan Li
  • 71,365
  • 17
  • 180
  • 248
  • https://stackoverflow.com/users/4891738/zheyuan-li Is there away to get rid of the NaN. In my one only the first is NaN. The rest have values – thewal Jun 30 '22 at 18:50
  • @thewal The first value for each SID will be NaN. This happens because the variance of a single sample is `NA` or `NaN`. Try `var(rnorm(1))`. If you don't like it, you can replace them by 0 with `sorted_dat$runvar[is.na(sorted_dat$runvar)] <- 0`. – Zheyuan Li Jun 30 '22 at 19:04