0

I have a data frame of 50 columns by 2.5 million rows in R, representing a time series. The time column is of class POSIXct. For analysis, I repeatedly need to find the state of the system for a given class at a particular time.

My current approach is the following (simplified and reproducible):

set.seed(1)
N <- 10000
.time <- sort(sample(1:(100*N),N))
class(.time) <- c("POSIXct", "POSIXt")
df <- data.frame(
  time=.time,
  distance1=sort(sample(1:(100*N),N)),
  distance2=sort(sample(1:(100*N),N)),
  letter=sample(letters,N,replace=TRUE)
)

# state search function
time.state <- function(df,searchtime,searchclass){
  # find all rows in between the searchtime and a while (here 10k seconds)
  # before that
  rows <- which(findInterval(df$time,c(searchtime-10000,searchtime))==1)
  # find the latest state of the given class within the search interval
  return(rev(rows)[match(T,rev(df[rows,"letter"]==searchclass))])
}  

# evaluate the function to retrieve the latest known state of the system
# at time 500,000.
df[time.state(df,500000,"a"),]

However, the call to which is very costly. Alternatively, I could first filter by class and then find the time, but that doesn't change the evaluation time much. According to Rprof, it's which and == that cost the majority of the time.

Is there a more efficient solution? The time points are sorted weakly increasing.

Gavin Simpson
  • 170,508
  • 25
  • 396
  • 453
roelandvanbeek
  • 659
  • 8
  • 20
  • 1
    i think this is pretty efficient already. complexity of which and == is linear. – Aditya Sihag Jan 31 '13 at 15:27
  • 1
    With 1M rows and 1000 unique `letter` values this takes only a view ms on my system. Why do you need to optimize it? – Roland Jan 31 '13 at 15:36
  • The call to findInterval takes about 70ms for 2.5M rows (linear with the number of rows). I have to call this function 100k-1M times, which makes any optimization very welcome. If you say there is none, I'll start looking for workarounds. – roelandvanbeek Feb 01 '13 at 10:26
  • I managed to find a workaround that works quite well, see my answer. – roelandvanbeek Feb 01 '13 at 12:16

1 Answers1

1

Because which, == and [ are all linear with the size of the data frame, the solution is to generate subset data frames for bulk operations, as follows:

# function that applies time.state to a series of time/class cominations
time.states <- function(df,times,classes,day.length=24){
  result <- vector("list",length(times))
  day.end <- 0
  for(i in 1:length(times)){
    if(times[i] > day.end){
      # create subset interval from 1h before to 24h after
      day.begin <- times[i]-60*60
      day.end <- times[i]+day.length*60*60
      df.subset <- df[findInterval(df$time,c(day.begin,day.end))==1,]
    }
    # save the resulting row from data frame
    result[[i]] <- df.subset[time.state(df.subset,times[i],classes[i]),]
  }
  return(do.call("rbind",result))
}

With dT=diff(range(df$times)) and dT/day.length large, this reduces the evaluation time with a factor of dT/(day.length+1).

roelandvanbeek
  • 659
  • 8
  • 20