1

i have a question how to select certain values from a table. I have a table with times and values and i want to get the row below and after a certain time.

Example-Data.Frame.

Time   Value
02:51  0.08033405 
05:30  0.43456738 
09:45  0.36052075 
14:02  0.45013807 
18:55  0.05745870
....# and so on

Time is coded as character, but can be formatted. Now i have for example the time "6:15" and want to get the values of the time before and after this time from the table (0.43456738 and 0.36052075). The database is in fact quite huge and i have a lot of time values. Anyone has a nice suggestion how to accomplish this?

thanks Curlew

Matt Dowle
  • 58,872
  • 22
  • 166
  • 224
Curlew
  • 1,022
  • 18
  • 39
  • If, as you mention in your question, you're getting this data.frame from a database, the most efficient way to get those numbers would be at the database level with a carefully constructed query. – Justin Jun 08 '12 at 15:56
  • @Justin Using a NoSQL database do you mean? – Matt Dowle Jun 08 '12 at 16:21
  • sql or nosql writing a function or stored procedure that takes a list of times and returns the values nearest those times is probably the most efficient, rather than reading the entire time list into R and doing the munging. Huge is relative, but if the op really means huge, potentially the data will be much to big for R to hold in memory. – Justin Jun 08 '12 at 16:27
  • As Justin mentioned huge is indeed relative and i would prefer to do it in r. It is not a everyday routine procedure and i only need to calculate it once. Therefore time shouldn't be this much a problem. – Curlew Jun 08 '12 at 16:34
  • Huge data (as @Justin points out) is very subjective. So it would be helpful if you gave us some idea of the magnitude. – Maiasaura Jun 08 '12 at 16:47
  • @Justin Yes you're right. I wasn't including the time to load the entire database into RAM, and if it's too big or not for that. Generally you'd do that once (if possible) and then research on it in RAM, say 128GB RAM server. Iff that's possible then that's fastest, if you're making many queries on the same data. But not just for a one off query. – Matt Dowle Jun 08 '12 at 16:48
  • Have you looked at the `lead()` and `lag()` functions from `dplyr`? – naught101 Oct 02 '14 at 05:09

1 Answers1

1
value_before <- example_df[which(example_df$time=="09:45")-1, ]$value
value_after <- example_df[which(example_df$time=="09:45")+1, ]$value

# This could become a function

return_values <- function(df,cutoff) {
value_before <- df[which(df$time==cutoff)-1, ]$value
value_after <- df[which(df$time==cutoff)+1, ]$value
return(list(value_before, value_after))
}

return_values(exmaple_df, "09:15")

# A solution for a large dataset.

library(data.table)
df <- data.frame(time = 1:1000000, value = rnorm(1000000))
# create a couple of offsets
df$nvalue <- c(df$value[2:dim(df)[1]],NA)
df$pvalue <- c(NA,df$value[2:dim(df)[1]])
new_df <- data.table(df)
setkey(new_df,"time")

new_df[time==10]
 time      value     pvalue     nvalue
[1,]   10 -0.8488881 -0.1281219 -0.5741059


> new_df[time==1234]
     time      value   pvalue     nvalue
[1,] 1234 -0.3045015 0.708884 -0.5049194
Maiasaura
  • 32,226
  • 27
  • 104
  • 108
  • -1 because OP said his data is huge. This answer will be very slow, and reinvents the wheel of several packages that do this op. – Matt Dowle Jun 08 '12 at 16:18
  • tested it and it doesn't work like this. "which(df$time==cutoff)" assumes that the exact time is in the database which is not the case. – Curlew Jun 08 '12 at 16:30
  • @Curlew `xts` and `data.table` both have binary search which is what you want for this. You're looking up a missing value in an ordered key and you have large data => `xts` or `data.table`, afaik. Search for LOCF (last observation carried forward). In `data.table` you need `roll=TRUE` and `which=TRUE` to get the prevailing row, then +1 for the row afterwards. Beware I'm biased as I wrote `data.table`. – Matt Dowle Jun 08 '12 at 16:30
  • I've added a solution that will work with very large datasets. You don't have to do an exact match (I just used that since your example was a `chr`. In the current case, you can test any condition. – Maiasaura Jun 08 '12 at 16:44
  • @MatthewDowle I love `data.table` and use it all the time. Thanks for writing it. – Maiasaura Jun 08 '12 at 16:48
  • Great. But the new solution vector scans (doesn't use binary search), theres no `roll=TRUE` and it doesn't answer the OPs question afaics. – Matt Dowle Jun 08 '12 at 16:53
  • Then why don't you leave the peanut gallery and post a solution? – Maiasaura Jun 08 '12 at 16:54
  • @Maiasaura Since OP didn't know either `xts` or `data.table` I think it's enough to point to them, which I did. The difference between vector scan and binary search is fully explained in `vignette("datatable-intro")`. – Matt Dowle Jun 08 '12 at 18:59
  • thanks to you both! I'll develop the code you provided a litte bit more to fit it to my needs. Package xts sounds also very interesting. – Curlew Jun 08 '12 at 22:25