Use start and end time to extract values from separate dataframe

Question

I have worked through a lot of issues in stats and database issues in R and am a new user, but I have not figured out loops and programming yet. I am stuck with this one and just can't figure it out. I have a dataframe that has a start date and an end date and a value TP (total phosphorus).

begin = c("2015/11/16 17:45",   "2015/11/17 17:45", "2015/11/18 17:45", "2015/11/19 17:45", "2015/11/20 17:45", "2015/11/21 17:45")
end = c("2015/11/17 17:45", "2015/11/18 17:45", "2015/11/19 17:45", "2015/11/20 17:45", "2015/11/21 17:45", "2015/11/22 17:45")
bottle =    c(1,    2,  3,  4,  5,  6)
tp =    c(10,   200,    100,    73, 38, 50)
data=data.frame(begin, end, bottle, tp)

I need to multiply TP by the sum of values from another data frame based on the begin and end date for each of the rows. The other data frame is in a shortened version below....

set.seed(1)
time = seq.POSIXt(from=as.POSIXct('2015-11-15',tz=''),
              to=as.POSIXct('2016-11-25',tz=''),
              by = as.difftime(0.25,units="hours"))
level_m= runif(n = length(time), min = .01, max = .06)
time <- as.data.frame(time)
level_m <- as.data.frame(level_m)
# as dataframe
water_level <- cbind(time, level_m)

I would like to get this to do it for each of the rows of the first data frame called data. What I have tried is using a database approach where the two data frames are merged and aggregated but its not really satisfying as the last value would sum too much data not just the range I need... and it is doing more than I really need.

rng <- cut(water_levle$time, 
       breaks=c(data$begin, max(data$end)),
                include.lowest=T)
test <- aggregate(cbind(count=1, sum=water_level$level_m)~rng, FUN=sum)

I hope I have done this right as it is the first question I have asked here. Thanks, Bill

Chris P · Accepted Answer · 2016-03-29T05:30:15.067

1

Good first post, you made it easy to reproduce your data.

I think the following is what you are looking for. It sums the m_level in the date range specified, and multiplied t by tp. the level.sum is an intermediary, I left it in the dataframe for clarity.

for( n in 1:nrow(data)){
#Defined here for clarify in the subset
  begin_d <- as.Date(data[n,'begin'])
  end_d <- as.Date(data[n,'end'])
  time_d <- as.Date(water_level$time)

  data[n,'level.sum'] <- sum(water_level[time_d>=begin_d & time_d<=end_d, "level_m"])
  data[n,'tp.mult.sum']<- (data[n,"tp"]*data[n,"level.sum"])
}

I edited it not to use subset(), because it is intended for interactive use only.

General note: Selecting elements of a dataframe using code like dataframe['selection criteria','column name'] is a useful tool.

I ran it instead of the final code block you had, and it seemed to run correctly.

edited Mar 29 '16 at 05:30

answered Mar 28 '16 at 21:47

Chris P

225
2
10

1

it might be cleaner to use `subset`. – bouncyball Mar 28 '16 at 21:57
1

@bouncyball You are right, I edited it to use subset, also defined some variables so the subset criteria looks like less of a mess. Thanks. – Chris P Mar 28 '16 at 22:30
Wow, I have to find a course to learn how to do this coding. Currently I do mostly stats, graphing and database merging and manipulation but this is the direction I am moving. The way I was doing it was very cumbersome and did not really work. I am going to play around with this to understand it better but it totally worked. Thank you Chris P. Cheers, Bill – Bill Perry Mar 29 '16 at 01:10
I still am stunned how simple that was. The only thing I modified was to use as.POSIXct rather than as.Date. As an excel user switch over, I ran it back through excel to check and as.Date loses the time interval which is key in this case. I can modify this in so many ways to use on other examples Thank you again, Bill – Bill Perry Mar 29 '16 at 01:35
1

You should heed the *Warning* section in `?subset` and avoid using it non-interactively. Using `subset` is no "cleaner" than using regular subsetting in this case, and really provides no benefit since you're not subsetting by columns in `water_level`. All the "cleanliness" comes from defining variables outside the subset call: `sum(water_level[time_d>=begin_d & time_d<=end_d, "level_m"])`. – Joshua Ulrich Mar 29 '16 at 03:11
@BillPerry, note I edited this not to use subset, based on Joshua Ulrich 's advice (I _really_ need to pay more attention to the documentation, thanks for the comment.) Good to know about the as.POSIXct vs as.Date, it didn't occur to me to preserve the time interval. This looks like BIology related stuff, in another life, I earned my degree in Biology... good luck with the coding. – Chris P Mar 29 '16 at 05:37
A great alternative to above and I was not aware of the subset warning. I am starting to learn sapply, lapply, mapply. This is another great method with a different approach. From a learning perspective it is cool to see how these work. I added a few commands to fix this to do what I need:
('data$sum.level <- mapply(function(begin, end, level)
sum(level[begin:end]),
from,
to,
MoreArgs = list(level = water_level$level))
data$tp.load <- data$sum.level * data$tp')

sorry for the multiple edits - trying to do this right – Bill Perry Mar 29 '16 at 13:04
@ChrisP Thanks and yes this is Biology or Biogeocheistry. I am trying to save my mind and a grad students. We are looking at nutrients flowing into drinking water reservoirs and how our experiments reduce nutrients. Cheers and thanks, Bill – Bill Perry Mar 29 '16 at 13:06

score 0 · Answer 2 · answered Mar 29 '16 at 10:04

Instead of locating water_level$time in data$begin, doing the opposite, will give the location in water_level$time that each data$begin (and data$end) appeared:

#convert to date-time
data$begin = as.POSIXct(data$begin)
data$end = as.POSIXct(data$end)

from = findInterval(data$begin, water_level$time)
to = findInterval(data$end, water_level$time)
from
#[1] 168 264 360 456 552 648
to
#[1] 264 360 456 552 648 744
#e.g. `water_level$time[168:264]` is ` >= data$begin[1]` and ` <= data$end[1]` etc

And -assuming that nrow(data) and/or to - from are not very large- we could simply loop over each from:to to find the respective sum of water_level$level:

mapply(function(begin, end, level) sum(level[begin:end]), 
       from,
       to, 
       MoreArgs = list(level = water_level$level))
#[1] 3.495112 3.182623 3.419295 3.466763 3.504436 3.450281

#and multiply
#... * data$tp

A great alternative to above and I was not aware of the warning for subset command. I am starting to learn sapply, lapply and mapply and am just beginning with this. This is another great method with a different approach. From a learning perspective it is cool to see how these work. I added a few commands to fix this to do what I need: ('data$sum.level <- mapply(function(begin, end, level) sum(level[begin:end]), from, to, MoreArgs = list(level = water_level$level)) data$tp.load <- data$sum.level * data$tp') — Bill Perry, Mar 29 '16 at 12:56

Use start and end time to extract values from separate dataframe

2 Answers2