I’m looking for a way to employ a lookup algorithm on a dataframe that, for a given element, examines corresponding variables within a range and returns the max such variable. The general gist is that I want the function to (1) look at a given element, (2) find all other elements of the same name, (3) among all elements of the same name, see if a corresponding variable is within +- X of any others, and (4) if so, return the max of those; if not, just return whatever that variable is.
A concrete example is with some time stamp data. Say I have orders for 2 businesses that are classified by date, hour, and minute. I want to look at daily orders, but the problem is that if orders come within 2 minutes of each other, they’re double-counted, so I only want to look at the max value in such cases.
*EDIT: I should say that if orders are logged consecutively within a couple minutes of each other, we assume they're duplicated and only want the max value. So if 4 orders came in, each a minute apart, but then there were no other orders +2 minutes from the last and -2 from the first, we can assume that group of 4 should only be counted once, and it should be the max value that's counted
Here's some data:
data <- structure(list(date = structure(c(16090, 16090, 16090, 16090,
16090, 16090, 16090, 16090, 16090, 16090, 16090, 16090, 16091,
16091, 16091, 16091, 16091, 16091, 16091), class = "Date"), company = structure(c(1L,
1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L,
1L, 1L), .Label = c("ABCo", "Zyco"), class = "factor"), hour = c(5L,
5L, 5L, 7L, 7L, 5L, 5L, 6L, 6L, 7L, 7L, 8L, 6L, 6L, 6L, 7L, 7L,
7L, 8L), minute = c(21L, 22L, 50L, 13L, 20L, 34L, 47L, 34L, 35L,
20L, 44L, 19L, 14L, 16L, 37L, 24L, 26L, 49L, 50L), orders = c(59L,
46L, 31L, 15L, 86L, 23L, 8L, 71L, 86L, 44L, 23L, 47L, 6L, 53L,
21L, 54L, 73L, 63L, 4L)), .Names = c("date", "company", "hour",
"minute", "orders"), row.names = c(NA, -19L), class = "data.frame")
What I care about here is, for each company, on a given date, within a given hour, if there are any entries that fall +- 2 minutes of each other, I want to take the max value for "orders". If a given entry doesn't have anything within +- 2 minutes of it, then just keep the "orders" value given. (In this case, the first two rows of "data", ABCo on 2014-01-20 at hour=5, since minute 21 and 22 are within +-2 of eachother, we'd return the max value for orders, so 59. The third row, ABCo on 1-20 at hour=5 and minute = 50 has no other minute +-2 from it, so we'd just keep the value for orders, 31)
A starting point to look at the data for minutes and orders in terms of company+date+hour could be to concatenate these 3 terms together and reorganize the data frame:
data$biztime <- do.call(paste, c(data[c("company","date","hour")], sep = "_"))
data2 <- ddply(data, .(biztime, minute), summarise, orders = sum(orders))
But from here I'm lost. Is there any easy way to add another column to this dataframe using an ifelse statement or something else along these lines that does the sort of conditional operation above?