4

I’m looking for a way to employ a lookup algorithm on a dataframe that, for a given element, examines corresponding variables within a range and returns the max such variable. The general gist is that I want the function to (1) look at a given element, (2) find all other elements of the same name, (3) among all elements of the same name, see if a corresponding variable is within +- X of any others, and (4) if so, return the max of those; if not, just return whatever that variable is.

A concrete example is with some time stamp data. Say I have orders for 2 businesses that are classified by date, hour, and minute. I want to look at daily orders, but the problem is that if orders come within 2 minutes of each other, they’re double-counted, so I only want to look at the max value in such cases.

*EDIT: I should say that if orders are logged consecutively within a couple minutes of each other, we assume they're duplicated and only want the max value. So if 4 orders came in, each a minute apart, but then there were no other orders +2 minutes from the last and -2 from the first, we can assume that group of 4 should only be counted once, and it should be the max value that's counted

Here's some data:

data <- structure(list(date = structure(c(16090, 16090, 16090, 16090, 
16090, 16090, 16090, 16090, 16090, 16090, 16090, 16090, 16091, 
16091, 16091, 16091, 16091, 16091, 16091), class = "Date"), company = structure(c(1L, 
1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L), .Label = c("ABCo", "Zyco"), class = "factor"), hour = c(5L, 
5L, 5L, 7L, 7L, 5L, 5L, 6L, 6L, 7L, 7L, 8L, 6L, 6L, 6L, 7L, 7L, 
7L, 8L), minute = c(21L, 22L, 50L, 13L, 20L, 34L, 47L, 34L, 35L, 
20L, 44L, 19L, 14L, 16L, 37L, 24L, 26L, 49L, 50L), orders = c(59L, 
46L, 31L, 15L, 86L, 23L, 8L, 71L, 86L, 44L, 23L, 47L, 6L, 53L, 
21L, 54L, 73L, 63L, 4L)), .Names = c("date", "company", "hour", 
"minute", "orders"), row.names = c(NA, -19L), class = "data.frame")

What I care about here is, for each company, on a given date, within a given hour, if there are any entries that fall +- 2 minutes of each other, I want to take the max value for "orders". If a given entry doesn't have anything within +- 2 minutes of it, then just keep the "orders" value given. (In this case, the first two rows of "data", ABCo on 2014-01-20 at hour=5, since minute 21 and 22 are within +-2 of eachother, we'd return the max value for orders, so 59. The third row, ABCo on 1-20 at hour=5 and minute = 50 has no other minute +-2 from it, so we'd just keep the value for orders, 31)

A starting point to look at the data for minutes and orders in terms of company+date+hour could be to concatenate these 3 terms together and reorganize the data frame:

data$biztime <- do.call(paste, c(data[c("company","date","hour")], sep = "_"))

data2 <- ddply(data, .(biztime, minute), summarise, orders = sum(orders))

But from here I'm lost. Is there any easy way to add another column to this dataframe using an ifelse statement or something else along these lines that does the sort of conditional operation above?

Marc Tulla
  • 1,751
  • 2
  • 20
  • 34
  • What if 4 transactions take place, each a minute apart? Would you want the max of all 4 or the max of the first 2 and then the max of the second 2? – rnorberg Feb 01 '14 at 23:37
  • 1
    Morberg, great question -- see my edits. In short, yeah, if orders are consecutive 1-2 minutes apart, we want the max value of that string of orders that's separated + or - 2 minutes by any others. Does that make sense? – Marc Tulla Feb 01 '14 at 23:54
  • 1
    Just a quick follow up here that had me scratching my head for a while: When working with actual data, where I had to turn character variables into dates before working with them, the whole plyr part of this process stalled because my dates were set as POSIXlt. It took some hunting, but I found this thread http://stackoverflow.com/questions/14153092/meaning-of-ddply-error-names-attribute-9-must-be-the-same-length-as-the-vec where Hadley mentions in a comment that POSIXct dates should be used in data frames, NOT POSIXlt – Marc Tulla Feb 03 '14 at 22:34

2 Answers2

4

Add a column of datetime objects:

data <- transform(data,
   datetime = strptime(sprintf("%s %s:%s", date, hour, minute),
                       format = "%Y-%m-%d %H:%M"))

Add a column of indices where two rows within two minutes of each other will share the same index:

data <- ddply(data, .(company), transform, timegroup =
              cumsum(c(TRUE, diff(datetime, units = "mins") > 2)))

Finally, summarize:

ddply(data, .(company, timegroup), summarise,
      orders = max(orders),
      datetime = datetime[1])

#    company timegroup orders            datetime
# 1     ABCo         1     59 2014-01-20 05:21:00
# 2     ABCo         2     31 2014-01-20 05:50:00
# 3     ABCo         3     15 2014-01-20 07:13:00
# 4     ABCo         4     86 2014-01-20 07:20:00
# 5     ABCo         5     53 2014-01-21 06:14:00
# 6     ABCo         6     21 2014-01-21 06:37:00
# 7     ABCo         7     73 2014-01-21 07:24:00
# 8     ABCo         8     63 2014-01-21 07:49:00
# 9     ABCo         9      4 2014-01-21 08:50:00
# 10    Zyco         1     23 2014-01-20 05:34:00
# 11    Zyco         2      8 2014-01-20 05:47:00
# 12    Zyco         3     86 2014-01-20 06:34:00
# 13    Zyco         4     44 2014-01-20 07:20:00
# 14    Zyco         5     23 2014-01-20 07:44:00
# 15    Zyco         6     47 2014-01-20 08:19:00
flodel
  • 87,577
  • 21
  • 185
  • 223
  • This looks like it could be the fastest, most efficient way... thanks Flodel! – Marc Tulla Feb 02 '14 at 00:20
  • 1
    Well, definitely not the fastest if you compare with other tools like `data.table` or `dplyr` but at least it shows a good approach (use date times, etc.). I used `plyr` because that's what you were already using. Nothing wrong with it if you are happy with the speeds you are getting. – flodel Feb 02 '14 at 00:52
0

Unless I misunderstood something, perhaps this is helpful; probably slow, I guess.

data$gr = as.numeric(interaction(data$company, data$date, data$hour))

ff = function(mins, ords) {
 unlist(lapply(mins, function(x) max(ords[abs(x - mins) <= 2])))
}

do.call(rbind, 
             lapply(split(data, data$gr), 
                        function(x) transform(x, new_val = ff(x$minute, x$orders))))

#            date company hour minute orders gr new_val
#1.1   2014-01-20    ABCo    5     21     59  1      59
#1.2   2014-01-20    ABCo    5     22     46  1      59
#1.3   2014-01-20    ABCo    5     50     31  1      31
#2.6   2014-01-20    Zyco    5     34     23  2      23
#2.7   2014-01-20    Zyco    5     47      8  2       8
#6.8   2014-01-20    Zyco    6     34     71  6      86
#6.9   2014-01-20    Zyco    6     35     86  6      86
#7.13  2014-01-21    ABCo    6     14      6  7      53
#7.14  2014-01-21    ABCo    6     16     53  7      53
#7.15  2014-01-21    ABCo    6     37     21  7      21
#9.4   2014-01-20    ABCo    7     13     15  9      15
#9.5   2014-01-20    ABCo    7     20     86  9      86
#10.10 2014-01-20    Zyco    7     20     44 10      44
#10.11 2014-01-20    Zyco    7     44     23 10      23
#11.16 2014-01-21    ABCo    7     24     54 11      73
#11.17 2014-01-21    ABCo    7     26     73 11      73
#11.18 2014-01-21    ABCo    7     49     63 11      63
#14    2014-01-20    Zyco    8     19     47 14      47
#15    2014-01-21    ABCo    8     50      4 15       4
alexis_laz
  • 12,884
  • 4
  • 27
  • 37
  • This looks like a great start (but, as you mentioned, could be slow when scaled). But, the "new_val" variable... I want to simply be able to sum it up to get the orders, and in this case, I'd be double counting, if I'm understanding this output correctly – Marc Tulla Feb 01 '14 at 23:56
  • @MarcTulla: Hm, what I understood was something `ave`-like for `data$orders`. What you're saying is that you want to keep only the maximum from the rows that are +-2 minutes apart and not both? – alexis_laz Feb 02 '14 at 00:05
  • Exactly, just the max value – Marc Tulla Feb 02 '14 at 00:19