3

Let's say you have a data frame generated by the following commands:

date <- seq(as.Date("2012-09-01"), Sys.Date(), 1)
id <- rep(c("a","b","c","d"), 8)
bdate <- seq(as.Date("2012-08-01"), as.Date("2012-11-01"), 1)[sample(1:32, 32)]

# The end date should be random but greater than the begin date. However, I set it to 15 days later for simplicity.
edate <- bdate + 15

value <- seq(1, 1000, 1)[sample(1:1000, 32)]
dfa <- data.frame(id, value, bdate, edate)
names(dfa) <- c("ID", "Value", "Begin.Date", "End.Date")

The goal is to sum all the observations by ID (i.e., "a", "b", or "c") in the following way:

Date        a   b   c
2012-08-01  XXX YYY ZZZ
2012-08-02  XXX YYY ZZZ
2012-08-03  XXX YYY ZZZ

The values XXX, YYY, and ZZZ represent the sum of all the observations where the date on column "Date" falls between dfa$Begin.Date and dfa$End.Date on the original data frame, for each ID.

My current solution is practically useless for large datasets so I was wondering if there are any faster ways to do it.

My current script:

# Create additional data frame
dfb <- data.frame(seq(as.Date("2012-08-01"), as.Date("2012-11-01"), 1))
names(dfb)[1] <- "Date"

# Variable for unique IDs
nid <- unique(dfa$ID)

# Number of total IDs
tid <- length(nid)

for (i in c(1:tid))
{
sums <- vapply(dfb$Date, function(x)
{
temp <- subset(dfa, dfa$ID == nid[i])
temp <- subset(temp, temp$Begin.Date < x & temp$End.Date > x)
res <- sum(temp$Value)
res
}, FUN.VALUE = 0.1
)
dfb[1+i] <- sums
}

# Change column names to ID
names(dfb) <- c("Date", as.character(nid))

EDIT: I posted an answer below with a more efficient way to do this. However, I accepted Matthew's answer because it set me on the right path.

Fael
  • 183
  • 2
  • 10

4 Answers4

3

Thanks to @Matthew Dowle I found out how to do this more efficiently with the data.table package.

# Fire up the bad boy
library(data.table)

# Create the data table with original data
value <- seq(1, 1000, 1)[sample(1:1000, 32)]
dt <- data.table(id, value, bdate, edate)
setnames(dt, names(dt), c("id", "value", "begin", "end"))

# For each pair of id and value, create a row for each day. (i.e., the first line:
# a  928  2012-08-11  2012-08-26
# will now be 15 lines. The first two columns are repeated over 15 different dates.
dt <- dt[, seq(begin[1], (last(end) - 1), by="days"), by = list(id, value)]
setnames(dt, names(dt), c(names(dt)[1:2], "date"))
setkey(dt)

# Sum each pair of id and value over the dates column
dt <- dt[, sum(value), by = list(id, date)]
setnames(dt, names(dt), c(names(dt)[1:2], "value"))
setkey(dt, date, id)

# Define the time span you would like on your final table
timespan <- dt[, seq(as.Date("2012-07-25"), max(date), by = "day")]

# Now just cross reference the time span with your actual data
setkey(dt, id, date)
dt <- dt[CJ(unique(id), timespan), ]
setnames(dt, names(dt), c(names(dt)[1:2], "value"))
setkey(dt, date)

Ta-da!!

Now, to rearrange the table in the order I originally wanted:

Date        a   b   c
2012-08-01  XXX YYY ZZZ
2012-08-02  XXX YYY ZZZ
2012-08-03  XXX YYY ZZZ

Just use dcast from the reshape2 package.

So what do you all think? Awesome, right?

Fael
  • 183
  • 2
  • 10
  • Hi. :) That's good, glad you like it. But the `seq(begin[1], (last(end) - 1), by="days")` expanding part isn't needed, iiuc. That's what `roll=TRUE` is for. Its the same thing really, without actually expanding out, for speed. `roll=TRUE` is used in the other linked question and is really important for date range joins like this. – Matt Dowle Oct 12 '12 at 22:59
  • This is a little bit awkward but that part is on the first line of your final answer in the other question... I'll try to use the roll=TRUE part (though I don't know how yet). Hmmm maybe I haven't tried hard enough but, would you mind sharing how you would do it with the roll=TRUE option? – Fael Oct 12 '12 at 23:26
  • Apols, I meant say it's that `seq` by `by = list(id, value)` that makes it expand out each row (each bill), doesn't it?. I do that by `acct` only in the other question. – Matt Dowle Oct 13 '12 at 00:15
2

Interesting. It seems very similar to this question :

Splitting irregular time series into regular monthly averages

Does that help? There, as in your question, one trick can be to join to the prevailing begin using roll=TRUE in package data.table. Especially since you stated you have large datasets.

Community
  • 1
  • 1
Matt Dowle
  • 58,872
  • 22
  • 166
  • 224
  • Interesting... very interesting. I don't have time right now but I'll try it in a couple of days and report back. Thanks for helping! – Fael Oct 08 '12 at 18:18
  • 1
    Thank you, thank you, thank you, thank you! I just figured out how to do it but I'll post it as a different answer to improve readability. – Fael Oct 12 '12 at 18:49
0

I would do the following. First subset the original data set by checking if the desired date is between Begin.Date and End.Date. Then just use a simple table function to get the frequencies for 'a' through 'd'.

mydate <- as.Date("2012-08-25")  # take Aug 25, 2012 as an example
ind <- (dfa$Begin.Date <= mydate) & (dfa$End.Date >= mydate)
temp <- subset(dfa, ind)
out <- table(temp$ID)
JACKY88
  • 3,391
  • 5
  • 32
  • 48
  • This solution works for particular dates and I'd have to loop over all needed dates in order to get the results data frame. I haven't tried it on a large data set so it might work, but it looks similar to my vapply approach. Thank you very much for your suggestion. – Fael Oct 02 '12 at 20:22
  • @Fael Yes, the basic idea is the same. There may be some packages on this, otherwise, I think a loop over all desired dates is unavoidable. Good luck. – JACKY88 Oct 02 '12 at 20:53
  • I just tried this approach and I must say it feels considerably slower than mine. Additionally, the table command only _counts_ the observations, it doesn't actually sums them. Thank you very much for your help but I think my original approach might be more suitable. – Fael Oct 02 '12 at 21:27
  • @Fael What do you mean it doesn't actually sum them? – JACKY88 Oct 02 '12 at 21:30
  • Suppose that on 2012-08-25 the desired values to sum for each id are: 100, 125 (for a); 40 (for b); 0 (for c); 120, 100, 400 (for d). The output should be a = 225, b = 40, c = 0, and d = 620. However, your code only _counts_ the relevant values. It outputs a = 2, b = 1, c = 0, d = 3. – Fael Oct 02 '12 at 23:04
  • @Fael I misunderstood your meaning. I thought you want to count the frequencies of a, b, c, d. Sorry about that. – JACKY88 Oct 02 '12 at 23:50
-1

I don't know if this is any faster (have not benchmarked it), and for particularly large data it may create too large an intermediate data set, but I'll present it anyway.

The range of dates considered can be set as well (request based on comment to this answer).

library("plyr")
library("reshape2")

earliest.date <- as.Date("2007-01-01")
latest.date <- as.Date("2012-11-01")

dfa.long <- adply(dfa, 1, function(DF) {
  if(DF$End.Date >= earliest.date & DF$Begin.Date <= latest.date) {
    data.frame(Date=seq(max(DF$Begin.Date, earliest.date), 
                        min(DF$End.Date, latest.date), 
                        1))
  }
})

dfb <- ddply(dfa.long, .(Date, ID), summarise, sum=sum(Value))
dfb <- dcast(dfb, Date~ID, value.var="sum", fill=0)

dfa.long is a data set with each row repeated for each date within the start/end range (constrained to also within the earliest/latest date range). Then it is straightforward to aggregate by date and ID and use dcast from reshape2 to turn it into the wide format you wanted.

Brian Diggs
  • 57,757
  • 13
  • 166
  • 188
  • Your approach is similar to vapply but, in my limited experience, the plyr and reshape2 combination is at least as fast as the *pply functions. I'll try it and report back later. – Fael Oct 02 '12 at 20:24
  • I just tried this approach on a subset of my whole data set and, while it is possible that I implemented it in a wrong way, it seems slower. Plus, I can't really understand the dfb$Date column. For example, one data point has a value of 100 with a begin date of "1999-01-01" and an end date of "2050-01-01". However, I want the dfb$Date column to include results for "2007-01-01" onwards. Since some of the original data points have dates previous to "2007-01-01", my dbf$Date does not look the way I expected. There must be away to correct this but, right now, it escapes me. – Fael Oct 02 '12 at 21:09
  • -1 For suggesting `plyr` when question states "large datasets so I was wondering if there are any faster ways". – Matt Dowle Oct 05 '12 at 13:38
  • @MatthewDowle It's a fair cop. – Brian Diggs Oct 05 '12 at 16:52