0

I need to calculate the number of active instances in each month for an ID. I'm able to achieve it with a for loop but I have a very large dataset with 12k ID's and it takes long to complete. Any suggestions for a better solution.

A sample from my data is as shown below

ID  instances   start_month end_month
key1    x1  1397    1400
key1    x2  1395    1402
key1    x3  1399    1402
key1    x4  1398    1401
key2    x5  1396    1401
key2    x6  1398    1402
key2    x7  1398    1402

I want my output to be like below

key1    1395    1
key1    1396    1
key1    1397    2
key1    1398    3
key1    1399    4
key1    1400    4
key1    1401    3
key1    1402    2
key2    1396    1
key2    1397    1
key2    1398    3
key2    1399    3
key2    1400    3
key2    1401    3
key2    1402    2
user1946217
  • 1,733
  • 6
  • 31
  • 40
  • Related: https://stackoverflow.com/questions/35752214/obtain-number-of-days-by-different-months-using-2-dates/35753264#35753264 – Jaap Jun 21 '17 at 06:36
  • @akrun Yes, they are the same. When I tried to tag it as a duplicate of the link you provided, somehow SO gave me a different weblink. I don't know why. – www Jun 21 '17 at 09:12
  • @ycw It's okay, anyway, it got safely identified as dupe. – akrun Jun 21 '17 at 09:52

1 Answers1

3

Using dplyr:

DF %>%
  group_by(ID, instances) %>%
  do(data.frame(out=.$start_month:.$end_month)) %>%
  ungroup() %>%
  count(ID, out)

# # A tibble: 15 x 3
#       ID   out     n
#    <chr> <int> <int>
#  1  key1  1395     1
#  2  key1  1396     1
#  3  key1  1397     2
#  4  key1  1398     3
#  5  key1  1399     4
#  6  key1  1400     4
#  7  key1  1401     3
#  8  key1  1402     2
#  9  key2  1396     1
# 10  key2  1397     1
# 11  key2  1398     3
# 12  key2  1399     3
# 13  key2  1400     3
# 14  key2  1401     3
# 15  key2  1402     2

Data

DF <- structure(list(ID = c("key1", "key1", "key1", "key1", "key2", 
"key2", "key2"), instances = c("x1", "x2", "x3", "x4", "x5", 
"x6", "x7"), start_month = c(1397L, 1395L, 1399L, 1398L, 1396L, 
1398L, 1398L), end_month = c(1400L, 1402L, 1402L, 1401L, 1401L, 
1402L, 1402L)), .Names = c("ID", "instances", "start_month", 
"end_month"), class = "data.frame", row.names = c(NA, -7L))
JasonWang
  • 2,414
  • 11
  • 12