1

I am reposting this question as I thought I needed a cluster type analysis but what is required is more of a 'sliding window' analysis. I have a dataset that has 59k entries recorded over 63 years, I need to identify 'clusters' of events with the criteria being:

A sequence of 6 or more events, with no more than a 6 hour period between consecutive events in the sequence.

Each event has a unique ID and a date/time stamp an output would ideally assign a cluster ID to events that fulfil the above criteria. I have been directed towards a sliding window appraoch, is the best option to use? maybe using rollapply from the zoo package?

I have added a sample one years worth of data if anyone is feeling in a very helpful mood. https://dl.dropboxusercontent.com/u/16400709/StackOverflow/DataStack.csv

I have seen the output of such an analysis in R but have not been able to replicate it as of yet, the results of this analysis can be seen in this paper --> https://dl.dropboxusercontent.com/u/16400709/StackOverflow/fuhrmann_etal_waf2014.pdf

Thanks for your time!

Methexis
  • 2,739
  • 5
  • 24
  • 34

1 Answers1

3

Here's a solution using rollapply from the zoo package.

require(chron)
require(zoo)

d <- read.csv("path/to/file/DataStack.csv")
d[] <- lapply(d, as.character)

d$time <- chron(d$Date, d$Time, format = c(dates = "d/m/y", times = "h:m:s"))
d <- d[order(d$time), ]
d$diffHours <- c(0, diff(d$time)) * 24
d$withinSixHr <- ifelse(d$diffHours < 6, 1, 0)
d$streak <- c(rep(NA, 5), rollapply(zoo(d$withinSixHr), width = 6, FUN = sum))
d$cluster <- ifelse(d$streak == 6, 1, 0)

d$clusterParticipant <- 0

for (i in 6:nrow(d)) {

  if (d[i, "cluster"] == 1) {

    d[i - 5, "clusterParticipant"] <- 1
    d[i - 4, "clusterParticipant"] <- 1
    d[i - 3, "clusterParticipant"] <- 1
    d[i - 2, "clusterParticipant"] <- 1
    d[i - 1, "clusterParticipant"] <- 1
    d[i - 0, "clusterParticipant"] <- 1

  }
}

And here's the result:

> head(d[c(1, 5:10)], n = 20)
   EventID                time   diffHours withinSixHr streak cluster clusterParticipant
2   272481 (01/01/11 00:02:00)   0.0000000           1     NA      NA                  1
3   272666 (01/01/11 00:40:00)   0.6333333           1     NA      NA                  1
4   272674 (01/01/11 00:46:00)   0.1000000           1     NA      NA                  1
5   272701 (01/01/11 01:15:00)   0.4833333           1     NA      NA                  1
6   272715 (01/01/11 02:00:00)   0.7500000           1     NA      NA                  1
7   272720 (01/01/11 02:25:00)   0.4166667           1      6       1                  1
8   272723 (01/01/11 02:56:00)   0.5166667           1      6       1                  1
21  278829 (09/01/11 03:25:00) 192.4833333           0      5       0                  0
1   268346 (17/01/11 10:03:00) 198.6333333           0      4       0                  0
43  280736 (25/01/11 15:35:00) 197.5333333           0      3       0                  0
26  279417 (25/01/11 17:15:00)   1.6666667           1      3       0                  1
44  280739 (25/01/11 17:41:00)   0.4333333           1      3       0                  1
45  280740 (25/01/11 18:08:00)   0.4500000           1      3       0                  1
46  280751 (25/01/11 18:40:00)   0.5333333           1      4       0                  1
47  280806 (25/01/11 19:09:00)   0.4833333           1      5       0                  1
48  281559 (25/01/11 21:50:00)   2.6833333           1      6       1                  1
14  276331 (01/02/11 06:10:00) 152.3333333           0      5       0                  0
15  276336 (01/02/11 08:24:00)   2.2333333           1      5       0                  0
50  281741 (01/02/11 20:06:00)  11.7000000           0      4       0                  0
11  275388 (24/02/11 15:53:00) 547.7833333           0      3       0                  0

EDIT: The code below is for giving each cluster (or supercluster) an ID number. It creates the variable clusterDiff and uses it as a switchboard to determine whether there's a change in cluster status. It'll be pretty slow on a large data set, but it'll do the trick.

d$clusterDiff <- c(d[1, "clusterParticipant"], diff(d$clusterParticipant))
d$clusterID <- as.numeric(NA)

count <- 1
inCluster <- FALSE

for (i in 1:nrow(d)) {

  if (d[i, "clusterDiff"] == 1) { 
    d[i, "clusterID"] <- count
    inCluster <- TRUE 

  } else if (d[i, "clusterDiff"] == -1) { 
    inCluster <- FALSE
    count <- count + 1

  } else if (inCluster == TRUE & d[i, "clusterDiff"] == 0) {
    d[i, "clusterID"] <- count

  } else { next }

}
rsoren
  • 4,036
  • 3
  • 26
  • 37
  • Looks great, just going to give it a whirl over the full dataset and see how we get on. The above results look promising! – Methexis Aug 05 '14 at 22:53
  • Hello, I have had a go at running the code, couple of quick q's, why do you have 'd[]' why would you not just call it 'd' never seen this before, when setting the data to d[] I get an error 'number of items to replace is not a multiple of replacement length' Also d <- d[order(d$time), ] throws an 'incorrect number of dimensions' error. If I lapply just to d and do not order then I can get results down to d$cluster but the head() command does not print the same as above! Hope that makes sense?! – Methexis Aug 06 '14 at 14:41
  • 1
    I've made a change that should read in the data for you correctly, and should take care of the errors you mentioned. Just change the file path to where you're keeping ```DataStack.csv```. I use ```d[]``` so that it preserves the data frame structure of ```d```, otherwise the result of ```lapply``` will be a list. – rsoren Aug 06 '14 at 14:58
  • Reed thanks again, will give it another go. I have learnt more from stackoverflow than any book or online tutorial! – Methexis Aug 06 '14 at 14:59
  • 1
    I learned that particular nugget from Hadley Wickham's excellent online (free) book, "Advanced R" (http://adv-r.had.co.nz/). Highly recommended if you have time to work through it. – rsoren Aug 06 '14 at 15:04
  • I will take a look, okay got the code up and running time to run it through the full dataset and see how we go! – Methexis Aug 06 '14 at 15:07
  • 1
    Just curious, what sort of "events" are we talking about here? – rsoren Aug 06 '14 at 15:14
  • Tornadoes! I am looking at outbreaks, their cumulative energy looking for any correlation with the ENSO signal. The calculating energy expelled per outbreak and relationship with ENSO value was the easy bit. I have run and exported down to clusterpaticipant, will give this second step a shot now – Methexis Aug 06 '14 at 15:24
  • Almost there, for ClusterParticipant, can the first cluster be 1, the second 2, 3, 4, 5.....off the top of my head define a loop say x and the if (d[i, "cluster"] == x) something like that! – Methexis Aug 06 '14 at 15:49
  • 1
    Do you want adjacent clusters to be considered one big cluster, as in lines 7 and 8? – rsoren Aug 06 '14 at 15:55
  • Correct....if the last tornado in the cluster occurs =>6 since the second last then they should be recorded as occurring within the same cluster. To think my supervisor was going to get me to do this in excel manually! – Methexis Aug 06 '14 at 16:02
  • in other words a cluster is defined with 0s at the start and the end of a sequence of 1's in cluster participant, but ideally that sequence would then go 1, 2, 3 etc. That way I can generate unique cluster ID's – Methexis Aug 06 '14 at 16:06
  • 1
    Ok I added a ```for``` loop that assigns cluster IDs. Your supervisor can buy me a beer next time I'm in London :D – rsoren Aug 06 '14 at 16:41
  • IT WORKS...thats brilliant, yep a beer and might even get a thanks too in the write up! – Methexis Aug 06 '14 at 17:01
  • Great :) if you wouldn't mind, post a copy of the report here if you decide to cite this post – rsoren Aug 06 '14 at 17:14