1

I am working with a large dataset and I am trying to first identify clusters of values that meet specific threshold values. My aim then is to only keep clusters of a minimum length. Below is some example data and my progress thus far:

Test = c("A","A","A","A","A","A","A","A","A","A","B","B","B","B","B","B","B","B","B","B")
Sequence = c(1,2,3,4,5,6,7,8,9,10,1,2,3,4,5,6,7,8,9,10)
Value = c(3,2,3,4,3,4,4,5,5,2,2,4,5,6,4,4,6,2,3,2)
Data <- data.frame(Test, Sequence, Value)

Using package evd, I have identified clusters of values >3

C1 <- clusters(Data$Value, u = 3, r = 1, cmax = F, plot = T)

Which produces

C1
$cluster1
4 
4 

$cluster2
6 7 8 9 
4 4 5 5 

$cluster3
12 13 14 15 16 17 
 4  5  6  4  4  6 

My problem is twofold: 1) I don't know how to relate this back to the original dataframe (for example to Test A & B) 2) How can I only keep clusters with a minimum size of 3 (thus excluding Cluster 1)

I have looked into various filtering options etc. however they do not cluster data according to a desired threshold, with no options for the minimum size of the cluster either.

Any help is much appreciated.

Andy
  • 11
  • 1

1 Answers1

0

Q1: relate back to original dataframe: Have a look at Carl Witthoft's answer. He wrote a variant of rle() (seqle() because it allows one to look for integer sequences rather than repetitions): detect intervals of the consequent integer sequences

Q2: only keep clusters of certain length:

C1[sapply(C1, length) > 3]

yields the 2 clusters that are long enough:

$cluster2
6 7 8 9 
4 4 5 5 

$cluster3
12 13 14 15 16 17 
 4  5  6  4  4  6 
Community
  • 1
  • 1
Martin
  • 594
  • 5
  • 16