0

I have a data set which, when plotted, produces a graph that looks like this:

Plot

The head of this data is:

> head(data_frame)
      score position
73860    10    43000
73859    10    43001
73858    10    43002
73857    10    43003
73856    10    43004
73855    10    43005

I've uploaded the whole file as a tab delimited text file here.

As you can see, the plot has regions which have a score of around 10, but there's one region in the middle that "dips". I would like to identify these dips.

Defining a dip as:

  • Starting when the score is below 7
  • Ending when the score rises to 7 or above and stays at 7 or above for at least 500 positions

I would like to identify all the regions which meet the above definition, and output their start and end positions. In this case that would only be the one region.

However, I'm at a bit of a loss as to how to do this. Looks like the rle() function could be useful, but I'm not too sure how to implement it.

Expected output for the data frame would be something like:

[1] 44561 46568

(I haven't actually checked that everything in between these would qualify under the definition, but from the plot this looks about right)

I would be very grateful for any suggestions!

Andrei

2 Answers2

0

So I've come up with one solution that uses a series of loops. I do realise this is inefficient, though, so if you have a better answer, please let me know.

results <- data.frame(matrix(ncol=2,nrow=1))
colnames(results) <- c("start","end")

state <- "out"
count <- 1
for (i in 1:dim(data_frame)[1]){
        print(i/dim(data_frame)[1])
        if (data_frame[i,3] < 7 & state=="out") {
                results[count,1] <- data_frame[i,2]
                state <- "in"
                next
        }
        if (data_frame[i,3] >= 7 & state=="in") {
                if ((i+500)>dim(data_frame)[1]){
                        results[count,2] <- data_frame[dim(data_frame)[1],2]
                        state <- "out"
                        break
                }
                if (any(data_frame[(i+1):(i+500),3] < 7)) {
                        next
                } else {
                        results[count,2] <- data_frame[i-1,2]
                        count <- count+1
                        state <- "out"
                        next
                }
        }
        if ((i+500)>dim(data_frame)[1] & state == "out") {
                break
        }
}
0

Something like this is a tidyverse solution and uses rle as OP suggested...

below7 <- data_frame$score < 7
x <- rle(below7)
runs <- tibble(
          RunLength=x$lengths, 
          Below7=x$values,
          RunStart=df$position[1]
        ) %>% 
        mutate(
          RunStart=ifelse(
                     row_number() == 1, 
                     data_frame$position[1], 
                     data_frame$position[1] + cumsum(RunLength)-RunLength+1
          ),
          RunEnd=RunStart + RunLength-1,
          Dip=Below7,
          Dip=Dip | Below7 | (!Below7 & RunLength < 500)
        )
as.data.frame(runs)

Giving

   RunLength Below7 RunStart RunEnd   Dip
1       1393  FALSE    43000  44392 FALSE
2         84   TRUE    44394  44477  TRUE
3         84  FALSE    44478  44561  TRUE
...
19        60  FALSE    46338  46397  TRUE
20       171   TRUE    46398  46568  TRUE
21      2433  FALSE    46569  49001 FALSE

So to get OP's final answer

runs %>% 
  filter(Dip) %>% 
  summarise(
    DipStart=min(RunStart), 
    DipEnd=max(RunEnd)
  )
# A tibble: 1 x 2
  DipStart DipEnd
     <dbl>  <dbl>
1    44394  46568

If the original data.frame might contain more than one dip, you'd have to do a little more work when creating the runs tibble: having indentified each individual run, you'd need to create an additional column, DipIndex say, which indexed each individual Dip.

Limey
  • 10,234
  • 2
  • 12
  • 32