2
set.seed (12345)

data <- paste(sample(c("A","C","G","T"),100000,replace=TRUE,prob=rep(0.25,4)))
data <- ifelse(data=="A",1,0)

Suppose I convert the data into 1 (desired character) and 0 (else). Then take sum at each positions. If sum upto a position matches sum upto the next, we stop; otherwise we carry on the sum and store the sums at the positions. Then the maximum sum at corresponding position will give the maximum length of sequence.

I got the algorithm but can't code it. Please help.

madmathguy
  • 23
  • 1
  • 7

1 Answers1

5

The rle function is what you want here:

set.seed(12345)
data = sample(c('A', 'C', 'G', 'T'), 100000, replace = TRUE, prob = rep(0.25, 4))

run_lengths = rle(data == 'A')
(result = max(run_lengths$lengths[run_lengths$values]))
# [1] 10

Getting the position of the longest run is a bit harder. You can use which.max for that, but we’ve previously filtered out all non-A results. Instead, we can also set all non-A runs to 0 — that way, they’ll still be there, but won’t be the maximum:

only_a = ifelse(run_lengths$values, run_lengths$lengths, 0)
longest_run_index = which.max(only_a)

Now we need to calculate back from the longest_run_index to the index inside A. We do this by adding up the lengths of all the runs before this index:

index = sum(run_lengths$lengths[seq_len(longest_run_index - 1)]) + 1
data[index : (index + result - 1)]
# [1] "A" "A" "A" "A" "A" "A" "A" "A" "A" "A"
Konrad Rudolph
  • 530,221
  • 131
  • 937
  • 1,214