Finding contiguity by comparing kmers in R

Question

Hello I have a dataframe which looks like this:

  LR ID           Kmer       ProcID
    1         GTACGTAT         10
    1         TACGTATC         10
    1         ACGTATCG          2
    1         GTATCGTT          3
    2         GTTACGTA         16
    2         TTACGTAC         16
    2         TACGTACT         16
    2         ACGTACTT         11

Output is something like:

LR1 max length: 16 #(as 2 kmers are consecutively going to proc 10)
LR1 min length: 8
LR2 max length: 24 #(as 3 kmers are consecutively going to proc 16)

There are 800 LR Ids like these which have kmers going to different processes. My objective is to find the longest uninterrupted sequence belonging to one LR ID going to the same destination proc id. I need to compare the (k-1) characters of one row to its next and so on.

I know there is this function called

str_detect()

in R which checks to see if any pattern exists or not. I was wondering is there any other better way to do this?

@RonakShah I want to find what is the max and min continuous uninterrupted sequence for a particular LR ID going to the same process ID. — Ashi, Feb 06 '21 at 05:13
@RonakShah I have added some new rows and also gave a sample output. Let me know if that is understandable. — Ashi, Feb 06 '21 at 06:46

score 1 · Answer 1 · answered Feb 06 '21 at 07:38

We can count consecutive occurrence of ProcID in each LRID and count min and max in it.

library(dplyr)

df %>%
  count(LRID, grp = data.table::rleid(ProcID)) %>%
  group_by(LRID) %>%
  summarise(max = max(n) * 8, 
            min = min(n) * 8)

#   LRID   max   min
#* <int> <dbl> <dbl>
#1     1    16     8
#2     2    24     8

Or using data.table :

library(data.table)
setDT(df)[, .(n = .N), .(LRID, rleid(ProcID))][, .(max = max(n) * 8, min = min(8)), LRID]

score 1 · Accepted Answer · answered Feb 06 '21 at 18:17

We can use

library(dplyr)
df1 %>% 
    count(LRID, grp = cumsum(ProcID != lag(ProcID, default = first(ProcID)))) %>%
    group_by(LRID) %>% 
    summarise(max = max(n) * 8, 
             min = min(n) * 8, .groups = 'drop')
# A tibble: 2 x 3
#   LRID   max   min
#  <int> <dbl> <dbl>
#1     1    16     8
#2     2    24     8

data

df1 <- structure(list(LRID = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L), Kmer = c("GTACGTAT", 
"TACGTATC", "ACGTATCG", "GTATCGTT", "GTTACGTA", "TTACGTAC", "TACGTACT", 
"ACGTACTT"), ProcID = c(10L, 10L, 2L, 3L, 16L, 16L, 16L, 11L)),
class = "data.frame", row.names = c(NA, 
-8L))

Finding contiguity by comparing kmers in R

2 Answers2

data