Sliding window algorithm to analyze values of fasta segments

Question

I have two segments of a random fasta file

1 Segment1 AAGGTTCC
2 Segment2 CCTTGGAA

I have another random data set containing dinucleotides' energy values as

 AA -1.0
 AG -2.0
 GG -1.5
 GT -1.7
 TT -1.2
 TC -1.8
 CC -1.4
 CT -2.5
 TG -2.1
 GA -2.3

Here, I want to analyze and compare the nucleotides of the two fasta segments with the given energy values in a 'sliding window algorithm' such that the energy output value for fasta segment1 would be average of all the possile dinucleotide combination in an overlapping sliding window manner which will give the answer as -10.6 i.e. {(-1.0)+ (-2.0) + (-1.5) + (-1.7) + (-1.2) + (-1.8) + (-1.4)}/7 and the same computation would be performed for segment2, using the help of for and if else loop preferably.

score 1 · Answer 1 · 2022-05-10T15:51:23.290

1

Here is another way using tidytext. We are using the 'character shingles` tokenizer which breaks it up the way you are looking for.

library(tidytext)
library(dplyr)

df <- df1 %>% 
  unnest_character_shingles("Dinu", "Segment", n = 2L, to_lower = FALSE, drop = FALSE) %>%  
  left_join(df2, by = "Dinu") %>% 
  group_by(ID, Segment) %>% 
  summarize(mean = mean(Value))

Which gives the result:

> df
# A tibble: 2 x 3
# Groups:   ID [2]
  ID       Segment   mean
  <chr>    <chr>    <dbl>
1 Segment1 AAGGTTCC -1.51
2 Segment2 CCTTGGAA -1.71

To access the new columns after this analysis, use the df object. For example, mean(df$mean) will provide the average of the mean column.

edited May 10 '22 at 15:51

answered Apr 11 '22 at 16:51

Exactly what I was looking for! Have another query how will I access a specific column of that tibble. Like if I have to get the mean of the entire column titled 'mean' – 08BKS09 Apr 27 '22 at 12:37
In order assign the tibble to a vector I would be requiring the name of that specific tibble, how to get the name of the tibble formed? – 08BKS09 Apr 28 '22 at 12:30
1

@08BKS09 I suggest looking at some "getting started" material for R. There is plenty out there. In this case, you would do `df <- df1 %>% unnest_character_shingles(...)` to assign to a variable named `df`. You can name that anything you want. This is fundamental R. – Apr 28 '22 at 13:27
While executing my file today I am getting an error message `Error in tf(col, lowercase = to_lower, ...) : formal argument "lowercase" matched by multiple actual arguments` it was working fine till yesterday. Is there any way to solve this issue? – 08BKS09 May 10 '22 at 15:42
1

@08BKS09 the new version might have fixed the problem I mentioned before. Try removing the `lowercase = FALSE` part and keep the `to_lower = FALSE` part only instead. Let me know if that works so I can update the answer. – May 10 '22 at 15:46
Removing `lowercase = FALSE` solves the mentioned issue! Thanks – 08BKS09 May 10 '22 at 15:50

Maël · Answer 2 · 2022-04-11T13:51:29.200

Here's a way to do it in the tidyverse. First, create a vector of two consecutive characters in the string (using f). Then, with some pivoting, merge with the second dataset and compute the sum by group.

library(tidyverse)

f <- function(string) sapply(seq(nchar(string[1]) - 1), \(i) substr(string, i, i+1))

df1 %>% 
  mutate(data.frame(f(df1$Segment))) %>% 
  pivot_longer(-c(ID, Segment), values_to = "Dinu") %>% 
  inner_join(df2) %>% 
  group_by(ID, Segment) %>% 
  summarise(sum = sum(Value))

  ID       Segment    sum
  <chr>    <chr>    <dbl>
1 Segment1 AAGGTTCC -10.6
2 Segment2 CCTTGGAA -12

data

df1 <- read.table(header = F, text = "1 Segment1 AAGGTTCC
2 Segment2 CCTTGGAA")[, 2:3]
colnames(df1) <- c("ID", "Segment")

df2 <- read.table(header = F, text = " AA -1.0
 AG -2.0
 GG -1.5
 GT -1.7
 TT -1.2
 TC -1.8
 CC -1.4
 CT -2.5
 TG -2.1
 GA -2.3")
colnames(df2) <- c("Dinu", "Value")

score 0 · Answer 3 · answered Apr 11 '22 at 16:23

Here is a data.table approach (returning the sum and the mean; it was unclear in your post which one you wanted, but in case of different sequence lengths mean might more sense).

The idea is to vectorize the sequence, transpose and shift by one, then combine and retrieve the values from the named vector of energy values. Would be interesting to compare performance on more and/or longer sequences, but it seems to be faster than the proposed tidyverse approach. I am sure this can still be improved, though.

library(data.table)
dt1 <- data.table(df1, key=c("ID", "Segment"))
dt2 <- with(df2, setNames(Value, Dinu))
dt1[, e:= lapply(.SD, \(x) strsplit(x, "")), by="ID", .SDcols="Segment"]
dt1[, e2 := data.table::shift(e, 1, type="lead")]
dt1 <- dt1[, lapply(.(e, e2), unlist), by = list(Segment, ID)]
dt1[, .(sum = sum(dt2[paste0(V1, V2)], na.rm=TRUE),
    mean = mean(dt2[paste0(V1, V2)], na.rm=TRUE)), by=.(Segment, ID)][]
#>     Segment       ID   sum      mean
#> 1: AAGGTTCC Segment1 -10.6 -1.514286
#> 2: CCTTGGAA Segment2 -12.0 -1.714286

Sliding window algorithm to analyze values of fasta segments

3 Answers3