0

Since I am really new in R, I am not sure if I will be able to express my problem correctly so sorry in advance. I have some letters that have a given value. I created a dataframe for those and I also have a string with the same set of letters. I want to correspond the values from the dataframe to each letter of my string and then calculate the mean for a window of length L. I can't find a way to do the first part, since I don't know how to compare the string chars with the dataframe chars and then assign the values to the string chars in order to find the mean of the window. Any tips?

A = data.frame(A = 0.429, C = -0.051, D = -2.024, E = -2.181, F = 0.836, 
     G = 0.158, H = -1.056, I = 0.959, K = -2.398, L = 0.658, 
     M = 0.470, N = -1.099, P = -0.675, Q = -1.564, R = -2.501, 
     S = -0.292, T = -0.182, V = 0.634, W = 0.463, Y = 0.163)
(a <- "MASEFKKKLFWRAVVAEF")                                                                                                                                              
a_split = strsplit(a, "")
L = readline(prompt = "Enter window length: \n")
x = nchar(a)
for(i in 1:x-L)
{
  for(j in a_split)
  {
     
      
  }
 
}

Edit 1: Okay so after your help I think I am making some progress. Sorry for the late thank you and response. I want to iterate N(sequence length) - L(window length) + 1, and thus I want N - L + 1 mean values of the windows. Then I want to correspond the mean value of each window to the most central aminoacid of each window, for example for the first 10 aminoacids the mean value of the window will be assigned to aminoacid 5, then for window 2-11 to aminoacid 6 etc.

`
A = c(A = 0.429, C = -0.051, D = -2.024, E = -2.181, F = 0.836, G = 0.158, H = -1.056, I = 0.959, K = -2.398, L = 0.658, M = 0.470, N = -1.099, P = -0.675, Q = -1.564, R = -2.501, S = -0.292, T = -0.182, V = 0.634, W = 0.463, Y = 0.163) cnt = 0

(a <- "MASEFKKKLFWRAVVAEFLATTLFVFISIGSALGFKYPVGNNQTAVQDNV")
a_split = strsplit(a, "")
unlist(A)[ a_split[[1]] ]
values <- A[ a_split[[1]] ]
L=5
N = nchar(a)
print(N)

for(i in 1:N-L)
{
    print(convolve(values, rep(i,i + L-1) / L, type ="filter"))
    print(i/2)
    cnt = cnt + 1
}
print(cnt)

`

Since I am not familiar with R I do not completely understand how convolve works and that is my main issue.

Edit 2: I think you understood correctly my question and I thank you for that. I have a sequence of N elements that I want to see if there are parts in that sequence that fit a certain criteria. For that reason, I want to have a window of length 10 to search through the sequence. For every window, the mean value will be assigned to the "central" element (I know 5.5 is mathematically the center, but rounding down here is perfect).

After all the iterations are finished, I want to see the values of each window and see if there at least L/2 elements in sequence in the results list with a positive value. For example if in results exists a subsequence like ["5" = 0.5, "6" = 2.35, "7" = 0.15, "8" = 0.35, "9" = 0.5],i.e. at least 5 elements in sequence with positive value then this part of the sequence (5-9) is possibly a transmembrane region. Of course if there are more sequentially positive values, the critera still applies. My goal is to find these regions which could possibly be transmembrane regions.

I hope I will be able to do the last part since it doesn't include convolve, which for some reason really gave me a hard time.

I am really greatful for your help!

Greggs
  • 1
  • 4
  • The second argument to `convolve()` is like a weight, and `convolve` calculates the weighted sum of `values` over the window. Choose the weight so that each element in the window is weighted as `1 / window_size`, i.e., `rep(1, window_size) / window_size`, so that the weighted sum is just the average value. I think the simple result `convolve(values, rep(1, window_size) / window_size, type = "filter")` is exactly what you want (the average in each window), but with the first letter of the window, rather than the middle letter, used to identify the location of the window. – Martin Morgan Nov 10 '21 at 20:12
  • When I use rep(1, window_size) / window_size I get the mean values for every element. I want to print and calculate the means for the elements of each window. For example 1-10, 2-11, 3-12 etc and the first window's mean value will be for element 5. I tried changing the parameters in rep but it didn't work as expected. I typed: rep(i, window_size) / window_size and also I changed the values parameter into values[i:i+L] . – Greggs Nov 13 '21 at 13:16
  • I updated my answer to more completely show my understanding of your question. If I am not understanding, then perhaps you could illustrate 'by hand' in your question what you are expecting, at least for a few amino acids. – Martin Morgan Nov 14 '21 at 16:08

2 Answers2

0

For the original data.frame, you could write unlist(A)[ a_split[[1]] ].

But instead of using a data.frame, use a named numeric vector,

A = c(A = 0.429, C = -0.051, D = -2.024, E = -2.181, F = 0.836, 
     G = 0.158, H = -1.056, I = 0.959, K = -2.398, L = 0.658, 
     M = 0.470, N = -1.099, P = -0.675, Q = -1.564, R = -2.501, 
     S = -0.292, T = -0.182, V = 0.634, W = 0.463, Y = 0.163)

Then use this as a 'map' between the letters and the values

values <- A[ a_split[[1]] ]
values
#      M      A      S      E      F      K      K      K      L      F      W
#  0.470  0.429 -0.292 -2.181  0.836 -2.398 -2.398 -2.398  0.658  0.836  0.463
#      R      A      V      V      A      E      F
# -2.501  0.429  0.634  0.634  0.429 -2.181  0.836

Use convolve() to calculate the sliding window average

> window_size = 10
> result <- convolve(values, rep(1, window_size) / window_size, type = "filter")
> result
      M       A       S       E       F       K       K       K       L       F
-0.6438 -0.6445 -0.9375 -0.8654 -0.5839 -0.6041 -0.3214 -0.2997  0.0237  0.0237
      W       R       A       V       V       A       E       F       L       A
-0.0170 -0.0815  0.1504  0.1733  0.1935  0.1935  0.2342  0.5482  0.4354  0.4655
      T       T       L       F       V       F       I       S       I       G
 0.4384  0.4274  0.4885  0.4885  0.4207  0.4409  0.1175  0.0379 -0.0004 -0.0329
      S       A       L       G       F       K       Y       P       V       G
-0.0329 -0.1136 -0.2664 -0.4886 -0.5226 -0.5633 -0.2601 -0.4328 -0.5677 -0.7410
      N
-0.6934

Note that the first element of the result is the mean value of elements 1:10, the second the mean value of elements 2:11, etc

> mean(values[1:10])
[1] -0.6438
> mean(values[2:11])
[1] -0.6445
> mean(values[3:12])
[1] -0.9375

I believe that you are saying that you would like the windows named differently, using the 5th, 6th, ... names instead of the first, second, ... so

> names(values)[5:(length(values) - 5)]
 [1] "F" "K" "K" "K" "L" "F" "W" "R" "A" "V" "V" "A" "E" "F" "L" "A" "T" "T" "L"
[20] "F" "V" "F" "I" "S" "I" "G" "S" "A" "L" "G" "F" "K" "Y" "P" "V" "G" "N" "N"
[39] "Q" "T" "A"

so

> names(result) <- names(values)[5:(length(values) - 5)]
> result
      F       K       K       K       L       F       W       R       A       V
-0.6438 -0.6445 -0.9375 -0.8654 -0.5839 -0.6041 -0.3214 -0.2997  0.0237  0.0237
      V       A       E       F       L       A       T       T       L       F
-0.0170 -0.0815  0.1504  0.1733  0.1935  0.1935  0.2342  0.5482  0.4354  0.4655
      V       F       I       S       I       G       S       A       L       G
 0.4384  0.4274  0.4885  0.4885  0.4207  0.4409  0.1175  0.0379 -0.0004 -0.0329
      F       K       Y       P       V       G       N       N       Q       T
-0.0329 -0.1136 -0.2664 -0.4886 -0.5226 -0.5633 -0.2601 -0.4328 -0.5677 -0.7410
      A
-0.6934

Maybe if you mean something else you could edit your original question to include a 'hand-calculated' example.

One small point is that '5' is not in the middle of the sequence 1-10, the middle is 5.5...

Martin Morgan
  • 45,935
  • 7
  • 84
  • 112
0

You can do this is a one-liner using your original data format:

sapply(unlist(strsplit(a, "")), \(i) A[[i]])
#>      M      A      S      E      F      K      K      K      L 
#>  0.470  0.429 -0.292 -2.181  0.836 -2.398 -2.398 -2.398  0.658 
#>      F      W      R      A      V      V      A      E      F 
#>  0.836  0.463 -2.501  0.429  0.634  0.634  0.429 -2.181  0.836 

Or if you don't want the letter indices, the one-liner is:

as.numeric(sapply(unlist(strsplit(a, "")), \(i) A[[i]]))
#>  [1]  0.470  0.429 -0.292 -2.181  0.836 -2.398 -2.398 -2.398  0.658
#> [10]  0.836  0.463 -2.501  0.429  0.634  0.634  0.429 -2.181  0.836
Allan Cameron
  • 147,086
  • 7
  • 49
  • 87