0

Given: A collection of at most 10 DNA strings of equal length (at most 1 kbp) in FASTA format.

Return: A consensus string and profile matrix for the collection. (If several possible consensus strings exist, then you may return any one of them.)

Sample Dataset

>Rosalind_1
ATCCAGCT
>Rosalind_2
GGGCAACT
>Rosalind_3
ATGGATCT
>Rosalind_4
AAGCAACC
>Rosalind_5
TTGGAACT
>Rosalind_6
ATGCCATT
>Rosalind_7
ATGGCACT

Sample Output

ATGCAACT
A: 5 1 0 0 5 5 0 0
C: 0 0 1 4 2 0 6 1
G: 1 1 6 3 0 1 0 0
T: 1 5 0 0 0 1 1 6

When I submit my answer into Rosalind I get it wrong. I don't know if it is formatting or if it is a problem with the actual information I am submitting.

My Solution in R:

input <- readDNAStringSet("./Rosalind-Input/rosalind_cons.txt")
# Biostrings creates DNAStringset object of FASTA sequences
## already loaded Biostrings/Tidyverse in library 

length <- width(input)
# finds length of DNA mol using biostrings width 

consensus <- consensusMatrix(input) 
# creates consensus matrix from biostring dnastring object

consensusAGCT <- consensus[1:4,]
#removes additional letters/bases that we aren't looking for

answer <- c()
for(i in 1:length[1]) {
  answer[i] <- which(consensusAGCT[, i] == max(consensusAGCT[, i]), arr.ind = TRUE) %>%
    names()
}
# gives the name of each letter that is the max of the column
  # to be added to answer vector

paste(answer, collapse = "")
# collapses answer into a character string instead of vector 

cat("A: ", paste(consensusAGCT[1,], collapse = ""));
cat("C: ", paste(consensusAGCT[2,], collapse = ""));
cat("G: ", paste(consensusAGCT[3,], collapse = ""));
cat("T: ", paste(consensusAGCT[4,], collapse = ""))

I am getting the following warnings when I run this code:

50: In answer[i] <- which(consensusAGCT[, i] == max(consensusAGCT[,  ... :
  number of items to replace is not a multiple of replacement length

These warnings are because there are multiple maxes at specific vector lengths (i). Rosalind is fine with one of the many correct answers. I created a smaller dataset that verifies that this error is from this. It chooses the first max to add to the answer vector. The answer vector remains the correct length.

To verify the consensus matrix is correct, I reviewed the first few base pairs to see if the consensus matrix and they were correct. I couldn't check all the bases in my longer dataset because there's 998 bases.

This leads me to believe that the issue may be a formatting error? The code below is me trying to turn my answers into the format that Rosalind expects.

I'm not exactly sure how to create line breaks with R so the code prints on separate lines, I'm currently doing that by hand. I don't think that's specifically the issue though since I'm doing it by hand...

paste(answer, collapse = "")
# collapses answer into a character string instead of vector 

cat("A: ", paste(consensusAGCT[1,], collapse = ""));
cat("C: ", paste(consensusAGCT[2,], collapse = ""));
cat("G: ", paste(consensusAGCT[3,], collapse = ""));
cat("T: ", paste(consensusAGCT[4,], collapse = ""))

Any insights would be highly appreciated! I've not used stackoverflow before, so if there's anything I'm not formatting optimally please let me know. I'm very new to programming and I want to improve :)

0 Answers0