9

(I've tried asking this on BioStars, but for the slight chance that someone from text mining would think there is a better solution, I am also reposting this here)

The task I'm trying to achieve is to align several sequences.

I don't have a basic pattern to match to. All that I know is that the "True" pattern should be of length "30" and that the sequences I have had missing values introduced to them at random points.

Here is an example of such sequences, were on the left we see what is the real location of the missing values, and on the right we see the sequence that we will be able to observe.

My goal is to reconstruct the left column using only the sequences I've got on the right column (based on the fact that many of the letters in each position are the same)

                     Real_sequence           The_sequence_we_see
1   CGCAATACTAAC-AGCTGACTTACGCACCG CGCAATACTAACAGCTGACTTACGCACCG
2   CGCAATACTAGC-AGGTGACTTCC-CT-CG   CGCAATACTAGCAGGTGACTTCCCTCG
3   CGCAATGATCAC--GGTGGCTCCCGGTGCG  CGCAATGATCACGGTGGCTCCCGGTGCG
4   CGCAATACTAACCA-CTAACT--CGCTGCG   CGCAATACTAACCACTAACTCGCTGCG
5   CGCACGGGTAAGAACGTGA-TTACGCTCAG CGCACGGGTAAGAACGTGATTACGCTCAG
6   CGCTATACTAACAA-GTG-CTTAGGC-CTG   CGCTATACTAACAAGTGCTTAGGCCTG
7   CCCA-C-CTAA-ACGGTGACTTACGCTCCG   CCCACCTAAACGGTGACTTACGCTCCG

Here is an example code to reproduce the above example:

ATCG <- c("A","T","C","G")
set.seed(40)
original.seq <- sample(ATCG, 30, T)
seqS <- matrix(original.seq,200,30, T)
change.letters <- function(x, number.of.changes = 15, letters.to.change.with = ATCG) 
{
    number.of.changes <- sample(seq_len(number.of.changes), 1)
    new.letters <- sample(letters.to.change.with , number.of.changes, T)
    where.to.change.the.letters <- sample(seq_along(x) , number.of.changes, F)
    x[where.to.change.the.letters] <- new.letters
    return(x)
}
change.letters(original.seq)
insert.missing.values <- function(x) change.letters(x, 3, "-") 
insert.missing.values(original.seq)

seqS2 <- t(apply(seqS, 1, change.letters))
seqS3 <- t(apply(seqS2, 1, insert.missing.values))

seqS4 <- apply(seqS3,1, function(x) {paste(x, collapse = "")})
require(stringr)
# library(help=stringr)
all.seqS <- str_replace(seqS4,"-" , "")

# how do we allign this?
data.frame(Real_sequence = seqS4, The_sequence_we_see = all.seqS)

I understand that if all I had was a string and a pattern I would be able to use

library(Biostrings)
pairwiseAlignment(...)

But in the case I present we are dealing with many sequences to align to one another (instead of aligning them to one pattern).

Is there a known method for doing this in R?

zx8754
  • 52,746
  • 12
  • 114
  • 209
Tal Galili
  • 24,605
  • 44
  • 129
  • 187

4 Answers4

9

Though this is quite an old thread, I do not want to miss the opportunity to mention that, since Bioconductor 3.1, there is a package 'msa' that implements interfaces to three different multiple sequence alignment algorithms: ClustalW, ClustalOmega, and MUSCLE. The package runs on all major platforms (Linux/Unix, Mac OS, and Windows) and is self-contained in the sense that you need not install any external software. More information can be found on http://www.bioinf.jku.at/software/msa/ and http://www.bioconductor.org/packages/release/bioc/html/msa.html.

UBod
  • 825
  • 7
  • 11
  • 2
    Thx for updating the information. I lost the accepted answer, but I don't mind. After 7 years that answer was well overdue anyway. – Joris Meys Jun 13 '17 at 08:03
9

Writing an alignment algorithm in R looks like a bad idea to me, but there is an R interface to the MUSCLE algorithm in the bio3d package (function seqaln()). Be aware of the fact that you have to install this algorithm first.

Alternatively, you can use any of the available algorithms (eg ClustalW, MAFFT, T-COFFEE) and import the multiple sequence alignemts in R using bioconductor functionality. See eg here..

Joris Meys
  • 106,551
  • 31
  • 221
  • 263
  • Hi Joris. I am glad to know that such an interface exists (I wasn't able to find it from the searches I performed). I'm not sure what you mean with "you have to install this algorithm first." (did you mean package?) - but I guess I'll find out soon :) Thank you for your help :) Tal – Tal Galili Dec 21 '10 at 18:04
  • 1
    it's not a package, it's a separate executable: http://www.drive5.com/muscle/downloads.htm – Ben Bolker Dec 21 '10 at 19:38
  • As Ben said. MUSCLE is a well-known and well performing algorithm for multiple alignments. So you'll have to install the executable first and then play around with the bio3d package. It's a bit an odd place to look for it, as the bio3d is mostly for protein modelling. But MUSCLE can align multiple DNA sequences as well. – Joris Meys Dec 21 '10 at 20:47
  • Note that there are new packages for this, incl. `msa` which can handle MUSCLE, CLUSTALW and CLUSTALOMEGA algorithms. On the plus side, it does not depend on any outside software. – Roman Luštrik Dec 30 '16 at 10:31
4

You can perform multiple alignment in R with the DECIPHER package.

Following your example, it would look something like:

library(DECIPHER)
dna <- DNAStringSet(all.seqS)
aligned_DNA <- AlignSeqs(dna)

It is fast and at least as accurate as the other methods listed here (see the paper). I hope that helps!

Erik Wright
  • 135
  • 1
  • 5
  • doesn't DECIPHER assume the data being read in is in FASTA format? I did not see any examples for when the strings are raw characters – Vass Aug 24 '20 at 00:18
0

You are looking for a global alignment algorithm on multiple sequences. Did you look at Wikipedia before asking ?

First learn what global alignment is, then look for multiple sequence alignment.

Wikipedia doesn't give a lot of details about algorithms, but this paper is better.

Joris Meys
  • 106,551
  • 31
  • 221
  • 263
Jules Olléon
  • 6,733
  • 6
  • 37
  • 47
  • 3
    @Jules This doesn't tell him how to do it in R. I believe Tal is not really helped by sending him some links he could easily google himself. And probably did long before he asked this question. – Joris Meys Dec 21 '10 at 10:46
  • Hello Jules. Thank you for the good intentions and the link to the presentation. Cheers, Tal – Tal Galili Dec 21 '10 at 18:02
  • @Joris Given the way he presented his question, it looked like Tal wasn't aware that this is a standard bio-informatics problem, that's why I thought some background knowledge about this field might help him... but you're right, I should have posted that as a comment and not as an answer. – Jules Olléon Dec 21 '10 at 23:30
  • @Jules: No harm meant. You can't know Tal is a biostatistician and has a quite profound knowledge of R, so I might have been a bit harsh in my comment. – Joris Meys Dec 21 '10 at 23:53
  • 1
    Hi Jules. You are correct in that it's my first time trying to do such a task with R, hence my query. And as said, suggesting a background reading was kind of you. Joris - both you and I know that very few people poses a profound knowledge of R, and as I am today, I would not presume to be one of them. I still very much appreciate your sentiment :) Yours, Tal – Tal Galili Dec 22 '10 at 10:38
  • @Jules: In the light of Tals kind comments, I removed my downvote. I had to add a space to your post in order to be able to do so. – Joris Meys Dec 24 '10 at 00:16