4

I have a set S of strings generated from DNA sequencing using a specific adapter fragment. This means that all the strings in S contain a suffix that approximately matches (due to sequencing errors) a prefix of the adapter sequence. How can I, given only the set S, infer the most likely adapter sequence used to generate S?

The set S is very large - roughly 1 million fragments, where each has a length of 50 characters. I know building a generalized suffix tree over the set S will greatly help in this problem, but I am unsure of a method to use to find the most likely adapter sequence.

zx8754
  • 52,746
  • 12
  • 114
  • 209
Wims
  • 43
  • 3
  • What kinds of sequencing errors can the strings contain? In particular, are there only (or mostly) just substitution errors, on can there be insertions and/or deletions too? – Ilmari Karonen Oct 29 '16 at 00:14
  • 1
    The errors are limited to just substitution errors. – Wims Oct 29 '16 at 00:17
  • keywords are `blast de novo assembly` GIYF – wildplasser Oct 29 '16 at 11:12
  • Is the adapter a subsequence of string of length 50 i.e. adapter can be located a different positions in each string. Or, is it simpler than that and the adapter is the entire sequence of 50 and you want to infer the consensus 50nt sequence across the 1 million sequences of S? Also, if adapter is subsequence, is length known? – Vince Oct 29 '16 at 13:00
  • Thanks for the blast de novo assembly tip, I'll look into that. I want to infer several possible adapter sequences of different length. Finding a consensus sequence across the million sequences would be a good approach, as there will be some wrong reads in the sequencing. – Wims Oct 30 '16 at 17:48

1 Answers1

1

Maybe this will suit your needs:

http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0164228

Vince
  • 3,325
  • 2
  • 23
  • 41
  • Exactly what I need, thank you very much! The logic in the algorithm is also very straight-forward: Identify frequent k-mers across the set, sort them by how frequent they are, and align them into an output sequence. – Wims Oct 31 '16 at 10:36