cluster short, homogeneous strings (DNA) according to common sub-patterns and extract consensus of classes

Question

Task:
to cluster a large pool of short DNA fragments in classes that share common sub-sequence-patterns and find the consensus sequence of each class.

Pool: ca. 300 sequence fragments
8 - 20 letters per fragment
4 possible letters: a,g,t,c
each fragment is structured in three regions:
1. 5 generic letters
2. 8 or more positions of g's and c's
3. 5 generic letters
  (As regex that would be [gcta]{5}[gc]{8,}[gcta]{5})

Plan:
to perform a multiple alignment (i.e. withClustalW2) to find classes that share common sequences in region 2 and their consensus sequences.

Questions:

Are my fragments too short, and would it help to increase their size?
Is region 2 too homogeneous, with only two allowed letter types, for showing patterns in its sequence?
Which alternative methods or tools can you suggest for this task?

Best regards,

Simon

This is a very interesting insight into the kinds of things *bioinformatics* does with DNA sequences. I'd upvote it, but the arrow says 'this question is useful and clear', not 'this is an interesting question'. — pavium, Oct 02 '09 at 13:02
Where are your DNA fragments coming from, and what are you trying to represent? It's hard to know how short is "too short" without more information. Also, what are you trying to represent, and what do you mean by "showing patterns in the sequence?" — James Thompson, Oct 05 '09 at 05:20
I want to find out if there exists a consensus within the GC regions among the fragments. So that I can say: The fragments not only contain a GC repeat, but the GC repeat also shows a common pattern (if it actually does). The fragments are just randomly picked GC repeats (plus a frame of their 10 closest neighbor bases; this can be of course changed or removed) from the human genome. — SimonSalman, Oct 15 '09 at 08:42

score 2 · Answer 1 · answered Nov 16 '09 at 05:59

Yes, 300 is FAR TOO FEW considering that this is the human genome and you're essentially just looking for a particular 8-mer. There are 65,536 possible 8-mers and 3,000,000,000 unique bases in the genome (assuming you're looking at the entire genome and not just genic or coding regions). You'll find G/C containing sequences 3,000,000,000 / 65,536 * 2^8 =~ 12,000,000 times (and probably much more since the genome is full of CpG islands compared to other things). Why only choose 300?

You don't want to use regex's for this task. Just start at chromosome 1, look for the first CG or GC and extend until you get your first non-G-or-C. Then take that sequence, its context and save it (in a DB). Rinse and repeat.

For this project, Clustal may be overkill -- but I don't know your objectives so I can't be sure. If you're only interested in the GC region, then you can do some simple clustering like so:

Make a database entry for each G/C 8-mer (2^8 = 256 in all).
Take each GC-region and walk it to see which 8-mers it contains.
Tag each GC-region with the sequences it contains.

Now, for each 8-mer, you have thousands of sequences which contain it. I'll leave the analysis of the data up to your own objectives.

that sounds like an approach I should try :) – SimonSalman Nov 30 '09 at 08:34 — SimonSalman, Nov 30 '09 at 08:34
What exactly are you trying to find out? – Ron Gejman Nov 30 '09 at 16:16 — Ron Gejman, Nov 30 '09 at 16:16

score 1 · Answer 2 · answered Oct 02 '09 at 13:17

1

Your region two, with the 2 letters, may end up a bit too similar, increasing length or variability (e.g. more letters) could help.

answered Oct 02 '09 at 13:17

Calyth

1,673
3
16
26

cluster short, homogeneous strings (DNA) according to common sub-patterns and extract consensus of classes

2 Answers2