0

(This is sort of a follow up question to one I asked before)

So I have a DNA string

e.g.

acagatgaaggaggacgcgcccccgccgctgtcctgcgcctcagccatcctatgagacgg

and I have 20 different 3 letter patterns (each combination corresponds to an amino acid) I want to match to the data. My java program looks at 3 letters at a time and tries to match it to one of the patterns. I want to eventually count the number of times each amino acid appears, so when I find a match I need to increment a particular counter. I want to implement this using java regex, so I have:

Pattern A = Pattern.compile("(gct)|(gcc)|(gca)|(gcg)");
Pattern C = Pattern.compile("(tgt) | (tgc)");
Pattern D = Pattern.compile("(gat) | (gac)");

etc.

However, I now realise that you need to make a matcher for EACH pattern and you can't use one matcher to search for ALL patterns; what is the best way for me to achieve what I am trying to do?

Community
  • 1
  • 1
user1058210
  • 1,639
  • 7
  • 29
  • 49
  • Do you know where the boundaries are in your string? Is it every third letter? If you don't know, I think the regex will be complicated by the fact that some of the patterns will overlap. For example, starting at the 17th character in the string above, you have 'gcgcc' - this will match 'gcg' starting at the first position, but will also match 'gcc' starting at the third position. These matches share the second g in the string, so I doubt this actually signifies two different acids in those positions. – Mike C Dec 09 '11 at 13:14
  • user1058210, Its really not cool to get these guys to do your GC04 Coursework, you should really do it yourself. Peace! – StrangeLondoner Dec 12 '11 at 18:42

4 Answers4

1

I wouldn't use a regex here. You have four letters, hence 4^3 = 64 possible triplets. Just loop over the string, translate every triplet to its number (a -> 0, c -> 1, g -> 2, t -> 3, so gcc -> 2*4^2 + 1*4^1 + 1*4^0 = 37), increment counter[number] and ignore the ones you didn't want at the end. (If you want also the positions, it will probably be worth checking whether the current triplet is one of the wanted before inserting into the appropriate list to save some space.)

Daniel Fischer
  • 181,706
  • 17
  • 308
  • 431
0

You can use an expression like

(?<gc>gc[tcag])|(?<tg>tg[tc])|(?<ga>ga[tc])

however, you cannot use the regex against the string. It will give you a lot of false positives.

The only valid amino's in your string are GAT GCC and GCT, if you use a regex against the string, you can find up to 11 results.

So you first need to split up the string into 3 character groups and then use a regex to match the pattern against it.

try the regex :

(?<gc>gc(?=[tcag]))|(?<tg>tg(?=[tc]))|(?<ga>ga(?=[tc]))

against your string and it will come up with 11 results. Not what you want.

Be carefull what you wish for, you might get it. (or much more then you wanted)

Dim dna As String = "acagatgaaggaggacgcgcccccgccgctgtcctgcgcctcagccatcctatgagacgg"
    Dim pattern As New ArrayList
    For x = 0 To dna.Length - 1 Step 3
        Dim match As Match = Regex.Match(dna.Substring(x, 3), "(?<gc>gc[tcag])|(?<tg>tg[tc])|(?<ga>ga[tc])")
        If match.Success Then
            pattern.Add(match.Value)
        End If
    Next

This will give you 4 results

gat - gcc - gcc - gct

matching all 3 valid amino's

Sedecimdies
  • 152
  • 1
  • 10
0

Factorize your regex :

(gc[tcag]|(tg|ga)[tc])
FailedDev
  • 26,680
  • 9
  • 53
  • 73
0

Give it a try

String dna = "acagatgaaggaggacgcgcccccgccgctgtcctgcgcctcagccatcctatgagacgg";

Pattern p = Pattern.compile("(gct)|(gcc)|(gca)|(gcg)");
Matcher m = p.matcher(dna);

int count = 0;
while(m.find()) {
  count ++;
}
Taha
  • 1,086
  • 1
  • 10
  • 20