1

I'm trying to extract some DNA info from a file. Before the DNA data consisting of bases GCAT there is the word ORIGIN, and after there is a //. How do I write a regular expression to get these bases between these markers?

I have tried the following but it doesn't work.

[ORIGIN(GCATgcat)////]

Sample data:

ORIGIN      
  1 acagatgaag acagatgaag acagatgaag acagatgaag
  2 acagatgaag acagatgaag acagatgaag acagatgaag
//
Gilles 'SO- stop being evil'
  • 104,111
  • 38
  • 209
  • 254
user1044585
  • 493
  • 2
  • 5
  • 19

2 Answers2

2

Try this pattern "\\b([GCATgcat]+)\\b" which matches any GCAT character sequence (upper or lowercase) surrounded by a word boundary (so it wouldn't match those characters embedded in other strings, like the word "catalog"). If you repeatedly scan for this regex in your sample file you will extract each sequence.

Here's a working example for your sample file:

// Locate the substring between "ORIGIN" and "//" in the file.
String fileContents = getSampleFileContents();
int indexOfOrigin = fileContents.indexOf("ORIGIN");
String pertinentSection = fileContents.substring(
    indexOfOrigin, fileContents.indexOf("//", indexOfOrigin));

// Search for sequences within the pertinent substring.
Pattern p = Pattern.compile("\\b([GCATgcat]+)\\b");
Matcher m = p.matcher(pertinentSection);
List<String> sequences = new ArrayList<String>();
while (m.find()) {
  sequences.add(m.group(1));
}
sequences.toString(); // => ["acagatgaag", "acagatgaag", ..., "acagatgaag"]
maerics
  • 151,642
  • 46
  • 269
  • 291
  • Ok i just did that but its still not working, would it matter if there was a new line between origin and the data and the data and //? – user1044585 Dec 07 '11 at 15:33
  • @user1044585: yes, any character in the matching string, including whitespace and newlines will affect the regular expression. Please update your question with some sample data strings exactly as they are since that is the crux of the issue. – maerics Dec 07 '11 at 15:35
  • dude, thanks so much! that worked really well, however one problem is there are sentences after the DNA sequences following a //. So i get a few a's and stuff at the end that shouldnt be there. Also its the same with the start. – user1044585 Dec 07 '11 at 17:30
  • @user1044585: see my updated answer with an example of how to identify the pertinent substring between "ORIGIN" and "//". – maerics Dec 07 '11 at 17:53
0

For all of us who aren't regex super-wizards, I'd suggest a two step approach. Remove the obvious cruft such as the digits and newlines, then do the match. e.g.

public class Regex {

   static String NL = "\n";
   static String INPUT = "stuff at beginning ORIGIN" + NL + 
   "1 acagatgaag acagatgaag acagatgaag acagatgaag" + NL + NL + 
   "2 acagatgaag acagatgaag acagatgaag acagatgaag" + NL + 
   "// I added stuff here at the end that should be ignored";

   public static void main(String[] args) {
       Pattern removePattern = Pattern.compile("[\\r\\n \\t\\d]+");
       Pattern findPattern = Pattern.compile("ORIGIN[GCATgcat]+//");

       Matcher removeMatcher = removePattern.matcher(INPUT);
       String clean = removeMatcher.replaceAll("");

      Matcher findMatcher = findPattern.matcher(clean);
      if ( findMatcher.find()) {
         System.out.println(findMatcher.group());
      }
   }
}
user949300
  • 15,364
  • 7
  • 35
  • 66
  • p.s. - you might want to add 'U' and 'u' to the possible bases to cover RNA. – user949300 Dec 07 '11 at 16:31
  • this isnt the ideal solution as im reading my data from a file. but thanks – user1044585 Dec 07 '11 at 17:35
  • I have to get the String from somewhere for the demo code. Obviously you'd read it from the file in the real code. The code you accepted also just uses a String! This code is more robust to errors - if the file doesn't contain "ORIGIN" @maerics code will blow up. Also, with minor work (a while loop) my code could find multiple sequences in the file. Many DNA database files contain more than one sequence. – user949300 Dec 07 '11 at 19:29