Perl regular expression to match '>'

Question

This is how data is arranged in my file.

>Contig1
TGGCACCTTCGACAGTTGCTCCCTCCTGGGTGGGGGCCGTCTGACCTCGCTGTACTCCT
>Contig2
GGGCCTTGGGAAGCGCAGGTGCCGAGAACTTGGCTAGAGCGGTAGACAATGCGGTTCGTG
AAAAGAGCAACTTTAAATACTTGTACGACCTCAACCAGCCAGTCAAAGAGAAAATCGAG
>NODE_105957_length_443_cov_1.000000
TCAGAAGTTAATGCAATCTGGTCCATTAAGTAAATGGGTATCATGGTACATAAACTAAAA
GCACAGAACATGGATTATTTTCCCAATTTTAACTTTCCTAACCATTTTTATCTCTCTCAA
TAACTTCCACAGTAGTTTTTATTCGTCTCAATAACTTTATTAAAAGGGATCCCTCTATCC
CCAGAATTCAGTAGCTGCATACGACTTTCCTGTCACTAGAGATCCCTCAGATGTCGGTAG
TGCATTCATCTTAAGTGATAAATCAAATGTTAGTCAAGTTAGGAAGTGAGAATTGATACA
GAATTTCTACTTCAATACTAGCTATCCCAAAATGGTCATTGACGATTTATTTTTTTCCTA
CCAGCATATTCTTTTCTAGTATTTCAGATCTAGTGACTCAGAACTAGGACAATCATAAAT
TTGAAGGGAACCTTAAGTCTTTTTTCATGCTGAGACTGCCAAG
>NODE_105950_length_95_cov_1.000000
TCAGGTCCTACTTCATTTGTAAGGAAAACTGACAGGTAATTCAGTGGGACAGAATACCAT
GTGAAGAGTTTCCTCTCACCTGAGAGGAGACTTTTTGATGATGATGATGATCAAT

Can you please advice me on how to extract the sequences i.e the lines with just the A,T,G,C with a newline between each successive set of sequences. This is the code I have thus far

#!/usr/bin/perl

print "Enter the first filename\n";
$filename = <>;

print "Enter the output file for ids\n";
$filename1 = <>;

print "Enter the output file for sequences\n";
$filename2 = <>;
my $first = ">";
open(FILE, $filename) or die "Could not read from $filename, program halting.";
open(FIL, '>', $filename1) or die "Could not read from $filename1, program halting.";
open(FILES, '>', $filename2) or die "Could not read from $filename2, program halting.";
while(my $line = <FILE>)
{
    if ($line =~ m//s) 
        {
            print FILES $line, "\n";
        } 
    if ($line =~ m/^>/)
        {
            print FIL $line;
        }
}
close FILE;
close FIL;
close FILES;

which is just a basic regular, simple perl program to match patterns. Any help is appreciated.

I'm not sure I understand what you're asking. The title says you want a regexp to match `>`, and it looks like you have that with `/^>/`. What do you really need help with? — Barmar, Mar 27 '15 at 08:56
@Barmar I could extract the sequences using else statement but while uploading them, each sequence line shows up as a separate cell on the database. As you can see some sequences are multiple lines long. I can't seem to get individual sequences into a single line. Hope that helps. — The Last Word, Mar 27 '15 at 09:02
Remove the newline with `chomp`, and don't add a newline when you write them to the file. Add a newline when you see one of the `>` lines that starts a new sequence. — Barmar, Mar 27 '15 at 09:04
It looks like `if ($line =~ m//s) ` is always true, swap the conditions and use "if else": `if ($line =~ m/^>/)) { ... } else { ... }` — Wiktor Stribiżew, Mar 27 '15 at 09:46
@stribizhev I tried the else statement but it again does not join individual sequence lines into one. — The Last Word, Mar 27 '15 at 09:50
If this is a fasta file, perhaps you should look up using software specifically developed for parsing it, like perhaps BioPerl? — TLP, Mar 27 '15 at 10:21

score 2 · Answer 1 · answered Mar 27 '15 at 08:57

2

you can use this regex

/^[ATGC]+$/gm

demo here https://regex101.com/r/rQ9gN4/2

if you want to extract

NODE_105957_length_443_cov_1.000000 NODE_105950_length_95_cov_1.000000

negate the above regex

/^([^ATGC]+)$/gm

answered Mar 27 '15 at 08:57

Vladu Ionut

8,075
1
19
30

some sequences are multiple lines long. Could you please put them in one line? The code you have given splits multiple sequences into new lines eg: Contig2 has a 2 line sequence which has to be made into a single line. – The Last Word Mar 27 '15 at 09:06
see this regex https://regex101.com/r/rQ9gN4/3 /((?<!\>)[ATGC\s]+)/gs check if it's ok for your case and after i will update my answer – Vladu Ionut Mar 27 '15 at 09:14
Actually, I also think `((?<!\>)[ATGC]+)` works for the input you have on regex101. And no need of `s` option. – Wiktor Stribiżew Mar 27 '15 at 09:16
@stribizhev works, but each line becomes a newline. I would like to concatanate individual sequence lines into a single line. Eg: Contig2 has two lines which I would like to make into one. There need not be newlines in that case. – The Last Word Mar 27 '15 at 09:32
@VladuIonut I am sorry but your regex doesn't work on Perl. I don't know how it is working on regex101. – The Last Word Mar 27 '15 at 09:33

score 1 · Accepted Answer · answered Mar 27 '15 at 10:17

Have a try with:

#!/usr/bin/perl

# ALLWAYS
use strict;
use warnings;

print "Enter the first filename\n";
chomp (my $filename = <>); # remove the line break

print "Enter the output file for ids\n";
chomp (my $filename1 = <>); # remove the line break

print "Enter the output file for sequences\n";
chomp (my $filename1 = <>); # remove the line break

# use three args open and show the reason when it fails
open(my $FILE,  '<', $filename)  or die "Unable to open '$filename', $!";
open(my $FILE1, '>', $filename1) or die "Unable to open '$filename1', $!";
open(my $FILE2, '>', $filename2) or die "Unable to open '$filename2', $!";

while(my $line = <$FILE>) {
    chomp($line);   # remove line break
    if ($line =~ /^>/) {
        print $FILE1 $line,"\n";
        # add a line break to filename2 unless we are at first line.
        print $FILE2 "\n" unless $. < 2;
    }
    else {
        print $FILE2 $line;
    }
}

Perl regular expression to match '>'

2 Answers2