3

I get a question about parse a vector has strings like this:

"chr1-247751935-G-.:M92R,chr1-247752366-G-.:R236G,"
"chr1-247951785-G-.:G98K,"
"chr13-86597895-S-78:M34*,chr13-56891235-S-8:G87K,chr13-235689125-S-7:M389L,"

I want to get:

"M92R R236G"
"G98K"
"M34* G87K M389L"

When I use

while ($info1=~s/^(.*)\:(([A-Z\*]){1}([\d]+)([A-Z\*]){1})\,//) 
{
    $pos=$2; 
}

the result $pos only give me the last one for each row, that is:

"R236G"
"G98K"
"M389L"

How should I correct the script?

Jonathan Leffler
  • 730,956
  • 141
  • 904
  • 1,278
user2917442
  • 43
  • 1
  • 4
  • Welcome to Stack Overflow. Please read the [bout] page soon. Note that it helps to format code (and often data too) as 'code', by selecting the lines, then using the **`{}`** button above the edit box to indent it. – Jonathan Leffler Oct 24 '13 at 20:28
  • The `{1}` notation is not necessary; it can be omitted without changing the regex (as the prior pattern will be required once without it). – Jonathan Leffler Oct 24 '13 at 20:31

3 Answers3

2

Using a one-liner :

$ perl -ne 'print q/"/ . join(" ", m/:([^,]+),/g) . qq/"\n/' file
"M92R R236G"
"G98K" 
"M34* G87K M389L"

In a script :

$ perl -MO=Deparse -ne 'print "\042" . join(" ", m/:([^,]+),/g) . "\042\n"' file

script :

LINE: while (defined($_ = <ARGV>)) {
    print '"' . join(' ', /:([^,]+),/g) . qq["\n];
}
Gilles Quénot
  • 173,512
  • 41
  • 224
  • 223
2

The reason your code isn't working is that you have a greedy ^(.*) at the start of of the regular expression. That will take up as much of the target string as possible as long as the rest of the pattern matches, so you will find only the last occurrence of the substring. You can fix it by just changing it to a non-greedy pattern ^(.*?).

A few other notes on your regular expression:

  • There is no need to escape : or ,, or * when it is inside a character class [...]

  • There is never a need for the quantifier {1} as that is the effect of a pattern without a quantifier

  • There is no need to put \d inside a character class [\d], as it works fine on its own

  • There is no need to enclose subpatterns in parentheses unless you need access to whatever substring matched that subpattern when the match succeeds. So, for instance ^.* is fine without the parentheses

This modification of your code works identically to yours, but is very much more concise

while ($info1 =~ s/^.*?:([A-Z*]\d+[A-Z*]),// ) {
  my $pos = $1;
  ...
}

But the best solution is to use a global match that finds all occurrences of a pattern within a string, and doesn't need to modify the string in the process.

This program does what you describe. It just looks for all the alphanumeric or asterisk strings that follow a colon in each record.

use strict;
use warnings;

while (<DATA>) {
  my @fields = /:([A-Z0-9*]+)/g;
  print "@fields\n";
}

__DATA__
"chr1-247751935-G-.:M92R,chr1-247752366-G-.:R236G,"
"chr1-247951785-G-.:G98K,"
"chr13-86597895-S-78:M34*,chr13-56891235-S-8:G87K,chr13-235689125-S-7:M389L,"

output

M92R R236G
G98K
M34* G87K M389L
Borodin
  • 126,100
  • 9
  • 70
  • 144
  • Hi, I tried the one you corrected based on my script, but it still shows the same thing as I get before. Can you help me for fixing that. Since I am pretty new to perl, I do not quite understand how to put the new one you provided into my script. Thank you a lot. – user2917442 Oct 24 '13 at 21:24
0

You can use as regex a colon and some alpanumerics characters, use an array to save them and print at the end of the loop. Here you have an example:

#!/usr/bin/env perl;

use strict;
use warnings;

my (@data);

while ( <DATA> ) { 
    while ( m/:([[:alnum:]*]+)/g ) { 
        push @data, $1; 
    }   
    printf qq|"%s"\n|, join q| |, @data;
    undef @data;
}

__DATA__
"chr1-247751935-G-.:M92R,chr1-247752366-G-.:R236G,"
"chr1-247951785-G-.:G98K,"
"chr13-86597895-S-78:M34*,chr13-56891235-S-8:G87K,chr13-235689125-S-7:M389L,"

Run it like:

perl script.pl

That yields:

"M92R R236G"
"G98K"
"M34* G87K M389L"
Birei
  • 35,723
  • 2
  • 77
  • 82