1
open(IN_FILE, $id_file) or die "Cant open $id_file file";

while (my $id_list= <IN_FILE>) {
    chomp $id_list;

    if ($id_list =~ m/^#|^$/g) {
        next;
    }
        # This Works WELL
        # if the file comes in QIIME format 
    elsif($otus_tag){
        if ($id_list =~ m/^$otus_tag\t/g) {
            @list_id = split /\t/, $id_list;

        }

    }
        # This is the section that I want to FIX !!!!
        # if the format are in space, tab, semicolon, comma or in new line.

    elsif(!$otus_tag){
        if ($id_list =~ m/\s|\t|\,|\;/g) {
             @list_id = split /\s|\t|\,|\;/, $id_list;
        }


    } 
}

I have this a section of a perl script to extract a list of ids from files with 6 different formats:

    Tab_delimited file:
    Y4.SW08.DCM.X4a_1386    Y4.SW08.DCM.X4a_1457    Y4.SW08.DCM.X4a_1590

    Tab_delimited_QIIME file:
    A100B1      Y4.SW08.DCM.X4a_1386    Y4.SW08.DCM.X4a_1457    Y4.SW08.DCM.X4a_1590

    Space_delimited file:
    Y4.SW08.DCM.X4a_1386 Y4.SW08.DCM.X4a_1457 Y4.SW08.DCM.X4a_1590

    Comma_delimited file:
    Y4.SW08.DCM.X4a_1386,Y4.SW08.DCM.X4a_1457,Y4.SW08.DCM.X4a_1590

    Semicolon_delimited file:
    Y4.SW08.DCM.X4a_1386;Y4.SW08.DCM.X4a_1457;Y4.SW08.DCM.X4a_1590

    List_delimited file:
    Y4.SW08.DCM.X4a_1386
    Y4.SW08.DCM.X4a_1457
    Y4.SW08.DCM.X4a_1590

the code works well at the moment to add the ids to an array, except with the last format, the list delimited file, I have tried to add a \n to the next 2 lines:

if ($id_list =~ m/\s|\t|\,|\;|\n/g)
@list_id = split /\s|\t|\,|\;|\n/, $id_list;

But it do not add the ids to the array when the file format is a list !!! ...... Any Idea ???

Thanks So Much

abraham
  • 661
  • 8
  • 14
  • 1
    `\s` includes `\n` – toolic Feb 23 '17 at 19:10
  • 1
    Never use `if (/.../g)`*. Makes absolutely no sense, and behaves subtly different than `if (/.../)`. Use `if (/.../)`! (* - Unless you're unrolling a while loop.) – ikegami Feb 23 '17 at 19:18
  • I don't know what your question is, but I suspect you're wondering why `$id_list` doesn't contain more than one line even though you've only put one line in it? – ikegami Feb 23 '17 at 19:19
  • @toolic And `\t`! – Matt Jacob Feb 23 '17 at 19:19
  • What is the question? – Matt Jacob Feb 23 '17 at 19:20
  • I used without \n (if ($id_list =~ m/\s|\t|\,|\;/g)) but the only file that don't work with, is the list file !!!! ... Thanks !! – abraham Feb 23 '17 at 19:20
  • Does that mean your question is answered, or...? What exactly is the question? – Matt Jacob Feb 23 '17 at 19:22
  • the question is how to add the list of ids to an array (@list_id ) using 6 different formats, the only one that don´t work is the last file: List_delimited file – abraham Feb 23 '17 at 19:23
  • You are reading the file _line by line_. What you call "_list delimited_" format is broken up into multiple lines and thus can't be parsed that way -- you aren't going to get the whole "_list_" per filehandle read, but just one line of it. Best do that format separately, since it is substantially different. – zdim Feb 23 '17 at 19:29
  • Is the only time you get more than one line when is when you have a so-called "list-delimited" file? – ikegami Feb 23 '17 at 19:34

1 Answers1

0

I think you can simply your code a bit since there are some redundant regexes in there. You really just need to run a split function on each line in the file with a character class matching the possibilities I think. I might simplify it to this:

#!/usr/bin/perl
use strict;
use warnings;
use autodie;

my $file = shift;

my @list_ids;
open(my $fh, "<", $file);
while (<$fh>) {
    next if m/^[\#\$]/;
    my @elems = split(/[\s+,;]/);
    # maybe another next regex / string comparison filter if $otus_tag?
    #next if $otus_tag and ! /.../;  ?
    #next unless $otus_tag eq $elems[0]; ?
    push(@list_ids, $_) for @elems;
}

print "$_\n" for @list_ids;

This outputs the following after running through the 6 different file types:

$ for f in files/*file; do echo $f; ./parse_file.pl $f; echo; done
files/comma.file
Y4.SW08.DCM.X4a_1386
Y4.SW08.DCM.X4a_1457
Y4.SW08.DCM.X4a_1590

files/list.file
Y4.SW08.DCM.X4a_1386
Y4.SW08.DCM.X4a_1457
Y4.SW08.DCM.X4a_1590

files/semicolon.file
Y4.SW08.DCM.X4a_1386
Y4.SW08.DCM.X4a_1457
Y4.SW08.DCM.X4a_159

files/space.file
Y4.SW08.DCM.X4a_1386
Y4.SW08.DCM.X4a_1457
Y4.SW08.DCM.X4a_1590

files/tab.file
Y4.SW08.DCM.X4a_1386
Y4.SW08.DCM.X4a_1457
Y4.SW08.DCM.X4a_15

files/tab2.file
A100B1
Y4.SW08.DCM.X4a_1386
Y4.SW08.DCM.X4a_1457
Y4.SW08.DCM.X4a_159

I don't know what an otus_tag is and what you're trying to get at with that variable. But I put in a couple ideas for filtering on that, if that's what you're trying to do. The one I indicated as 'tab2.file' is what I think is the otus_tag file requiring extra filtering, but your code suggests that we keep the different string in the output, so I don't know what you want to do there.

When I run your script, after putting in some dummy variable since we don't know what $otus_tag is for example, I get the same answer as my script. So, I'm not entirely sure what's going wrong for you. Maybe some example output of what you're getting and some sample output of what you really want would be helpful.

drmrgd
  • 733
  • 5
  • 17