4

I have a piece of code which opens up a file and parses it. This text document has a redundant structure and has multiple entries. I need to peek ahead within my loop to see if there is a new entry, if there is, I will be able to parse all of the data my program extracts. Let me first show my implementation so far

use strict;
my $doc = open(my $fileHandler, "<", "test.txt");

while(my $line = <$fileHandler>) {
    ## right here I want to look at the next line to see if 
    ## $line =~ m/>/ where > denotes a new entry
}
Brian Tompsett - 汤莱恩
  • 5,753
  • 72
  • 57
  • 129
user1876508
  • 12,864
  • 21
  • 68
  • 105
  • If you are on a *nix environment why not use `uniq` first to remove duplicates? – squiguy Jan 15 '13 at 18:06
  • How big are the files? If you can load it into an array, then use a `for` loop, you can peek ahead by just adding 1 to the index. – friedo Jan 15 '13 at 18:07
  • how do you use a for loop to look through it? – user1876508 Jan 15 '13 at 18:21
  • @squiguy I am in windows – user1876508 Jan 15 '13 at 18:21
  • you can always use `seek` and `tell` to move all around the files, however you want to. http://perldoc.perl.org/functions/seek.html – asf107 Jan 15 '13 at 18:44
  • 1
    @asf107 Mixing seek and tell with readline() is usually just complicated. – TLP Jan 15 '13 at 18:59
  • are there any better tutorials online? this is pretty confusing – user1876508 Jan 15 '13 at 19:09
  • @user1876508 Tutorials for what? perldoc perlopentut? Try `Tie::File`, it is very easy to understand. – TLP Jan 15 '13 at 19:11
  • tutorials for seek() and tell() – user1876508 Jan 15 '13 at 19:17
  • @user1876508 Don't use seek and tell. Almost always when you think that you have to rewind or fast forward a file, you're doing something wrong. The documentation is in perldoc, though, if you feel like reading it. http://perldoc.perl.org/functions/seek.html – TLP Jan 15 '13 at 19:19
  • They're not called "file handlers", they're called "file handles" – ikegami Jan 15 '13 at 19:29
  • If you are parsing a FASTA formatted file, then you can use **[BioPerl](http://www.bioperl.org/wiki/Main_Page)**. Here is an example **[retrieving a sequence from a file](http://www.bioperl.org/wiki/HOWTO:Beginners#Retrieving_a_sequence_from_a_file)**. – Chris Charley Jan 15 '13 at 19:32

3 Answers3

9

Try handling the iteration yourself:

my $line = <$fileHandler>;
while(1) { # keep looping until I say so
    my $nextLine = <$fileHandler>;

    if ($line =~ m/>/ || !defined $nextLine) {
        ### Do the stuff
    }
    ### Do any other stuff;

    last unless defined $nextLine;
    $line = $nextLine;
}

I added the extra check in the if statement under the assumption that you will also want to process what you have when you reach the end of the file.

Alternatively, as suggested by friedo, if the file can fit into memory, you can load the whole thing into an array at once:

my @lines = <$fileHandler>;
for (my $i = 0; $i <= $#lines; $i++) {
    if ($i == $#lines || $lines[$i+1] =~ />/) {
        ### Do the stuff
    }
}

This is more flexible in that you can access any arbitrary line of the file, in any order, but as mentioned the file does have to be small enough to fit into memory.

reo katoa
  • 5,751
  • 1
  • 18
  • 30
  • 1
    Instead of "last unless defined $nextLine;" you could simply do: while ( $line ) { my $next = <$fh>; ...; $line = $next; } – preaction Jan 16 '13 at 20:48
  • 1
    Yes, I thought of that, but I decided to leave it as is to clearly show the logic. If you really want to get compact, you could do `my $next = <$fh>; while(my $line = $next) { $next = <$fh>; ... }` – reo katoa Jan 17 '13 at 13:14
5

A nice way to handle these problems is using Tie::File, which allows you to treat a file like an array, without the performance penalty of actually loading the file into memory. It is also a core module since perl v5.7.3.

use Tie::File;
tie my @file, 'Tie::File', "test.txt" or die $!;

for my $linenr (0 .. $#file) {             # loop over line numbers
    if ($file[$linenr] =~ /foo/) {         # this is the current line
        if ($file[$linenr + 1] =~ /^>/ &&  # this is the next line
            $linenr <= $#file) {           # don't go past end of file
             # do stuff
        }
    }
}
untie @file;   # all done
TLP
  • 66,756
  • 10
  • 92
  • 149
  • 1
    Tie::File adds a huge performance penalty. What it saves is memory, but even then, for the typical file, it will use up more memory. – ikegami Jan 15 '13 at 19:28
  • That warns because check if you're on the last line after working with the next line. And you can't just swap the checks without slowing everything down – ikegami Jan 15 '13 at 19:31
  • 1
    @ikegami If you're worried that a simple `<=` check will slow things down, you can change the loop end condition to `$#file - 1`. Either way, the benefit in performance is compared to loading the file into memory, and the benefit from using a line-by-line solution is readability. – TLP Jan 15 '13 at 19:39
  • `$#file` reads the entire file the first time it's used. I guess you'll end up using it anyway, so swapping the order probably won't cost much. – ikegami Jan 15 '13 at 19:40
  • 2
    The benefit in performance vs loading the file into memory is what I was talking about too. Loading the file into memory will be soooo much faster. I wouldn't be surprised if it was 100% faster. Even when Tie::File ends up loading the entire file into memory because it's less than 2MB. – ikegami Jan 15 '13 at 19:42
  • @ikegami So if you do not know which size file people would be reading, would you recommend Tie::File or reading the whole file into memory? Because that is the question here. – TLP Jan 15 '13 at 19:48
  • 1
    The advantage of Tie::File is rapid development. Talking about how it saves resources is oh so wrong. That's all. – ikegami Jan 15 '13 at 19:52
  • 1
    @ikegami Rapid development? If that was all, then there would be no drawback to simply loading the file into memory. – TLP Jan 15 '13 at 19:54
  • There aren't just two alternatives. – ikegami Jan 15 '13 at 20:07
  • @ikegami What alternatives do you propose? – TLP Jan 15 '13 at 22:06
  • 1
    I wasn't proposing alternatives; I was proposing that your claims were wrong. Accessible alternatives include loading the entire file into memory and maintaining a one-line buffer. – ikegami Jan 15 '13 at 23:43
0

I just used the Tie::File code in #5 for great justice. My in file had a hostname and the next line was either a hostname or the host's crit level. If there was a crit level, I made a line with the hostname & crit for output to a CSV; if there was no crit assigned, I assigned it 0.

(I had to split the lines because the line was name:servername or critlevel:99, along with cleaning up leading/trailing spaces)

for my $linenumber (0..$#file) {
        #print "$file[$linenumber]\n";
        if ($file[$linenumber] =~/name/) {
                ($crap,$server) = split(/\:/,$file[$linenumber],2);
                $server =~ s/^\s+|\s+$//g; 
                #print "$server\n";
                if ($file[$linenumber+1] =~/server/ && $linenumber <=$#file) { 
                        ($crap,$crit) = split(/\:/,$file[$linenumber+1],2);
                        $crit =~ s/^\s+|\s+$//g;
                        #print "$crit\n";
                }
                else { $crit = "0"; }
        $outstr = "$server,$crit\n";
        print $outstr;
        print OUTFILE $outstr;
        
        }
}