0

I want to compare two files so I wrote the following code:

while($line1 = <FH1>){
     while($line2 = <FH2>){
         next if $line1 > $line2;
         last if $line1 < $line2;
     }
     next;
}

My question here is that when the outer loop comes to the next line of file1 and then goes to the inner loop, will the inner while statement read from the first line of file2 again or continue where it left off on the previous iteration of the outer loop?

Thanks

DavidO
  • 13,812
  • 3
  • 38
  • 66
lolibility
  • 2,187
  • 6
  • 25
  • 45

3 Answers3

7

You should always use strict and use warnings at the start of all your programs and declare all variables at their point of first use. This applies especially when you are asking for help with your code.

Is all the data in your files numeric? If not then enabling warnings would have told you that that the < and > operators are for comparing numeric values rather than general strings.

Once a file has been read through completely - i.e. the second loop's while condition terminates - you can read no more data from the file unless you open it again or use seek to rewind to the beginning.

In general it is better in these circumstances to read the smaller of the two files into an array and use the data from there. If both files are very large then something special must be done.

What sort of file comparison are you trying to do? Are you making sure that the two files are identical, or that all data in the second file appears in the first, or something else? Please give an example of your two data files so that we can help you better.

Borodin
  • 126,100
  • 9
  • 70
  • 144
  • sorry for the inconvenience, I just put some pseudocode up there. Of course in my real code, all those declarations is included. In fact I just want to find same line in two files, some statements have been omitted, and I'm putting the smaller file as outer loop, so no need to worry about 2nd file used up. I just want to know the fact that is the inner loop going to continue reading or start from the very beginning, for optimization purpose, since two files are already sorted, I do not want to go backwards reading. If the inner loop starts from the beginning of file, I will use for loop – lolibility Jul 13 '12 at 16:56
  • In general you should read the *smaller* file into memory and put it in the *inner* loop. Depending on your data it is likely to be best to use a hash for the smaller file. But again, if you don't show your data files then we can't help you. When you say you are using a `for` loop, *don't* use `for my $line2 () { ... }` as this has the memory overhead of reading all of the file into memory without the advantage of being able to access it a second time. – Borodin Jul 13 '12 at 17:06
2

The inner while loop will consume all the content of the FH2 filehandle when you have read the first line from the FH1 handle. If I can intuit what you want to accomplish, one way to go about it would be to read from both handles in the same statement:

while ( defined($line1 = <FH1>) && defined($line2 = <FH2>) ) {
    # 'lt' is for string comparison, '<' is for numbers
    if ($line1 lt $line2) {
        # print a warning?
        last;
    }
}
mob
  • 117,087
  • 18
  • 149
  • 283
  • thanks, but it won't work as I want to find same lines in these two files, but file two can have duplicates. I'm not simply compare same lines of each file. But anyway, I found a way out based on other answers. Thanks. – lolibility Jul 13 '12 at 17:11
0

The inner loop will continue from it's last known position in FH2 - if you want it to restart from the beginning of the file you need to put:

seek(FH2, SEEK_SET, 0);

before the inner while

Documentation for seek is here in perldoc

beresfordt
  • 5,088
  • 10
  • 35
  • 43
  • 1
    Using `seek` is generally a bad solution. Depending on the nature of the data, reading the file into memory or using `Tie::File` is likely to be preferable. – Borodin Jul 13 '12 at 15:58
  • Fair enough - could you link me something explaining why seek isn't a good option please – beresfordt Jul 13 '12 at 16:25
  • 2
    The problem isn't with `seek` per se, it's with reading the same data from mass storage over and over again. Files need to be read once to establish their contents, but disk storage is so *vastly slower* than memory that it is exceedingly wasteful to keep throwing the memory contents away and reading it from the file again. The situation is different with huge files, which cannot be read in their entirety. In this case the `Tie::File` module is useful for caching the file data and hiding all the `seek` operations for you. – Borodin Jul 13 '12 at 17:02
  • Ah, I thought Tie::File was just a nicer interface for interacting with files by line etc; didn't know it did any caching.. Time for me to go read the documentation! – beresfordt Jul 13 '12 at 19:00