0

I have multi-line records in a text file I'd like to dedupe using perl:

Records are delimited by "#end-of-record" string and look like this:

CAPTAIN GIBLET'S NEWT CORRAL
555 RANDOM ST
TARDIS, CT 99999

We regret to inform you that we must repossess your pants in part due to your being 6 months late on payments. But mostly it's maliciousness. :)

TOTAL DUE: $30.00

#end-of-record

Here is my initial attempt:

    #!/usr/bin/perl -w

    use strict;

    {
            local $/ = "#end-of-record";

            my %seen;
            while ( my $record = <> ) {

                    if (not exists $seen{$record}) {
                            print $record;
                            $seen{$record} = 1;
                    }
            }

    }

This is printing out every record ...and duplicate records. Where did I go wrong?

UPDATE
Above code seems to work.

Bubnoff
  • 3,917
  • 3
  • 30
  • 33
  • 2
    That is one way of doing it. You'll need `$seen{$record} = 1;` within your `if` statement though. Also, you might want to do some processing such as moving leading and trailing white space. Remember, you are essential matching each record character for character, so white space will affect whether a record is seen or not. – hmatt1 Nov 21 '14 at 03:34
  • 2
    You are never setting the record into the hash but check for its existence. So it will always return false and print the record. – xtreak Nov 21 '14 at 05:53
  • I added the suggested code but it's still printing all records ...and duplicates. – Bubnoff Nov 21 '14 at 17:22

1 Answers1

0
gawk 'BEGIN {ORS = RS = "#end-of-record\n"} !$seen[$0]++
      END { print $ORS }' yourfile
Kaz
  • 55,781
  • 9
  • 100
  • 149