0

I have a perl script that manage conversion of a specific file format into csv files i can manage later.

I need this script to be able to prevent generating duplicated lines:

  #get timetamp
  if ((rindex $l,"ZZZZ,") > -1) {
          (my $t1, my $t2, my $timestamptmp1, my $timestamptmp2) = split(",",$l);
          $timestamp = $timestamptmp2." ".$timestamptmp1;
  }

  if (((rindex $l,"TOP,") > -1) && (length($timestamp) > 0)) {
    (my @top) = split(",",$l);
        my $aecrire = $SerialNumber.",".$hostnameT.",".$timestamp.",".$virtual_cpus.",".$logical_cpus.",".$smt_threads.",".$top[1];
        my $i = 3;###########################################################################
        while ($i <= $#top) {
      $aecrire = $aecrire.','.$top[$i];
          $i = $i + 1;
        }
        print (FIC2 $aecrire."\n");
  }

My source file is FIC1 and destination file FIC2, the uniq key is $timestamp.

I want the script to check if $timestamp already exist in FIC1 (which is opened at begin of process), and if it does exclude the line from being writing to FIC2. if $timestamp is not present, then write as normal.

Currently if a rerun the script over an already proceeded file, each line will be sorted by the timestamp and duplicated.

My goal is to be able to run this script periodically over a file without duplicating events.

I'm quite new to perl, as far as i've seen this should be achieve simply using the %seen variable within the while, but i could not yet achieve it successfully...

Thank you very much in advance for any help :-)

Guilmxm
  • 47
  • 1
  • 9

2 Answers2

1

What you are describing is a hash.

You would define a hash in your code

my %seen = ();

Then when you read a line - before you decide to write it you could do something like:

#Check the hash to see if we have seen this line before we write it out

if ($seen{$aecrire} eq 1) {
 #Do nothing - skip the line
} else {
 $seen{$aecrire} = 1;  
 print (FIC2 $aecrire."\n"); 
}

I haven't checked this code but that is the jist.

Jeef
  • 26,861
  • 21
  • 78
  • 156
  • It is not necessary to assign the empty list (or any value) when declaring a hash (or any type of variable). `my %seen` is all you need. +1 for suggesting a hash, though. – TLP Feb 18 '14 at 01:34
  • Hi, I tried adding the condition, this almost works as it does not anymore generates all duplicate lines BUT it still generates one duplicate line per execution...(the last line) The uniq key should be the timestamp, if timestamp exists in FIC1, the script should not write i've added your code (i replace ($seen{$aecrire} with ($seen{$timestamp}) replacing print line with that condition And i added : my %seen; at top of script (also tried with my %seen = ();) What am i doing wrong ? Also, the script outputs the message "Use of uninitialized value within %seen in string eq at..." – Guilmxm Feb 18 '14 at 08:54
  • I'm not 100% sure but if you aren't going to be checking if it eq 1 you might need to do if (exists $seen{$timestamp}) ... Also are you using strict? – Jeef Feb 18 '14 at 10:46
  • If you can post some sample data I can probably fix the program for ya. – Jeef Feb 18 '14 at 10:47
  • Jeef, yes i'm using strict That would be very cool :-) I will as soon as i can get access to pastebin. One more thing, perhaps i haven't been clear enough, at begining of script i open FIC1 (source file) and FIC2 (dest file) What i need to is: - if $timestamp exist in FIC1, do nothing - if not, write to FIC2 So i think i have first to get timestamp pattern in FIC1 to check this – Guilmxm Feb 18 '14 at 13:33
0

I ended by adding the following code at the end of my process:

my (@final, %hash, $file) = ((), (), "");

foreach $file ($dstfile_CPU_ALL, $dstfile_MEM, $dstfile_VM, $dstfile_PROC, $dstfile_TOP ) {

        if (!open FILE, "+<$file") {
                print "Nothing to dedup, '$file' $!\n";
                next;
        }

        while (<FILE>) {
                if (not exists $hash{$_}) {
                        push @final, $_;
                        $hash{$_} = 1;
                }
        }

        truncate FILE, 0;
        seek FILE, 0, 0;
        print FILE @final;
        close FILE;
        %hash = @final = ();
}
Guilmxm
  • 47
  • 1
  • 9