0

I have 2 files.

  1. Obfuscated file called input.txt
  2. A second file called mapping.txt consisting of key value pairs.

I want to find every occurrence of the key from mapping.txt in input.txt and replace it with the value corresponding to the key.

Please note that I want to overwrite the contents of the line in input.txt everytime a successful match occurs.

I have written the following code:

#! /usr/bin/perl

use strict;
use warnings;

(my $mapping,my $input)=@ARGV;

open(MAPPING,'<',$mapping) || die("couldn't read from the file, $mapping with error: $!\n");

while(<MAPPING>)
{
    chomp $_;
    my $line=$_;
    (my $key,my $value)=split("=",$line);
    open(INPUT,'+<',$input);
    while(<INPUT>)
    {
        chomp $_;
        if(index($_,$key)!=-1)
        {
            $_=~s/\Q$key/$value/g;
            # move pointer to beginning of line
           print INPUT $_."\n";
        }
    }
    close INPUT;
}
close MAPPING;

Brief Overview of the code:

  1. Opens the mapping.txt file in read mode.
  2. Since each line is a key value pair, it splits it into key and value.
  3. Opens the input.txt file in overwrite mode.
  4. Checks if the key is found in the current line.
  5. If the key is found, then substitute the key with the value ignoring any meta characters in the key (by prefixing \Q)
  6. At this point, the file pointer would be at the end of the line since the previous statement would scan the entire line to find the key and replace it.
  7. If I could move the file pointer to the start of the line, then I can overwrite with:

    print INPUT $_,"\n"

  8. I tried looking up the seek function however unable to figure out a way to use it for this purpose.

Once this is done, then the code will close the file. It will pick the next key value pair from mapping.txt and again scan the input file from beginning looking for matches and replacing them.

The most important point is, each time the inner while loop will be operating on the input.txt which was modified in the previous iteration of inner while loop. This way, any successful Find and Replace operations would keep on getting saved in the input.txt file.

How do I do this?

Thanks.

matthias krull
  • 4,389
  • 3
  • 34
  • 54
Neon Flash
  • 3,113
  • 12
  • 58
  • 96

2 Answers2

3

First of all you should use lexical file handles, the three-parameter form of open, and always check the status to make sure that an open has succeeded (as you do with the mapping file but not the input file).

The solution you suggest, of rewinding to the start of the line before using print will not work because you cannot update part of a file unless your replacement data is exactly the same size as the data it is replacing. This will not generally be true in your situation.

There are a number of solutions to this, the first and simplest is to invert the loops and put the read loop for the mapping file inside the read loop for the input file. Your code would look like this:

use strict;
use warnings;

my ($mapping, $input) = @ARGV;

open my $infh, '<', $input or die "Unable to open '$input': $!";

while (my $line = <$input>) {

  open my $mapfh, '<', $mapping or die "Unable to open '$mapping': $!";

  while (<$mapfh>) {
    chomp;
    my ($key, $value) = split /=/;
    $line =~ s/\Q$key/$value/g;
  }
  print $line;
}

but your output is sent to STDOUT and you will have to arrange the output to be saved to a file and renamed appropriately.

An alternative here is to use the -I command-line option, which forces a file to be renamed automatically, and a backup saved if required. Using a bare -I will modify the file in-place by deleting the old file and renaming the new output, while giving the parameter a value like -I.bak will rename the old file by appending .bak instead of deleting it. The -I option applies only to files read from ARGV using an empty <> operator, and setting the built-in variable $^I to a value (or to the empty string '') has the same effect. The code looks like this:

use strict;
use warnings;

my $mapping = shift @ARGV;
$^I = '.bak';

while (my $line = <>) {

  open my $mapfh, '<', $mapping or die "Unable to open '$mapping': $!";

  while (<$mapfh>) {
    chomp;
    my ($key, $value) = split /=/;
    $line =~ s/\Q$key/$value/g;
  }
  print $line;
}

A third, and neater alternative is to use Tie::File, which maps a Perl array to the file contents and reflects all modifications of the array back to the original file. This is an example:

use strict;
use warnings;

use Tie::File;

my ($mapping, $input) = @ARGV;
tie my @input, 'Tie::File', $input or die "Unable to open '$input': $!";

for my $line (@input) {

  open my $mapfh, '<', $mapping or die "Unable to open '$mapping': $!";

  while (<$mapfh>) {
    chomp;
    my ($key, $value) = split /=/;
    $line =~ s/\Q$key/$value/g;
  }
}

Finally, it is highly inefficient to keep opening and reading the mapping file for every line of input, and it is best to build a regex from its contents and use it throughout the program. This version first builds a hash %mapping from the mapping file and then creates a regex by applying quotemeta to each hash key to escape any regex metacharacters, and then joining them with the regex alternation operator |. The keys are sorted by descending length so that the longest matches are found and replaced in priority over the shorter ones.

use strict;
use warnings;

use Tie::File;

my ($mapping, $input) = @ARGV;

open my $mapfh, '<', $mapping or die "Unable to open '$mapping': $!";
my %mapping = map { chomp; /\S/ ? split /=/ : () } <$mapfh>;
my $regex = join '|', map quotemeta, sort { length $b <=> length $b } keys %mapping;

tie my @input, 'Tie::File', $input or die "Unable to open '$input': $!";

for my $line (@input) {
  $line =~ s/($regex)/$mapping{$1}/g;
}
Borodin
  • 126,100
  • 9
  • 70
  • 144
0

If I could move the file pointer to the start of the line, then I can overwrite with:

print INPUT $_,"\n"

Your premise is wrong: Assuming the byte sequence 00 01 02 and the rule 01 = A1 A2, the resulting byte sequence would be 00 A1 A2 and not 00 A1 A2 02. Ways around this include:

  • Use the Tie::File module.
  • Write to another file, and rename the second file to the original, once your pass is complete. This is probably most efficient and scalable.

seeking is not a good idea: You would be constrained to fix-length substitutions, and seek and tell operate on bytes, not characters. If you really have to use in-place editing, you could use this loop:

my $beginning_of_line = tell $fh;
while (<$fh>) {
  # do processing
  seek $fh, $beginning_of_line, 0;
  # do update
} continue {$beginning_of_line = tell $fh}

Also, you make several passes over the input file. Assuming the token sequence a b c and the rules b = d e and d = f, you would produce the sequences a f e c or a d e c depending on the order of the rules! This may not be what you want.
Also, consider the ambiguity between the rules a = c and a b = d over the input a b. Does this produce c b or d?

amon
  • 57,091
  • 2
  • 89
  • 149
  • From the [`Tie::File` documentation](https://metacpan.org/module/Tie%3a%3aFile#DESCRIPTION): *The file is not loaded into memory, so this will work even for gigantic files* – Borodin Oct 08 '12 at 10:43