SED or AWK script to replace multiple text

Question

I am trying to do the following with a sed script but it's taking too much time. Looks like something I'm doing wrongly.

Scenario: I've student records (> 1 million) in students.txt. In This file (each line) 1st 10 characters are student ID and next 10 characters are contact number and so on

students.txt

10000000019234567890XXX...
10000000029325788532YYY...
.
.
.
10010000008766443367ZZZZ...

I have another file (encrypted_contact_numbers.txt) which has all the phone but numbers and corresponding encrypted phone numbers as below

encrypted_contact_numbers.txt

Phone_Number, Encrypted_Phone_Number

9234567890, 1122334455
9325788532, 4466742178
.
.
.
8766443367, 2964267747

I wanted to replace all the contact numbers (11th–20th position) in students.txt with the corresponding encrypted phone number from encrypted_contact_numbers.txt.

Expected Output:

10000000011122334455XXX...
10000000024466742178YYY...
.
.
.
10010000002964267747ZZZZ...

I am using the below sed script to do this operation. It is working fine but too slowly.

Approach 1:

while read -r pattern replacement; do   
    sed -i "s/$pattern/$replacement/" students.txt
done < encrypted_contact_numbers.txt

Approach 2:

sed 's| *\([^ ]*\) *\([^ ]*\).*|s/\1/\2/g|' <encrypted_contact_numbers.txt |
sed -f- students.txt > outfile.txt

Is there any way to process this huge file quickly?

Update: 9-Feb-2018

Solutions given in AWK and Perl is working fine if the phone number is in specified position (column 10-20), If I try to do global replacement it took too much time to process. Is there any best way to achieve this?

students.txt : Updated version

10000000019234567890XXX...9234567890
10000000029325788532YYY...
.
.
.
10010000008766443367ZZZZ9234567890...

In addition to what John mentioned please add your encrypted_contact_numbers.txt too in code tags. — RavinderSingh13, Jan 23 '18 at 18:38
For correct formatting, please use 4 spaces at the front of each line of code/data/error msgs OR highlight a block of text and use the `{}` format tool at the top left of the edit box to format as `code/data/output/errorMsgs`. For more info see [editing-help](https://stackoverflow.com/editing-help) and [formatting](https://stackoverflow.com/help/formatting). Good luck. — shellter, Jan 23 '18 at 18:45

karakfa · Answer 1 · 2018-01-23T19:42:58.097

5

awk to the rescue!

if you have enough memory to keep the phone_map file in memory

awk -F', *' 'NR==FNR{a[$1]=$2; next}
                    {key=substr($0,11,20)}
           key in a {$0=substr($0,1,10) a[key] substr($0,21)}1' phone_map data_file

not tested since you're missing the data file. It should speed up since both files will be scanned only once.

edited Jan 23 '18 at 19:42

answered Jan 23 '18 at 19:04

karakfa

66,216
7
41
56

I have added the sample data file **(Students.txt)** – Dhanabalan Jan 23 '18 at 19:11
@Dhanabalan, please try to attach all details in a single shot to your post, try Karakfa and mine code and let us know how it goes then. – RavinderSingh13 Jan 23 '18 at 19:23
Am I missing something? Won't the `next` in the first action block trigger on every line? If so the other two blocks would never run. Should it be `NF>1 {a[$1]=$2]; next}`? Also, I think the `1` after the third action block is misplaced. – cxw Jan 23 '18 at 19:41
you're right of course. "first file" condition was missing, now added. I typed directly here without testing first. – karakfa Jan 23 '18 at 19:44
That `NR==FNR` is a nice trick I haven't seen before - thanks! (For future readers, it is true if and only if the line you are on comes from the first file on the command line.) – cxw Jan 23 '18 at 19:50

score 2 · Accepted Answer · answered Jan 23 '18 at 19:23

2

Following awk may help you on same.

awk '
FNR==NR{
  sub(/ +$/,"");
  a[$1]=$2;
  next
}
(substr($0,11,10) in a){
  print substr($0,1,10) a[substr($0,11,10)] substr($0,21)
}
' FS=", " encrypted_contact_number.txt students.txt

Output will be as follows. Will add explanation too shortly.

10000000011122334455XXX...
10000000024466742178YYY...

answered Jan 23 '18 at 19:23

RavinderSingh13

130,504
14
57
93

This is working fine for sample. Cold you please explain the code so that it will be implement into my actual scenario with huge file. Thanks again for your help! – Dhanabalan Jan 23 '18 at 20:14
I was able to understand and tried to implemented the same in actual scenario. Its working fine, Thanks for your timely help! – Dhanabalan Jan 23 '18 at 20:57
Hi Ravinder. Is it possible to replace all matching number with encrypted number without mentioning position? like replace all – Dhanabalan Feb 08 '18 at 04:20
@Dhanabalan, please show example of it, not clear. If it is a new question then post it in a new thread too. – RavinderSingh13 Feb 10 '18 at 09:25
Initially I told 1st 10 characters are student ID and next 10 characters are contact number. So we used substr($0,11,10) to do replacement. Now I getting the file which has contact number in other positions also, so I trying to do replace all globally instead of particular position. – Dhanabalan Feb 10 '18 at 09:30
@Dhanabalan, why don't you post a new thread for it as answers were given in this thread are as per your 1st question, changing questions later will not be good as many users try to help you. Better choose a right answer(out of any answers feasible in here) and open a new post, thanks. – RavinderSingh13 Feb 10 '18 at 09:32
Sure. I have created another post https://stackoverflow.com/questions/48719330/sed-or-awk-replace-all-with-patterns-from-another-file Thanks! – Dhanabalan Feb 10 '18 at 09:41
@Dhanabalan, choose a correct answer too, out of any. To reward people's effort. Thanks. – RavinderSingh13 Feb 10 '18 at 09:42
Sorry I unknowingly removed it. Marked again – Dhanabalan Feb 10 '18 at 09:43

cxw · Answer 3 · 2018-01-23T20:44:31.620

2

What question would be complete without a Perl answer? :) Adapted from various answers in the Perl Monks' discussion of this topic.

Edited source

Edited per @Borodin's comment. With some inline comments for explanation, in hopes that they are helpful.

#!/usr/bin/env perl

use strict;     # keep out of trouble
use warnings;   # ditto

my %numbers;    # map from real phone number to encrypted phone number

open(my $enc, '<', 'encrypted_contact_numbers.txt') or die("Can't open map file");
while(<$enc>) {
    s{\s+}{}g;                               #remove all whitespace
    my ($regular, $encrypted) = split ',';
    $numbers{$regular} = $encrypted;
}

# Make a regex that will match any of the numbers of interest
my $number_pattern = join '|', map quotemeta, keys %numbers;
$number_pattern = qr{$number_pattern}o;
    # Compile the regex - we no longer need the string representation

while(<>) {     # process each line of the input
    next unless length > 1;     # Skip empty lines (don't need this line if there aren't any in your input file)
    substr($_, 10, 10) =~ s{($number_pattern)}{$numbers{$1}}e;
    # substr: replace only in columns 11--20
    # Replacement (s{}{}e): the 'e' means the replacement text is perl code.
    print;  # output the modified line
}

Test

Tested on Perl v5.22.4.

encrypted_contact_numbers.txt:

9234567890, 1122334455
9325788532, 4466742178

students.txt:

aaaaaaaaaa9234567890XXX...
bbbbbbbbbb9325788532YYY...
cccccccccc8766443367ZZZZ...
dddddddddd5432112345Nonexistent phone number

(modified for ease of reading)

Output of ./process.pl students.txt:

aaaaaaaaaa1122334455XXX...
bbbbbbbbbb4466742178YYY...
cccccccccc8766443367ZZZZ...
dddddddddd5432112345Nonexistent phone number

The change has been made on the first two lines, but not the second two, which is correct for this input.

edited Jan 23 '18 at 20:44

answered Jan 23 '18 at 19:38

cxw

16,685
2
45
81

1

It diminishes our efforts to encourage best practices when you publish an *answer* without `use strict` or `use warnings`, using the two-parameter `open`, no information in the `die` string, and global file handles. In addition -- Why are you using `quotemeta` on strings of digits? -- Your `$numbers{$1} || $1` is pointless—you just built the regex pattern out of the hash keys. You're also holding on to a hash of over a million records long after you've finished using it to build a regex. – Borodin Jan 23 '18 at 19:58
1

@Borodin because I'm still learning, and trying to give back while doing so. Thanks for the feedback! **quotemeta** because I don't want to assume the OP's file is only digits. I agree it should be, but better safe than sorry. **`|| $1`** if I leave this out, I get `Use of uninitialized value within %numbers in substitution iterator` on lines that match. Why? – cxw Jan 23 '18 at 20:16
@Borodin I can no longer repro the problem I was having - not sure what the deal was. Anyway, fixed version is above. I do need to keep the hash for the replacement values in the main loop, but don't need to keep the `join` result, so I changed that. – cxw Jan 23 '18 at 20:42
*"It's much shorter without the comments!"* You should leave them out. It's not your place to teach people Perl in every program you write. *"if I leave this out, I get `Use of uninitialized value within %numbers in substitution iterator`"* I can't get that behaviour. // Also, you shouldn't have the `/g` modifier on the substitution, and you have no reason to avoid the standard slash delimiters for `s///` and `m//`. – Borodin Jan 23 '18 at 20:43
@Borodin Yep, I got rid of the `/g` modifier. `s{}{}` is only because, in my browser, the syntax highlighter can't properly parse `s///`. Re. comments, the OP specifically asked another answerer "Co[u]ld you please explain the code". Whether or not it's my place, I think it's not unreasonable to comment in anticipation of a similar request. – cxw Jan 23 '18 at 20:46
As for explaining the code, I always write a separate narrative before the code itself. And the familiarity of the `s///` delimiters is more important than the highlighting. It's usually because it's expecting an even number of slashes and you can fix it by adding a comment `#/`. Note that you can write `s/($number_pattern)/{ $numbers{$1} }e` which you may prefer as it emphasizes that the replacement is Perl code. – Borodin Jan 23 '18 at 21:17
Remember that you can post your code on [*Code Review*](https://codereview.stackexchange.com) to get this sort of assessment. – Borodin Jan 23 '18 at 21:19
Thanks everyone, This is the first time I am trying Perl. – Dhanabalan Feb 09 '18 at 23:33

SED or AWK script to replace multiple text

3 Answers3

Edited source

Test