11

I am trying to do pattern replacement using SED script but its not working properly

sample_content.txt

288Y2RZDBPX1000000001dhana
JP2F64EI1000000002d
EU9V3IXI1000000003dfg1000000001dfdfds
XATSSSSFOO4dhanaUXIBB7TF71000000004adf
10Q1W4ZEAV18LXNPSPGRTTIDHBN1000000005egw

patterns.txt

1000000001 9000000003
1000000002 2000000001
1000000003 3000000001
1000000004 4000000001
1000000005 5000000001

Expected output

288Y2RZDBPX9000000003dhana
JP2F64EI2000000001d
EU9V3IXI3000000001dfg9000000003dfdfds
XATSSSSFOO4dhanaUXIBB7TF74000000001adf
10Q1W4ZEAV18LXNPSPGRTTIDHBN5000000001egw

I am able to do with single SED replacement like

sed  's/1000000001/1000000003/g' sample_content.txt

Note:

  • Matching pattern is not in fixed position.
  • Single line may have multiple matching value to replace in sample_content.txt
  • Sample_content.txt and patterns.txt has > 1 Million records

File attachment link: https://drive.google.com/open?id=1dVzivKMirEQU3yk9KfPM6iE7tTzVRdt_

Could anyone suggest how can achieve this without affecting performance?

Updated on 11-Feb-2018

After analyzing the real file I just got a hint that there is a grade value at the 30 and 31th position. Which helps where and all we need to apply replacement.
If grade AB then replace the 10 digit phone number at 41-50 and 101-110
If grade BC then replace the 10 digit phone number at 11-20, 61-70 and 151-160
If grade DE then replace the 10 digit phone number at 1-10, 71-80, 151-160 and 181-190

Like this I am seeing 50 unique grades for 2 Million sample records.

{   grade=substr($0,110,2)} // identify grade
{ 
    if (grade == "AB") {
        print substr($0,41,10) ORS substr($0,101,10)
    } else if(RT == "BC"){
        print substr($0,11,10) ORS substr($0,61,10) ORS substr($0,151,10) 
    }

    like wise 50 coiditions
}

May I know, whether this approach is advisable or anyother better approach?

Dhanabalan
  • 572
  • 5
  • 19
  • 3
    re: `> 1 Million records` && `without affecting performance` good luck – mpapec Feb 10 '18 at 10:23
  • @Сухой27, I completely second you on same. – RavinderSingh13 Feb 10 '18 at 10:50
  • 1
    Both solutions you've got so far apply each substition to the whole of each input line rather then the remainder of the input line after the previous substitutions are made so if, for example, sample_content.txt contained `xay` and patterns.txt included `a b` and `b c`, then the tools would output `xcy` - **is** that the expected output or should it be `xby`? Whichever is correct you should include a case that tests that in your sample input. – Ed Morton Feb 10 '18 at 11:37
  • Dhanabalana, see George's NICE testing reports and seems my 2nd solution is MUCH FASTER than my previous solution. Please check all solutions and let us know your experience so that we all could learn here too, cheers :) – RavinderSingh13 Feb 11 '18 at 02:14
  • 1
    @RavinderSingh13 the OP specifically said `Single line may have multiple matching value to replace in sample_content.txt` so jumping put of the loop after the first match, while faster, wouldn't produce the expected output. I also think that just doing 1 replacement for each matching string instead of replacing all occurrences of each string on each line (as both your scripts do by using sub() instead of gsub()) is incorrect. – Ed Morton Feb 11 '18 at 12:32
  • 1
    @EdMorton, sure Ed sir I got it, you are right on this. – RavinderSingh13 Feb 11 '18 at 12:43
  • Would be better maybe to unaccept answer and allow more answers to appear. In mt opinion all answers does not perform really well. Some are faster than others but even the fastest one will need hours for a 50K lines file. – George Vasiliou Feb 11 '18 at 16:02
  • @EdMorton Sorry for the delay Ed Sir. We can assume that real scenario will not have this pattern. If a = b comes then b will not come in LHS side again. So once we replaced xay as xby then we can consider reminder of this line for further replacement as single line will have multiple matching value to replace. – Dhanabalan Feb 11 '18 at 19:49

4 Answers4

7

Give a try to this one . Should be fast.

$ sed -f <(printf 's/%s/%s/g\n' $(<patterns.txt)) contents.txt

This formats the data of `patterns.txt like bellow without actually changing patterns.txt real contents:

$ printf 's/%s/%s/g\n' $(<patterns.txt)
s/1000000001/9000000003/g
s/1000000002/2000000001/g
s/1000000003/3000000001/g
s/1000000004/4000000001/g
s/1000000005/5000000001/g

All above are then given with process substitution <(...) to a simple sed as a script file using
sed -f switch = read sed commands from file

$ sed -f <(printf 's/%s/%s/g\n' $(<patterns.txt)) contents.txt
288Y2RZDBPX9000000003dhana
JP2F64EI2000000001d
EU9V3IXI3000000001dfg9000000003dfdfds
XATSSSSFOO4dhanaUXIBB7TF74000000001adf
10Q1W4ZEAV18LXNPSPGRTTIDHBN5000000001egw
George Vasiliou
  • 6,130
  • 2
  • 20
  • 27
  • @Dhanabalan: Maybe that's faster: `sed -f <(sed -E 's|(.*) (.*)|s/\1/\2/|g' patterns.txt) sample_content.txt` – Cyrus Feb 10 '18 at 15:34
7

Benchmarks for future reference

Test environment:

Using your sample files patterns.txt with 50,000 lines and contents.txt also with 50,000 lines.

All lines from patterns.txt are loaded in all solutions but only the first 1000 lines of contents.txt are examined.

Testing laptop is equipped with a dual core 64bit Intel(R) Celeron(R) CPU N3050 @ 2.16GHz, 4 GB RAM, Debian 9 64bit Testing , gnu sed 4.4 and gnu awk 4.1.4

In all cases the output is sent to a new file to avoid the slow overhead for printing data on the screen.

Results:

1. RavinderSingh13 1st awk solution

$ time awk 'FNR==NR{a[$1]=$2;next}   {for(i in a){match($0,i);val=substr($0,RSTART,RLENGTH);if(val){sub(val,a[i])}};print}' patterns.txt  <(head -n 1000 contents.txt) >newcontents.txt

real    19m54.408s
user    19m44.097s
sys 0m1.981s

2. EdMorton 1st awk Solution

$ time awk 'NR==FNR{map[$1]=$2;next}{for (old in map) {gsub(old,map[old])}print}' patterns.txt <(head -n1000 contents.txt) >newcontents.txt

real    20m3.420s
user    19m16.559s
sys 0m2.325s

3. Sed (my sed) solution

$ time sed -f <(printf 's/%s/%s/g\n' $(<patterns.txt)) <(head -n 1000 contents.txt) >newcontents.txt

real    1m1.070s
user    0m59.562s
sys 0m1.443s

4. Cyrus sed solution

$ time sed -f <(sed -E 's|(.*) (.*)|s/\1/\2/|g' patterns.txt) <(head -n1000 contents.txt) >newcontents.txt

real    1m0.506s
user    0m59.871s
sys 0m1.209s

5. RavinderSingh13 2nd awk solution

$ time awk 'FNR==NR{a[$1]=$2;next}{for(i in a){match($0,i);val=substr($0,RSTART,RLENGTH);if(val){sub(val,a[i]);print;next}};}1' patterns.txt  <(head -n 1000 contents.txt) >newcontents.txt

real    0m25.572s
user    0m25.204s
sys     0m0.040s

For a small amount of input data like 1000 lines, awk solution seems good. Lets make make another test with 9000 lines this time to compare performance

6.RavinderSingh13 2nd awk solution with 9000 lines

$ time awk 'FNR==NR{a[$1]=$2;next}{for(i in a){match($0,i);val=substr($0,RSTART,RLENGTH);if(val){sub(val,a[i]);print;next}};}1' patterns.txt  <(head -9000 contents.txt) >newcontents.txt

real    22m25.222s
user    22m19.567s
sys      0m2.091s

7. Sed Solution with 9000 lines

$ time sed -f <(printf 's/%s/%s/g\n' $(<patterns.txt)) <(head -9000 contents.txt) >newcontents.txt

real    9m7.443s
user    9m0.552s
sys     0m2.650s

8. Parallel Seds Solution with 9000 lines

$ cat sedpar.sh
s=$SECONDS
sed -f <(printf 's/%s/%s/g\n' $(<patterns.txt)) <(head -3000 contents.txt) >newcontents1.txt &
sed -f <(printf 's/%s/%s/g\n' $(<patterns.txt)) <(tail +3001 contents.txt |head -3000) >newcontents2.txt &
sed -f <(printf 's/%s/%s/g\n' $(<patterns.txt)) <(tail +6001 contents.txt |head -3000) >newcontents3.txt &
wait
cat newcontents1.txt newcontents2.txt newcontents3.txt >newcontents.txt && rm -f newcontents1.txt newcontents2.txt newcontents3.txt
echo "seconds elapsed: $(($SECONDS-$s))"

$ time ./sedpar.sh
seconds elapsed: 309

real    5m16.594s
user    9m43.331s
sys     0m4.232s

Splitting the task to more commands like three parallel seds seems that can speed things up.

For those who would like to repeat the benchmarks on their own PC you can download files contents.txt and patterns.txt either by OP's links or by my github:

contents.txt

patterns.txt

George Vasiliou
  • 6,130
  • 2
  • 20
  • 27
  • Thanks George for sharing the same. Could you please test my 2nd solution once too and post it here, since I don't have that much lines files couldn't test it sir, will be grateful to you. – RavinderSingh13 Feb 10 '18 at 17:00
  • cool, actually that link is somehow not working for me. Thanks again. – RavinderSingh13 Feb 10 '18 at 17:28
  • Thanks George, it is because ASAP I am finding a match inside array(a) and coming out of it so it will NOT traverse whole array :) I am glad I could make it some efficient. THANKS a LOT for checking it sir. – RavinderSingh13 Feb 11 '18 at 02:13
  • 1
    Thanks a lot everyone. This is the first time I'm learning these scripts. Lot of knowledge gained from these. Thanks again. – Dhanabalan Feb 11 '18 at 03:01
  • @RavinderSingh13 It seems that your awk is fast for small input files. If 9000 lines of input files are examined then the results are different. See updated benchmark. – George Vasiliou Feb 11 '18 at 09:59
  • @Dhanabalan Check out the parallel seds solution – George Vasiliou Feb 11 '18 at 09:59
  • Did all scripts produce the same output as each other? Why didn't you try the 2nd solution in my answer - the one I claimed would be faster? Also, it's extremely hard to believe that Ravinders script which does 3 function calls per line would be faster than even my first script which does one - did you do 3rd-exectution timing to remove cache-ing issues? – Ed Morton Feb 11 '18 at 12:09
  • @EdMorton I tried your second script but for some reason my laptop hanged badly... i had to hard reset my machine (power off). I will try one more time. I use to run all scripts three times. – George Vasiliou Feb 11 '18 at 12:56
  • It's possible that the regexp string just gets too big, I was concerned about that. Did you compare the output files with each other to make sure all the scripts are producing the same output? Also for your parallel sed scripts - you should add the step that concatenates the output files into 1 and tidies up by removing the temp output files so it's output is the same as the others. It'd also make sense to do it in a loop rather than hard-coding several scripts for it to be a realistic approach. Thanks for providing the timings. – Ed Morton Feb 11 '18 at 12:59
  • P.S. I asked about the 3rd timing because I just can't see any way that `match($0,i);val=substr($0,RSTART,RLENGTH);if(val){sub(val,a[i])}` could be faster than `sub(val,a[i])` alone but I can't think of anything other than cacheing that'd explain the results you got! – Ed Morton Feb 11 '18 at 13:02
  • @EdMorton Using cat for such small files (3000 lines each) does not really affecting parallel sed script performance. I updated the script. I did not verify the results (file newcontents.txt) in any of the solutions. – George Vasiliou Feb 11 '18 at 13:48
  • That `cat` won't do the job though since it'll cat the the files in alphabetic instead of numeric order. What I'm getting at is that for a general, usable solution there's more work to do that just running sed multiple times. – Ed Morton Feb 11 '18 at 14:17
  • @EdMorton It is true that your second script turns my laptop to a brick. Sorry. I have upload OP files to github (see links in this post); maybe you can make a test on your own. – George Vasiliou Feb 11 '18 at 14:18
  • 1
    Guess I'd have to carve that RE up into an array of still large but smaller REs. Oh well. The OPs already selected an answer and isn't responding to questions (see my comment under the question) so it's time to move on like he apparently has.... thanks again for running the timing tests. – Ed Morton Feb 11 '18 at 14:20
3

Could you please try following awk and let me know if this helps you.

Solution 1st:

awk 'FNR==NR{a[$1]=$2;next}   {for(i in a){match($0,i);val=substr($0,RSTART,RLENGTH);if(val){sub(val,a[i])}};print}' patterns.txt  sample_content.txt

Output will be as follows.

288Y2RZDBPX9000000003dhana
JP2F64EI2000000001d
EU9V3IXI3000000001dfg9000000003dfdfds
XATSSSSFOO4dhanaUXIBB7TF74000000001adf
10Q1W4ZEAV18LXNPSPGRTTIDHBN5000000001egw

Explanation of solution 1st: Adding explanation too now here:

awk '
FNR==NR{                           ##FNR==NR is a condition which will be TRUE when only first Input_file patterns.txt is being read.
                                   ##FNR and NR both represents line number of Input_file(s) where FNR value will be RESET when a new Input_file is getting read on the other hand NR value will be keep increasing till all Input_file(s) read.
  a[$1]=$2;                        ##creating an array a whose index is first field of line and value is 2nd field of current line.
  next                             ##next will skip all further statements for now.
}
{
for(i in a){                       ##Starting a for loop which traverse through array a all element.
  match($0,i);                     ##Using match function of awk which will try to match index if array a present in variable i.
  val=substr($0,RSTART,RLENGTH);   ##Creating a variable named val which contains the substring of current line substring starts from value of variable RSTART till RLENGTH value.
  if(val){                         ##Checking condition if variable val is NOT NULL then do following:
    sub(val,a[i])}                 ##using sub function of awk to substitute variable val value with array a value of index i.
};
  print                            ##Using print here to print the current line either changed or not changed one.
}
' patterns.txt  sample_content.txt ##Mentioning the Input_file(s) name here.

Solution 2nd: Without traversing all the time to array as like first solution coming out of array when a match is found as follows:

awk '
FNR==NR{                           ##FNR==NR is a condition which will be TRUE when only first Input_file patterns.txt is being read.
                                   ##FNR and NR both represents line number of Input_file(s) where FNR value will be RESET when a new Input_file is getting read on the other hand NR value will be keep increasing till all Input_file(s) read.
  a[$1]=$2;                        ##creating an array a whose index is first field of line and value is 2nd field of current line.
  next                             ##next will skip all further statements for now.
}
{
for(i in a){                       ##Starting a for loop which traverse through array a all element.
  match($0,i);                     ##Using match function of awk which will try to match index if array a present in variable i.
  val=substr($0,RSTART,RLENGTH);   ##Creating a variable named val which contains the substring of current line substring starts from value of variable RSTART till RLENGTH value.
  if(val){                         ##Checking condition if variable val is NOT NULL then do following:
    sub(val,a[i]);print;next}                 ##using sub function of awk to subsitute variable val value with array a value of index i.
};
}
1
' patterns.txt  sample_content.txt ##Mentioning the Input_file(s) name here.
RavinderSingh13
  • 130,504
  • 14
  • 57
  • 93
  • Expected outptut is coming fine but its taking too much time to complete it – Dhanabalan Feb 10 '18 at 10:09
  • I have attached the sample input file https://drive.google.com/open?id=1dVzivKMirEQU3yk9KfPM6iE7tTzVRdt_ – Dhanabalan Feb 10 '18 at 10:46
  • @Dhanabalan, let's count how much time it is taking once and then we could see more what we could do more, let us know on same. – RavinderSingh13 Feb 10 '18 at 10:49
  • Its still running :( – Dhanabalan Feb 10 '18 at 10:54
  • 1
    @Dhanabalan, Please be patience and test both of the solutions and note down their timings by adding `time` command before them. Let us know how it goes then. We can't expect millions of lines to be processed in seconds buddy. – RavinderSingh13 Feb 10 '18 at 11:04
  • 1
    That's much more complicated than necessary (you start with `i` and then you use `match()`+`substr()` to get `val` when `val` will always be identical to `i` and then you do a `sub()` to look for the same string you just did a `match()` on!). See the first script I posted at https://stackoverflow.com/a/48720400/1745001 for the simple implementation of this approach. Yours will also fail if the same string appears multiple times on 1 line as it'd only replace the first occurrence of each string. – Ed Morton Feb 10 '18 at 12:16
2

The simple approach is:

$ cat tst.awk
NR==FNR {
    map[$1] = $2
    next
}
{
    for (old in map) {
        gsub(old,map[old])
    }
    print
}

$ awk -f tst.awk patterns.txt sample_content.txt
288Y2RZDBPX9000000003dhana
JP2F64EI2000000001d
EU9V3IXI3000000001dfg9000000003dfdfds
XATSSSSFOO4dhanaUXIBB7TF74000000001adf
10Q1W4ZEAV18LXNPSPGRTTIDHBN5000000001egw

Just like the other solutions posted so far this applies every substitution to the whole line and so given sample_content.txt containing xay and patterns.txt including a b and b c, then the tools would output xcy rather than xby.

Alternatively you could try this:

$ cat tst.awk
NR==FNR {
    map[$1] = $2
    re = re sep $1
    sep = "|"
    next
}
{
    head = ""
    tail = $0
    while ( match(tail,re) ) {
        head = head substr(tail,1,RSTART-1) map[substr(tail,RSTART,RLENGTH)]
        tail = substr(tail,RSTART+RLENGTH)
    }
    print head tail
}

$ awk -f tst.awk patterns.txt sample_content.txt
288Y2RZDBPX9000000003dhana
JP2F64EI2000000001d
EU9V3IXI3000000001dfg9000000003dfdfds
XATSSSSFOO4dhanaUXIBB7TF74000000001adf
10Q1W4ZEAV18LXNPSPGRTTIDHBN5000000001egw

That approach has several advantages:

  1. It would output xby (which is what I suspect you'd really want if that situation arose) in the case I mention above
  2. It only does as many regexp comparisons per line of sample_content.txt as could match instead of 1 per line of patterns.txt for every line of sample_content.txt
  3. It only operates on whats left of the line after the previous replacement so the string being tested keeps shrinking
  4. It doesn't change $0 and so awk doesn't have to recompile and resplit that record with every subsitution.

so it should be much faster than the original script assuming the regexp constructed from patterns.txt isn't so huge it causes a performance degradation just by it's size.

Ed Morton
  • 188,023
  • 17
  • 78
  • 185