0

I need to apply a series of substitutions on a text file, using a filter file with the same number of lines: line n of the filter should apply to line n of the original file.

E.g. original file:

foo
bar
foobar

Filter file:

s/oo/uu/
s/a/i/
s/b/l/

Expected result:

fuu
bir
foolar

Since sed will apply each filter on each line, using sed -f filterfile is particularly inefficient (the number of lines is fairly large, so is quite large as well…). Furthermore, although in my particular case I can modify the filters to avoid this issue, this command will lead to wrong results on the example.

I'm currently implementing the following approach (still trying to fix an issue with tabulations…):

paste -d'@' filterA filterB infile \
  |while IFS="@" read AA BB LINE;
do
  echo $LINE|"s/$AA/$BB/g"
done > outfile

But I'm wondering if there was a more elegant solution, e.g. some sed option? (Preferably with standard GNU/Linux tools.)

Benjamin W.
  • 46,058
  • 19
  • 106
  • 116
Skippy le Grand Gourou
  • 6,976
  • 4
  • 60
  • 76

2 Answers2

3

You could modify your filter file by adding the proper line address in front of each line

$ nl filter
     1  s/oo/uu/
     2  s/a/i/
     3  s/b/l/

and then pipe this to sed:

$ nl filter | sed -f- infile
fuu
bir
foolar

If the substitutions need to be global, append g first:

$ sed 's/$/g/' filter
s/oo/uu/g
s/a/i/g
s/b/l/g

resulting in

sed 's/$/g/' filter | nl | sed -f- infile

A small optimization to start the next loop after the substitution is to add a b command after it:

sed 's/.*/{&g;b}/' filter | nl | sed -f- infile

This starts the next cycle immediately. The effect for a 30,000 line version of the input and filter files from the question is about a 20% time saving:

$ wc -l filter infile
 33033 filter
 33033 infile
 66066 total
$ time sed 's/$/g/' filter | nl | sed -f- infile >/dev/null

real    0m15.868s
user    0m15.522s
sys     0m0.296s
$ time sed 's/.*/{&g;b}/' filter | nl | sed -f- infile >/dev/null

real    0m12.238s
user    0m11.901s
sys     0m0.271s

If your file is large, awk is a lot faster (code courtesy of Ed Morton):

$ time awk 'NR==FNR{o[NR]=$2;n[NR]=$3;next} {gsub(o[FNR],n[FNR])} 1' filter infile >/dev/null

real    0m0.073s
user    0m0.061s
sys     0m0.007s
Benjamin W.
  • 46,058
  • 19
  • 106
  • 116
  • That should solve the problem of getting wrong results and should improve the efficiency but isn't sed still going to test every line of the filter file once for every line of input? I could be wrong but I'd expect it just won't execute the substitution because the line number won't match but it'll still go through every line of the script file testing each line number. So if your filter file is 1000 lines long then sed will do 1000 line number comparisons for each line of infile. – Ed Morton May 16 '19 at 17:12
  • @EdMorton Yes, I think it'll try matching everything with everything for every line. I'll add a small optimization. – Benjamin W. May 16 '19 at 17:25
  • Would you mind posting the timing for `awk 'NR==FNR{o[NR]=$2;n[NR]=$3;next} {gsub(o[FNR],n[FNR])} 1' filter infile >/dev/null` when run on those same files to see if there's a noticeable difference? – Ed Morton May 16 '19 at 17:31
  • If you don't mind I'd also be interested to see if `awk 'BEGIN{ while ( (getline line < "filter") > 0 ) { split(line,f,"/"); o[NR]=f[2]; n[NR]=f[3]} } {gsub(o[FNR],n[FNR])} 1' infile >/dev/null` makes a difference to the timing (removed the NR==FNR test from each line and turned off field splitting by not mentioning any fields) – Ed Morton May 16 '19 at 17:36
  • 1
    @EdMorton The last one is a bit slower: around 0.1 s. – Benjamin W. May 16 '19 at 17:37
  • Really? That is surprising! Good to know, thanks. I'd much rather stick with the first approach as it's easier to write and enhance and now I know it's faster too it's a clear favorite. – Ed Morton May 16 '19 at 17:39
  • 1
    I was performing the same tests on my files, and came to the same conclusions. It's worth noticing that although being less efficient, the first `sed` solution is still several orders of magnitude faster than the brute `sed` mentioned in OP (went from almost 5 min to about 2 s… yet `awk` achieved less than half a second.). Though I'm fond of `awk` in general, I tend to find the `sed` solution more elegant here, but given the performances I'll accept Ed's answer. – Skippy le Grand Gourou May 16 '19 at 17:45
2
awk -F'/' '
NR==FNR {
    old[NR] = $2
    new[NR] = $3
    next
}
{ gsub(old[FNR],new[FNR]) }
1' filterfile originalfile
fuu
bir
foolar

The above will work using any awk in any shell on any UNIX box.

Ed Morton
  • 188,023
  • 17
  • 78
  • 185