2

I basically have the following 2 files:

$ cat file1.txt
AB,12 34 56,2.4,256,,
CD,23 45 67,10.8,257,,
EF,34 56 78,0.6,258,,
GH,45 67 89,58.3,259,,
...
$ cat file2.txt
AB,12 34 56,2.4,36
XY,56 99 11,3.6,15
ZQ,12 36 89,5.9,0
EF,34 56 78,0.6,99
GH,45 67 89,58.3,79
...

And for every line in file1.txt, I'd like to use the first 3 fields as an index in file2.txt, grab the corresponding last field, and place it into file1.txt like so:

cat newfile.txt
AB,12 34 56,2.4,256,36,
CD,23 45 67,10.8,257,,
EF,34 56 78,0.6,258,99,
GH,45 67 89,58.3,259,79,

There is no guarantee that each line in file1 will appear in file2, and vice versa, and for such cases empty fields shown above in newfile.txt are fine.

In my first attempt, I was reading in each line from file1 in a while read loop, then grepping for the appropriate line in file2, and it worked but it was just way too slow. file1 and file2 have hundreds of thousands of lines each.

Is there any way I can use sed to use the first 3 fields of each line from file1 as an index into file2, lookup the value I need, and append it to that line in file1? And do so without reading file1 line by line?

Any help is appreciated.

Marc Ryan
  • 23
  • 2

3 Answers3

2

Using join and sed (for some pre and post processing), and assuming the | character doesn't appear in either file

join -a1 -t'|' \
    <(sort file1.txt | sed 's/,/|/3') \
    <(sort file2.txt | sed 's/,/|/3') |
    sed 's/,|//; s/|/,/; s/[^,]$/&,/' > newfile.txt

(Tested with the input given in the question)

It could be done in plain bash using associative arrays, but I doubt if it would be efficient. For example:

#!/bin/bash

declare -A tail

while IFS= read -r line; do
    if [[ $line =~ ([^,]*,){3} ]]; then
        tail[${BASH_REMATCH[0]}]=${line#"${BASH_REMATCH[0]}"}
    fi
done < file2.txt

while IFS= read -r line; do
    if [[ $line =~ ([^,]*,){3} ]] && [[ -n ${tail[${BASH_REMATCH[0]}]} ]]; then
        printf '%s%s\n' "${line%?}" "${tail[${BASH_REMATCH[0]}]},"
    else
        printf '%s\n' "$line"
    fi
done < file1.txt > newfile.txt
M. Nejat Aydin
  • 9,597
  • 1
  • 7
  • 17
  • 1
    Join seems to be the appropriate command for this task. Thanks for the help! – Marc Ryan Feb 11 '21 at 00:20
  • @MarcRyan `join` is the appropriate command for joining 2 sorted files on an individual field but that's not the task at hand here. Using `join` on the output of 2 `sort`s piped to 2 `sed`s within 2 `bash process substitutions` with it's output piped to another `sed` and with assumptions about what characters might be present in the input isn't so obviously the appropriate command for this task vs simply calling awk once. – Ed Morton Feb 12 '21 at 01:23
  • @M.NejatAydin with respect to "It could be done in plain bash using associative arrays, but I doubt if it would be efficient." - you're right, it would be extremely inefficient, orders of magnitude slower than an awk script or even the join+sorts+seds+bash script. See [why-is-using-a-shell-loop-to-process-text-considered-bad-practice](https://unix.stackexchange.com/questions/169716/why-is-using-a-shell-loop-to-process-text-considered-bad-practice) for the main reasons plus bash accessing associative arrays is slow in general. – Ed Morton Feb 12 '21 at 01:29
1

Using awk:

awk -F, 'FNR==NR { map[$1","$2","$3]=$4;next } { print $1","$2","$3","$4","map[$1","$2","$3]"," }' file2.txt file1.txt

Process file1.txt first (FNR==NR) Create an array map with the first, second and third comma delimited fields as the index and the 4th field as the value. Then for the second file, print the 1st,2nd,3rd and 4th field along with the contents of the map array for the first index, separated with commas.

Raman Sailopal
  • 12,320
  • 2
  • 11
  • 18
0

This will work very efficiently using any awk in any shell on every Unix box and doesn't rely on any characters not being present in the input:

$ awk '
    BEGIN { FS=OFS="," }
    { key = $1 FS $2 FS $3 }
    NR==FNR { map[key] = $4; next }
    { $5 = map[key] }
1' file2 file1
AB,12 34 56,2.4,256,36,
CD,23 45 67,10.8,257,,
EF,34 56 78,0.6,258,99,
GH,45 67 89,58.3,259,79,
Ed Morton
  • 188,023
  • 17
  • 78
  • 185