2

I have two files, file1.csv

3 1009
7 1012
2 1013
8 1014

and file2.csv

5 1009
3 1010
1 1013

In the shell, I want to subtract the count in the first column in the second file from that in the first file, based on the identifier in the second column. If an identifier is missing in the second column, the count is assumed to be 0.

The result would be

-2 1009
-3 1010 
7 1012
1 1013
8 1014

The files are huge (several GB). The second columns are sorted.

How would I do this efficiently in the shell?

user32849
  • 609
  • 1
  • 6
  • 16

4 Answers4

6

Assuming that both files are sorted on second column:

$ join -j2 -a1 -a2 -oauto -e0 file1 file2 | awk '{print $2 - $3, $1}'
-2 1009
-3 1010
7 1012
1 1013
8 1014

join will join sorted files.
-j2 will join one second column.
-a1 will print records from file1 even it there is no corresponding row in file2.
-a2 Same as -a1 but applied for file2.
-oauto is in this case the same as -o1.2,1.1,2.1 which will print the joined column, and then the remaining columns from file1 and file2.
-e0 will insert 0 instead of an empty column. This works with -a1 and -a2.

The output from join is three columns like:

1009 3 5
1010 0 3
1012 7 0
1013 2 1
1014 8 0

Which is piped to awk, to subtract column three from column 2, and then reformatting.

Andreas Louv
  • 46,145
  • 13
  • 104
  • 123
  • Elegant solution,nice! I wonder if there's any significant time or memory difference between `join` and merge (`sort -m`) for large files like these - @user... could you test with your input and let us know? – Ed Morton Dec 14 '16 at 15:22
  • I've just written (independently) an answer that's very similar, but I went for `dc` rather than `awk`. I think your answer is a bit better than mine, with only the two processes rather than three. I must reach for awk more often! – Toby Speight Dec 14 '16 at 16:02
  • Very nice indeed. – James Brown Dec 14 '16 at 17:41
1
$ awk 'NR==FNR { a[$2]=$1; next }
               { a[$2]-=$1 }
           END { for(i in a) print a[i],i }' file1 file2
7 1012
1 1013
8 1014
-2 1009
-3 1010

It reads the first file in memory so you should have enough memory available. If you don't have the memory, I would maybe sort -k2 the files first, then sort -m (merge) them and continue with that output:

$ sort -m -k2 -k3 <(sed 's/$/ 1/' file1|sort -k2) <(sed 's/$/ 2/' file2|sort -k2) # | awk ...
3 1009 1
5 1009 2  # previous $2 = current $2 -> subtract
3 1010 2  # previous $2 =/= current and current $3=2 print -$3
7 1012 1
2 1013 1  # previous $2 =/= current and current $3=1 print prev $2
1 1013 2
8 1014 1

(I'm out of time for now, maybe I'll finish it later)

EDIT by Ed Morton Hope you don't mind me adding what I was working on rather than posting my own extremely similar answer, feel free to modify or delete it:

$ cat tst.awk
{ split(prev,p) }
$2 == p[2] {
    print p[1] - $1, p[2]
    prev = ""
    next
}
p[2] != "" {
    print (p[3] == 1 ? p[1] : 0-p[1]), p[2]
}
{ prev = $0 }
END {
    split(prev,p)
    print (p[3] == 1 ? p[1] : 0-p[1]), p[2]
}

$ sort -m -k2 <(sed 's/$/ 1/' file1) <(sed 's/$/ 2/' file2) | awk -f tst.awk
-2 1009
-3 1010
7 1012
1 1013
8 1014
Ed Morton
  • 188,023
  • 17
  • 78
  • 185
James Brown
  • 36,089
  • 7
  • 43
  • 59
  • Beaten by 3 seconds! was about to post the same! :) – Inian Dec 14 '16 at 13:57
  • 2
    @Inian Victory better served cold or something like that... :D Been there many times myself I'm now thinking if he really has gigs and gigs of data, there shoudl be some more efficient way. This solution is crappy that way. – James Brown Dec 14 '16 at 13:59
  • add a `| sort -k2` to be fully compliant :-) – NeronLeVelu Dec 14 '16 at 14:00
  • It still holds the first table whole in memory. – James Brown Dec 14 '16 at 14:01
  • @JamesBrown on huge data file, memory is often a problem if you need to remember info and not working on SQL like tools/environment – NeronLeVelu Dec 14 '16 at 14:02
  • 1
    for memory optimization, wee need to prepare a bit the 2 input and served it as a stream "ordered" ideally. It still is a time consuming to sort or prepare data :-( – NeronLeVelu Dec 14 '16 at 14:05
  • Would there perhaps be a way to use 'join', since the second column is sorted? – user32849 Dec 14 '16 at 14:44
  • I was going the `sort -m` route too but hit a wall when I discovered I couldn't tell a file1-only value from a file2-only value in the output and of course the solution for that is exactly what you're doing with appending a file number. You don't need the `|sort -k2`s though since the input files are already sorted. – Ed Morton Dec 14 '16 at 15:18
  • 1
    Hope you don't mind but I added my awk script to the end of your answer rather than posting my own answer that would've also been based on the `sort -m` which is the key part of this solution. You don't need the `-k3`, btw since `-k2` means "start at key 2" not "only use key 2". – Ed Morton Dec 14 '16 at 15:35
  • 1
    Not at all, thank you sir. I hate it when work bothers my hobby. :D – James Brown Dec 14 '16 at 17:40
1

Since the files are sorted¹, you can merge them line-by-line with the join utility in coreutils:

$ join -j2 -o auto -e 0 -a 1 -a 2 41144043-a 41144043-b
1009 3 5
1010 0 3
1012 7 0
1013 2 1
1014 8 0

All those options are required:

  • -j2 says to join based on the second column of each file
  • -o auto says to make every row have the same format, beginning with the join key
  • -e 0 says that missing values should be substituted with zero
  • -a 1 and -a 2 include rows that are absent from one file or another
  • the filenames (I've used names based on the question number here)

Now we have a stream of output in that format, we can do the subtraction on each line. I used this GNU sed command to transform the above output into a dc program:

sed -re 's/.*/c&-n[ ]np/e'

This takes the three values on each line and rearranges them into a dc command for the subtraction, then executes it. For example, the first line becomes (with spaces added for clarity)

c 1009 3 5 -n [ ]n p

which subtracts 5 from 3, prints it, then prints a space, then prints 1009 and a newline, giving

-2 1009

as required.

We can then pipe all these lines into dc, giving us the output file that we want:

$ join -o auto -j2 -e 0 -a 1 -a 2 41144043-a 41144043-b \
>   | sed -e 's/.*/c& -n[ ]np/' \
>   | dc
-2 1009
-3 1010
7 1012
1 1013
8 1014

¹ The sorting needs to be consistent with LC_COLLATE locale setting. That's unlikely to be an issue if the fields are always numeric.


TL;DR

The full command is:

join -o auto -j2 -e 0 -a 1 -a 2 "$file1" "$file2" | sed -e 's/.*/c& -n[ ]np/' | dc

It works a line at a time, and starts only the three processes you see, so should be reasonably efficient in both memory and CPU.

Community
  • 1
  • 1
Toby Speight
  • 27,591
  • 48
  • 66
  • 103
0

Assuming this is a csv with blank separation, if this is a "," use argument -F ','

awk 'FNR==NR {Inits[$2]=$1; ids[$2]++; next}
             {Discounts[$2]=$1; ids[$2]++}
     END     { for (id in ids) print Inits[ id] - Discounts[ id] " " id}
    ' file1.csv file2.csv

for memory issue (could be in 1 serie of pipe but prefer to use a temporary file)

awk 'FNR==NR{print;next}{print -1 * $1 " " $2}' file1 file2 \
 | sort -k2 \
 > file.tmp
awk 'Last != $2 { 
        if (NR != 1) print Result " "Last
        Last = $2; Result = $1
        }
    Last == $2 { Result+= $1; next}
    END { print Result " " $2}
    ' file.tmp
rm file.tmp
NeronLeVelu
  • 9,908
  • 1
  • 23
  • 43
  • Thanks a lot! This seems to work, but uses quite a substantial amount of memory (killed it at 67GB). I guess it uses some kind of associative array to hold the values? I've changed the question to emphasize that the second column is sorted. – user32849 Dec 14 '16 at 13:58