GROUP BY/SUM from shell

Question

I have a large file containing data like this:

a 23
b 8
a 22
b 1

I want to be able to get this:

a 45
b 9

I can first sort this file and then do it in Python by scanning the file once. What is a good direct command-line way of doing this?

score 44 · Answer 1 · edited Aug 29 '18 at 12:41

44

Edit: The modern (GNU/Linux) solution, as mentioned in comments years ago ;-) .

awk '{
    arr[$1]+=$2
   }
   END {
     for (key in arr) printf("%s\t%s\n", key, arr[key])
   }' file \
   | sort -k1,1

The originally posted solution, based on old Unix sort options:

awk '{
    arr[$1]+=$2
   }
   END {
     for (key in arr) printf("%s\t%s\n", key, arr[key])
   }' file \
   | sort +0n -1

I hope this helps.

edited Aug 29 '18 at 12:41

kometen

6,536
6
41
51

answered Apr 23 '12 at 19:01

shellter

36,525
7
83
90

what exactly do those arguments do to sort? I don't see them in the man page and the invocation page has left me confused. – EricR Apr 23 '12 at 19:11
1

Modern versions of sort prefer the `-k` syntax for specifying sort keys: `sort -nk1,1` instead of `sort +0n -1`. But since the keys are letters, why are you specifying `-n` anyway? – Mark Reed Apr 23 '12 at 19:11
@EricR: `+0n -1` is old-fashioned for `-n -k1,1`: sort numerically by the first (whitespace-separated) field. – Mark Reed Apr 23 '12 at 19:12
You could also let awk do the sorting: `asorti(arr,keys); for (i in keys) { printf "%s\t%s\n", keys[i], arr[keys[i]]);` – Mark Reed Apr 23 '12 at 19:14
@MarkReed you _could_, but GNU sort would be the more efficient approach if one wants to scale to really big inputs (particularly if those inputs will be larger than available memory; GNU sort will partition into temporary files in that case) – Charles Duffy Apr 23 '12 at 19:17
@CharlesDuffy Point, but the associative array would be pretty huge at that point, too; I think awk would have trouble even without the sorting in that case. – Mark Reed Apr 23 '12 at 19:19
@MarkReed ...hence the second version given in my pure-bash answer, which doesn't store more than two lines at a time and thus scales to very large inputs nicely. :) – Charles Duffy Apr 23 '12 at 19:21
Thanks folks, I was going to cover some of those points in an detailed answer. But 2 lines at at time, can't beat that! How is the performance? Good luck to all. – shellter Apr 23 '12 at 19:34
You can make it a real one-liner saying `awk '{sum[$1]+=$2} END {for (val in sum) print val, sum[val]}' <(sort file)`. – fedorqui Aug 24 '16 at 08:16

Charles Duffy · Answer 2 · 2012-04-23T19:29:08.793

8

No need for awk here, or even sort -- if you have Bash 4.0, you can use associative arrays:

#!/bin/bash
declare -A values
while read key value; do
  values["$key"]=$(( $value + ${values[$key]:-0} ))
done
for key in "${!values[@]}"; do
  printf "%s %s\n" "$key" "${values[$key]}"
done

...or, if you sort the file first (which will be more memory-efficient; GNU sort is able to do tricks to sort files larger than memory, which a naive script -- whether in awk, python or shell -- typically won't), you can do this in a way which will work in older versions (I expect the following to work through bash 2.0):

#!/bin/bash
read cur_key cur_value
while read key value; do
  if [[ $key = "$cur_key" ]] ; then
    cur_value=$(( cur_value + value ))
  else
    printf "%s %s\n" "$cur_key" "$cur_value"
    cur_key="$key"
    cur_value="$value"
  fi
done
printf "%s %s\n" "$cur_key" "$cur_value"

edited Apr 23 '12 at 19:29

answered Apr 23 '12 at 19:04

Charles Duffy

280,126
43
390
441

3

Heck, with some minimal munging the above would work in vanilla Bourne, no bash required.`while read key value; do if [ "$key" = "$cur_key" ]; then cur_value=\`expr $cur_value + $value\`; else echo "$cur_key $cur_value"; cur_key="$key"; cur_value="$value"; fi; done; echo "$cur_key $cur_value"` – Mark Reed Apr 23 '12 at 19:26
2

@MarkReed, definitely so, though the performance impact of the subshell running `expr` is sufficient that the POSIX sh `$(( ))` expansion would be better; while `(( ))` is a bash extension, `$(( ))` is standard-compliant; it's only pre-1991-POSIX-standard Bourne sh where `expr` is needed. – Charles Duffy Mar 03 '16 at 17:09
1

Another (Bash-specific, more concise) way to express the value addition in your first example: `(( values["$key"] += value ))` Note: the default-0 isn't necessary. – Dennis Williamson Oct 15 '18 at 18:45

score 8 · Answer 3 · answered Apr 23 '12 at 22:58

8

This Perl one-liner seems to do the job:

perl -nle '($k, $v) = split; $s{$k} += $v; END {$, = " "; foreach $k (sort keys %s) {print $k, $s{$k}}}' inputfile

answered Apr 23 '12 at 22:58

Dennis Williamson

346,391
90
374
439

saaj · Answer 4 · 2020-12-23T21:30:57.987

This can be easily achieved with the following single-liner:

cat /path/to/file | termsql "SELECT col0, SUM(col1) FROM tbl GROUP BY col0"

Or.

termsql -i /path/to/file "SELECT col0, SUM(col1) FROM tbl GROUP BY col0"

Here a Python package, termsql, is used, which is a wrapper around SQLite. Note, that currently it's not upload to PyPI, and also can only be installed system-wide (setup.py is a little broken), like:

pip install --user https://github.com/tobimensch/termsql/archive/master.zip

Update

In 2020 version 1.0 was finally uploaded to PyPI, so pip install --user termsql can be used.

I like the termsql tool! It can be handy! – Petr May 05 '19 at 07:54 — Petr, May 05 '19 at 07:54

score 2 · Answer 5 · answered Apr 23 '12 at 19:44

2

One way using perl:

perl -ane '
    next unless @F == 2; 
    $h{ $F[0] } += $F[1]; 
    END { 
        printf qq[%s %d\n], $_, $h{ $_ } for sort keys %h;
    }
' infile

Content of infile:

a 23
b 8
a 22
b 1

Output:

a 45
b 9

answered Apr 23 '12 at 19:44

Birei

35,723
2
77
82

score 2 · Answer 6 · answered Apr 23 '12 at 20:05

2

With GNU awk (versions less than 4):

WHINY_USERS= awk 'END {
  for (E in a)
    print E, a[E]
    }
{ a[$1] += $2 }' infile

With GNU awk >= 4:

awk 'END {
  PROCINFO["sorted_in"] = "@ind_str_asc"
  for (E in a)
    print E, a[E]
    }
{ a[$1] += $2 }' infile

answered Apr 23 '12 at 20:05

Dimitre Radoulov

27,252
4
40
48

1

I seem to have stumbled on this almost three years late. What exactly does the `WHINY_USERS` variable do? – Mr. Llama May 11 '15 at 19:55
1

Sorts the array keys in [asciibetical order](http://awk.info/?doc/tip/whinyUsers.html). – Dimitre Radoulov May 12 '15 at 07:21
2

Link is dead, this one is more likely to live: https://stackoverflow.com/q/11697556/476716 – OrangeDog Jun 19 '17 at 13:50

score 2 · Answer 7 · answered Dec 24 '20 at 13:19

With sort + awk combination one could try following, without creating array.

sort -k1 Input_file | 
awk '
  prev!=$1 && prev{
    print prev,(prevSum?prevSum:"N/A")
    prev=prevSum=""
  }
  {
    prev=$1
    prevSum+=$2
  }
  END{
    if(prev){
       print prev,(prevSum?prevSum:"N/A")
    }
}'

Explanation: Adding detailed explanation for above.

sort -k1 file1 |                          ##Using sort command to sort Input_file by 1st field and sending output to awk as an input.
awk '                                     ##Starting awk program from here.
  prev!=$1 && prev{                       ##Checking condition prev is NOT equal to first field and prev is NOT NULL.
    print prev,(prevSum?prevSum:"N/A")    ##Printing prev and prevSum(if its NULL then print N/A).
    prev=prevSum=""                       ##Nullify prev and prevSum here.
  }
  {
    prev=$1                               ##Assigning 1st field to prev here.
    prevSum+=$2                           ##Adding 2nd field to prevSum.
  }
  END{                                    ##Starting END block of this awk program from here.
    if(prev){                             ##Checking condition if prev is NOT NULL then do following.
       print prev,(prevSum?prevSum:"N/A") ##Printing prev and prevSum(if its NULL then print N/A).
    }
}'

GROUP BY/SUM from shell

7 Answers7

Update

Linked

Related