40

I have a large file containing data like this:

a 23
b 8
a 22
b 1

I want to be able to get this:

a 45
b 9

I can first sort this file and then do it in Python by scanning the file once. What is a good direct command-line way of doing this?

Legend
  • 113,822
  • 119
  • 272
  • 400

7 Answers7

44

Edit: The modern (GNU/Linux) solution, as mentioned in comments years ago ;-) .

awk '{
    arr[$1]+=$2
   }
   END {
     for (key in arr) printf("%s\t%s\n", key, arr[key])
   }' file \
   | sort -k1,1

The originally posted solution, based on old Unix sort options:

awk '{
    arr[$1]+=$2
   }
   END {
     for (key in arr) printf("%s\t%s\n", key, arr[key])
   }' file \
   | sort +0n -1

I hope this helps.

kometen
  • 6,536
  • 6
  • 41
  • 51
shellter
  • 36,525
  • 7
  • 83
  • 90
  • what exactly do those arguments do to sort? I don't see them in the man page and the invocation page has left me confused. – EricR Apr 23 '12 at 19:11
  • 1
    Modern versions of sort prefer the `-k` syntax for specifying sort keys: `sort -nk1,1` instead of `sort +0n -1`. But since the keys are letters, why are you specifying `-n` anyway? – Mark Reed Apr 23 '12 at 19:11
  • @EricR: `+0n -1` is old-fashioned for `-n -k1,1`: sort numerically by the first (whitespace-separated) field. – Mark Reed Apr 23 '12 at 19:12
  • You could also let awk do the sorting: `asorti(arr,keys); for (i in keys) { printf "%s\t%s\n", keys[i], arr[keys[i]]);` – Mark Reed Apr 23 '12 at 19:14
  • @MarkReed you _could_, but GNU sort would be the more efficient approach if one wants to scale to really big inputs (particularly if those inputs will be larger than available memory; GNU sort will partition into temporary files in that case) – Charles Duffy Apr 23 '12 at 19:17
  • @CharlesDuffy Point, but the associative array would be pretty huge at that point, too; I think awk would have trouble even without the sorting in that case. – Mark Reed Apr 23 '12 at 19:19
  • @MarkReed ...hence the second version given in my pure-bash answer, which doesn't store more than two lines at a time and thus scales to very large inputs nicely. :) – Charles Duffy Apr 23 '12 at 19:21
  • Thanks folks, I was going to cover some of those points in an detailed answer. But 2 lines at at time, can't beat that! How is the performance? Good luck to all. – shellter Apr 23 '12 at 19:34
  • You can make it a real one-liner saying `awk '{sum[$1]+=$2} END {for (val in sum) print val, sum[val]}' <(sort file)`. – fedorqui Aug 24 '16 at 08:16
8

No need for awk here, or even sort -- if you have Bash 4.0, you can use associative arrays:

#!/bin/bash
declare -A values
while read key value; do
  values["$key"]=$(( $value + ${values[$key]:-0} ))
done
for key in "${!values[@]}"; do
  printf "%s %s\n" "$key" "${values[$key]}"
done

...or, if you sort the file first (which will be more memory-efficient; GNU sort is able to do tricks to sort files larger than memory, which a naive script -- whether in awk, python or shell -- typically won't), you can do this in a way which will work in older versions (I expect the following to work through bash 2.0):

#!/bin/bash
read cur_key cur_value
while read key value; do
  if [[ $key = "$cur_key" ]] ; then
    cur_value=$(( cur_value + value ))
  else
    printf "%s %s\n" "$cur_key" "$cur_value"
    cur_key="$key"
    cur_value="$value"
  fi
done
printf "%s %s\n" "$cur_key" "$cur_value"
Charles Duffy
  • 280,126
  • 43
  • 390
  • 441
  • 3
    Heck, with some minimal munging the above would work in vanilla Bourne, no bash required.`while read key value; do if [ "$key" = "$cur_key" ]; then cur_value=\`expr $cur_value + $value\`; else echo "$cur_key $cur_value"; cur_key="$key"; cur_value="$value"; fi; done; echo "$cur_key $cur_value"` – Mark Reed Apr 23 '12 at 19:26
  • 2
    @MarkReed, definitely so, though the performance impact of the subshell running `expr` is sufficient that the POSIX sh `$(( ))` expansion would be better; while `(( ))` is a bash extension, `$(( ))` is standard-compliant; it's only pre-1991-POSIX-standard Bourne sh where `expr` is needed. – Charles Duffy Mar 03 '16 at 17:09
  • 1
    Another (Bash-specific, more concise) way to express the value addition in your first example: `(( values["$key"] += value ))` Note: the default-0 isn't necessary. – Dennis Williamson Oct 15 '18 at 18:45
8

This Perl one-liner seems to do the job:

perl -nle '($k, $v) = split; $s{$k} += $v; END {$, = " "; foreach $k (sort keys %s) {print $k, $s{$k}}}' inputfile
Dennis Williamson
  • 346,391
  • 90
  • 374
  • 439
5

This can be easily achieved with the following single-liner:

cat /path/to/file | termsql "SELECT col0, SUM(col1) FROM tbl GROUP BY col0"

Or.

termsql -i /path/to/file "SELECT col0, SUM(col1) FROM tbl GROUP BY col0"

Here a Python package, termsql, is used, which is a wrapper around SQLite. Note, that currently it's not upload to PyPI, and also can only be installed system-wide (setup.py is a little broken), like:

pip install --user https://github.com/tobimensch/termsql/archive/master.zip

Update

In 2020 version 1.0 was finally uploaded to PyPI, so pip install --user termsql can be used.

saaj
  • 23,253
  • 3
  • 104
  • 105
2

One way using perl:

perl -ane '
    next unless @F == 2; 
    $h{ $F[0] } += $F[1]; 
    END { 
        printf qq[%s %d\n], $_, $h{ $_ } for sort keys %h;
    }
' infile

Content of infile:

a 23
b 8
a 22
b 1

Output:

a 45
b 9
Birei
  • 35,723
  • 2
  • 77
  • 82
2

With GNU awk (versions less than 4):

WHINY_USERS= awk 'END {
  for (E in a)
    print E, a[E]
    }
{ a[$1] += $2 }' infile

With GNU awk >= 4:

awk 'END {
  PROCINFO["sorted_in"] = "@ind_str_asc"
  for (E in a)
    print E, a[E]
    }
{ a[$1] += $2 }' infile
Dimitre Radoulov
  • 27,252
  • 4
  • 40
  • 48
2

With sort + awk combination one could try following, without creating array.

sort -k1 Input_file | 
awk '
  prev!=$1 && prev{
    print prev,(prevSum?prevSum:"N/A")
    prev=prevSum=""
  }
  {
    prev=$1
    prevSum+=$2
  }
  END{
    if(prev){
       print prev,(prevSum?prevSum:"N/A")
    }
}'

Explanation: Adding detailed explanation for above.

sort -k1 file1 |                          ##Using sort command to sort Input_file by 1st field and sending output to awk as an input.
awk '                                     ##Starting awk program from here.
  prev!=$1 && prev{                       ##Checking condition prev is NOT equal to first field and prev is NOT NULL.
    print prev,(prevSum?prevSum:"N/A")    ##Printing prev and prevSum(if its NULL then print N/A).
    prev=prevSum=""                       ##Nullify prev and prevSum here.
  }
  {
    prev=$1                               ##Assigning 1st field to prev here.
    prevSum+=$2                           ##Adding 2nd field to prevSum.
  }
  END{                                    ##Starting END block of this awk program from here.
    if(prev){                             ##Checking condition if prev is NOT NULL then do following.
       print prev,(prevSum?prevSum:"N/A") ##Printing prev and prevSum(if its NULL then print N/A).
    }
}'
RavinderSingh13
  • 130,504
  • 14
  • 57
  • 93