8

I have a file of strings:

string-string-123
string-string-123
string-string-123
string-string-12345
string-string-12345
string-string-12345-123

How do I retrieve the most common line in bash (string-string-123)?

Gilles 'SO- stop being evil'
  • 104,111
  • 38
  • 209
  • 254
Alex
  • 83
  • 1
  • 5

3 Answers3

21

You can use sort with uniq

sort file | uniq -c | sort -n -r
glenn jackman
  • 238,783
  • 38
  • 220
  • 352
Grzegorz Żur
  • 47,257
  • 14
  • 109
  • 105
5

You could use awk to do this:

awk '{++a[$0]}END{for(i in a)if(a[i]>max){max=a[i];k=i}print k}' file

The array a keeps a count of each line. Once the file has been read, we loop through it and find the line with the maximum count.

Alternatively, you can skip the loop in the END block by assigning the line during the processing of the file:

awk 'max < ++c[$0] {max = c[$0]; line = $0} END {print line}' file

Thanks to glenn jackman for this useful suggestion.


It has rightly been pointed out that the two approaches above will only print out one of the most frequently occurring lines in the case of a tie. The following version will print out all of the most frequently occurring lines:

awk 'max<++c[$0] {max=c[$0]} END {for(i in c)if(c[i]==max)print i}' file
Community
  • 1
  • 1
Tom Fenech
  • 72,334
  • 12
  • 107
  • 141
  • 2
    Move the "max" logic out of the END block for simiplicity: `awk '{if (max < ++c[$0]) {max = c[$0]; line = $0}} END {print line}'` – glenn jackman Mar 28 '15 at 18:54
  • That's an elegant solution, but note that if there are _ties_ for the most frequently occurring line, your solution will only print _one_ of them, and it won't be obvious which. – mklement0 Mar 28 '15 at 20:07
  • 1
    @mklement0 it's a fair point. I've added another version which prints all of them. – Tom Fenech Mar 29 '15 at 12:33
  • Thanks for updating - nicely done. The only remaining concern is that this approach may not work with large input files, given that you read _all distinct_ input lines into memory, which could be a problem with a large number of distinct lines in the input. On the plus side, your approach is _much faster_ than the `sort` / `uniq` approach. – mklement0 Mar 29 '15 at 13:52
  • Another way is to use asort `awk '{b[a[$0]++]=$0}END{asort(b);print b[1]}'` –  Mar 30 '15 at 12:30
5
  • Tom Fenech's elegant awk answer works great [in the amended version that prints all most frequently occurring lines in the event of a tie].
    However, it may not be suitable for large files, because all distinct input lines are stored in an associative array in memory, which could be a problem if there are many non-duplicate lines; that said, it's much faster than the approaches discussed below.

  • Grzegorz Żur's answer combines multiple utilities elegantly to implicitly produce the desired result, but:

    • all distinct lines are printed (highest-frequency count first)
    • output lines are prefixed by their occurrence count (which may actually be desirable).

While you can pipe Grzegorz Żur's answer to head to limit the number of lines shown, you can't assume a fixed number of lines in general.

Building on Grzegorz's answer, here's a generic solution that shows all most-frequently-occurring lines - however many there are - and only them:

sort file | uniq -c | sort -n -r | awk 'NR==1 {prev=$1} $1!=prev {exit} 1'

If you don't want the output lines prefixed with the occurrence count:

sort file | uniq -c | sort -n -r | awk 'NR==1 {prev=$1} $1!=prev {exit} 1' | 
  sed 's/^ *[0-9]\{1,\} //'

Explanation of Grzegorz Żur's answer:

  • uniq -c outputs the set of unique input lines prefixed with their respective occurrence count (-c), followed by a single space.
  • sort -n -r then sorts the resulting lines numerically (-n), in descending order (-r), so that the most frequently occurring line(s) are at the top.
    • Note that sort, if -k is not specified, will generally try to sort by the entire input line, but -n causes only the longest prefix that is recognized as an integer to be used for sorting, which is exactly what's needed here.

Explanation of my awk command:

  • NR==1 {prev=$1} stores the 1st whitespace-separated field ($1) in variable prev for the first input line (NR==1)
  • $1!=prev {exit} terminates processing, if the 1st whitespace-separated field is not the same as the previous line's - this means that a non-topmost line has been reached, and no more lines need printing.
  • 1 is shorthand for { print } meaning that the input line at hand should be printed as is.

Explanation of my sed command:

  • ^ *[0-9]\{1,\} matches the numeric prefix (denoting the occurrence count) of each output line, as (originally) produced by uniq -c
  • applying s/...// means that the prefix is replaced with an empty string, i.e., effectively removed.
Community
  • 1
  • 1
mklement0
  • 382,024
  • 64
  • 607
  • 775