47

Given an input file containing one single number per line, how could I get a count of how many times an item occurred in that file?

cat input.txt
1
2
1
3
1
0

desired output (=>[1,3,1,1]):

cat output.txt
0 1
1 3
2 1
3 1

It would be great, if the solution could also be extended for floating numbers.

tommy.carstensen
  • 8,962
  • 15
  • 65
  • 108
Javier
  • 1,131
  • 4
  • 17
  • 22
  • 3
    This kind of output is simple and useful, but it's not a histogram. See, for example, http://quarknet.fnal.gov/toolkits/ati/histograms.html – Mike Sherrill 'Cat Recall' May 20 '11 at 00:07
  • I agree you are not asking for a histogram. That can however also be accomplished with `bash`, which is what I came looking for. See this question and its answers: https://unix.stackexchange.com/questions/177777/drawing-a-histogram-from-a-bash-command-output – tommy.carstensen May 21 '19 at 14:20

7 Answers7

87

You mean you want a count of how many times an item appears in the input file? First sort it (using -n if the input is always numbers as in your example) then count the unique results.

sort -n input.txt | uniq -c
Caleb
  • 5,084
  • 1
  • 46
  • 65
  • 3
    I didn't know about the `uniq` command. I changed it to `cat input.txt | sort -n | uniq -c | awk '{print $2 " " $1}'`, now I'm obtaining the desired output. – Javier May 18 '11 at 12:29
  • 7
    Your use of awk to get the ordering is fine, but you don't need to use cat there. You should learn about the `<` operator to input files into programs and even things like loop constructions. For humor value, see [the useless use of cat awards](http://partmaps.org/era/unix/award.html#cat) – Caleb May 18 '11 at 12:34
12

Another option:

awk '{n[$1]++} END {for (i in n) print i,n[i]}' input.txt | sort -n > output.txt
glenn jackman
  • 238,783
  • 38
  • 220
  • 352
  • 1
    @Javier, the 'n' array simply keeps a count of the strings it sees in the input file. It can be int, float or any arbitrary string. Yes, the 'END' part is executed after the input file is completely read. You don't need to initialize variables in awk: an uninitialized variable is considered to be zero or the empty string (depends on the context). In this case 'i' is a loop variable. I think the default 'sort' behaviour is to consider the whole line. This solution will work for anything in the input file: awk arrays are associative arrays. – glenn jackman May 18 '11 at 14:04
  • 1
    thanks for illutrasting an `awk-based` solution. From what I understood, in the first part you store the `histogram` into the `n` array considering the elements in the column `$1`. The `END` part means, that it's going to be done `after` the histogram is built, right? Is it not necessary to to initialize the variable `i` for loops in `awk`? Then, the `sort -n` is going to be applied only in the first column of the output: `i, n[i]`, right? i.e not on `n[i]`? Furthermore, this solution would only work for `integer` numbers (because of the indexing of the array)? – Javier May 18 '11 at 14:12
  • The `awk` solution has the distinct advantage of not requiring `sort`! To get sorted output, just keep track of the max and min values seen and iterate over them, checking if each is in the array. (This will only work for integers, though, and not with floats.) – William Pursell Sep 25 '12 at 23:04
  • Works great with strings too! Just need to change `$1`, the first word, to `$0`, the whole line: `awk '{n[$0]++} END {for (i in n) print i,n[i]}'` lets you easily find and count duplicate lines in input. Awesome. – Ahmed Fasih Nov 04 '14 at 13:43
  • I just deleted a comment in which I said I encountered a bug with the awk-based solution. Actually, it was a bug in my code. Since others might do it too, I thought it might be useful to share my experience here: My problem was that, probably influenced by the shell syntax for the for loop, I had added a ";" between the "for" and the "print" in the "END" part of the `awk` command. As a result, the for loop did nothing, and the print action used the last value of `i` only. – bli Mar 01 '16 at 16:59
3

Using maphimbu from the Debian stda package:

# use 'jot' to generate 100 random numbers between 1 and 5
# and 'maphimbu' to print sorted "histogram":
jot -r 100 1 5 | maphimbu -s 1

Output:

             1                20
             2                21
             3                20
             4                21
             5                18

maphimbu also works with floating point:

jot -r 100.0 10 15 | numprocess /%10/ | maphimbu -s 1

Output:

             1                21
           1.1                17
           1.2                14
           1.3                18
           1.4                11
           1.5                19
agc
  • 7,973
  • 2
  • 29
  • 50
3

At least some of that can be done with

sort output.txt | uniq -c

But the order number count is reversed. This will fix that problem.

sort test.dat | uniq -c | awk '{print $2, $1}'
Mike Sherrill 'Cat Recall'
  • 91,602
  • 17
  • 122
  • 185
pavium
  • 14,808
  • 4
  • 33
  • 50
  • 1
    If the items in column one are different lengths, this will mess up the alignment a bit so you could use a tab instead of the default space when you reorder the columns:`sort test.dat | uniq -c | awk '{print $2"\t"$1}'` – PeterVermont Dec 04 '13 at 20:13
1
perl -lne '$h{$_}++; END{for $n (sort keys %h) {print "$n\t$h{$n}"}}' input.txt

Loop over each line with -n
Each $_ number increments hash %h
Once the END of input.txt has been reached,
sort {$a <=> $b} the hash numerically
Print the number $n and the frequency $h{$n}

Similar code which works on floating point:

perl -lne '$h{int($_)}++; END{for $n (sort {$a <=> $b} keys %h) {print "$n\t$h{$n}"}}' float.txt

float.txt

1.732
2.236
1.442
3.162
1.260
0.707

output:

0       1
1       3
2       1
3       1
Chris Koknat
  • 3,305
  • 2
  • 29
  • 30
1

In addition to the other answers, you can use awk to make a simple graph. (But, again, it's not a histogram.)

Community
  • 1
  • 1
Mike Sherrill 'Cat Recall'
  • 91,602
  • 17
  • 122
  • 185
0

I had a similar problem as described, but across gigabytes of gzip'd log files. Because many of these solutions necessitated waiting until all the data was parsed, I opted to write rare to quickly parse and aggregate data based on a regexp.

In the case above, it's as simple as passing in the data to the histogram function:

rare histo input.txt
# OR
cat input.txt | rare histo

# Outputs:
1                   3         
0                   1         
2                   1         
3                   1

But it can also handle more complex cases via regex/expressions, such as:

rare histo --match "(\d+)" --extract "{1}" input.txt
zix99
  • 73
  • 5