generating frequency table from file

Question

Given an input file containing one single number per line, how could I get a count of how many times an item occurred in that file?

cat input.txt
1
2
1
3
1
0

desired output (=>[1,3,1,1]):

cat output.txt
0 1
1 3
2 1
3 1

It would be great, if the solution could also be extended for floating numbers.

This kind of output is simple and useful, but it's not a histogram. See, for example, http://quarknet.fnal.gov/toolkits/ati/histograms.html — Mike Sherrill 'Cat Recall', May 20 '11 at 00:07
I agree you are not asking for a histogram. That can however also be accomplished with `bash`, which is what I came looking for. See this question and its answers: https://unix.stackexchange.com/questions/177777/drawing-a-histogram-from-a-bash-command-output — tommy.carstensen, May 21 '19 at 14:20

score 87 · Accepted Answer · answered May 18 '11 at 12:26

87

You mean you want a count of how many times an item appears in the input file? First sort it (using -n if the input is always numbers as in your example) then count the unique results.

sort -n input.txt | uniq -c

answered May 18 '11 at 12:26

Caleb

5,084
1
46
65

3

I didn't know about the `uniq` command. I changed it to `cat input.txt | sort -n | uniq -c | awk '{print $2 " " $1}'`, now I'm obtaining the desired output. – Javier May 18 '11 at 12:29
7

Your use of awk to get the ordering is fine, but you don't need to use cat there. You should learn about the `<` operator to input files into programs and even things like loop constructions. For humor value, see [the useless use of cat awards](http://partmaps.org/era/unix/award.html#cat) – Caleb May 18 '11 at 12:34

score 12 · Answer 2 · answered May 18 '11 at 13:17

12

Another option:

awk '{n[$1]++} END {for (i in n) print i,n[i]}' input.txt | sort -n > output.txt

answered May 18 '11 at 13:17

glenn jackman

238,783
38
220
352

1

@Javier, the 'n' array simply keeps a count of the strings it sees in the input file. It can be int, float or any arbitrary string. Yes, the 'END' part is executed after the input file is completely read. You don't need to initialize variables in awk: an uninitialized variable is considered to be zero or the empty string (depends on the context). In this case 'i' is a loop variable. I think the default 'sort' behaviour is to consider the whole line. This solution will work for anything in the input file: awk arrays are associative arrays. – glenn jackman May 18 '11 at 14:04
1

thanks for illutrasting an `awk-based` solution. From what I understood, in the first part you store the `histogram` into the `n` array considering the elements in the column `$1`. The `END` part means, that it's going to be done `after` the histogram is built, right? Is it not necessary to to initialize the variable `i` for loops in `awk`? Then, the `sort -n` is going to be applied only in the first column of the output: `i, n[i]`, right? i.e not on `n[i]`? Furthermore, this solution would only work for `integer` numbers (because of the indexing of the array)? – Javier May 18 '11 at 14:12
The `awk` solution has the distinct advantage of not requiring `sort`! To get sorted output, just keep track of the max and min values seen and iterate over them, checking if each is in the array. (This will only work for integers, though, and not with floats.) – William Pursell Sep 25 '12 at 23:04
Works great with strings too! Just need to change `$1`, the first word, to `$0`, the whole line: `awk '{n[$0]++} END {for (i in n) print i,n[i]}'` lets you easily find and count duplicate lines in input. Awesome. – Ahmed Fasih Nov 04 '14 at 13:43
I just deleted a comment in which I said I encountered a bug with the awk-based solution. Actually, it was a bug in my code. Since others might do it too, I thought it might be useful to share my experience here: My problem was that, probably influenced by the shell syntax for the for loop, I had added a ";" between the "for" and the "print" in the "END" part of the `awk` command. As a result, the for loop did nothing, and the print action used the last value of `i` only. – bli Mar 01 '16 at 16:59

agc · Answer 3 · 2016-12-31T12:49:22.853

Using maphimbu from the Debian stda package:

# use 'jot' to generate 100 random numbers between 1 and 5
# and 'maphimbu' to print sorted "histogram":
jot -r 100 1 5 | maphimbu -s 1

Output:

             1                20
             2                21
             3                20
             4                21
             5                18

maphimbu also works with floating point:

jot -r 100.0 10 15 | numprocess /%10/ | maphimbu -s 1

Output:

             1                21
           1.1                17
           1.2                14
           1.3                18
           1.4                11
           1.5                19

score 3 · Answer 4 · edited May 20 '11 at 00:02

3

At least some of that can be done with

sort output.txt | uniq -c

But the order number count is reversed. This will fix that problem.

sort test.dat | uniq -c | awk '{print $2, $1}'

edited May 20 '11 at 00:02

Mike Sherrill 'Cat Recall'

91,602
17
122
185

answered May 18 '11 at 12:26

pavium

14,808
4
33
50

1

If the items in column one are different lengths, this will mess up the alignment a bit so you could use a tab instead of the default space when you reorder the columns:`sort test.dat | uniq -c | awk '{print $2"\t"$1}'` – PeterVermont Dec 04 '13 at 20:13

Chris Koknat · Answer 5 · 2018-09-04T18:10:58.383

1

perl -lne '$h{$_}++; END{for $n (sort keys %h) {print "$n\t$h{$n}"}}' input.txt

Loop over each line with -n
Each $_ number increments hash %h
Once the END of input.txt has been reached,
sort {$a <=> $b} the hash numerically
Print the number $n and the frequency $h{$n}

Similar code which works on floating point:

perl -lne '$h{int($_)}++; END{for $n (sort {$a <=> $b} keys %h) {print "$n\t$h{$n}"}}' float.txt

float.txt

output:

edited Sep 04 '18 at 18:10

answered Sep 22 '15 at 18:24

Chris Koknat

3,305
2
29
30

`sort keys %h` uses a lexicographic sort; it doesn't sort numerically. – melpomene Sep 04 '18 at 04:09

score 1 · Answer 6 · edited May 23 '17 at 10:31

1

In addition to the other answers, you can use awk to make a simple graph. (But, again, it's not a histogram.)

edited May 23 '17 at 10:31

Community

1
1

answered May 21 '11 at 01:10

Mike Sherrill 'Cat Recall'

91,602
17
122
185

score 0 · Answer 7 · answered May 31 '21 at 02:32

I had a similar problem as described, but across gigabytes of gzip'd log files. Because many of these solutions necessitated waiting until all the data was parsed, I opted to write rare to quickly parse and aggregate data based on a regexp.

In the case above, it's as simple as passing in the data to the histogram function:

rare histo input.txt
# OR
cat input.txt | rare histo

# Outputs:
1                   3         
0                   1         
2                   1         
3                   1

But it can also handle more complex cases via regex/expressions, such as:

rare histo --match "(\d+)" --extract "{1}" input.txt

generating frequency table from file

7 Answers7

Linked