Probability Distribution of each unique numbers in an array (length unknown) after excluding zeros

Question

Part of my datafile looks as

ifile.txt
1
1
3
0
6
3
0
3
3
5

I would like to find probability of each numbers excluding zeros. e.g. P(1)=2/8; P(3)=4/8 and so on

Desire output

ofile.txt
1  0.250
3  0.500
5  0.125
6  0.125

Where 1st column shows the unique numbers except 0 and 2nd column shows the probability. I was trying as following, but looks very lengthy idea. I am facing problem in for loop, as there are so many unique numbers

n=$(awk '$1 > 0 {print $0}' ifile.txt | wc -l)
for i in 1 3 5 6 .....
do
n1=$(awk '$1 == $i {print $0}' ifile.txt | wc -l)
p=$(echo $n1/$n | bc -l)
printf "%d %.3f\n" "$i $p" >> ofile.txt
done

score 5 · Accepted Answer · answered Jul 17 '15 at 02:42

5

Use an associative array in awk to get the count of each unique number in one pass.

awk '$0 != "0" { count[$0]++; total++ } 
     END { for(i in count) printf("%d %.3f\n", i, count[i]/total) }' ifile.txt | sort -n > ofile.txt

answered Jul 17 '15 at 02:42

Barmar

741,623
53
500
612

score 3 · Answer 2 · answered Jul 17 '15 at 02:33

3

How about a sort | uniq -c to get the distinct number counts in ~n log n instead of n^2 time, and then run that through division by your total non-zero count from wc -l?

answered Jul 17 '15 at 02:33

Phil Miller

36,389
13
67
90

Thank you @Novelocrat for your suggestion. But I couldn't able to solve it until snd's answer. – Kay Jul 17 '15 at 06:21

score 3 · Answer 3 · edited May 23 '17 at 12:15

3

Here's a way using Novelocrat's sort|uniq -c suggestion:

sed '/^0/ d' ifile.txt|sort|uniq -c >i
awk 'FNR==NR{n+=$1;next;}{print $2,$1/n}' i i

short explanation

remove numbers starting with 0's sed '/^0/ d' ifile.txt

sort|uniq -c >i gives you i:

In awk, FNR==NR{n+=$1;next;} totals col 1 of i in n (next skips the next command), and then print $2,$1/n prints col 2 of i and the quotient of col 1 over n.

edited May 23 '17 at 12:15

Community

1
1

answered Jul 17 '15 at 06:08

userABC123

1,460
2
18
31

1

`/0/!p` will also remove `10`, `20`, etc. – Barmar Jul 17 '15 at 07:58
@Barmar Thank you :) I fixed it (I think) – userABC123 Jul 17 '15 at 08:16

Probability Distribution of each unique numbers in an array (length unknown) after excluding zeros

3 Answers3