3

Part of my datafile looks as

ifile.txt
1
1
3
0
6
3
0
3
3
5

I would like to find probability of each numbers excluding zeros. e.g. P(1)=2/8; P(3)=4/8 and so on

Desire output

ofile.txt
1  0.250
3  0.500
5  0.125
6  0.125

Where 1st column shows the unique numbers except 0 and 2nd column shows the probability. I was trying as following, but looks very lengthy idea. I am facing problem in for loop, as there are so many unique numbers

n=$(awk '$1 > 0 {print $0}' ifile.txt | wc -l)
for i in 1 3 5 6 .....
do
n1=$(awk '$1 == $i {print $0}' ifile.txt | wc -l)
p=$(echo $n1/$n | bc -l)
printf "%d %.3f\n" "$i $p" >> ofile.txt
done
Phil Miller
  • 36,389
  • 13
  • 67
  • 90
Kay
  • 1,957
  • 2
  • 24
  • 46

3 Answers3

5

Use an associative array in awk to get the count of each unique number in one pass.

awk '$0 != "0" { count[$0]++; total++ } 
     END { for(i in count) printf("%d %.3f\n", i, count[i]/total) }' ifile.txt | sort -n > ofile.txt
Barmar
  • 741,623
  • 53
  • 500
  • 612
3

How about a sort | uniq -c to get the distinct number counts in ~n log n instead of n^2 time, and then run that through division by your total non-zero count from wc -l?

Phil Miller
  • 36,389
  • 13
  • 67
  • 90
  • Thank you @Novelocrat for your suggestion. But I couldn't able to solve it until snd's answer. – Kay Jul 17 '15 at 06:21
3

Here's a way using Novelocrat's sort|uniq -c suggestion:

sed '/^0/ d' ifile.txt|sort|uniq -c >i
awk 'FNR==NR{n+=$1;next;}{print $2,$1/n}' i i

short explanation

remove numbers starting with 0's sed '/^0/ d' ifile.txt

sort|uniq -c >i gives you i:

   2 1
   4 3
   1 5
   1 6

In awk, FNR==NR{n+=$1;next;} totals col 1 of i in n (next skips the next command), and then print $2,$1/n prints col 2 of i and the quotient of col 1 over n.

Community
  • 1
  • 1
userABC123
  • 1,460
  • 2
  • 18
  • 31