I've got two files formatted in this way:
File1:
word token occurence
File2:
token occurence
What I want is a third file with this output:
word token occurrence1/occurence2
This is my code:
while read token pos count
do
#get pos counts
poscount=$(grep "^$pos" $2 | cut -f 2)
#calculate probability
prob=$(echo "scale=5;$count / $poscount" | bc -l)
#print token, pos-tag & probability
echo -e "$token\t$pos\t$prob"
done < $1
The problem is that my output is something like this:
- : .25000
: : .75000
' '' 1.00000
0 CD .00396
1000 CD .00793
13 CD .00793
13th JJ .00073
36
29
16 CD .00396
17 CD .00396
There are lines with numbers that I don't know where they come from, they are not in the previous files.
Why do these numbers appear? Is there a way to remove those lines? Thanks in advance!