Piped input for `bc` division generates random numbers

Question

I've got two files formatted in this way:

File1:

word token occurence

File2:

token occurence

What I want is a third file with this output:

word token occurrence1/occurence2

This is my code:

while read token pos count
do
    #get pos counts
    poscount=$(grep "^$pos" $2 | cut -f 2)
    #calculate probability
    prob=$(echo "scale=5;$count / $poscount" | bc -l)
    #print token, pos-tag & probability
    echo -e "$token\t$pos\t$prob"
done < $1

The problem is that my output is something like this:

-   :   .25000
:   :   .75000
'   ''  1.00000
0   CD  .00396
1000    CD  .00793
13  CD  .00793
13th    JJ  .00073
36
29
16  CD  .00396
17  CD  .00396

There are lines with numbers that I don't know where they come from, they are not in the previous files.

Why do these numbers appear? Is there a way to remove those lines? Thanks in advance!

What are the real file names? Double quote the variables for safety: `"$2"`. — choroba, Mar 31 '17 at 12:15
Why do you not post some lines of the real files? Have you tried to debug your own script using tricks such as `-x` option or `echo -e ">$token<\t>$count<\t>$postcount<"`? In order words, how can you be so sure that it is the division which ``generates you random numbers`` when you have not inspected the `count` and `postcount` values? — Jdamian, Mar 31 '17 at 12:52
Probably the culprit: `grep "^$pos" $2 | cut -f 2`; if several lines begin with a particular value of `$pos`, then `grep` would find all of them. — agc, Mar 31 '17 at 14:25

agc · Answer 1 · 2017-03-31T14:02:47.890

0

Method using paste, cut, & dc:

echo "5 k $(paste file[12] | cut -f 3,5) / p" | dc | \
paste file1 - | cut --complement -f 3

Method using bash, paste & dc:

paste <(join -1 2 file1 -2 1 file2 -o 1.1,1.2)  \
  <(echo "5 k $(join -1 2 file1 -2 1 file2 -o 1.3,2.2) / p" | dc)

edited Mar 31 '17 at 14:02

answered Mar 31 '17 at 13:26

agc

7,973
2
29
50

Piped input for `bc` division generates random numbers

1 Answers1