-1

I've got two files formatted in this way:

File1:

word token occurence

File2:

token occurence

What I want is a third file with this output:

word token occurrence1/occurence2

This is my code:

while read token pos count
do
    #get pos counts
    poscount=$(grep "^$pos" $2 | cut -f 2)
    #calculate probability
    prob=$(echo "scale=5;$count / $poscount" | bc -l)
    #print token, pos-tag & probability
    echo -e "$token\t$pos\t$prob"
done < $1 

The problem is that my output is something like this:

-   :   .25000
:   :   .75000
'   ''  1.00000
0   CD  .00396
1000    CD  .00793
13  CD  .00793
13th    JJ  .00073
36
29
16  CD  .00396
17  CD  .00396

There are lines with numbers that I don't know where they come from, they are not in the previous files.

Why do these numbers appear? Is there a way to remove those lines? Thanks in advance!

agc
  • 7,973
  • 2
  • 29
  • 50
  • What are the real file names? Double quote the variables for safety: `"$2"`. – choroba Mar 31 '17 at 12:15
  • 1
    Why do you not post some lines of the real files? Have you tried to debug your own script using tricks such as `-x` option or `echo -e ">$token<\t>$count<\t>$postcount<"`? In order words, how can you be so sure that it is the division which ``generates you random numbers`` when you have not inspected the `count` and `postcount` values? – Jdamian Mar 31 '17 at 12:52
  • Probably the culprit: `grep "^$pos" $2 | cut -f 2`; if several lines begin with a particular value of `$pos`, then `grep` would find all of them. – agc Mar 31 '17 at 14:25

1 Answers1

0
  1. Method using paste, cut, & dc:

    echo "5 k $(paste file[12] | cut -f 3,5) / p" | dc | \
    paste file1 - | cut --complement -f 3
    
  2. Method using bash, paste & dc:

    paste <(join -1 2 file1 -2 1 file2 -o 1.1,1.2)  \
      <(echo "5 k $(join -1 2 file1 -2 1 file2 -o 1.3,2.2) / p" | dc)
    
agc
  • 7,973
  • 2
  • 29
  • 50