4

I have two documents. Doc1 is in the below format:

TOPIC:  0 5892.0
site 0.0371690427699
Internet 0.0261371350984
online 0.0229124236253
web 0.0218940936864
say 0.0159538357094

TOPIC:  1 12366.0
web 0.150331554262
site 0.0517548115801
say 0.0451237263464
Internet 0.0153647096879
online 0.0135856380398

...and so on till Topic 99 in the same pattern.

And Doc2 is in the format:

0 0.566667 0 0.0333333 0 0 0 0.133333 ..........

and so on... There are totally 100 values each value for each topic.

Now, I have to find the weighted average probability for each word, that is:

P(w) = alpha.P(w1)+ alpha.P(w2)+...... +alpha.P(wn)

where alpha = value in the nth position corresponding to the nth topic. 

that is for the word "say", the probability should be

P(say) = 0*0.0159 + 0.5666*0.045+....... 

Likewise for each and every word, I have to calculate the probability.

For  multiplication, if the word is taken from topic 0, then the 0th value from the doc2 must be considered and so on.

I have only performed counting of the occurrences of words with the below code, but have never taken their values. So, I am confused.

 with open(doc2, "r") as f:
    with open(doc3, "w") as f1:

         words = " ".join(line.strip() for line in f)
         d = defaultdict(int)
         for word in words.split():  
              d[word] += 1
              for key, value in d.iteritems() :
                  f1.write(key+ ' ' + str(value) + ' ')
              print '\n'

My output should look like:

 say = "prob of this word calculated by above formula"
 site = "
 internet = " 

and so on.

What am I doing wrong?

AJF
  • 11,767
  • 2
  • 37
  • 64
Rudhra
  • 274
  • 3
  • 13

1 Answers1

2

Presuming you are ignoring TOPIC lines, use a defaultdict to group the values and then do the calculation at the end:

from collections import defaultdict
from itertools import groupby, imap

d = defaultdict(list)
with open("doc1") as f,open("doc2") as f2:
    values = map(float, f2.read().split()) 
    for line in f:
        if line.strip() and not line.startswith("TOPIC"):
            name, val = line.split()
            d[name].append(float(val))

for k,v in d.items():
    print("Prob for {} is {}".format(k ,sum(i*j for i, j in zip(v,values)) ))

Another way would be to do the calcs as you go, increasing a count each time you hit a new section i.e a line with TOPIC to get the correct value from values by indexing:

from collections import defaultdict
d = defaultdict(float)
from itertools import  imap

with open("doc1") as f,open("doc2") as f2:
    # create list of all floats from doc2
    values = imap(float, f2.read().split())
    for line in f:
        # if we have a new TOPIC increase the ind to get corresponding ndex from values
        if line.startswith("TOPIC"):
            ind = next(values)
            continue
        # ignore empty lines
        if line.strip():
            # get word and float and multiply the val by corresponding values value
            name, val = line.split()
            d[name] += float(val) * values[ind]

for k,v in d.items():
    print("Prob for {} is {}".format(k ,v) )

Using you two doc1 content and 0 0.566667 0 0.0333333 0 inside doc2 outputs the following for both:

Prob for web is 0.085187930859
Prob for say is 0.0255701266375
Prob for online is 0.0076985327511
Prob for site is 0.0293277438137
Prob for Internet is 0.00870667394471

You could also use itertools groupby:

from collections import defaultdict
d = defaultdict(float)
from itertools import groupby, imap

with open("doc1") as f,open("doc2") as f2:
    values = imap(float, f2.read().split())
    # lambda x: not(x.strip()) will split into groups on the empty lines
    for ind, (k, v) in enumerate(groupby(f, key=lambda x: not(x.strip()))):
        if not k:
            topic = next(v) 
            #  get matching float from values
            f = next(values)
            # iterate over the group 
            for s in v:
                name, val = s.split()
                d[name] += (float(val) * f)
for k,v in d.iteritems():
    print("Prob for {} is {}".format(k,v))

For python3 all the itertools imaps should be changed to just map which also returns an iterator in python3.

Padraic Cunningham
  • 176,452
  • 29
  • 245
  • 321