I have two documents. Doc1 is in the below format:
TOPIC: 0 5892.0
site 0.0371690427699
Internet 0.0261371350984
online 0.0229124236253
web 0.0218940936864
say 0.0159538357094
TOPIC: 1 12366.0
web 0.150331554262
site 0.0517548115801
say 0.0451237263464
Internet 0.0153647096879
online 0.0135856380398
...and so on till Topic 99 in the same pattern.
And Doc2 is in the format:
0 0.566667 0 0.0333333 0 0 0 0.133333 ..........
and so on... There are totally 100 values each value for each topic.
Now, I have to find the weighted average probability for each word, that is:
P(w) = alpha.P(w1)+ alpha.P(w2)+...... +alpha.P(wn)
where alpha = value in the nth position corresponding to the nth topic.
that is for the word "say", the probability should be
P(say) = 0*0.0159 + 0.5666*0.045+.......
Likewise for each and every word, I have to calculate the probability.
For multiplication, if the word is taken from topic 0, then the 0th value from the doc2 must be considered and so on.
I have only performed counting of the occurrences of words with the below code, but have never taken their values. So, I am confused.
with open(doc2, "r") as f:
with open(doc3, "w") as f1:
words = " ".join(line.strip() for line in f)
d = defaultdict(int)
for word in words.split():
d[word] += 1
for key, value in d.iteritems() :
f1.write(key+ ' ' + str(value) + ' ')
print '\n'
My output should look like:
say = "prob of this word calculated by above formula"
site = "
internet = "
and so on.
What am I doing wrong?