How to determine probability of words?

Question

I have two documents. Doc1 is in the below format:

TOPIC:  0 5892.0
site 0.0371690427699
Internet 0.0261371350984
online 0.0229124236253
web 0.0218940936864
say 0.0159538357094

TOPIC:  1 12366.0
web 0.150331554262
site 0.0517548115801
say 0.0451237263464
Internet 0.0153647096879
online 0.0135856380398

...and so on till Topic 99 in the same pattern.

And Doc2 is in the format:

0 0.566667 0 0.0333333 0 0 0 0.133333 ..........

and so on... There are totally 100 values each value for each topic.

Now, I have to find the weighted average probability for each word, that is:

P(w) = alpha.P(w1)+ alpha.P(w2)+...... +alpha.P(wn)

where alpha = value in the nth position corresponding to the nth topic.

that is for the word "say", the probability should be

P(say) = 0*0.0159 + 0.5666*0.045+.......

Likewise for each and every word, I have to calculate the probability.

For  multiplication, if the word is taken from topic 0, then the 0th value from the doc2 must be considered and so on.

I have only performed counting of the occurrences of words with the below code, but have never taken their values. So, I am confused.

 with open(doc2, "r") as f:
    with open(doc3, "w") as f1:

         words = " ".join(line.strip() for line in f)
         d = defaultdict(int)
         for word in words.split():  
              d[word] += 1
              for key, value in d.iteritems() :
                  f1.write(key+ ' ' + str(value) + ' ')
              print '\n'

My output should look like:

 say = "prob of this word calculated by above formula"
 site = "
 internet = "

and so on.

What am I doing wrong?

Padraic Cunningham · Accepted Answer · 2015-07-19T23:06:08.883

2

Presuming you are ignoring TOPIC lines, use a defaultdict to group the values and then do the calculation at the end:

from collections import defaultdict
from itertools import groupby, imap

d = defaultdict(list)
with open("doc1") as f,open("doc2") as f2:
    values = map(float, f2.read().split()) 
    for line in f:
        if line.strip() and not line.startswith("TOPIC"):
            name, val = line.split()
            d[name].append(float(val))

for k,v in d.items():
    print("Prob for {} is {}".format(k ,sum(i*j for i, j in zip(v,values)) ))

Another way would be to do the calcs as you go, increasing a count each time you hit a new section i.e a line with TOPIC to get the correct value from values by indexing:

from collections import defaultdict
d = defaultdict(float)
from itertools import  imap

with open("doc1") as f,open("doc2") as f2:
    # create list of all floats from doc2
    values = imap(float, f2.read().split())
    for line in f:
        # if we have a new TOPIC increase the ind to get corresponding ndex from values
        if line.startswith("TOPIC"):
            ind = next(values)
            continue
        # ignore empty lines
        if line.strip():
            # get word and float and multiply the val by corresponding values value
            name, val = line.split()
            d[name] += float(val) * values[ind]

for k,v in d.items():
    print("Prob for {} is {}".format(k ,v) )

Using you two doc1 content and 0 0.566667 0 0.0333333 0 inside doc2 outputs the following for both:

Prob for web is 0.085187930859
Prob for say is 0.0255701266375
Prob for online is 0.0076985327511
Prob for site is 0.0293277438137
Prob for Internet is 0.00870667394471

You could also use itertools groupby:

from collections import defaultdict
d = defaultdict(float)
from itertools import groupby, imap

with open("doc1") as f,open("doc2") as f2:
    values = imap(float, f2.read().split())
    # lambda x: not(x.strip()) will split into groups on the empty lines
    for ind, (k, v) in enumerate(groupby(f, key=lambda x: not(x.strip()))):
        if not k:
            topic = next(v) 
            #  get matching float from values
            f = next(values)
            # iterate over the group 
            for s in v:
                name, val = s.split()
                d[name] += (float(val) * f)
for k,v in d.iteritems():
    print("Prob for {} is {}".format(k,v))

For python3 all the itertools imaps should be changed to just map which also returns an iterator in python3.

edited Jul 19 '15 at 23:06

answered Jul 19 '15 at 22:11

Padraic Cunningham

176,452
29
245
321

so `say` 0.0159538357094 is multiplied by 5892.0? I only see you multiplying by the corresponding element from your doc2 file in your question – Padraic Cunningham Jul 19 '15 at 22:23
No worries, you're welcome. You had me a little confused :) – Padraic Cunningham Jul 19 '15 at 22:33
No prob, at least you made a good effort to solve it yourself which is more than a lot of people do on here ;) – Padraic Cunningham Jul 19 '15 at 22:37
I will have a look in the a.m but basically we just need to replace .read().split() with splitting each line then iterating over each, opening a file for each and writing the output. – Padraic Cunningham Jul 19 '15 at 23:35
Are you using the code exactly as posted and using python2? – Padraic Cunningham Jul 23 '15 at 21:39
@Rudhra, I have tried all the versions and they all work, add the traceback from the errors to pastebin after running the code exactly as above – Padraic Cunningham Jul 24 '15 at 00:55
I am heading out for the afternoon and evening but it should not be much of an issue to change the code, just add a sample input to pastebin and the expected output and I will throw it together when I get back – Padraic Cunningham Jul 24 '15 at 12:41
Thank you so much Padraic. You are the best :) I don't know what a pastebin is :P I will read on it and try my best. Sorry to bother you again and again. – Rudhra Jul 24 '15 at 12:52
http://pastebin.com/, your problem is you seem to want to use a new dict for each section/TOPIC not actually get a total count – Padraic Cunningham Jul 24 '15 at 12:56
Add the full expected output to pastebin – Padraic Cunningham Nov 27 '15 at 18:38
Just add a small sample and what you expect as output for all – Padraic Cunningham Nov 27 '15 at 18:47
Why are you expecting 1? Are all the topics independant or what exactly? – Padraic Cunningham Nov 27 '15 at 19:36
Running your code I get nothing close to what is mentioned in the question, are you saying each topic should sum to 1? – Padraic Cunningham Nov 27 '15 at 19:46
upload it to google drive and share the folder, do the amount of topics and amount of floats in the second file math up? – Padraic Cunningham Nov 27 '15 at 19:49
Also how could topic_1 sum to 1 when all the values are 0? – Padraic Cunningham Nov 27 '15 at 19:52
what does the last number after each topic name signify? – Padraic Cunningham Nov 27 '15 at 20:35
Ok, one other thing should the sum of all the topic values equal 1 exactly? – Padraic Cunningham Nov 27 '15 at 20:50
Are you saying your code gives you .99999999999999 or > using `"assigned0_lda25_100.txt"` `"Buy_local_food_call_to_schools_normalized.txt"`? – Padraic Cunningham Nov 27 '15 at 21:13
You realise you only have 15 topics and 400 float values pairing `"assigned0_lda25_100.txt"` `"Buy_local_food_call_to_schools_normalized.txt"`? – Padraic Cunningham Nov 27 '15 at 21:33
Ok that returns `0.999999999999949` – Padraic Cunningham Nov 27 '15 at 21:42
Is there some logic that can verify that the data in the files is actually correct,should each topic and the txt file with the decimals all sum to ~1? – Padraic Cunningham Nov 27 '15 at 21:48
OK because I cleaned your data down to only numbers >0.0 which makes it significantly smaller and I cannot see anything obvious – Padraic Cunningham Nov 27 '15 at 22:24
No worries, I am going to sign off for the night but I will have a look tomorrow again so if you discover anything in the meantime let me know. – Padraic Cunningham Nov 27 '15 at 22:31

How to determine probability of words?

1 Answers1

Linked