9

This is my first ever post on stackoverflow and am I am total fresher to coding. So, please bear with me.

I am working on an experiment which has two sets of data documents. Doc1 is as follows:

TOPIC:topic_0 5892.0
site 0.0371690427699
Internet 0.0261371350984
online 0.0229124236253
web 0.0218940936864
say 0.0159538357094
image 0.015105227427

TOPIC:topic_1 12366.0
Mr 0.150331554262
s 0.0517548115801
say 0.0451237263464

TOPIC:topic_2 ....
.....
.....

TOPIC:topic_3 1066.0
say 0.062
word 0.182

and so on till 100 topics.

In this document, there are words that are either present in all the topics or just present in few topics. So, I would like to perform a process where if a word is not present in one topic, I would like to have the word's value in that topic as 0. That is the word BBC is present in topic 2, but is not there in topic 1, so I would like to have my list as :

TOPIC:topic_0 5892.0
site 0.0371690427699
Internet 0.0261371350984
online 0.0229124236253
web 0.0218940936864
say 0.0159538357094
image 0.015105227427
Mr 0
s 0
president 0
tell 0
BBC 0

TOPIC:topic_1 12366.0
Mr 0.150331554262
s 0.0517548115801
say 0.0451237263464
president 0.0153647096879
tell 0.0135856380398
BBC 0.0135856380398
site 0
Internet 0
online 0
web 0
say 0
image 0

I have to multiply these values with another set of values present in another document. For that,

from collections import defaultdict
from itertools import groupby, imap

d = defaultdict(list)
with open("doc1") as f,open("doc2") as f2:
values = map(float, f2.read().split()) 
for line in f:
    if line.strip() and not line.startswith("TOPIC"):
        name, val = line.split()
        d[name].append(float(val))

for k,v in d.items():
     print("Prob for {} is {}".format(k ,sum(i*j for i, j in zip(v,values)) ))

My doc2 is of the format:

  0.566667 0.0333333 0.133333 0 0 0  2.43333 0 0.13333......... till 100 values. 

The above code considers the word "say". It checks that the word is in 3 topics, and gathers their values in a list like [0.015, 0.45, 0.062]. This list is multiplied from values in doc2 in such a way that the value 0.015 is multiplied to the 0th value in doc2, 0.45 * 1st value in doc2 and 0.062*2nd value in doc2. But this is not what I want. We can see that there is no word "SAY" in topic_2. Here the list must contain [0.015, 0.45, 0, 0.062]. So, when these values are multiplied with their respective position values from doc2, they would give

P(SAY) = (0.566667*0.015) + (0.0333333*0.045) + (0.133333 *0) + (0*0.062)

So, the code is perfectly fine but just this modification is required.

Ana_Sam
  • 469
  • 2
  • 4
  • 12
  • 1
    If you're new here, you may want to read http://stackoverflow.com/help/asking, especially http://stackoverflow.com/help/mcve – boardrider Jul 24 '15 at 12:42
  • I"ll keep that in mind for my next questions. Thanks for the guidance! – Ana_Sam Jul 24 '15 at 12:45
  • Am I understing correctly here that for example you have the word 'site' in 40 topics, each with a number then you want to calculate something, but you only use the 40 numbers instead of the 40 numbers and 60 zeroes? – KameeCoding Jul 24 '15 at 12:46
  • As in I just want the words to have value 0 if they are not present in the topics. My list has 40 values. But it should have zeros in places where the words are not present. The word "site" is not present in the first successive 40 topics. They are present in random 40 topics. – Ana_Sam Jul 24 '15 at 12:49
  • 2
    @Ana_Sam does the position of the values matter? – KameeCoding Jul 24 '15 at 12:56
  • Your question could be separated on smaller parts. I'm sure that you could find answers on these smaller question on stackoverflow. Just use your browser properly. – Karol Król Aug 10 '15 at 20:19

3 Answers3

4

The issue is you are treating the TOPICS as all one, if you want individual sections use the groupby code from the original answer getting a set of all names first then comparing the set of names against the defualtdict keys to find the difference in each section:

from collections import defaultdict
d = defaultdict(float)
from itertools import groupby, imap

with open("doc1") as f,open("doc2") as f2:
    values = imap(float, f2.read().split())
    # find every word in every TOPIC
    all_words = {line.split()[0] for line in f if line.strip() and not line.startswith("TOPIC")}
    f.seek(0) # rset pointer
    # lambda x: not(x.strip()) will split into groups on the empty lines
    for ind, (k, v) in enumerate(groupby(f, key=lambda x: not(x.strip()))):
        if not k:
            topic = next(v)
            #  get matching float from values
            f = next(values)
            # iterate over the group
            for s in v:
                name, val = s.split()
                d[name] += (float(val) * f)
            # get difference in all_words vs words in current TOPIC
            # giving 0 as default for missing values
            for word in all_words - d.viewkeys():
                d[word] = 0
            for k,v in d.iteritems():
                print("Prob for {} is {}".format(k,v))
            d = defaultdict(float)

To store all the output you can add the dicts to a list:

from collections import defaultdict
d = defaultdict(float)
from itertools import groupby, imap
with open("doc1") as f,open("doc2") as f2:
    values = imap(float, f2.read().split())
    all_words = {line.split()[0] for line in f if line.strip() and not line.startswith("TOPIC")}
    f.seek(0)
    out = []
    # lambda x: not(x.strip()) will split into groups on the empty lines
    for ind, (k, v) in enumerate(groupby(f, key=lambda x: not(x.strip()))):
        if not k:
            topic = next(v)
            #  get matching float from values
            f = next(values)
            # iterate over the group
            for s in v:
                name, val = s.split()
                d[name] += (float(val) * f)
            for word in all_words - d.viewkeys():
                d[word] = 0
            out.append(d)
            d = defaultdict(float)

Then iterate over the list:

for top in out:
  for k,v in top.iteritems():
            print("Prob for {} is {}".format(k,v))

Or forget the defualtdict and use dict.fromkeys:

from itertools import groupby, imap

with open("doc1") as f,open("doc2") as f2:
    values = imap(float, f2.read().split())
    all_words = [line.split()[0] for line in f if line.strip() and not line.startswith("TOPIC")]
    f.seek(0)
    out, d = [], dict.fromkeys(all_words ,0.0)
    # lambda x: not(x.strip()) will split into groups on the empty lines
    for ind, (k, v) in enumerate(groupby(f, key=lambda x: not(x.strip()))):
        if not k:
            topic = next(v)
            #  get matching float from values
            f = next(values)
            # iterate over the group
            for s in v:
                name, val = s.split()
                d[name] += (float(val) * f)
            out.append(d)
            d = dict.fromkeys(all_words ,0)

If you always want the missing words at the end use a collections.OrderedDict with the first approach adding missing values at the end of the dict:

from collections import OrderedDict

from itertools import groupby, imap
with open("doc1") as f,open("doc2") as f2:
    values = imap(float, f2.read().split())
    all_words = {line.split()[0] for line in f if line.strip() and not line.startswith("TOPIC")}
    f.seek(0)
    out = []
    # lambda x: not(x.strip()) will split into groups on the empty lines
    for  (k, v) in groupby(f, key=lambda x: not(x.strip())):
        if not k:
            topic = next(v)
            #  get matching float from values
            f = next(values)
            # iterate over the group
            for s in v:
                name, val = s.split()
                d.setdefault(name, (float(val) * f))
            for word in all_words.difference(d):
                    d[word] = 0
            out.append(d)
            d = OrderedDict()

for top in out:
    for k,v in top.iteritems():
         print("Prob for {} is {}".format(k,v))

Finally to store in order and by topic:

from collections import OrderedDict

from itertools import groupby, imap

with open("doc1") as f,open("doc2") as f2:
    values = imap(float, f2.read().split())
    all_words = {line.split()[0] for line in f if line.strip() and not line.startswith("TOPIC")}
    f.seek(0)
    out = OrderedDict()
    # lambda x: not(x.strip()) will split into groups on the empty lines
    for (k, v) in groupby(f, key=lambda x: not(x.strip())):
        if not k:
            topic = next(v).rstrip()
            # create OrderedDict for each topic
            out[topic] = OrderedDict()
            #  get matching float from values
            f = next(values)
            # iterate over the group
            for s in v:
                name, val = s.split()
                out[topic].setdefault(name, (float(val) * f))
            # find words missing from TOPIC and  set to 0
            for word in  all_words.difference(out[topic]):
                    out[topic][word] = 0

for k,v in out.items():
    print(k) # each TOPIC
    for k,v in v.iteritems():
        print("Prob for {} is {}".format(k,v)) # the OrderedDict items
   print("\n")

doc1:

TOPIC:topic_0 5892.0
site 0.0371690427699
Internet 0.0261371350984
online 0.0229124236253
web 0.0218940936864
say 0.0159538357094
image 0.015105227427

TOPIC:topic_1 12366.0
Mr 0.150331554262
s 0.0517548115801
say 0.0451237263464
president 0.0153647096879
tell 0.0135856380398
BBC 0.0135856380398

doc2:

0.345 0.566667

Output:

TOPIC:topic_0 5892.0
Prob for site is 0.0128233197556
Prob for Internet is 0.00901731160895
Prob for online is 0.00790478615073
Prob for web is 0.00755346232181
Prob for say is 0.00550407331974
Prob for image is 0.00521130346231
Prob for BBC is 0
Prob for Mr is 0
Prob for s is 0
Prob for president is 0
Prob for tell is 0


TOPIC:topic_1 12366.0
Prob for Mr is 0.085187930859
Prob for s is 0.0293277438137
Prob for say is 0.0255701266375
Prob for president is 0.00870667394471
Prob for tell is 0.0076985327511
Prob for BBC is 0.0076985327511
Prob for web is 0
Prob for image is 0
Prob for online is 0
Prob for site is 0
Prob for Internet is 0

You can apply the exact same logic using a regular for loop, the groupby just does all the grouping work for you.

If you actually just want to write to a file then the code even simpler:

from itertools import groupby, imap
with open("doc1") as f,open("doc2") as f2,open("prob.txt","w") as f3:
    values = imap(float, f2.read().split())
    all_words = {line.split()[0] for line in f if line.strip() and not line.startswith("TOPIC")}
    f.seek(0)
    for (k, v) in groupby(f, key=lambda x: not(x.strip())):
        if not k:
            topic, words  = next(v), []
            flt = next(values)
            f3.write(topic)    
            for s in v:
                name, val = s.split()
                words.append(name)
                f3.write("{} {}\n".format(name, (float(val) * flt)))
            for word in all_words.difference(words):
                  f3.write("{} {}\n".format(word, 0))
            f3.write("\n")

prob.txt:

TOPIC:topic_0 5892.0
site 0.0128233197556
Internet 0.00901731160895
online 0.00790478615073
web 0.00755346232181
say 0.00550407331974
image 0.00521130346231
BBC 0
Mr 0
s 0
president 0
tell 0

TOPIC:topic_1 12366.0
Mr 0.085187930859
s 0.0293277438137
say 0.0255701266375
president 0.00870667394471
tell 0.0076985327511
BBC 0.0076985327511
web 0
image 0
online 0
site 0
Internet 0
Community
  • 1
  • 1
Padraic Cunningham
  • 176,452
  • 29
  • 245
  • 321
  • Padriac so nice of your gesture. I really appreciate it. I have a problem. All the values are 0 in my output file. I would like to assign 0 to words only that are not available in that topic. The rest I want them to be intact. I want to perform the same operation as per the question asked before yesterday but with only assigning 0 to words that are not present in the topic. All the codes are giving me 0 for all the words :( – Rudhra Jul 24 '15 at 14:15
  • Please don't strain. It is so much of valuable information you have shared. I would never forget the help you did the last 3 days :) It is a pleasure to see great minds like you online. It inspires ppl like me and Ana to make ourselves better qualified. Hats off!! – Rudhra Jul 24 '15 at 14:53
  • Padraic you have helped us a great deal. This code is kinda giving us 0s for all the words. However for the help you have done to us, it is really such an honour knowing you. Hats off to your dedication and knowledge. As Rudhra said we will really work hard and take ppl like you as an inspiration! – Ana_Sam Jul 24 '15 at 14:56
  • The only way it would give a zeros would be if doc2 contained zeros. – Padraic Cunningham Jul 24 '15 at 15:09
  • Doc 2 contains 0s. But not all the places. I have given a sample of my doc 2 in the question. Only the topics where the words are not there must be 0 in the list that is gathered by the code. Then that result will be multiplied with doc 2. – Rudhra Jul 24 '15 at 15:21
  • 1
    Once the text files are in the same format the code should work. Add some prints in the code to debug – Padraic Cunningham Jul 24 '15 at 15:36
  • Thank you so much for the inputs Padraic. Really learnt things I never knew before in coding :) – Ana_Sam Jul 25 '15 at 11:22
2

As an alternative concise way for rewriting the blocks you can store all the names in a set then crate the relative OrderedDict of your blocks then get the missed names using set.difference with main words (the set words) for each block, then write them at the end of block :

from itertools import tee
from collections import OrderedDict

d=OrderedDict()
with open('input.txt') as f,open('new','w') as new:
    f2,f3,f=tee(f,3)
    next(f3)
    words={line.split()[0] for line in f if not line.startswith('TOPIC') and line.strip()}

    for line in f2:
        if line.startswith('TOPIC'):
           key=line
           next_line=next(f3)
           try:
               while not next_line.startswith('TOPIC'):
                  d.setdefault(key,[]).append(next_line)
                  next_line=next(f3)
           except:
                pass

    for k,v in d.items():
        block_words={line.split()[0] for line in v if line.strip()}
        insec=words.difference(block_words)
        new.writelines([k]+v+['{} {}\n'.format(i,0) for i in insec])

Result :

TOPIC:topic_0 5892.0
site 0.0371690427699
Internet 0.0261371350984
online 0.0229124236253
web 0.0218940936864
say 0.0159538357094
image 0.015105227427
president 0
s 0
BBC 0
tell 0
Mr 0
TOPIC:topic_1 12366.0
Mr 0.150331554262
s 0.0517548115801
say 0.0451237263464
president 0.0153647096879
tell 0.0135856380398
BBC 0.0135856380398web 0
image 0
online 0
site 0
Internet 0
Mazdak
  • 105,000
  • 18
  • 159
  • 188
  • Thank you so much for the immediate response. But this is totally screwing up the order. I want the topic orders to be the same from 0 to 99. It is giving Topic 48 first then 83. – Rudhra Jul 24 '15 at 14:16
  • @Rudhra Welcome, So you need to use `OrderedDict` instead of `defaultdict` and simulate the behavior of defaultdict by using `dict.setdefault` method ,check the edit. – Mazdak Jul 24 '15 at 14:21
  • Thank you so much for the timely help :) It is all perfectly fine now :) – Ana_Sam Jul 25 '15 at 11:21
2

I would first read file1 as a list of mappings { word, value }, each topic building an element of the list.

with open('Doc1') as f:
    maps = []
    for line in f:
        line = line.strip()
        if line.startswith('TOPIC'):
            mapping = {}
            maps.append(mapping)
        elif len(line) == 0:
            pass
        else:
            k, v = line.split()
            mapping[k] = v

Then I will build a set of all words by taking the union of keys from all mappings

words = set()
for mapping in maps:
    words = words.union(mapping.keys())

Then I will iterate on each mapping and add a 0 value for all keys in the set of words not already present in the dict.

for mapping in maps:
    for k in words.difference(mapping.keys()):
        mapping[k] = 0

That way, all words are present in all mappings, and it is trivial to build a nice d dict :

d = {k : list() for k in words }
for mapping in maps:
    for k in mappings:
        d[k].append(float(mapping[k]))

Each word present in at least one topic has for value a list of 100 values one per topic, with the true value when it is present and 0 if not : zip will now work fine.

Serge Ballesta
  • 143,923
  • 11
  • 122
  • 252