Calculate union and intersection with Python based on score of a column

Question

I need to calculate both the union and the intersection of the weighted score of an element in a column of 2 different files.

Input file 1 and Input file 2 are the same: 3-tab separated columns: Here is an example:

input1

abc with-1-rosette-n    8.1530
abc with-1-tyre-n   6.3597
abc with-1-weight-n 4.8932

input2

deg about-article-n 3.2917
deg with-1-tyre-n   3.2773
deg about-bit-n 3.4527

We want to calculate the sum of intersection of the score(in Col3) of each value in Col 2 of ABC, where we consider the min(value) & DEG as well as the sum of the union of scores (in Col3) of each value in Col2 of ABC & DEG. So essentially, the desired output would be as follows:

In this case: intersection = 3.2773 (with-1-tyre-n) and union = 29.3546.

where we get a score by dividing the union by the intersection: score(intersection)/ score(union) So, from this sample dataset the desired output is as follows

abc deg 0.1165

I have been working very hard on the script and have been running into some problems. I have already incorporated the suggestions from here and here and here and I have not been able to solve my problem.

Here is a sample of the function of the code that I am working with:

def polyCalc(a_dict, b_dict):
    intersect = min(classA & classB)
    union = classA | classB

    score = sum(intersect) / sum(union)
    return score

def calculate_polyCalc(classB_infile, classA_infile, outfile):
    targetContext_polyCalc_A = defaultdict(dict)  # { target_lemma : {feat1 : weights, feat2: weights} ...}
    with open(classA_infile, "rb") as opened_infile_A:
        for line_A in opened_infile_A:
            target_class_A, featureA, weight = line_A.split()
            targetContext_polyCalc_A[target_class_A][featureA] = float(weight)

        targetContext_polyCalc_B = defaultdict(dict)
        with open(classB_infile, "rb") as opened_infile_B:
            for line_B in opened_infile_B:
                target_class_B, featureB, weight = line_B.split()
                targetContext_polyCalc_B[target_class_B][featureB] = float(weight)
                classA = set(targetContext_polyCalc_A[featureA])
                classB = set(targetContext_polyCalc_B[featureB])


            with open(outfile, "wb") as output_file:
                poly = polyCalc(targetContext_polyCalc_A[target_class_A], targetContext_polyCalc_B[target_class_B], score)
                outstring = "\t".join([classA, classB, str(poly)])
                output_file.write(outstring + "\n")

I have followed all of the instructions in the documentation and various different websites - and I am still producing an error with the above code. Besides giving me errors with the definition of the function union, I also seem to have a problem with how I have defined the dictionaries in themselves. Can anyone provide some "experience" insight on how to solve this problem to reach my desired outcome?

Thank you in advance.

PS BTW this was written with python2.* in mind.

What errors are you getting? Also, is that the correct indentation of the function? — CDspace, Nov 13 '13 at 20:05
yes and at the moment: this is the error that I mean getting in Traceback: ` File "trial.py", line 68 return score SyntaxError: 'return' outside function ` — owwoow14, Nov 13 '13 at 20:10
In that case, from the `with` down needs to be tabbed in one more level. As written, the function is only one line, the rest is outside the function, thus the error — CDspace, Nov 13 '13 at 20:51
you were right, however see updated code in question which now gives the following error: ` File "trial_poly_calc.py", line 93, in main(sys.argv) File "trial_poly_calc.py", line 89, in main calculate_polyCalc(classB_infile, classA_infile, outfile) File "trial_poly_calc.py", line 56, in calculate_polyCalc poly = polyCalc(targetContext_polyCalc_A[target_class_A], targetContext_polyCalc_B[target_class_B], score) NameError: global name 'score' is not defined` — owwoow14, Nov 14 '13 at 09:30

ely · Accepted Answer · 2013-11-13T20:17:20.187

I might solve this by making my own class that had attributes of the set data type and also could hold values like a dict. I call it setmap below (maybe something like this already exists? Or maybe you can get away with just using dict.keys() like a set?)

class setmap(set):
    def __init__(self, val_dict):
        super(self.__class__, self).__init__(val_dict.keys())
        self.val_dict = val_dict

    def __getitem__(self, itm):
        return self.val_dict.get(itm)

    def add(self, key, val):
        super(self.__class__, self).add(key)
        self.val_dict[key] = val

Then something like this would work:

In [131]: t = setmap({'a':1, 'b':2, 'c':3})

In [132]: t1 = setmap({'a':3, 'd':8})

In [133]: t.intersection(t1)
Out[133]: set(['a'])

In [134]: {x:(t[x] + t1[x]) for x  in t.intersection(t1)}
Out[134]: {'a': 4}

Then your goal is just to process your data into different setmaps, one for the abc data, one for the deg data, etc.

To get some of the other stats you mentioned, you can utilize the same kind of dict comprehension, but with set.difference (first t.difference(t1) then t1.difference(t)) and then add in the result from using intersection. It is in some ways similar to an Inclusion/Exclusion Principle based approach to the problem.

This is kind of a cutesy way to do it and I don't claim that it is best for performance if loading lots of data. One other option is to load your data directly to a Pandas DataFrame object, then group by the middle column of data and aggregate as needed.

Calculate union and intersection with Python based on score of a column

1 Answers1