I have the following dataset
s1, s2, count
1, 2, x1
1, 3, x2
1, 4, x3
2, 1, y1
2, 3, y2
2, 4, y3
3, 1, z1
3, 2, z2
I want to get the following output
s1, s2, count
1, 2, x1-y1
1, 3, x2-z1
1, 4, x3
2, 3, y2-z2
2, 4, y3
The idea is that s1 is an entity which is favored over s2. And I have tuples such that s1 (say = 1) was favored over s2 (say = 2) by x1 times AND s1 (say = 2) was favored over s2 (say = 1) by y1 times. What I need is a sub O(n^2) algorithm to compute absolute number of times s1 was favored over s2 (or other way round). (x1-y1)
The problem is that there are 230 million such tuples and I can't have O(n^2) algorithm to compute this.
One observation is that the tuples are sorted on s1 since they are results of another MR output.
Please help me find a better solution.