4

I am using Java here and my current datasets looks like this:

dataset d1 (where column c1 has both int and string type dataset):
c1, c2, c3
12, ab, a
xy, ah, ab
19, a, ad
a, b, c

dataset d2
c1, c2, c3
12, ab, a
10, ah, ab
19, a, xy
1, b, c

Now I want to join two dataset with OR condition like:

d1.col(c1).equalTo(d2.col(c1)).or(d1.col(c1).equalTo(d2.col(c3)))

I have tried the above join and it works with smaller dataset but when we do it for bigger dataset like 8billion X 1 million it runs forever. I am not sure what it happens. Any leads?

I have also tried following things:

when(condition, value1).otherwise(value2)

But it also didn't worked out. I have also tried googling it but no luck. Also seen this stack overflow post but not working for me. Conditional Join in Spark DataFrame

Prateek Jain
  • 260
  • 1
  • 3
  • 15
  • Take a look at [Optimize Spark job that has to calculate each to each entry similarity and output top N similar items for each](https://stackoverflow.com/q/50088548) - this is exactly the same category of problems. – zero323 Sep 21 '18 at 15:46
  • 1
    Hi, Thanks for sharing the post but still it doesn't answer the question that how can I join two datasets on 1 to many column using OR key. – Prateek Jain Sep 21 '18 at 15:57

0 Answers0