Beam SQL / Apache Beam is Slower when Running Multiple Joins

Question

While doing joins on 2 tables using Beam SQL then it's working properly provide expected performance but as my Joining Tables increases then the performance become worst.

Below is my snippet which might help you to debug my Joining condition in Beam SQL to get better performance.

PCollection<Row> outputStream2 = PCollectionTuple.of(new TupleTag<>("corporation1"), sourceData)
                .and(new TupleTag<>("dim"), dimtable).and(new TupleTag<>("place"), placeData)
                .and(new TupleTag<>("principle"), principle).apply(SqlTransform.query(
                        "Select d.merchant,d.corporation1,d.place,d.principal,c.corporation1_sk,r.place_sk,p.principal_sk FROM dim d LEFT JOIN corporation1 c ON c.corporation1 = d.corporation1 LEFT JOIN place p ON p.place = d.place and c.corporation1 = p.corporation1 "));

Any better way I can do Joins on Beam SQL/ Apache Beam because Sequential Joins in Table,

Where the previous output is responsible for next table Joins. I have tried using Co-GroupBy and SideInput mixed approach also where data in Table is below 5K I took the SideInput and where data is above 50K used Co-GroupBy for Joining but the performance is not upto the mark.

Hi, got any solution for this? I'm also facing same issue. – Syed Mohammed Mehdi Aug 21 '20 at 19:46 — Syed Mohammed Mehdi, Aug 21 '20 at 19:46

score 3 · Accepted Answer · answered Aug 14 '18 at 16:46

It looks like you're seeing something similar to this issue, which currently doesn't have an ETA for a fix. Beam SQL itself doesn't do a lot of JOIN optimizations at the moment, it chooses the most appropriate approach (side-input, CoGBK) based on what kind of inputs it receives, but that's pretty much it, you cannot control it otherwise.

It is hard to tell for sure without knowing your specific setup, e.g. what kind of data sources you have, or how did you make sure to use side-input vs CoGBK, or what runner you use, or what performance you expect vs what you're actually observing.

Relevant links, if you need more details:

Design Your Pipeline / Multiple Sources;
CoGroupByKey;
Side-inputs;
Beam SQL JOIN Implementation;
CoGBK-based Join library, this is what Beam SQL uses for unbounded joins;

No, there is no new guidance for this, sorry. I suggest leaving the comment in https://issues.apache.org/jira/browse/BEAM-5049 so that it gets higher visibility — Anton, Aug 24 '20 at 23:57

Beam SQL / Apache Beam is Slower when Running Multiple Joins

1 Answers1