0

I have a basic Beam pipeline that reads from GCS, does a Beam SQL transform and writes the results to BigQuery.

When I don't do any aggregation in my SQL statement it works fine:

..
PCollection<Row> outputStream =
                sqlRows.apply(
                        "sql_transform",
                        SqlTransform.query("select views from PCOLLECTION"));
outputStream.setCoder(SCHEMA.getRowCoder());
..

However, when I try to aggregate with a sum then it fails (throws a CannotPlanException exception):

..
PCollection<Row> outputStream =
                sqlRows.apply(
                        "sql_transform",
                        SqlTransform.query("select wikimedia_project, sum(views) from PCOLLECTION group by wikimedia_project"));
outputStream.setCoder(SCHEMA.getRowCoder());
..

Stacktrace:

Step #1: 11:47:37,562 0    [main] INFO  org.apache.beam.runners.dataflow.DataflowRunner - PipelineOptions.filesToStage was not specified. Defaulting to files from the classpath: will stage 117 files. Enable logging at DEBUG level to see which files will be staged.
Step #1: 11:47:39,845 2283 [main] INFO  org.apache.beam.sdk.extensions.sql.impl.BeamQueryPlanner - SQL:
Step #1: SELECT `PCOLLECTION`.`wikimedia_project`, SUM(`PCOLLECTION`.`views`)
Step #1: FROM `beam`.`PCOLLECTION` AS `PCOLLECTION`
Step #1: GROUP BY `PCOLLECTION`.`wikimedia_project`
Step #1: 11:47:40,387 2825 [main] INFO  org.apache.beam.sdk.extensions.sql.impl.BeamQueryPlanner - SQLPlan>
Step #1: LogicalAggregate(group=[{0}], EXPR$1=[SUM($1)])
Step #1:   BeamIOSourceRel(table=[[beam, PCOLLECTION]])
Step #1: 
Step #1: Exception in thread "main" org.apache.beam.repackaged.beam_sdks_java_extensions_sql.org.apache.calcite.plan.RelOptPlanner$CannotPlanException: Node [rel#7:Subset#1.BEAM_LOGICAL.[]] could not be implemented; planner state:
Step #1: 
Step #1: Root: rel#7:Subset#1.BEAM_LOGICAL.[]
Step #1: Original rel:
Step #1: LogicalAggregate(subset=[rel#7:Subset#1.BEAM_LOGICAL.[]], group=[{0}], EXPR$1=[SUM($1)]): rowcount = 10.0, cumulative cost = {11.375000476837158 rows, 0.0 cpu, 0.0 io}, id = 5
Step #1:   BeamIOSourceRel(subset=[rel#4:Subset#0.BEAM_LOGICAL.[]], table=[[beam, PCOLLECTION]]): rowcount = 100.0, cumulative cost = {100.0 rows, 101.0 cpu, 0.0 io}, id = 2
Step #1: 
Step #1: Sets:
Step #1: Set#0, type: RecordType(VARCHAR wikimedia_project, BIGINT views)
Step #1:        rel#4:Subset#0.BEAM_LOGICAL.[], best=rel#2, importance=0.81
Step #1:                rel#2:BeamIOSourceRel.BEAM_LOGICAL.[](table=[beam, PCOLLECTION]), rowcount=100.0, cumulative cost={100.0 rows, 101.0 cpu, 0.0 io}
Step #1:        rel#10:Subset#0.ENUMERABLE.[], best=rel#9, importance=0.405
Step #1:                rel#9:BeamEnumerableConverter.ENUMERABLE.[](input=rel#4:Subset#0.BEAM_LOGICAL.[]), rowcount=100.0, cumulative cost={1.7976931348623157E308 rows, 1.7976931348623157E308 cpu, 1.7976931348623157E308 io}
Step #1: Set#1, type: RecordType(VARCHAR wikimedia_project, BIGINT EXPR$1)
Step #1:        rel#6:Subset#1.NONE.[], best=null, importance=0.9
Step #1:                rel#5:LogicalAggregate.NONE.[](input=rel#4:Subset#0.BEAM_LOGICAL.[],group={0},EXPR$1=SUM($1)), rowcount=10.0, cumulative cost={inf}
Step #1:        rel#7:Subset#1.BEAM_LOGICAL.[], best=null, importance=1.0
Step #1:                rel#8:AbstractConverter.BEAM_LOGICAL.[](input=rel#6:Subset#1.NONE.[],convention=BEAM_LOGICAL,sort=[]), rowcount=10.0, cumulative cost={inf}
Step #1: 
Step #1: 
Step #1:        at org.apache.beam.repackaged.beam_sdks_java_extensions_sql.org.apache.calcite.plan.volcano.RelSubset$CheapestPlanReplacer.visit(RelSubset.java:448)
Step #1:        at org.apache.beam.repackaged.beam_sdks_java_extensions_sql.org.apache.calcite.plan.volcano.RelSubset.buildCheapestPlan(RelSubset.java:298)
Step #1:        at org.apache.beam.repackaged.beam_sdks_java_extensions_sql.org.apache.calcite.plan.volcano.VolcanoPlanner.findBestExp(VolcanoPlanner.java:666)
Step #1:        at org.apache.beam.repackaged.beam_sdks_java_extensions_sql.org.apache.calcite.tools.Programs$RuleSetProgram.run(Programs.java:368)
Step #1:        at org.apache.beam.repackaged.beam_sdks_java_extensions_sql.org.apache.calcite.prepare.PlannerImpl.transform(PlannerImpl.java:336)
Step #1:        at org.apache.beam.sdk.extensions.sql.impl.BeamQueryPlanner.convertToBeamRel(BeamQueryPlanner.java:138)
Step #1:        at org.apache.beam.sdk.extensions.sql.impl.BeamSqlEnv.parseQuery(BeamSqlEnv.java:105)
Step #1:        at org.apache.beam.sdk.extensions.sql.SqlTransform.expand(SqlTransform.java:96)
Step #1:        at org.apache.beam.sdk.extensions.sql.SqlTransform.expand(SqlTransform.java:79)
Step #1:        at org.apache.beam.sdk.Pipeline.applyInternal(Pipeline.java:537)
Step #1:        at org.apache.beam.sdk.Pipeline.applyTransform(Pipeline.java:488)
Step #1:        at org.apache.beam.sdk.values.PCollection.apply(PCollection.java:338)
Step #1:        at org.polleyg.TemplatePipeline.main(TemplatePipeline.java:59)
Step #1: :run FAILED
Step #1: 
Step #1: FAILURE: Build failed with an exception.

I'm using Beam 2.6.0

Am I missing something obvious?

Graham Polley
  • 14,393
  • 4
  • 44
  • 80

1 Answers1

1

This should work, it's a bug. Filed BEAM-5384.

If you look at the plan, it has the LogicalAggregate operation that represents the aggregation and needs to be implemented by Beam. Due to how Beam works, to implement an aggregation it also needs to pull some information from LogicalProject operation that represents field access in select f1, f2, and that's what is missing here. It is not very clear yet whether it is a bug where the query is overly optimized and projection is removed from the plan, or whether it is a valid use case that Beam should support.

One suggestion I have is to try to modify the select clause, e.g. re-order fields, add more fields.

Update:

There was at least one issue causing this. Basically when your schema has only the fields that you use in the query, then there is no need for projection and Calcite doesn't add it to the plan. However Beam aggregation needs a projection node to extract windowing information from (that's current implementation, it is likely not the correct thing to do).

Workaround: So in order to fix the specific query you can add extra fields to the schema and not use them in the query, this will cause Calcite to add projection node to the plan and Beam SQL Aggregation will apply.

Beam HEAD now has this specific problem fixed: https://github.com/apache/beam/commit/8c35781d62846211e43b6b122b557f8c3fdaec6d#diff-4f4ffa265fe666e99c37c346d50da67dR637

Anton
  • 2,431
  • 10
  • 20
  • Thanks Anton. I've already tried to reordering the fields and add more too. But, neither works. This is a pretty nasty bug if Beam SQL can't do basic aggregation. – Graham Polley Sep 13 '18 at 21:47
  • Beam itself supports aggregation (including basic things like SUMs, even UDAFs): https://github.com/apache/beam/blob/06128f27d1780f25c23ca65cc7ace693a78dac80/sdks/java/extensions/sql/src/test/java/org/apache/beam/sdk/extensions/sql/BeamSqlDslAggregationTest.java#L152 https://github.com/apache/beam/blob/a2b0ad14f1525d1a645cb26f5b8ec45692d9d54e/sdks/java/extensions/sql/src/test/java/org/apache/beam/sdk/extensions/sql/BeamSqlDslUdfUdafTest.java#L40 – Anton Sep 13 '18 at 22:42
  • It looks like something in this specific case forces the optimizer to convert the plan into something that is unsupported, and I am not sure what. Can you share the full schema that you're using? I want to try reproduce this? – Anton Sep 13 '18 at 22:43
  • Sure Anton. All the code (one class) is here: https://github.com/polleyg/gcp-batch-ingestion-bigquery/blob/beam_sql/src/main/java/org/polleyg/TemplatePipeline.java – Graham Polley Sep 14 '18 at 06:28
  • Updated the post with the follow up and a workaround – Anton Oct 25 '18 at 02:24