How do I get an output schema for an Apache Beam SQL query?

Question

I've been playing with the Beam SQL DSL and I'm unable to use the output from a query without providing a coder that's aware of the output schema manually. Can I infer the output schema rather than hardcoding it?

Neither the walkthrough or the examples actually use the output from a query. I'm using Scio rather than the plain Java API to keep the code relatively readable and concise, I don't think that makes a difference for this question.

Here's an example of what I mean.

Given an input schema inSchema and some data source that is mapped onto a Row as follows: (in this example, Avro-based, but again, I don't think that matters):

sc.avroFile[Foo](args("input"))
   .map(fooToRow)
   .setCoder(inSchema.getRowCoder)
   .applyTransform(SqlTransform.query("SELECT COUNT(1) FROM PCOLLECTION"))
   .saveAsTextFile(args("output"))

Running this pipeline results in a KryoException as follows:

com.esotericsoftware.kryo.KryoException: java.lang.NullPointerException
Serialization trace:
fieldIndices (org.apache.beam.sdk.schemas.Schema)
schema (org.apache.beam.sdk.values.RowWithStorage)
org.apache.beam.sdk.Pipeline$PipelineExecutionException: 
com.esotericsoftware.kryo.KryoException: java.lang.NullPointerException

However, inserting a RowCoder matching the SQL output, in this case a single count int column:

   ...snip...
   .applyTransform(SqlTransform.query("SELECT COUNT(1) FROM PCOLLECTION"))
   .setCoder(Schema.builder().addInt64Field("count").build().getRowCoder)
   .saveAsTextFile(args("output"))

Now the pipeline runs just fine.

Having to manually tell the pipeline how to encode the SQL output seems unnecessary, given that we specify the input schema/coder(s) and a query. It seems to me that we should be able to infer the output schema from that - but I can't see how, other than maybe using Calcite directly?

Before raising a ticket on the Beam Jira, I thought I'd check I wasn't missing something obvious!

score 2 · Accepted Answer · answered Sep 06 '18 at 17:50

2

Output schema inference should work, your expectation is correct. This seems like a bug (either in Beam or Scio), filed BEAM-5335 for investigation.

answered Sep 06 '18 at 17:50

Anton

2,431
10
20

1

I think it is a bug in Scio - the pipeline works in plain Java, as it turns out. Then I saw that Scio sets the coder itself after applying a transform, maybe this is the problem. I'll raise a ticket on the repo https://github.com/spotify/scio/blob/e95310ec0828e60209e279baa5047a6b62288c5a/scio-core/src/main/scala/com/spotify/scio/values/SCollection.scala#L149 – brabster Sep 07 '18 at 14:31
1

Thank you for following up! I'll update the Jira with your findings. I am not familiar with how Scio sets coders when expanding PTrasnforms but it does look like where the bug would live. – Anton Sep 07 '18 at 16:49
1

Overall, Beam SQL and Schemas are very much under active development and there can be bugs and breaking changes now and then. Plus to my knowledge neither Beam SQL nor Schemas are currently being explicitly tested against Scio, so there's extra layer of possible issues there. – Anton Sep 07 '18 at 16:52

How do I get an output schema for an Apache Beam SQL query?

1 Answers1