Apache beam SqlTransforms schema issue

Question

I'm trying to perform ETL which involves loading files from HDFS, apply transforms and write them to Hive. While using SqlTransforms for performing transformations by following this doc, I'm encountering below issue. Can you please help?

java.lang.IllegalStateException: Cannot call getSchema when there is no schema
    at org.apache.beam.sdk.values.PCollection.getSchema(PCollection.java:328)
    at org.apache.beam.sdk.extensions.sql.impl.schema.BeamPCollectionTable.<init>(BeamPCollectionTable.java:34)
    at org.apache.beam.sdk.extensions.sql.SqlTransform.toTableMap(SqlTransform.java:105)
    at org.apache.beam.sdk.extensions.sql.SqlTransform.expand(SqlTransform.java:90)
    at org.apache.beam.sdk.extensions.sql.SqlTransform.expand(SqlTransform.java:77)
    at org.apache.beam.sdk.Pipeline.applyInternal(Pipeline.java:537)
    at org.apache.beam.sdk.Pipeline.applyTransform(Pipeline.java:471)
    at org.apache.beam.sdk.values.PCollection.apply(PCollection.java:339)
    at org.apache.beam.examples.SqlTest.runSqlTest(SqlTest.java:107)
    at org.apache.beam.examples.SqlTest.main(SqlTest.java:167)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.codehaus.mojo.exec.ExecJavaMojo$1.run(ExecJavaMojo.java:282)
    at java.lang.Thread.run(Thread.java:748)

Code Snippet:

PCollection<String> data = p.apply("ReadLines", TextIO.read().from(options.getInputFile()));

    if(options.getOutput().equals("hive")){
        Schema hiveTableSchema = Schema.builder()
                .addStringField("eid")
                .addStringField("name")
                .addStringField("salary")
                .addStringField("destination")
                .build();
          data.apply(ParDo.of(new DoFn<String, Row>() {
              @ProcessElement
              public void processElement(@Element String input, OutputReceiver<Row> out){
                  String[] values = input.split(",");
                  System.out.println(values);
                  Row row = Row.withSchema(hiveTableSchema)
                                .addValues(values)
                                .build();
                  out.output(row);
              }
          })).apply(SqlTransform.query("select eid, destination from PCOLLECTION"))

                .apply(ParDo.of(new DoFn<Row, HCatRecord>() {
                    @ProcessElement
                    public void processElement(@Element Row input, OutputReceiver<HCatRecord> out){
                        HCatRecord record = new DefaultHCatRecord(input.getFieldCount());
                        for(int i=0; i < input.getFieldCount(); i++){
                            record.set(i, input.getString(i));
                        }
                        out.output(record);
                    }
                        }))
                .apply("WriteData", HCatalogIO.write()
                        .withConfigProperties(configProperties)
                        .withDatabase("wmrpoc")
                        .withTable(options.getOutputTableName()));

Anton · Answer 1 · 2018-10-25T16:17:10.480

0

It looks like you need to set the schema on the PCollection. In the walkthrough you linked there's Create...withCoder() that handles that. In your case schema cannot be inferred from your ParDo, the only information Beam potentially has looking at it is that it outputs the elements of type Row but there's no information available to it whether your ParDo even adheres to a single schema for all outputs. So you need to call pcollection.setRowSchema() before you apply SqlTransform to tell Beam what schema you're planning to get out of your conversion ParDo.

Update

And looks like most of what you're doing before HCatalog will likely eventually be simplified a lot, e.g. imagine you only need to specify something like pipeline.apply(TextIO.readCsvRows(schema)).apply(sqlTransform).... In fact Beam SQL supports reading CSV files without extra conversion ParDos (through TextTableProvider) but it is not wired up to SqlTransform yet and only accessible through Beam SQL CLI

edited Oct 25 '18 at 16:17

answered Oct 25 '18 at 16:01

Anton

2,431
10
20

Thanks a lot for replying. It solved above error but I now get a new one `java.lang.RuntimeException: java.lang.RuntimeException: Property 'org.apache.beam.sdk.extensions.sql.impl.planner.BeamRelDataTypeSystem' not valid for plugin type org.apache.calcite.rel.type.RelDataTypeSystem` – Bluecrow Oct 25 '18 at 16:18
Update on trace `An exception occured while executing the Jav a class. java.lang.RuntimeException: Property 'org.apache.beam.sdk.extensions.sql.impl.planner.BeamRelDataTypeSystem' not valid for plugin type org.apache.calcite.r el.type.RelDataTypeSystem: Cannot cast org.apache.beam.sdk.extensions.sql.impl.planner.BeamRelDataTypeSystem to org.apache.calcite.rel.type.RelDataTypeSystem -> [He lp 1]` – Bluecrow Oct 25 '18 at 16:24
Hm, not sure what is happening here, `BeamRelDataTypeSystem` is a subclass of `RelDataTypeSystem` ( https://github.com/apache/beam/blob/a2b0ad14f1525d1a645cb26f5b8ec45692d9d54e/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/planner/BeamRelDataTypeSystem.java#L24 . Are you building Beam SDK sources or using Beam as maven dependency? – Anton Oct 25 '18 at 16:27
As a maven dependency. ` org.apache.beam beam-sdks-java-extensions-sql ${beam.version} ` And the beam version is 2.7.0 – Bluecrow Oct 25 '18 at 16:29
I don't understand what can be causing this, filed https://issues.apache.org/jira/browse/BEAM-5858 to follow up. Unfortunately I don't know of a workaround either at the moment :( I will update the answer when I figure it out – Anton Oct 25 '18 at 16:38

Apache beam SqlTransforms schema issue

1 Answers1