Apache beam get kafka data execute SQL error:Cannot call getSchema when there is no schema

Question

I will input data of multiple tables to kafka, and beam will execute SQL after getting the data, but now there are the following errors:

Exception in thread "main"

java.lang.IllegalStateException: Cannot call getSchema when there is no schema at org.apache.beam.sdk.values.PCollection.getSchema(PCollection.java:328) at org.apache.beam.sdk.extensions.sql.impl.schema.BeamPCollectionTable.(BeamPCollectionTable.java:34) at org.apache.beam.sdk.extensions.sql.SqlTransform.toTableMap(SqlTransform.java:141) at org.apache.beam.sdk.extensions.sql.SqlTransform.expand(SqlTransform.java:102) at org.apache.beam.sdk.extensions.sql.SqlTransform.expand(SqlTransform.java:82) at org.apache.beam.sdk.Pipeline.applyInternal(Pipeline.java:539) at org.apache.beam.sdk.Pipeline.applyTransform(Pipeline.java:473) at org.apache.beam.sdk.values.PCollectionTuple.apply(PCollectionTuple.java:248) at BeamSqlTest.main(BeamSqlTest.java:65)

Is there a feasible solution? Please help me!

score 1 · Accepted Answer · answered Nov 22 '19 at 18:00

1

I think you need to set schema for your input collection PCollection<Row> apply with setRowSchema() or setSchema(). The problem is that your schema is dynamic and it's defined in runtime (not sure if Beam supports this). Could you have static schema and define it before starting processing input data?

Also, since your input source is unbounded, you need to define windows to apply SqlTransform after.

answered Nov 22 '19 at 18:00

Alexey Romanenko

1,353
5
11

thanks a lot ! add schema and window and you can use it! – smarctor Nov 27 '19 at 08:04
Great that it works! Please, accept this answer if it helped you. Thanks! – Alexey Romanenko Nov 27 '19 at 10:18

score 0 · Answer 2 · answered Nov 22 '19 at 10:18

import com.alibaba.fastjson.JSON;
import com.alibaba.fastjson.JSONObject;
import org.apache.beam.repackaged.sql.com.google.common.collect.ImmutableMap;
import org.apache.beam.runners.direct.DirectRunner;
import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.extensions.sql.SqlTransform;
import org.apache.beam.sdk.io.kafka.KafkaIO;
import org.apache.beam.sdk.io.kafka.KafkaRecord;
import org.apache.beam.sdk.options.PipelineOptions;
import org.apache.beam.sdk.options.PipelineOptionsFactory;
import org.apache.beam.sdk.schemas.Schema;
import org.apache.beam.sdk.transforms.DoFn;
import org.apache.beam.sdk.transforms.MapElements;
import org.apache.beam.sdk.transforms.ParDo;
import org.apache.beam.sdk.transforms.SimpleFunction;
import org.apache.beam.sdk.values.*;
import org.apache.kafka.common.serialization.StringDeserializer;

import java.util.ArrayList;
import java.util.List;

class BeamSqlTest {
    public static void main(String[] args) {
        PipelineOptions options = PipelineOptionsFactory.fromArgs(args).as(PipelineOptions.class);
        options.setRunner(DirectRunner.class);
        Pipeline p = Pipeline.create(options);

        PCollection<KafkaRecord<String, String>> lines = p.apply(KafkaIO.<String, String>read()
                .withBootstrapServers("192.168.8.16")
                .withTopic("tmp_table.reuslt")
                .withKeyDeserializer(StringDeserializer.class)
                .withValueDeserializer(StringDeserializer.class)
                .withConsumerConfigUpdates(ImmutableMap.of("group.id", "beam_app"))
                .withReadCommitted()
                .commitOffsetsInFinalize());

        PCollection<Row> apply = lines.apply(ParDo.of(new DoFn<KafkaRecord<String, String>,Row>(){
            @ProcessElement
            public void processElement(ProcessContext c) {
                String jsonData = c.element().getKV().getValue(); //data: {id:0001@int,name:test01@string,age:29@int,score:99@int}
                if(!"data_increment_heartbeat".equals(jsonData)){ //Filter out heartbeat information
                    JSONObject jsonObject = JSON.parseObject(jsonData);
                    Schema.Builder builder = Schema.builder();
                    //A data pipeline may have data from multiple tables so the Schema is obtained dynamically
                    //This assumes data from a single table
                    List<Object> list = new ArrayList<Object>();
                    for(String s : jsonObject.keySet()) {
                        String[] dataType = jsonObject.get(s).toString().split("@");   //data@field type
                        if(dataType[1].equals("int")){
                            builder.addInt32Field(s);
                        }else if(dataType[1].equals("string")){
                            builder.addStringField(s);
                        }
                        list.add(dataType[0]);
                    }
                    Schema schema = builder.build();
                    Row row = Row.withSchema(schema).addValues(list).build();
                    System.out.println(row);
                    c.output(row);
                }
            }
        }));

        PCollection<Row> result = PCollectionTuple.of(new TupleTag<>("USER_TABLE"), apply)
                .apply(SqlTransform.query("SELECT COUNT(id) total_count, SUM(score) total_score FROM USER_TABLE GROUP BY id"));

        result.apply( "log_result", MapElements.via( new SimpleFunction<Row, Row>() {
            @Override
            public Row apply(Row input) {
                System.out.println("USER_TABLE result: " + input.getValues());
                return input;
            }
        }));`enter code here`

    }
}

Apache beam get kafka data execute SQL error:Cannot call getSchema when there is no schema

2 Answers2