Apache Beam SQLTransform: Cannot call getSchema when there is no schema

Question

I am trying to apply SQLTransform on a PCollection<Object>. Here, CustomSource transform generates a Pojo at runtime.Hence, the type of the Object on which the SQLTransform runs is not known at compile time.

        Pipeline p = Pipeline.create(options);

        PCollection<Object> objs = p.apply(new CustomSource());

        Schema type = Schema.builder().addInt32Field("c1").addStringField("c2").addDoubleField("c3").build();
        PCollectionTuple.of(new TupleTag<>("somedata"), objs).apply(SqlTransform.query("SELECT c1 FROM somedata"))
                .setSchema(type, SerializableFunctions.identity(), SerializableFunctions.identity());
        p.run().waitUntilFinish();

I have provided the schema to SQLTransform with the setSchema and yet I receive an error namely

java.lang.IllegalStateException: Cannot call getSchema when there is no schema
    at org.apache.beam.sdk.values.PCollection.getSchema(PCollection.java:328)
PCollection.java:328
    at org.apache.beam.sdk.extensions.sql.impl.schema.BeamPCollectionTable.<init>(BeamPCollectionTable.java:34)

Is it possible to generate Pojo objects at runtime and run sqltransforms on them by providing schema information to the transform ?

Here's the CustomSource class for reference:

import java.util.HashMap;
import java.util.Map;

import com.beaconinside.messages.PojoGenerator;

import org.apache.beam.sdk.coders.StringUtf8Coder;
import org.apache.beam.sdk.transforms.Create;
import org.apache.beam.sdk.transforms.PTransform;
import org.apache.beam.sdk.values.PBegin;
import org.apache.beam.sdk.values.PCollection;

import javassist.CannotCompileException;
import javassist.NotFoundException;

public class CustomSource extends PTransform<PBegin, PCollection<Object>> {

    Map<String, Class<?>> props;
    Class<?> clazz;
    String data = "{\"c1\": 1, \"c2\": \"row\", \"c3\": 2.0}";

    public CustomSource() throws NotFoundException, CannotCompileException {
        props = new HashMap<String, Class<?>>();
        props.put("c1", Integer.class);
        props.put("c2", String.class);
        props.put("c3", Double.class);
        clazz = PojoGenerator.generate("net.javaforge.blog.javassist.PojoGenerated", props);
    }

    @Override
    public PCollection<Object> expand(PBegin input) {
        return input.apply(Create.of(data)).setCoder(StringUtf8Coder.of()).apply(new SensorSource(clazz, props));
        // return input.apply(Create.of(data));
    }

}

score 1 · Answer 1 · answered Jul 08 '19 at 17:58

1

I think your setSchema is just set schema of output PCollection from SQLTransform. You should also set schema on PCollection<Object> objs.

answered Jul 08 '19 at 17:58

Rui Wang

789
6
11

score 0 · Answer 2 · answered Sep 24 '19 at 09:43

The above answer is right that PCollection<Object> should also call setSchema to define the input data schema and row-object conversion functions. If you have multiple PCollection to build a PCollectionTuple, the PCollections should call setSchema respectively. PCollectionTuple doesn't need to call setSchema because the output schema can be inferred from the SQL command.

score -1 · Answer 3 · edited Apr 19 '20 at 11:51

-1

use setRowSchema as below

PCollection<Row> testApps = PBegin.in(p).apply(Create.of(row1,row2,row3).withCoder(RowCoder.of(appSchema)))
                .setRowSchema(appSchema);

edited Apr 19 '20 at 11:51

Zoe

27,060
21
118
148

answered Apr 19 '20 at 11:33

Syed Mohammed Mehdi

183
2
5
15

Apache Beam SQLTransform: Cannot call getSchema when there is no schema

3 Answers3