0

I am doing some experiments with Beam SQL. I get a PCollection<Row> from the transform SampleSource and pass its output to a SqlTransform.

String sql1 = "select c1, c2, c3 from PCOLLECTION where c1 > 1";

The code below runs without any error.

POutput it = p.apply(new SampleSource()).apply(SqlTransform.query(sql1));
p.run().waitUntilFinish();

However, when I try the following lines of code, I get a runtime error.

POutput it = p.apply(new SampleSource());
it.getPipeline().apply(SqlTransform.query(sql1));
p.run().waitUntilFinish();

The error details are

Caused by: org.apache.beam.repackaged.beam_sdks_java_extensions_sql.org.apache.calcite.sql.validate.SqlValidatorException: Object 'PCOLLECTION' not found

Please provide some pointers.

Anton
  • 2,431
  • 10
  • 20
Akshata
  • 1,005
  • 2
  • 12
  • 22

1 Answers1

1

It doesn't work because you're applying a SqlTransform to a pipeline, not a PCollection.

You probably want to change it along these lines:


// source probably returns a PCollection,
// would make sense to change 'it' to PCollection:
PCollection<...> it = p.apply(new SampleSource());

// then apply SqlTransform to the PCollection from the previous step,
// that is apply it directly to 'it':
it.apply(SqlTransform.query(sql1));

...

How Beam pipeline works, from high level perspective:

  • create a pipeline;
  • apply an IO PTransform that reads from some source and produces a PColelction of some elements that it reads from the source;
  • chain-apply more PTransforms to the PCollection from the previous step to process the data (conceptually, different PCollections will be produced at each step);
  • repeat;

SqlTransform is a normal PTransform, it is expected to be applied to a PCollection of elements and output another PCollection as a result. The query that you specify in SqlTransform.create() is applied to a PCollection. It expects the data to come from a magical PCOLLECTION table that represents the PCollection that you apply the SqlTransform to.

What you are doing in your example is different:

  • create a pipeline;
  • apply a source PTransform that produces a POutput not necessarily a PCollection;
  • then you ignore the output if your source, but instead take the original pipeline and apply a SqlTransform directly to it;

So what happens is that SqlTransform in this case is applied to the 'root' of the pipeline, not to the PCollection that comes out of the source. Instead of chain of PTransforms applied one after another you now have two PTransforms applied to the root independently from each other.

One more caveat is that SqlTransform expects input elements to be Rows, because SQL as a language works only on data that is represented as rows. There are two ways to achieve this:

  • manually convert the elements that are produced by the source to Rows by applying another ParDo between the source and SqlTransform;
  • use Beam's Schema framework (e.g. check out PCollection.setSchema() method) that allows Beam SQL to automatically convert input elements into Rows;
Anton
  • 2,431
  • 10
  • 20