Read Parquet file using Apache Beam Java SDK without providing schema

Question

It seems that the org.apache.beam.sdk.io.parquet.ParquetIO.readFiles method requires a schema to be passed in.

Is there a way to avoid the need to pass in the schema?
Isn't the schema included in the Parquet file?
What if I am trying to read multiple Parquet files with different schema?

score 0 · Answer 1 · answered Nov 26 '19 at 15:04

Please find my response inline

Is there a way to avoid the need to pass in the schema? Currently there is no mechanism to avoid passing schema of the parquet files
Isn't the schema included in the Parquet file? Yes that is correct, the metadata in the header as the schema definition of the file. Please refer to BEAM-8344 which is an Open Feature request to support infer schema
What if I am trying to read multiple Parquet files with different schema? You can do something as below, wherein you can pass file patterns and paths and specify different schemas.

  PCollection<FileIO.ReadableFile> files = pipeline
    .apply(FileIO.match().filepattern(options.getInputFilepattern())
    .apply(FileIO.readMatches());

  PCollection<GenericRecord> output = files.apply(ParquetIO.readFiles(SCHEMA));

It would be better to read **file patterns and schema** pairs to read multiple parquet files with different schemas to avoid reading all the files for each schema. — Bruno, Sep 02 '21 at 01:48

Read Parquet file using Apache Beam Java SDK without providing schema

1 Answers1