0

It seems that the org.apache.beam.sdk.io.parquet.ParquetIO.readFiles method requires a schema to be passed in.

  • Is there a way to avoid the need to pass in the schema?
  • Isn't the schema included in the Parquet file?
  • What if I am trying to read multiple Parquet files with different schema?
3thanZ
  • 133
  • 1
  • 1
  • 4

1 Answers1

0

Please find my response inline

  • Is there a way to avoid the need to pass in the schema? Currently there is no mechanism to avoid passing schema of the parquet files

  • Isn't the schema included in the Parquet file? Yes that is correct, the metadata in the header as the schema definition of the file. Please refer to BEAM-8344 which is an Open Feature request to support infer schema

  • What if I am trying to read multiple Parquet files with different schema? You can do something as below, wherein you can pass file patterns and paths and specify different schemas.

  PCollection<FileIO.ReadableFile> files = pipeline
    .apply(FileIO.match().filepattern(options.getInputFilepattern())
    .apply(FileIO.readMatches());

  PCollection<GenericRecord> output = files.apply(ParquetIO.readFiles(SCHEMA));
Jayadeep Jayaraman
  • 2,747
  • 3
  • 15
  • 26
  • It would be better to read **file patterns and schema** pairs to read multiple parquet files with different schemas to avoid reading all the files for each schema. – Bruno Sep 02 '21 at 01:48