how can I filter parquet files with common field (but different schemas ) using Flink

Question

I have a folder with parquet files with different schemas, all have a common field that is guarantied to exists. I want to filter the lines according to that field and to write it back to other parquet file.

Similar action in spark will be fairly simple and will look something like

val filtered = rawsDF.filter(!col("id").isin(idsToDelete: _*))

The problem is that if I am extending ParquetInputFormat I have to deliver the schema as well which can be different

ParquetInputFmt(path: Path,  messageType: MessageType) extends ParquetInputFormat[User](path, messageType)

or using source function like this:

class ParquetSourceFunction extends SourceFunction[String]{
  override def run(ctx: SourceFunction.SourceContext[String]): Unit = {
     val inputPath = "s3a://foo/day=01/"
    val conf = new Configuration()
    conf.setBoolean("recursive.file.enumeration", true)
    conf.set("fs.s3a.impl","org.apache.hadoop.fs.s3a.S3AFileSystem")

    val hadoopFile = HadoopInputFile.fromPath(new Path(inputPath), conf)
    val readFooter = ParquetFileReader.open(hadoopFile)
    val metadata = readFooter.getFileMetaData
    val schema = metadata.getSchema
    val parquetFileReader = new ParquetFileReader(hadoopFile, ParquetReadOptions.builder().build())
    parquetFileReader.getFilteredRecordCount
    var pages: PageReadStore = null
    try {
      while ({ pages = parquetFileReader.readNextRowGroup; pages != null }) {
        val rows = pages.getRowCount
        val columnIO = new ColumnIOFactory().getColumnIO(schema)
        val recordReader = columnIO.getRecordReader(pages, new GroupRecordConverter(schema))
        (0L until rows).foreach { _ =>
          val group: Group = recordReader.read()
          val ind = group.getType.getFieldIndex("id")
          val id = group.getInteger(ind, ind)
         if (!listOfIds.contains(id))
              ctx.collect(?) // how can I get the original row ?

        }
      }
    }
  }

My problem with the latter is that I cannot get the raw data

any ideas ?

how can I filter parquet files with common field (but different schemas ) using Flink

0 Answers0