I have a folder with parquet files with different schemas, all have a common field that is guarantied to exists. I want to filter the lines according to that field and to write it back to other parquet file.
Similar action in spark will be fairly simple and will look something like
val filtered = rawsDF.filter(!col("id").isin(idsToDelete: _*))
The problem is that if I am extending ParquetInputFormat I have to deliver the schema as well which can be different
ParquetInputFmt(path: Path, messageType: MessageType) extends ParquetInputFormat[User](path, messageType)
or using source function like this:
class ParquetSourceFunction extends SourceFunction[String]{
override def run(ctx: SourceFunction.SourceContext[String]): Unit = {
val inputPath = "s3a://foo/day=01/"
val conf = new Configuration()
conf.setBoolean("recursive.file.enumeration", true)
conf.set("fs.s3a.impl","org.apache.hadoop.fs.s3a.S3AFileSystem")
val hadoopFile = HadoopInputFile.fromPath(new Path(inputPath), conf)
val readFooter = ParquetFileReader.open(hadoopFile)
val metadata = readFooter.getFileMetaData
val schema = metadata.getSchema
val parquetFileReader = new ParquetFileReader(hadoopFile, ParquetReadOptions.builder().build())
parquetFileReader.getFilteredRecordCount
var pages: PageReadStore = null
try {
while ({ pages = parquetFileReader.readNextRowGroup; pages != null }) {
val rows = pages.getRowCount
val columnIO = new ColumnIOFactory().getColumnIO(schema)
val recordReader = columnIO.getRecordReader(pages, new GroupRecordConverter(schema))
(0L until rows).foreach { _ =>
val group: Group = recordReader.read()
val ind = group.getType.getFieldIndex("id")
val id = group.getInteger(ind, ind)
if (!listOfIds.contains(id))
ctx.collect(?) // how can I get the original row ?
}
}
}
}
My problem with the latter is that I cannot get the raw data
any ideas ?