Let's suppose you have very big parquet files from which you want to filter a subset and save it:
val df = spark.read.parquet(inputFileS3Path)
.select(c1, c2, c3)
.where("c1 = '38940f'")
df.write.parquet(outputFileS3Path)
Does Spark read in memory all the parquet files first and then does the filtering? Is there a way in which for example, Spark reads just a batch and keep in memory only the records that satisfy the filter condition?
I am running Spark 2.2 in a Zeppelin notebook and what it seems is happening is that it reads all in memory and then does the filtering, making the process crashing sometimes (in the Spark Web UI, the input in the stage is like > 1TB but the output in S3 is 1 MB).
Is there a more efficient way to filter these files (whether is changing code, file formats, Spark version etc. etc.)? I already select just a subset of the columns, but it doesn't seem enough.
UPDATE
After further investigations, I noticed that Spark was reading all in, in case the filter is on a nested field:
val df = spark.read.parquet(inputFileS3Path)
.select(c1, c2, c3)
.where("c1.a = '38940f'")
df.write.parquet(outputFileS3Path)
And I think the functionality is still not implemented (see https://issues.apache.org/jira/browse/SPARK-17636). Do you have any tips besides rewriting all the parquets with the nested fields explicit? Is there a way to force the optimizer to build a better plan?