I'm using Spark 1.2.1 (Ancient, I know, but it's what I can use for the moment.) and trying to read a parquet file of about 4.5GB with sparksql like this (I will avoid the boilerplate)
val schemaRDD: SchemaRDD = parquetFile("data.parquet")
the problem is that in the first stage (the one that reads the parquet) it creates one partition instead of at least generate one partition per block. How can I change this behavior? I want, for example, to read that file into 32 partitions. I've also tried doing:
val schemaRDD: SchemaRDD = parquetFile("data.parquet")
schemaRDD.repartition(32)
//Just forcing repartition
schemaRDD.first()
But still reads the parquet file into 1 partition
This is the code that I'm running:
val rdd: SchemaRDD = parquetFile("data.parquet")
rdd.map(row => {
row.getInt("id") -> (
row.getInt("value1"),
row.getInt("value2"))
}).groupByKey
And then using the result of this function to run a join against another (key,value) RDD.