0

I'm using Spark 1.2.1 (Ancient, I know, but it's what I can use for the moment.) and trying to read a parquet file of about 4.5GB with sparksql like this (I will avoid the boilerplate)

val schemaRDD: SchemaRDD = parquetFile("data.parquet")

the problem is that in the first stage (the one that reads the parquet) it creates one partition instead of at least generate one partition per block. How can I change this behavior? I want, for example, to read that file into 32 partitions. I've also tried doing:

val schemaRDD: SchemaRDD = parquetFile("data.parquet")
schemaRDD.repartition(32)

//Just forcing repartition
schemaRDD.first()

But still reads the parquet file into 1 partition

This is the code that I'm running:

val rdd: SchemaRDD = parquetFile("data.parquet")

rdd.map(row => {
  row.getInt("id") -> (
    row.getInt("value1"),
    row.getInt("value2"))
}).groupByKey

And then using the result of this function to run a join against another (key,value) RDD.

Alper t. Turker
  • 34,230
  • 9
  • 83
  • 115
Marcos
  • 701
  • 1
  • 8
  • 25

0 Answers0