Read parquet file to multiple partitions

Asked Jul 05 '18 at 12:48

Active Jul 05 '18 at 15:23

Viewed 125 times

I'm using Spark 1.2.1 (Ancient, I know, but it's what I can use for the moment.) and trying to read a parquet file of about 4.5GB with sparksql like this (I will avoid the boilerplate)

val schemaRDD: SchemaRDD = parquetFile("data.parquet")

the problem is that in the first stage (the one that reads the parquet) it creates one partition instead of at least generate one partition per block. How can I change this behavior? I want, for example, to read that file into 32 partitions. I've also tried doing:

val schemaRDD: SchemaRDD = parquetFile("data.parquet")
schemaRDD.repartition(32)

//Just forcing repartition
schemaRDD.first()

But still reads the parquet file into 1 partition

This is the code that I'm running:

val rdd: SchemaRDD = parquetFile("data.parquet")

rdd.map(row => {
  row.getInt("id") -> (
    row.getInt("value1"),
    row.getInt("value2"))
}).groupByKey

And then using the result of this function to run a join against another (key,value) RDD.

edited Jul 05 '18 at 15:23

Alper t. Turker

34,230
9
83
115

asked Jul 05 '18 at 12:48

Marcos

This seems to be a duplicate of this https://stackoverflow.com/a/29819133/2231685 – Iulian Rosca Jul 05 '18 at 13:25
Additionally `repartition` creates a new object, not modify existing one, and even if it didn't, `schemaRDD.first` would have no meaning. – Alper t. Turker Jul 05 '18 at 15:23

Read parquet file to multiple partitions

0 Answers0