Spark-SQl DataFrame partitions

Question

I need to load an Hive table using spark-sql and then to run some machine-learning algho on that. I do that writing:

val dataSet = sqlContext.sql(" select * from table")

It works well, but If I wanted to increase number of partions of the dataSet Dataframe, How can I could do that? With normal RDD I can do writing:

val dataSet = sc.textFile(" .... ", N )

With N number of partitions I want to have.

Thanks

score 0 · Accepted Answer · answered Dec 02 '15 at 11:15

0

You can coalesce or repartition the resulting DataFrame, i.e.:

val dataSet = sqlContext.sql(" select * from table").coalesce(N)

answered Dec 02 '15 at 11:15

mgaido

It is pretty expensive operation, right? The coalesce overhead should anyway be reduced by speeding up training step. Thanks – Edge07 Dec 02 '15 at 11:25
Yes,it is. It involves transferring all the data among the nodes of the cluster. Another option may be trying to set the `spark.default.parallelism` configuration property, but you have to try, I don't know if it works... – mgaido Dec 02 '15 at 11:35
You can also check this link https://deepsense.io/optimize-spark-with-distribute-by-and-cluster-by/ – Noman Khan Jun 06 '17 at 02:36

1 Answers1