1

I need to load an Hive table using spark-sql and then to run some machine-learning algho on that. I do that writing:

val dataSet = sqlContext.sql(" select * from table")

It works well, but If I wanted to increase number of partions of the dataSet Dataframe, How can I could do that? With normal RDD I can do writing:

val dataSet = sc.textFile(" .... ", N )

With N number of partitions I want to have.

Thanks

Edge07
  • 13
  • 3

1 Answers1

0

You can coalesce or repartition the resulting DataFrame, i.e.:

val dataSet = sqlContext.sql(" select * from table").coalesce(N)
mgaido
  • 2,987
  • 3
  • 17
  • 39
  • It is pretty expensive operation, right? The coalesce overhead should anyway be reduced by speeding up training step. Thanks – Edge07 Dec 02 '15 at 11:25
  • Yes,it is. It involves transferring all the data among the nodes of the cluster. Another option may be trying to set the `spark.default.parallelism` configuration property, but you have to try, I don't know if it works... – mgaido Dec 02 '15 at 11:35
  • You can also check this link https://deepsense.io/optimize-spark-with-distribute-by-and-cluster-by/ – Noman Khan Jun 06 '17 at 02:36