0

I am trying to run group by query on dataset of 1.4 million records.
With Hive it takes 2 min while in spark it takes ~40 min with same resources
I am sure I am doing something wrong because this difference between hive and spark with a simple basic query not make sense
I tried to do it in 2 ways:
1.

Dataset <row> ds = batchDs.select (
col ("key"),
col ("ts")).groupby (col ("key"),col("ts"))


2.

sparkSession.sql ("select ket ts from x group by key,ts")


Both queries take 40 min. I know that in this case I could just do distinct but this is not my real problem.
I am trying to do over partition and get the same bad performance so I tried to simplify the problem with more basic operation that is very similar to over partition (group by)
And ideas? thank you

user3100708
  • 148
  • 3
  • 19

0 Answers0