Spark Group By/ Over Partition Has Bad Performance Java

Asked Jan 01 '18 at 08:10

Active Jan 01 '18 at 08:10

Viewed 450 times

I am trying to run group by query on dataset of 1.4 million records.
With Hive it takes 2 min while in spark it takes ~40 min with same resources
I am sure I am doing something wrong because this difference between hive and spark with a simple basic query not make sense
I tried to do it in 2 ways:
1.

Dataset <row> ds = batchDs.select (
col ("key"),
col ("ts")).groupby (col ("key"),col("ts"))

sparkSession.sql ("select ket ts from x group by key,ts")

Both queries take 40 min. I know that in this case I could just do distinct but this is not my real problem.
I am trying to do over partition and get the same bad performance so I tried to simplify the problem with more basic operation that is very similar to over partition (group by)
And ideas? thank you

asked Jan 01 '18 at 08:10

user3100708

Could you please add how you're running your `spark-shell` or how you're configuring your Spark context (e.g. cores, executors, memory, Spark version, etc.)? – Silvio Jan 02 '18 at 03:59
@Silvio I will add in few min – user3100708 Jan 02 '18 at 14:43

Spark Group By/ Over Partition Has Bad Performance Java

0 Answers0