1

I have a large hive table(~9 billion records and ~45GB in orc format). I am using spark sql to do some profiling of the table.But it takes too much time to do any operation on this. Just a count on the input data frame itself takes ~11 minutes to complete. And min, max and avg on any column alone takes more than one and half hours to complete.

I am working on a limited resource cluster (as it is the only available one), a total of 9 executors each with 2 core and 5GB memory per executor spread over 3 physical nodes.

Is there any way to optimise this, say bring down the time to do all the aggregate functions on each column to less than 30 minutes atleast with the same cluster, or bumping up my resources is the only way?? which I am personally not very keen to do. One solution I came across to speed up data frame operations is to cache them. But I don't think its a feasible option in my case.

All the real world scenarios I came across use huge clusters for this kind of load.

Any help is appreciated. I use spark 1.6.0 in standalone mode with kryo serializer.

aladeen
  • 297
  • 7
  • 18
  • Can you add some code and what config do you use to run the job? How are the table partitioned, if at all? – morfious902002 Jul 10 '17 at 14:19
  • @morfious902002 I have already mentioned my environment. And I run dataframe.count() or hiveContext.sql("select max(column),min(column),avg(column) from table" ) And the hive table is not partitioned but it is bucketed using a column which contains parameter name. – aladeen Jul 10 '17 at 14:39
  • Try 3 executors with 3-5 cores each and 15 GB memory. For cache you could try .persist(StorageLevel.MEMORY_AND_DISK_SER()) – morfious902002 Jul 10 '17 at 14:58
  • @morfious902002 I had already tried to persist on memeor_and_disk_ser, which itself takes around 90 minutes to persist and the majority goes to disk as the ram is 45GB. And doesn't help much. Though it's helpful in a lower amount of data when it can be fit into memory completely. And I will try out combining executors to bigger ones, thank you. But could you please explain how it's helpful. – aladeen Jul 11 '17 at 05:05
  • You have too many small executors compared to physical machines. This in turn causes more overhead. Also, try to do filtering, selecting columns you need (if not done already) before using the functions. – morfious902002 Jul 11 '17 at 13:28
  • @morfious902002 I did combine my executors and that doesn't help much and my table has only 4 columns, all which is needed, and while running the aggregation query I run only at one single column at a time. – aladeen Jul 11 '17 at 14:13

1 Answers1

1

There are some cool features in sparkSQL like:

Cluster by/ Distribute by/ Sort by

Spark allows you to write queries in SQL-like language - HiveQL. HiveQL let you control the partitioning of data, in the same way we can use this in SparkSQL queries also.

Distribute By

In spark, Dataframe is partitioned by some expression, all the rows for which this expression is equal are on the same partition.

SET spark.sql.shuffle.partitions = 2
SELECT * FROM df DISTRIBUTE BY KEY

So, look how it works:

par1: [(1,c), (3,b)]
par2: [(3,c), (1,b), (3,d)]
par3: [(3,a),(2,a)]

This will transform into:

par1: [(1,c), (3,b), (3,c), (1,b), (3,d), (3,a)]
par2: [(2,a)]

Sort By

SELECT * FROM df SORT BY key

for this case it will look like:

par1: [(1,c),  (1,b), (3,b), (3,c), (3,d), (3,a)]
par2: [(2,a)]

Cluster By

This is shortcut for using distribute by and sort by together on the same set of expressions.

SET spark.sql.shuffle.partitions =2
SELECT * FROM df CLUSTER BY key

Note: This is basic information, Let me know if this helps otherwise we can use various different methods to optimize your spark Jobs and queries, according to the situation and settings.

arglee
  • 1,374
  • 4
  • 17
  • 30