1

I am trying to query huge data from postgres using SparkSQl. I do see 100 partionion on the query stage , however , there is only one query running and only one executer executing.

Code :

df = sqlcontext.read.format('jdbc').options(url=params['url']
                                            ,driver=params["driver"]
                                            ,dbtable=tableName
                                            ,user=params['user']
                                            ,password=params['password']
                                            ,numPartitions=numberOfPartitions
                                            ,partitionColumn=partitionC
                                            ,lowerBound=lowerB
                                            ,upperBound=upperB).load()

The partitionC is of type date, I even tried a similar thing on a numeric column. I also made sure that the data is well balanced.

enter image description here

enter image description here

How to make spark execute multiple queries to postgres?

already referred to SparkSQL PostgresQL Dataframe partitions

Neha Jirafe
  • 741
  • 5
  • 14

2 Answers2

2

Truns out you need to have a Index on the column in Database on which you are partitioning the data.

So the following implemnetation is quite valid.

df = sqlcontext.read.format('jdbc').options(url=params['url']
                                            ,driver=params["driver"]
                                            ,dbtable=tableName
                                            ,user=params['user']
                                            ,password=params['password']
                                            ,numPartitions=numberOfPartitions
                                            ,partitionColumn=partitionC
                                            ,lowerBound=lowerB
                                            ,upperBound=upperB).load()

The partitionC is of type numeric as a timestamp, with index in the databse

Neha Jirafe
  • 741
  • 5
  • 14
0

Please provide numeric column as the partition column.

Provide the lower value and the upper value of the numeric column.

Please keep number of partions as 10. If there are to many database might clog.

loneStar
  • 3,780
  • 23
  • 40