4

We have around 40 million records for table.

How to choose the values of lowerBound, upperBound, and numPartitions while reading data from Oracle using Spark 1.6.2.

I have partition key. we have datastax cluster - 3 nodes - 18 cores each - 27 GB for each

Thanks in advance.

T. Gawęda
  • 15,706
  • 4
  • 46
  • 61
Chandra
  • 199
  • 2
  • 20

2 Answers2

4

The simplest heuristic is to calculate min and max of the column values. In SQL (normal JDBC) you can run:

select min(column) as lowerBound, max(column) as upperBound from table;

And set lowerBound and upperBound to the result of query.

But, it's not always so simple. Sometimes column has very skewed distribution - i.e., SomeFunnyID may have 100x values from range [0..100] and then 100.000.000x values higher than 100000. Then you reading will be very skewed and, because of that, slower.

In such cases I recommend to set dbtable parameter to select mod(s.someFunnyID, partitionCount) as partition_key, s.* from table s and then query set lowerBound to 0 and "upperBound" to partitionCount

val partitionCount = // here choose partition count
val df = spark.read.jdbc(...)
    .option("dbtable", s"select mod(s.someFunnyID, $partitionCount) as partition_key, s.* from table s")
    .option("lowerBound", "0")
    .option("upperBound", partitionCount)
    .option("partitionColumn", "partition_key")
    .load()

For partitionCount, I have only one simple heuristic: number of executors * executor cores

Be aware, that each time you must benchmark you configuration to see if it's correct in your case

T. Gawęda
  • 15,706
  • 4
  • 46
  • 61
  • Thanks Sir for prompt reply. Can you please suggest how to choose partitionCount and partition for ~40 million records. I see partition_key is not using anywhere. – Chandra Feb 01 '18 at 17:36
  • @Chandra Sorry, my mistake - it's used in the partitionColumn. If you have 3 nodes with 18 cores each, I suggest 3*18 as the partitionCount. If you have lower number of cores per executor, then use that number instead :) – T. Gawęda Feb 01 '18 at 18:05
  • Looks like we need set partitionCount for numPatiitons .But we are setting it for upperbound. I applied CAST(mod(fa_id, $partitionCount) AS INT) as partition_key . Partitioncount given as 10. I see 10 tasks are running but one tasking is taking more time. Looks like data is not distributed equlally across tasks. Can u suggest me on this. – Chandra Feb 04 '18 at 05:44
-1

@Chandra, You might try adding .repartition(partitionCount) after the .load() This change doesn't help with the initial load reflecting the skewed data but it can make many subsequent activities against the data frame far more efficient because the data should not be skewed anymore.

  • @T. Gawęda, I am trying to load from oracle , i have given like this ``` var local_ora_df: DataFrameReader = ora_df; local_ora_df.option("partitionColumn", "FISCAL_YEAR") local_ora_df .option("schema",schema) .option("dbtable", query) .load() ``` even though there are 11 years data it is coming as only one partition how to get the data partitioned enough ? – BdEngineer Feb 20 '19 at 04:26
  • @T. Gawęda, any help plz https://stackoverflow.com/questions/55965978/how-to-set-jdbc-partitioncolumn-type-to-date-in-spark-2-4-1/55966481#55966481 – BdEngineer May 03 '19 at 12:18