The simplest heuristic is to calculate min and max of the column values. In SQL (normal JDBC) you can run:
select min(column) as lowerBound, max(column) as upperBound from table;
And set lowerBound and upperBound to the result of query.
But, it's not always so simple. Sometimes column has very skewed distribution - i.e., SomeFunnyID
may have 100x values from range [0..100] and then 100.000.000x values higher than 100000. Then you reading will be very skewed and, because of that, slower.
In such cases I recommend to set dbtable
parameter to select mod(s.someFunnyID, partitionCount) as partition_key, s.* from table s
and then query set lowerBound
to 0
and "upperBound"
to partitionCount
val partitionCount = // here choose partition count
val df = spark.read.jdbc(...)
.option("dbtable", s"select mod(s.someFunnyID, $partitionCount) as partition_key, s.* from table s")
.option("lowerBound", "0")
.option("upperBound", partitionCount)
.option("partitionColumn", "partition_key")
.load()
For partitionCount, I have only one simple heuristic: number of executors * executor cores
Be aware, that each time you must benchmark you configuration to see if it's correct in your case