1
           jdbc(String url,
                     String table,
                     String columnName,
                     long lowerBound,
                     long upperBound,
                     int numPartitions,
                     java.util.Properties connectionProperties)

Hello, I want to import few table from Oracle to hdfs using spark jdbc connectivity. To ensure parallelism, I want to choose the correct upperBound for each table. I am planning put row_number as my partition column and count of the table as the upperBound. Is there a better way to chose upperBound?, since I have to connect to the table in the first time to get the count. Please help

Akhil
  • 63
  • 5
  • what is the `row_number` here? is it precomputed? or it will in oracle query? If it in query, then I donot think it will be performant. It will have huge overhead over DB. – Sanket9394 Jul 08 '21 at 19:49
  • row_number in spark sql query – Akhil Jul 09 '21 at 06:23

1 Answers1

0

Generally the better way to use partitioning in Spark JDBC:

  • Choose a numeric or date type column.
  • Set upper bound as the maximum value of the col
  • Set lower bound as the minimum value of the col

(if there is skew then there are other ways to handle, generally this is good)

Obviously the above requires some querying and handling

  • Keep the mapping of table: partition column (probably an external store)
  • Query and fetch min, max

Another tip for skipping query: if you can find a Date based column , you can probably use upper = today's date and lower = 2000's date. But again it subject's to your content. the values might not hold true.

From your question I believe you are looking for something generic which you can easily apply for all tables. I understand that's the desired state , but it was as straight forward as using row_number in DB to do so, Spark would have done that already by default.

Such functions may technically work, but will be definitely be slower than the above steps as well put extra load on your database.

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
Sanket9394
  • 2,031
  • 1
  • 10
  • 15
  • set upper bound as the maximum value of the col - For this, firstly I need to connect to DB through JDBC right?..which causes slowness – Akhil Jul 09 '21 at 06:21
  • Yes, but you anyway have to operate `row_number` which is defnitely slower than `max(column)` . – Sanket9394 Jul 09 '21 at 07:00
  • But if it's all Strings, then Yeah, you don't have any other choice than depending on a function's value. – Sanket9394 Jul 09 '21 at 07:02
  • So if we have to use max (column), We connect to DB with spark.read.jdbc right?..So are we doing spark.read.jdbc two times? – Akhil Jul 09 '21 at 10:00
  • The connection may be twice. But look at the complexity . `1 small query + direction partition query` VS `single query which computes row_number first then derives partition qury` . You can always give it a practical try on a large table and compare the results. – Sanket9394 Jul 09 '21 at 12:33
  • Please consider upvoting/accepting if it was useful to you. @Akhil – Sanket9394 Jul 10 '21 at 07:31