Is there any parameter partitioning when Spark reads RDBMS through JDBC?

Question

When I run the spark application for table synchronization, the error message is as follows:

19/10/16 01:37:40 ERROR Executor: Exception in task 0.0 in stage 3.0 (TID 51)
com.mysql.cj.jdbc.exceptions.CommunicationsException: Communications link failure

The last packet sent successfully to the server was 0 milliseconds ago. The driver has not received any packets from the server.
    at com.mysql.cj.jdbc.exceptions.SQLError.createCommunicationsException(SQLError.java:590)
    at com.mysql.cj.jdbc.exceptions.SQLExceptionsMapping.translateException(SQLExceptionsMapping.java:57)
    at com.mysql.cj.jdbc.ConnectionImpl.createNewIO(ConnectionImpl.java:1606)
    at com.mysql.cj.jdbc.ConnectionImpl.<init>(ConnectionImpl.java:633)
    at com.mysql.cj.jdbc.ConnectionImpl.getInstance(ConnectionImpl.java:347)
    at com.mysql.cj.jdbc.NonRegisteringDriver.connect(NonRegisteringDriver.java:219)
    at org.apache.spark.sql.execution.datasources.jdbc.DriverWrapper.connect(DriverWrapper.scala:45)
    at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$createConnectionFactory$1.apply(JdbcUtils.scala:63)
    at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$createConnectionFactory$1.apply(JdbcUtils.scala:54)
    at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD.compute(JDBCRDD.scala:272)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
    at org.apache.spark.scheduler.Task.run(Task.scala:121)
    at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

I think this is caused by the large amount of data in the table. I used the parameters related to the mongo partition before,such as:spark.mongodb.input.partitioner,spark.mongodb.input.partitionerOptions.partitionSizeMB I want to know if Spark has similar parameters for partitioning when reading RDBMS via JDBC?

ORACLE, mysql or MOGODB or other? – thebluephantom Oct 16 '19 at 08:39 — thebluephantom, Oct 16 '19 at 08:39

score 0 · Accepted Answer · answered Oct 16 '19 at 09:06

Below are the parameters along with their description which we can use while reading the RDBMS table using spark jdbc.

partitionColumn, lowerBound, upperBound -These options must all be specified if any of them is specified. In addition, numPartitions must be specified. They describe how to partition the table when reading in parallel from multiple workers. partitionColumn must be a numeric, date, or timestamp column from the table in question. Notice that lowerBound and upperBound are just used to decide the partition stride, not for filtering the rows in table. So all rows in the table will be partitioned and returned. This option applies only to reading.

numPartitions-The maximum number of partitions that can be used for parallelism in table reading and writing. This also determines the maximum number of concurrent JDBC connections. If the number of partitions to write exceeds this limit, we decrease it to this limit by calling coalesce(numPartitions) before writing.

fetchsize - The JDBC fetch size, which determines how many rows to fetch per round trip. This can help performance on JDBC drivers which default to low fetch size (eg. Oracle with 10 rows). This option applies only to reading.

Please note that all the above parameters should be used in together. Below is an example:-

  spark.read.format("jdbc").
      option("driver", driver).
      option("url",url ).
      option("partitionColumn",column name).
      option("lowerBound", 10).
      option("upperBound", 10000).
      option("numPartitions", 10).
      option("fetchsize",1000).
      option("dbtable", query).
      option("user", user).
      option("password",password).load()

Is there any parameter partitioning when Spark reads RDBMS through JDBC?

1 Answers1