Error while importing huge table from Oracle into HDFS: 'org.apache.spark.shuffle.FetchFailedException: Read error or truncated source'

Question

I have been trying to replicate a couple of huge tables from an Oracle DB into HDFS, I use PySpark and JDBC to read the tables from the source and I save the tables as Hive partitioned tables.

I have managed to replicate and save these tables to HDFS already: straight from reading with JDBC to a Hive partitioned table.

The problem with this approach is that it creates tons of small files on each of the partitions in HDFS. So, in an attempt to avoid this, I am trying to run a repartition of the data being read from JDBC before writing to HDFS, doing something like:

partition_cols = ["col1", "col2"]
df = spark.read \
    .format( "jdbc" ) \
    .option( "url", jdbc_url  ) \
    .option( "dbtable", "(SELECT * FROM table) T" ) \
    .option( "driver", "oracle.jdbc.driver.OracleDriver" ) \
    .option( "user", "user" ) \
    .option( "password", "password" ) \
    .option( "numPartitions", 128 ) \
    .option( "fetchsize", 32000) \
    .option( "partitionColumn",  key_col ) \
    .option( "lowerBound", min_val) \
    .option( "upperBound", max_val ) \
    .load()
df = df.repartition( *partition_cols )
df.write.mode( "overwrite" ).format( "parquet" ).partitionBy( *partition_cols ).saveAsTable( "some_table" )

When I run that, I got the following error message:

org.apache.spark.shuffle.FetchFailedException: Read error or truncated source
    at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:554)
    at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:470)
    at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:64)
    at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:435)
    at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:441)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
    at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:30)
    at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown Source)
    at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
    at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$11$$anon$1.hasNext(WholeStageCodegenExec.scala:624)
    at org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:216)
    at org.apache.spark.sql.execution.SortExec$$anonfun$1.apply(SortExec.scala:108)
    at org.apache.spark.sql.execution.SortExec$$anonfun$1.apply(SortExec.scala:101)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
    at org.apache.spark.scheduler.Task.run(Task.scala:121)
    at org.apache.spark.executor.Executor$TaskRunner$$anonfun$11.apply(Executor.scala:407)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1363)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:413)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: Read error or truncated source
    at com.github.luben.zstd.ZstdInputStream.read(ZstdInputStream.java:102)
    at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
    at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
    at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
    at java.io.FilterInputStream.read(FilterInputStream.java:107)
    at org.apache.spark.util.Utils$$anonfun$copyStream$1.apply$mcJ$sp(Utils.scala:364)
    at org.apache.spark.util.Utils$$anonfun$copyStream$1.apply(Utils.scala:351)
    at org.apache.spark.util.Utils$$anonfun$copyStream$1.apply(Utils.scala:351)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1363)
    at org.apache.spark.util.Utils$.copyStream(Utils.scala:372)
    at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:462)
    ... 26 more

Any idea why this error happens would be welcome. So far I have not been able to find any useful information about this issue.

Spark Version 2.4.0
JDBC8
Python 2.7
Hive 2.1.1
Hadoop 3.0.0

Error while importing huge table from Oracle into HDFS: 'org.apache.spark.shuffle.FetchFailedException: Read error or truncated source'

0 Answers0