I have been trying to replicate a couple of huge tables from an Oracle DB into HDFS, I use PySpark and JDBC to read the tables from the source and I save the tables as Hive partitioned tables.
I have managed to replicate and save these tables to HDFS already: straight from reading with JDBC to a Hive partitioned table.
The problem with this approach is that it creates tons of small files on each of the partitions in HDFS. So, in an attempt to avoid this, I am trying to run a repartition of the data being read from JDBC before writing to HDFS, doing something like:
partition_cols = ["col1", "col2"]
df = spark.read \
.format( "jdbc" ) \
.option( "url", jdbc_url ) \
.option( "dbtable", "(SELECT * FROM table) T" ) \
.option( "driver", "oracle.jdbc.driver.OracleDriver" ) \
.option( "user", "user" ) \
.option( "password", "password" ) \
.option( "numPartitions", 128 ) \
.option( "fetchsize", 32000) \
.option( "partitionColumn", key_col ) \
.option( "lowerBound", min_val) \
.option( "upperBound", max_val ) \
.load()
df = df.repartition( *partition_cols )
df.write.mode( "overwrite" ).format( "parquet" ).partitionBy( *partition_cols ).saveAsTable( "some_table" )
When I run that, I got the following error message:
org.apache.spark.shuffle.FetchFailedException: Read error or truncated source
at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:554)
at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:470)
at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:64)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:435)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:441)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:30)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$11$$anon$1.hasNext(WholeStageCodegenExec.scala:624)
at org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:216)
at org.apache.spark.sql.execution.SortExec$$anonfun$1.apply(SortExec.scala:108)
at org.apache.spark.sql.execution.SortExec$$anonfun$1.apply(SortExec.scala:101)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$11.apply(Executor.scala:407)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1363)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:413)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: Read error or truncated source
at com.github.luben.zstd.ZstdInputStream.read(ZstdInputStream.java:102)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
at java.io.FilterInputStream.read(FilterInputStream.java:107)
at org.apache.spark.util.Utils$$anonfun$copyStream$1.apply$mcJ$sp(Utils.scala:364)
at org.apache.spark.util.Utils$$anonfun$copyStream$1.apply(Utils.scala:351)
at org.apache.spark.util.Utils$$anonfun$copyStream$1.apply(Utils.scala:351)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1363)
at org.apache.spark.util.Utils$.copyStream(Utils.scala:372)
at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:462)
... 26 more
Any idea why this error happens would be welcome. So far I have not been able to find any useful information about this issue.
- Spark Version 2.4.0
- JDBC8
- Python 2.7
- Hive 2.1.1
- Hadoop 3.0.0