Efficient ingestion of large bz2 files in Spark

Asked Jun 24 '16 at 19:00

Active Jun 24 '16 at 19:00

Viewed 295 times

Is there a way to efficiently ingest large (e.g. 50 GB) bz2 files in Spark? I'm using Spark 1.6.1, 8 executors with 30 GB of RAM each. Initially, each executor had 4 cores. However, opening bz2 files with textFile() throws ArrayOutOfBoundsException. As reported here (and other threads across the web) http://apache-spark-user-list.1001560.n3.nabble.com/java-lang-ArrayIndexOutOfBoundsException-using-sc-textFile-on-BZ2-compressed-files-td22905.html, the bz2 decompressor that Hadoop uses isn't thread safe, which creates problems in a multi-threaded environment like Spark. To get around this, I set up the number of cores per executor to 1, as suggested in the web page above, which however slows down the overall computation.

I'm using Hadoop 2.4.0.2.1.1.0-390. Any idea about this?

Thanks,

Marco

asked Jun 24 '16 at 19:00

Marco

Any thoughts about this? – Marco Jun 27 '16 at 03:16

Efficient ingestion of large bz2 files in Spark

0 Answers0