Is there a way to efficiently ingest large (e.g. 50 GB) bz2 files in Spark? I'm using Spark 1.6.1, 8 executors with 30 GB of RAM each. Initially, each executor had 4 cores. However, opening bz2 files with textFile() throws ArrayOutOfBoundsException. As reported here (and other threads across the web) http://apache-spark-user-list.1001560.n3.nabble.com/java-lang-ArrayIndexOutOfBoundsException-using-sc-textFile-on-BZ2-compressed-files-td22905.html, the bz2 decompressor that Hadoop uses isn't thread safe, which creates problems in a multi-threaded environment like Spark. To get around this, I set up the number of cores per executor to 1, as suggested in the web page above, which however slows down the overall computation.
I'm using Hadoop 2.4.0.2.1.1.0-390. Any idea about this?
Thanks,
Marco