I currently have a problem with Spark and reading bz2 files. I'm using Spark 1.2.0 (prebuilt for hadoop 2.4, but the files are currently read only locally). For testing there are ~1500 files, each file about 50KB size.
The following script count_loglines.py illustrates the problem:
from pyspark import SparkConf, SparkContext
spark_conf = SparkConf().setAppName("SparkTest")
sc = SparkContext(conf=spark_conf)
overall_log_lines = sc.textFile('/files/bzipped/*.log.bz2')
line_count = overall_log_lines.count()
print line_count
Running the script locally on one core, it works as expected.
spark/bin/spark-submit --master local[1] count_log_lines.py
Running the script on 2 cores using
spark/bin/spark-submit --master local[2] count_log_lines.py
ends in error messages of the hadoop bzip2 library such as
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 60 in stage 0.0 failed 1 times, most recent failure: Lost task 60.0 in stage 0.0 (TID 60, localhost): java.io.IOException: unexpected end of stream
at org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.bsGetBit(CBZip2InputStream.java:626)
When I decompress the files beforehand, and read the uncompressed log files instead of the bzipped ones, i.e., sc.textFile('/files/unzipped/*.log') the script works as expected, also on multiple cores.
My question: What's wrong here? Why doesn't the Spark job read the bz2 files properly if running on more than one core?
Thank your for help!