4

I currently have a problem with Spark and reading bz2 files. I'm using Spark 1.2.0 (prebuilt for hadoop 2.4, but the files are currently read only locally). For testing there are ~1500 files, each file about 50KB size.

The following script count_loglines.py illustrates the problem:

 from pyspark import SparkConf, SparkContext
 spark_conf = SparkConf().setAppName("SparkTest")
 sc = SparkContext(conf=spark_conf)

 overall_log_lines = sc.textFile('/files/bzipped/*.log.bz2')
 line_count = overall_log_lines.count()
 print line_count

Running the script locally on one core, it works as expected.

spark/bin/spark-submit --master local[1] count_log_lines.py

Running the script on 2 cores using

spark/bin/spark-submit --master local[2] count_log_lines.py

ends in error messages of the hadoop bzip2 library such as

 : org.apache.spark.SparkException: Job aborted due to stage failure: Task 60 in stage 0.0 failed 1 times, most recent failure: Lost task 60.0 in stage 0.0 (TID 60, localhost): java.io.IOException: unexpected end of stream
    at org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.bsGetBit(CBZip2InputStream.java:626)

When I decompress the files beforehand, and read the uncompressed log files instead of the bzipped ones, i.e., sc.textFile('/files/unzipped/*.log') the script works as expected, also on multiple cores.

My question: What's wrong here? Why doesn't the Spark job read the bz2 files properly if running on more than one core?

Thank your for help!

siggi_42
  • 143
  • 2
  • 7

1 Answers1

2

I'm not really sure either textfile support bz2 files.

You might have a look to pyspark newAPIHadoopFile or hadoopfile APIs. If the splitted bz2 file contains Text (such as log) you can use:

stdout = sc.newAPIHadoopFile(path="/HDFSpath/to/folder/containing/bz2/", inputFormatClass="org.apache.hadoop.mapreduce.lib.input.TextInputFormat", keyClass="org.apache.hadoop.io.Text", valueClass="org.apache.hadoop.io.Text", keyConverter=None, valueConverter=None, conf=None, batchSize=5)

Source: http://spark.apache.org/docs/1.2.0/api/python/pyspark.html

hadoopFile(path, inputFormatClass, keyClass, valueClass, keyConverter=None, valueConverter=None, conf=None, batchSize=0)

Read an ‘old’ Hadoop InputFormat with arbitrary key and value class from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI. The mechanism is the same as for sc.sequenceFile.

A Hadoop configuration can be passed in as a Python dict. This will be converted into a Configuration in Java.

Parameters: path – path to Hadoop file inputFormatClass – fully qualified classname of Hadoop InputFormat (e.g. “org.apache.hadoop.mapred.TextInputFormat”) keyClass – fully qualified classname of key Writable class (e.g. “org.apache.hadoop.io.Text”) valueClass – fully qualified classname of value Writable class (e.g. “org.apache.hadoop.io.LongWritable”) keyConverter – (None by default) valueConverter – (None by default) conf – Hadoop configuration, passed in as a dict (None by default) batchSize – The number of Python objects represented as a single Java object. (default 0, choose batchSize automatically)

or

newAPIHadoopFile(path, inputFormatClass, keyClass, valueClass, keyConverter=None, valueConverter=None, conf=None, batchSize=0)

Read a ‘new API’ Hadoop InputFormat with arbitrary key and value class from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI. The mechanism is the same as for sc.sequenceFile.

A Hadoop configuration can be passed in as a Python dict. This will be converted into a Configuration in Java

Parameters: path – path to Hadoop file inputFormatClass – fully qualified classname of Hadoop InputFormat (e.g. “org.apache.hadoop.mapreduce.lib.input.TextInputFormat”) keyClass – fully qualified classname of key Writable class (e.g. “org.apache.hadoop.io.Text”) valueClass – fully qualified classname of value Writable class (e.g. “org.apache.hadoop.io.LongWritable”) keyConverter – (None by default) valueConverter – (None by default) conf – Hadoop configuration, passed in as a dict (None by default) batchSize – The number of Python objects represented as a single Java object. (default 0, choose batchSize automatically)

Rgs,

K

KiiMo
  • 21
  • 3