pyspark how to load compressed snappy file

Question

I have compressed a file using python-snappy and put it in my hdfs store. I am now trying to read it in like so but I get the following traceback. I can't find an example of how to read the file in so I can process it. I can read the text file (uncompressed) version fine. Should I be using sc.sequenceFile ? Thanks!

I first compressed the file and pushed it to hdfs

python-snappy -m snappy -c gene_regions.vcf gene_regions.vcf.snappy
hdfs dfs -put gene_regions.vcf.snappy /

I then added the following to spark-env.sh
export SPARK_EXECUTOR_MEMORY=16G                                                
export HADOOP_HOME=/usr/local/hadoop                                            

export JAVA_LIBRARY_PATH=$JAVA_LIBRARY_PATH:$HADOOP_HOME/lib/native             
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$HADOOP_HOME/lib/native                 
export SPARK_LIBRARY_PATH=$SPARK_LIBRARY_PATH:$HADOOP_HOME/lib/native           
export SPARK_CLASSPATH=$SPARK_CLASSPATH:$HADOOP_HOME/lib/lib/snappy-java-1.1.1.8-SNAPSHOT.jar

I then launch my spark master and slave and finally my ipython notebook where I am executing the code below.

a_file = sc.textFile("hdfs://master:54310/gene_regions.vcf.snappy")
a_file.first()

ValueError Traceback (most recent call last) in () ----> 1 a_file.first()

/home/user/Software/spark-1.3.0-bin-hadoop2.4/python/pyspark/rdd.pyc in first(self) 1244 if rs: 1245 return rs[0] -> 1246 raise ValueError("RDD is empty") 1247 1248 def isEmpty(self):

ValueError: RDD is empty

Working code (uncompressed) text file
a_file = sc.textFile("hdfs://master:54310/gene_regions.vcf")
a_file.first()

output: u'##fileformat=VCFv4.1'

Please refine your question. Also, please provide more relevant code (such as - how do you save the file) — Mark Segal, Apr 25 '15 at 22:02

score 3 · Accepted Answer · answered Apr 25 '15 at 22:55

3

The issue here is that python-snappy is not compatible with Hadoop's snappy codec, which is what Spark will use to read the data when it sees a ".snappy" suffix. They are based on the same underlying algorithm but they aren't compatible in that you can compress with one and decompress with another.

You can make this work either by writing your data out in the first place to snappy using Spark or Hadoop. Or by having Spark read your data as binary blobs and then you manually invoke the python-snappy decompression yourself (see binaryFiles here http://spark.apache.org/docs/latest/api/python/pyspark.html). The binary blob approach is a bit more brittle because it needs to fit the entire file in memory for each input file. But if your data is small enough that will work.

answered Apr 25 '15 at 22:55

Patrick Wendell

46
1

Thanks Patrick that makes a lot of sense. I read some more about Hadoop's snappy codec which it seems is used for the intermediate files produced from the mapper prior to reducing everything back down. Is there a command line utility I can use to compress my text files using the hadoop snappy codec before pushing them to the hdfs store? I basically have about 10,000 50 million line text files. Looks like this might work...https://github.com/kubo/snzip – Levi Pierce Apr 27 '15 at 00:22
This is now outdated, python-snappy supports hadoop-snappy although it isnt very clear. – Jeroen Mar 18 '21 at 09:40

score 2 · Answer 2 · answered Mar 18 '21 at 09:39

The accepted answer is now outdated. You can use python-snappy to compress hadoop-snappy, but the documentation is virtually absent. Example:

import snappy
with open('test.json.snappy', 'wb') as out_file:
    data=json.dumps({'test':'somevalue','test2':'somevalue2'}).encode('utf-8')
    compressor = snappy.hadoop_snappy.StreamCompressor()
    compressed = compressor.compress(data)
    out_file.write(compressed)

You can also use the command line, where the option is a bit more straight forward, using the -t hadoop_snappy flag. Example:

echo "{'test':'somevalue','test2':'somevalue2'}" | python -m snappy -t hadoop_snappy -c - test.json.snappy

score 1 · Answer 3 · answered Jun 02 '20 at 20:49

1

Not sure exactly which snappy codec my files have, but spark.read.text worked without incident for me.

answered Jun 02 '20 at 20:49

ijoseph

6,505
4
26
26

score 0 · Answer 4 · answered Apr 27 '15 at 02:08

Alright I found a solution!

Build this... https://github.com/liancheng/snappy-utils On ubuntu 14.10 I had to install gcc-4.4 to get it to build commented on my error I was seeing here https://code.google.com/p/hadoop-snappy/issues/detail?id=9

I can now compress the text files using snappy at the command line like so

snappy -c gene_regions.vcf -o gene_regions.vcf.snappy

dump it into hdfs

hdfs dfs -put gene_regions.vcf.snappy

and then load it in pyspark!

a_file = sc.textFile("hdfs://master:54310/gene_regions.vcf.snappy")
a_file.first()

Voila! The header of the vcf...

u'##fileformat=VCFv4.1'

pyspark how to load compressed snappy file

4 Answers4

Linked