How to read snappy compressed sequence File in spark

Question

We have our huge legacy files sitting in our hadoop cluster in compressed sequence file Format. The sequence files were created using hive ETL. Lets say I had table in hive created using the following DDL:

CREATE TABLE sequence_table(
col1 string,
col2 int)
stored as sequence file;

Here is the script used to load above sequence table:

set hive.exec.compress.output=true;
set io.seqfile.compression.type=BLOCK;
set mapred.output.compression.codec = org.apache.hadoop.io.compress.SnappyCodec;
insert into table sequence_table
select * from text_table;

Now we have exported the data inside sequence file location to S3 for archival. Now am trying to process those files using spark in AWS EMR. How can i read the sequence file in spark. I have taken a look at the sample file which has header like below and got to know that sequence file is with <K,V> as <BytesWritable,Text>

SEQ"org.apache.hadoop.io.BytesWritableorg.apache.hadoop.io.Text)
org.apache.hadoop.io.compress.SnappyCodec
Ì(KÇØ»Ô:˜t£¾äIrlÿÿÿÿÌ(KÇØ»Ô:˜t£¾äIrlŽ£E  £   =£

I have tried this way:

val file = sc.sequenceFile(
  "s3://viewershipforneo4j/viewership/event_date=2017-09-17/*",
  classOf[BytesWritable],
  classOf[Text])
file.take(10)

But it generates this error:

18/03/09 16:48:07 ERROR TaskSetManager: Task 0.0 in stage 0.0 (TID 0) had a not serializable result: org.apache.hadoop.io.BytesWritable
Serialization stack:
        - object not serializable (class: org.apache.hadoop.io.BytesWritable, value: )
        - field (class: scala.Tuple2, name: _1, type: class java.lang.Object)
        - object (class scala.Tuple2, (,8255909859607837R188956557001505628000150562818115056280001505628181TimerRecord9558SRM-1528454-0PiYo Workout!FRFM1810000002017-09-17 01:29:29))
        - element of array (index: 0)
        - array (class [Lscala.Tuple2;, size 1); not retrying
18/03/09 16:48:07 WARN ExecutorAllocationManager: No stages are running, but numRunningTasks != 0
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0 in stage 0.0 (TID 0) had a not serializable result: org.apache.hadoop.io.BytesWritable

I've then tried the following, but still no luck:

scala> val data = sc.newAPIHadoopFile("s3://viewershipforneo4j/viewership/event_date=2017-09-17/*",classOf[SequenceFileInputFormat],classOf[BytesWritable],classOf[Text],conf)
<console>:31: error: class SequenceFileInputFormat takes type parameters
       val data = sc.newAPIHadoopFile("s3://viewershipforneo4j/viewership/event_date=2017-09-17/*",classOf[SequenceFileInputFormat],classOf[BytesWritable],classOf[Text],conf)
                                                                                                           ^

scala> val data = sc.newAPIHadoopFile("s3://viewershipforneo4j/viewership/event_date=2017-09-17/*",classOf[SequenceFileInputFormat<BytesWritable,Text>],classOf[BytesWritable],classOf[Text],conf)
<console>:1: error: identifier expected but ']' found.
val data = sc.newAPIHadoopFile("s3://viewershipforneo4j/viewership/event_date=2017-09-17/*",classOf[SequenceFileInputFormat<BytesWritable,Text>],classOf[BytesWritable],classOf[Text],conf)

It is a partitioned table With table location as "s3://viewershipforneo4j/viewership/" . I have started to play around with just one partition. The file names would be as 000000_0,000000_1 like that — kalyan chakravarthy, Mar 10 '18 at 06:29
The file name doesnt have .snappy extension. But when i open the file it notepad it has both text and binary characters inside it. The top line in the file is as below.SEQ"org.apache.hadoop.io.BytesWritableorg.apache.hadoop.io.Text)org.apache.hadoop.io.compress.SnappyCodec Ì(KÇØ»Ô:˜t£¾äIrlÿÿÿÿÌ(KÇØ»Ô:˜t£¾äIrlŽ£E £ =£ — kalyan chakravarthy, Mar 10 '18 at 06:33
Can you rename a file by including the .snappy extension and try gain? — Xavier Guihot, Mar 10 '18 at 06:35

How to read snappy compressed sequence File in spark

0 Answers0