We have our huge legacy files sitting in our hadoop cluster in compressed sequence file Format. The sequence files were created using hive ETL. Lets say I had table in hive created using the following DDL:
CREATE TABLE sequence_table(
col1 string,
col2 int)
stored as sequence file;
Here is the script used to load above sequence table:
set hive.exec.compress.output=true;
set io.seqfile.compression.type=BLOCK;
set mapred.output.compression.codec = org.apache.hadoop.io.compress.SnappyCodec;
insert into table sequence_table
select * from text_table;
Now we have exported the data inside sequence file location to S3 for archival. Now am trying to process those files using spark in AWS EMR. How can i read the sequence file in spark. I have taken a look at the sample file which has header like below and got to know that sequence file is with <K,V>
as <BytesWritable,Text>
SEQ"org.apache.hadoop.io.BytesWritableorg.apache.hadoop.io.Text)
org.apache.hadoop.io.compress.SnappyCodec
Ì(KÇØ»Ô:˜t£¾äIrlÿÿÿÿÌ(KÇØ»Ô:˜t£¾äIrlŽ£E £ =£
I have tried this way:
val file = sc.sequenceFile(
"s3://viewershipforneo4j/viewership/event_date=2017-09-17/*",
classOf[BytesWritable],
classOf[Text])
file.take(10)
But it generates this error:
18/03/09 16:48:07 ERROR TaskSetManager: Task 0.0 in stage 0.0 (TID 0) had a not serializable result: org.apache.hadoop.io.BytesWritable
Serialization stack:
- object not serializable (class: org.apache.hadoop.io.BytesWritable, value: )
- field (class: scala.Tuple2, name: _1, type: class java.lang.Object)
- object (class scala.Tuple2, (,8255909859607837R188956557001505628000150562818115056280001505628181TimerRecord9558SRM-1528454-0PiYo Workout!FRFM1810000002017-09-17 01:29:29))
- element of array (index: 0)
- array (class [Lscala.Tuple2;, size 1); not retrying
18/03/09 16:48:07 WARN ExecutorAllocationManager: No stages are running, but numRunningTasks != 0
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0 in stage 0.0 (TID 0) had a not serializable result: org.apache.hadoop.io.BytesWritable
I've then tried the following, but still no luck:
scala> val data = sc.newAPIHadoopFile("s3://viewershipforneo4j/viewership/event_date=2017-09-17/*",classOf[SequenceFileInputFormat],classOf[BytesWritable],classOf[Text],conf)
<console>:31: error: class SequenceFileInputFormat takes type parameters
val data = sc.newAPIHadoopFile("s3://viewershipforneo4j/viewership/event_date=2017-09-17/*",classOf[SequenceFileInputFormat],classOf[BytesWritable],classOf[Text],conf)
^
scala> val data = sc.newAPIHadoopFile("s3://viewershipforneo4j/viewership/event_date=2017-09-17/*",classOf[SequenceFileInputFormat<BytesWritable,Text>],classOf[BytesWritable],classOf[Text],conf)
<console>:1: error: identifier expected but ']' found.
val data = sc.newAPIHadoopFile("s3://viewershipforneo4j/viewership/event_date=2017-09-17/*",classOf[SequenceFileInputFormat<BytesWritable,Text>],classOf[BytesWritable],classOf[Text],conf)