1

We have our huge legacy files sitting in our hadoop cluster in compressed sequence file Format. The sequence files were created using hive ETL. Lets say I had table in hive created using the following DDL:

CREATE TABLE sequence_table(
col1 string,
col2 int)
stored as sequence file;

Here is the script used to load above sequence table:

set hive.exec.compress.output=true;
set io.seqfile.compression.type=BLOCK;
set mapred.output.compression.codec = org.apache.hadoop.io.compress.SnappyCodec;
insert into table sequence_table
select * from text_table;

Now we have exported the data inside sequence file location to S3 for archival. Now am trying to process those files using spark in AWS EMR. How can i read the sequence file in spark. I have taken a look at the sample file which has header like below and got to know that sequence file is with <K,V> as <BytesWritable,Text>

SEQ"org.apache.hadoop.io.BytesWritableorg.apache.hadoop.io.Text)
org.apache.hadoop.io.compress.SnappyCodec
Ì(KÇØ»Ô:˜t£¾äIrlÿÿÿÿÌ(KÇØ»Ô:˜t£¾äIrlŽ£E  £   =£ 

I have tried this way:

val file = sc.sequenceFile(
  "s3://viewershipforneo4j/viewership/event_date=2017-09-17/*",
  classOf[BytesWritable],
  classOf[Text])
file.take(10)

But it generates this error:

18/03/09 16:48:07 ERROR TaskSetManager: Task 0.0 in stage 0.0 (TID 0) had a not serializable result: org.apache.hadoop.io.BytesWritable
Serialization stack:
        - object not serializable (class: org.apache.hadoop.io.BytesWritable, value: )
        - field (class: scala.Tuple2, name: _1, type: class java.lang.Object)
        - object (class scala.Tuple2, (,8255909859607837R188956557001505628000150562818115056280001505628181TimerRecord9558SRM-1528454-0PiYo Workout!FRFM1810000002017-09-17 01:29:29))
        - element of array (index: 0)
        - array (class [Lscala.Tuple2;, size 1); not retrying
18/03/09 16:48:07 WARN ExecutorAllocationManager: No stages are running, but numRunningTasks != 0
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0 in stage 0.0 (TID 0) had a not serializable result: org.apache.hadoop.io.BytesWritable

I've then tried the following, but still no luck:

scala> val data = sc.newAPIHadoopFile("s3://viewershipforneo4j/viewership/event_date=2017-09-17/*",classOf[SequenceFileInputFormat],classOf[BytesWritable],classOf[Text],conf)
<console>:31: error: class SequenceFileInputFormat takes type parameters
       val data = sc.newAPIHadoopFile("s3://viewershipforneo4j/viewership/event_date=2017-09-17/*",classOf[SequenceFileInputFormat],classOf[BytesWritable],classOf[Text],conf)
                                                                                                           ^

scala> val data = sc.newAPIHadoopFile("s3://viewershipforneo4j/viewership/event_date=2017-09-17/*",classOf[SequenceFileInputFormat<BytesWritable,Text>],classOf[BytesWritable],classOf[Text],conf)
<console>:1: error: identifier expected but ']' found.
val data = sc.newAPIHadoopFile("s3://viewershipforneo4j/viewership/event_date=2017-09-17/*",classOf[SequenceFileInputFormat<BytesWritable,Text>],classOf[BytesWritable],classOf[Text],conf)
Xavier Guihot
  • 54,987
  • 21
  • 291
  • 190
  • What is the exact name of a file? – Xavier Guihot Mar 10 '18 at 06:19
  • It is a partitioned table With table location as "s3://viewershipforneo4j/viewership/" . I have started to play around with just one partition. The file names would be as 000000_0,000000_1 like that – kalyan chakravarthy Mar 10 '18 at 06:29
  • Do files have .snappy extension? – Xavier Guihot Mar 10 '18 at 06:30
  • The file name doesnt have .snappy extension. But when i open the file it notepad it has both text and binary characters inside it. The top line in the file is as below.SEQ"org.apache.hadoop.io.BytesWritableorg.apache.hadoop.io.Text)org.apache.hadoop.io.compress.SnappyCodec Ì(KÇØ»Ô:˜t£¾äIrlÿÿÿÿÌ(KÇØ»Ô:˜t£¾äIrlŽ£E £ =£ – kalyan chakravarthy Mar 10 '18 at 06:33
  • Can you rename a file by including the .snappy extension and try gain? – Xavier Guihot Mar 10 '18 at 06:35

0 Answers0