1

I have to read a compressed file which is uploaded on s3.

Functionality: When any file is uploaded on s3, a lambda is triggered which triggers a spark job.

Where should I read the file, in AWS Lambda or through Apache Spark? which one would be beneficial? How should I read compressed files in spark?

best wishes
  • 5,789
  • 1
  • 34
  • 59
Etisha
  • 307
  • 6
  • 16

1 Answers1

0

You ask multiple question. So I try to answer each of your question.

Where do I need to read: through lambda or through spark, which one would be beneficial?

You can let s3 trigger lambda , and lambda trigger EMR spark.

Here are many example for you

How should I read compressed files in spark?

First, which kind of compressed file? Spark and Hadoop support following compressed type

name    | ext      | codec class
-------------------------------------------------------------
bzip2   | .bz2     | org.apache.hadoop.io.compress.BZip2Codec 
default | .deflate | org.apache.hadoop.io.compress.DefaultCodec 
deflate | .deflate | org.apache.hadoop.io.compress.DeflateCodec 
gzip    | .gz      | org.apache.hadoop.io.compress.GzipCodec 
lz4     | .lz4     | org.apache.hadoop.io.compress.Lz4Codec 
snappy  | .snappy  | org.apache.hadoop.io.compress.SnappyCodec

If your compressed type is supported, you can read compressed files by following example code.

rdd = sc.textFile("s3://bucket/project/logfilexxxxx.*.gz")

howie
  • 2,587
  • 3
  • 27
  • 43