0

I am pretty much new to the Spark world. I am trying to write an optimised solution for the below use case:

  1. Need to read streaming data from Kafka which primarily is a S3 filepath of some compressed files.
  2. Read the compressed file from the filepath received above and process it and store it back to some S3 bucket.

I am able to read the Kafka topic and get the filepath but not sure how do I read this file path now? Something like spark.read.binaryFile(filePath).

Any help or guidance would be appreciated.

cody
  • 11
  • 1
    Does this answer your question? [Read file path from Kafka topic and then read file and write to DeltaLake in Structured Streaming](https://stackoverflow.com/questions/65777481/read-file-path-from-kafka-topic-and-then-read-file-and-write-to-deltalake-in-str) – Michael Heil Apr 09 '21 at 17:34
  • Thank you Mike. I think the above solution will work for me with Kafka. But I am trying to use spark Streaming instead of structured streaming as the streaming source can later be different as well. Once I read the stream, each record in the stream should be the filepath to the actual file stored in S3. Next I have to read the file from this filepath (which is unstructured) process it and finally store it. – cody Apr 10 '21 at 19:48

1 Answers1

0

This has many examples

Read compressed file

rdd = sc.textFile("s3://bucket/lahs/blahblah.*.gz")

Without your code its hard, here is an outline on reading and writing it back

From this answer for the rest...

 val spark = SparkSession.builder()
    .appName("myKfconsumer")
    .master("local[*]")
    .getOrCreate()

  //... create your schema
  // you path

  val filePath = "file:///tmp/spark/Blah/blah"
  // create it as your batch data
  // someBatchData

  // now read it, you need your schema and write it back in the process section below
  import spark.implicits._
  val kafkaStream = spark.readStream
    .format("kafka")
    .option("kafka.xyz.servers", "localhost:0001")
    .option("subscribe", "blah")
    .option("startingOffsets", "latest")
    .option("failOnDataLoss", "true") // stop and debug it
    .load()
    .as[String]

      kafkaStream.writeStream.foreachBatch((someBatchData:Dataset[String], batchId:Long) => {

    val records = someBatchData.collect()
    // go through all the records    
    records.foreach((path: String) => {
      val yourData = spark.read.schema(.. youfileSchema).json(..youPath)
      // write it back as you wanted..
      //
    })
  }).start()

  spark.streams.awaitAnyTermination()
Dharman
  • 30,962
  • 25
  • 85
  • 135
Transformer
  • 6,963
  • 2
  • 26
  • 52
  • `someBatchData.collect()` should generally be avoided. Shouldn't that be a loop over the partitions? – OneCricketeer Apr 10 '21 at 13:51
  • 1
    Hello Transformer, If I understand it right, calling collect on the batch would bring the data on driver as Array [String] and then calling foreach on this would result in sequential processing, isn't it? Is it possible to process the records in parallel? Sorry if this question doesn't makes any sense. – cody Apr 10 '21 at 20:10
  • Hi @cody I gave sectional answers/samples to help with your current question. This works for me, if it works for you mark as answer; and ask a new question I will try to answer that as well. Thanks – Transformer Apr 11 '21 at 03:39
  • @OneCricketeer i didnt know collect() was bad... can you share more around this. if you see my code comment under that, it clearly says go through all the records... loop is fine or a delegate etc... – Transformer Apr 11 '21 at 03:44
  • You should be using forEachPartition rather than collect to keep data isolated to the executors, not sent to the driver. This solution may work, but it's not optimal. cc @cody – OneCricketeer Apr 11 '21 at 14:14
  • If you know the answer you can simply edit mine... rather sit on the sidelines and poke holes :) I am happy to enhance it. – Transformer Apr 12 '21 at 22:23