3

I have an external data source which sends the data thru Kafka.

As a fact this is not a real data, but links to the data.

"type": "job_type_1"
"urls": [
  "://some_file"
  "://some_file"
]

There is a single topic, but it contains type field basing on which I need to execute one of jobs.

The data is not continuous, but more like jobs - it contains a set of data which should be processed in a single batch. The next topic is independent. All topics of the same type should be processed synchronously.

Options:

  1. Use Spark Streaming.

    It does not look this is an appropriate solution for my scenario. And there is no built-in ability to consider the value not as a data, but as a list of paths

  2. Create an intermediate service, which will dispatch requests and start a concrete job. In this case what is the best approach to pass 20Kb+ data to the job as spark-submit may not take so much as an argument

  3. Create a long running spark app, which will contain pure Kafka consumer, and on each message it will create Spark Session and execute the job.

Not sure will this work properly, how to stop it, etc.

  1. ???

UPDATE

as of now my solution is to create a long-running Spark Job which will connect to Kafka using Kafka API (not a spark one), subscribe, retrieve the list of URLs, dispatch the job type, and then execute spark job with urls so spark app will use a standard spark.read().load(urls) api

dr11
  • 5,166
  • 11
  • 35
  • 77
  • https://github.com/spark-jobserver/spark-jobserver could be useful. Write a kafka consumer which would read job info record from kafka and initiate the corresponding job using SparkJobServer. – Ajay Srivastava Jul 31 '19 at 16:18
  • thank you. eventually, i'm not looking for the replacement of the current infra as it's closer to impossible. – dr11 Jul 31 '19 at 18:49
  • Out of curiosity, how are you managing job processing acknowledgements. Is processing a job twice cause of concern? – D3V Aug 02 '19 at 07:39
  • Message is considered as read only after the spark job is completed. App pulls messages 1 by 1 – dr11 Aug 05 '19 at 00:27

1 Answers1

1

You can have multiple spark jobs running within a spark session. Start a spark streaming job on the incoming stream. collect the results to the master node and in parallel fireoff the queries. For example...

class KafkaStreamingExample {

  val conf = new SparkConf().setAppName("Spark Pi")
  def main(args:Array[String]):Unit =  {
    val spark = SparkSession.builder.config(conf).enableHiveSupport().getOrCreate()
    val ssc = new StreamingContext(spark.sparkContext, Seconds(1))

    val kafkaParams = Map[String, Object](
      "bootstrap.servers" -> "localhost:9092,anotherhost:9092",
      "key.deserializer" -> classOf[StringDeserializer],
      "value.deserializer" -> classOf[StringDeserializer],
      "group.id" -> "use_a_separate_group_id_for_each_stream",
      "auto.offset.reset" -> "latest",
      "enable.auto.commit" -> (false: java.lang.Boolean)
    )

    val topics = Array("topicA", "topicB")
    val stream = KafkaUtils.createDirectStream[String, String](
      ssc,
      PreferConsistent,
      Subscribe[String, String](topics, kafkaParams)
    )

    stream.foreachRDD((rdd,time) =>{

      val queriesToRun = rdd.map(_.value()).collect()

      queriesToRun.par.foreach(query => {
        spark.sql(query)
      })
    })
  }
}
Andrew Long
  • 863
  • 4
  • 9
  • eventually, I'm using DataFrame API. but from what I see this example will read and collect all the data from topics, while I have only URLs in topics – dr11 Aug 01 '19 at 12:24