-6

My pipeline is the following:

Source-webservices ---> Kafka Producer --> topics --> sparkJobs --> hdfs/hive

I have two design-related questions:

  1. I need to pull the data from DataSourceAPIs(web service URLs) and push on to the topics. If I use Kafka producer then Can Kafka producer be written as a part of a spark job, or should it be a stand-alone java application? Is it possible to write a Kafka producer as a spark-job? if so, how?

  2. I have different kinds of data that come through various topics. But some of the data is depended on other data from other topics So I need to achieve some ordering of data. For example, data from topic_3 can't be processed unless topic_1 and topic_2 data are available. How to handle this kind of dependencies?

  3. What is the best place to achieve ordering ? @Kafka Producer side or @Consumer side?

drkostas
  • 517
  • 1
  • 7
  • 29
BdEngineer
  • 2,929
  • 4
  • 49
  • 85
  • Do not use support mailing list from spark and other product to make your question visible !!! Specially with hidden recipient. – Kiwy May 17 '19 at 07:52

2 Answers2

3

Spark provides connector for Kafka through which you can connect to any of the kafka topic available in your cluster. Once you get connected to your Kafka topic you can read or write the data.

Example code:

stream

val kafkaStream = spark
      .readStream
      .format("kafka")
      .option("kafka.bootstrap.servers", KAFKA_BROKERS)
      .option("subscribe", src_topic_name)
      .option("startingOffsets", "latest")
      .option("failOnDataLoss", "false")
      .load()

batch

val kafkaStream = spark
          .read
          .format("kafka")
          .option("kafka.bootstrap.servers", KAFKA_BROKERS)
          .option("subscribe", src_topic_name)
          .option("startingOffsets", "latest")
          .option("failOnDataLoss", "false")
          .load()

Now using kafkaStream you can read the data from the src_topic_name(we are using readStream here)

Ref: Spark Streaming with Kafka

This blog can be helpful for you

arctic_Oak
  • 974
  • 2
  • 13
  • 29
  • Thank you , but question is more of designing aspect of how to design. – BdEngineer May 17 '19 at 11:36
  • 2
    You are supposed to do research for this, you can't ask question to design complete architecture for your system. You can ask questions specific to the problem you faced in the implementation. Try to break the problems into sub-problems. And do research on each part. Like in your case the first part is from source-webservice to Kafka. First complete this part then move ahead. – arctic_Oak May 17 '19 at 11:46
1

1) I am not sure about your pipeline. Your question suggests opposite flow, i.e. from Dataset to Kafka.

Of course a kafka producer can be used inside your Spark DAG. There are a couple of options. I understand you meant Dataset API not DataSource API. On a Dataset you may always add a terminal node with 'foreach' and then emit every element. You may also be a bit more effective and create terminal node with 'foreachPartition', where you would reuse the same Producer for every element in a given subset.

2) In Kafka the strict ordering is guaranteed within single topic partition. Therefore if you need to keep the order of you different type data, you need to send them to the same topic/partition (multiplex them) and make sure that your data consumer is capable to demultiplex that heterogenous stream. To keep your data within the same topic partition either use same message key and rely on the default partitioner (recommended) or provide your own one.

L. F.
  • 19,445
  • 8
  • 48
  • 82