My pipeline is the following:
Source-webservices ---> Kafka Producer --> topics --> sparkJobs --> hdfs/hive
I have two design-related questions:
I need to pull the data from DataSourceAPIs(web service URLs) and push on to the topics. If I use Kafka producer then Can Kafka producer be written as a part of a spark job, or should it be a stand-alone java application? Is it possible to write a Kafka producer as a spark-job? if so, how?
I have different kinds of data that come through various topics. But some of the data is depended on other data from other topics So I need to achieve some ordering of data. For example, data from topic_3 can't be processed unless topic_1 and topic_2 data are available. How to handle this kind of dependencies?
What is the best place to achieve ordering ? @Kafka Producer side or @Consumer side?