I have an external data source which sends the data thru Kafka.
As a fact this is not a real data, but links to the data.
"type": "job_type_1"
"urls": [
"://some_file"
"://some_file"
]
There is a single topic, but it contains type
field basing on which I need to execute one of jobs.
The data is not continuous, but more like jobs - it contains a set of data which should be processed in a single batch. The next topic is independent. All topics of the same type should be processed synchronously.
Options:
Use Spark Streaming.
It does not look this is an appropriate solution for my scenario. And there is no built-in ability to consider the
value
not as a data, but as a list of pathsCreate an intermediate service, which will dispatch requests and start a concrete job. In this case what is the best approach to pass 20Kb+ data to the job as spark-submit may not take so much as an argument
Create a long running spark app, which will contain pure Kafka consumer, and on each message it will create Spark Session and execute the job.
Not sure will this work properly, how to stop it, etc.
- ???
UPDATE
as of now my solution is to create a long-running Spark Job which will connect to Kafka using Kafka API (not a spark one), subscribe, retrieve the list of URLs, dispatch the job type, and then execute spark job with urls
so spark app will use a standard spark.read().load(urls)
api