0

My application is configured to read a topic from a configured Kafka, then write the transformed result in the Hadoop HDFS. In order to do so, it needs to be launched on a Yarn cluster node.

In order to do so, we'd like to use Spring DataFlow. But since this application doesn't need any input from another flow (it already knows where to pull its source), and outputs nothing, how can I create a valid DataFlow stream from it ? In other words, this would be a stream composed of only one app, that should run indefinitely on a Yarn Node.

Alexandre FILLATRE
  • 1,305
  • 11
  • 20

1 Answers1

1

In this case you need a stream definition that connects to a named destination in Kafka and write to HDFS.

For instance, the stream would look like this:

stream create a1 --definition ":myKafkaTopic > hdfs"

You can read here for more info on this.

Ilayaperumal Gopinathan
  • 4,099
  • 1
  • 13
  • 12
  • Thanks, that what I did to make it work since I didn't have any other choice anyway. Does the topic name really matters here, since all is already configured in the application it self ? Should I change the application behaviour to use a Sink as input, rather than configuring the Kafka polling directly in it ? – Alexandre FILLATRE Dec 02 '16 at 06:26
  • As long as the the HDFS sink application uses `kafka` stream binder you don't have to make any changes. The topic name really matters and you don't need to configure anything in the sink application if you are using data flow. – Ilayaperumal Gopinathan Dec 02 '16 at 08:28