0

I've been trying to figure out how to write a flink program that receives events, from 3 kafka's topics, sum them up for the today, yesterday, and the day before yesterday.

so the first question is, how can i sum the transaction for 3 different days and extract them as a json file

TheEliteOne
  • 105
  • 1
  • 11
  • 1
    Your question is a somewhat unclear. When you say "3 kafka's streaming" are you referring to the Kafka Streams library, or did you mean to say "3 kafka topics"? In any event, I suggest you start with the docs https://ci.apache.org/projects/flink/flink-docs-release-1.4/dev/connectors/kafka.html or some examples http://training.data-artisans.com/exercises/toFromKafka.html. – David Anderson Feb 06 '18 at 15:17
  • @alpinegizmo thank you for answering, well i did modify the question – TheEliteOne Feb 06 '18 at 17:26
  • how you want to group your transactions? If you have the following transactions ids: 1,2,3,4,5,6 will it group 1,2,3, and 4,5,6 or the groups must be sliding groups 1,2,3 - 2,3,4 - 3,4,5 - .....? – diegoreico Feb 07 '18 at 07:44
  • @DiegoReirizCores well, i meant by grouping, grouping all the transaction of the day and return their number as a json – TheEliteOne Feb 07 '18 at 09:20
  • oh, then in my examples, the window size must be changed to 1 to group all the messages in a day, instead of all messages from last 3 days . – diegoreico Feb 07 '18 at 09:34

1 Answers1

1

If you want to read from 3 different kafka topics or partitions, you have to create 3 kafka sources

Flink's documentation about kafka consumer

val env = StreamExecutionEnvironment.getExecutionEnvironment()
val consumer0 = new FlinkKafkaConsumer08[String](...)
val consumer1 = new FlinkKafkaConsumer08[String](...)
val consumer2 = new FlinkKafkaConsumer08[String](...)
consumer0.setStartFromGroupOffsets()
consumer1.setStartFromGroupOffsets()
consumer2.setStartFromGroupOffsets()

val stream0 = env.addSource(consumer0)
val stream1 = env.addSource(consumer1)
val stream2 = env.addSource(consumer2)

val unitedStream = stream0.union(stream1,stream2)

/* Logic to group transactions from 3 days */
/* I need more info, but it should be a Sliding or Fixed windows Keyed by the id of the transactions*/

val windowSize = 1 // number of days that the window use to group events
val windowStep = 1 // window slides 1 day

val reducedStream = unitedStream
    .keyBy("transactionId") // or any field that groups transactions in the same group
    .timeWindow(Time.days(windowSize),Time.days(windowStep))
    .map(transaction => {
        transaction.numberOfTransactions = 1
        transaction
    }).sum("numberOfTransactions");

val streamFormatedAsJson = reducedStream.map(functionToParseDataAsJson) 
// you can use a library like GSON for this
// or a scala string template

streamFormatedAsJson.sink(yourFavoriteSinkToWriteYourData)

If your topics names could be matched With a Regular expresion, you can create only one kafka consumer as follows:

val env = StreamExecutionEnvironment.getExecutionEnvironment()

val consumer = new FlinkKafkaConsumer08[String](
  java.util.regex.Pattern.compile("day-[1-3]"),
  ..., //check documentation to know how to fill this field
  ...) //check documentation to know how to fill this field

val stream = env.addSource(consumer)

Most common aproach is to have all transactions inside the same kafka topic and not in differents topics, in that case, the code will be more simple, because you only have to use a window to process your data

Day 1 -> 11111 -\
Day 2 -> 22222 --> 1111122222333 -> Window -> 11111 22222 333 -> reduce operation per window partition
Day 3 -> 3333 --/                            |-----|-----|---|

Example code

val env = StreamExecutionEnvironment.getExecutionEnvironment()
val consumer = new FlinkKafkaConsumer08[String](...)
consumer.setStartFromGroupOffsets()

val stream = env.addSource(consumer)

/* Logic to group transactions from 3 days */
/* I need more info, but it should be a Sliding or Fixed windows Keyed by the id of the transactions*/

val windowSize = 1 // number of days that the window use to group events
val windowStep = 1 // window slides 1 day

val reducedStream = stream
    .keyBy("transactionId") // or any field that groups transactions in the same group
    .timeWindow(Time.days(windowSize),Time.days(windowStep))
    .map(transaction => {
        transaction.numberOfTransactions = 1
        transaction
    }).sum("numberOfTransactions");

val streamFormatedAsJson = reducedStream.map(functionToParseDataAsJson) 
// you can use a library like GSON for this
// or a scala string template

streamFormatedAsJson.sink(yourFavoriteSinkToWriteYourData)
diegoreico
  • 461
  • 3
  • 10
  • May god bless you for the help ! – TheEliteOne Feb 07 '18 at 09:08
  • No problem ;) just post another comment if you need more help and set the question as solved if this works for you, so other people that answers questions could know that this is already solved and focus in help other people – diegoreico Feb 07 '18 at 09:20
  • i'll u a link a my github so u can check what i have done :) if you are not against ofc :D btw, i really want to learn flink/kafka do u know any internet courses? – TheEliteOne Feb 07 '18 at 09:48
  • sure! i will check it ;) and people from [Data Artisans have some good examples](http://training.data-artisans.com/), out of that, i'm learning Flink and Kafka trying different things each time i do something – diegoreico Feb 07 '18 at 09:51
  • this is the link https://github.com/anvarknian/flinkquickstartscala/ check it out and tell me if everything is alright out there :D – TheEliteOne Feb 07 '18 at 11:24
  • @TheEliteOne it was an error of mine. Those groupBy should be a keyBy. I already changed the example – diegoreico Feb 07 '18 at 15:50