0

I have a usecase where I want to build a realtime a decision tree evaluator using Flink. I have a decision tree something like below: Decision tree example


Root Node(Product A)---- Check if price of Product A increased by $10 in last 10mins

----------------------------
If Yes --> Left Child of A(Product B) ---> check if price of Product B increased by $20 in last 10mins ---> If not output Product B

----------------------------
If No ---> Right Child of A(Product C) ---> Check if price of Product C increased by $20 in last 10mins ---> If not output Product C

Note: This is just example of one decision tree, I have multiple such decision trees with different product type/number of nodes and different conditions. Want to write a common Flink app to evaluate all these.

Now in input I am getting an input data stream with prices of all product types(A, B and c) every 1min. To achieve my usecase one approach that I can think of is as follows:

  1. Filter input stream by product type
  2. For each product type, use Sliding Window over last X mins based on product type triggered every min
  3. Process window function to check difference of prices for a particular product type and output price difference for each product type in output stream.

Now that we have price difference of each product type/nodes of the tree, we can then evaluate the decision tree logic. Now to do this, we have to make sure the processing of price diff calculation of all product types in a decision tree (Product A, B and C in above example) has to be completed before determining the output. One way is to store the outputs of all these products from output stream to a datastore and keep checking from an ec2 instance every 5s or so if all these price computations are completed. Once done, execute the decision tree logic to determine the output product.

Wanted to understand if there is any other way where this entire computation can be done in Flink itself without needing any other components(datastore/ec2). I am fairly new to Flink so any leads would be highly appreciated!

Cross
  • 1
  • 1
    Is the set of trees sufficiently static that it's okay to redeploy when they change, or do you require more dynamism than that? – David Anderson Jun 05 '21 at 19:16
  • Yes, its okay to redeploy when the tree changes. – Cross Jun 06 '21 at 06:36
  • Yes, this can be done entirely in Flink. But getting into a detailed architectural design of possible solutions is rather out of scope for Stack Overflow. Perhaps you can work your way through the Flink tutorials -- https://ci.apache.org/projects/flink/flink-docs-stable/docs/learn-flink/overview/ -- and then return with more specific questions. – David Anderson Jun 06 '21 at 19:05
  • Sure, thanks David! One high level question I had was as follows: once I filter, window and calculate price diff for each product type individually, I need to know whether price diff for all product types for a given decision tree has been processed for a given minute/timestamp. how can we achieve this in Flink? (Will come back with more questions post going through the link you shared) – Cross Jun 07 '21 at 05:22
  • I can't really respond without more information. Flink can keep state for each product type and each timeframe, but without knowing what "all product types" really means, I wouldn't know how to implement this. Ultimately this should become a new question, where you explain what you've tried, and where you are stuck/uncertain. – David Anderson Jun 07 '21 at 10:29
  • Not sure that I understand correctly the question, correct me if wrong - Join the 3 streams in one and key it, use the keyed state (rocksdb) to keep state of which product evaluation is finished. If all of them are in the state, then evaluate your trees. – Georgi Stoyanov Jun 07 '21 at 14:35

0 Answers0