0

I would like a simpler better and more elegant way of approaching the below problem. I have yet to come across any documentation on the topic , and i am sure there my current approach has some bottle necks , thank you

I have a stream where Json is mapped to a POJO

DataStream<MYPOJO> stream = env.
             addSource( <<kafkaSource>>).map(new EventToPOJO());

Some of the fields of the POJO will have a populated primary key and some will have a populated alternate-Key , Some will have both .The only example of working with two keys I have found in Flink document, is using a keyselector for a composite key but nothing for alternate keys

My current approach is as follows :

  1. Use a richFlatMapFunction to collect all elements of primary key into stream , Astream
  2. Use a richFlatMapFunction to collect all elements of alternate Key into a stream , BStream
  3. USe richFlatMap for items that have both primary and alternate keys, CStream
  4. Join the Astream with the Cstream on Primary Key
  5. Join the Bstream with the Cstream on Alternate Key
  6. finally KeyBy Primary Key

 DataStream<MyPOJO> primaryKey = stream.flatMap(new RichFlatMapFunction<MyPOJO mypojo, MyPOJO mypojo>() {
            @Override
            public void flatMap(MyPOJO mypojo, Collector<MyPOJO> collector) throws Exception {
                if(mypojo.PrimaryKey() != null){
                 
                    collector.collect(MyPOJO);
                }
            }
        });


 DataStream<MyPOJO> alternateKey = stream.flatMap(new RichFlatMapFunction<MyPOJO mypojo, MyPOJO mypojo>() {
            @Override
            public void flatMap(MyPOJO mypojo, Collector<MyPOJO> collector) throws Exception {
                if(mypojo.getAlternateKey() != null){
                 
                    collector.collect(mypojo);
                }
            }
        });


 DataStream<MyPOJO> both = stream.flatMap(new RichFlatMapFunction<MyPOJO mypojo, MyPOJO mypojo>() {
            @Override
            public void flatMap(MyPOJO mypojo, Collector<MYPOJO> collector) throws Exception {
                if(mypojo.getAlternateKey() != null && mypojo.getPrimaryKey() !=null ){
                 
                    collector.collect(mypojo);
                }
            }
        });



//Join them 

   both.join(alternateKey)
                .where(MyPOJO::getAlternateKey)
                .equalTo(MyPOJO::getAlternateKey)
                .window(TumblingEventTimeWindows.of(Time.milliseconds(1)))
                .apply (new JoinFunction<MyPOJO, MyPOJO, MyPOJO>(){
                   
                    @Override
                    public StateObject join(MyPOJO Mypojo, MyPOJO mypojo2) throws Exception {

                      // Some Join logic to keep both states 
                        return stateObject2;
                    }
                });

:: repeat for primary key stream ...


// keyby at the end
both.keyBy(MyPOJO::getPrimaryKey)


I'm sure I could use a filter function As well to achieve the 3 streams , but I would like not to have to split into 3 streams in the first place, please not I have simplified the above for readability sake so please dont mind any syntax errors you may find.

Gabriel
  • 121
  • 5
  • What is the logic for two records to be joined? Is it `((primary == primary) OR (alternative == alternative))`? Or is there additional logic when both primary & alternative keys exist? – kkrugler May 13 '21 at 14:03
  • right now yes its prime = prime , alternate = alternate , what Id rather have is if not prime = prime then alternate = alternate – Gabriel May 13 '21 at 14:54
  • Some questions : Why are you using RichFlatMap if you can use a simple FilterFunction ? https://ci.apache.org/projects/flink/flink-docs-master/api/java/org/apache/flink/api/common/functions/FilterFunction.html And , if you want to manage complex keys, I think that you can use the KeySelector funcions ( https://ci.apache.org/projects/flink/flink-docs-master/api/java/org/apache/flink/api/java/functions/KeySelector.html ) – Antonio Miranda May 14 '21 at 08:50
  • yes a simple filter would be more appropriate, however that doesn't affect the solution - a key selector needs to be deterministic, it will not return the correlated identities – Gabriel May 16 '21 at 10:13

1 Answers1

0

You should implement a custom POJO that contains the primary & secondary keys. It needs to have equals() and hashCode() methods, which implement your required logic(*) of when two records are equal. See hashCode() and equals() method for custom classes in flink for more details on why you have to do this.

Add a MyPOJO.getJoiningKey() that returns this custom POJO.

Then just do a single join based on .where(r -> r.getJoiningKey()).equals(r -> r.getJoiningKey()).

(*) I'm still not sure of what you want your logic to be. E.g. if left-side primary & secondary key is not-null, and right-side primary key is null but secondary key is not-null, what would you want to compare?

kkrugler
  • 8,145
  • 6
  • 24
  • 18
  • So what you saying is, if I override the hashscode and equals method then I can keyby a composite object. that's brilliant this might work. I will try it out. Basically i just wanna to group states by the same key or alternate key. i am using different types of sensors some will return the pk in the event the other can only return a relative id aka alternate key. the above logic is a simple example , Should I elaborate further ? – Gabriel May 16 '21 at 10:55
  • Seems like it's worth a try, at least. Pretty easy to create a unit test to validate. – kkrugler May 17 '21 at 14:36
  • Did it work? Adding a follow-up is very helpful for others who might run into this same situation, thanks. – kkrugler Jun 18 '21 at 18:39
  • Hi unfortunately it didnt work as the hashvalue needs to be deterministic and sometimes there wont be a value in there hence the hash won't be the same and will not be the same object, thank you for the suggestion though – Gabriel Jun 20 '21 at 07:58