Flink operator design advice

Question

I have some data that I need to process and aggregate, ideally avoiding data skew but I am unsure how to design a topology that would do so.

As an example, my data would look like this:

struct Object {
    int a
    int b
    int c
}

Now, the goal is to aggregate data via combinations of int a and b, and determine how many times int c has occured. So as example, say we have data:

Object(1,2,3)
Object(1,2,1)
Object(1,4,1)
Object(1,4,1)

The final goal is to look like

Result(1,2,{3:1, 1:1}) // Key 3 and 1 both occured only once
Result(1,4,{1:2})     // Key 1 has occured twice in the datastream

My current approach, is to KeyBy a Tuple2 of int a and b, then use a session window to aggregate the occurences of int c and return output. I am using a session window because the same int a and b combinations happen during similar time frames, one can understand this as an user's action.

However, this presents a dataskew problem as I am struggling to distribute workload evenly. Is there a better design to approach this problem? Thanks.

score 0 · Accepted Answer · answered Apr 15 '23 at 22:15

0

If you know for sure that data skew is a problem, then you could first do a count of occurrences using all three values as the key, and then do a second aggregation of those results with a keyBy of just the first two values.

Though this would only help if you know that you're getting a reasonable number of occurrences for each unique key from all three values.

answered Apr 15 '23 at 22:15

kkrugler

8,145
6
24
18

Yes I have already encountered data skew. I see that your solution is to further increase variety via combining all three values, then aggregate? Can you help explain how that would help with reducing data skew, since I still see that this process aggregates by the first two values again. Thanks! – Baiqing Apr 17 '23 at 17:29
As an extreme example, if you have occurring 1M times, and occurring 1M times, the first step would use two groups and give you single records with and . Then when you key by , you'd only be processing 2 records, not 2M records. – kkrugler Apr 18 '23 at 23:10

Flink operator design advice

1 Answers1