1

my doubt is about if custom classes in Flink with Java needs to override or not hashCode() and equals() methods because I have read in this page that hashCode() MUST never be implemented in distributed systems and Apache Flink is one of them.

Example: I have this class:

public class EventCounter {
    public String Id;
    public long count;
    public Timestamp firstEvent;
    public Timestamp lastEvent;
    public Date date;

    public EventCounter() {
    }
}

Do I need to implement hashCode() and equals() for this kind of classes in Flink or it is better for performance if I let Flink manage those methods on it's own?

Kind regards!

David Anderson
  • 39,434
  • 4
  • 33
  • 60
Alter
  • 903
  • 1
  • 11
  • 27

3 Answers3

3

Types that you want to use as keys in Flink (i.e., as values you return from a KeySelector) must have valid implementations of hashCode and equals. In particular, hashCode must be deterministic across JVMs (which is why arrays and enums don't work as keys in Flink).

David Anderson
  • 39,434
  • 4
  • 33
  • 60
  • Let's assume this scenario for the same class above: `DataStream stream = env.addSource(...); KeyedStream keyed = stream.keyby(k->k.id); keyed.flatMap(new customRichFlatMapClass()) or keyed.window(TumblingEventTimeWindows).process(new ProcessFunctionClass());` These classes `customRichFlatMapClass` or `ProcessFunctionClass` works with states and I have a `final TypeInformation info = Types.POJO(EventCounter.class);` to Serialize the object into the state. Do I need `hashCode()` and `equals()` in `EventCounter` ? Thanks – Alter Sep 01 '20 at 14:25
  • 1
    No, you do not. The only case where you would need to think about it is if you want to do `stream.keyBy(k -> k)`. – David Anderson Sep 01 '20 at 15:23
  • Thanks a lot @David, that was my understanding after a long research of information about this, but I wanted to be sure of that is exactly how it works, because otherwise I'm having an extra CPU utilization without been needed to create those hashes. Thanks a lot one more time. – Alter Sep 01 '20 at 15:34
  • I agree with David's answer. But however, in Java, it is always preferable to override the hashcode/equals method for any Business POJO's which is a good practice. Reg `I'm having an extra CPU utilization without been needed to create those hashes` - I think those will be utilized only when you (or internal code) call those methods. And you can expect the desired result only when you override those methods while those are invoked. – Jaya Ananthram Sep 05 '20 at 18:48
0

Before writing the two methods, just think about your class need to be, symmetric or transitive or consistent?

It specially designed for Hash based algorithms. So you need to make sure that them in proper way, and a side note creating hash code is a CPU intensive task.

prostý člověk
  • 909
  • 11
  • 29
0

hasCode() and equals() methods needs to be implemented only in cases where the object/class is going to be used as keys into Flink, example:

DataStream<EventCounter> stream = env.addSource(...);
KeyedStream<EventCounter, String> keyed = stream.keyby(k->k); /*Where k is the class object type!*/
Alter
  • 903
  • 1
  • 11
  • 27