2

Lets assume that I have a job with max.parallelism=4 and a RichFlatMapFunction which is working with MapState. What is the best way to create the MapStateDescriptor? into the RichFlatMapFunction which means that for each instance of this class I will have a descriptor, or create a single instance of the descriptor, for example: public static MapStateDescriptor descriptor in a single class and call it from the RichFlatMapFunction? Because doing it on this way I will have just one MapStateDescriptor instead of 4, or did I misunderstood something?

Kind regards!

Alter
  • 903
  • 1
  • 11
  • 27

1 Answers1

2

A few points...

  1. Since each of your RichFlatMapFunction sub-tasks can be running in a different JVM on a different server, how would they share a static MapStateDescriptor?
  2. Note that Flink's "max parallelism" isn't the same as the default environment parallelism. In general you want to leave the max parallelism value alone, and (if necessary) set your environment parallelism equal to the number of slots in your cluster.
  3. The MapStateDescriptor doesn't store state. It tells Flink how to create the state. In your RichFlatMapFunction operator's open() call is where you'll be creating the state using the state descriptor.

So net-net is don't bother using a static MapStateDescriptor, it won't help. Just create your state (as per many examples) in your open() method.

kkrugler
  • 8,145
  • 6
  • 24
  • 18
  • To add: it's really important that you don't share the resulting `MapState` across threads/function instances - it's not thread-safe. Make sure you have a normal instance field per function, such that each task thread has its own `MapState`. For descriptors, it's not uncommon to share them across all threads in one task manager, but it's really not saving much (a few bytes of main memory). – Arvid Heise Sep 10 '20 at 20:19
  • I'm not sharing a `MapState` across the threads, I have one instance of `MapState` for each subtask, I was wondering about `MapStateDescriptor`, because I only need to declare one to say to my subtasks that has a `MapState` how they will behaves, and for that I'm not sure if I need one `MapStateDescriptor` for each subtask or maybe is better to has just one. – Alter Sep 11 '20 at 13:39
  • The state descriptor is only used once (per sub-task), in the `open()` method, so it doesn't matter whether it's a static or not. – kkrugler Sep 11 '20 at 21:15
  • You will declare the state descriptor only once but Flink internally will create different Descriptors for each subtask based on the different keys – Haseeb Asif Sep 15 '20 at 18:44
  • @kkrugler is there a reason the descriptor is instantiated in the `open()` method and not in the field definition? I noticed all the docs do this but have not seen why – Manos Ntoulias Sep 13 '22 at 14:51
  • No specific reason why it's instantiated in the `open()` method, other than statics are typically used when you've got multiple references to the same thing. So if you only use it in the open method, you might as well create it in the same method. – kkrugler Sep 14 '22 at 15:58