1

TLDR; I am running Flink Streaming job in mode=Batch on EMR. I have tried several EMR cluster configurations but neither of them works as required. Some do not work at all. Workflow is very network-intensive that cases main problems.

Question: What EMR cluster configuration (ec2 instance types) would you recommend for this use-case?

--

The job has following stages:

  1. Read from MySQL
  2. KeyBy user_id
  3. Reduce by user_id
  4. Async I/O enriching from Redis
  5. Async I/O enriching from other Redis
  6. Async I/O enriching from REST #1
  7. Async I/O enriching from REST #2
  8. Async I/O enriching from REST #2
  9. Write to Elasticsearch

Other info:

Flink version: 1.13.1 EMR version: 6.4.0 Java version: JDK version Corretto-8.302.08.1 (provided by EMR) Input data size: ~800 GB Output data size: ~300 GB

  • "taskmanager.network.sort-shuffle.min-parallelism": 1
  • "taskmanager.memory.framework.off-heap.batch-shuffle.size": 256m
  • "taskmanager.network.sort-shuffle.min-buffers": 2048
  • "taskmanager.network.blocking-shuffle.compression.enabled": true
  • "taskmanager.memory.framework.off-heap.size": 512m
  • "taskmanager.memory.network.max": 2g

Configurations we tried:

#1

master: r6g.xlarge

core: r6g.xlarge (per/hour: $0.2; CPU: 4; RAM: 32 GiB; Disk: EBS 128 GB, network: 1.25 Gigabit baseline with burst up to 10 Gigabit)

min_scale: 2

max_scale: 25

  • expected: finishes within 24 hours
  • actual: works with sort-based shuffling enabled but very slowly (~36h), as this type of instance has a baseline & burst performance, when burst credits are exhausted degrades to the baseline of 1GBps, that slows down I/O. With hash-based shuffling fails on KeyBy -> Reduce with "Connection reset by peer", Task Manager fails -> Job fails -> Job manager is not able to restart.

#2

master: m5.xlarge

core: r6g.12xlarge (per/hour: $2.4; CPU: 48; RAM: 384 GiB; Disk: EBS 1.5 TB, network: 20 Gigabit)

min_scale: 1

max_scale: 4

  • expected: finishes within 24 hours, as there is much higher network badwith
  • actual: does not work. With sort-based shuffling fails on the writing phase with exception "Failed to transfer file from TaskExecutor". With hash-based shuffling fails on the same stage with "Connection reset by peer".

0 Answers0