0

In my pipeline I am using pyflink to load & transform data from an RDS and sink to a MYSQL. Using FLINK CDC I am able to get the data I want from the RDS and with JDBC library sink to MYSQL. My aim is to read 1 table and create 10 others using a sample of the code below, in 1 job (basically breaking a huge table in smaller tables). The problem I am facing is despite using RocksDB as state backend and options in flink cdc such as scan.incremental.snapshot.chunk.size and scan.snapshot.fetch.size and debezium.min.row. count.to.stream.result the usage memory keeps growing causing a Taskmanager with 2GB memory to fail. My intuition here is that a simple select- insert query loads all table in memory no matter what!If so, can I somehow avoid that? The table size is around 500k rows.

env = StreamExecutionEnvironment.get_execution_environment()

t_env = StreamTableEnvironment.create(env)
stmt_set = t_env.create_statement_set()


create_kafka_source= (
        """
            CREATE TABLE somethin(
               bla INT,
               bla1 DOUBLE,
               bla2 TIMESTAMP(3),
              PRIMARY KEY(bla2) NOT ENFORCED
         ) WITH (
        'connector'='mysql-cdc',
        'server-id'='1000',
        'debezium.snapshot.mode' = 'when_needed',   
        'debezium.poll.interval.ms'='5000',         
        'hostname'= 'som2',
        'port' ='som2',
        'database-name'='som3',
        'username'='som4',
        'password'='somepass',
        'table-name' = 'atable'
        )
        """
    )
create_kafka_dest = (
        """CREATE TABLE IF NOT EXISTS atable(
                    time1 TIMESTAMP(3),
                    blah2 DOUBLE,
                    PRIMARY KEY(time_stamp)  NOT ENFORCED

                    ) WITH (                       'connector'= 'jdbc',
                    'url' = 'jdbc:mysql://name1:3306/name1',
                    'table-name' = 't1','username' = 'user123',
                    'password' = '123'   
        )"""
    )



t_env.execute_sql(create_kafka_source)
t_env.execute_sql(create_kafka_dest)

stmt_set.add_insert_sql(
    "INSERT INTO atable SELECT  DISTINCT bla2,bla1,"
    "FROM somethin"
)
David Anderson
  • 39,434
  • 4
  • 33
  • 60
gy_e-Aa
  • 21
  • 7

1 Answers1

0

Using DISTINCT in a streaming query is expensive, especially when there aren't any temporal constraints on the distinctiveness (e.g., counting unique visitors per day). I imagine that's why your query needs a lot of state.

However, you should be able to get this to work. RocksDB isn't always well-behaved; sometimes it will consume more memory than it has been allocated.

What version of Flink are you using? Improvements were made in Flink 1.11 (by switching to jemalloc) and further improvements came in Flink 1.14 (by upgrading to a newer version of RocksDB). So upgrading Flink might fix this. Otherwise you may need to basically lie and tell Flink it has somewhat less memory than it actually has, so that when RocksDB steps out of bounds it doesn't cause out-of-memory errors.

David Anderson
  • 39,434
  • 4
  • 33
  • 60
  • David thanks for the reply, to be precise I am using FLINK 1.13 in a dockerized environment. Removing distinct (and rely on primary key) does exactly the same.. seems to load the whole table in memory. As far as I understand select-insert does not use any state at all ! Can I somehow tell Flink to remove everything after sink in MYSQL? – gy_e-Aa Nov 25 '21 at 08:19
  • I guess I misdiagnosed the problem. Another hypothesis: perhaps the problem is that Flink SQL is materializing the end result of the incoming CDC/update stream. That would explain why it's storing the entire table in Flink state. Can you share the result of using EXPLAIN so we can see what the execution plan looks like? – David Anderson Nov 25 '21 at 09:33
  • Dear David, you can find the execution plan attached here: https://gist.github.com/Giwrgosyea/8dcfb28c2b797865eb137dc2a00a46c7 – gy_e-Aa Nov 25 '21 at 10:03