In my pipeline I am using pyflink to load & transform data from an RDS and sink to a MYSQL. Using FLINK CDC I am able to get the data I want from the RDS and with JDBC library sink to MYSQL. My aim is to read 1 table and create 10 others using a sample of the code below, in 1 job (basically breaking a huge table in smaller tables). The problem I am facing is despite using RocksDB as state backend and options in flink cdc such as scan.incremental.snapshot.chunk.size
and scan.snapshot.fetch.size
and debezium.min.row. count.to.stream.result
the usage memory keeps growing causing a Taskmanager with 2GB memory to fail. My intuition here is that a simple select- insert query loads all table in memory no matter what!If so, can I somehow avoid that? The table size is around 500k rows.
env = StreamExecutionEnvironment.get_execution_environment()
t_env = StreamTableEnvironment.create(env)
stmt_set = t_env.create_statement_set()
create_kafka_source= (
"""
CREATE TABLE somethin(
bla INT,
bla1 DOUBLE,
bla2 TIMESTAMP(3),
PRIMARY KEY(bla2) NOT ENFORCED
) WITH (
'connector'='mysql-cdc',
'server-id'='1000',
'debezium.snapshot.mode' = 'when_needed',
'debezium.poll.interval.ms'='5000',
'hostname'= 'som2',
'port' ='som2',
'database-name'='som3',
'username'='som4',
'password'='somepass',
'table-name' = 'atable'
)
"""
)
create_kafka_dest = (
"""CREATE TABLE IF NOT EXISTS atable(
time1 TIMESTAMP(3),
blah2 DOUBLE,
PRIMARY KEY(time_stamp) NOT ENFORCED
) WITH ( 'connector'= 'jdbc',
'url' = 'jdbc:mysql://name1:3306/name1',
'table-name' = 't1','username' = 'user123',
'password' = '123'
)"""
)
t_env.execute_sql(create_kafka_source)
t_env.execute_sql(create_kafka_dest)
stmt_set.add_insert_sql(
"INSERT INTO atable SELECT DISTINCT bla2,bla1,"
"FROM somethin"
)