UPD: i actually found jira ticket which describes my problem here -https://issues.apache.org/jira/browse/FLINK-30314 Waiting for it's resolution...
I've met a strange issue and i need to ask you guys if im not missing anything. I have an issue with parsing gzipped json in plain file, but i'm cutting this to much more simple case:
I have filesystem raw source, and simple sql which counts lines. For non-compressed test file of 1k lines, i get 1k as result of count. for same file, gzipped with terminal, i get 12 as a result.
Strangiest thing is that if applied to json log file (that's my initial task), Flink actually parses PART of json objects from gzipped file.
This is my SQL:
def main():
table_env.execute_sql(f"""
CREATE TABLE logs_source (
raw_row STRING
) WITH (
'connector' = 'filesystem',
'path' = '{logs_path}',
'source.monitor-interval' = '10',
'format' = 'raw'
)
""")
table_env.execute_sql("""
CREATE TABLE print_sink (
ip_number BIGINT NOT NULL
) WITH (
'connector' = 'print'
)
""")
table_env.execute_sql(f"""
INSERT INTO print_sink
SELECT
COUNT(raw_row)
FROM logs_source
""").wait()
It's written somewhere in documentation that gzip is decoded on the fly, based on the extension (i have filename like *.log.gz).
I searched for any options or parameters to enable parsing for gzipped files specifically - but i've failed...
Flink version 1.16.0, im using pyflink, Python 3.9
What's the issue here? Thanks for any ideas!