Flink SQL doesn't unpack gzipped source on the fly - but still parses PART of it

Question

UPD: i actually found jira ticket which describes my problem here -https://issues.apache.org/jira/browse/FLINK-30314 Waiting for it's resolution...

I've met a strange issue and i need to ask you guys if im not missing anything. I have an issue with parsing gzipped json in plain file, but i'm cutting this to much more simple case:

I have filesystem raw source, and simple sql which counts lines. For non-compressed test file of 1k lines, i get 1k as result of count. for same file, gzipped with terminal, i get 12 as a result.

Strangiest thing is that if applied to json log file (that's my initial task), Flink actually parses PART of json objects from gzipped file.

This is my SQL:

def main():
    table_env.execute_sql(f"""
        CREATE TABLE logs_source (
            raw_row STRING
        ) WITH (
            'connector' = 'filesystem',
            'path' = '{logs_path}',
            'source.monitor-interval' = '10',
            'format' = 'raw'
        )
    """)

    table_env.execute_sql("""
        CREATE TABLE print_sink (
            ip_number BIGINT NOT NULL
        ) WITH (
            'connector' = 'print'
        )
    """)
    table_env.execute_sql(f"""
            INSERT INTO print_sink
                SELECT
                    COUNT(raw_row)         
                FROM logs_source
    """).wait()

It's written somewhere in documentation that gzip is decoded on the fly, based on the extension (i have filename like *.log.gz).

I searched for any options or parameters to enable parsing for gzipped files specifically - but i've failed...

Flink version 1.16.0, im using pyflink, Python 3.9

What's the issue here? Thanks for any ideas!

Flink SQL doesn't unpack gzipped source on the fly - but still parses PART of it

0 Answers0