Reading parquet files from gs or s3a fails with ClassNotFoundException: org.apache.hadoop.conf.Configuration

Question

I'm using Flink 1.16.0 with Kotlin to read and process (snappy-compressed) parquet files that were generated by Spark, and I keep running into ClassNotFoundException: org.apache.hadoop.conf.Configuration. The files are on Google Cloud Storage/gs://, but the problem replicates with MiniO and s3a://.

My Jobs use the Table API to read the parquets, convert the table to a stream and then perform some stream processing into a (Kotlin) data class, which is persisted with AvroParquetWriters. A minimal version:

val query = """
    CREATE TABLE data (letter STRING, timedigit BIGINT) WITH (
        'connector' = 'filesystem',
        'path' = '${inputPath}',
        'format' = 'parquet'
    )
    """

tableEnv.executeSql(query)

val dataTable = tableEnv.from("data")

val parquetSink = FileSink
    .forBulkFormat(Path(outputPath), AvroParquetWriters.forReflectRecord(Data::class.java))
    .build()

tableEnv
    .toDataStream(dataTable)
    .assignTimestampsAndWatermarks(
        WatermarkStrategy.noWatermarks<Row>().withTimestampAssigner { row, _ -> row.getFieldAs("timedigit") }
    )
    .map { row -> Data(row.getFieldAs("timedigit"), row.getFieldAs("letter")) }
    .sinkTo(parquetSink)

env.execute("Read from and write to parquet")

My build.gradle is constructed as instructed in the Flink docs and includes the necessary dependencies for parquet:

dependencies {  
    implementation "org.apache.flink:flink-streaming-java:${flinkVersion}"  
    implementation "org.apache.flink:flink-table-api-java:${flinkVersion}"  
    implementation "org.apache.flink:flink-table-api-java-bridge:${flinkVersion}"  
    implementation "org.apache.flink:flink-clients:${flinkVersion}"  
    implementation "org.apache.flink:flink-connector-files:${flinkVersion}"  

    flinkShadowJar "org.apache.flink:flink-parquet:${flinkVersion}"  
    flinkShadowJar("org.apache.parquet:parquet-avro:1.12.2") {  
        exclude group: "org.apache.hadoop", module: "hadoop-client"  
        exclude group: "it.unimi.dsi", module: "fastutil"  
    }    flinkShadowJar ("org.xerial.snappy:snappy-java:1.1.8.4") {  
        exclude group: "org.osgi", module: "core"  
    }  

    runtimeOnly "org.apache.logging.log4j:log4j-slf4j-impl:${log4jVersion}"  
    runtimeOnly "org.apache.logging.log4j:log4j-api:${log4jVersion}"  
    runtimeOnly "org.apache.logging.log4j:log4j-core:${log4jVersion}"  
}

I'm running the Job with a Docker image on Kubernetes "native" in application mode. This is the docker file for the s3a case, where I'm providing the s3 hadoop plugin in the plugins directory:

# PART 1 - Build jar via gradle  
FROM gradle:7.5.1-jdk11 AS gradle
WORKDIR /usr/src/app  
COPY build.gradle build.gradle  
COPY settings.gradle settings.gradle  
COPY src src    
RUN gradle clean installShadowDist  

# Part 2 - Create flink container with jar  
FROM flink:1.16.0-java11    
ARG kotlinVersion=1.7.21  
ARG flinkVersion=1.16.0  
USER flink  

## Provide Kotlin jar  
RUN wget -P $FLINK_HOME/lib https://repo1.maven.org/maven2/org/jetbrains/kotlin/kotlin-stdlib/${kotlinVersion}/kotlin-stdlib-${kotlinVersion}.jar  
  
## Provide Job jar  
RUN mkdir -p $FLINK_HOME/usrlib  
COPY --from=gradle /usr/src/app/build/libs/flink-parquet-batch-demo-1.0-all.jar $FLINK_HOME/usrlib/flink-parquet-batch-demo.jar  
  
## Provide S3 plugin  
RUN mkdir -p $FLINK_HOME/plugins/s3-fs-hadoop  
RUN cp $FLINK_HOME/opt/flink-s3-fs-hadoop-${flinkVersion}.jar $FLINK_HOME/plugins/s3-fs-hadoop/

I'm using the same image to launch the job from within Kubernetes, i.e. I create a Pod based on this image, ssh into it and launch the job via

./bin/flink run-application \  
--target kubernetes-application --class demo.Job \  
-Dkubernetes.cluster-id=flink-application-cluster \  
-Dkubernetes.container.image=flink-custom:latest \  
-Dkubernetes.service-account=flink-service-account \  
-Dparallelism.default=2 \  
-Dexecution.runtime-mode=BATCH \  
-Ds3.endpoint=http://172.17.0.4:9000 \  
-Ds3.path-style=true \  
-Ds3.access-key=minioadmin \  
-Ds3.secret-key=minioadmin \  
local:///opt/flink/usrlib/flink-parquet-batch-demo.jar \  
--input_path s3a://data/parquet-data/ \  
--output_path s3a://data/output-data/

What I've tried

Things I've tried that unfortunately lead to the same ClassNotFoundException:

Switched the entire project to maven
Job submission via the Flink kubernetes operator (instead of the scripts that ship with Flink)
Job submission via session mode (instead of application mode)
Running everything in Docker instead of k8s
Removed the parquet-avro dependency
Removed the snappy-java dependency
Left out the gs / s3a plugins
Included org.apache.flink:flink-hadoop-compatibility_2.12 via flinkShadowJar
included org.apache.flink:flink-sql-parquet:1.16.0 via flinkShadowJar

When I include either hadoop-core or hadoop-client via flinkShadowJar, the exception disappears and the taskmanagers launch, but I'm getting a org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for scheme "s3a" / No FileSystem for scheme "gs". Also, to my understanding adding any Hadoop dependencies is discouraged.

I've committed a minimal project at https://github.com/mcpeanutbutter/Flink-parquet-batch-demo that includes the full log and stacktrace (output.log).

My initial thought was "I've seen something similar like this" and remember ticket https://issues.apache.org/jira/browse/FLINK-29729 - Could that be the issue for you too? — Martijn Visser, Nov 28 '22 at 13:39
Thank you @martijn, yes, this might be strongly related! If the contents of `flink-conf.yaml` are ignored when the parquet reader is created, it might not be able to get a Hadoop conf at all. I wonder, however, if the PR referenced in the ticket would solve my issue entirely, since as far as I can tell, the "vanilla" `flink-conf.yaml` doesn't even include any Hadoop configuration keys. — McPeanutbutter, Nov 28 '22 at 20:04

Reading parquet files from gs or s3a fails with ClassNotFoundException: org.apache.hadoop.conf.Configuration

What I've tried

0 Answers0