1

I'm using Flink 1.16.0 with Kotlin to read and process (snappy-compressed) parquet files that were generated by Spark, and I keep running into ClassNotFoundException: org.apache.hadoop.conf.Configuration. The files are on Google Cloud Storage/gs://, but the problem replicates with MiniO and s3a://.

My Jobs use the Table API to read the parquets, convert the table to a stream and then perform some stream processing into a (Kotlin) data class, which is persisted with AvroParquetWriters. A minimal version:

val query = """
    CREATE TABLE data (letter STRING, timedigit BIGINT) WITH (
        'connector' = 'filesystem',
        'path' = '${inputPath}',
        'format' = 'parquet'
    )
    """

tableEnv.executeSql(query)

val dataTable = tableEnv.from("data")

val parquetSink = FileSink
    .forBulkFormat(Path(outputPath), AvroParquetWriters.forReflectRecord(Data::class.java))
    .build()

tableEnv
    .toDataStream(dataTable)
    .assignTimestampsAndWatermarks(
        WatermarkStrategy.noWatermarks<Row>().withTimestampAssigner { row, _ -> row.getFieldAs("timedigit") }
    )
    .map { row -> Data(row.getFieldAs("timedigit"), row.getFieldAs("letter")) }
    .sinkTo(parquetSink)

env.execute("Read from and write to parquet")

My build.gradle is constructed as instructed in the Flink docs and includes the necessary dependencies for parquet:

dependencies {  
    implementation "org.apache.flink:flink-streaming-java:${flinkVersion}"  
    implementation "org.apache.flink:flink-table-api-java:${flinkVersion}"  
    implementation "org.apache.flink:flink-table-api-java-bridge:${flinkVersion}"  
    implementation "org.apache.flink:flink-clients:${flinkVersion}"  
    implementation "org.apache.flink:flink-connector-files:${flinkVersion}"  

    flinkShadowJar "org.apache.flink:flink-parquet:${flinkVersion}"  
    flinkShadowJar("org.apache.parquet:parquet-avro:1.12.2") {  
        exclude group: "org.apache.hadoop", module: "hadoop-client"  
        exclude group: "it.unimi.dsi", module: "fastutil"  
    }    flinkShadowJar ("org.xerial.snappy:snappy-java:1.1.8.4") {  
        exclude group: "org.osgi", module: "core"  
    }  

    runtimeOnly "org.apache.logging.log4j:log4j-slf4j-impl:${log4jVersion}"  
    runtimeOnly "org.apache.logging.log4j:log4j-api:${log4jVersion}"  
    runtimeOnly "org.apache.logging.log4j:log4j-core:${log4jVersion}"  
}

I'm running the Job with a Docker image on Kubernetes "native" in application mode. This is the docker file for the s3a case, where I'm providing the s3 hadoop plugin in the plugins directory:

# PART 1 - Build jar via gradle  
FROM gradle:7.5.1-jdk11 AS gradle
WORKDIR /usr/src/app  
COPY build.gradle build.gradle  
COPY settings.gradle settings.gradle  
COPY src src    
RUN gradle clean installShadowDist  

# Part 2 - Create flink container with jar  
FROM flink:1.16.0-java11    
ARG kotlinVersion=1.7.21  
ARG flinkVersion=1.16.0  
USER flink  

## Provide Kotlin jar  
RUN wget -P $FLINK_HOME/lib https://repo1.maven.org/maven2/org/jetbrains/kotlin/kotlin-stdlib/${kotlinVersion}/kotlin-stdlib-${kotlinVersion}.jar  
  
## Provide Job jar  
RUN mkdir -p $FLINK_HOME/usrlib  
COPY --from=gradle /usr/src/app/build/libs/flink-parquet-batch-demo-1.0-all.jar $FLINK_HOME/usrlib/flink-parquet-batch-demo.jar  
  
## Provide S3 plugin  
RUN mkdir -p $FLINK_HOME/plugins/s3-fs-hadoop  
RUN cp $FLINK_HOME/opt/flink-s3-fs-hadoop-${flinkVersion}.jar $FLINK_HOME/plugins/s3-fs-hadoop/

I'm using the same image to launch the job from within Kubernetes, i.e. I create a Pod based on this image, ssh into it and launch the job via

./bin/flink run-application \  
--target kubernetes-application --class demo.Job \  
-Dkubernetes.cluster-id=flink-application-cluster \  
-Dkubernetes.container.image=flink-custom:latest \  
-Dkubernetes.service-account=flink-service-account \  
-Dparallelism.default=2 \  
-Dexecution.runtime-mode=BATCH \  
-Ds3.endpoint=http://172.17.0.4:9000 \  
-Ds3.path-style=true \  
-Ds3.access-key=minioadmin \  
-Ds3.secret-key=minioadmin \  
local:///opt/flink/usrlib/flink-parquet-batch-demo.jar \  
--input_path s3a://data/parquet-data/ \  
--output_path s3a://data/output-data/

What I've tried

Things I've tried that unfortunately lead to the same ClassNotFoundException:

  • Switched the entire project to maven
  • Job submission via the Flink kubernetes operator (instead of the scripts that ship with Flink)
  • Job submission via session mode (instead of application mode)
  • Running everything in Docker instead of k8s
  • Removed the parquet-avro dependency
  • Removed the snappy-java dependency
  • Left out the gs / s3a plugins
  • Included org.apache.flink:flink-hadoop-compatibility_2.12 via flinkShadowJar
  • included org.apache.flink:flink-sql-parquet:1.16.0 via flinkShadowJar

When I include either hadoop-core or hadoop-client via flinkShadowJar, the exception disappears and the taskmanagers launch, but I'm getting a org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for scheme "s3a" / No FileSystem for scheme "gs". Also, to my understanding adding any Hadoop dependencies is discouraged.

I've committed a minimal project at https://github.com/mcpeanutbutter/Flink-parquet-batch-demo that includes the full log and stacktrace (output.log).

  • My initial thought was "I've seen something similar like this" and remember ticket https://issues.apache.org/jira/browse/FLINK-29729 - Could that be the issue for you too? – Martijn Visser Nov 28 '22 at 13:39
  • Thank you @martijn, yes, this might be strongly related! If the contents of `flink-conf.yaml` are ignored when the parquet reader is created, it might not be able to get a Hadoop conf at all. I wonder, however, if the PR referenced in the ticket would solve my issue entirely, since as far as I can tell, the "vanilla" `flink-conf.yaml` doesn't even include any Hadoop configuration keys. – McPeanutbutter Nov 28 '22 at 20:04

0 Answers0