I'm using Flink 1.16.0 with Kotlin to read and process (snappy-compressed) parquet files that were generated by Spark, and I keep running into ClassNotFoundException: org.apache.hadoop.conf.Configuration
. The files are on Google Cloud Storage/gs://
, but the problem replicates with MiniO and s3a://
.
My Jobs use the Table API to read the parquets, convert the table to a stream and then perform some stream processing into a (Kotlin) data class, which is persisted with AvroParquetWriters
. A minimal version:
val query = """
CREATE TABLE data (letter STRING, timedigit BIGINT) WITH (
'connector' = 'filesystem',
'path' = '${inputPath}',
'format' = 'parquet'
)
"""
tableEnv.executeSql(query)
val dataTable = tableEnv.from("data")
val parquetSink = FileSink
.forBulkFormat(Path(outputPath), AvroParquetWriters.forReflectRecord(Data::class.java))
.build()
tableEnv
.toDataStream(dataTable)
.assignTimestampsAndWatermarks(
WatermarkStrategy.noWatermarks<Row>().withTimestampAssigner { row, _ -> row.getFieldAs("timedigit") }
)
.map { row -> Data(row.getFieldAs("timedigit"), row.getFieldAs("letter")) }
.sinkTo(parquetSink)
env.execute("Read from and write to parquet")
My build.gradle
is constructed as instructed in the Flink docs and includes the necessary dependencies for parquet:
dependencies {
implementation "org.apache.flink:flink-streaming-java:${flinkVersion}"
implementation "org.apache.flink:flink-table-api-java:${flinkVersion}"
implementation "org.apache.flink:flink-table-api-java-bridge:${flinkVersion}"
implementation "org.apache.flink:flink-clients:${flinkVersion}"
implementation "org.apache.flink:flink-connector-files:${flinkVersion}"
flinkShadowJar "org.apache.flink:flink-parquet:${flinkVersion}"
flinkShadowJar("org.apache.parquet:parquet-avro:1.12.2") {
exclude group: "org.apache.hadoop", module: "hadoop-client"
exclude group: "it.unimi.dsi", module: "fastutil"
} flinkShadowJar ("org.xerial.snappy:snappy-java:1.1.8.4") {
exclude group: "org.osgi", module: "core"
}
runtimeOnly "org.apache.logging.log4j:log4j-slf4j-impl:${log4jVersion}"
runtimeOnly "org.apache.logging.log4j:log4j-api:${log4jVersion}"
runtimeOnly "org.apache.logging.log4j:log4j-core:${log4jVersion}"
}
I'm running the Job with a Docker image on Kubernetes "native" in application mode. This is the docker file for the s3a
case, where I'm providing the s3 hadoop plugin in the plugins
directory:
# PART 1 - Build jar via gradle
FROM gradle:7.5.1-jdk11 AS gradle
WORKDIR /usr/src/app
COPY build.gradle build.gradle
COPY settings.gradle settings.gradle
COPY src src
RUN gradle clean installShadowDist
# Part 2 - Create flink container with jar
FROM flink:1.16.0-java11
ARG kotlinVersion=1.7.21
ARG flinkVersion=1.16.0
USER flink
## Provide Kotlin jar
RUN wget -P $FLINK_HOME/lib https://repo1.maven.org/maven2/org/jetbrains/kotlin/kotlin-stdlib/${kotlinVersion}/kotlin-stdlib-${kotlinVersion}.jar
## Provide Job jar
RUN mkdir -p $FLINK_HOME/usrlib
COPY --from=gradle /usr/src/app/build/libs/flink-parquet-batch-demo-1.0-all.jar $FLINK_HOME/usrlib/flink-parquet-batch-demo.jar
## Provide S3 plugin
RUN mkdir -p $FLINK_HOME/plugins/s3-fs-hadoop
RUN cp $FLINK_HOME/opt/flink-s3-fs-hadoop-${flinkVersion}.jar $FLINK_HOME/plugins/s3-fs-hadoop/
I'm using the same image to launch the job from within Kubernetes, i.e. I create a Pod based on this image, ssh into it and launch the job via
./bin/flink run-application \
--target kubernetes-application --class demo.Job \
-Dkubernetes.cluster-id=flink-application-cluster \
-Dkubernetes.container.image=flink-custom:latest \
-Dkubernetes.service-account=flink-service-account \
-Dparallelism.default=2 \
-Dexecution.runtime-mode=BATCH \
-Ds3.endpoint=http://172.17.0.4:9000 \
-Ds3.path-style=true \
-Ds3.access-key=minioadmin \
-Ds3.secret-key=minioadmin \
local:///opt/flink/usrlib/flink-parquet-batch-demo.jar \
--input_path s3a://data/parquet-data/ \
--output_path s3a://data/output-data/
What I've tried
Things I've tried that unfortunately lead to the same ClassNotFoundException
:
- Switched the entire project to maven
- Job submission via the Flink kubernetes operator (instead of the scripts that ship with Flink)
- Job submission via session mode (instead of application mode)
- Running everything in Docker instead of k8s
- Removed the
parquet-avro
dependency - Removed the
snappy-java
dependency - Left out the
gs
/s3a
plugins - Included
org.apache.flink:flink-hadoop-compatibility_2.12
viaflinkShadowJar
- included
org.apache.flink:flink-sql-parquet:1.16.0
viaflinkShadowJar
When I include either hadoop-core
or hadoop-client
via flinkShadowJar
, the exception disappears and the taskmanagers launch, but I'm getting a org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for scheme "s3a"
/ No FileSystem for scheme "gs"
. Also, to my understanding adding any Hadoop dependencies is discouraged.
I've committed a minimal project at https://github.com/mcpeanutbutter/Flink-parquet-batch-demo that includes the full log and stacktrace (output.log
).