How do I access a MinIO storage bucket from a Jupyter Pyspark notebook?

Question

I have instances of MinIO and Jupyter Pyspark notebook running locally on separate docker containers. I am able to use the minio Python package to view buckets and objects in MinIO, however when I try to load a parquet from a bucket using Pyspark I get the below:

Code:

salesDF_path = 's3a://{}:{}/data/sales/example.parquet'.format(host, port)
df = spark.read.load(salesDF_path)

Error:

Py4JJavaError: An error occurred while calling o22.load.
: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found
    at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2195)
    at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2654)
    at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
    at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
    at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
    at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
    at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
    at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
    at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary$1.apply(DataSource.scala:547)
    at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary$1.apply(DataSource.scala:545)
    at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
    at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
    at scala.collection.immutable.List.foreach(List.scala:392)
    at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
    at scala.collection.immutable.List.flatMap(List.scala:355)
    at org.apache.spark.sql.execution.datasources.DataSource.org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary(DataSource.scala:545)
    at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:359)
    at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:238)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found
    at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2101)
    at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2193)
    ... 30 more

I'm trying to write a script that spins up these containers, runs some tests, then tears them down. Is there some configuration that I need to include somewhere?

score 2 · Answer 1 · edited Jun 20 '20 at 09:12

Please follow next steps:

relevant aws jars

Make sure you have installed following jars:

You can run your pyspark application something like:

pyspark --jars "aws-java-sdk-1.7.4.jar,hadoop-aws-2.7.3 (or from a docker CMD)

Inside the notebook, configure 'hadoop configuration'

sc._jsc.hadoopConfiguration().set("fs.s3a.access.key", "access_key")
sc._jsc.hadoopConfiguration().set("fs.s3a.secret.key", "secret_key")
sc._jsc.hadoopConfiguration().set("fs.s3a.proxy.host", "minio")
sc._jsc.hadoopConfiguration().set("fs.s3a.endpoint", "minio")
sc._jsc.hadoopConfiguration().set("fs.s3a.proxy.port", "9000")
sc._jsc.hadoopConfiguration().set("fs.s3a.path.style.access", "true")
sc._jsc.hadoopConfiguration().set("fs.s3a.connection.ssl.enabled", "false")
sc._jsc.hadoopConfiguration().set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")

Please check s3a client configuration to a full list of params

Now should be able to query data from minio, for example:

sc.textFile("s3a://<file path>")

score 0 · Answer 2 · answered Jun 29 '19 at 22:29

Could you show how you have installed Spark and initialized it? It looks like you have to download the Java library for org.apache.hadoop.fs.s3a.S3AFileSystem. Can you make sure you have the hadoop-aws and aws-java-sdk JARS installed? I use http://central.maven.org/maven2/org/apache/hadoop/hadoop-aws/2.7.3/hadoop-aws-2.7.3.jar and http://central.maven.org/maven2/com/amazonaws/aws-java-sdk/1.7.4/aws-java-sdk-1.7.4.jar.

score 0 · Answer 3 · answered Aug 29 '19 at 19:55

See this comment here: https://github.com/jupyter/docker-stacks/issues/272#issuecomment-244278586

Specifically:

import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.amazonaws:aws-java-sdk:1.10.34,org.apache.hadoop:hadoop-aws:2.6.0,com.databricks:spark-csv_2.11:1.4.0 pyspark-shell'

This helped get rid of that class not found error

How do I access a MinIO storage bucket from a Jupyter Pyspark notebook?

3 Answers3

relevant aws jars

Inside the notebook, configure 'hadoop configuration'