1

I have a server with redis and maven configured I then do the following sparkSession

spark = pyspark
.sql
.SparkSession
.builder
.master('local[4]')
.appName('try_one_core')
.config("spark.redis.host", "XX.XXX.XXX.XXX")
.config("spark.redis.port", "6379")
.config("spark.redis.auth", "XXXX")
.getOrCreate()

I am trying to connect to a remote redis server and write/load data from it, however when I try to .save() with the following command

df
.write
.format("org.apache.spark.sql.redis")
.option("table", "df")
.option("key.column", "case_id")
.save()

I get the following error:

py4j.protocol.Py4JJavaError: An error occurred while calling o327.save. : java.lang.ClassNotFoundException: Failed to find data source: org.apache.spark.sql.redis. Please find packages at http://spark.apache.org/third-party-projects.html

Is there any fix to this?

SJN
  • 377
  • 2
  • 8
  • 18

2 Answers2

2

It means that spark-redis-<version>-jar-with-dependencies.jar is not loaded in Spark.

You have to run pyspark with the following arguments as stated in the documentation:

$ bin/pyspark --jars <path-to>/spark-redis-<version>-jar-with-dependencies.jar --conf "spark.redis.host=localhost" --conf "spark.redis.port=6379" --conf "spark.redis.auth=passwd"

fe2s
  • 425
  • 2
  • 9
  • I have this server configured with redis to be accessed remotely, then on my computer I am establishing a connection through pySpark code with a the specified spark session, however should I have maven also installed on my computer? or only on the remote server that is being accessed – Jose Gutierrez Feb 19 '20 at 16:58
0

As addition to @fe2s answer , instead of loading it from disk or network storage it can be also loaded directly from maven

bin/pyspark --packages com.redislabs:spark-redis:2.4.0

the --packages and --jars arguments can also be used with normal spark-submit command

dre-hh
  • 7,840
  • 2
  • 33
  • 44