2

I am using the doc and trying to run a simple script found here: https://docs.snowflake.com/en/user-guide/spark-connector-use.html

Py4JJavaError: An error occurred while calling o37.load.
: java.lang.ClassNotFoundException: Failed to find data source: net.snowflake.spark.snowflake.

My code below. I also tried to set config option with the path to the jdbc and spark-snowflake jars located in /Users/Hana/spark-sf/ directory but no luck.

from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext
from pyspark.sql.types import *
from pyspark import SparkConf, SparkContext

spark = SparkSession \
    .builder \
    .appName("Python Spark SQL basic example") \
    .config('spark.jars','/Users/Hana/spark-sf/snowflake-jdbc-3.12.9.jar,/Users/Hana/spark-sf/spark-snowflake_2.12-2.8.1-spark_3.0.jar') \
    .getOrCreate()

# Set options below
sfOptions = {
  "sfURL" : "<account_name>.snowflakecomputing.com",
  "sfUser" : "<user_name>",
  "sfPassword" : "<password>",
  "sfDatabase" : "<database>",
  "sfSchema" : "<schema>",
  "sfWarehouse" : "<warehouse>"
}

SNOWFLAKE_SOURCE_NAME = "net.snowflake.spark.snowflake"


df = spark.read.format(SNOWFLAKE_SOURCE_NAME) \
  .options(**sfOptions) \
  .option("query",  "select * from table limit 200") \
  .load()

df.show()

How should I properly be setting variables? And which ones are needed to set? If someone can help to list out these steps I would greatly appreciate it!

Hana
  • 1,330
  • 4
  • 23
  • 38

2 Answers2

0

Can you try format as "snowflake" only

So your dataframe will have

df = spark.read.format("snowflake") \
  .options(**sfOptions) \
  .option("query",  "select * from table limit 200") \
  .load()

or set SNOWFLAKE_SOURCE_NAME variable to

SNOWFLAKE_SOURCE_NAME = "snowflake"
demircioglu
  • 3,069
  • 1
  • 15
  • 22
  • Thanks for your input - I tried that and I'm now getting this error: java.lang.ClassNotFoundException: Failed to find data source: snowflake. – Hana Jul 20 '20 at 00:10
  • Can you follow this link : https://stackoverflow.com/questions/62957222/facing-classnotfound-exception-while-reading-a-snowflake-table-using-spark/62959579#62959579 – Ankur Srivastava Jul 20 '20 at 15:59
0

I've been also struggling with the "Failed to find data source", while configured my local development environment.
After installing and setting my Environment Variables for: HADOOP_HOME SCALA_HOME SPARK_CLASSPATH SPARK_HOME and updated the Path variable, I've installed the snowflake connector:

pip install -r https://raw.githubusercontent.com/snowflakedb/snowflake-connector-python/v2.7.2/tested_requirements/requirements_39.reqs
pip install snowflake-connector-python==2.7.2
pyspark --packages net.snowflake:snowflake-jdbc:3.13.10,net.snowflake:spark-snowflake_2.12:2.9.2-spark_3.1

I was trying to use snowflake datasource in my python code::

conf = SparkConf().setAppName('MYAPP').setMaster("local[*]")
sc = SparkContext(conf=conf)
spark_sql = SQLContext(sc)

df = spark_sql.read.format("snowflake")\
  .options(**sfOptions)\
  .option("query", query)\
  .load()`enter code here`

But the pyspark still did not seem to know the required jars to support the extension of snowflake datasource.

In order to solve the issue, I've finally added the required packages to the SparkConf:

conf = SparkConf() \
    .setAppName('MYAPP') \
    .setMaster("local[*]") \
    .set("spark.jars.packages", "net.snowflake:snowflake-jdbc:3.13.10,net.snowflake:spark-snowflake_2.12:2.9.2-spark_3.1")

Note that I've used spark.jars.packages with the packages instead of spark.jars with the full path to the jars

raul7
  • 171
  • 1
  • 10