I am using Great_Expectations in databricks.
I am using shared cluser and runtime version is 13.1 Beta (includes Apache Spark 3.4.0, Scala 2.12)
- py4j version 0.10.9.7
- pyspark version 3.4.0
here is my code:
%pip install great_expectations
dbutils.library.restartPython()
import great_expectations as gx
from great_expectations.checkpoint import SimpleCheckpoint
context_root_dir = "abfss://<container>@<acc>.dfs.core.windows.net/tmp/great_expectations/"
context = gx.get_context(context_root_dir=context_root_dir)
print(context)
from pyspark.sql import SparkSession
import pandas as pd
session_name = 'mk_spark_session'
spark = SparkSession.builder.appName(session_name).getOrCreate()
query = "SELECT * FROM my_test_table limit 10"
spark_df = spark.sql(query)
# print(spark_df)
#(returns -- DataFrame[<data>])
dataframe_datasource = context.sources.add_or_update_spark(
name="my_spark_in_memory_datasource",
)
print(dataframe_datasource)
#(returns --> name: my_spark_in_memory_datasource
type: spark)
dataframe_asset = dataframe_datasource.add_dataframe_asset(
name="MK_DF_asset",
dataframe=spark_df,
)
print(dataframe_asset)
#(returns --> batch_metadata: {}
name: MK_DF_asset
type: dataframe)
#NOT sure why batch_metadata is blank?
dataframe_datasource = context.sources.add_or_update_spark(
name="my_spark_in_memory_datasource",
)
print(dataframe_datasource)
#(returns --> name: my_spark_in_memory_datasource
type: spark)
dataframe_asset = dataframe_datasource.add_dataframe_asset(
name="MK_DF_asset",
dataframe=spark_df,
)
print(dataframe_asset)
#(returns --> batch_metadata: {}
name: MK_DF_asset
type: dataframe)
batch_request = dataframe_asset.build_batch_request()
print(batch_request)
#(returns--> datasource_name='my_spark_in_memory_datasource' data_asset_name='MK_DF_asset' options={})
# create expectation
expectation_suite_name = "MK_expectation_suite"
context.add_or_update_expectation_suite(expectation_suite_name=expectation_suite_name)
########################################################################
# and I get error on the following command
validator = context.get_validator(
batch_request=batch_request,
expectation_suite_name=expectation_suite_name,
)
########################################################################
print(validator.head())`
type here
and I get following error:
# **py4j.security.Py4JSecurityException: Constructor public org.apache.spark.SparkConf(boolean) is not whitelisted.
#
# Py4JError: An error occurred while calling None.org.apache.spark.SparkConf. Trace:
# py4j.security.Py4JSecurityException: Constructor public org.apache.spark.SparkConf(boolean) is not whitelisted.
# at py4j.security.WhitelistingPy4JSecurityManager.checkConstructor(WhitelistingPy4JSecurityManager.java:451)
# at py4j.Gateway.invoke(Gateway.java:256)
# at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
# at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
# at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:195)
# at py4j.ClientServerConnection.run(ClientServerConnection.java:115)
# at java.lang.Thread.run(Thread.java:750)**
I couldn't figure out why I am getting this error! possible compatibility issue but when I checked I am using latest versions.