3

I have successfully created a Great_Expectation result and I would like to output the results of the expectation to an html file.

There are few links highlighting how show the results in human readable from using what is called 'Data Docs' https://docs.greatexpectations.io/en/latest/guides/tutorials/getting_started/set_up_data_docs.html#tutorials-getting-started-set-up-data-docs

But to be quite honest, the documentation is extremely hard to follow.

My expectation simply verifies the number of passengers from my dataset fall within 1 and 6. I would like help outputting the results to a folder using 'Data Docs' or however it is possible to output the data to a folder:

import great_expectations as ge
import great_expectations.dataset.sparkdf_dataset
from great_expectations.dataset.sparkdf_dataset import SparkDFDataset
from pyspark.sql.types import StructType, StructField, IntegerType, StringType, BooleanType
from great_expectations.data_asset import DataAsset

from great_expectations.data_context.types.base import DataContextConfig, DatasourceConfig, FilesystemStoreBackendDefaults
from great_expectations.data_context import BaseDataContext
from great_expectations.data_context.types.resource_identifiers import ValidationResultIdentifier
from datetime import datetime
from great_expectations.data_context import BaseDataContext


df_taxi = spark.read.csv('abfss://root@adlspretbiukadlsdev.dfs.core.windows.net/RAW/LANDING/yellow_trip_data_sample_2019-01.csv', inferSchema=True, header=True)

taxi_rides = SparkDFDataset(df_taxi)

taxi_rides.expect_column_value_lengths_to_be_between(column='passenger_count', min_value=1, max_value=6)

taxi_rides.save_expectation_suite()

The code is run from Apache Spark.

If someone could just point me in the right direction, I will able to figure it out.

Alex Ott
  • 80,552
  • 8
  • 87
  • 132
Patterson
  • 1,927
  • 1
  • 19
  • 56

2 Answers2

5

You can visualize Data Docs on Databricks - you just need to use correct renderer* combined with DefaultJinjaPageView that renders it into HTML, and its result could be shown with displayHTML. We need to import necessary classes/functions:

import great_expectations as ge
from great_expectations.profile.basic_dataset_profiler import BasicDatasetProfiler
from great_expectations.dataset.sparkdf_dataset import SparkDFDataset
from great_expectations.render.renderer import *
from great_expectations.render.view import DefaultJinjaPageView

To see result of profiling, we need to use ProfilingResultsPageRenderer:

expectation_suite, validation_result = BasicDatasetProfiler.profile(SparkDFDataset(df))
document_model = ProfilingResultsPageRenderer().render(validation_result)
displayHTML(DefaultJinjaPageView().render(document_model))

it will show something like this:

enter image description here

We can visualize results of validation with ValidationResultsPageRenderer:

gdf = SparkDFDataset(df)
gdf.expect_column_values_to_be_of_type("county", "StringType")
gdf.expect_column_values_to_be_between("cases", 0, 1000)
validation_result = gdf.validate()
document_model = ValidationResultsPageRenderer().render(validation_result)
displayHTML(DefaultJinjaPageView().render(document_model))

it will show something like this:

enter image description here

Or we can render expectation suite itself with ExpectationSuitePageRenderer:

gdf = SparkDFDataset(df)
gdf.expect_column_values_to_be_of_type("county", "StringType")
document_model = ExpectationSuitePageRenderer().render(gdf.get_expectation_suite())
displayHTML(DefaultJinjaPageView().render(document_model))

it will show something like this:

enter image description here

If you're not using Databricks, then you can render the data into HTML and store it as files stored somewhere

* The correct renderer* documentation link above is technically "Legacy" now but still valid. The new docs site version lacks detail at the time of this writing.

Eric Seastrand
  • 2,473
  • 1
  • 29
  • 36
Alex Ott
  • 80,552
  • 8
  • 87
  • 132
  • Hi @alex ott, you have come through for me again. Can I assume that I can use the same principle with azure blob or ADLS Gen2? The company I working for at the moment aren't working with Databricks to access Apache Spark, they're using Azure Synapse workspace with Apache Spark pools. Therefore, I will need to work with Azure Blob. Will you explanation above work exactly as you have described with Azure Blob? Thanks – Patterson Jul 14 '21 at 10:52
  • I don't understand what you mean by 'expectation_suite,' – Patterson Jul 14 '21 at 11:17
  • yes, it should work with azure blob as well. `DefaultJinjaPageView().render` just return HTML that you can store where you want. – Alex Ott Jul 14 '21 at 11:20
  • Hi @alex ott, I'm getting the following error: ```SyntaxError: unexpected EOF while parsing (, line 3) File "", line 3 displayHTML(DefaultJinjaPageView().render(document_model)``` – Patterson Jul 14 '21 at 11:20
  • I should point out that I'm running your suggested code from with Synapse on Apache Spark – Patterson Jul 14 '21 at 11:22
  • Hi @Alex, I am having some mild success. Ideally, I would like to the Data Docs from the following code ```validation_results = ge_df2.validate(expectation_suite='gregs_expectations.json', only_return_failures=True)``` – Patterson Jul 14 '21 at 11:38
  • yes, for validation results you need just use corresponding renderer – Alex Ott Jul 14 '21 at 12:02
  • Hi Alex, can I ask for one more favour. Is it possible to save the results? I tried the following mydisplay = displayHTML(DefaultJinjaPageView().render(document_model)) and then type display(mydisplay) to download the results but that didn't work. Alternatively, can I save the displayHHTML to a blob folder. At the moment, the results are appearing on my screen – Patterson Jul 14 '21 at 12:27
  • you just need to do `html = DefaultJinjaPageView().render(document_model)`, and then save it as it was described in the question about storing GE expectations into blob storage – Alex Ott Jul 14 '21 at 12:49
  • So @Alex I did ```html = DefaultJinjaPageView().render(document_model)``` And then I tried to save it as described in the other SO question you just helped me with as ```html.save_expectation_suite('/tmp/gregs_expectations.json')``` but I got AttributeError: 'str' object has no attribute 'save_expectation_suite' – Patterson Jul 14 '21 at 17:57
  • `html` is just HTML string. you need to use `create_file/append_data` functions to save it. If you want to save expectation suite itself - it's another function from GE itself – Alex Ott Jul 14 '21 at 18:00
  • In the SO question you helped me with https://stackoverflow.com/questions/68307596/how-to-save-an-great-expectation-to-azure-data-lake-or-blob-store/68313830?noredirect=1#comment120853523_68313830 I created the file. But before creating the file I save the file using ```ge_df.save_expectation_suite('gregs_expectations.json')``` I'm trying to do the save thing here i.e save the ```displayHTML(DefaultJinjaPageView().render(document_model)) ```and then create a file from it. – Patterson Jul 14 '21 at 18:12
  • I feel as though I'm missing something really simple – Patterson Jul 14 '21 at 18:18
  • you need to remove `displayHTML`, you just need `html = ....; file = DataLakeFileClient.from_connection_string.....; file.create_file (); file.append_data(html, offset=0, length=len(html)); file.flush_data(len(html))` – Alex Ott Jul 14 '21 at 18:24
  • finally got it.... you really a star. Thanks ever so much man. And thanks for your patience – Patterson Jul 14 '21 at 18:43
  • Just revisiting this question Alex, Can remind me what you meant when you said 'you just need html = ....;'? – Patterson Feb 01 '22 at 13:07
  • `html=DefaultJinjaPageView().render(document_model)`, etc. – Alex Ott Feb 01 '22 at 18:19
  • Hi Alex Ott, I got it before looking at your post. But thanks anyway. You have been such a great help on my Databricks / Apache Spark journey. Thanks man – Patterson Feb 01 '22 at 20:20
0

I have been in touch with the developers of Great_Expectations in connection with this question. They have informed me that Data Docs is not currently available with Azure Synapse or Databricks.

Patterson
  • 1,927
  • 1
  • 19
  • 56