I am trying to a do some data profiling/quality checks on data in Snowflake. I've already tried to implement some using SQL but saw there is also an option for a Python worksheet.
My current code:
import snowflake.snowpark as snowpark
from snowflake.snowpark.functions import col, count, countDistinct, min, max, length, lit, avg, sum
def main(session: snowpark.Session):
# Your code goes here, inside the "main" handler.
df_campaign = session.table('liveperson_campaign')
# Only check on the top 1000 rows
df_campaign_top = df_campaign.limit(1000)
# Initialize empty list to store results
data = []
# Iterate over columns to compute completeness
for column_name in df_campaign_top.columns:
completeness = df_campaign_top.select((count(column_name)/df_campaign_top.count()).alias('completeness')).collect()[0]['completeness']
data.append((column_name, completeness))
# Create new dataframe for completeness
df_completeness = session.createDataFrame(data, schema=['column', 'completeness'])
# Print a sample of the dataframe to standard output.
df_completeness.show()
# Return value will appear in the Results tab.
return df_completeness
I currently get the following error:
Traceback (most recent call last):
Worksheet, line 19, in main
File "snowflake/snowpark/row.py", line 151, in __getitem__
return self._named_values[item]
KeyError: 'completeness'
Where line 19 is
completeness = df_campaign_top.select((count(column_name)/df_campaign_top.count()).alias('completeness')).collect()[0]['completeness']
I would like to achieve the following: check the completeness for every column in the table, put this into a new dataframe and show the results. Eventually, I would also add checks like uniqueness, min/max values etc. to the same results dataframe. For now, I would just be happy if this works.
Is there a way to achieve this, I don't really understand the error I'm getting. In addition, I'm wondering if snowpark is the right way to tackle this. Would it be better to just keep using the SQL worksheets? I've used pandas before and that worked great, is there a way to embed Pandas into snowpark in some way?
Thanks for your help in advance!