0

I am trying to a do some data profiling/quality checks on data in Snowflake. I've already tried to implement some using SQL but saw there is also an option for a Python worksheet.

My current code:

import snowflake.snowpark as snowpark
from snowflake.snowpark.functions import col, count, countDistinct, min, max, length, lit, avg, sum

def main(session: snowpark.Session): 
    # Your code goes here, inside the "main" handler.
    df_campaign = session.table('liveperson_campaign')

    # Only check on the top 1000 rows 
    df_campaign_top = df_campaign.limit(1000)

    # Initialize empty list to store results 
    data = []

    # Iterate over columns to compute completeness 
    for column_name in df_campaign_top.columns:
        completeness = df_campaign_top.select((count(column_name)/df_campaign_top.count()).alias('completeness')).collect()[0]['completeness']
        data.append((column_name, completeness))

    # Create new dataframe for completeness 
    df_completeness = session.createDataFrame(data, schema=['column', 'completeness'])
        
    # Print a sample of the dataframe to standard output.
    df_completeness.show()

    # Return value will appear in the Results tab.
    return df_completeness

I currently get the following error:

Traceback (most recent call last):
  Worksheet, line 19, in main
  File "snowflake/snowpark/row.py", line 151, in __getitem__
    return self._named_values[item]
KeyError: 'completeness'

Where line 19 is

completeness = df_campaign_top.select((count(column_name)/df_campaign_top.count()).alias('completeness')).collect()[0]['completeness']

I would like to achieve the following: check the completeness for every column in the table, put this into a new dataframe and show the results. Eventually, I would also add checks like uniqueness, min/max values etc. to the same results dataframe. For now, I would just be happy if this works.

Is there a way to achieve this, I don't really understand the error I'm getting. In addition, I'm wondering if snowpark is the right way to tackle this. Would it be better to just keep using the SQL worksheets? I've used pandas before and that worked great, is there a way to embed Pandas into snowpark in some way?

Thanks for your help in advance!

Leonie
  • 47
  • 4
  • 1
    When you work with Snowpark dataframes think about your columns as they are SQL table columns, and in Snowflake by default columns are created with uppercase, but python is case sensitive, so in line 19 you try to use small letters to access the column. Try this: ```completeness = dataframe.select((count(column_name)/dataframe.count()).alias('completeness')).collect()[0]['COMPLETENESS']``` – aek Jul 31 '23 at 13:34
  • Thanks this works. it was not something I would've thought about myself. When I run my check, it no longer takes only the top 1000 rows. Does this also have something to do with the relation to SQL? – Leonie Jul 31 '23 at 14:39
  • you mean it takes more than 1000? Btw, you can see the actual SQL queries generated by Snowpark. If you login to Snowflake UI and go to activity - query history, you will see what exactly is being done, maybe it will help you to find the problem. I tried to reproduce the problem myself and have not spotted any issues except the one above. My dataset is different though. – aek Jul 31 '23 at 14:49
  • Hey! Working on this https://github.com/SuperiorityComplex/data_checks which lets you use this existing Python code to run, schedule, and do alerting on data quality checks. – josh Aug 30 '23 at 17:20

0 Answers0