Design a data quality check application in Python

Question

I am developing an application that performs data quality checks over input files and captures counts based on reported DQ failures in the data. Does the approach I use make sense or would recommend better way to do this?

I am trying to write an application in Python that would capture DQ errors in the data and gather counts. I could have used Pandas, Numpy for this, however, since the data quantity is huge ~100 GB I decided to do it through Spark. This is my 3rd application in Python so while i can write code in it, I am out of my depth if this really is the optimum way to do it.

So to summarize, I am reading multiple CSV files and creating a single Parquet file over it then creating a temp view that i could query to find DQ issues. I then capture the results of the query in a variable, and then write it to a list. This list is later used to write a CSV that becomes input for a dashboard report. Code below.

# Importing required libraries
import time,datetime
from pyspark.sql import SparkSession

# Initiating Spark Session
spark = SparkSession.builder.appName("DQ-checks").getOrCreate()

# Initializing Variables
time1 = datetime.datetime.now()
src_file_01 = r'\All kinds of data files\craigslistVehicles.csv'
target_filename = r'\All kinds of data files\craigslistVehicles.parquet'

# Read the CSV file through Spark
df = spark.read.csv(src_file_01, header="true", inferSchema="true")

# Update code to make it flexible enough to read multiple files

# Write the contents of the CSV file into a Parquet file
df.write.format("parquet").save(target_filename, mode="Overwrite")
print("created a parquet file")

# Create a temporary view over the Parquet file to query data
df2 = spark.read.parquet(target_filename)
df2.createOrReplaceTempView("craigslistVehicles")

# Create a column list from the header of the Spark View
column_list = df2.columns
print(column_list)

# Capturing time before start of the query for benchmarking
time2 = datetime.datetime.now()
result_store = []
# Iterate through all the columns and capture null counts for each column
rule_type = 'Null Check'
results={}
for column_names in column_list:
    query = "Select count(*) from craigslistVehicles where {} is null".format(column_names)
#    print(query)
    df3 = spark.sql(query).collect()
    for i in df3:
        results.update(i.asDict())
        res_in_num=results['count(1)']
    result_store=[rule_type,column_names,res_in_num]
    print (result_store)

# Next Steps - Update code to add more data quality checks based on requirement.


# Next Steps - Insert results of the queries into a spark table that can be used as a log and becomes an input for a dashboard report.


# Capturing time after end of the query for benchmarking
time3 = datetime.datetime.now()
print("Query time is.{}",time3-time2)
print("Total Job run time is.{}",time3-time1)

# Spark Session Stop
spark.stop()

Currently, this works. I am able to process a 1.1 GB file in under a minute.

My questions are -

Does this design make sense? How would you do it if you had to do this? Is there something obvious I can change to make the code more clean?

Why do you exactly needs read all csv and create one parquet file from them and then read the parquet file to create a temp view? I think you are doing an unnecessary extra work — Kafels, Jul 09 '19 at 20:01
And to count nulls values from all columns can be done with one select, see this example: [How to find count of Null and Nan values for each column in a PySpark dataframe efficiently?](https://stackoverflow.com/questions/44627386/how-to-find-count-of-null-and-nan-values-for-each-column-in-a-pyspark-dataframe) — Kafels, Jul 09 '19 at 20:03
Just in addition to Kafels, in general your should try to avoid calling `collect()`. This calls all the data to the spark driver and worsens your performance. — cronoik, Jul 09 '19 at 23:31
This might be worth checking out as an alternative for doing data quality checks with spark: https://github.com/timgent/data-flare — Timmeh, Aug 16 '20 at 11:42

score 0 · Answer 1 · answered Apr 26 '23 at 04:19

You should consider the following strategies for data quality in your environment: embedding quality checks into your ETL process, integration with alerting systems, implementing both record-level and batch-level checks, and creating profiling check code generation tools. Then you’re set up for success.

Design a data quality check application in Python

1 Answers1